-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get wrong result when use columns
in option
#523
Comments
columns
in optioncolumns
in option
This is not a reproducible example. A reproducible example with code is required to debug bugs. |
With columns:> function meminfo() {
... global.gc()
... const memoryUsage = process.memoryUsage();
... console.log('RSS:', memoryUsage.rss / (1024 * 1024), 'MB');
... console.log('Heap Total:', memoryUsage.heapTotal / (1024 * 1024), 'MB');
... console.log('Heap Used:', memoryUsage.heapUsed / (1024 * 1024), 'MB');
... }
undefined
>
undefined
> const {ParquetFile, readParquet, readParquetStream, wasmMemory} = require("parquet-wasm")
undefined
> const {parseTable, parseRecordBatch} = require("arrow-js-ffi")
undefined
> var WASM_MEMORY = wasmMemory();
undefined
>
> meminfo()
RSS: 70.890625 MB
Heap Total: 17.921875 MB
Heap Used: 7.547981262207031 MB
undefined
>
> console.time('pq')
undefined
> var table = await ParquetFile.fromUrl('http://localhost:8000/lineitem.parquet')
undefined
> var numRows = table.metadata().fileMetadata().numRows()
undefined
> var option = {
... columns : ['l_quantity', 'l_extendedprice', 'l_discount', 'l_tax', 'l_returnflag', 'l_linestatus', 'l_shipdate'],
... batchSize : 122_880,
... }
undefined
> meminfo()
RSS: 95.42578125 MB
Heap Total: 14.0703125 MB
Heap Used: 10.886177062988281 MB
undefined
>
> var ffiTable = (await table.read(option)).intoFFI();
undefined
> meminfo()
RSS: 641.65234375 MB
Heap Total: 14.0703125 MB
Heap Used: 10.615234375 MB
undefined
>
> var arrowTable = parseTable(
... WASM_MEMORY.buffer,
... ffiTable.arrayAddrs(),
... ffiTable.schemaAddr()
... )
undefined
> console.timeEnd('pq')
pq: 2.231s
undefined
> meminfo()
RSS: 1097.15234375 MB
Heap Total: 17.3203125 MB
Heap Used: 11.910774230957031 MB
undefined
>
> console.log(numRows, table.metadata().fileMetadata().createdBy());
6001215 DuckDB
undefined
> arrowTable.batches[0].data.children[4]
Data {
type: Decimal { typeId: 7, scale: 2, precision: 15, bitWidth: 128 },
children: [],
dictionary: undefined,
offset: 0,
length: 122880,
_nullCount: 0,
stride: 4,
values: Uint32Array(491520) [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99,
... 491420 more items
],
nullBitmap: Uint8Array(0) []
} Without Columns:> console.time('pq')
undefined
> var table = await ParquetFile.fromUrl('http://localhost:8000/lineitem.parquet')
undefined
> var numRows = table.metadata().fileMetadata().numRows()
undefined
> var option = {
... // columns : ['l_quantity', 'l_extendedprice', 'l_discount', 'l_tax', 'l_returnflag', 'l_linestatus', 'l_shipdate'],
... batchSize : 122_880,
... }
undefined
> meminfo()
RSS: 1099.27734375 MB
Heap Total: 14.3203125 MB
Heap Used: 11.8231201171875 MB
undefined
>
> var ffiTable = (await table.read(option)).intoFFI();
undefined
> meminfo()
RSS: 2150.1640625 MB
Heap Total: 15.0703125 MB
Heap Used: 11.835853576660156 MB
undefined
>
> var arrowTable = parseTable(
... WASM_MEMORY.buffer,
... ffiTable.arrayAddrs(),
... ffiTable.schemaAddr()
... )
undefined
> console.timeEnd('pq')
pq: 5.992s
undefined
> meminfo()
RSS: 3131.0390625 MB
Heap Total: 17.8203125 MB
Heap Used: 11.851348876953125 MB
undefined
>
> console.log(numRows, table.metadata().fileMetadata().createdBy());
6001215 DuckDB
undefined
> arrowTable.batches[0].data.children[9]
Data {
type: Utf8 { typeId: 5 },
children: [],
dictionary: undefined,
offset: 0,
length: 122880,
_nullCount: 0,
stride: 1,
valueOffsets: Int32Array(122881) [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
96, 97, 98, 99,
... 122781 more items
],
values: Uint8Array(122880) [
79, 79, 79, 79, 79, 79, 79, 70, 70, 70, 70, 70,
70, 79, 70, 70, 70, 70, 79, 79, 79, 79, 79, 79,
79, 79, 79, 79, 79, 79, 79, 70, 70, 70, 70, 79,
79, 79, 79, 79, 79, 79, 79, 79, 79, 70, 70, 70,
79, 79, 79, 79, 79, 79, 79, 70, 70, 79, 79, 70,
70, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79,
79, 79, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 79, 79, 79, 79, 79, 79, 70, 70, 70, 70,
70, 70, 70, 70,
... 122780 more items
],
nullBitmap: Uint8Array(0) []
} |
the bug might be:
the default column postion 4 is: Field {
name: 'l_quantity',
type: Decimal { typeId: 7, scale: 2, precision: 15, bitWidth: 128 },
nullable: true,
metadata: Map(0) {}
} with specified Field {
name: 'l_linestatus',
type: Utf8 { typeId: 5 },
nullable: true,
metadata: Map(0) {}
} |
I'm sorry I still don't understand what your issue is. Are you able to provide a reproducible data file along with your code? Can you remove all the memory reporting that is irrelevant to this issue, and just include the minimum amount of code to show your issue? See: |
Also note that I believe the ordering of the names in |
yes,default postions are 0~15, and the tpch lienitem parquet, you can generate from duckdb |
I still don't know what you're saying doesn't work. It looks like you're able to access the data correctly. You need to check the schema to find the positions of the columns in the output data.
I am not going to go out of my way to generate data without a reproducible example. Please supply a minimal, reproducible example or I'll close this issue. |
v0.6, test with
await ParquetFile.from[File/Url]
such as tpch lintitem sf1:
without
columns
in option:l_extendedprice
returns with 3 of 0 between two number with 4x typedarray lengthpick
l_extendedprice
l_returnflag
, will get worng values:l_extendedprice
returns with 0 between two number with same typedarray lengthl_returnflag
returns values as indexThe text was updated successfully, but these errors were encountered: