You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looks like Parquet files generated using 0.10.1 & 0.10.2 are causing Apache Impala to fail reading them with the following error:
"metadata is corrupt. Dictionary page (offset=2943) must come before any data pages (offset=2943)"
This error message looks to occur when col_chunk.meta_data.dictionary_page_offset >= col_start:
Status ParquetMetadataUtils::ValidateColumnOffsets(const string& filename,
+ int64_t file_length, const parquet::RowGroup& row_group) {
+ for (int i = 0; i < row_group.columns.size(); ++i) {
+ const parquet::ColumnChunk& col_chunk = row_group.columns[i];
+ RETURN_IF_ERROR(ValidateOffsetInFile(filename, i, file_length,
+ col_chunk.meta_data.data_page_offset, "data page offset"));
+ int64_t col_start = col_chunk.meta_data.data_page_offset;
+ // The file format requires that if a dictionary page exists, it be before data pages.
+ if (col_chunk.meta_data.__isset.dictionary_page_offset) {
+ RETURN_IF_ERROR(ValidateOffsetInFile(filename, i, file_length,
+ col_chunk.meta_data.dictionary_page_offset, "dictionary page offset"));
+ if (col_chunk.meta_data.dictionary_page_offset >= col_start) {
+ return Status(Substitute("Parquet file '$0': metadata is corrupt. Dictionary "
+ "page (offset=$1) must come before any data pages (offset=$2).",
+ filename, col_chunk.meta_data.dictionary_page_offset, col_start));
+ }
What happens?
Hi All,
Looks like Parquet files generated using 0.10.1 & 0.10.2 are causing Apache Impala to fail reading them with the following error:
"metadata is corrupt. Dictionary page (offset=2943) must come before any data pages (offset=2943)"
This error message looks to occur when col_chunk.meta_data.dictionary_page_offset >= col_start:
https://lists.apache.org/thread/0mmlmt02hgb2btlr9hg1n2fs01dylskl
Downgrading to 0.10.0 works perfectly.
Regards
GP
To Reproduce
COPY (SELECT * FROM tbl) TO 'output.parquet' (FORMAT PARQUET);
Files written with 0.10.1 or 0.10.2 produce "metadata is corrupt" errors when read with Apache Impala (<=0.10.0 works fine)
OS:
Windows
DuckDB Version:
0.10.1
DuckDB Client:
Java
Full Name:
Pronzato
Affiliation:
Personal
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: