Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt Parquet Metadata (migrating from 0.10.0 to 0.10.1) #12096

Closed
2 tasks done
pronzato opened this issue May 16, 2024 · 1 comment · Fixed by #12109
Closed
2 tasks done

Corrupt Parquet Metadata (migrating from 0.10.0 to 0.10.1) #12096

pronzato opened this issue May 16, 2024 · 1 comment · Fixed by #12109

Comments

@pronzato
Copy link

pronzato commented May 16, 2024

What happens?

Hi All,

Looks like Parquet files generated using 0.10.1 & 0.10.2 are causing Apache Impala to fail reading them with the following error:

"metadata is corrupt. Dictionary page (offset=2943) must come before any data pages (offset=2943)"

This error message looks to occur when col_chunk.meta_data.dictionary_page_offset >= col_start:

Status ParquetMetadataUtils::ValidateColumnOffsets(const string& filename,
+    int64_t file_length, const parquet::RowGroup& row_group) {
+  for (int i = 0; i < row_group.columns.size(); ++i) {
+    const parquet::ColumnChunk& col_chunk = row_group.columns[i];
+    RETURN_IF_ERROR(ValidateOffsetInFile(filename, i, file_length,
+        col_chunk.meta_data.data_page_offset, "data page offset"));
+    int64_t col_start = col_chunk.meta_data.data_page_offset;
+    // The file format requires that if a dictionary page exists, it be before data pages.
+    if (col_chunk.meta_data.__isset.dictionary_page_offset) {
+      RETURN_IF_ERROR(ValidateOffsetInFile(filename, i, file_length,
+            col_chunk.meta_data.dictionary_page_offset, "dictionary page offset"));
+      if (col_chunk.meta_data.dictionary_page_offset >= col_start) {
+        return Status(Substitute("Parquet file '$0': metadata is corrupt. Dictionary "
+            "page (offset=$1) must come before any data pages (offset=$2).",
+            filename, col_chunk.meta_data.dictionary_page_offset, col_start));
+      }

https://lists.apache.org/thread/0mmlmt02hgb2btlr9hg1n2fs01dylskl

Downgrading to 0.10.0 works perfectly.

Regards

GP

To Reproduce

COPY (SELECT * FROM tbl) TO 'output.parquet' (FORMAT PARQUET);

Files written with 0.10.1 or 0.10.2 produce "metadata is corrupt" errors when read with Apache Impala (<=0.10.0 works fine)

OS:

Windows

DuckDB Version:

0.10.1

DuckDB Client:

Java

Full Name:

Pronzato

Affiliation:

Personal

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@hannes
Copy link
Member

hannes commented May 17, 2024

Thanks for reporting this, I've opened a PR with a fix

@hannes hannes linked a pull request May 21, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants