Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable updating Parquet key-value metadata of opened ParquetWriter before closing #41608

Open
kylebarron opened this issue May 9, 2024 · 3 comments

Comments

@kylebarron
Copy link
Contributor

kylebarron commented May 9, 2024

Describe the enhancement requested

In pyarrow, it would be great to have the ability to update Parquet key-value metadata of an existing ParquetWriter instance before closing the file. For example:

import pyarrow.parquet as pq

with pq.ParquetWriter("output.parquet", schema) as writer:
    writer.write_batch(...)
    writer.add_key_value_metadata({b"hello": "world"})

This is akin to the arrow-rs ArrowWriter::append_key_value_metadata, which lets you mutate an existing writer to add key-value metadata before writing the final metadata.

My use case for this is in writing to GeoParquet efficiently. Part of the GeoParquet metadata (stored as Parquet key-value metadata) includes an optional bounding box of all geometries in the file. But many applications might need a pass over the dataset to infer the bounding box of the data. If this key-value metadata can only be defined when opening the ParquetWriter, then streaming applications (such as converting other file formats to GeoParquet) might need two passes over the data: one to infer the bounding box and construct the schema to pass to the ParquetWriter constructor, and other to actually write the batches.

Component(s)

Python

@mapleFU
Copy link
Member

mapleFU commented May 13, 2024

C++ has this API, but this might not integrated to Python SDK. Would you like to support this?

This means exporting this api ( 3bd57e3#diff-0042b97c44521fde7da4ce3b4446cd7bc85bbc25eae63d2b4896ce58241e3b94 ) to Python SDK

@kylebarron
Copy link
Contributor Author

That would be great. I don't know C++ myself so I can't help out. But I may be able to test a branch if necessary

@mapleFU
Copy link
Member

mapleFU commented May 13, 2024

I would finish it in this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants