You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In pyarrow, it would be great to have the ability to update Parquet key-value metadata of an existing ParquetWriter instance before closing the file. For example:
This is akin to the arrow-rs ArrowWriter::append_key_value_metadata, which lets you mutate an existing writer to add key-value metadata before writing the final metadata.
My use case for this is in writing to GeoParquet efficiently. Part of the GeoParquet metadata (stored as Parquet key-value metadata) includes an optional bounding box of all geometries in the file. But many applications might need a pass over the dataset to infer the bounding box of the data. If this key-value metadata can only be defined when opening the ParquetWriter, then streaming applications (such as converting other file formats to GeoParquet) might need two passes over the data: one to infer the bounding box and construct the schema to pass to the ParquetWriter constructor, and other to actually write the batches.
Component(s)
Python
The text was updated successfully, but these errors were encountered:
Describe the enhancement requested
In pyarrow, it would be great to have the ability to update Parquet key-value metadata of an existing
ParquetWriter
instance before closing the file. For example:This is akin to the arrow-rs
ArrowWriter::append_key_value_metadata
, which lets you mutate an existing writer to add key-value metadata before writing the final metadata.My use case for this is in writing to GeoParquet efficiently. Part of the GeoParquet metadata (stored as Parquet key-value metadata) includes an optional bounding box of all geometries in the file. But many applications might need a pass over the dataset to infer the bounding box of the data. If this key-value metadata can only be defined when opening the
ParquetWriter
, then streaming applications (such as converting other file formats to GeoParquet) might need two passes over the data: one to infer the bounding box and construct the schema to pass to theParquetWriter
constructor, and other to actually write the batches.Component(s)
Python
The text was updated successfully, but these errors were encountered: