Enable updating Parquet key-value metadata of opened ParquetWriter before closing #41608

kylebarron · 2024-05-09T20:44:03Z

Describe the enhancement requested

In pyarrow, it would be great to have the ability to update Parquet key-value metadata of an existing ParquetWriter instance before closing the file. For example:

import pyarrow.parquet as pq

with pq.ParquetWriter("output.parquet", schema) as writer:
    writer.write_batch(...)
    writer.add_key_value_metadata({b"hello": "world"})

This is akin to the arrow-rs ArrowWriter::append_key_value_metadata, which lets you mutate an existing writer to add key-value metadata before writing the final metadata.

My use case for this is in writing to GeoParquet efficiently. Part of the GeoParquet metadata (stored as Parquet key-value metadata) includes an optional bounding box of all geometries in the file. But many applications might need a pass over the dataset to infer the bounding box of the data. If this key-value metadata can only be defined when opening the ParquetWriter, then streaming applications (such as converting other file formats to GeoParquet) might need two passes over the data: one to infer the bounding box and construct the schema to pass to the ParquetWriter constructor, and other to actually write the batches.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

mapleFU · 2024-05-13T05:21:41Z

C++ has this API, but this might not integrated to Python SDK. Would you like to support this?

This means exporting this api ( 3bd57e3#diff-0042b97c44521fde7da4ce3b4446cd7bc85bbc25eae63d2b4896ce58241e3b94 ) to Python SDK

kylebarron · 2024-05-13T15:26:09Z

That would be great. I don't know C++ myself so I can't help out. But I may be able to test a branch if necessary

mapleFU · 2024-05-13T15:31:34Z

I would finish it in this week

kylebarron added the Type: enhancement label May 9, 2024

github-actions bot added the Component: Python label May 9, 2024

github-actions bot mentioned this issue May 13, 2024

GH-41608: [C++][Python] Extends the add_key_value to parquet::arrow and PyArrow #41633

Open

github-actions bot assigned mapleFU May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable updating Parquet key-value metadata of opened ParquetWriter before closing #41608

Enable updating Parquet key-value metadata of opened ParquetWriter before closing #41608

kylebarron commented May 9, 2024 •

edited

mapleFU commented May 13, 2024 •

edited

kylebarron commented May 13, 2024

mapleFU commented May 13, 2024

Enable updating Parquet key-value metadata of opened ParquetWriter before closing #41608

Enable updating Parquet key-value metadata of opened ParquetWriter before closing #41608

Comments

kylebarron commented May 9, 2024 • edited

Describe the enhancement requested

Component(s)

mapleFU commented May 13, 2024 • edited

kylebarron commented May 13, 2024

mapleFU commented May 13, 2024

kylebarron commented May 9, 2024 •

edited

mapleFU commented May 13, 2024 •

edited