Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41579: [C++][Python][Parquet] Support reading/writing key-value metadata from/to ColumnChunkMetaData #41580

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

clee704
Copy link

@clee704 clee704 commented May 8, 2024

Rationale for this change

Parquet standard allows reading/writing key-value metadata from/to ColumnChunkMetaData, but there is no way to do that with Parquet C++.

What changes are included in this PR?

Support reading/writing key-value metadata from/to ColumnChunkMetaData with Parquet C++ reader/writer. Support reading key-value metadata from ColumnChunkMetaData with pyarrow.parquet.

Are these changes tested?

Yes, unit tests are added

Are there any user-facing changes?

Yes.

  • Users can read or write key-value metadata for column chunks with Parquet C++.
  • Users can read key-value metadata for column chunks with PyArrow.
  • parquet-reader tool prints key-value metadata in column chunks when --print-key-value-metadata option is used.

Copy link

github-actions bot commented May 8, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@clee704 clee704 changed the title Support reading/writing key-value metadata from/to ColumnMetaData [GH-41579] Support reading/writing key-value metadata from/to ColumnMetaData May 8, 2024
@clee704 clee704 changed the title [GH-41579] Support reading/writing key-value metadata from/to ColumnMetaData GH-41579: [C++] Support reading/writing key-value metadata from/to ColumnMetaData May 8, 2024
@clee704 clee704 changed the title GH-41579: [C++] Support reading/writing key-value metadata from/to ColumnMetaData GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnMetaData May 8, 2024
Copy link

github-actions bot commented May 8, 2024

⚠️ GitHub issue #41579 has been automatically assigned in GitHub to PR creator.

@clee704 clee704 changed the title GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnMetaData GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnChunkMetaData May 8, 2024
@clee704 clee704 marked this pull request as ready for review May 8, 2024 05:00
@clee704 clee704 requested a review from wgtmac as a code owner May 8, 2024 05:00
@mapleFU mapleFU changed the title GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnChunkMetaData GH-41579: [C++][Python][Parquet] Support reading/writing key-value metadata from/to ColumnChunkMetaData May 8, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea general LGTM

cc @pitrou @emkornfield @wgtmac

Comment on lines +1622 to +1626
KeyValueMetadata& key_value_metadata() {
if (!key_value_metadata_) {
key_value_metadata_ = std::make_unique<KeyValueMetadata>();
}
return *key_value_metadata_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So once key_value_metadata() is called, the metadata is generated, and the metadata would be put in SetKeyValueMetadata?

Should we just add a "Put" here?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 12, 2024
@mapleFU mapleFU requested a review from pitrou May 12, 2024 05:46
@wgtmac
Copy link
Member

wgtmac commented May 12, 2024

I saw a related issue: https://issues.apache.org/jira/browse/PARQUET-1648. It seems that parquet-mr does not use it yet.

@mapleFU
Copy link
Member

mapleFU commented May 12, 2024

It seems that parquet-mr does not use it yet.

Can parquet-mr reads that?

@mapleFU
Copy link
Member

mapleFU commented May 12, 2024

🤔at least this patch is ok and seems other implementation has thrift, it doesn't break the standard..

@wgtmac
Copy link
Member

wgtmac commented May 12, 2024

I agree. It should not block us from implementing this.

@@ -1405,6 +1405,10 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
return pages_change_on_record_boundaries_;
}

KeyValueMetadata& key_value_metadata() override {
return metadata_->key_value_metadata();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check !close_ here?

kv_pair.__set_value(source.value(i));
metadata.key_value_metadata.push_back(kv_pair);
}
metadata.__isset.key_value_metadata = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use void __set_key_value_metadata(const std::vector<KeyValue> & val); instead?

@@ -181,6 +184,10 @@ class PARQUET_EXPORT ColumnWriter {
/// \brief The file-level writer properties
virtual const WriterProperties* properties() = 0;

/// \brief Return KeyValueMetadata that can be used to store key-value
/// metadata in ColumnChunkMetaData.
virtual KeyValueMetadata& key_value_metadata() = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not to give the freedom to get the mutable object. Instead, we can implement a similar function equivalent to https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_writer.h#L163

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Comment on lines 31 to +32
class Array;
class KeyValueMetadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove this and instead include arrow/type_fwd.h?

@@ -53,6 +54,8 @@ class Encryptor;
class OffsetIndexBuilder;
class WriterProperties;

using KeyValueMetadata = ::arrow::KeyValueMetadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally better to avoid using directives in non-internal .h files.

@@ -181,6 +184,10 @@ class PARQUET_EXPORT ColumnWriter {
/// \brief The file-level writer properties
virtual const WriterProperties* properties() = 0;

/// \brief Return KeyValueMetadata that can be used to store key-value
/// metadata in ColumnChunkMetaData.
virtual KeyValueMetadata& key_value_metadata() = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

// column metadata
void SetStatistics(const EncodedStatistics& stats);

KeyValueMetadata& key_value_metadata();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's avoid non-const references, as per our C++ conventions. Instead, you could follow the current builder pattern:

Suggested change
KeyValueMetadata& key_value_metadata();
void SetKeyValueMetadata(const std::shared_ptr<const KeyValueMetadata>&);


void PrintKeyValueMetadata(std::ostream& stream,
const KeyValueMetadata& key_value_metadata,
int indent_level = 0, int indent_width = 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be easier to just pass the indentation string?

Suggested change
int indent_level = 0, int indent_width = 1) {
std::string_view indent = "") {

Comment on lines +513 to +521
cdef:
unordered_map[c_string, c_string] metadata
const CKeyValueMetadata* underlying_metadata
underlying_metadata = self.metadata.key_value_metadata().get()
if underlying_metadata != NULL:
underlying_metadata.ToUnorderedMap(&metadata)
return metadata
else:
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use pyarrow_wrap_metadata instead. Something like (untested):

Suggested change
cdef:
unordered_map[c_string, c_string] metadata
const CKeyValueMetadata* underlying_metadata
underlying_metadata = self.metadata.key_value_metadata().get()
if underlying_metadata != NULL:
underlying_metadata.ToUnorderedMap(&metadata)
return metadata
else:
return None
wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata())
if wrapped is not None:
return wrapped.to_dict()
else:
return wrapped

def test_column_chunk_key_value_metadata(datadir):
metadata = pq.read_metadata(datadir / 'column-chunk-key-value-metadata.parquet')
key_value_metadata = metadata.row_group(0).column(0).metadata
print(key_value_metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove this print() call (which I assume was meant for debugging)?

metadata = pq.read_metadata(datadir / 'column-chunk-key-value-metadata.parquet')
key_value_metadata = metadata.row_group(0).column(0).metadata
print(key_value_metadata)
assert key_value_metadata[b'foo'] == b'bar'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you test the entire dict contents?
  2. Can you add a test where the metadata is None?

@pitrou
Copy link
Member

pitrou commented May 13, 2024

@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants