New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41579: [C++][Python][Parquet] Support reading/writing key-value metadata from/to ColumnChunkMetaData #41580
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea general LGTM
KeyValueMetadata& key_value_metadata() { | ||
if (!key_value_metadata_) { | ||
key_value_metadata_ = std::make_unique<KeyValueMetadata>(); | ||
} | ||
return *key_value_metadata_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So once key_value_metadata()
is called, the metadata is generated, and the metadata would be put in SetKeyValueMetadata
?
Should we just add a "Put" here?
I saw a related issue: https://issues.apache.org/jira/browse/PARQUET-1648. It seems that parquet-mr does not use it yet. |
Can parquet-mr reads that? |
🤔at least this patch is ok and seems other implementation has thrift, it doesn't break the standard.. |
I agree. It should not block us from implementing this. |
@@ -1405,6 +1405,10 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter< | |||
return pages_change_on_record_boundaries_; | |||
} | |||
|
|||
KeyValueMetadata& key_value_metadata() override { | |||
return metadata_->key_value_metadata(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check !close_
here?
kv_pair.__set_value(source.value(i)); | ||
metadata.key_value_metadata.push_back(kv_pair); | ||
} | ||
metadata.__isset.key_value_metadata = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use void __set_key_value_metadata(const std::vector<KeyValue> & val);
instead?
@@ -181,6 +184,10 @@ class PARQUET_EXPORT ColumnWriter { | |||
/// \brief The file-level writer properties | |||
virtual const WriterProperties* properties() = 0; | |||
|
|||
/// \brief Return KeyValueMetadata that can be used to store key-value | |||
/// metadata in ColumnChunkMetaData. | |||
virtual KeyValueMetadata& key_value_metadata() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not to give the freedom to get the mutable object. Instead, we can implement a similar function equivalent to https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_writer.h#L163
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
class Array; | ||
class KeyValueMetadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove this and instead include arrow/type_fwd.h
?
@@ -53,6 +54,8 @@ class Encryptor; | |||
class OffsetIndexBuilder; | |||
class WriterProperties; | |||
|
|||
using KeyValueMetadata = ::arrow::KeyValueMetadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally better to avoid using
directives in non-internal .h
files.
@@ -181,6 +184,10 @@ class PARQUET_EXPORT ColumnWriter { | |||
/// \brief The file-level writer properties | |||
virtual const WriterProperties* properties() = 0; | |||
|
|||
/// \brief Return KeyValueMetadata that can be used to store key-value | |||
/// metadata in ColumnChunkMetaData. | |||
virtual KeyValueMetadata& key_value_metadata() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
// column metadata | ||
void SetStatistics(const EncodedStatistics& stats); | ||
|
||
KeyValueMetadata& key_value_metadata(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let's avoid non-const references, as per our C++ conventions. Instead, you could follow the current builder pattern:
KeyValueMetadata& key_value_metadata(); | |
void SetKeyValueMetadata(const std::shared_ptr<const KeyValueMetadata>&); |
|
||
void PrintKeyValueMetadata(std::ostream& stream, | ||
const KeyValueMetadata& key_value_metadata, | ||
int indent_level = 0, int indent_width = 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it would be easier to just pass the indentation string?
int indent_level = 0, int indent_width = 1) { | |
std::string_view indent = "") { |
cdef: | ||
unordered_map[c_string, c_string] metadata | ||
const CKeyValueMetadata* underlying_metadata | ||
underlying_metadata = self.metadata.key_value_metadata().get() | ||
if underlying_metadata != NULL: | ||
underlying_metadata.ToUnorderedMap(&metadata) | ||
return metadata | ||
else: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use pyarrow_wrap_metadata
instead. Something like (untested):
cdef: | |
unordered_map[c_string, c_string] metadata | |
const CKeyValueMetadata* underlying_metadata | |
underlying_metadata = self.metadata.key_value_metadata().get() | |
if underlying_metadata != NULL: | |
underlying_metadata.ToUnorderedMap(&metadata) | |
return metadata | |
else: | |
return None | |
wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata()) | |
if wrapped is not None: | |
return wrapped.to_dict() | |
else: | |
return wrapped |
def test_column_chunk_key_value_metadata(datadir): | ||
metadata = pq.read_metadata(datadir / 'column-chunk-key-value-metadata.parquet') | ||
key_value_metadata = metadata.row_group(0).column(0).metadata | ||
print(key_value_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove this print() call (which I assume was meant for debugging)?
metadata = pq.read_metadata(datadir / 'column-chunk-key-value-metadata.parquet') | ||
key_value_metadata = metadata.row_group(0).column(0).metadata | ||
print(key_value_metadata) | ||
assert key_value_metadata[b'foo'] == b'bar' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you test the entire dict contents?
- Can you add a test where the metadata is
None
?
@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR? |
Rationale for this change
Parquet standard allows reading/writing key-value metadata from/to ColumnChunkMetaData, but there is no way to do that with Parquet C++.
What changes are included in this PR?
Support reading/writing key-value metadata from/to ColumnChunkMetaData with Parquet C++ reader/writer. Support reading key-value metadata from ColumnChunkMetaData with pyarrow.parquet.
Are these changes tested?
Yes, unit tests are added
Are there any user-facing changes?
Yes.
--print-key-value-metadata
option is used.