Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROTOCOL RFC] Support for collated strings in the schema and statistics #2894

Open
1 of 3 tasks
olaky opened this issue Apr 16, 2024 · 1 comment
Open
1 of 3 tasks

Comments

@olaky
Copy link
Contributor

olaky commented Apr 16, 2024

Protocol Change Request

Description of the protocol change

Spark is introducing support for collated Strings (see SPARK-46830) and we should support collated columns and fields in Delta tables as well. This will require changes to two parts of the Delta protocol

  • The schema: To store collation information for columns and fields
  • Statistics: Collations, for example in ICU, are versioned and the collation rules can change between versions. To ensure correctness of data skipping, we have to annotate statistics with the version of the collation that was used to generate them. We should also consider to support storing statistics for multiple versions at the same time to have a nice update path and make it easier for clients that have different versions of a collation available to do data skipping.

More details about the idea can be found in the Design Doc

Willingness to contribute

The Delta Lake Community encourages protocol innovations. Would you or another member of your organization be willing to contribute this feature to the Delta Lake code base?

  • Yes. I can contribute.
  • Yes. I would be willing to contribute with guidance from the Delta Lake community.
  • No. I cannot contribute at this time.
@olaky
Copy link
Contributor Author

olaky commented May 17, 2024

Protocol RFC PR is open: #3068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant