Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request][Spark] Optimize automated batching #3081

Open
2 of 8 tasks
Kimahriman opened this issue May 10, 2024 · 1 comment · May be fixed by #3089
Open
2 of 8 tasks

[Feature Request][Spark] Optimize automated batching #3081

Kimahriman opened this issue May 10, 2024 · 1 comment · May be fixed by #3089
Labels
enhancement New feature or request

Comments

@Kimahriman
Copy link
Contributor

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Currently optimize is an all or nothing operation on all files in the table, or limited by a partition filter. The partition filter allows you to do manually batching of subsets of the table, but with clustering now a thing, there is no option to do partition filtering. We should add the ability to enable batch support inside of optimize, so chunks of optimized files can be added to the transaction log incrementally.

Motivation

Currently you could rewrite an entire petabyte of data, just to fail on the last file and have all that be for naught, wasting a lot of compute time and storage space. With automatic batching, nearly all of the results would be saved along the way, and only the last batch that failed would have to be retried.

Further details

I think this can be fairly straightforward, just grouping the existing bins into another layer of batches.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@Kimahriman Kimahriman added the enhancement New feature or request label May 10, 2024
@Kimahriman
Copy link
Contributor Author

I think this is already a thing in Databricks, so it would be great to know if there are any plans to open-source that before I spend a bunch of time on this! @scottsand-db

@Kimahriman Kimahriman linked a pull request May 14, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant