Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Optimize splits generator for unaware bucket mode #3235

Open
2 tasks done
Aitozi opened this issue Apr 18, 2024 · 0 comments
Open
2 tasks done

[Feature] Optimize splits generator for unaware bucket mode #3235

Aitozi opened this issue Apr 18, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@Aitozi
Copy link
Contributor

Aitozi commented Apr 18, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

In stream mode, the unaware bucket mode will use the FIFOSplitAssigner, So we could avoid Bin-Packing serveral files into one split. IMO, it has two benefit:

  • In the FIFOSplitAssigner, we have a work-stealing mechanism that allows for higher total throughput as the split size decreases.
  • We can avoid the issue of skewed files by comparing two files with similar row counts but very different file sizes.

Solution

One file one split for unaware bucket mode

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@Aitozi Aitozi added the enhancement New feature or request label Apr 18, 2024
@Aitozi Aitozi self-assigned this Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant