Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warning about input_shard_per_output_shard performance #187

Open
rom1504 opened this issue Sep 30, 2022 · 1 comment
Open

Add warning about input_shard_per_output_shard performance #187

rom1504 opened this issue Sep 30, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@rom1504
Copy link
Owner

rom1504 commented Sep 30, 2022

For good performance current implementation requires

input_shard_per_output_shard >= num_prepro_workers

For example:
num_prepro_workers = 8
sample_per_output_shard = 1000000
sample_per_input_shard = 10000
input_shard_per_output_shard = sample_per_output_shard / sample_per_input_shard
input_shard_per_output_shard = 100

In that case the speed will be optimal only when 100 shards are still left to be done.
That means for datasets smaller than 100 shards it will be slow, and if the dataset is 100 shards, it will be fast initially then gets slower and slower.

Action item:

  • warn if input_shard_per_output_shard < num_prepro_workers
  • warn if input_shard_per_output_shard < input_shard_count
  • recommend better parameters for small datasets
@rom1504
Copy link
Owner Author

rom1504 commented Sep 30, 2022

an option to consider may be to introduce the concept of tasks that contains multiple output shard and hence can keep reading the same input shards. It would solve this problem

@rom1504 rom1504 added the enhancement New feature or request label Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant