You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In that case the speed will be optimal only when 100 shards are still left to be done.
That means for datasets smaller than 100 shards it will be slow, and if the dataset is 100 shards, it will be fast initially then gets slower and slower.
Action item:
warn if input_shard_per_output_shard < num_prepro_workers
warn if input_shard_per_output_shard < input_shard_count
recommend better parameters for small datasets
The text was updated successfully, but these errors were encountered:
an option to consider may be to introduce the concept of tasks that contains multiple output shard and hence can keep reading the same input shards. It would solve this problem
For good performance current implementation requires
input_shard_per_output_shard >= num_prepro_workers
For example:
num_prepro_workers = 8
sample_per_output_shard = 1000000
sample_per_input_shard = 10000
input_shard_per_output_shard = sample_per_output_shard / sample_per_input_shard
input_shard_per_output_shard = 100
In that case the speed will be optimal only when 100 shards are still left to be done.
That means for datasets smaller than 100 shards it will be slow, and if the dataset is 100 shards, it will be fast initially then gets slower and slower.
Action item:
The text was updated successfully, but these errors were encountered: