3x Faster Text Preprocessing #6711

ashvardanian · 2024-03-03T19:03:04Z

I was preparing some datasets for AI training and noticed that datasets by HuggingFace uses the conventional open mechanism to read the file and split it into chunks. I thought it can be significantly accelerated, and started with a benchmark:

$ pip install --upgrade --force-reinstall datasets
$ python benchmark_huggingface_datasets.py xlsum.csv 
Generating train split: 1004598 examples [00:47, 21116.16 examples/s]
Time taken to load the dataset: 48.66838526725769 seconds
Time taken to chunk the dataset into parts of size 10000: 0.11466407775878906 seconds
Total time taken: 48.78304934501648 seconds

For benchmarks I've used a large CSV file with mixed UTF-8 content, most common in modern large-scale pre-training pipelines. I've later patched the datasets library to use stringzilla, which resulted in significantly lower memory consumption and in 2.9x throughput improvement on the AWS r7iz instances. That's using slow SSDs mounted over the network. Performance on local SSDs on something like a DGX-H100 should be even higher:

$ pip install -e .
$ python benchmark_huggingface_datasets.py xlsum.csv 
Generating train split: 1004598 examples [00:15, 64529.90 examples/s]
Time taken to load the dataset: 16.45028805732727 seconds
Time taken to chunk the dataset into parts of size 10000: 0.1291060447692871 seconds
Total time taken: 16.579394102096558 seconds

I've already pushed the patches to my fork, and would love to contribute them to the upstream repository.

All the tests pass, but they leave a couple of important questions open. The default Python open(..., newline=None) uses universal newlines, where \n, \r, and \r\n are all converted to \n on the fly. I am not sure if its a good idea for a general purpose dataset preparation pipeline?

I can simulate the same behavior (which I don't yet do) for "line" splitter. Adjusting it for "paragraph"-splitter would be harder. Should we stick exactly to the old Pythonic behavior or stay closer to how C and other programming languages do that?

lhoestq

Hi ! Unfortunately we don't like adding new dependencies to datasets

Maybe we can improve speed and memory usage using built-in tools though, like regex finditer ?

ashvardanian · 2024-03-04T10:49:06Z

Unfortunately, that won't improve the performance. StringZilla repository has extensive benchmarks comparing against different built-in functionality of several programming languages. Using re.finditer for tokenization is practically the slowest anti-pattern I've encountered in any language. The gap between that and a SIMD-accelerated kernel can be as big as 10 MB/s vs 10 GB/s.

I understand the need to keep the dependencies minimal. It helps the package remain small and portable. At this point, StringZilla provides 105 binaries for different OS and hardware versions (more portable than NumPy) and the binary size generally ranges from 50 KB to 250 KB, smaller than a single JPEG.

mariosasko · 2024-03-04T15:15:49Z

The text builder is not very popular, so I'm also not a fan of introducing a dependency for it.

Moreover, I couldn't find any projects of this size/usage depending on StringZilla (with GitHub search), so we should at least wait for its greater adoption to merge this PR.

Add: Faster text parser

ffeaef6

lhoestq reviewed Mar 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3x Faster Text Preprocessing #6711

3x Faster Text Preprocessing #6711

ashvardanian commented Mar 3, 2024

lhoestq left a comment

ashvardanian commented Mar 4, 2024

mariosasko commented Mar 4, 2024

3x Faster Text Preprocessing #6711

Are you sure you want to change the base?

3x Faster Text Preprocessing #6711

Conversation

ashvardanian commented Mar 3, 2024

lhoestq left a comment

Choose a reason for hiding this comment

ashvardanian commented Mar 4, 2024

mariosasko commented Mar 4, 2024