Fast alternative to text tokenization with `SimpleTokenizer` #755

michael-p · 2023-12-06T13:33:22Z

We've been working on a re-implementation of the original OpenAI text tokenizer (SimpleTokenizer) in Rust, with bindings for Python, called instant-clip-tokenizer.
In our benchmarks it is around 70x faster than the current Python implementation.

Are you interested in mentioning this library in your Readme as an alternative to the SimpleTokenizer included in this repository? If yes I'm happy to send in a PR!

The text was updated successfully, but these errors were encountered:

bryant1410 · 2023-12-06T16:46:29Z

I wonder how it compares with CLIPTokenizerFast from transformers since it's supposed to do the same thing: Rust-backed tokenization for CLIP.

I've been using transformers' CLIP tokenizer as a replacement for SimpleTokenizer, and I haven't measured the runtime performance, but it does tokenize in the same way.

michael-p · 2023-12-06T18:19:44Z

I just ran a short benchmark, on my machine it is 47x faster for encoding than the Rust-based CLIPTokenizerFast implementation from transformers:

from transformers import CLIPTokenizerFast
from instant_clip_tokenizer import Tokenizer as InstantTokenizer

tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch16")
tokenizer_instant = InstantTokenizer()

INPUT = "If yes I'm happy to send in a PR!" # some random sentence :)

%timeit tokenizer_fast.encode(INPUT, add_special_tokens=False)
# -> 80.1 µs ± 626 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit tokenizer_instant.encode(INPUT)
# -> 1.7 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

bryant1410 · 2023-12-06T18:30:01Z

Cool! Good to know about this faster tokenizer!

Have you tried batch tokenization?

However, in my use cases, tokenization is not a bottleneck when training CLIP-like models.

michael-p · 2023-12-06T18:52:20Z

Have you tried batch tokenization?

We mostly care about tokenization performance for single inputs (we use it for inference). Nevertheless, we provide a tokenize_batch method which is around 3x to 10x faster (depending on batch size) than the corresponding method from CLIPTokenizerFast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast alternative to text tokenization with `SimpleTokenizer` #755

Fast alternative to text tokenization with `SimpleTokenizer` #755

michael-p commented Dec 6, 2023

bryant1410 commented Dec 6, 2023

michael-p commented Dec 6, 2023

bryant1410 commented Dec 6, 2023

michael-p commented Dec 6, 2023

Fast alternative to text tokenization with SimpleTokenizer #755

Fast alternative to text tokenization with SimpleTokenizer #755

Comments

michael-p commented Dec 6, 2023

bryant1410 commented Dec 6, 2023

michael-p commented Dec 6, 2023

bryant1410 commented Dec 6, 2023

michael-p commented Dec 6, 2023

Fast alternative to text tokenization with `SimpleTokenizer` #755

Fast alternative to text tokenization with `SimpleTokenizer` #755