Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast alternative to text tokenization with SimpleTokenizer #755

Open
michael-p opened this issue Dec 6, 2023 · 4 comments
Open

Fast alternative to text tokenization with SimpleTokenizer #755

michael-p opened this issue Dec 6, 2023 · 4 comments

Comments

@michael-p
Copy link

We've been working on a re-implementation of the original OpenAI text tokenizer (SimpleTokenizer) in Rust, with bindings for Python, called instant-clip-tokenizer.
In our benchmarks it is around 70x faster than the current Python implementation.

Are you interested in mentioning this library in your Readme as an alternative to the SimpleTokenizer included in this repository? If yes I'm happy to send in a PR!

@bryant1410
Copy link
Contributor

I wonder how it compares with CLIPTokenizerFast from transformers since it's supposed to do the same thing: Rust-backed tokenization for CLIP.

I've been using transformers' CLIP tokenizer as a replacement for SimpleTokenizer, and I haven't measured the runtime performance, but it does tokenize in the same way.

@michael-p
Copy link
Author

I just ran a short benchmark, on my machine it is 47x faster for encoding than the Rust-based CLIPTokenizerFast implementation from transformers:

from transformers import CLIPTokenizerFast
from instant_clip_tokenizer import Tokenizer as InstantTokenizer

tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch16")
tokenizer_instant = InstantTokenizer()

INPUT = "If yes I'm happy to send in a PR!" # some random sentence :)

%timeit tokenizer_fast.encode(INPUT, add_special_tokens=False)
# -> 80.1 µs ± 626 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit tokenizer_instant.encode(INPUT)
# -> 1.7 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

@bryant1410
Copy link
Contributor

Cool! Good to know about this faster tokenizer!

Have you tried batch tokenization?

However, in my use cases, tokenization is not a bottleneck when training CLIP-like models.

@michael-p
Copy link
Author

Have you tried batch tokenization?

We mostly care about tokenization performance for single inputs (we use it for inference). Nevertheless, we provide a tokenize_batch method which is around 3x to 10x faster (depending on batch size) than the corresponding method from CLIPTokenizerFast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants