Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading tokenizer.model with Rust API #1518

Open
EricLBuehler opened this issue Apr 28, 2024 · 5 comments
Open

Loading tokenizer.model with Rust API #1518

EricLBuehler opened this issue Apr 28, 2024 · 5 comments

Comments

@EricLBuehler
Copy link

Hello all,

Thank you for your excellent work here. I am trying to load a tokenizer.model file in my Rust application. However, it seems that the Tokenizer::from_file function only support loading from a tokenizer.json file. This causes problems as using a small script to save the tokenizer.json is error-prone and hard to discover for users. Is there a way to load a tokenizer.model file?

@ArthurZucker
Copy link
Collaborator

You cannot load a tokenizer.model, you need to write a converter.
This is because it does not come from the tokenizers library but from either tiktoken or sentencepiece and there is no secret recipe. We need to adapt to the content of the file, but this is not super straight forward.

https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L544 is the simplest way to understand the process!

@EricLBuehler
Copy link
Author

EricLBuehler commented Apr 30, 2024

Ok, I understand. Do you know of a way or a library to do this in Rust without reaching for the Python transformers converter?

@ArthurZucker
Copy link
Collaborator

A library no, but we should be able to come up with a small rust code to do this 😉

@EricLBuehler
Copy link
Author

@ArthurZucker are there any specifications or example loaders which I can look at to implement this?

@chenwanqq
Copy link

I also have the same question, for llava reasons😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants