Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving and reloading the pretrained model's vocab breaks the tokenizer. #9024

Open
owos opened this issue Apr 23, 2024 · 2 comments
Open

Saving and reloading the pretrained model's vocab breaks the tokenizer. #9024

owos opened this issue Apr 23, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@owos
Copy link

owos commented Apr 23, 2024

Describe the bug

So I picked nvidia/parakeet-ctc-0.6b and I untarred the the .nemo file.
After that, I then loaded the model and changed the the vocab this way:

Steps/Code to reproduce bug

model.change_vocabulary(
            new_tokenizer_dir=vocab_extension_path, new_tokenizer_type="bpe"
        )

where vocab_extension_path = the path of the pretrained model.

Expected behavior
The model's tokenizer is supposed to remain intact and not start generating gibberish because I am just reloading the extact tokenizer that was used to pretrain the model.

Why I need this
I need to replace some tokens in the model's vocab while keeping the order tokens intact. If I cant keep other parts of the tokenizer intact then my replacement of tokens cannot work.

@owos owos added the bug Something isn't working label Apr 23, 2024
@nithinraok
Copy link
Collaborator

It should be path to a tokenizer directory not model.

the directory should contain:

  • tokenizer.model
  • tokenizer.vocab
  • vocab.txt

@owos
Copy link
Author

owos commented May 8, 2024

Yes, that's what I'm doing.
Infact, I've been able to edit the pretrained model's tokenizer and changed the tokens inside of it.
What I found out is that just reloading the pretrained tokenizer with the change_vocab method scatters the whole decoding process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants