Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed NoneType attribute crash in tokenization_utils_base.py #30721

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ElleLeonne
Copy link

@ElleLeonne ElleLeonne commented May 8, 2024

The attribute self.model_max_length is not universally set in all tokenizers.

If these cases, the tokenizer will crash the program without the listed change.

I noticed it specifically when loading tokenizers from disk, as the attribute appeared to get lost.

@amyeroberts
Copy link
Collaborator

cc @ArthurZucker

@ArthurZucker
Copy link
Collaborator

Thanks! Do you have a small reproducer?

@ElleLeonne
Copy link
Author

Thanks! Do you have a small reproducer?

Oh dear, thank you for asking. It appears I've made a very small mistake and jumped to conclusions early.

Allow me to properly demonstrate the problem. As part of my code, I actually manually override this attribute on the tokenizer.

However, it appears that this method loses track of the change when I perform this, as the attribute vanishes in this snippet and causes a crash.

This will reproduce, and the change causes the code to work as expected and output the expected dialogue, however there may be a deeper root cause as to why the attribute is None here.

from transformers import GemmaTokenizer
import random

tokenizer = GemmaTokenizer.from_pretrained("google/Gemma-7B")
tokenizer.model_max_length = 10

# Generate some hard-to-tokenizer nonsense.
string = ' '.join(random.choices('abcdefghijklmnopqrstuvwxyz', k=20))

text = tokenizer.encoder(string)

@ArthurZucker
Copy link
Collaborator

cc @itazap if you can have a look!

@itazap
Copy link

itazap commented May 14, 2024

Hi @ElleLeonne!

I am unable to reproduce the error with the code snippet provided. I only observe the following warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (21 > 10). Running this sequence through the model will result in indexing errors

After this warning, the code continues to run successfully and the result of text as the full encoded sequence (with length 21). So even though it is too long, the expected behaviour is still to encode this sequence. If your use case require it to be truncated to the max_model_length, you can use tokenizer.encode(string, truncation=True)

If you are experiencing an error / crash, can you please provide a stacktrace?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants