Fixed NoneType attribute crash in tokenization_utils_base.py #30721

ElleLeonne · 2024-05-08T19:38:27Z

The attribute self.model_max_length is not universally set in all tokenizers.

If these cases, the tokenizer will crash the program without the listed change.

I noticed it specifically when loading tokenizers from disk, as the attribute appeared to get lost.

amyeroberts · 2024-05-09T10:24:25Z

cc @ArthurZucker

ArthurZucker · 2024-05-09T10:26:03Z

Thanks! Do you have a small reproducer?

ElleLeonne · 2024-05-09T13:31:25Z

Thanks! Do you have a small reproducer?

Oh dear, thank you for asking. It appears I've made a very small mistake and jumped to conclusions early.

Allow me to properly demonstrate the problem. As part of my code, I actually manually override this attribute on the tokenizer.

However, it appears that this method loses track of the change when I perform this, as the attribute vanishes in this snippet and causes a crash.

This will reproduce, and the change causes the code to work as expected and output the expected dialogue, however there may be a deeper root cause as to why the attribute is None here.

from transformers import GemmaTokenizer
import random

tokenizer = GemmaTokenizer.from_pretrained("google/Gemma-7B")
tokenizer.model_max_length = 10

# Generate some hard-to-tokenizer nonsense.
string = ' '.join(random.choices('abcdefghijklmnopqrstuvwxyz', k=20))

text = tokenizer.encoder(string)

ArthurZucker · 2024-05-10T08:32:03Z

cc @itazap if you can have a look!

itazap · 2024-05-14T10:16:28Z

Hi @ElleLeonne!

I am unable to reproduce the error with the code snippet provided. I only observe the following warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (21 > 10). Running this sequence through the model will result in indexing errors

After this warning, the code continues to run successfully and the result of text as the full encoded sequence (with length 21). So even though it is too long, the expected behaviour is still to encode this sequence. If your use case require it to be truncated to the max_model_length, you can use tokenizer.encode(string, truncation=True)

If you are experiencing an error / crash, can you please provide a stacktrace?

Thanks!

Update tokenization_utils_base.py

cfe6c7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed NoneType attribute crash in tokenization_utils_base.py #30721

Fixed NoneType attribute crash in tokenization_utils_base.py #30721

ElleLeonne commented May 8, 2024 •

edited

amyeroberts commented May 9, 2024

ArthurZucker commented May 9, 2024

ElleLeonne commented May 9, 2024

ArthurZucker commented May 10, 2024

itazap commented May 14, 2024 •

edited

Fixed NoneType attribute crash in tokenization_utils_base.py #30721

Are you sure you want to change the base?

Fixed NoneType attribute crash in tokenization_utils_base.py #30721

Conversation

ElleLeonne commented May 8, 2024 • edited

amyeroberts commented May 9, 2024

ArthurZucker commented May 9, 2024

ElleLeonne commented May 9, 2024

ArthurZucker commented May 10, 2024

itazap commented May 14, 2024 • edited

ElleLeonne commented May 8, 2024 •

edited

itazap commented May 14, 2024 •

edited