Special token handling breaks idempotency of sentencepiece due to extra spaces #1527

cat-state · 2024-05-09T17:38:46Z

Sentenpiece tokenizers have the property that Decode(Encode(Normalize(input))) == Normalize(input).. This property is very useful when combining and re-inferring prompts. However, when used through tokenizers with special tokens added for BOS/EOS etc, tokenizers will inject an extra space around special tokens when decoding - i.e, <s>A will become <s> A, which when encoded and decoded will become <s> A, <s> A, etc.

A previous issue was raised about this but incorrectly closed as intended behavior/unfixable: #1237 . Although not all tokenizers have this property, sentencepiece is very widely used now due to llama and mistral so it would make sense for this behavior to be preserved.

There could be two fixes for this: either not add the extra space, or tokenize <s> A the same as <s>A (i think could be accomplished by changing the AddedToken params for these tokens.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-05-17T12:05:57Z

Do you have a reproducer?
I'd love to fix it, but I'm not sure this is still happening

ArthurZucker · 2024-05-17T12:06:27Z

Llama based tokenizer don't have this issue anymore and was fixed by the metaspace refactoring.

ArthurZucker · 2024-05-17T12:06:54Z

Are you using legacy=False (mistral does not)

ArthurZucker · 2024-05-17T12:12:22Z

Also the snipper shared:

from transformers import LlamaTokenizer
model_id = "lmsys/vicuna-13b-delta-v1.1"
tokenizer = LlamaTokenizer.from_pretrained(model_id, add_bos_token = False, )
message = "<s>hello</s>"
decoded = tokenizer.decode(tokenizer(message)['input_ids'])
print(decoded, decoded == message)

this is on transformers side. Not tokenizers. I'll open a PR right away, it's super weird that it was not caught up until now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special token handling breaks idempotency of sentencepiece due to extra spaces #1527

Special token handling breaks idempotency of sentencepiece due to extra spaces #1527

cat-state commented May 9, 2024 •

edited

ArthurZucker commented May 17, 2024

ArthurZucker commented May 17, 2024

ArthurZucker commented May 17, 2024

ArthurZucker commented May 17, 2024

Special token handling breaks idempotency of sentencepiece due to extra spaces #1527

Special token handling breaks idempotency of sentencepiece due to extra spaces #1527

Comments

cat-state commented May 9, 2024 • edited

ArthurZucker commented May 17, 2024

ArthurZucker commented May 17, 2024

ArthurZucker commented May 17, 2024

ArthurZucker commented May 17, 2024

cat-state commented May 9, 2024 •

edited