You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sentenpiece tokenizers have the property that Decode(Encode(Normalize(input))) == Normalize(input).. This property is very useful when combining and re-inferring prompts. However, when used through tokenizers with special tokens added for BOS/EOS etc, tokenizers will inject an extra space around special tokens when decoding - i.e, <s>A will become <s> A, which when encoded and decoded will become <s> A, <s> A, etc.
A previous issue was raised about this but incorrectly closed as intended behavior/unfixable: #1237 . Although not all tokenizers have this property, sentencepiece is very widely used now due to llama and mistral so it would make sense for this behavior to be preserved.
There could be two fixes for this: either not add the extra space, or tokenize <s> A the same as <s>A (i think could be accomplished by changing the AddedToken params for these tokens.
The text was updated successfully, but these errors were encountered:
Sentenpiece tokenizers have the property that
Decode(Encode(Normalize(input))) == Normalize(input).
. This property is very useful when combining and re-inferring prompts. However, when used throughtokenizers
with special tokens added for BOS/EOS etc,tokenizers
will inject an extra space around special tokens when decoding - i.e,<s>A
will become<s> A
, which when encoded and decoded will become<s> A
,<s> A
, etc.A previous issue was raised about this but incorrectly closed as intended behavior/unfixable: #1237 . Although not all tokenizers have this property, sentencepiece is very widely used now due to llama and mistral so it would make sense for this behavior to be preserved.
There could be two fixes for this: either not add the extra space, or tokenize
<s> A
the same as<s>A
(i think could be accomplished by changing theAddedToken
params for these tokens.The text was updated successfully, but these errors were encountered: