Breaking changes in v0.19.1 for tiktoken/llama3 #1512

sanderland · 2024-04-24T07:51:53Z

import tokenizers
def show_tokenization(tok, s):
    ids = tok.encode(s, add_special_tokens=False).ids
    print([(i, tok.decode([i])) for i in ids])

def show_tokenization_from_id(tok, id):
    s = tok.decode([id])
    print(f"id {id} decodes to {s!r}, which encodes to...")
    show_tokenization(tok, s)

fb_tok = tokenizers.Tokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
show_tokenization_from_id(112328)

v0.19.0
id 112328 decodes to ' Arthropoda', which encodes to...
[(1676, ' Ar'), (98643, 'throp'), (14320, 'oda')]

v0.19.1
id 112328 decodes to ' Arthropoda', which encodes to...
[(112328, ' Arthropoda')]

I have good evidence that the new behaviour is how the model was trained, but the announcement of the patch release should perhaps be a little louder in advising to e.g. retokenize all training data for particular model families.

The text was updated successfully, but these errors were encountered:

thusinh1969 · 2024-05-06T19:10:17Z

I am waiting for #1513 break changes to happend to start continual pretrain LlaMA-3 with extended vocab et all.

Not sure when this merge will happend (v0.19.2 I guess) as it is critical for LLaMA-3 for non-English corpus.

Cheers,
Steve

sanderland · 2024-05-07T07:21:06Z

@thusinh1969 What are you finding wrong with 0.19.1?

thusinh1969 · 2024-05-07T08:52:45Z

@thusinh1969 What are you finding wrong with 0.19.1?

The decoder was buggy for added token when we want to extend vocab for non-English. Being fixed I think.

meta-llama/llama3#67

Steve

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

sanderland commented Apr 24, 2024

thusinh1969 commented May 6, 2024 •

edited

sanderland commented May 7, 2024

thusinh1969 commented May 7, 2024 •

edited

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Comments

sanderland commented Apr 24, 2024

thusinh1969 commented May 6, 2024 • edited

sanderland commented May 7, 2024

thusinh1969 commented May 7, 2024 • edited

thusinh1969 commented May 6, 2024 •

edited

thusinh1969 commented May 7, 2024 •

edited