Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Open
sanderland opened this issue Apr 24, 2024 · 3 comments
Open

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

sanderland opened this issue Apr 24, 2024 · 3 comments

Comments

@sanderland
Copy link

import tokenizers
def show_tokenization(tok, s):
    ids = tok.encode(s, add_special_tokens=False).ids
    print([(i, tok.decode([i])) for i in ids])

def show_tokenization_from_id(tok, id):
    s = tok.decode([id])
    print(f"id {id} decodes to {s!r}, which encodes to...")
    show_tokenization(tok, s)

fb_tok = tokenizers.Tokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
show_tokenization_from_id(112328)

v0.19.0
id 112328 decodes to ' Arthropoda', which encodes to...
[(1676, ' Ar'), (98643, 'throp'), (14320, 'oda')]

v0.19.1
id 112328 decodes to ' Arthropoda', which encodes to...
[(112328, ' Arthropoda')]

I have good evidence that the new behaviour is how the model was trained, but the announcement of the patch release should perhaps be a little louder in advising to e.g. retokenize all training data for particular model families.

@thusinh1969
Copy link

thusinh1969 commented May 6, 2024

I am waiting for #1513 break changes to happend to start continual pretrain LlaMA-3 with extended vocab et all.

Not sure when this merge will happend (v0.19.2 I guess) as it is critical for LLaMA-3 for non-English corpus.

Cheers,
Steve

@sanderland
Copy link
Author

@thusinh1969 What are you finding wrong with 0.19.1?

@thusinh1969
Copy link

thusinh1969 commented May 7, 2024

@thusinh1969 What are you finding wrong with 0.19.1?

The decoder was buggy for added token when we want to extend vocab for non-English. Being fixed I think.

meta-llama/llama3#67

Steve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants