Add support for loading checkpoints with newly added tokens. #272

charlesCXK · 2024-03-22T13:58:45Z

No description provided.

danielhanchen · 2024-03-23T17:22:24Z

Wait would this load the lm_head and embed_tokens matrix correctly?

danielhanchen · 2024-03-23T17:22:39Z

Would it not cause it to be randomnly inited?

charlesCXK · 2024-03-24T10:51:58Z

Would it not cause it to be randomnly inited?

I have tested the code using such a setting:

First I add new tokens to the tokenizer.

'''
########################################
Add special tokens to the tokenizer.
########################################
'''
if True:
    old_vocab_size = tokenizer.vocab_size
    print('old vocab size: ', old_vocab_size)
    tokenizer.add_tokens("<NEWTOKEN>", special_tokens=True)
    tokenizer.add_tokens("</NEWTOKEN>", special_tokens=True)

    # test case
    print(tokenizer.tokenize("This is an example with <NEWTOKEN> and </NEWTOKEN> token."))  

    # We resize the embeddings to avoid index errors.
    model.resize_token_embeddings(len(tokenizer))
    model.config.vocab_size = len(tokenizer)

    # average init the new token embeddings
    num_new_tokens = len(tokenizer) - old_vocab_size
    print("num_new_tokens:", num_new_tokens)
    input_embeddings = model.get_input_embeddings().weight.data
    output_embeddings = model.get_output_embeddings().weight.data
    input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
        dim=0, keepdim=True)
    output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
        dim=0, keepdim=True)
    input_embeddings[-num_new_tokens:] = input_embeddings_avg
    output_embeddings[-num_new_tokens:] = output_embeddings_avg

    # open lm head and input embedding
    model.lm_head.weight.requires_grad = True
    model.get_input_embeddings().weight.requires_grad = True

I trained the model on a dataset with several steps and save the lora checkpoint.

save_path = "/home/xxx"
if os.path.exists(save_path):
    shutil.rmtree(save_path)
model.save_pretrained(save_path)

Then I use the saved checkpoint for inference.

print('Use saved model for inference.')
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path, # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    new_token_num = 0,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    "Continue the fibonnaci sequence. 1, 1, 2, 3, 5, 8"
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

The output is the same as the original model.

chtmp223 · 2024-04-04T22:08:36Z

Hi @charlesCXK, when using this code, I noticed that the loaded model doesn't include the new token that I added before fine-tuning. Do you have to add the new token again for inference? For example,

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path, # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    new_token_num = 1,        # 1 new added token
)
if "<pad>" not in tokenizer.get_vocab():
    tokenizer.add_tokens(["<pad>"], special_tokens=True)
    model.resize_token_embeddings(len(tokenizer))  

# Inference code goes here

danielhanchen · 2024-04-05T17:46:38Z

Whoopsies sorry on the horrible delay - I'll review this PR and test it out - so sorry!

danielhanchen · 2024-04-21T19:32:26Z

@charlesCXK @chtmp223 Extreme apologies on the delay - I think I might have fixed it. You need to call add_new_tokens before get_peft_model to update the vocab, resize, and also save the learnt embeddings

from unsloth import add_new_tokens
from unsloth import FastLanguageModel

add_new_tokens(model, tokenizer, ["new_token_1", "new_token_2"])
model = FastLanguageModel.get_peft_model(model, ...)

Add support for newly added tokens.

935aa45

charlesCXK mentioned this pull request Mar 22, 2024

Occurring error when I add new tokens to the tokenizer. #237

Open

charlesCXK changed the title ~~Add support for newly added tokens.~~ Add support for loading checkpoints with newly added tokens. Mar 22, 2024

oKatanaaa mentioned this pull request Apr 24, 2024

Fix: loading models with resized vocabulary #377

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for loading checkpoints with newly added tokens. #272

Add support for loading checkpoints with newly added tokens. #272

charlesCXK commented Mar 22, 2024

danielhanchen commented Mar 23, 2024

danielhanchen commented Mar 23, 2024

charlesCXK commented Mar 24, 2024

chtmp223 commented Apr 4, 2024

danielhanchen commented Apr 5, 2024

danielhanchen commented Apr 21, 2024

Add support for loading checkpoints with newly added tokens. #272

Are you sure you want to change the base?

Add support for loading checkpoints with newly added tokens. #272

Conversation

charlesCXK commented Mar 22, 2024

danielhanchen commented Mar 23, 2024

danielhanchen commented Mar 23, 2024

charlesCXK commented Mar 24, 2024

chtmp223 commented Apr 4, 2024

danielhanchen commented Apr 5, 2024

danielhanchen commented Apr 21, 2024