Occurring error when I add new tokens to the tokenizer. #237

charlesCXK · 2024-03-12T09:47:24Z

Hi,
I want to add new tokens to the tokenizer through:

tokenizer.add_tokens("<NEW1>", special_tokens=True) 
tokenizer.add_tokens("<NEW2>", special_tokens=True) 
model.resize_token_embeddings(len(tokenizer))
model.config.vocab_size = len(tokenizer)

Then I save the model as LoRA adapters through:

model.save_pretrained(save_path)

When I load the model, the error occurs:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path, # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
        size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([32002, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
        size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([32002, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

It seems that the saved checkpoint does not match the pre-defined model architecture (with 32000-d output). What should I do to solve this issue?

Thanks,

The text was updated successfully, but these errors were encountered:

danielhanchen · 2024-03-13T07:12:41Z

@charlesCXK Oh I think you'll have to add modules_to_save ie https://github.com/unslothai/unsloth/wiki#finetuning-the-lm_head-and-embed_tokens-matrices

I haven't yet fixed some parts, so hopefully I'll fix this by today! Sorry on the delay!

charlesCXK · 2024-03-17T18:27:56Z

@danielhanchen
Dear author,

Thanks for your reply! I think the core problem is not related to add modules_to_save. We can see that the model is already saved ("copying a param with shape torch.Size([32002, 4096]) from checkpoint"). I am wondering how can I load the saved model using FastLanguageModel.from_pretrained. I want to use the saved model (with new vocabulary) for inference.

charlesCXK · 2024-03-22T14:01:52Z

Dear author,
I have fixed the issue and create a pull request: #272.
Now we can successfully load checkpoints with newly added special tokens through new_token_num argument in FastLanguageModel.from_pretrained function.
@danielhanchen

danielhanchen · 2024-03-22T17:23:27Z

@charlesCXK Oh thanks!! So sorry again on the issue! I'll take a look a your PR - thanks so much again!

chtmp223 · 2024-04-04T20:07:20Z

Hi, bumping this up again! I added a new token to the tokenizer. Now I want to load my LoRA checkpoint using from_pretrained, but I got the same error:

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
	size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
	size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

@danielhanchen Would you mind reviewing charlesCXK's PR?

danielhanchen · 2024-05-17T17:51:45Z

@charlesCXK @chtmp223 Whoops I actually totally missed this, but now using resize_model_vocab = True in FastLanguageModel.from_pretrained(...) should hopefully fix the issue

danielhanchen added the currently fixing Am fixing now! label Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occurring error when I add new tokens to the tokenizer. #237

Occurring error when I add new tokens to the tokenizer. #237

charlesCXK commented Mar 12, 2024 •

edited

danielhanchen commented Mar 13, 2024 •

edited

charlesCXK commented Mar 17, 2024 •

edited

charlesCXK commented Mar 22, 2024 •

edited

danielhanchen commented Mar 22, 2024

chtmp223 commented Apr 4, 2024

danielhanchen commented May 17, 2024

Occurring error when I add new tokens to the tokenizer. #237

Occurring error when I add new tokens to the tokenizer. #237

Comments

charlesCXK commented Mar 12, 2024 • edited

danielhanchen commented Mar 13, 2024 • edited

charlesCXK commented Mar 17, 2024 • edited

charlesCXK commented Mar 22, 2024 • edited

danielhanchen commented Mar 22, 2024

chtmp223 commented Apr 4, 2024

danielhanchen commented May 17, 2024

charlesCXK commented Mar 12, 2024 •

edited

danielhanchen commented Mar 13, 2024 •

edited

charlesCXK commented Mar 17, 2024 •

edited

charlesCXK commented Mar 22, 2024 •

edited