Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained #30762

yingqianch · 2024-05-11T15:26:18Z

System Info

Hello, I am running the qwen-1.5-0.5B-Chat model . According to https://modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat/summary , at the Qickstart part ,

from modelscope import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen1.5-0.5B-Chat",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen1.5-0.5B-Chat")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

I put the code in run_qwen-1.5-0.5B-Chat.py, when I run run_qwen-1.5-0.5B-Chat.py, I get the warning : Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained .And then there's no output. I don't know how to solve the warning.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from modelscope import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
"qwen/Qwen1.5-0.5B-Chat",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen1.5-0.5B-Chat")

prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Expected behavior

Does anyone else run into the same problem with qwen-1.5-0.5B-Chat model? If anyone knows the solution to this problem, please let me know, I would be very grateful.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-05-13T09:07:46Z

cc @ArthurZucker

ArthurZucker · 2024-05-13T09:44:08Z

Hey! You are using modelscope which is not transformers.
Anyway the warning just means that there are special tokens added to the tokenizer, more often than not the word embeddings are already correctly resized

arihant-neohuman · 2024-05-13T13:39:02Z

@ArthurZucker , i got a bit confused by your response . Please correct me if im wrong . The models embedding size doesnt need to be resized in the mentioned code since the tokenizer already has the instruction tokens added .

arasaahov · 2024-05-23T07:23:00Z

print(response)

ArthurZucker · 2024-05-23T14:46:29Z

It does, if you add the token to the tokenizer, you increase the vocab size. But if you don't do the equivalent operation for the embedding matrix, then you are gonna go over the dimension of the embedding matrix!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained #30762

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained #30762

yingqianch commented May 11, 2024 •

edited by ArthurZucker

amyeroberts commented May 13, 2024

ArthurZucker commented May 13, 2024

arihant-neohuman commented May 13, 2024 •

edited

arasaahov commented May 23, 2024

ArthurZucker commented May 23, 2024

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained #30762

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained #30762

Comments

yingqianch commented May 11, 2024 • edited by ArthurZucker

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented May 13, 2024

ArthurZucker commented May 13, 2024

arihant-neohuman commented May 13, 2024 • edited

arasaahov commented May 23, 2024

ArthurZucker commented May 23, 2024

yingqianch commented May 11, 2024 •

edited by ArthurZucker

arihant-neohuman commented May 13, 2024 •

edited