You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey guys, I am trying to add some customized new tokens in a Code related model deepseek-ai/deepseek-coder-6.7b-instruct. However the model completely loses performance after using tokenizer.add_tokens(special_tokens).
For this model, the size of the embedding layer is [32256, 4096] and The initial size of its vocabulary is 32022 . It looks that this model reserves some places for special tokens. Thus, I did not use the model.resize_token_embeddings(len(tokenizer)) to enlarge the size of the embedding layer.
After adding new tokens to its tokenizer, I checked the encode performance by using the following code:
from transformers import AutoModel, AutoTokenizer
# Text
text = """
Translate this Fortran code to C++:
program DRB093_doall2_collapse_orig_no
use omp_lib
use DRB093
implicit none
integer :: len, i, j
len = 100
allocate (a(len, len))
do i = 1, len
do j = 1, len
a(i, j) = a(i, j) + 1
end do
end do
!$omp end parallel do
end program
"""
# Check old tokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
print(f"len(tokenizer): {len(tokenizer)}") # ---> 32022
id = tokenizer.encode(text, add_special_tokens=True)
print(f"id before: {id}")
# Check new tokenizer
special_tokens = ["<API_PYTHON_START>", "<API_PYTHON_STOP>", "<API_PIP_START>", "<API_PIP_STOP>"]
tokenizer.add_tokens(special_tokens)
tokenizer.save_pretrained("/path/to/output/directory/")
tokenizer = AutoTokenizer.from_pretrained("/path/to/output/directory/")
print(f"len(tokenizer): {len(tokenizer)}") # ---> 32026
id = tokenizer.encode(text, add_special_tokens=True)
print(f"id after: {id}")
# Save model
model = AutoModel.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
model.save_pretrained("/path/to/output/directory/")
I found the encoded id is exactly the same.
However, when I tried to use new tokenizer to generate the results for this model, it completely loses performance.
Here is the script that I use to generate the output:
from transformers import pipeline
from transformers import AutoTokenizer ,AutoModelForCausalLM
checkpoint = "/path/to/output/directory/"
model = AutoModelForCausalLM.from_pretrained(model_checkpoint,
device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint,
legacy=False)
pipe = pipeline("text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.float16,)
input_prompt = """
Translate this Fortran code to C++:
program DRB093_doall2_collapse_orig_no
use omp_lib
use DRB093
implicit none
integer :: len, i, j
len = 100
allocate (a(len, len))
do i = 1, len
do j = 1, len
a(i, j) = a(i, j) + 1
end do
end do
!$omp end parallel do
end program
"""
print(F"input_prompt: \n{input_prompt}")
new_chars = pipe(input_prompt,
do_sample=True,
temperature=0.2,
max_new_tokens=512,)[0]["generated_text"][len(input_prompt):]
print(f"Answer:{new_chars}")
The original model is good (checkpoint = "deepseek-ai/deepseek-coder-6.7b-instruct")
The performance of same model with new tokenizer is bad (checkpoint = "/path/to/output/directory/")
Is there any suggestions? Thanks!
The text was updated successfully, but these errors were encountered:
Hey guys, I am trying to add some customized new tokens in a Code related model
deepseek-ai/deepseek-coder-6.7b-instruct
. However the model completely loses performance after usingtokenizer.add_tokens(special_tokens)
.Here is how I add new tokens:
For this model, the size of the embedding layer is
[32256, 4096]
and The initial size of its vocabulary is32022
. It looks that this model reserves some places for special tokens. Thus, I did not use themodel.resize_token_embeddings(len(tokenizer))
to enlarge the size of the embedding layer.After adding new tokens to its tokenizer, I checked the encode performance by using the following code:
I found the encoded id is exactly the same.
However, when I tried to use new tokenizer to generate the results for this model, it completely loses performance.
Here is the script that I use to generate the output:
The original model is good (
checkpoint = "deepseek-ai/deepseek-coder-6.7b-instruct"
)The performance of same model with new tokenizer is bad (
checkpoint = "/path/to/output/directory/"
)Is there any suggestions? Thanks!
The text was updated successfully, but these errors were encountered: