Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens) #1490

bin123apple · 2024-04-11T03:10:37Z

Hey guys, I am trying to add some customized new tokens in a Code related model deepseek-ai/deepseek-coder-6.7b-instruct. However the model completely loses performance after using tokenizer.add_tokens(special_tokens).

Here is how I add new tokens:

special_tokens = ["<API_PYTHON_START>", "<API_PYTHON_STOP>", "<API_PIP_START>", "<API_PIP_STOP>"]
tokenizer.add_tokens(special_tokens)

For this model, the size of the embedding layer is [32256, 4096] and The initial size of its vocabulary is 32022 . It looks that this model reserves some places for special tokens. Thus, I did not use the model.resize_token_embeddings(len(tokenizer)) to enlarge the size of the embedding layer.

After adding new tokens to its tokenizer, I checked the encode performance by using the following code:

from transformers import AutoModel, AutoTokenizer

# Text
text = """
Translate this Fortran code to C++: 
program DRB093_doall2_collapse_orig_no
  use omp_lib
  use DRB093
  implicit none

  integer :: len, i, j
  len = 100  

  allocate (a(len, len))

  do i = 1, len
    do j = 1, len
      a(i, j) = a(i, j) + 1
    end do
  end do
  !$omp end parallel do

end program
"""

# Check old tokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
print(f"len(tokenizer): {len(tokenizer)}") # ---> 32022
id = tokenizer.encode(text, add_special_tokens=True)
print(f"id before: {id}")

# Check new tokenizer
special_tokens = ["<API_PYTHON_START>", "<API_PYTHON_STOP>", "<API_PIP_START>", "<API_PIP_STOP>"]
tokenizer.add_tokens(special_tokens)
tokenizer.save_pretrained("/path/to/output/directory/")
tokenizer = AutoTokenizer.from_pretrained("/path/to/output/directory/")
print(f"len(tokenizer): {len(tokenizer)}") # ---> 32026
id = tokenizer.encode(text, add_special_tokens=True)
print(f"id after: {id}")

# Save model
model = AutoModel.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
model.save_pretrained("/path/to/output/directory/")

I found the encoded id is exactly the same.

However, when I tried to use new tokenizer to generate the results for this model, it completely loses performance.

Here is the script that I use to generate the output:

from transformers import pipeline
from transformers import AutoTokenizer ,AutoModelForCausalLM

checkpoint = "/path/to/output/directory/"    

model = AutoModelForCausalLM.from_pretrained(model_checkpoint,
                                             device_map="auto") 

tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint, 
                                          legacy=False)

pipe = pipeline("text-generation", 
                model=model, 
                tokenizer=tokenizer,
                torch_dtype=torch.float16,)

input_prompt = """
Translate this Fortran code to C++: 
program DRB093_doall2_collapse_orig_no
  use omp_lib
  use DRB093
  implicit none

  integer :: len, i, j
  len = 100  

  allocate (a(len, len))

  do i = 1, len
    do j = 1, len
      a(i, j) = a(i, j) + 1
    end do
  end do
  !$omp end parallel do

end program
"""

print(F"input_prompt: \n{input_prompt}")
new_chars = pipe(input_prompt, 
            do_sample=True,
            temperature=0.2,
            max_new_tokens=512,)[0]["generated_text"][len(input_prompt):]
print(f"Answer:{new_chars}")

The original model is good (checkpoint = "deepseek-ai/deepseek-coder-6.7b-instruct")

The performance of same model with new tokenizer is bad (checkpoint = "/path/to/output/directory/")

Is there any suggestions? Thanks!

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-12T01:51:18Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label May 12, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens) #1490

Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens) #1490

bin123apple commented Apr 11, 2024

github-actions bot commented May 12, 2024

Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens) #1490

Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens) #1490

Comments

bin123apple commented Apr 11, 2024

github-actions bot commented May 12, 2024