Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens) #1490

Closed
bin123apple opened this issue Apr 11, 2024 · 1 comment
Labels

Comments

@bin123apple
Copy link

Hey guys, I am trying to add some customized new tokens in a Code related model deepseek-ai/deepseek-coder-6.7b-instruct. However the model completely loses performance after using tokenizer.add_tokens(special_tokens).

Here is how I add new tokens:

special_tokens = ["<API_PYTHON_START>", "<API_PYTHON_STOP>", "<API_PIP_START>", "<API_PIP_STOP>"]
tokenizer.add_tokens(special_tokens)

For this model, the size of the embedding layer is [32256, 4096] and The initial size of its vocabulary is 32022 . It looks that this model reserves some places for special tokens. Thus, I did not use the model.resize_token_embeddings(len(tokenizer)) to enlarge the size of the embedding layer.

After adding new tokens to its tokenizer, I checked the encode performance by using the following code:

from transformers import AutoModel, AutoTokenizer

# Text
text = """
Translate this Fortran code to C++: 
program DRB093_doall2_collapse_orig_no
  use omp_lib
  use DRB093
  implicit none

  integer :: len, i, j
  len = 100  

  allocate (a(len, len))

  do i = 1, len
    do j = 1, len
      a(i, j) = a(i, j) + 1
    end do
  end do
  !$omp end parallel do

end program
"""

# Check old tokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
print(f"len(tokenizer): {len(tokenizer)}") # ---> 32022
id = tokenizer.encode(text, add_special_tokens=True)
print(f"id before: {id}")

# Check new tokenizer
special_tokens = ["<API_PYTHON_START>", "<API_PYTHON_STOP>", "<API_PIP_START>", "<API_PIP_STOP>"]
tokenizer.add_tokens(special_tokens)
tokenizer.save_pretrained("/path/to/output/directory/")
tokenizer = AutoTokenizer.from_pretrained("/path/to/output/directory/")
print(f"len(tokenizer): {len(tokenizer)}") # ---> 32026
id = tokenizer.encode(text, add_special_tokens=True)
print(f"id after: {id}")

# Save model
model = AutoModel.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
model.save_pretrained("/path/to/output/directory/")

I found the encoded id is exactly the same.

However, when I tried to use new tokenizer to generate the results for this model, it completely loses performance.

Here is the script that I use to generate the output:

from transformers import pipeline
from transformers import AutoTokenizer ,AutoModelForCausalLM

checkpoint = "/path/to/output/directory/"    

model = AutoModelForCausalLM.from_pretrained(model_checkpoint,
                                             device_map="auto") 

tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint, 
                                          legacy=False)

pipe = pipeline("text-generation", 
                model=model, 
                tokenizer=tokenizer,
                torch_dtype=torch.float16,)

input_prompt = """
Translate this Fortran code to C++: 
program DRB093_doall2_collapse_orig_no
  use omp_lib
  use DRB093
  implicit none

  integer :: len, i, j
  len = 100  

  allocate (a(len, len))

  do i = 1, len
    do j = 1, len
      a(i, j) = a(i, j) + 1
    end do
  end do
  !$omp end parallel do

end program
"""

print(F"input_prompt: \n{input_prompt}")
new_chars = pipe(input_prompt, 
            do_sample=True,
            temperature=0.2,
            max_new_tokens=512,)[0]["generated_text"][len(input_prompt):]
print(f"Answer:{new_chars}")

The original model is good (checkpoint = "deepseek-ai/deepseek-coder-6.7b-instruct")

The performance of same model with new tokenizer is bad (checkpoint = "/path/to/output/directory/")

Is there any suggestions? Thanks!

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 12, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant