Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlamaTokenizer class issue #40

Open
SJayYangNN opened this issue Aug 8, 2023 · 0 comments
Open

LlamaTokenizer class issue #40

SJayYangNN opened this issue Aug 8, 2023 · 0 comments

Comments

@SJayYangNN
Copy link

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. The class this function is called from is 'LlamaTokenizer'.

HI! I'm running the LLM Tuner UI and run into this issue, which has been solved in another issue https://github.com/huggingface/transformers/issues/22222#issuecomment-1477171703. However, whenever I try to simply change the LlamaTokenizer name in tokenizer_config.json in the Huggingface cache ~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf, other issues pop whenever running the app.

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 33/33 [00:13<00:00,  2.52it/s]
Traceback (most recent call last):
  File "llm_tuner/app.py", line 147, in <module>
    fire.Fire(main)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "llm_tuner/app.py", line 119, in main
    prepare_base_model(Config.default_base_model_name)
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 262, in prepare_base_model
    Global.new_base_model_that_is_ready_to_be_used = get_new_base_model(
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 80, in get_new_base_model
    tokenizer = get_tokenizer(base_model_name)
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 156, in get_tokenizer
    raise e
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 143, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 700, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1811, in from_pretrained
    return cls._from_pretrained(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1965, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 89, in __init__
    super().__init__(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 1288, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 445, in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
    _descriptor.EnumValueDescriptor(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 796, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.

Any idea on how to tackle this so that the model and tokenizer will match properly? And any insight on if it will affect finetuning results if I didn't match up the classnames earlier?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant