-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Megatron-LM for LLaMa3 #818
Comments
There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model |
Tente recomeçar do zero observando com calma os digitado. As vezes as máquinas falha. |
also checkout the llama3 example in megatron launcher: https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/examples/training/llama/h100/llama3_8b_bf16.sh |
This doesn't work directly because this model file can get loaded from the sentencepiece. Here's the error:
|
That's true! But bypassing this is pretty easy, just create a new tokenizer like the one of Llama2. You can do |
True
…On Mon, 20 May 2024 at 8:59 PM, Antoni-Joan Solergibert < ***@***.***> wrote:
There is a tokenizer.model file in the Hugging Face Checkpoints under the
/original folder, check
https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model
This doesn't work directly because this model file can get loaded from the
sentencepiece.
Here's the error:
Traceback (most recent call last):
File "/Users/dsdsdds/Downloads/check_tokenizer_model.py", line 5, in <module>
print(sp.Load("./tokenizer.model"))
File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from ./tokenizer.model
That's true! But bypassing this is pretty easy, just create a new
tokenizer like the one of Llama2
<https://github.com/NVIDIA/Megatron-LM/blob/c3677e09aa4e2eec37048307bd795928b8f8324a/megatron/training/tokenizer/tokenizer.py#L441>.
You can do self.tokenizer = AutoTokenizer.from_pretrained() and change a
bit some methods (For example, def tokenize(...): return
self.tokenizer(....)).
—
Reply to this email directly, view it on GitHub
<#818 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEA4FGUDA4EQ5AJBLRE3U4LZDG3Q5AVCNFSM6AAAAABHQHBIXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZHE4TSMJYGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I also encountered the same problem. Could you please share your configuration? Thank you very much |
I'm attempting to train LLaMA-3 using Megatron-LM but have encountered an issue: LLaMA-3 utilizes Tiktoken for tokenization and doesn't provide a tokenizer.model file, which is required by Megatron-LM. How can I adapt or generate a compatible tokenizer.model for Megatron-LM? Any guidance or workaround would be greatly appreciated!
The text was updated successfully, but these errors were encountered: