Megatron-LM for LLaMa3 #818

SDsly · 2024-05-10T08:31:30Z

I'm attempting to train LLaMA-3 using Megatron-LM but have encountered an issue: LLaMA-3 utilizes Tiktoken for tokenization and doesn't provide a tokenizer.model file, which is required by Megatron-LM. How can I adapt or generate a compatible tokenizer.model for Megatron-LM? Any guidance or workaround would be greatly appreciated!

TJ-Solergibert · 2024-05-10T14:44:03Z

There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model

felipeliliti · 2024-05-10T14:47:21Z

Tente recomeçar do zero observando com calma os digitado. As vezes as máquinas falha.

ethanhe42 · 2024-05-17T22:04:21Z

also checkout the llama3 example in megatron launcher: https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/examples/training/llama/h100/llama3_8b_bf16.sh

shamanez · 2024-05-19T05:02:56Z

There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model

This doesn't work directly because this model file can get loaded from the sentencepiece.

Here's the error:

Traceback (most recent call last):
  File "/Users/dsdsdds/Downloads/check_tokenizer_model.py", line 5, in <module>
    print(sp.Load("./tokenizer.model"))
  File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from ./tokenizer.model

TJ-Solergibert · 2024-05-20T08:59:36Z

There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model

This doesn't work directly because this model file can get loaded from the sentencepiece.

Here's the error:
Traceback (most recent call last):
  File "/Users/dsdsdds/Downloads/check_tokenizer_model.py", line 5, in <module>
    print(sp.Load("./tokenizer.model"))
  File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from ./tokenizer.model

That's true! But bypassing this is pretty easy, just create a new tokenizer like the one of Llama2. You can do self.tokenizer = AutoTokenizer.from_pretrained() and change a bit some methods (For example, def tokenize(...): return self.tokenizer(....)).

shamanez · 2024-05-20T10:27:22Z

True

…

On Mon, 20 May 2024 at 8:59 PM, Antoni-Joan Solergibert < ***@***.***> wrote: There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model This doesn't work directly because this model file can get loaded from the sentencepiece. Here's the error: Traceback (most recent call last): File "/Users/dsdsdds/Downloads/check_tokenizer_model.py", line 5, in <module> print(sp.Load("./tokenizer.model")) File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load return self.LoadFromFile(model_file) File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) RuntimeError: Internal: could not parse ModelProto from ./tokenizer.model That's true! But bypassing this is pretty easy, just create a new tokenizer like the one of Llama2 <https://github.com/NVIDIA/Megatron-LM/blob/c3677e09aa4e2eec37048307bd795928b8f8324a/megatron/training/tokenizer/tokenizer.py#L441>. You can do self.tokenizer = AutoTokenizer.from_pretrained() and change a bit some methods (For example, def tokenize(...): return self.tokenizer(....)). — Reply to this email directly, view it on GitHub <#818 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEA4FGUDA4EQ5AJBLRE3U4LZDG3Q5AVCNFSM6AAAAABHQHBIXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZHE4TSMJYGQ> . You are receiving this because you commented.Message ID: ***@***.***>

IronMan-WangJinxi · 2024-05-27T10:05:11Z

真的
…
On Mon, 20 May 2024 at 8:59 PM, Antoni-Joan Solergibert < @.> wrote: There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model This doesn't work directly because this model file can get loaded from the sentencepiece. Here's the error: Traceback (most recent call last): File "/Users/dsdsdds/Downloads/check_tokenizer_model.py", line 5, in print(sp.Load("./tokenizer.model")) File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/init.py", line 961, in Load return self.LoadFromFile(model_file) File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/init.py", line 316, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) RuntimeError: Internal: could not parse ModelProto from ./tokenizer.model That's true! But bypassing this is pretty easy, just create a new tokenizer like the one of Llama2

Megatron-LM/megatron/training/tokenizer/tokenizer.py

Line 441 in c3677e0

class _Llama2Tokenizer(_SentencePieceTokenizer):

. You can do self.tokenizer = AutoTokenizer.from_pretrained() and change a bit some methods (For example, def tokenize(...): return self.tokenizer(....)). — Reply to this email directly, view it on GitHub <#818 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUDA4EQ5AJBLRE3U4LZDG3Q5AVCNFSM6AAAAABHQHBIXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZHE4TSMJYGQ . You are receiving this because you commented.Message ID: @.>

I also encountered the same problem. Could you please share your configuration? Thank you very much

ethanhe42 mentioned this issue May 30, 2024

Does Megatron has plan to support llama pre-train？ #824

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron-LM for LLaMa3 #818

Megatron-LM for LLaMa3 #818

SDsly commented May 10, 2024

TJ-Solergibert commented May 10, 2024

felipeliliti commented May 10, 2024

ethanhe42 commented May 17, 2024

shamanez commented May 19, 2024 •

edited

TJ-Solergibert commented May 20, 2024

shamanez commented May 20, 2024 via email

IronMan-WangJinxi commented May 27, 2024

Megatron-LM for LLaMa3 #818

Megatron-LM for LLaMa3 #818

Comments

SDsly commented May 10, 2024

TJ-Solergibert commented May 10, 2024

felipeliliti commented May 10, 2024

ethanhe42 commented May 17, 2024

shamanez commented May 19, 2024 • edited

TJ-Solergibert commented May 20, 2024

shamanez commented May 20, 2024 via email

IronMan-WangJinxi commented May 27, 2024

shamanez commented May 19, 2024 •

edited