Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megatron-LM for LLaMa3 #818

Open
SDsly opened this issue May 10, 2024 · 7 comments
Open

Megatron-LM for LLaMa3 #818

SDsly opened this issue May 10, 2024 · 7 comments

Comments

@SDsly
Copy link

SDsly commented May 10, 2024

I'm attempting to train LLaMA-3 using Megatron-LM but have encountered an issue: LLaMA-3 utilizes Tiktoken for tokenization and doesn't provide a tokenizer.model file, which is required by Megatron-LM. How can I adapt or generate a compatible tokenizer.model for Megatron-LM? Any guidance or workaround would be greatly appreciated!

@TJ-Solergibert
Copy link

There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model

@felipeliliti
Copy link

Tente recomeçar do zero observando com calma os digitado. As vezes as máquinas falha.

@ethanhe42
Copy link
Member

also checkout the llama3 example in megatron launcher: https://github.com/NVIDIA/NeMo-Framework-Launcher/blob/main/examples/training/llama/h100/llama3_8b_bf16.sh

@shamanez
Copy link

shamanez commented May 19, 2024

There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model

This doesn't work directly because this model file can get loaded from the sentencepiece.

Here's the error:

Traceback (most recent call last):
  File "/Users/dsdsdds/Downloads/check_tokenizer_model.py", line 5, in <module>
    print(sp.Load("./tokenizer.model"))
  File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from ./tokenizer.model


@TJ-Solergibert
Copy link

There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model

This doesn't work directly because this model file can get loaded from the sentencepiece.

Here's the error:

Traceback (most recent call last):
  File "/Users/dsdsdds/Downloads/check_tokenizer_model.py", line 5, in <module>
    print(sp.Load("./tokenizer.model"))
  File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: could not parse ModelProto from ./tokenizer.model

That's true! But bypassing this is pretty easy, just create a new tokenizer like the one of Llama2. You can do self.tokenizer = AutoTokenizer.from_pretrained() and change a bit some methods (For example, def tokenize(...): return self.tokenizer(....)).

@shamanez
Copy link

shamanez commented May 20, 2024 via email

@IronMan-WangJinxi
Copy link

真的

On Mon, 20 May 2024 at 8:59 PM, Antoni-Joan Solergibert < @.> wrote: There is a tokenizer.model file in the Hugging Face Checkpoints under the /original folder, check https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model This doesn't work directly because this model file can get loaded from the sentencepiece. Here's the error: Traceback (most recent call last): File "/Users/dsdsdds/Downloads/check_tokenizer_model.py", line 5, in print(sp.Load("./tokenizer.model")) File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/init.py", line 961, in Load return self.LoadFromFile(model_file) File "/Users/dsdsdds/anaconda3/envs/moe/lib/python3.10/site-packages/sentencepiece/init.py", line 316, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) RuntimeError: Internal: could not parse ModelProto from ./tokenizer.model That's true! But bypassing this is pretty easy, just create a new tokenizer like the one of Llama2

class _Llama2Tokenizer(_SentencePieceTokenizer):
. You can do self.tokenizer = AutoTokenizer.from_pretrained() and change a bit some methods (For example, def tokenize(...): return self.tokenizer(....)). — Reply to this email directly, view it on GitHub <#818 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGUDA4EQ5AJBLRE3U4LZDG3Q5AVCNFSM6AAAAABHQHBIXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZHE4TSMJYGQ . You are receiving this because you commented.Message ID: @.>

I also encountered the same problem. Could you please share your configuration? Thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants