Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective #780

Open
ftgreat opened this issue Apr 16, 2024 · 2 comments

Comments

@ftgreat
Copy link

ftgreat commented Apr 16, 2024

During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows:
"found NaN in local grad norm in backward pass before data-parallel communication collective".

f'Rank {global_rank}: found NaN in local grad norm in '

Main Settings

  • tp=1,pp=8,ep=2
  • use_mcore=True
  • impl=transformers_engine
  • distributed_optimizer=True.

Questions

    1. At steps=A, an assert error occurred. however, resume training from latest ckpt, assert error would not happen at steps=A.(samples sequence is fixed). Besides, during resume training process, except loss at the very first step, losses of all subsequent steps have tiny numeric differences. Could you explain the reasons?
    1. How to figure out the above NaN error, could you give me some advice to debugging details? Thanks.
@D1026
Copy link

D1026 commented Apr 19, 2024

I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem

@ftgreat
Copy link
Author

ftgreat commented Apr 20, 2024

I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem

@D1026 did you train deepseek dense model or deepseek-moe model?
Often this error happened due to data.
However, in my case, data seems ok. I am not sure whether this case is related to moe pretraining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants