[BUG FIX] Fix world_size bug in QuickStart Example #747
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BUG Description
When I entered the developer guide https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start, and running the given example python file run_simple_mcore_train_loop.py, the terminal didn't respond for nearly an hour and throw an exception:
BUG Reason
initialize_distributed()
function inMegatron-LM/examples/run_simple_mcore_train_loop.py
, theworld_size
is set totorch.cuda.device_count()
. However, this activity is actually wrong if user is running this script on a 8-gpus node, but settingtorchrun --nproc-per-node
to any number that is not 8. This will cause the world_size is not consistent with the gpus actually used in the script. Even worse, it may cause the terminal not responding for a very long time.BUG Reproduce
Whenever you set the number of
torchrun --nproc-per-node
not consistent with the total number of gpus on this machine you're using.BUG Fix
This change will fix this bug and avoid causing an exception without the need of changing the running command. Meanwhile, this change is adopted both in script file
examples/run_simple_mcore_train_loop.py
and doc md filemegatron/core/QuickStart.md