Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow training on Mixtral-8x22B when DP size > 1 #9031

Open
sunilitggu opened this issue Apr 24, 2024 · 2 comments
Open

Slow training on Mixtral-8x22B when DP size > 1 #9031

sunilitggu opened this issue Apr 24, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@sunilitggu
Copy link

sunilitggu commented Apr 24, 2024

Describe the bug

We are in the process of fine-tuning Mixtral-8x22b using an instruction fine-tuning dataset. The model is divided using PP=8 and TP=4. Our experiments are conducted on DGX nodes, each equipped with 8 H100 GPUs. Nodes are interconnected through a 3.2 TbPS InfiniBand.

The model was tested with various DP sizes, such as 1, 2, 4, and 8. Throughout all experiments, we maintained a Micro Batch Size of 1 and a Global Batch Size of 128; gradient accommodation occurs according to DP size.

Here are the average global batch processing times (GBPT):

#nodes=4, DP=1, GBPT = 2 sec
#nodes=8, DP=2, GBPT = 10 sec
#nodes=16, DP=4, GBPT = 32 sec
#nodes=32, DP=8, GBPT = 36 sec

Adding more nodes exponentially increases training time.

Steps/Code to reproduce bug

TRAIN="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]"
VALID="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]"
TEST="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]"
MODEL="/checkpoints/mixtral-8x22b-v0.1/nemo-checkpoints/"

VALID_NAMES="v10p1"
CONCAT_SAMPLING_PROBS="[1]"

read -r -d '' cmd <<EOF
echo "STARTING*"
&& python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py
trainer.precision=bf16
trainer.devices=8
trainer.num_nodes=16
trainer.val_check_interval=1000
trainer.max_steps=10000
trainer.log_every_n_steps=1
trainer.use_distributed_sampler=False
model.restore_from_path=${MODEL}
model.micro_batch_size=1
model.global_batch_size=128
model.tensor_model_parallel_size=4
model.pipeline_model_parallel_size=8
model.sequence_parallel=True
+model.expert_model_parallel_size=2
+model.data.train_ds.pad_to_max_length=True
+model.data.test_ds.pad_to_max_length=True
+model.data.validation_ds.pad_to_max_length=True
model.optim.name=fused_adam
model.megatron_amp_O2=True
model.optim.lr=5e-6
model.answer_only_loss=True
model.peft.peft_scheme=none
model.data.train_ds.file_names=${TRAIN}
model.data.validation_ds.file_names=${VALID}
model.data.test_ds.file_names=${TEST}
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS}
model.data.train_ds.max_seq_length=2048
model.data.train_ds.num_workers=4
model.data.validation_ds.num_workers=4
model.data.test_ds.num_workers=4
model.data.validation_ds.metric.name=loss
model.data.test_ds.metric.name=loss
exp_manager.create_wandb_logger=False
exp_manager.explicit_log_dir=./result-2/
exp_manager.resume_if_exists=True
exp_manager.resume_ignore_no_checkpoint=True
exp_manager.create_checkpoint_callback=True
exp_manager.name=exp-2
exp_manager.checkpoint_callback_params.monitor=validation_loss
exp_manager.checkpoint_callback_params.save_best_model=False
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True
exp_manager.checkpoint_callback_params.mode=min
EOF

srun -N 16 -o logs/log-moe-16nodes-%j.txt
--gpus-per-node=8 --ntasks-per-node=8 --cpus-per-task=8 --mem=2000G
--partition=nlp
--container-mounts="/vast/core42-nlp/shared/model_checkpoints/:/checkpoints/"
--job-name=nemo-train
--container-workdir=$PWD
--container-image="/vast/core42-nlp/users/sunil.sahu/nemo_24_03_py3.sqsh"
bash -c "${cmd}"

Expected behaviour

Expanding the number of nodes enhances the DDP size and reduces the required gradient accommodation, which should speed up the training.

Environment details
Nvidia Docker Version: nvcr.io#nvidia/nemo:24.03.framework

@sunilitggu sunilitggu added the bug Something isn't working label Apr 24, 2024
@sunilitggu sunilitggu changed the title Slow training on DP size > 1 Slow training on Mixtral-8x22B when DP size > 1 Apr 26, 2024
@akoumpa akoumpa self-assigned this Apr 26, 2024
@akoumpa
Copy link
Collaborator

akoumpa commented Apr 26, 2024

Hi, thanks for reporting this,

Can you retry without EP and report back whether this improves the speed? In addition, I would encourage trying different TP/PP configurations to determine the optimal.

Thank you.

@sunilitggu
Copy link
Author

Hi, thanks for reporting this,

Can you retry without EP and report back whether this improves the speed? In addition, I would encourage trying different TP/PP configurations to determine the optimal.

Thank you.

Thank you for your response. We have already attempted the process without EP. However, it proved to be slower compared to when EP was utilized. Below are the average times recorded without EP:

#nodes=4, DP=1, GBPT = 2 sec
#nodes=8, DP=2, GBPT = 12 sec
#nodes=16, DP=4, GBPT = 34 sec
We have also experimented with different combinations for TP and PP, such as 8x4, 4x8 and 8x8. In terms of speed, all configurations performed worse than the one reported in the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants