Slow training on Mixtral-8x22B when DP size > 1 #9031

sunilitggu · 2024-04-24T18:30:37Z

Describe the bug

We are in the process of fine-tuning Mixtral-8x22b using an instruction fine-tuning dataset. The model is divided using PP=8 and TP=4. Our experiments are conducted on DGX nodes, each equipped with 8 H100 GPUs. Nodes are interconnected through a 3.2 TbPS InfiniBand.

The model was tested with various DP sizes, such as 1, 2, 4, and 8. Throughout all experiments, we maintained a Micro Batch Size of 1 and a Global Batch Size of 128; gradient accommodation occurs according to DP size.

Here are the average global batch processing times (GBPT):

#nodes=4, DP=1, GBPT = 2 sec
#nodes=8, DP=2, GBPT = 10 sec
#nodes=16, DP=4, GBPT = 32 sec
#nodes=32, DP=8, GBPT = 36 sec

Adding more nodes exponentially increases training time.

Steps/Code to reproduce bug

TRAIN="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]"
VALID="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]"
TEST="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]"
MODEL="/checkpoints/mixtral-8x22b-v0.1/nemo-checkpoints/"

VALID_NAMES="v10p1"
CONCAT_SAMPLING_PROBS="[1]"

read -r -d '' cmd <<EOF
echo "STARTING*"
&& python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py
trainer.precision=bf16
trainer.devices=8
trainer.num_nodes=16
trainer.val_check_interval=1000
trainer.max_steps=10000
trainer.log_every_n_steps=1
trainer.use_distributed_sampler=False
model.restore_from_path=${MODEL}
model.micro_batch_size=1
model.global_batch_size=128
model.tensor_model_parallel_size=4
model.pipeline_model_parallel_size=8
model.sequence_parallel=True
+model.expert_model_parallel_size=2
+model.data.train_ds.pad_to_max_length=True
+model.data.test_ds.pad_to_max_length=True
+model.data.validation_ds.pad_to_max_length=True
model.optim.name=fused_adam
model.megatron_amp_O2=True
model.optim.lr=5e-6
model.answer_only_loss=True
model.peft.peft_scheme=none
model.data.train_ds.file_names=${TRAIN}
model.data.validation_ds.file_names=${VALID}
model.data.test_ds.file_names=${TEST}
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS}
model.data.train_ds.max_seq_length=2048
model.data.train_ds.num_workers=4
model.data.validation_ds.num_workers=4
model.data.test_ds.num_workers=4
model.data.validation_ds.metric.name=loss
model.data.test_ds.metric.name=loss
exp_manager.create_wandb_logger=False
exp_manager.explicit_log_dir=./result-2/
exp_manager.resume_if_exists=True
exp_manager.resume_ignore_no_checkpoint=True
exp_manager.create_checkpoint_callback=True
exp_manager.name=exp-2
exp_manager.checkpoint_callback_params.monitor=validation_loss
exp_manager.checkpoint_callback_params.save_best_model=False
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True
exp_manager.checkpoint_callback_params.mode=min
EOF

srun -N 16 -o logs/log-moe-16nodes-%j.txt
--gpus-per-node=8 --ntasks-per-node=8 --cpus-per-task=8 --mem=2000G
--partition=nlp
--container-mounts="/vast/core42-nlp/shared/model_checkpoints/:/checkpoints/"
--job-name=nemo-train
--container-workdir=$PWD
--container-image="/vast/core42-nlp/users/sunil.sahu/nemo_24_03_py3.sqsh"
bash -c "${cmd}"

Expected behaviour

Expanding the number of nodes enhances the DDP size and reduces the required gradient accommodation, which should speed up the training.

Environment details
Nvidia Docker Version: nvcr.io#nvidia/nemo:24.03.framework

akoumpa · 2024-04-26T19:42:30Z

Hi, thanks for reporting this,

Can you retry without EP and report back whether this improves the speed? In addition, I would encourage trying different TP/PP configurations to determine the optimal.

Thank you.

sunilitggu · 2024-04-27T04:19:35Z

Hi, thanks for reporting this,

Can you retry without EP and report back whether this improves the speed? In addition, I would encourage trying different TP/PP configurations to determine the optimal.

Thank you.

Thank you for your response. We have already attempted the process without EP. However, it proved to be slower compared to when EP was utilized. Below are the average times recorded without EP:

#nodes=4, DP=1, GBPT = 2 sec
#nodes=8, DP=2, GBPT = 12 sec
#nodes=16, DP=4, GBPT = 34 sec
We have also experimented with different combinations for TP and PP, such as 8x4, 4x8 and 8x8. In terms of speed, all configurations performed worse than the one reported in the issue.

sunilitggu added the bug Something isn't working label Apr 24, 2024

sunilitggu changed the title ~~Slow training on DP size > 1~~ Slow training on Mixtral-8x22B when DP size > 1 Apr 26, 2024

akoumpa self-assigned this Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow training on Mixtral-8x22B when DP size > 1 #9031

Slow training on Mixtral-8x22B when DP size > 1 #9031

sunilitggu commented Apr 24, 2024 •

edited

akoumpa commented Apr 26, 2024 •

edited

sunilitggu commented Apr 27, 2024

Slow training on Mixtral-8x22B when DP size > 1 #9031

Slow training on Mixtral-8x22B when DP size > 1 #9031

Comments

sunilitggu commented Apr 24, 2024 • edited

akoumpa commented Apr 26, 2024 • edited

sunilitggu commented Apr 27, 2024

sunilitggu commented Apr 24, 2024 •

edited

akoumpa commented Apr 26, 2024 •

edited