You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are in the process of fine-tuning Mixtral-8x22b using an instruction fine-tuning dataset. The model is divided using PP=8 and TP=4. Our experiments are conducted on DGX nodes, each equipped with 8 H100 GPUs. Nodes are interconnected through a 3.2 TbPS InfiniBand.
The model was tested with various DP sizes, such as 1, 2, 4, and 8. Throughout all experiments, we maintained a Micro Batch Size of 1 and a Global Batch Size of 128; gradient accommodation occurs according to DP size.
Here are the average global batch processing times (GBPT):
Can you retry without EP and report back whether this improves the speed? In addition, I would encourage trying different TP/PP configurations to determine the optimal.
Can you retry without EP and report back whether this improves the speed? In addition, I would encourage trying different TP/PP configurations to determine the optimal.
Thank you.
Thank you for your response. We have already attempted the process without EP. However, it proved to be slower compared to when EP was utilized. Below are the average times recorded without EP:
#nodes=4, DP=1, GBPT = 2 sec
#nodes=8, DP=2, GBPT = 12 sec
#nodes=16, DP=4, GBPT = 34 sec
We have also experimented with different combinations for TP and PP, such as 8x4, 4x8 and 8x8. In terms of speed, all configurations performed worse than the one reported in the issue.
Describe the bug
We are in the process of fine-tuning Mixtral-8x22b using an instruction fine-tuning dataset. The model is divided using PP=8 and TP=4. Our experiments are conducted on DGX nodes, each equipped with 8 H100 GPUs. Nodes are interconnected through a 3.2 TbPS InfiniBand.
The model was tested with various DP sizes, such as 1, 2, 4, and 8. Throughout all experiments, we maintained a Micro Batch Size of 1 and a Global Batch Size of 128; gradient accommodation occurs according to DP size.
Here are the average global batch processing times (GBPT):
#nodes=4, DP=1, GBPT = 2 sec
#nodes=8, DP=2, GBPT = 10 sec
#nodes=16, DP=4, GBPT = 32 sec
#nodes=32, DP=8, GBPT = 36 sec
Adding more nodes exponentially increases training time.
Steps/Code to reproduce bug
TRAIN="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]"
VALID="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]"
TEST="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]"
MODEL="/checkpoints/mixtral-8x22b-v0.1/nemo-checkpoints/"
VALID_NAMES="v10p1"
CONCAT_SAMPLING_PROBS="[1]"
read -r -d '' cmd <<EOF
echo "STARTING*"
&& python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py
trainer.precision=bf16
trainer.devices=8
trainer.num_nodes=16
trainer.val_check_interval=1000
trainer.max_steps=10000
trainer.log_every_n_steps=1
trainer.use_distributed_sampler=False
model.restore_from_path=${MODEL}
model.micro_batch_size=1
model.global_batch_size=128
model.tensor_model_parallel_size=4
model.pipeline_model_parallel_size=8
model.sequence_parallel=True
+model.expert_model_parallel_size=2
+model.data.train_ds.pad_to_max_length=True
+model.data.test_ds.pad_to_max_length=True
+model.data.validation_ds.pad_to_max_length=True
model.optim.name=fused_adam
model.megatron_amp_O2=True
model.optim.lr=5e-6
model.answer_only_loss=True
model.peft.peft_scheme=none
model.data.train_ds.file_names=${TRAIN}
model.data.validation_ds.file_names=${VALID}
model.data.test_ds.file_names=${TEST}
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS}
model.data.train_ds.max_seq_length=2048
model.data.train_ds.num_workers=4
model.data.validation_ds.num_workers=4
model.data.test_ds.num_workers=4
model.data.validation_ds.metric.name=loss
model.data.test_ds.metric.name=loss
exp_manager.create_wandb_logger=False
exp_manager.explicit_log_dir=./result-2/
exp_manager.resume_if_exists=True
exp_manager.resume_ignore_no_checkpoint=True
exp_manager.create_checkpoint_callback=True
exp_manager.name=exp-2
exp_manager.checkpoint_callback_params.monitor=validation_loss
exp_manager.checkpoint_callback_params.save_best_model=False
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True
exp_manager.checkpoint_callback_params.mode=min
EOF
srun -N 16 -o logs/log-moe-16nodes-%j.txt
--gpus-per-node=8 --ntasks-per-node=8 --cpus-per-task=8 --mem=2000G
--partition=nlp
--container-mounts="/vast/core42-nlp/shared/model_checkpoints/:/checkpoints/"
--job-name=nemo-train
--container-workdir=$PWD
--container-image="/vast/core42-nlp/users/sunil.sahu/nemo_24_03_py3.sqsh"
bash -c "${cmd}"
Expected behaviour
Expanding the number of nodes enhances the DDP size and reduces the required gradient accommodation, which should speed up the training.
Environment details
Nvidia Docker Version: nvcr.io#nvidia/nemo:24.03.framework
The text was updated successfully, but these errors were encountered: