Example for benchmarking ML worloads using Torch Profiler and NSight #322

syedazi · 2024-05-10T10:38:47Z

This recipe is built on meta's llama recipe with modifications to allow for model pretraining (LLAMA2) on FSDP with an additional ability to profile the workloads using either Torch Profiler or NSight.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…s/awsome-distributed-training into llama2-fsdp-benchmark

perifaws · 2024-05-13T14:47:21Z

3.test_cases/19.FSDP-llama2-profiling/03.nsys_train.sbatch

+## Plenty of EFA level variables
+## Comment out for non-efa instances (G4d, P3)
+## For G5.12x, Comment out RDMA and Fork safe
+## For G4dn and other G5, comment out all


You may want to have sections for clarity. Otherwise it can be confusing as there's a list of variables some of which are commented but only 1 explanation on the top

perifaws · 2024-05-13T14:47:58Z

3.test_cases/19.FSDP-llama2-profiling/03.nsys_train.sbatch

+export FI_PROVIDER=efa
+export NCCL_DEBUG=INFO
+export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+#export NCCL_SOCKET_IFNAME=ens


perifaws · 2024-05-13T14:48:24Z

3.test_cases/19.FSDP-llama2-profiling/02.train_container.sbatch

+export FI_PROVIDER=efa
+export NCCL_DEBUG=INFO
+export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+#export NCCL_SOCKET_IFNAME=ens


split in sections

perifaws · 2024-05-13T14:48:35Z

3.test_cases/19.FSDP-llama2-profiling/01.train.sbatch

+export NCCL_DEBUG=INFO
+export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+
+#export NCCL_SOCKET_IFNAME=ens


split in sections or remove

perifaws · 2024-05-13T14:49:00Z

3.test_cases/19.FSDP-llama2-profiling/0.create_conda.sh

+conda activate llamapretrain
+
+# Install pytorch and other dependencies
+conda install -y pytorch==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia


put the versions at the top w/ the mamba version

perifaws · 2024-05-13T14:49:27Z

3.test_cases/19.FSDP-llama2-profiling/04.nsys_train_container.sbatch

+            --output /fsx/nsys_profiles2/llama2/report_llama2_job%q{SLURM_JOB_ID}_rank%q{SLURM_PROCID}_on_%q{HOSTNAME}.nsys-rep \
+            torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${MODEL_ARGS[@]}"
+
+#srun -u -l "${ENROOT_ARGS[@]}" /usr/local/cuda/bin/nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas \


Why this command?

3.test_cases/19.FSDP-llama2-profiling/training.py

mhuguesaws · 2024-05-15T14:21:50Z

3.test_cases/19.FSDP-llama2-profiling/01.train.sbatch

+## For G4dn and other G5, comment out all
+export FI_EFA_USE_DEVICE_RDMA=1 # use for p4d
+export FI_EFA_FORK_SAFE=1
+export FI_LOG_LEVEL=1


Suggested change

export FI_LOG_LEVEL=1

export FI_LOG_LEVEL=warn

syedazi and others added 30 commits March 12, 2024 11:27

Mamba pretraining example with deepspeed

ef6c406

Changes to TrainArgs data class

54a15b0

Updates to training script with FSDP

bb16cf8

Updated conda version

5e8a643

Update dist-train.sh

ee1e531

Upated README

eddf728

Updates to loss function

e473839

Updates to loss function

6ed5a82

changes to train batch script and addition of nsys

4f1631c

fsdp pretraining of llama2 with profiling

28d06cc

Updates to add support for mamba

78f4d22

updates to torch profiler config

76050b4

Updates to deepspeed config

402c757

Switching to native Mamba implementation

aa03d93

Working version of FSDP Llama2 script for profiling

78e6d44

containerised llama2 run and created seperate mamba training scripts

5210c8c

Added nsight script for llama2

24ba63a

Updates to runtime variables

079422f

Updates for Nsight profiling

206c4f4

Updates to sbatch files for nsight

4e7d49b

Updates to nsys batch scripts

8af42bb

Update 04.nsys_train_docker.sbatch

138c11a

Update to sbatch files to account for NSight

6cb263a

Merge branch 'llama2-fsdp-benchmark' of https://github.com/aws-sample…

bfa5d33

…s/awsome-distributed-training into llama2-fsdp-benchmark

Created Readme and made related changes to sbatch scripts

10ae093

Updates to sbatch scripts for NSight settings

b699f54

updates to resolve nccl issues

7093b0f

further updates to sbatch scripts for nsight profiling

1179d0b

bug fixes for low cpu fsdp

d96b92b

Updates to sbatch scripts

a548644

removed fsdp mamba from this branch

2ee9a36

syedazi requested review from sean-smith and awsankur May 10, 2024 10:38

syedazi self-assigned this May 10, 2024

Merge branch 'main' into llama2-fsdp-benchmark

fed2da1

perifaws requested changes May 13, 2024

View reviewed changes

mhuguesaws reviewed May 15, 2024

View reviewed changes

KeitaW force-pushed the llama2-fsdp-benchmark branch from f4807f0 to 9925232 Compare June 3, 2024 22:53

KeitaW force-pushed the main branch from 8dc7dc0 to 44e448e Compare June 3, 2024 22:53

KeitaW force-pushed the llama2-fsdp-benchmark branch from 9925232 to fed2da1 Compare June 4, 2024 02:26

KeitaW force-pushed the main branch 2 times, most recently from 44e448e to 1209815 Compare June 4, 2024 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example for benchmarking ML worloads using Torch Profiler and NSight #322

Example for benchmarking ML worloads using Torch Profiler and NSight #322

syedazi commented May 10, 2024

perifaws May 13, 2024

perifaws May 13, 2024

perifaws May 13, 2024

perifaws May 13, 2024

perifaws May 13, 2024

perifaws May 13, 2024

mhuguesaws May 15, 2024

Example for benchmarking ML worloads using Torch Profiler and NSight #322

Are you sure you want to change the base?

Example for benchmarking ML worloads using Torch Profiler and NSight #322

Conversation

syedazi commented May 10, 2024

perifaws May 13, 2024

Choose a reason for hiding this comment

perifaws May 13, 2024

Choose a reason for hiding this comment

perifaws May 13, 2024

Choose a reason for hiding this comment

perifaws May 13, 2024

Choose a reason for hiding this comment

perifaws May 13, 2024

Choose a reason for hiding this comment

perifaws May 13, 2024

Choose a reason for hiding this comment

mhuguesaws May 15, 2024

Choose a reason for hiding this comment