[ResNet-50/pytorch] FP32 and AMP Mode taking same time to complete 90 Epochs #1365

kishoryd · 2023-11-29T05:31:15Z

I'm trying to benchmark the performance of V100 and A100 GPU's on our cluster. im trying to run 90 epoch and capture time for convergence.

The throughput for FP32 is giving 1400 Images/sec (90 Epochs time: 2776m17.375s) and AMP is 3950 images/sec (90 Epochs time: 2767m31.195s).

The code is pulled from repo: PyTorch/Classification/ConvNets/resnet50v1.5

To Reproduce
I have run this code using slurm and py-axis. This script is used to run for FP32 precision.

For running on AMP, i have changed the precision to AMP. although time for convegence for 90 Epochs is not as per submitted?

Further there is no implementation to stop training after reaching MLcomms standard convegence accuracy: 76%.

#!/bin/bash
#SBATCH --job-name=pytorch_4GPU
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --partition=gpu
#SBATCH --reservation=testing
#SBATCH --time=3-00:00:00
#SBATCH --nodelist=gpu035
#SBATCH --exclusive
#SBATCH --output=pytorch_gpu_4_%j.out
#SBATCH --error=pytorch_gpu_4_%j.err

time srun --container-image=/scratch/cdacapp/enroot/nvidia+pytorch+21.03-py3.sqsh
--container-name=pytorch
--container-mounts=/var/share/slurm/slurm.taskprolog:/var/share/slurm/slurm.taskprolog,/scratch/cdacapp:/scratch/cdacapp
sh -c 'cd /scratch/cdacapp/pytorch/DeepLearningExamples/PyTorch/Classification/ConvNets && python ./multiproc.py
--nproc_per_node 4 --nnodes 1 ./launch.py --model resnet50
--precision FP32 --mode convergence --platform DGX1V /scratch/cdacapp/pytorch/image2012
--raport-file benchmark_4GPU.json --epochs 90 --no-checkpoints
--optimizer-batch-size 1024 --batch-size 256 --workers 4 --prefetch 4 --seed 100'

Environment

Container version : pytorch:21.03-py3
GPUs in the system : 4x Tesla V100-SXM2-16GB
CUDA driver version : 470.57.02

sanjeebtiwary · 2023-12-29T12:40:35Z

`import torch
import numpy as np

... your model and data loading code ...

Define early stopping parameters
patience = 5
best_accuracy = 0
counter = 0

Training loop
for epoch in range(num_epochs):
... training code ...

 Validation step
model.eval()
with torch.no_grad():
    val_accuracy = validate(model, val_loader)  # Implement your validation function
model.train()

Check for improvement
if val_accuracy > best_accuracy:
    best_accuracy = val_accuracy
    counter = 0
else:
    counter += 1

 Early stopping condition
if counter >= patience:
    print(f'Early stopping at epoch {epoch}')
    break

`

iamsh4shank · 2024-02-28T13:55:31Z

Hi @kishoryd, I have one question how much training time does it need to train for 90 epochs?

kishoryd added the bug Something isn't working label Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResNet-50/pytorch] FP32 and AMP Mode taking same time to complete 90 Epochs #1365

[ResNet-50/pytorch] FP32 and AMP Mode taking same time to complete 90 Epochs #1365

kishoryd commented Nov 29, 2023

sanjeebtiwary commented Dec 29, 2023 •

edited

iamsh4shank commented Feb 28, 2024

[ResNet-50/pytorch] FP32 and AMP Mode taking same time to complete 90 Epochs #1365

[ResNet-50/pytorch] FP32 and AMP Mode taking same time to complete 90 Epochs #1365

Comments

kishoryd commented Nov 29, 2023

sanjeebtiwary commented Dec 29, 2023 • edited

iamsh4shank commented Feb 28, 2024

sanjeebtiwary commented Dec 29, 2023 •

edited