Issue with Increasing VRAM/Shared GPU Memory Usage During Training on EfficientVIT-M2 and EfficientNet_lite0 #2128

thoj · 2024-03-30T00:00:38Z

thoj
Mar 30, 2024

Background

I am currently attempting to train an EfficientVIT-M2 model from scratch, utilizing the unaltered train.py script provided in the repository. My computing environment is on WSL2 and an NVIDIA GeForce RTX 3080 with 10GB VRAM. I suspect that my issue may stem from the combination of WSL2 and NVIDIA, rather than being a bug. I have also observed the same issue on EfficientNet_lite0 although that model is so slow in general that it matters much less.

Problem Description

Throughout the training process, I've observed a consistent increase in VRAM usage, eventually leading to overflow into shared GPU memory. This overflow significantly slows down the training. Initially, I achieve a training speed of approximately 5000-6000 samples per second for the first two epochs. However, by the third epoch, the speed drops dramatically to about 1600 samples per second. For the lite0 model it starts out at about 2000 samples per second and decreases to 1000-1500 samples per second. Despite numerous attempts to mitigate this issue, I have not found a solution.

Steps Taken

Disabling prefetcher: Resulted in higher initial shared GPU memory usage but did not prevent VRAM overflow or the subsequent slowdown. Also slower than using the prefetcher in general.
Reducing batch sizes: Tried sizes of 768, 512, and 128. Smaller batches delay the VRAM filling but do not prevent it. The first epoch with a batch size of 768 uses about 8.2GB of VRAM.
Turning off EMA: No impact.
Disabling torch.backends.cudnn.benchmark: No change .

Screenshots

I thought that VRAM usage would remain relatively stable during training.
It seems counterintuitive for the system to resort to using shared memory, as this drastically impairs performance. I would prefer an out-of-memory (OOM) termination over this behavior.

Has anyone encountered similar issues or have suggestions on potential fixes? Any insights or advice would be greatly appreciated.

Command:

--data-dir ../base_clean_15k --train-split train.tar --val-split valid.tar --model efficientvit_m2 --opt rmsprop --opt-eps 1e-08 -j 8 --amp -b 768 -vb 128 --pin --cutmix 1.0 --mixup .1 --scale 0.8 1.1 --weight-decay 0 --aa rand-m9-mstd0.5 --remode pixel --reprob 0.2 --recount 2 --torchcompile --crop-pct 1.0 --lr-base 0.001 --vflip .25 --hflip .25 --sched cosine --epochs 60 --num-classes 25 --log-wandb --experiment vit_tests --output 20240330000153_rmsprop_001_1e-08 --bce-loss --drop 0.1 --model-ema --model-ema-decay 0.9999 --model-ema-warmup

Answered by thoj

Mar 30, 2024

After encountering significant VRAM overflow issues during the training of an EfficientVIT-M2 model, I developed a workaround. It's important to note that my explanation for why this solution works is based on a theory regarding the NVIDIA driver's memory management behavior.

I theorize that the underlying issue arises from the NVIDIA driver's memory manager (In Windows), which appears to attempt optimizing VRAM usage by preemptively transferring data to shared GPU memory. This seems to occur to prevent complete VRAM saturation, with the process starting when VRAM usage is just shy of its maximum capacity (around 9.8GB in my scenario), leaving about 200MB of VRAM "free." PyTorch, recogniz…

View full answer

thoj · 2024-03-30T00:46:07Z

thoj
Mar 30, 2024
Author

After encountering significant VRAM overflow issues during the training of an EfficientVIT-M2 model, I developed a workaround. It's important to note that my explanation for why this solution works is based on a theory regarding the NVIDIA driver's memory management behavior.

I theorize that the underlying issue arises from the NVIDIA driver's memory manager (In Windows), which appears to attempt optimizing VRAM usage by preemptively transferring data to shared GPU memory. This seems to occur to prevent complete VRAM saturation, with the process starting when VRAM usage is just shy of its maximum capacity (around 9.8GB in my scenario), leaving about 200MB of VRAM "free." PyTorch, recognizing this free space, continues to allocate memory, prompting the NVIDIA driver to offload data once more to shared memory.

Proposed Workaround

To address this issue, I introduced a modification in the train.py script, immediately following line 442:

torch.cuda.set_per_process_memory_fraction(0.95)

This line instructs PyTorch to utilize only up to 95% of the available VRAM, theoretically preventing the NVIDIA driver from its preemptive eviction process. By implementing this limit, the goal is to maintain a stable VRAM usage without incurring the performance penalties associated with overflow into shared GPU memory.

Note: My hypothesis about the NVIDIA driver's behavior and the effectiveness of this workaround are based on observations and testing in my specific environment.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Increasing VRAM/Shared GPU Memory Usage During Training on EfficientVIT-M2 and EfficientNet_lite0 #2128

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Issue with Increasing VRAM/Shared GPU Memory Usage During Training on EfficientVIT-M2 and EfficientNet_lite0 #2128

thoj Mar 30, 2024

Background

Problem Description

Steps Taken

Screenshots

Command:

Replies: 1 comment

thoj Mar 30, 2024 Author

Proposed Workaround

thoj
Mar 30, 2024

thoj
Mar 30, 2024
Author