PyTorch Training Error on Multi-GPU Setup with SLURM: 'No Space Left on Device' Despite Ample Space #19565
Unanswered
eyad-al-shami
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm currently running multiple experiments across 4 GPUs on a single node managed by Slurm. Each GPU executes a distinct experiment using the same model but with varying hyperparameters, all working on the same dataset. My setup script for each experiment looks something like this:
experiment_1.sbatch:
`#!/bin/bash
#SBATCH -p normal
#SBATCH -o logs/baselines/%j.out
#SBATCH -t 7:00:00
#SBATCH --gres=gpu:full:4
#SBATCH -c 40
CUDA_VISIBLE_DEVICES=0 python3 train.py --parameters 1 &
CUDA_VISIBLE_DEVICES=1 python3 train.py --parameters 2 &
CUDA_VISIBLE_DEVICES=2 python3 train.py --parameters 3 &
CUDA_VISIBLE_DEVICES=3 python3 train.py --parameters 4 &
wait
`
This approach had been working flawlessly until recently when I started encountering failures in some of the jobs. Interestingly, when a job fails, all four commands within it fail simultaneously with the following error related to a "Sanity Check" process::
`Sanity Checking: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/home/train.py", line 162, in
trainer.fit(model, datamodule=my_dataset_module)
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1021, in _run_stage
self._run_sanity_check()
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1050, in _run_sanity_check
val_loop.run()
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
return loop_run(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 98, in run
self.setup_data()
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 168, in setup_data
_check_dataloader_iterable(dl, source, trainer_fn)
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py", line 391, in _check_dataloader_iterable
iter(dataloader) # type: ignore[call-overload]
^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 436, in iter
self._iterator = self._get_iterator()
^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1015, in init
self._worker_result_queue = multiprocessing_context.Queue() # type: ignore[var-annotated]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/multiprocessing/context.py", line 103, in Queue
return Queue(maxsize, ctx=self.get_context())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/multiprocessing/queues.py", line 43, in init
self._rlock = ctx.Lock()
^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/multiprocessing/context.py", line 68, in Lock
return Lock(ctx=self.get_context())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/envs/t/lib/python3.11/multiprocessing/synchronize.py", line 167, in init
SemLock.init(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/home/anaconda3/envs/t/lib/python3.11/multiprocessing/synchronize.py", line 57, in init
sl = self._semlock = _multiprocessing.SemLock(
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 28] No space left on device`
However, I'm confident the issue isn't related to actual disk space, as I have ample free space, and not every job I submit fails with this error.
I would greatly appreciate any insights or suggestions on how to resolve this issue.
Beta Was this translation helpful? Give feedback.
All reactions