[nnUNET/PyTorch] Training step running into "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid." #1373

Navee402 · 2024-02-01T06:52:19Z

Related to nnUNET/pyTorch

I am trying to use the BraTS21.ipynb and BraTS22.ipynb to train the nnUNet model. But I am constantly running into the following error. "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid."

Detailed description:

834 training, 417 validation, 1251 test examples
Provided checkpoint /mnt/e/Naveen/Datasets/BraTS2021/check_points/ is not a file. Starting training from scratch.
Filters: [64, 128, 256, 512, 768, 1024],
Kernels: [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]
Strides: [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]]
`precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

834 training, 417 validation, 1251 test examples
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | model | UNet3D | 177 M
1 | model.input_block | InputBlock | 119 K
2 | model.downsamples | ModuleList | 40.5 M
3 | model.bottleneck | ConvBlock | 49.5 M
4 | model.upsamples | ModuleList | 87.2 M
5 | model.output_block | OutputBlock | 195
6 | model.deep_supervision_heads | ModuleList | 1.2 K
7 | loss | LossBraTS | 0
8 | loss.dice | DiceLoss | 0
9 | loss.ce | BCEWithLogitsLoss | 0

177 M Trainable params
0 Non-trainable params
177 M Total params
709.241 Total estimated model params size (MB)
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s 1.31it/s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Validation ━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━ 3/4 0:00:07 • 0:00:01 1.58it/s
Traceback (most recent call last):
File "/home/navi/nnUNet/notebooks/../main.py", line 128, in
main()
File "/home/navi/nnUNet/notebooks/../main.py", line 110, in main
trainer.fit(model, datamodule=data_module)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, **kwargs)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
self.fit_loop.run()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 137, in run
self.on_advance_end(data_fetcher)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 285, in on_advance_end
self.val_loop.run()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 127, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 127, in next
batch = super().next()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 56, in next
batch = next(self.iterator)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 326, in next
out = next(self._iterator)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 132, in next
out = next(self.iterators[0])
File "/home/navi/nnUNet/data_loading/dali_loader.py", line 236, in next
out = super().next()[0]
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/nvidia/dali/plugin/pytorch/init.py", line 245, in next
outputs = self._get_outputs()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/nvidia/dali/plugin/base_iterator.py", line 340, in _get_outputs
outputs.append(p.share_outputs())
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1135, in share_outputs
return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline:
Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered:
CUDA allocation failed
Current pipeline object is no longer valid.

Steps to reproduce the behavior:

Set up the data directories according to the notebook requirements and launch jupyter lab in the system
cd into nnUNet/notebooks and open either Brats21.ipynb or Brats22.ipynb in the jupyter notebooks.
run the training step by giving the command
"!python ../main.py --brats --brats22_model --data /mnt/e/Naveen/Datasets/BraTS2021/11_3d/ --results /mnt/e/Naveen/Datasets/BraTS2021/ --ckpt_path /mnt/e/Naveen/Datasets/BraTS2021/check_points/ --ckpt_store_dir /mnt/e/Naveen/Datasets/BraTS2021/check_points/ --scheduler --learning_rate 0.05 --epochs 10 --fold 0 --gpus 1 --amp --task 11 --nfolds 5 --save_ckpt"

Expected behavior:
To complete the training step successfully and give the trained model.

Environment

I have Installed all the requirements according to the requirements.txt
Pytorch: 2.2.0+cu121
CUDA: Cuda compilation tools, release 12.1, V12.1.66; Build cuda_12.1.r12.1/compiler.32415258_0
Platform: WSL2 on Windows
GPUs in the system: Nvidia Rtx 3090

The text was updated successfully, but these errors were encountered:

Navee402 added the bug Something isn't working label Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nnUNET/PyTorch] Training step running into "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid." #1373

[nnUNET/PyTorch] Training step running into "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid." #1373

Navee402 commented Feb 1, 2024 •

edited

[nnUNET/PyTorch] Training step running into "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid." #1373

[nnUNET/PyTorch] Training step running into "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid." #1373

Comments

Navee402 commented Feb 1, 2024 • edited

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

| Name | Type | Params

Navee402 commented Feb 1, 2024 •

edited

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes