Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nnUNET/PyTorch] Training step running into "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid." #1373

Open
Navee402 opened this issue Feb 1, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Navee402
Copy link

Navee402 commented Feb 1, 2024

Related to nnUNET/pyTorch

I am trying to use the BraTS21.ipynb and BraTS22.ipynb to train the nnUNet model. But I am constantly running into the following error. "RuntimeError: Critical error in pipeline: Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered: CUDA allocation failed Current pipeline object is no longer valid."

Detailed description:

834 training, 417 validation, 1251 test examples
Provided checkpoint /mnt/e/Naveen/Datasets/BraTS2021/check_points/ is not a file. Starting training from scratch.
Filters: [64, 128, 256, 512, 768, 1024],
Kernels: [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]
Strides: [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]]
precision=16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default ModelSummary callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

834 training, 417 validation, 1251 test examples
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | model | UNet3D | 177 M
1 | model.input_block | InputBlock | 119 K
2 | model.downsamples | ModuleList | 40.5 M
3 | model.bottleneck | ConvBlock | 49.5 M
4 | model.upsamples | ModuleList | 87.2 M
5 | model.output_block | OutputBlock | 195
6 | model.deep_supervision_heads | ModuleList | 1.2 K
7 | loss | LossBraTS | 0
8 | loss.dice | DiceLoss | 0
9 | loss.ce | BCEWithLogitsLoss | 0

177 M Trainable params
0 Non-trainable params
177 M Total params
709.241 Total estimated model params size (MB)
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s 1.31it/s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Epoch 0/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 0:00:14 • 0:00:00 1.36it/s s
Validation ━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━ 3/4 0:00:07 • 0:00:01 1.58it/s
Traceback (most recent call last):
File "/home/navi/nnUNet/notebooks/../main.py", line 128, in
main()
File "/home/navi/nnUNet/notebooks/../main.py", line 110, in main
trainer.fit(model, datamodule=data_module)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, **kwargs)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
self.fit_loop.run()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 137, in run
self.on_advance_end(data_fetcher)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 285, in on_advance_end
self.val_loop.run()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 127, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 127, in next
batch = super().next()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 56, in next
batch = next(self.iterator)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 326, in next
out = next(self._iterator)
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 132, in next
out = next(self.iterators[0])
File "/home/navi/nnUNet/data_loading/dali_loader.py", line 236, in next
out = super().next()[0]
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/nvidia/dali/plugin/pytorch/init.py", line 245, in next
outputs = self._get_outputs()
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/nvidia/dali/plugin/base_iterator.py", line 340, in _get_outputs
outputs.append(p.share_outputs())
File "/home/navi/miniconda3/envs/myenv/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1135, in share_outputs
return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline:
Error when executing CPU operator readers__Numpy, instance name: "ReaderX", encountered:
CUDA allocation failed
Current pipeline object is no longer valid.

Steps to reproduce the behavior:

  1. Set up the data directories according to the notebook requirements and launch jupyter lab in the system
  2. cd into nnUNet/notebooks and open either Brats21.ipynb or Brats22.ipynb in the jupyter notebooks.
  3. run the training step by giving the command
    "!python ../main.py --brats --brats22_model --data /mnt/e/Naveen/Datasets/BraTS2021/11_3d/ --results /mnt/e/Naveen/Datasets/BraTS2021/ --ckpt_path /mnt/e/Naveen/Datasets/BraTS2021/check_points/ --ckpt_store_dir /mnt/e/Naveen/Datasets/BraTS2021/check_points/ --scheduler --learning_rate 0.05 --epochs 10 --fold 0 --gpus 1 --amp --task 11 --nfolds 5 --save_ckpt"

Expected behavior:
To complete the training step successfully and give the trained model.

Environment

  • I have Installed all the requirements according to the requirements.txt
  • Pytorch: 2.2.0+cu121
  • CUDA: Cuda compilation tools, release 12.1, V12.1.66; Build cuda_12.1.r12.1/compiler.32415258_0
  • Platform: WSL2 on Windows
  • GPUs in the system: Nvidia Rtx 3090
@Navee402 Navee402 added the bug Something isn't working label Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant