Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: NCCL Error 2: unhandled system error #198

Open
waduhekx opened this issue Aug 23, 2021 · 2 comments
Open

RuntimeError: NCCL Error 2: unhandled system error #198

waduhekx opened this issue Aug 23, 2021 · 2 comments

Comments

@waduhekx
Copy link

when i use two gpus to run the main.py to train model on sthv2 dataset, got error as below:

Traceback (most recent call last):
File "main.py", line 378, in
main()
File "main.py", line 194, in main
train(train_loader, model, criterion, optimizer, epoch, log_training, tf_writer)
File "main.py", line 244, in train
output = model(input_var)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error

@waduhekx
Copy link
Author

how can i solve this problem? please.

@Luffy03
Copy link

Luffy03 commented Feb 10, 2022

Have you solved the problem? Would you please share your solution? thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants