Watchdog caught collective operation timeout #2511

srdfjy · 2024-05-01T01:37:50Z

Hi

When I use 4 machines (a total of 8 GPUs, 2 per machine) for training, there are no problems (and there are no problems when using fewer GPUs). However, when I use 4 machines (a total of 16 GPUs, 4 per machine) for training, the following error occurs.

Version and Model：v3.0.1 conformer u2++

2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO Bootstrap : Using eth0:10.241.101.227<0>
2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO cudaDriverVersion 11070
2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO Bootstrap : Using eth0:10.241.101.227<0>
2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Failed to open libibverbs.so[.1]
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Using network Socket
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO NET/Socket : Using [0]eth0:10.241.101.227<0>
2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO cudaDriverVersion 11070
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Failed to open libibverbs.so[.1]
2024/04/30 23:29:35 NCCL version 2.18.6+cuda11.8
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.101.227<0>
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Using network Socket
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO comm 0x885c980 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId b1000 commId 0x108ef55e18c5d28a - Init START
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO comm 0x6359eb40 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x108ef55e18c5d28a - Init START
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Setting affinity for GPU 1 to aaaa,aaaaaaaa
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO P2P Chunksize set to 131072
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/0 : 15[1] -> 0[0] [receive] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/0 : 15[1] -> 0[0] [receive] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->2
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO P2P Chunksize set to 131072
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Connected all rings
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Connected all rings
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [receive] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Connected all trees
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Connected all trees
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO comm 0x6359eb40 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x108ef55e18c5d28a - Init COMPLETE
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO comm 0x885c980 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId b1000 commId 0x108ef55e18c5d28a - Init COMPLETE
2024/04/30 23:29:36 2024-04-30 23:29:36,090 INFO Checkpoint: save to checkpoint /data/exp/init.pt
2024/04/30 23:29:36 2024-04-30 23:29:36,107 INFO Epoch 0 TRAIN info lr 8.333333333333334e-09 rank 1
2024/04/30 23:29:38 2024-04-30 23:29:38,370 INFO Epoch 0 TRAIN info lr 8.333333333333334e-09 rank 0
2024/04/30 23:29:38 2024-04-30 23:29:38,422 INFO using accumulate grad, new batch size is 16 times larger than before
2024/04/30 23:29:38 2024-04-30 23:29:38,422 INFO using accumulate grad, new batch size is 16 times larger than before
2024/05/01 00:00:56 [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
2024/05/01 00:00:57 job-2046325-339461780-0:306:327 [1] NCCL INFO [Service thread] Connection closed by localRank 1
2024/05/01 00:00:57 job-2046325-339461780-0:306:318 [0] NCCL INFO comm 0x885c980 rank 1 nranks 16 cudaDev 1 busId b1000 - Abort COMPLETE
2024/05/01 00:00:57 [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
2024/05/01 00:00:57 [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
2024/05/01 00:00:57 [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

srdfjy · 2024-05-06T09:18:08Z

@xingchensong

I have temporarily solved this issue by changing num_workers=4 and prefetch=250 to num_workers=2 and prefetch=125. However, I'm not sure why setting a higher number of num_workers would lead to this error. It seems that different numbers of GPU cards need to match the appropriate number of num_workers.

Mddct · 2024-05-06T13:20:07Z

may some oom occurs in training，

 num workers * gpus <= cpus cores

srdfjy · 2024-05-07T02:38:14Z

may some oom occurs in training，
 num workers * gpus <= cpus cores

I am using a total of 4 machines, each equipped with 4 V100 GPUs with 16GB of memory and 100 cores of dedicated CPU.

Mddct assigned xingchensong May 1, 2024

srdfjy closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watchdog caught collective operation timeout #2511

Watchdog caught collective operation timeout #2511

srdfjy commented May 1, 2024 •

edited

srdfjy commented May 6, 2024

Mddct commented May 6, 2024

srdfjy commented May 7, 2024

Watchdog caught collective operation timeout #2511

Watchdog caught collective operation timeout #2511

Comments

srdfjy commented May 1, 2024 • edited

srdfjy commented May 6, 2024

Mddct commented May 6, 2024

srdfjy commented May 7, 2024

srdfjy commented May 1, 2024 •

edited