Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog caught collective operation timeout #2511

Closed
srdfjy opened this issue May 1, 2024 · 3 comments
Closed

Watchdog caught collective operation timeout #2511

srdfjy opened this issue May 1, 2024 · 3 comments
Assignees

Comments

@srdfjy
Copy link
Contributor

srdfjy commented May 1, 2024

Hi

When I use 4 machines (a total of 8 GPUs, 2 per machine) for training, there are no problems (and there are no problems when using fewer GPUs). However, when I use 4 machines (a total of 16 GPUs, 4 per machine) for training, the following error occurs.

Version and Model:v3.0.1 conformer u2++

2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO Bootstrap : Using eth0:10.241.101.227<0>
2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO cudaDriverVersion 11070
2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO Bootstrap : Using eth0:10.241.101.227<0>
2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Failed to open libibverbs.so[.1]
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Using network Socket
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO NET/Socket : Using [0]eth0:10.241.101.227<0>
2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO cudaDriverVersion 11070
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Failed to open libibverbs.so[.1]
2024/04/30 23:29:35 NCCL version 2.18.6+cuda11.8
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.101.227<0>
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Using network Socket
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO comm 0x885c980 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId b1000 commId 0x108ef55e18c5d28a - Init START
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO comm 0x6359eb40 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x108ef55e18c5d28a - Init START
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Setting affinity for GPU 1 to aaaa,aaaaaaaa
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO P2P Chunksize set to 131072
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/0 : 15[1] -> 0[0] [receive] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/0 : 15[1] -> 0[0] [receive] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->2
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO P2P Chunksize set to 131072
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Connected all rings
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Connected all rings
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [receive] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/Socket/0
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Connected all trees
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Connected all trees
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO comm 0x6359eb40 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x108ef55e18c5d28a - Init COMPLETE
2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO comm 0x885c980 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId b1000 commId 0x108ef55e18c5d28a - Init COMPLETE
2024/04/30 23:29:36 2024-04-30 23:29:36,090 INFO Checkpoint: save to checkpoint /data/exp/init.pt
2024/04/30 23:29:36 2024-04-30 23:29:36,107 INFO Epoch 0 TRAIN info lr 8.333333333333334e-09 rank 1
2024/04/30 23:29:38 2024-04-30 23:29:38,370 INFO Epoch 0 TRAIN info lr 8.333333333333334e-09 rank 0
2024/04/30 23:29:38 2024-04-30 23:29:38,422 INFO using accumulate grad, new batch size is 16 times larger than before
2024/04/30 23:29:38 2024-04-30 23:29:38,422 INFO using accumulate grad, new batch size is 16 times larger than before
2024/05/01 00:00:56 [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
2024/05/01 00:00:57 job-2046325-339461780-0:306:327 [1] NCCL INFO [Service thread] Connection closed by localRank 1
2024/05/01 00:00:57 job-2046325-339461780-0:306:318 [0] NCCL INFO comm 0x885c980 rank 1 nranks 16 cudaDev 1 busId b1000 - Abort COMPLETE
2024/05/01 00:00:57 [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
2024/05/01 00:00:57 [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
2024/05/01 00:00:57 [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

@srdfjy
Copy link
Contributor Author

srdfjy commented May 6, 2024

@xingchensong

I have temporarily solved this issue by changing num_workers=4 and prefetch=250 to num_workers=2 and prefetch=125. However, I'm not sure why setting a higher number of num_workers would lead to this error. It seems that different numbers of GPU cards need to match the appropriate number of num_workers.

@Mddct
Copy link
Collaborator

Mddct commented May 6, 2024

may some oom occurs in training,

 num workers * gpus <= cpus cores

@srdfjy
Copy link
Contributor Author

srdfjy commented May 7, 2024

may some oom occurs in training,

 num workers * gpus <= cpus cores

I am using a total of 4 machines, each equipped with 4 V100 GPUs with 16GB of memory and 100 cores of dedicated CPU.

@srdfjy srdfjy closed this as completed May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants