You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1%|▏ | 4817/360910 [38:19<22:40:36, 4.36it/s][rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6207, OpType=ALLGATHER, NumelIn=48128, NumelOut=192512, Timeout(ms)=1800000) ran for 1800644 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6207, OpType=ALLGATHER, NumelIn=49152, NumelOut=196608, Timeout(ms)=1800000) ran for 1800838 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6029, OpType=ALLREDUCE, NumelIn=65084417, NumelOut=65084417, Timeout(ms)=600000) ran for 1801622 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6206, OpType=ALLGATHER, NumelIn=8192, NumelOut=32768, Timeout(ms)=1800000) ran for 1801713 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6206, OpType=ALLGATHER, NumelIn=8192, NumelOut=32768, Timeout(ms)=1800000) ran for 1801713 milliseconds before timing out.
Sorry, we cannot determine the cause of this issue based on provided information.
You can try to run the code several more times to see if it can successfully run to completion.
1%|▏ | 4817/360910 [38:19<22:40:36, 4.36it/s][rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6207, OpType=ALLGATHER, NumelIn=48128, NumelOut=192512, Timeout(ms)=1800000) ran for 1800644 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6207, OpType=ALLGATHER, NumelIn=49152, NumelOut=196608, Timeout(ms)=1800000) ran for 1800838 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6029, OpType=ALLREDUCE, NumelIn=65084417, NumelOut=65084417, Timeout(ms)=600000) ran for 1801622 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6206, OpType=ALLGATHER, NumelIn=8192, NumelOut=32768, Timeout(ms)=1800000) ran for 1801713 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6206, OpType=ALLGATHER, NumelIn=8192, NumelOut=32768, Timeout(ms)=1800000) ran for 1801713 milliseconds before timing out.
造成这种问题的原因是什么呢?我通过设置 --negatives_cross_device False 可以避免吗?
negatives_cross_device False 会造成很大的模型效果损失吗?有没有更好的建议,比如在哪里设置negatives 的最大值以规避超时?
The text was updated successfully, but these errors were encountered: