Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all_gather timeout #764

Open
trillionmonster opened this issue May 11, 2024 · 1 comment
Open

all_gather timeout #764

trillionmonster opened this issue May 11, 2024 · 1 comment

Comments

@trillionmonster
Copy link

1%|▏ | 4817/360910 [38:19<22:40:36, 4.36it/s][rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6207, OpType=ALLGATHER, NumelIn=48128, NumelOut=192512, Timeout(ms)=1800000) ran for 1800644 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6207, OpType=ALLGATHER, NumelIn=49152, NumelOut=196608, Timeout(ms)=1800000) ran for 1800838 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6029, OpType=ALLREDUCE, NumelIn=65084417, NumelOut=65084417, Timeout(ms)=600000) ran for 1801622 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6206, OpType=ALLGATHER, NumelIn=8192, NumelOut=32768, Timeout(ms)=1800000) ran for 1801713 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6206, OpType=ALLGATHER, NumelIn=8192, NumelOut=32768, Timeout(ms)=1800000) ran for 1801713 milliseconds before timing out.

造成这种问题的原因是什么呢?我通过设置 --negatives_cross_device False 可以避免吗?
negatives_cross_device False 会造成很大的模型效果损失吗?有没有更好的建议,比如在哪里设置negatives 的最大值以规避超时?

@staoxiao
Copy link
Collaborator

Sorry, we cannot determine the cause of this issue based on provided information.
You can try to run the code several more times to see if it can successfully run to completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants