You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your releasing Ok-Topk. It is an interesting work, and I am developing certain functions based this repo.
I succeed in single-node training. However, when I try Ok-Topk across 2 nodes, a total 8 GPUs. I found that certain values in all_indexes are negative.
May I ask some suggestions about how to debug it?
Thanks.
The text was updated successfully, but these errors were encountered:
Hi,
I didn't encounter similar issue on my side. They shouldn't be negative values. To print out the indices of topk values may help to find where is the issue.
Actually, all values in local_topk_indexes are positive. After executing allgaverV, they become negative.
I guess that certain features of MPI communication over nodes result in this issue, or hardware features. Just expect to get potential directions to debug. Anyway, thanks for your reply.
Thanks for your releasing Ok-Topk. It is an interesting work, and I am developing certain functions based this repo.
I succeed in single-node training. However, when I try Ok-Topk across 2 nodes, a total 8 GPUs. I found that certain values in all_indexes are negative.
May I ask some suggestions about how to debug it?
Thanks.
The text was updated successfully, but these errors were encountered: