Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Node Sparse Training Error #2

Open
gaow0007 opened this issue May 14, 2022 · 3 comments
Open

Multi-Node Sparse Training Error #2

gaow0007 opened this issue May 14, 2022 · 3 comments

Comments

@gaow0007
Copy link

Thanks for your releasing Ok-Topk. It is an interesting work, and I am developing certain functions based this repo.
I succeed in single-node training. However, when I try Ok-Topk across 2 nodes, a total 8 GPUs. I found that certain values in all_indexes are negative.

May I ask some suggestions about how to debug it?

Thanks.

@Shigangli
Copy link
Owner

Shigangli commented May 14, 2022

Hi,
I didn't encounter similar issue on my side. They shouldn't be negative values. To print out the indices of topk values may help to find where is the issue.

indexes, values = self._compression.compress(tensor=new_tensor, name=new_name, ratio=density)

@gaow0007
Copy link
Author

Actually, all values in local_topk_indexes are positive. After executing allgaverV, they become negative.

I guess that certain features of MPI communication over nodes result in this issue, or hardware features. Just expect to get potential directions to debug. Anyway, thanks for your reply.

@Shigangli
Copy link
Owner

I see. So probably write a simple demo of AllgatherV to see if the error can be reproduced or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants