Multi-Node Sparse Training Error #2

gaow0007 · 2022-05-14T08:43:25Z

Thanks for your releasing Ok-Topk. It is an interesting work, and I am developing certain functions based this repo.
I succeed in single-node training. However, when I try Ok-Topk across 2 nodes, a total 8 GPUs. I found that certain values in all_indexes are negative.

May I ask some suggestions about how to debug it?

Thanks.

Shigangli · 2022-05-14T09:04:47Z

Hi,
I didn't encounter similar issue on my side. They shouldn't be negative values. To print out the indices of topk values may help to find where is the issue.

Ok-Topk/VGG/allreducer.py

Line 1425 in 03c5229

    
           indexes, values = self._compression.compress(tensor=new_tensor, name=new_name, ratio=density)

gaow0007 · 2022-05-14T09:15:50Z

Actually, all values in local_topk_indexes are positive. After executing allgaverV, they become negative.

I guess that certain features of MPI communication over nodes result in this issue, or hardware features. Just expect to get potential directions to debug. Anyway, thanks for your reply.

Shigangli · 2022-05-14T20:57:14Z

I see. So probably write a simple demo of AllgatherV to see if the error can be reproduced or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Node Sparse Training Error #2

Multi-Node Sparse Training Error #2

gaow0007 commented May 14, 2022

Shigangli commented May 14, 2022 •

edited

gaow0007 commented May 14, 2022

Shigangli commented May 14, 2022

Multi-Node Sparse Training Error #2

Multi-Node Sparse Training Error #2

Comments

gaow0007 commented May 14, 2022

Shigangli commented May 14, 2022 • edited

gaow0007 commented May 14, 2022

Shigangli commented May 14, 2022

Shigangli commented May 14, 2022 •

edited