Skip to content

Shigangli/eager-SGD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Eager-SGD

Eager-SGD is a decentralized asynchronous SGD for distributed deep learning training based on gradient averaging. It utilizes novel partial collectives operations (partial allreduce) to accumulate the gradients across all the processes. Different from the traditional collectives operations (such as MPI, NCCL), a partial collective is an asynchronous operation where a subset of the processes can trigger and contribute the latest data to the collective operation.

Eager-SGD may bring staleness to the gradients. Thanks to our sophisticated implementation of solo-allreduce and majority-allreduce, the staleness is bounded and therefore eager-SGD is stale-synchronous. Due to the asynchrony feature of eager-SGD, it can better handle the deep learning training with load imbalance. To the best of our knowledge, this is the first work that implements asynchronous and stale-synchronous decentralized SGD where the messages propagate to all nodes in one step.

Demo

A script to run eager-SGD on ResNet-50/ImageNet with SLURM job scheduler can be found here. Generally, to evaluate other neural network models with the customized optimizers (e.g., gradient averaging using solo/majority-allreduce), one can simply wrap the default optimizer using the customized optimizers. See the example for ResNet-50 here.

Publication

The work of eager-SGD is pulished in PPoPP'20, Best Paper Finalist. See the paper for details. If you use eager-SGD, cite us:

@inproceedings{li2020taming,
  title={Taming unbalanced training workloads in deep learning with partial collective operations},
  author={Li, Shigang and Ben-Nun, Tal and Girolamo, Salvatore Di and Alistarh, Dan and Hoefler, Torsten},
  booktitle={Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  pages={45--61},
  year={2020}
}

License

See LICENSE.

About

Eager-SGD is a decentralized asynchronous SGD. It utilizes novel partial collectives operations to accumulate the gradients across all the processes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published