GitHub - Shigangli/eager-SGD: Eager-SGD is a decentralized asynchronous SGD. It utilizes novel partial collectives operations to accumulate the gradients across all the processes.

Eager-SGD

Eager-SGD is a decentralized asynchronous SGD for distributed deep learning training based on gradient averaging. It utilizes novel partial collectives operations (partial allreduce) to accumulate the gradients across all the processes. Different from the traditional collectives operations (such as MPI, NCCL), a partial collective is an asynchronous operation where a subset of the processes can trigger and contribute the latest data to the collective operation.

Eager-SGD may bring staleness to the gradients. Thanks to our sophisticated implementation of solo-allreduce and majority-allreduce, the staleness is bounded and therefore eager-SGD is stale-synchronous. Due to the asynchrony feature of eager-SGD, it can better handle the deep learning training with load imbalance. To the best of our knowledge, this is the first work that implements asynchronous and stale-synchronous decentralized SGD where the messages propagate to all nodes in one step.

Demo

A script to run eager-SGD on ResNet-50/ImageNet with SLURM job scheduler can be found here. Generally, to evaluate other neural network models with the customized optimizers (e.g., gradient averaging using solo/majority-allreduce), one can simply wrap the default optimizer using the customized optimizers. See the example for ResNet-50 here.

Publication

The work of eager-SGD is pulished in PPoPP'20, Best Paper Finalist. See the paper for details. If you use eager-SGD, cite us:

@inproceedings{li2020taming,
  title={Taming unbalanced training workloads in deep learning with partial collective operations},
  author={Li, Shigang and Ben-Nun, Tal and Girolamo, Salvatore Di and Alistarh, Dan and Hoefler, Torsten},
  booktitle={Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
  pages={45--61},
  year={2020}
}

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
eager-SGD-modules		eager-SGD-modules
test-models/tf-models-r1.11		test-models/tf-models-r1.11
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eager-SGD-modules

eager-SGD-modules

test-models/tf-models-r1.11

test-models/tf-models-r1.11

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Eager-SGD

Demo

Publication

License

About

Releases

Packages

Languages

License

Shigangli/eager-SGD

Folders and files

Latest commit

History

Repository files navigation

Eager-SGD

Demo

Publication

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages