Voda Scheduler

tags
voda-scheduler

Voda Scheduler

Note that everything is experimental and may change significantly at any time.

Voda scheduler is a GPU scheduler for elastic deep learning workloads based on Kubernetes, Kubeflow Training Operator and Horovod.

Voda Scheduler is designed to be easily deployed in any Kubernetes cluster. For more architectural details, see design.

Contents

Why Elastic Training?
Why Voda Scheduler?
Demo
Get Started
Scheduling Algorithms
Docker Images
Prometheus Metrics Exposed
Related Projects
Reference

Why Elastic Training?

Elastic training enables the distributed training jobs to be scaled up and down dynamically at runtime, without interrupting the training process.

With elastic training, the scheduler can make training jobs utilize idle resources if there are any and make the most efficient resource allocations if the cluster is heavily-loaded, thus increasing cluster throughput and reducing overall training time.

For more information about elastic training, see Elastic Horovod, Torch Distributed Elastic or Elastic Training.

Why Voda Scheduler?

Voda Scheduler provides several critical features for elastic deep learning workloads as follows:

Rich Scheduling Algorithms (with resource elasticity) to choose from
Topology-Aware Scheduling & Worker Migration
- Actively consolidate resources to maximize cluster throughput
- Particularly important for elastic training since resource allocations can be dynamically adjusted
Node Addition/Deletion Awareness
- Co-works with existing autoscaler
- Makes the best use of spot instances that may come and go with little warning
- Tolerates failing nodes
Fault-Tolerance

Demo

Checkout the demo to see how resource allocations are dynamically adjusted (and how worker pods are migrated) to maximize cluster throughput

Prerequisite

A Kubernetes cluster, on-cloud or on-premise, that can schedule GPUs. Voda Scheduler is tested with v1.20

Get Started

Scheduling Algorithms

Algorithm	Elastic	Reference
FIFO
Elastic-FIFO (default)	✔️
SRJF
Elastic-SRJF	✔️
Tiresias		Gu, Juncheng, et al. "Tiresias: A GPU cluster manager for distributed deep learning." 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 2019. https://www.usenix.org/conference/nsdi19/presentation/gu
Elastic-Tiresias	✔️	Wu, Yidi, et al. "Elastic Deep Learning in Multi-Tenant GPU Clusters." IEEE Transactions on Parallel and Distributed Systems (2021). https://ieeexplore.ieee.org/abstract/document/9373916
FfDL Optimizer	✔️	Saxena, Vaibhav, et al. "Effective elastic scaling of deep learning workloads." 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 2020. https://ieeexplore.ieee.org/abstract/document/9285954
AFS-L	✔️	Shin, Jinwoo, and KyoungSoo Park. "Elastic Resource Sharing for Distributed Deep Learning." (2021) https://www.usenix.org/system/files/nsdi21-hwang.pdf

Docker Images

Voda Scheduler Docker Images

Prometheus Metrics Exposed

Prometheus Metrics Exposed

Related Projects

kubeflow/training-operator: Training operators on Kubernetes.
kubeflow/mpi-operator: Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
horovod/horovod: Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
heyfey/munkres: Hungarian algorithm used in the placement algorithm.
heyfey/nvidia_smi_exporter: nvidia-smi exporter for Prometheus. For monitoring GPUs in the cluster.

Reference

T. -T. Hsieh and C. -R. Lee, "Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters," 2023 IEEE International Conference on Cloud Engineering (IC2E), Boston, MA, USA, 2023, pp. 131-140, doi: 10.1109/IC2E59103.2023.00023. https://ieeexplore.ieee.org/document/10305838

@INPROCEEDINGS{10305838,
  author={Hsieh, Tsung-Tso and Lee, Che-Rung},
  booktitle={2023 IEEE International Conference on Cloud Engineering (IC2E)}, 
  title={Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters}, 
  year={2023},
  volume={},
  number={},
  pages={131-140},
  doi={10.1109/IC2E59103.2023.00023}}

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
bin		bin
cmd		cmd
config		config
deploy/patch-file-tolerations		deploy/patch-file-tolerations
doc		doc
docker		docker
examples		examples
helm/voda-scheduler		helm/voda-scheduler
pkg		pkg
python/metrics_collector		python/metrics_collector
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

License

heyfey/vodascheduler

Folders and files

Latest commit

History

Repository files navigation

Voda Scheduler

Why Elastic Training?

Why Voda Scheduler?

Demo

Prerequisite

Scheduling Algorithms

Docker Images

Prometheus Metrics Exposed

Related Projects

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Languages