AllReduce Training Using DLRover on Public Cloud

This document explains how to run a DLRover elastic job using torchrun on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(ACK).

Preliminary

Install GO 1.18.
Create a Kubernetes cluster on ACK.
Configure cluster credentials on your local computer.
Create a NAS storage and mount it to the cluster.

If you do not have a Kubernetes cluster on Cloud, you also can start a local kubernetes cluster by Minikube start.

Deploy the ElasticJob CRD on the Kubernetes Cluster

Clone the repo to your host.

git clone git@github.com:intelligent-machine-learning/dlrover.git

Deploy the controller on the cluster.

cd dlrover/dlrover/go/operator/
make deploy IMG=easydl/elasticjob-controller:master  # GO 1.18

Grant permission for the DLRover master to Access CRDs.

kubectl -n dlrover apply -f config/manifests/bases/default-role.yaml

Submit a Job

Submit a job to train a CNN model with MNIST dataset.

kubectl -n dlrover apply -f examples/pytorch/mnist/elastic_job.yaml

Check the job status

kubectl -n dlrover get elasticjob torch-mnist

NAME          PHASE     AGE
torch-mnist   Running   19h

Check the Pod status

kubectl -n dlrover get pods -l elasticjob-name=torch-mnist

NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          26s
torch-mnist-edljob-worker-0             1/1     Running   0          29s
torch-mnist-edljob-worker-1             1/1     Running   0          32s

We can view the training log of the worker by

kubectl -n dlrover logs torch-mnist-edljob-worker-0

loss = 0.016916541382670403, step = 400
Save checkpoint.
loss = 0.05502168834209442, step = 420
loss = 0.13794168829917908, step = 440
loss = 0.023234723135828972, step = 460
Test model after epoch 18
Test the model ...

Test set: Average loss: 0.0499, Accuracy: 9828/10000 (98%)

Test Fault-tolerance

Delete a worker.

kubectl -n dlrover delete pod torch-mnist-edljob-worker-1

Then, we can see there are only one worker.

NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          1m12s
torch-mnist-edljob-worker-0             1/1     Running   0          1m15s

For a while, DLRover will restore the deleted worker.

NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          1m52s
torch-mnist-edljob-worker-0             1/1     Running   0          1m55s
torch-mnist-edljob-worker-1             1/1     Running   0          32s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch_elasticjob_on_k8s.md

torch_elasticjob_on_k8s.md

AllReduce Training Using DLRover on Public Cloud

Preliminary

Deploy the ElasticJob CRD on the Kubernetes Cluster

Submit a Job

Test Fault-tolerance

Files

torch_elasticjob_on_k8s.md

Latest commit

History

torch_elasticjob_on_k8s.md

File metadata and controls

AllReduce Training Using DLRover on Public Cloud

Preliminary

Deploy the ElasticJob CRD on the Kubernetes Cluster

Submit a Job

Test Fault-tolerance