Skip to content

Latest commit

 

History

History
111 lines (80 loc) · 3.01 KB

torch_elasticjob_on_k8s.md

File metadata and controls

111 lines (80 loc) · 3.01 KB

AllReduce Training Using DLRover on Public Cloud

This document explains how to run a DLRover elastic job using torchrun on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(ACK).

Preliminary

  • Install GO 1.18.
  • Create a Kubernetes cluster on ACK.
  • Configure cluster credentials on your local computer.
  • Create a NAS storage and mount it to the cluster.

If you do not have a Kubernetes cluster on Cloud, you also can start a local kubernetes cluster by Minikube start.

Deploy the ElasticJob CRD on the Kubernetes Cluster

  1. Clone the repo to your host.
git clone [email protected]:intelligent-machine-learning/dlrover.git
  1. Deploy the controller on the cluster.
cd dlrover/dlrover/go/operator/
make deploy IMG=easydl/elasticjob-controller:master  # GO 1.18
  1. Grant permission for the DLRover master to Access CRDs.
kubectl -n dlrover apply -f config/manifests/bases/default-role.yaml

Submit a Job

  • Submit a job to train a CNN model with MNIST dataset.
kubectl -n dlrover apply -f examples/pytorch/mnist/elastic_job.yaml
  • Check the job status
kubectl -n dlrover get elasticjob torch-mnist 
NAME          PHASE     AGE
torch-mnist   Running   19h
  • Check the Pod status
kubectl -n dlrover get pods -l elasticjob-name=torch-mnist
NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          26s
torch-mnist-edljob-worker-0             1/1     Running   0          29s
torch-mnist-edljob-worker-1             1/1     Running   0          32s

We can view the training log of the worker by

kubectl -n dlrover logs torch-mnist-edljob-worker-0
loss = 0.016916541382670403, step = 400
Save checkpoint.
loss = 0.05502168834209442, step = 420
loss = 0.13794168829917908, step = 440
loss = 0.023234723135828972, step = 460
Test model after epoch 18
Test the model ...

Test set: Average loss: 0.0499, Accuracy: 9828/10000 (98%)

Test Fault-tolerance

  • Delete a worker.
kubectl -n dlrover delete pod torch-mnist-edljob-worker-1

Then, we can see there are only one worker.

NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          1m12s
torch-mnist-edljob-worker-0             1/1     Running   0          1m15s

For a while, DLRover will restore the deleted worker.

NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          1m52s
torch-mnist-edljob-worker-0             1/1     Running   0          1m55s
torch-mnist-edljob-worker-1             1/1     Running   0          32s