This document explains how to run a DLRover elastic job using torchrun on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(ACK).
- Install GO 1.18.
- Create a Kubernetes cluster on ACK.
- Configure cluster credentials on your local computer.
- Create a NAS storage and mount it to the cluster.
If you do not have a Kubernetes cluster on Cloud, you also can start a local kubernetes cluster by Minikube start.
- Clone the repo to your host.
git clone [email protected]:intelligent-machine-learning/dlrover.git
- Deploy the controller on the cluster.
cd dlrover/dlrover/go/operator/
make deploy IMG=easydl/elasticjob-controller:master # GO 1.18
- Grant permission for the DLRover master to Access CRDs.
kubectl -n dlrover apply -f config/manifests/bases/default-role.yaml
- Submit a job to train a CNN model with MNIST dataset.
kubectl -n dlrover apply -f examples/pytorch/mnist/elastic_job.yaml
- Check the job status
kubectl -n dlrover get elasticjob torch-mnist
NAME PHASE AGE
torch-mnist Running 19h
- Check the Pod status
kubectl -n dlrover get pods -l elasticjob-name=torch-mnist
NAME READY STATUS RESTARTS AGE
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 26s
torch-mnist-edljob-worker-0 1/1 Running 0 29s
torch-mnist-edljob-worker-1 1/1 Running 0 32s
We can view the training log of the worker by
kubectl -n dlrover logs torch-mnist-edljob-worker-0
loss = 0.016916541382670403, step = 400
Save checkpoint.
loss = 0.05502168834209442, step = 420
loss = 0.13794168829917908, step = 440
loss = 0.023234723135828972, step = 460
Test model after epoch 18
Test the model ...
Test set: Average loss: 0.0499, Accuracy: 9828/10000 (98%)
- Delete a worker.
kubectl -n dlrover delete pod torch-mnist-edljob-worker-1
Then, we can see there are only one worker.
NAME READY STATUS RESTARTS AGE
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 1m12s
torch-mnist-edljob-worker-0 1/1 Running 0 1m15s
For a while, DLRover will restore the deleted worker.
NAME READY STATUS RESTARTS AGE
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 1m52s
torch-mnist-edljob-worker-0 1/1 Running 0 1m55s
torch-mnist-edljob-worker-1 1/1 Running 0 32s