This document shows how to run ElasticDL jobs on your personal computer using Minikube.
-
Install Minikube, preferably >= v1.11.0, following the installation guide. Minikube runs a single-node Kubernetes cluster in a virtual machine on your personal computer.
-
Install Docker CE, preferably >= 18.x, following the guide for building Docker images containing user-defined models and the ElasticDL framework.
-
Install Python, preferably >= 3.6, because the ElasticDL command-line tool is in Python.
Among all machine learning toolkits that ElasticDL can work with, TensorFlow is the most tested and used. In this tutorial, we use a model from the model zoo directory. This model is defined using TensorFlow Keras API. To write your models, please refer to this tutorial.
We use the MNIST dataset in this tutorial. The dynamic data partitioning mechanism of ElasticDL requires that the training data files are in the RecordIO format. To download the MNIST dataset and convert it into RecordIO files, please run the following command.
docker run --rm -it \
-v $HOME/.keras:/root/.keras \
-v $PWD:/work \
-w /work \
elasticdl/elasticdl:dev bash -c "/scripts/gen_dataset.sh data"
After the running of this command, we will see the generated dataset files in
the directory ./data
.
The following command starts a Kubernetes cluster locally using Minikube. It uses hyperkit, a hypervisor coming with macOS, to create the virtual machine cluster. If you want, please feel free to use other hypervisors including VirtualBox.
minikube start --vm-driver=hyperkit \
--cpus 2 --memory 6144 --disk-size=50gb \
--mount=true --mount-string="./data:/data"
eval $(minikube docker-env)
The command-line option --mount-string
exposes the directory ./data
on the
host to Minikube as /data
, which, we can later bind mount into containers
running on the Kubernetes cluster.
The command minikube docker-env
returns a set of Bash environment variable
exports to configure your local environment to re-use the Docker daemon inside
the Minikube instance.
The following command is necessary to enable RBAC of Kubernetes.
kubectl apply -f \
https://raw.githubusercontent.com/sql-machine-learning/elasticdl/develop/elasticdl/manifests/elasticdl-rbac.yaml
If you happen to live in a region where raw.githubusercontent.com
is banned,
you might want to Git clone the above repository to get the YAML file.
The following command installs the command line tool elasticdl
, which talks to
the Kubernetes cluster and operates ElasticDL jobs.
pip install elasticdl_client
Kubernetes runs Docker containers, so we need to put the training system, consisting of user-defined models, ElasticDL the trainer, and all dependencies, into a Docker image.
In this tutorial, we use a predefined model in the ElasticDL repository. To retrieve the source code, please run the following command.
git clone https://github.com/sql-machine-learning/elasticdl
Model definitions are in directory elasticdl/model_zoo
.
The following commands build the Docker image elasticdl:mnist_ps
cd elasticdl
elasticdl zoo init --model_zoo=model_zoo
elasticdl zoo build --image=elasticdl:mnist_ps .
We have not released ElasticDL packages with AllReduce yet. Thus, we need to manually build packages with AllReduce support.
We must build an image elasticdl:dev_allreduce
first using the
scripts/travis/build_images.sh
Then we use this image to build packages with AllReduce support.
scripts/docker_build_wheel.sh
After this, we can build the AllReduce training image elasticdl:mnist_allreduce
with model definitions in model_zoo.
elasticdl zoo init \
--base_image=elasticdl:dev_allreduce \
--model_zoo=model_zoo \
--local_pkg_dir=./build
elasticdl zoo build --image=elasticdl:mnist_allreduce .
The following command submits a training job:
elasticdl train \
--image_name=elasticdl:mnist_ps \
--model_zoo=model_zoo \
--model_def=mnist.mnist_functional_api.custom_model \
--training_data=/data/mnist/train \
--validation_data=/data/mnist/test \
--num_epochs=2 \
--master_resource_request="cpu=0.2,memory=1024Mi" \
--master_resource_limit="cpu=1,memory=2048Mi" \
--worker_resource_request="cpu=0.4,memory=1024Mi" \
--worker_resource_limit="cpu=1,memory=2048Mi" \
--ps_resource_request="cpu=0.2,memory=1024Mi" \
--ps_resource_limit="cpu=1,memory=2048Mi" \
--minibatch_size=64 \
--num_minibatches_per_task=2 \
--num_ps_pods=1 \
--num_workers=1 \
--evaluation_steps=50 \
--job_name=test-mnist \
--image_pull_policy=Never \
--volume="host_path=/data,mount_path=/data" \
--need_elasticdl_job_service=true \
--distribution_strategy=ParameterServerStrategy
We had exposed the directory ./data
to Minikube in above sections. Here, the
option --volume="host_path=/data,mount_path=/data"
bind mount it into the
containers/pods.
The above command starts a Kubernetes job with only one container, or pod, which are exchangeable in this document), the master container.
The option --num_workers=1
tells the master container to start a worker pod.
The option --distribution_strategy=ParameterServerStrategy
chooses the
parameter server for the distributed stochastic gradient descent (SGD)
algorithm. The option --num_ps_pods=1
tells the master to start one parameter
server pod. For more details about parameter server strategy, please refer to
the design doc.
After the job submission, we can run the command kubectl get pods
to list
related containers.
NAME READY STATUS RESTARTS AGE
elasticdl-test-mnist-master 1/1 Running 0 33s
elasticdl-test-mnist-ps-0 1/1 Running 0 30s
elasticdl-test-mnist-worker-0 1/1 Running 0 30s
We can also trace the training progress by watching the log from the master container. The following command watches the evaluation metrics changing over iterations.
kubectl logs elasticdl-test-mnist-master | grep "Evaluation"
The output looks like the following.
[2020-04-14 02:46:21,836] [INFO] [master.py:192:prepare] Evaluation service started
[2020-04-14 02:46:40,750] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=50]: {'accuracy': 0.21933334}
[2020-04-14 02:46:53,827] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=100]: {'accuracy': 0.5173333}
[2020-04-14 02:47:07,529] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=150]: {'accuracy': 0.6253333}
[2020-04-14 02:47:23,251] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=200]: {'accuracy': 0.752}
[2020-04-14 02:47:35,746] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=250]: {'accuracy': 0.77}
[2020-04-14 02:47:52,082] [INFO] [master.py:249:_stop] Evaluation service stopped
The logs show that the accuracy reaches to 0.77 after 250 steps iteration.
elasticdl train \
--image_name=elasticdl:mnist_allreduce \
--model_zoo=model_zoo \
--model_def=mnist.mnist_functional_api.custom_model \
--training_data=/data/mnist/train \
--num_epochs=1 \
--master_resource_request="cpu=0.2,memory=1024Mi" \
--master_resource_limit="cpu=1,memory=2048Mi" \
--worker_resource_request="cpu=0.4,memory=1024Mi" \
--worker_resource_limit="cpu=1,memory=2048Mi" \
--minibatch_size=64 \
--num_minibatches_per_task=2 \
--num_workers=2 \
--job_name=test-mnist-allreduce \
--image_pull_policy=Never \
--volume="host_path=/data,mount_path=/data" \
--need_elasticdl_job_service=true \
--distribution_strategy=AllreduceStrategy
After the job submission, we can run the command kubectl get pods
to list
related containers.
NAME READY STATUS RESTARTS AGE
elasticdl-test-mnist-allreduce-master 1/1 Running 0 102s
elasticdl-test-mnist-allreduce-worker-0 1/1 Running 0 98s
elasticdl-test-mnist-allreduce-worker-1 1/1 Running 0 98s
Then, we can view the loss in the worker log using the following command
kubectl logs elasticdl-test-mnist-allreduce-worker-0 | grep Loss
The outputs look like.
[2020-08-27 13:22:47,930] [INFO] [worker.py:627:_process_minibatch] Loss = 2.686038017272949, steps = 2
[2020-08-27 13:23:17,254] [INFO] [worker.py:627:_process_minibatch] Loss = 0.08301685750484467, steps = 100
[2020-08-27 13:23:47,887] [INFO] [worker.py:627:_process_minibatch] Loss = 0.0823458805680275, steps = 200
[2020-08-27 13:24:19,067] [INFO] [worker.py:627:_process_minibatch] Loss = 0.14079990983009338, steps = 300