The document describes how to make contribution to DLRover.
- Fork DLRover Repo to your owner namespace.
git clone [email protected]:intelligent-machine-learning/dlrover.git
cd dlrover
git remote rename origin upstream
git remote add origin ${YOUR OWNER REPO}
git checkout -b {DEV-BRANCH}
git push -u origin {DEV-BRANCH}
Then, you create a PR on github. If you has modified codes of the repo,
you need to execute pre-commit
to check codestyle and unittest cases
by the following steps.
docker run -v `pwd`:/dlrover -it easydl/dlrover:ci /bin/bash
cd /dlrover
pip install deprecated kubernetes pynvml psutil ray torch && sh scripts/build_wheel.sh
pre-commit run -a
python -m pytest dlrover/python/tests
python -m pytest dlrover/trainer/tests
- Go (1.18 or later)
Create a symbolic link inside your GOPATH to the location you checked out the code
mkdir -p ${go env GOPATH}/src/github.com/intelligent-machine-learning
ln -sf ${GIT_TRAINING} ${go env GOPATH}/src/github.com/intelligent-machine-learning/dlrover
- GIT_TRAINING should be the location where you checked out https://github.com/intelligent-machine-learning/dlrover
Install dependencies
go mod vendor
Running the operator locally (as opposed to deploying it on a K8s cluster) is convenient for debugging/development.
Install minikube on your loptop.
To enable GPU support, follow the docs as follows:
-
Install cri-dockerd and NVIDIA Container Toolkit
-
Enable k8s-device-plugin
-
Test your GPU by the official gpu-pod
It is highly recommended to have more than one GPU resources in your workspace.
However, there is still a workaround to divide your single GPU resource into multiple ones.
For this, enable shared-access-to-gpus with CUDA Time-Slicing to get more GPU resources.
Check the doc and modify your nvidia-k8s-device-plugin
or simply update the plugin by helm
with the command
(See more details about getting GPU resources)
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.13.0 \
--namespace nvidia-device-plugin \
--create-namespace \
--set-file config.map.config=./dlrover/go/operator/config/gpu/nvidia-device-plugin-gpu-shared.yaml
Then test your GPU resources by
$ kubectl get nodes -ojson | jq '.items[].status.capacity'
>
{
"cpu": "8",
"ephemeral-storage": "229336240Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "32596536Ki",
"nvidia.com/gpu": "2", # create one more gpu resource on your laptop
"pods": "110"
}
Create this deployment to test your GPU resources.
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-gpu
spec:
replicas: 2 # replace this to your amount of GPU resources
selector:
matchLabels:
app: test-gpu
template:
metadata:
labels:
app: test-gpu
spec:
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
NAME READY STATUS RESTARTS AGE
dlrover-controller-manager-6c464d59f8-np7tg 2/2 Running 0 55m
test-gpu-59c9677b99-qtxbv 0/1 Completed 2 (24s ago) 27s
test-gpu-59c9677b99-sxd6n 0/1 Completed 2 (24s ago) 27s
$ kubectl logs test-gpu-59c9677b99-qtxbv
>
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
After preparing your minikube cluster you can start minikube with the command:
minikube start --vm-driver=docker --cpus 6 --memory 6144
# If you wish to run minikube with GPUs, recommended commands are as follows.(root privilege requried)
minikube start --driver=none --container-runtime='containerd' --apiserver-ips 127.0.0.1 \
--apiserver-name localhost --cpus 6 --memory 6144
We can configure the operator to run locally using the configuration available in your kubeconfig to communicate with a K8s cluster. Set your environment:
export KUBECONFIG=$(echo ~/.kube/config)
export KUBEFLOW_NAMESPACE=$(your_namespace)
- KUBEFLOW_NAMESPACE is used when deployed on Kubernetes, we use this variable to create other
resources (e.g. the resource lock) internal in the same namespace. It is optional, use
default
namespace if not set.
We can run the ElasticJob in the terminal or deploy the controller with a docker image.
- Run the controller in the terminal.
cd dlrover/go/operator
make install
make run
- Deploy the controller with GO 1.18.
make deploy IMG=easydl/elasticjob-controller:master
kubectl apply -f dlrover/go/operator/config/manifests/bases/default-role.yaml
Build the master image with codes.
docker build -t easydl/dlrover-master:test -f docker/master.dockerfile .
Build the training image of PyTorch models.
docker build -t easydl/dlrover-train:test -f docker/pytorch/mnist.dockerfile .
We can set the training image of the line 18 and the master image
of line 42 in the debug job examples/pytorch/mnist/elastic_job.yaml
.
Then, we can submit a job with the above images.
eval $(minikube docker-env)
kubectl -n dlrover apply -f examples/pytorch/mnist/elastic_test_job.yaml
Check traning nodes.
kubectl -n dlrover get pods
NAME READY STATUS RESTARTS AGE
elasticjob-torch-mnist-master 1/1 Running 0 2m47s
torch-mnist-edljob-chief-0 1/1 Running 0 2m42s
torch-mnist-edljob-worker-0 1/1 Running 0 2m42s
torch-mnist-edljob-worker-1 1/1 Running 0 2m42s
Change pip version and docker image tag when creating a new release.
On ubuntu the default go package appears to be gccgo-go which has problems see issue golang-go package is also really old so install from golang tarballs instead.