diff --git a/docs/how-to/installation/kubernetes.txt b/docs/how-to/installation/kubernetes.txt index c53e71ce82a..dbf6aa18c01 100644 --- a/docs/how-to/installation/kubernetes.txt +++ b/docs/how-to/installation/kubernetes.txt @@ -6,7 +6,7 @@ This document describes how to install Determined on `Kubernetes `__. The installation is performed using the -:download:`Determined Helm Chart `. For +:download:`Determined Helm Chart `. For general information about using Determined with Kubernetes, refer the :ref:`determined-on-kubernetes` guide. @@ -36,7 +36,7 @@ the following prerequisites are satisfied: ``fluent/fluent-bit:1.6`` Docker image from Docker Hub. You should also download a copy of the :download:`Determined Helm Chart -` and extract it on your local machine. +` and extract it on your local machine. If you do not yet have a Kubernetes cluster deployed and you want to use Determined in a public cloud environment, we recommend using a managed @@ -270,6 +270,22 @@ perform TLS termination in the load-balancer: serviceName: determined-master-service- servicePort: masterPort configured in values.yaml +Default Scheduler (Optional) +============================ + +Determined includes support for the `lightweight coscheduling plugin +`__, +which extends the default Kubernetes scheduler to provide gang +scheduling. This feature is currently in beta and is not enabled by +default. To activate the plugin, set the ``defaultScheduler`` field to +``coscheduler``. If the field is empty or doesn't exist, Determined will +use the default Kubernetes scheduler to schedule all experiments and +tasks. + +.. code:: yaml + + defaultScheduler: coscheduler + *********************** Installing Determined *********************** @@ -283,7 +299,7 @@ Determined run: helm install determined-helm-chart ``determined-helm-chart`` is a relative path to where the -:download:`Determined Helm Chart ` is +:download:`Determined Helm Chart ` is located. It may take a few minutes for all resources to come up. If you encounter issues during installation please follow our list of :ref:`useful kubectl commands `. Helm will diff --git a/docs/reference/helm-config.txt b/docs/reference/helm-config.txt index d7f7d3cd9cf..8daf87f3ab4 100644 --- a/docs/reference/helm-config.txt +++ b/docs/reference/helm-config.txt @@ -7,7 +7,7 @@ When installing :ref:`Determined on Kubernetes ` via Helm, the deployment should be configured by editing the ``values.yaml`` and ``Chart.yaml`` files in the :download:`Determined -Helm Chart `. +Helm Chart `. ************************* ``Chart.yaml`` Settings @@ -237,3 +237,10 @@ Helm Chart `. than a path as it does in the cluster configuration. This can be conveniently set at the command line using ``helm install --set-file logging.security.tls.certificate=``. + +- ``defaultScheduler``: Configures the default scheduler that + Determined will use. Currently only supports the ``coscheduler`` + option, which enables the `lightweight coscheduling plugin + `__. + Unless specified as ``coscheduler``, Determined will use the default + Kubernetes scheduler. diff --git a/docs/topic-guides/deployment/determined-on-kubernetes.txt b/docs/topic-guides/deployment/determined-on-kubernetes.txt index 9ea9ca6fb00..956fbe2bb6b 100644 --- a/docs/topic-guides/deployment/determined-on-kubernetes.txt +++ b/docs/topic-guides/deployment/determined-on-kubernetes.txt @@ -43,34 +43,16 @@ Kubernetes. Scheduling ========== -Determined on Kubernetes does not currently support the scheduling -policies that are available when deploying Determined on VMs. These -policies include: priority scheduling, fair sharing resources across -experiments, and gang-scheduling for distributed training. Determined -relies on Kubernetes to handle scheduling, which does not natively -support these scheduling policies. - -:ref:`Distributed training ` experiments that use -multiple pods require all pods to be scheduled and running in order to -make progress. Due to the lack of gang-scheduling in Kubernetes, when -running distributed training experiments it is possible to deadlock the -Kubernetes cluster such that none of the experiments will make any -progress. For example, if you have a cluster with three 4-GPU nodes, -scheduling an experiment that requires four such nodes will deadlock the -cluster. Three pods will start up on the available nodes and occupy all -of their GPUs while waiting for the fourth pod to launch before training -can start. Because the fourth pod will never start (due to insufficient -resources), the job will never make progress. Similarly, if you launch -two experiments simultaneously that both attempt to use 12 GPUs on a -cluster with only 12 GPUs, it is likely that Kubernetes will assign some -of the GPUs to one experiment and some GPUs to the other. Because -neither experiment will receive the resources it needs to begin -executing, the system will wait indefinitely. - -To avoid deadlocking your cluster, we recommend enabling the cluster -autoscaler if possible. If a potential deadlock is detected, a warning -will be displayed in the trial logs. Upon encountering a deadlock, users -should pause, cancel, or kill one or more of the deadlocked experiments. +By default, the Kubernetes scheduler does not support gang scheduling or +preemption. This can be problematic for distributed deep learning +workloads that require multiple pods to be scheduled before execution +starts. Determined includes built-in support for the `lightweight +coscheduling plugin +`__, +which extends the default Kubernetes scheduler to support gang +scheduling. The coscheduling plugin is not enabled by default. For more +details and instructions on how to enable the coscheduling plugin, refer +to :ref:`scheduling-on-kubernetes`. Dynamic Agents ============== diff --git a/docs/topic-guides/system-concepts/scheduling.txt b/docs/topic-guides/system-concepts/scheduling.txt index da28c3d3175..b15db9b9a0e 100644 --- a/docs/topic-guides/system-concepts/scheduling.txt +++ b/docs/topic-guides/system-concepts/scheduling.txt @@ -119,40 +119,69 @@ enabled: priority 1 then starts running. Once that experiment is complete, distributed training experiment with priority 2 runs. -******************************* - Gang-scheduling on Kubernetes -******************************* - -Kubernetes does not natively support gang-scheduling and its default -scheduler will schedule pods on a first-come-first-serve basis. This -approach is problematic if a user submits several multi-pod jobs at -once; first-come-first-serve could result in a cluster deadlock. -Determined is able to support gang-scheduling by using the lightweight -coscheduling plugin, which extends the Kubernetes scheduler and blocks -scheduling of pods unless there are enough resources for all the pods in -the job. To function, the plugin requires special labels to be set that -specify the amount of nodes that each job needs for execution. -Determined is able to automatically calculate and set these labels for -its experiments. - -Importantly, the coscheduling plugin does not work with cluster -autoscaling. Static node pools must be used to achieve gang-scheduling. -Also, while the plugin allocates resources to jobs based on their -priority, it does not support preemption. Any low priority task will be -able to finish before a higher priority task can begin running. -Additionally, there isn't an implementation of ``max_slots`` or -``max_concurrent_trials`` that would limit the resources of an -experiment, i.e. one for hyperparameter search. Lastly, Determined's -capability to automatically set pod labels is restricted to GPU -experiments; it is unable to do the same for CPU experiments or user -commands. If gang-scheduling is desired for these, it must be set -manually via the environment field in the config. For instance: - -.. code:: +.. _scheduling-on-kubernetes: + +************************** + Scheduling on Kubernetes +************************** + +By default, the Kubernetes scheduler does not perform gang scheduling or +support preemption of pods. While it does take pod priority into +account, it greedily schedules pods without consideration for the job +each pod belongs to. This can result in problematic behavior for deep +learning workloads, particularly for distributed training jobs that use +many GPUs. A distributed training job that uses multiple pods requires +all pods to be scheduled and running in order to make progress. Because +Kubernetes does not support gang scheduling by default, cluster +deadlocks can arise. For example, suppose that two experiments are +launched simultaneously that each require 16 GPUs on a cluster with only +16 GPUs. It is possible that Kubernetes will assign some GPUs to one +experiment and some GPUs to the other. Because neither experiment will +receive the resources it needs to begin executing, the system will wait +indefinitely. + +Determined addresses these problems through the use of the `lightweight +coscheduling plugin +`__, +which extends the Kubernetes scheduler to support priority-based gang +scheduling. To implement gang scheduling, the coscheduling plugin will +not schedule a pod unless there are enough available resources to also +schedule the rest of the pods in the same job. To function, the plugin +requires special labels to be set that specify the number of nodes that +each job needs for execution. Determined automatically calculates and +sets these labels for GPU experiments that it launches. + +The coscheduling plugin is in beta and is therefore not enabled by +default. To enable it, edit ``values.yaml`` in the Determined Helm chart +to set the ``defaultScheduler`` field to ``coscheduler``. + +Importantly, the coscheduling plugin does not work with Kubernetes' +cluster autoscaling feature: static node pools must be used to achieve +gang scheduling. Also, while the plugin allocates resources to jobs +based on their priority, it does not support preemption. For example, if +the cluster is full of low priority jobs and a new high priority job is +submitted, the high priority job will not be scheduled until one of the +low priority jobs finishes. Additionally, there isn't an implementation +of ``max_slots`` or ``max_concurrent_trials`` that would limit the +resources of an experiment, i.e. one for hyperparameter search. Lastly, +Determined's capability to automatically set pod labels is restricted to +GPU experiments; Determined does not currently set labels for CPU +experiments or user commands. + +To enable gang scheduling with commands or CPU experiments, enable the +coscheduler in ``values.yaml`` and modify the experiment config to +contain the following: + +.. code:: yaml environment: pod_spec: metadata: labels: - pod-group.scheduling.sigs.k8s.io/name: determined - pod-group.scheduling.sigs.k8s.io/min-available: "2" + pod-group.scheduling.sigs.k8s.io/name: + pod-group.scheduling.sigs.k8s.io/min-available: <# of GPUs required> + spec: + schedulerName: coscheduler + +You can also use ``schedulerName: default-scheduler`` to use the default +Kubernetes scheduler. diff --git a/helm/charts/determined/Chart.yaml b/helm/charts/determined/Chart.yaml index d7c29f4fb81..e72058b99ec 100644 --- a/helm/charts/determined/Chart.yaml +++ b/helm/charts/determined/Chart.yaml @@ -1,7 +1,7 @@ apiVersion: v1 name: determined description: A Helm chart for Determined -version: "0.4.0" +version: "0.5.0" icon: https://github.com/determined-ai/determined/blob/master/determined-logo.png?raw=true home: https://github.com/determined-ai/determined.git