diff --git a/docs/how-to/installation/kubernetes.txt b/docs/how-to/installation/kubernetes.txt
index c53e71ce82a..dbf6aa18c01 100644
--- a/docs/how-to/installation/kubernetes.txt
+++ b/docs/how-to/installation/kubernetes.txt
@@ -6,7 +6,7 @@
This document describes how to install Determined on `Kubernetes
`__. The installation is performed using the
-:download:`Determined Helm Chart `. For
+:download:`Determined Helm Chart `. For
general information about using Determined with Kubernetes, refer the
:ref:`determined-on-kubernetes` guide.
@@ -36,7 +36,7 @@ the following prerequisites are satisfied:
``fluent/fluent-bit:1.6`` Docker image from Docker Hub.
You should also download a copy of the :download:`Determined Helm Chart
-` and extract it on your local machine.
+` and extract it on your local machine.
If you do not yet have a Kubernetes cluster deployed and you want to use
Determined in a public cloud environment, we recommend using a managed
@@ -270,6 +270,22 @@ perform TLS termination in the load-balancer:
serviceName: determined-master-service-
servicePort: masterPort configured in values.yaml
+Default Scheduler (Optional)
+============================
+
+Determined includes support for the `lightweight coscheduling plugin
+`__,
+which extends the default Kubernetes scheduler to provide gang
+scheduling. This feature is currently in beta and is not enabled by
+default. To activate the plugin, set the ``defaultScheduler`` field to
+``coscheduler``. If the field is empty or doesn't exist, Determined will
+use the default Kubernetes scheduler to schedule all experiments and
+tasks.
+
+.. code:: yaml
+
+ defaultScheduler: coscheduler
+
***********************
Installing Determined
***********************
@@ -283,7 +299,7 @@ Determined run:
helm install determined-helm-chart
``determined-helm-chart`` is a relative path to where the
-:download:`Determined Helm Chart ` is
+:download:`Determined Helm Chart ` is
located. It may take a few minutes for all resources to come up. If you
encounter issues during installation please follow our list of
:ref:`useful kubectl commands `. Helm will
diff --git a/docs/reference/helm-config.txt b/docs/reference/helm-config.txt
index d7f7d3cd9cf..8daf87f3ab4 100644
--- a/docs/reference/helm-config.txt
+++ b/docs/reference/helm-config.txt
@@ -7,7 +7,7 @@
When installing :ref:`Determined on Kubernetes `
via Helm, the deployment should be configured by editing the
``values.yaml`` and ``Chart.yaml`` files in the :download:`Determined
-Helm Chart `.
+Helm Chart `.
*************************
``Chart.yaml`` Settings
@@ -237,3 +237,10 @@ Helm Chart `.
than a path as it does in the cluster configuration. This can be
conveniently set at the command line using ``helm install
--set-file logging.security.tls.certificate=``.
+
+- ``defaultScheduler``: Configures the default scheduler that
+ Determined will use. Currently only supports the ``coscheduler``
+ option, which enables the `lightweight coscheduling plugin
+ `__.
+ Unless specified as ``coscheduler``, Determined will use the default
+ Kubernetes scheduler.
diff --git a/docs/topic-guides/deployment/determined-on-kubernetes.txt b/docs/topic-guides/deployment/determined-on-kubernetes.txt
index 9ea9ca6fb00..956fbe2bb6b 100644
--- a/docs/topic-guides/deployment/determined-on-kubernetes.txt
+++ b/docs/topic-guides/deployment/determined-on-kubernetes.txt
@@ -43,34 +43,16 @@ Kubernetes.
Scheduling
==========
-Determined on Kubernetes does not currently support the scheduling
-policies that are available when deploying Determined on VMs. These
-policies include: priority scheduling, fair sharing resources across
-experiments, and gang-scheduling for distributed training. Determined
-relies on Kubernetes to handle scheduling, which does not natively
-support these scheduling policies.
-
-:ref:`Distributed training ` experiments that use
-multiple pods require all pods to be scheduled and running in order to
-make progress. Due to the lack of gang-scheduling in Kubernetes, when
-running distributed training experiments it is possible to deadlock the
-Kubernetes cluster such that none of the experiments will make any
-progress. For example, if you have a cluster with three 4-GPU nodes,
-scheduling an experiment that requires four such nodes will deadlock the
-cluster. Three pods will start up on the available nodes and occupy all
-of their GPUs while waiting for the fourth pod to launch before training
-can start. Because the fourth pod will never start (due to insufficient
-resources), the job will never make progress. Similarly, if you launch
-two experiments simultaneously that both attempt to use 12 GPUs on a
-cluster with only 12 GPUs, it is likely that Kubernetes will assign some
-of the GPUs to one experiment and some GPUs to the other. Because
-neither experiment will receive the resources it needs to begin
-executing, the system will wait indefinitely.
-
-To avoid deadlocking your cluster, we recommend enabling the cluster
-autoscaler if possible. If a potential deadlock is detected, a warning
-will be displayed in the trial logs. Upon encountering a deadlock, users
-should pause, cancel, or kill one or more of the deadlocked experiments.
+By default, the Kubernetes scheduler does not support gang scheduling or
+preemption. This can be problematic for distributed deep learning
+workloads that require multiple pods to be scheduled before execution
+starts. Determined includes built-in support for the `lightweight
+coscheduling plugin
+`__,
+which extends the default Kubernetes scheduler to support gang
+scheduling. The coscheduling plugin is not enabled by default. For more
+details and instructions on how to enable the coscheduling plugin, refer
+to :ref:`scheduling-on-kubernetes`.
Dynamic Agents
==============
diff --git a/docs/topic-guides/system-concepts/scheduling.txt b/docs/topic-guides/system-concepts/scheduling.txt
index da28c3d3175..b15db9b9a0e 100644
--- a/docs/topic-guides/system-concepts/scheduling.txt
+++ b/docs/topic-guides/system-concepts/scheduling.txt
@@ -119,40 +119,69 @@ enabled:
priority 1 then starts running. Once that experiment is complete,
distributed training experiment with priority 2 runs.
-*******************************
- Gang-scheduling on Kubernetes
-*******************************
-
-Kubernetes does not natively support gang-scheduling and its default
-scheduler will schedule pods on a first-come-first-serve basis. This
-approach is problematic if a user submits several multi-pod jobs at
-once; first-come-first-serve could result in a cluster deadlock.
-Determined is able to support gang-scheduling by using the lightweight
-coscheduling plugin, which extends the Kubernetes scheduler and blocks
-scheduling of pods unless there are enough resources for all the pods in
-the job. To function, the plugin requires special labels to be set that
-specify the amount of nodes that each job needs for execution.
-Determined is able to automatically calculate and set these labels for
-its experiments.
-
-Importantly, the coscheduling plugin does not work with cluster
-autoscaling. Static node pools must be used to achieve gang-scheduling.
-Also, while the plugin allocates resources to jobs based on their
-priority, it does not support preemption. Any low priority task will be
-able to finish before a higher priority task can begin running.
-Additionally, there isn't an implementation of ``max_slots`` or
-``max_concurrent_trials`` that would limit the resources of an
-experiment, i.e. one for hyperparameter search. Lastly, Determined's
-capability to automatically set pod labels is restricted to GPU
-experiments; it is unable to do the same for CPU experiments or user
-commands. If gang-scheduling is desired for these, it must be set
-manually via the environment field in the config. For instance:
-
-.. code::
+.. _scheduling-on-kubernetes:
+
+**************************
+ Scheduling on Kubernetes
+**************************
+
+By default, the Kubernetes scheduler does not perform gang scheduling or
+support preemption of pods. While it does take pod priority into
+account, it greedily schedules pods without consideration for the job
+each pod belongs to. This can result in problematic behavior for deep
+learning workloads, particularly for distributed training jobs that use
+many GPUs. A distributed training job that uses multiple pods requires
+all pods to be scheduled and running in order to make progress. Because
+Kubernetes does not support gang scheduling by default, cluster
+deadlocks can arise. For example, suppose that two experiments are
+launched simultaneously that each require 16 GPUs on a cluster with only
+16 GPUs. It is possible that Kubernetes will assign some GPUs to one
+experiment and some GPUs to the other. Because neither experiment will
+receive the resources it needs to begin executing, the system will wait
+indefinitely.
+
+Determined addresses these problems through the use of the `lightweight
+coscheduling plugin
+`__,
+which extends the Kubernetes scheduler to support priority-based gang
+scheduling. To implement gang scheduling, the coscheduling plugin will
+not schedule a pod unless there are enough available resources to also
+schedule the rest of the pods in the same job. To function, the plugin
+requires special labels to be set that specify the number of nodes that
+each job needs for execution. Determined automatically calculates and
+sets these labels for GPU experiments that it launches.
+
+The coscheduling plugin is in beta and is therefore not enabled by
+default. To enable it, edit ``values.yaml`` in the Determined Helm chart
+to set the ``defaultScheduler`` field to ``coscheduler``.
+
+Importantly, the coscheduling plugin does not work with Kubernetes'
+cluster autoscaling feature: static node pools must be used to achieve
+gang scheduling. Also, while the plugin allocates resources to jobs
+based on their priority, it does not support preemption. For example, if
+the cluster is full of low priority jobs and a new high priority job is
+submitted, the high priority job will not be scheduled until one of the
+low priority jobs finishes. Additionally, there isn't an implementation
+of ``max_slots`` or ``max_concurrent_trials`` that would limit the
+resources of an experiment, i.e. one for hyperparameter search. Lastly,
+Determined's capability to automatically set pod labels is restricted to
+GPU experiments; Determined does not currently set labels for CPU
+experiments or user commands.
+
+To enable gang scheduling with commands or CPU experiments, enable the
+coscheduler in ``values.yaml`` and modify the experiment config to
+contain the following:
+
+.. code:: yaml
environment:
pod_spec:
metadata:
labels:
- pod-group.scheduling.sigs.k8s.io/name: determined
- pod-group.scheduling.sigs.k8s.io/min-available: "2"
+ pod-group.scheduling.sigs.k8s.io/name:
+ pod-group.scheduling.sigs.k8s.io/min-available: <# of GPUs required>
+ spec:
+ schedulerName: coscheduler
+
+You can also use ``schedulerName: default-scheduler`` to use the default
+Kubernetes scheduler.
diff --git a/helm/charts/determined/Chart.yaml b/helm/charts/determined/Chart.yaml
index d7c29f4fb81..e72058b99ec 100644
--- a/helm/charts/determined/Chart.yaml
+++ b/helm/charts/determined/Chart.yaml
@@ -1,7 +1,7 @@
apiVersion: v1
name: determined
description: A Helm chart for Determined
-version: "0.4.0"
+version: "0.5.0"
icon: https://github.com/determined-ai/determined/blob/master/determined-logo.png?raw=true
home: https://github.com/determined-ai/determined.git