Skip to content

Commit

Permalink
docs: further improve k8s coscheduling docs (#2099)
Browse files Browse the repository at this point in the history
* docs: further improve k8s coscheduling docs
  • Loading branch information
eecsliu authored and justin-determined-ai committed Mar 19, 2021
1 parent c4508da commit 2d16658
Show file tree
Hide file tree
Showing 5 changed files with 99 additions and 65 deletions.
22 changes: 19 additions & 3 deletions docs/how-to/installation/kubernetes.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

This document describes how to install Determined on `Kubernetes
<https://kubernetes.io/>`__. The installation is performed using the
:download:`Determined Helm Chart </helm/determined-0.4.0.tgz>`. For
:download:`Determined Helm Chart </helm/determined-0.5.0.tgz>`. For
general information about using Determined with Kubernetes, refer the
:ref:`determined-on-kubernetes` guide.

Expand Down Expand Up @@ -36,7 +36,7 @@ the following prerequisites are satisfied:
``fluent/fluent-bit:1.6`` Docker image from Docker Hub.

You should also download a copy of the :download:`Determined Helm Chart
</helm/determined-0.4.0.tgz>` and extract it on your local machine.
</helm/determined-0.5.0.tgz>` and extract it on your local machine.

If you do not yet have a Kubernetes cluster deployed and you want to use
Determined in a public cloud environment, we recommend using a managed
Expand Down Expand Up @@ -270,6 +270,22 @@ perform TLS termination in the load-balancer:
serviceName: determined-master-service-<name for your deployment>
servicePort: masterPort configured in values.yaml

Default Scheduler (Optional)
============================

Determined includes support for the `lightweight coscheduling plugin
<https://github.com/kubernetes-sigs/scheduler-plugins/tree/release-1.18/pkg/coscheduling>`__,
which extends the default Kubernetes scheduler to provide gang
scheduling. This feature is currently in beta and is not enabled by
default. To activate the plugin, set the ``defaultScheduler`` field to
``coscheduler``. If the field is empty or doesn't exist, Determined will
use the default Kubernetes scheduler to schedule all experiments and
tasks.

.. code:: yaml

defaultScheduler: coscheduler

***********************
Installing Determined
***********************
Expand All @@ -283,7 +299,7 @@ Determined run:
helm install <name for your deployment> determined-helm-chart

``determined-helm-chart`` is a relative path to where the
:download:`Determined Helm Chart </helm/determined-0.4.0.tgz>` is
:download:`Determined Helm Chart </helm/determined-0.5.0.tgz>` is
located. It may take a few minutes for all resources to come up. If you
encounter issues during installation please follow our list of
:ref:`useful kubectl commands <useful-kubectl-commands>`. Helm will
Expand Down
9 changes: 8 additions & 1 deletion docs/reference/helm-config.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
When installing :ref:`Determined on Kubernetes <install-on-kubernetes>`
via Helm, the deployment should be configured by editing the
``values.yaml`` and ``Chart.yaml`` files in the :download:`Determined
Helm Chart </helm/determined-0.4.0.tgz>`.
Helm Chart </helm/determined-0.5.0.tgz>`.

*************************
``Chart.yaml`` Settings
Expand Down Expand Up @@ -237,3 +237,10 @@ Helm Chart </helm/determined-0.4.0.tgz>`.
than a path as it does in the cluster configuration. This can be
conveniently set at the command line using ``helm install
--set-file logging.security.tls.certificate=<path>``.

- ``defaultScheduler``: Configures the default scheduler that
Determined will use. Currently only supports the ``coscheduler``
option, which enables the `lightweight coscheduling plugin
<https://github.com/kubernetes-sigs/scheduler-plugins/tree/release-1.18/pkg/coscheduling>`__.
Unless specified as ``coscheduler``, Determined will use the default
Kubernetes scheduler.
38 changes: 10 additions & 28 deletions docs/topic-guides/deployment/determined-on-kubernetes.txt
Original file line number Diff line number Diff line change
Expand Up @@ -43,34 +43,16 @@ Kubernetes.
Scheduling
==========

Determined on Kubernetes does not currently support the scheduling
policies that are available when deploying Determined on VMs. These
policies include: priority scheduling, fair sharing resources across
experiments, and gang-scheduling for distributed training. Determined
relies on Kubernetes to handle scheduling, which does not natively
support these scheduling policies.

:ref:`Distributed training <multi-gpu-training>` experiments that use
multiple pods require all pods to be scheduled and running in order to
make progress. Due to the lack of gang-scheduling in Kubernetes, when
running distributed training experiments it is possible to deadlock the
Kubernetes cluster such that none of the experiments will make any
progress. For example, if you have a cluster with three 4-GPU nodes,
scheduling an experiment that requires four such nodes will deadlock the
cluster. Three pods will start up on the available nodes and occupy all
of their GPUs while waiting for the fourth pod to launch before training
can start. Because the fourth pod will never start (due to insufficient
resources), the job will never make progress. Similarly, if you launch
two experiments simultaneously that both attempt to use 12 GPUs on a
cluster with only 12 GPUs, it is likely that Kubernetes will assign some
of the GPUs to one experiment and some GPUs to the other. Because
neither experiment will receive the resources it needs to begin
executing, the system will wait indefinitely.

To avoid deadlocking your cluster, we recommend enabling the cluster
autoscaler if possible. If a potential deadlock is detected, a warning
will be displayed in the trial logs. Upon encountering a deadlock, users
should pause, cancel, or kill one or more of the deadlocked experiments.
By default, the Kubernetes scheduler does not support gang scheduling or
preemption. This can be problematic for distributed deep learning
workloads that require multiple pods to be scheduled before execution
starts. Determined includes built-in support for the `lightweight
coscheduling plugin
<https://github.com/kubernetes-sigs/scheduler-plugins/tree/release-1.18/pkg/coscheduling>`__,
which extends the default Kubernetes scheduler to support gang
scheduling. The coscheduling plugin is not enabled by default. For more
details and instructions on how to enable the coscheduling plugin, refer
to :ref:`scheduling-on-kubernetes`.

Dynamic Agents
==============
Expand Down
93 changes: 61 additions & 32 deletions docs/topic-guides/system-concepts/scheduling.txt
Original file line number Diff line number Diff line change
Expand Up @@ -119,40 +119,69 @@ enabled:
priority 1 then starts running. Once that experiment is complete,
distributed training experiment with priority 2 runs.

*******************************
Gang-scheduling on Kubernetes
*******************************

Kubernetes does not natively support gang-scheduling and its default
scheduler will schedule pods on a first-come-first-serve basis. This
approach is problematic if a user submits several multi-pod jobs at
once; first-come-first-serve could result in a cluster deadlock.
Determined is able to support gang-scheduling by using the lightweight
coscheduling plugin, which extends the Kubernetes scheduler and blocks
scheduling of pods unless there are enough resources for all the pods in
the job. To function, the plugin requires special labels to be set that
specify the amount of nodes that each job needs for execution.
Determined is able to automatically calculate and set these labels for
its experiments.

Importantly, the coscheduling plugin does not work with cluster
autoscaling. Static node pools must be used to achieve gang-scheduling.
Also, while the plugin allocates resources to jobs based on their
priority, it does not support preemption. Any low priority task will be
able to finish before a higher priority task can begin running.
Additionally, there isn't an implementation of ``max_slots`` or
``max_concurrent_trials`` that would limit the resources of an
experiment, i.e. one for hyperparameter search. Lastly, Determined's
capability to automatically set pod labels is restricted to GPU
experiments; it is unable to do the same for CPU experiments or user
commands. If gang-scheduling is desired for these, it must be set
manually via the environment field in the config. For instance:

.. code::
.. _scheduling-on-kubernetes:

**************************
Scheduling on Kubernetes
**************************

By default, the Kubernetes scheduler does not perform gang scheduling or
support preemption of pods. While it does take pod priority into
account, it greedily schedules pods without consideration for the job
each pod belongs to. This can result in problematic behavior for deep
learning workloads, particularly for distributed training jobs that use
many GPUs. A distributed training job that uses multiple pods requires
all pods to be scheduled and running in order to make progress. Because
Kubernetes does not support gang scheduling by default, cluster
deadlocks can arise. For example, suppose that two experiments are
launched simultaneously that each require 16 GPUs on a cluster with only
16 GPUs. It is possible that Kubernetes will assign some GPUs to one
experiment and some GPUs to the other. Because neither experiment will
receive the resources it needs to begin executing, the system will wait
indefinitely.

Determined addresses these problems through the use of the `lightweight
coscheduling plugin
<https://github.com/kubernetes-sigs/scheduler-plugins/tree/release-1.18/pkg/coscheduling>`__,
which extends the Kubernetes scheduler to support priority-based gang
scheduling. To implement gang scheduling, the coscheduling plugin will
not schedule a pod unless there are enough available resources to also
schedule the rest of the pods in the same job. To function, the plugin
requires special labels to be set that specify the number of nodes that
each job needs for execution. Determined automatically calculates and
sets these labels for GPU experiments that it launches.

The coscheduling plugin is in beta and is therefore not enabled by
default. To enable it, edit ``values.yaml`` in the Determined Helm chart
to set the ``defaultScheduler`` field to ``coscheduler``.

Importantly, the coscheduling plugin does not work with Kubernetes'
cluster autoscaling feature: static node pools must be used to achieve
gang scheduling. Also, while the plugin allocates resources to jobs
based on their priority, it does not support preemption. For example, if
the cluster is full of low priority jobs and a new high priority job is
submitted, the high priority job will not be scheduled until one of the
low priority jobs finishes. Additionally, there isn't an implementation
of ``max_slots`` or ``max_concurrent_trials`` that would limit the
resources of an experiment, i.e. one for hyperparameter search. Lastly,
Determined's capability to automatically set pod labels is restricted to
GPU experiments; Determined does not currently set labels for CPU
experiments or user commands.

To enable gang scheduling with commands or CPU experiments, enable the
coscheduler in ``values.yaml`` and modify the experiment config to
contain the following:

.. code:: yaml

environment:
pod_spec:
metadata:
labels:
pod-group.scheduling.sigs.k8s.io/name: determined
pod-group.scheduling.sigs.k8s.io/min-available: "2"
pod-group.scheduling.sigs.k8s.io/name: <unique task name>
pod-group.scheduling.sigs.k8s.io/min-available: <# of GPUs required>
spec:
schedulerName: coscheduler

You can also use ``schedulerName: default-scheduler`` to use the default
Kubernetes scheduler.
2 changes: 1 addition & 1 deletion helm/charts/determined/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: v1
name: determined
description: A Helm chart for Determined
version: "0.4.0"
version: "0.5.0"
icon: https://github.com/determined-ai/determined/blob/master/determined-logo.png?raw=true
home: https://github.com/determined-ai/determined.git

Expand Down

0 comments on commit 2d16658

Please sign in to comment.