From 8abd69d8f540a68de24da85d28c8bd5bbe7ca2e2 Mon Sep 17 00:00:00 2001 From: adam Date: Mon, 23 Dec 2024 14:26:35 -0700 Subject: [PATCH] Add node selection documentation --- docs/admin/index.rst | 1 + docs/admin/node-selection.rst | 31 +++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+) create mode 100644 docs/admin/node-selection.rst diff --git a/docs/admin/index.rst b/docs/admin/index.rst index f16c5d0c2..1804b853d 100644 --- a/docs/admin/index.rst +++ b/docs/admin/index.rst @@ -12,6 +12,7 @@ Guides to common problems and instructions on how to integrate properly with Pha :maxdepth: 2 :caption: Guides + node-selection home-directories init-containers setup-gar diff --git a/docs/admin/node-selection.rst b/docs/admin/node-selection.rst new file mode 100644 index 000000000..d9041c76e --- /dev/null +++ b/docs/admin/node-selection.rst @@ -0,0 +1,31 @@ +###################### +Forcing Node Selection +###################### + +You may want to force user labs and fileservers to run on particular nodes. +A particular use case is when you are running on a Kubernetes cluster that supports autoscaling, and you would prefer that: + +* User labs be the only workload on a particular set of nodes, so they are easily separated from the rest of the Phalanx machinery. +* These nodes be optimized for resource usage rather than distribution, so scaling is less painful because the pods to be disrupted are gathered rather than widely spread. + +Taints and Tolerations +====================== + +In order to ensure that lab workload is the only workload on given nodes, you would assign those nodes particular taint keys and values, and then arrange for the controller or fileserver configuration to tolerate those taints. +We conventionally use the key ``nublado.lsst.io/permitted`` with a value (which is ignored) of ``true`` and an effect of ``NoExecute``. +The corresponding toleration in the configuration restates the key and effect, and uses the operator ``Exists`` (which is why the value does not matter). +Note that you should repeat this in the configuration for both the controller and the fileserver. + +Node Selection +============== + +However, this is only half the job. +Having told kubernetes that only user labs and fileservers can execute on tained nodes, we now want to ensure that they go there. + +There are two ways to do this. +One is with the ``nodeSelector`` configuration item. +This is the preferred method. +It is much simpler than using ``Affinity`` and, more important, the prepuller understands ``nodeSelector`` and if you use that method, the prepuller will only pull to nodes that can be selected for Lab workloads. +If you use ``Affinity`` the workload will still end up being scheduled onto the right nodes, and those nodes will still have prepulled images, but the prepuller will pull to other nodes as well. +Assuming that you can use a simple ``nodeSelector`` label, which you probably can, do that. +If, for instance, you have a node pool all of whose members are labelled with ``node_pool`` equal to ``user-lab-pool``, then you should have ``nodeSelector: { node_pool: "user-lab-pool" }`` in your fileserver and controller configuration.