Automatically taint and evict nodes with high CPU overload. Derived from kubernetes-loadwatcher.
The load average describes the average length of the run queue whenever a scheduling decision is made. But it does not tell us how often processes were waiting for CPU time. The kernel pressure metrics (psi by facebook) describes how often there was not enough CPU available.
A kubernetes node can be overcommited on CPU: there might be more processes that want more CPU than requested. This can easily happen due to variable resource usage per pod, variance in hardware or variance in pod distributions. By default, Kubernetes will not evict Pods from a node based on CPU usage, since CPU is considered a compressible resource. However if a node does not have enough CPU resources to handle all pods it will impose additional latencies that can be undesirable based on the workload (e.g. web/interactive traffic).
This project contains a small Kubernetes controller that watches each node's CPU pressure; when a certain threshold is exceeded, the node will be tainted (so that no additional workloads are scheduled on an already-overloaded node) and finally the controller will start to evict Pods from the node.
Pressure is more sensitive for small overloads, e.g. with pressure information it is easy to express "there is an up to 20% chance to not get CPU instantly when needed".
This controller can be started with two threshold flags: -taint-threshold
and -evict-threshold
. There are also safeguard flags -min-pod-age
and -eviction-backoff
.
The controller will continuously monitor a node's CPU pressure.
-
If the CPU pressure (5min average) exceeds the taint threshold, the node will be tainted with a
pressurecooker/load-exceeded
taint with thePreferNoSchedule
effect. This will instruct Kubernetes to not schedule any additional workloads on this node if at all possible. -
If the CPU load (both 5min and 15min average) falls back below the taint threshold, the taint will be removed again.
-
If the CPU load (15 min average) exceeds the eviction threshold, the controller will pick a suitable Pod running on the node and evict it. However, the following types of Pods will not be evicted:
- Pods with the
Guaranteed
QoS class - Pods belonging to Stateful Sets
- Pods belonging to Daemon Sets
- Standalone pods not managed by any kind of controller
- Pods running in the
kube-system
namespace or with a criticalpriorityClassName
- Pods newer than min-pod-age
- Pods with the
After a Pod was evicted, the next Pod will be evicted after a configurable eviction backoff (controllable using the evict-backoff
argument) if the load15 is still above the eviction threshold.
Older pods will be evicted first. The ration to remove old pods first is tat it is usually better to move well behaving pods away from bad neighbors than moving bad neighbors through the cluster. And as a node will always stay in a healthy state it can be assumed that the older pods are less likely to be the cause of an overload.