Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: change node resource amplification from api to scheduling level #1700

Merged
merged 1 commit into from
Oct 10, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@ authors:
- "@zqzten"
reviewers:
- "@saintube"
- "@eahydra"
- "@hormes"
creation-date: 2023-08-11
last-updated: 2023-08-29
last-updated: 2023-10-08
---

# Node Resource Amplification
Expand All @@ -32,7 +33,7 @@ last-updated: 2023-08-29

## Summary

Introduce a mechanism to amplify the resources of Kubernetes Nodes **at API level**, which enables the node to meet higher pod CPU requests. This feature can then be utilized by other higher-level ones such as CPU normalization.
Introduce a mechanism to amplify the resources of Kubernetes Nodes **at scheduling level**, which enables the node to meet higher pod CPU requests. This feature can then be utilized by other higher-level ones such as CPU normalization.

## Motivation

Expand All @@ -43,7 +44,8 @@ There are cases where users would like to amplify the resources of a node, for e

### Goals

- Introduce a mechanism to amplify resources (capacity & allocatable) of Kubernetes Nodes at API level.
- Introduce a mechanism to amplify resources (capacity & allocatable) of Kubernetes Nodes at scheduling level.
- Make sure CPUSet Pods get raw CPUs it requested on scheduling.

### Non-Goals/Future Work

Expand All @@ -61,6 +63,8 @@ On k8s side, user (end user or other system) annotates the Node to specify its r

On koord-manager side, the Node mutating webhook intercepts the Node status updating request from kubelet and modifies its `capacity` and `allocatable` according to the amplification ratio. It also records their raw values in annotations of the Node.

On koord-scheduler side, the `NodeNUMAResource` scheduling plugin filters and scores Nodes with similar logic to the `NodeResourcesFit` plugin for those without NUMA topology policy, but with one difference: It amplifies the request of CPUSet Pods with the Node's CPU normalization ratio, as CPUSet Pods would get real physical CPUs requested so that their requests would occupy more CPU resources than what they specify on Nodes with allocatable CPU amplified. For example, a CPUSet Pod with CPU request 1 would occupy 1 physical CPU on a Node, if the Node's CPU amplification ratio is 2, the Pod would actually occupy 2 amplified CPUs (which are mapped to 1 physical CPU) on this Node. Similarly, for those with NUMA topology policy, the plugin filters and scores Nodes with the same special logic with those NUMA zone CPU resources amplified.

### API

- Introduce a new Node annotation `node.koordinator.sh/resource-amplification-ratio` whose value is a JSON object string of type `map[corev1.ResourceName]float64`. The float value must > 1. This is used by user to specify the amplification ratio for each resource of the Node.
Expand All @@ -80,9 +84,8 @@ To mitigate this risk, a Node validating webhook can be introduced to validate t

#### Unexpected Resource Overcommitment

Since the resource amplification only takes effect at API level, unexpected resource overcommitment may happen if this feature is not used carefully, for example:
Since the resource amplification only takes effect at scheduling level, unexpected resource overcommitment may happen if this feature is not used carefully, for example:

- A Node with 4 allocatable physical CPUs and CPU amplification 2 would claim 8 allocatable CPUs, if there's a CPUSet Pod with CPU request 4 on it, it would occupy all the physical CPUs of the Node. However, the scheduler can still schedule Pods with total CPU request <= 4 to this Node, and thus leads to unexpected CPU starvation.
- A Node with 4 allocatable physical CPUs and CPU amplification 2 would claim 8 allocatable CPUs, if there's two CPUShare Pods with the same CPU request and limit 4 on this Node, user would suppose that there would be no CPU competition (because their total limits equal to the Node's allocatable CPU), however, each Pod would be given a CFS quota to use all the 4 physical CPUs of the Node actually, so if both Pods use CPU more than 2, there would be a CPU competition.
A Node with 4 allocatable physical CPUs and CPU amplification 2 would claim 8 allocatable CPUs, if there's two CPUShare Pods with the same CPU request and limit 4 on this Node, user would suppose that there would be no CPU competition (because their total limits equal to the Node's allocatable CPU), however, each Pod would be given a CFS quota to use all the 4 physical CPUs of the Node actually, so if both Pods use CPU more than 2, there would be a CPU competition.

To mitigate this risk, users should do amplification very carefully, and it is recommended to do amplification only on Nodes that are regularly idle. Another way is to utilize end-to-end resource-specific features such as CPU normalization which is covered in another individual proposal.