-
Notifications
You must be signed in to change notification settings - Fork 497
CNV-55593: Load-aware VM rebalancing through descheduler #1760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
CNV-55593: Load-aware VM rebalancing through descheduler #1760
Conversation
@ingvagabund: This pull request references CNV-55593 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@ingvagabund: This pull request references CNV-55593 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
2c3f6bd
to
196cadf
Compare
@ingvagabund: This pull request references CNV-55593 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@ingvagabund: This pull request references CNV-55593 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
196cadf
to
f27bce5
Compare
@ingvagabund: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a few questions, but otherwise
LGTM
### Update LowNodeUtilization plugin to consume Prometheus metrics | ||
|
||
Currently, the descheduler's `LowNodeUtilization` plugin classifies node utilization based on Kubernetes metrics. However, these metrics primarily capture CPU and memory usage at the node and pod levels, which are not well-suited for VMs. By consuming Prometheus metrics, the plugin can leverage a broader range of data that better aligns with VM-specific needs. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switching from metrics API to consuming raw Prometheus metrics, how do we ensure that the right version of the components that provide these metrics are available? if not available, then do we have a plan to surface this to the admin in any status object?
Also, for future, is it feasible to abstract these metrics into the metrics API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the idea is to have our descheduler operator managing a few Prometheus recording rules and have the descheduler acting based on this recording rules instead of having it fetching and aggregating low level metrics. This should be our interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The responsibility for making sure the right metrics (in version and values) are consumed is moved to the consuming components. The same descheduler version can be used across multiple versions of the CNV stack. Each CNV stack version can keep improving the metrics and produce more accurate results. Thus, for now ensuring the right version of the components producing the metrics is bound by the OCP versions and what version of the descheduler and CNV are supported there. In 4.18 we have CNV 4.18 + KDO 5.1.1-5.1.2. In 4.19 CNV 4.19+ KDO 5.2.
|
||
The existing plugin code is already structured to support additional resource clients, making integration straightforward. To simplify implementation, the plugin will accept a single metric entry per node, with values in the <0;1> interval—where higher value indicates higher node utilization. Since resource thresholds are expressed as percentages by default, this classification remains intuitive and easy to interpret. | ||
|
||
In the current iteration of the design, PSI pressure metrics were selected. Due to their nature, it is not possible to predict how evicting a single VM will impact a node's reported PSI pressure. Therefore, the plugin will be designed to evict only one VM per over-utilized node per cycle by default with option to evict more. Since it is safe to assume that each evicted VM will undergo live migration and the number of evictions per node is limited, running the descheduling cycle more frequently (e.g., every 3 minutes) will approximate the effect of evicting multiple VMs within a single cycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how evicting a single VM will impact a node's reported PSI
Question:
- is a "single VM" represented by a Pod object?
- what does the Node object represent here, the VM or something else?
Can you please clarify this topology, it would be helpful for someone not familiar with this domain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With node we mean an OpenShift Node so, for the OpenShift Virtualization case, a physical machine.
A VM is a Virtual Machine managed by OpenShift Virtualization.
A running Virtual Machine is executed inside a pod containing libvirt and qemu.
Only during a live migration we have two pods active at the same time for a single VM: source and destination one.
|
||
The taint controller needs to be allowed to switch to a mode where it only removes the soft-taint(s) from all affected nodes. In case an administrator decides to disable the load-aware functionality. | ||
|
||
In general, eviction, scheduling, and tainting steps can occur at any time, making VM rebalancing inherently non-deterministic. As a result, the process may not always converge to an optimal state or may take a long time to do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i understand it correctly, the steps involved are:
- a1) descheduler plugin reads the PSI metrics, andclassifies a node (overcommitted, undercommitted, normal)
- a2) descheduler plugin evicts a Pod (a VM in this case) based on results from
a
- b) taint controller applies soft tainting label to an overcommitted node
- c) scheduler schedules the evicted Pod to a new node
a, b and c are asyn steps, is this a correct summary of the overall flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
step c is a direct consequence of step a2: the descheduler evicts a VM with evict API, an OpenShift Virtualization controller notices it and creates a new pod as the target for the live migration.
That pod has to be scheduled by the scheduler as for any other pod.
b is completely async.
I'd rephrase it as:
a1) descheduler plugin reads the PSI metrics, and classifies a node (overcommitted, undercommitted, normal)
a2) descheduler plugin evicts a Pod (a VM in this case) based on results from a1
a3) a KubeVirt controller notices the eviction request for a pod related to a VM and translates it to a live migration. A live migration requires a target pod and so the scheduler will schedule it on a different node.
a4) once the target pod for the live migration is running, the KubeVirt controller will start a live migration from qemu on the source pod (the pod we are trying to evict) to the target one
a5) once the migration is completed, the source pod is freed up and the eviction is completed
b) taint controller applies soft tainting label to an overcommitted node
a1-a5 are sequential, a* and b are async.
|
||
Over time, these reconciliation loops will inevitably desynchronize due to factors such as transient outages, the number of pods and nodes to process, container restarts, and other operational randomness. So, while it may be tempting to share the same snapshot of metrics between both controllers, the benefit would be minimal unless explicit synchronization is required. | ||
|
||
The following is enumeration of some approaches for sharing the node clasification: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the descheduling plugin and the taint controller be part of the same process, the descheduling plugin can publish its result to an in-memory synchronized data structure, and the taint controller can read off of it, eliminating any need for sharing.
are there other constraints that prohibit the above approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically only this:
- Keeping rbac rules restricted: The descheduler should have read-only permissions, except for accessing the eviction API. Any expansion of permitted rules must be kept to a minimum.
With two decoupled and independent controllers and two different service accounts (where the soft-tainter one and all of its ancillary resources are deployed only if requested) we can avoid overloading the RBAC rules granted to the descheduler SA.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tkashem The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
### Goals | ||
|
||
* **Load-aware eviction**: VM pods are evicted based on their actual resource consumption, such as high PSI (Pressure Stall Information) resource pressure. The descheduler will be enhanced to incorporate Prometheus metrics for classifying node utilization. | ||
* **Guided scheduling**: While keeping the default Kubernetes scheduler unchanged, nodes are soft-tainted when their actual resource utilization exceeds a predefined threshold. Soft taints help prevent overutilized nodes from becoming completely unschedulable while making them less desirable for scheduling. A soft taint (`effect: PreferNoSchedule`) acts as a preference rather than a strict rule, meaning the scheduler will attempt to avoid placing pods on nodes flagged as overutilized by the descheduler, but it is not guaranteed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predefined threshold
-> I'd suggest to remove predefined
since we want also to support Adaptive thresholds
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that even with Adaptive we have predefined thresholds. :-)
### Update LowNodeUtilization plugin to consume Prometheus metrics | ||
|
||
Currently, the descheduler's `LowNodeUtilization` plugin classifies node utilization based on Kubernetes metrics. However, these metrics primarily capture CPU and memory usage at the node and pod levels, which are not well-suited for VMs. By consuming Prometheus metrics, the plugin can leverage a broader range of data that better aligns with VM-specific needs. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the idea is to have our descheduler operator managing a few Prometheus recording rules and have the descheduler acting based on this recording rules instead of having it fetching and aggregating low level metrics. This should be our interface.
|
||
The existing plugin code is already structured to support additional resource clients, making integration straightforward. To simplify implementation, the plugin will accept a single metric entry per node, with values in the <0;1> interval—where higher value indicates higher node utilization. Since resource thresholds are expressed as percentages by default, this classification remains intuitive and easy to interpret. | ||
|
||
In the current iteration of the design, PSI pressure metrics were selected. Due to their nature, it is not possible to predict how evicting a single VM will impact a node's reported PSI pressure. Therefore, the plugin will be designed to evict only one VM per over-utilized node per cycle by default with option to evict more. Since it is safe to assume that each evicted VM will undergo live migration and the number of evictions per node is limited, running the descheduling cycle more frequently (e.g., every 3 minutes) will approximate the effect of evicting multiple VMs within a single cycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With node we mean an OpenShift Node so, for the OpenShift Virtualization case, a physical machine.
A VM is a Virtual Machine managed by OpenShift Virtualization.
A running Virtual Machine is executed inside a pod containing libvirt and qemu.
Only during a live migration we have two pods active at the same time for a single VM: source and destination one.
|
||
The taint controller needs to be allowed to switch to a mode where it only removes the soft-taint(s) from all affected nodes. In case an administrator decides to disable the load-aware functionality. | ||
|
||
In general, eviction, scheduling, and tainting steps can occur at any time, making VM rebalancing inherently non-deterministic. As a result, the process may not always converge to an optimal state or may take a long time to do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
step c is a direct consequence of step a2: the descheduler evicts a VM with evict API, an OpenShift Virtualization controller notices it and creates a new pod as the target for the live migration.
That pod has to be scheduled by the scheduler as for any other pod.
b is completely async.
I'd rephrase it as:
a1) descheduler plugin reads the PSI metrics, and classifies a node (overcommitted, undercommitted, normal)
a2) descheduler plugin evicts a Pod (a VM in this case) based on results from a1
a3) a KubeVirt controller notices the eviction request for a pod related to a VM and translates it to a live migration. A live migration requires a target pod and so the scheduler will schedule it on a different node.
a4) once the target pod for the live migration is running, the KubeVirt controller will start a live migration from qemu on the source pod (the pod we are trying to evict) to the target one
a5) once the migration is completed, the source pod is freed up and the eviction is completed
b) taint controller applies soft tainting label to an overcommitted node
a1-a5 are sequential, a* and b are async.
|
||
Over time, these reconciliation loops will inevitably desynchronize due to factors such as transient outages, the number of pods and nodes to process, container restarts, and other operational randomness. So, while it may be tempting to share the same snapshot of metrics between both controllers, the benefit would be minimal unless explicit synchronization is required. | ||
|
||
The following is enumeration of some approaches for sharing the node clasification: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically only this:
- Keeping rbac rules restricted: The descheduler should have read-only permissions, except for accessing the eviction API. Any expansion of permitted rules must be kept to a minimum.
With two decoupled and independent controllers and two different service accounts (where the soft-tainter one and all of its ancillary resources are deployed only if requested) we can avoid overloading the RBAC rules granted to the descheduler SA.
|
||
On the descheduler side: | ||
1. The node classification thresholds gets configured as static (using `thresholds` and `targetThresholds`, as configured in the current `LowNodeUtilization` plugin) or dynamic, based on the state of other nodes in the cluster (using the `useDeviationThresholds` option within the plugin). | ||
1. The `LowNodeUtilization` plugin classifies nodes into three categories: under-utilized, appropriately-utilized, and over-utilized, as it does today. The metric used for classification will be derived from a PromQL query executed against a Prometheus instance. Prometheus will handle the aggregation of historical data and time series processing. The query is expected to return a scalar [0-1] value for each node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth to mention that we intend to use recording rules on Prometheus side to emphasize that we want keep aggregation of historical data and time series processing there keeping the descheduler simpler.
1. Looping through the node list: | ||
1. The soft-tainter controller will apply a soft taint (`effect: PreferNoSchedule`) to the node if it is not already present and the node is classified as overutilized. | ||
1. The soft-tainter controller will remove the soft taint from the node if it is present and the node is now classified as either low or appropriately utilized. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we mention the multi-soft taints logic here?
eg. 2 soft taints for overutilized nodes, 1 for appropriately utilized, 0 for under utilized.
We know that the scheduler is counting the number of soft-taints while weighing the nodes so this will lead under utilized nodes to be preferred over appropriately utilized that will also be preferred over over utilized ones.
|
||
- List of key tuning options is identified | ||
- Customers confirm the tuning options lead to sufficiently fast rebalancing | ||
- The new descheduler operator profile is available by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to mention that we should also provide sane defaults (ideally to fit a relevant amount of typical cluster sizes) for all the profileCustomizations coming with the new profile.
Propose a collaboration between the descheduler and the kube-virt stack to enable more seamless VM rebalancing. In December 2024, we released version 5.1.1 of the descheduler operator, which included TP functionality. This proposal outlines the design process and the rationale behind key decisions. Our upcoming release will introduce a new profile with enhanced tuning options to further enrich the customer experience while gathering valuable feedback to drive continuous improvements.