CNV-55593: Load-aware VM rebalancing through descheduler #1760

ingvagabund · 2025-02-26T11:53:07Z

Propose a collaboration between the descheduler and the kube-virt stack to enable more seamless VM rebalancing. In December 2024, we released version 5.1.1 of the descheduler operator, which included TP functionality. This proposal outlines the design process and the rationale behind key decisions. Our upcoming release will introduce a new profile with enhanced tuning options to further enrich the customer experience while gathering valuable feedback to drive continuous improvements.

openshift-ci-robot · 2025-02-26T11:53:11Z

@ingvagabund: This pull request references CNV-55593 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

TBD

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-02-26T11:57:33Z

@ingvagabund: This pull request references CNV-55593 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

TBD

TODO:

notes about extending the operator with rendering the taint controller (two separate deployments, two separate SAs)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-02-26T17:21:59Z

@ingvagabund: This pull request references CNV-55593 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Lay down a proposal for collaboration of the descheduler with the kube-virt stack to create a more friendly rebalancing for VMs. We released 5.1.1 descheduler operator with the TP functionality described in December 2024. This proposal is here to describe the design process and reasoning behind some of the decisions. We have another release scheduled that will introduce a new profile with new tuning options to make the customer experience richer. While collecting a valuable customer feedback to enhance the experience even more.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-02-26T17:22:51Z

@ingvagabund: This pull request references CNV-55593 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Propose a collaboration between the descheduler and the kube-virt stack to enable more seamless VM rebalancing. In December 2024, we released version 5.1.1 of the descheduler operator, which included TP functionality. This proposal outlines the design process and the rationale behind key decisions. Our upcoming release will introduce a new profile with enhanced tuning options to further enrich the customer experience while gathering valuable feedback to drive continuous improvements.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-03-07T15:43:25Z

@ingvagabund: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`f27bce5`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2025-04-05T01:15:40Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

ingvagabund · 2025-04-07T07:22:27Z

/remove-lifecycle stale

tkashem

I just have a few questions, but otherwise
LGTM

tkashem · 2025-05-02T13:05:39Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+### Update LowNodeUtilization plugin to consume Prometheus metrics
+
+Currently, the descheduler's `LowNodeUtilization` plugin classifies node utilization based on Kubernetes metrics. However, these metrics primarily capture CPU and memory usage at the node and pod levels, which are not well-suited for VMs. By consuming Prometheus metrics, the plugin can leverage a broader range of data that better aligns with VM-specific needs.
+


switching from metrics API to consuming raw Prometheus metrics, how do we ensure that the right version of the components that provide these metrics are available? if not available, then do we have a plan to surface this to the admin in any status object?

Also, for future, is it feasible to abstract these metrics into the metrics API?

I think that the idea is to have our descheduler operator managing a few Prometheus recording rules and have the descheduler acting based on this recording rules instead of having it fetching and aggregating low level metrics. This should be our interface.

The responsibility for making sure the right metrics (in version and values) are consumed is moved to the consuming components. The same descheduler version can be used across multiple versions of the CNV stack. Each CNV stack version can keep improving the metrics and produce more accurate results. Thus, for now ensuring the right version of the components producing the metrics is bound by the OCP versions and what version of the descheduler and CNV are supported there. In 4.18 we have CNV 4.18 + KDO 5.1.1-5.1.2. In 4.19 CNV 4.19+ KDO 5.2.

tkashem · 2025-05-02T13:09:50Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+
+The existing plugin code is already structured to support additional resource clients, making integration straightforward. To simplify implementation, the plugin will accept a single metric entry per node, with values in the <0;1> interval—where higher value indicates higher node utilization. Since resource thresholds are expressed as percentages by default, this classification remains intuitive and easy to interpret.
+
+In the current iteration of the design, PSI pressure metrics were selected. Due to their nature, it is not possible to predict how evicting a single VM will impact a node's reported PSI pressure. Therefore, the plugin will be designed to evict only one VM per over-utilized node per cycle by default with option to evict more. Since it is safe to assume that each evicted VM will undergo live migration and the number of evictions per node is limited, running the descheduling cycle more frequently (e.g., every 3 minutes) will approximate the effect of evicting multiple VMs within a single cycle.


how evicting a single VM will impact a node's reported PSI

Question:

is a "single VM" represented by a Pod object?

what does the Node object represent here, the VM or something else?

Can you please clarify this topology, it would be helpful for someone not familiar with this domain

With node we mean an OpenShift Node so, for the OpenShift Virtualization case, a physical machine.
A VM is a Virtual Machine managed by OpenShift Virtualization.
A running Virtual Machine is executed inside a pod containing libvirt and qemu.
Only during a live migration we have two pods active at the same time for a single VM: source and destination one.

tkashem · 2025-05-02T13:24:55Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+
+The taint controller needs to be allowed to switch to a mode where it only removes the soft-taint(s) from all affected nodes. In case an administrator decides to disable the load-aware functionality.
+
+In general, eviction, scheduling, and tainting steps can occur at any time, making VM rebalancing inherently non-deterministic. As a result, the process may not always converge to an optimal state or may take a long time to do so.


if i understand it correctly, the steps involved are:

a1) descheduler plugin reads the PSI metrics, andclassifies a node (overcommitted, undercommitted, normal)

a2) descheduler plugin evicts a Pod (a VM in this case) based on results from a

b) taint controller applies soft tainting label to an overcommitted node

c) scheduler schedules the evicted Pod to a new node

a, b and c are asyn steps, is this a correct summary of the overall flow?

step c is a direct consequence of step a2: the descheduler evicts a VM with evict API, an OpenShift Virtualization controller notices it and creates a new pod as the target for the live migration.
That pod has to be scheduled by the scheduler as for any other pod.

b is completely async.

I'd rephrase it as:

a1) descheduler plugin reads the PSI metrics, and classifies a node (overcommitted, undercommitted, normal)
a2) descheduler plugin evicts a Pod (a VM in this case) based on results from a1
a3) a KubeVirt controller notices the eviction request for a pod related to a VM and translates it to a live migration. A live migration requires a target pod and so the scheduler will schedule it on a different node.
a4) once the target pod for the live migration is running, the KubeVirt controller will start a live migration from qemu on the source pod (the pod we are trying to evict) to the target one
a5) once the migration is completed, the source pod is freed up and the eviction is completed

b) taint controller applies soft tainting label to an overcommitted node

a1-a5 are sequential, a* and b are async.

tkashem · 2025-05-02T13:30:14Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+
+Over time, these reconciliation loops will inevitably desynchronize due to factors such as transient outages, the number of pods and nodes to process, container restarts, and other operational randomness. So, while it may be tempting to share the same snapshot of metrics between both controllers, the benefit would be minimal unless explicit synchronization is required.
+
+The following is enumeration of some approaches for sharing the node clasification:


if the descheduling plugin and the taint controller be part of the same process, the descheduling plugin can publish its result to an in-memory synchronized data structure, and the taint controller can read off of it, eliminating any need for sharing.
are there other constraints that prohibit the above approach?

Basically only this:

Keeping rbac rules restricted: The descheduler should have read-only permissions, except for accessing the eviction API. Any expansion of permitted rules must be kept to a minimum.

With two decoupled and independent controllers and two different service accounts (where the soft-tainter one and all of its ancillary resources are deployed only if requested) we can avoid overloading the RBAC rules granted to the descheduler SA.

tkashem · 2025-05-02T13:34:39Z

/approve

openshift-ci · 2025-05-02T13:34:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tkashem]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tiraboschi · 2025-05-07T10:02:17Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+### Goals
+
+* **Load-aware eviction**: VM pods are evicted based on their actual resource consumption, such as high PSI (Pressure Stall Information) resource pressure. The descheduler will be enhanced to incorporate Prometheus metrics for classifying node utilization.
+* **Guided scheduling**: While keeping the default Kubernetes scheduler unchanged, nodes are soft-tainted when their actual resource utilization exceeds a predefined threshold. Soft taints help prevent overutilized nodes from becoming completely unschedulable while making them less desirable for scheduling. A soft taint (`effect: PreferNoSchedule`) acts as a preference rather than a strict rule, meaning the scheduler will attempt to avoid placing pods on nodes flagged as overutilized by the descheduler, but it is not guaranteed.


predefined threshold -> I'd suggest to remove predefined since we want also to support Adaptive thresholds.

I would argue that even with Adaptive we have predefined thresholds. :-)

tiraboschi · 2025-05-07T10:08:53Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+### Update LowNodeUtilization plugin to consume Prometheus metrics
+
+Currently, the descheduler's `LowNodeUtilization` plugin classifies node utilization based on Kubernetes metrics. However, these metrics primarily capture CPU and memory usage at the node and pod levels, which are not well-suited for VMs. By consuming Prometheus metrics, the plugin can leverage a broader range of data that better aligns with VM-specific needs.
+


I think that the idea is to have our descheduler operator managing a few Prometheus recording rules and have the descheduler acting based on this recording rules instead of having it fetching and aggregating low level metrics. This should be our interface.

tiraboschi · 2025-05-07T10:13:12Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+
+The existing plugin code is already structured to support additional resource clients, making integration straightforward. To simplify implementation, the plugin will accept a single metric entry per node, with values in the <0;1> interval—where higher value indicates higher node utilization. Since resource thresholds are expressed as percentages by default, this classification remains intuitive and easy to interpret.
+
+In the current iteration of the design, PSI pressure metrics were selected. Due to their nature, it is not possible to predict how evicting a single VM will impact a node's reported PSI pressure. Therefore, the plugin will be designed to evict only one VM per over-utilized node per cycle by default with option to evict more. Since it is safe to assume that each evicted VM will undergo live migration and the number of evictions per node is limited, running the descheduling cycle more frequently (e.g., every 3 minutes) will approximate the effect of evicting multiple VMs within a single cycle.


With node we mean an OpenShift Node so, for the OpenShift Virtualization case, a physical machine.
A VM is a Virtual Machine managed by OpenShift Virtualization.
A running Virtual Machine is executed inside a pod containing libvirt and qemu.
Only during a live migration we have two pods active at the same time for a single VM: source and destination one.

tiraboschi · 2025-05-07T10:21:50Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+
+The taint controller needs to be allowed to switch to a mode where it only removes the soft-taint(s) from all affected nodes. In case an administrator decides to disable the load-aware functionality.
+
+In general, eviction, scheduling, and tainting steps can occur at any time, making VM rebalancing inherently non-deterministic. As a result, the process may not always converge to an optimal state or may take a long time to do so.


step c is a direct consequence of step a2: the descheduler evicts a VM with evict API, an OpenShift Virtualization controller notices it and creates a new pod as the target for the live migration.
That pod has to be scheduled by the scheduler as for any other pod.

b is completely async.

I'd rephrase it as:

a1) descheduler plugin reads the PSI metrics, and classifies a node (overcommitted, undercommitted, normal)
a2) descheduler plugin evicts a Pod (a VM in this case) based on results from a1
a3) a KubeVirt controller notices the eviction request for a pod related to a VM and translates it to a live migration. A live migration requires a target pod and so the scheduler will schedule it on a different node.
a4) once the target pod for the live migration is running, the KubeVirt controller will start a live migration from qemu on the source pod (the pod we are trying to evict) to the target one
a5) once the migration is completed, the source pod is freed up and the eviction is completed

b) taint controller applies soft tainting label to an overcommitted node

a1-a5 are sequential, a* and b are async.

tiraboschi · 2025-05-07T11:01:56Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+
+Over time, these reconciliation loops will inevitably desynchronize due to factors such as transient outages, the number of pods and nodes to process, container restarts, and other operational randomness. So, while it may be tempting to share the same snapshot of metrics between both controllers, the benefit would be minimal unless explicit synchronization is required.
+
+The following is enumeration of some approaches for sharing the node clasification:


Basically only this:

Keeping rbac rules restricted: The descheduler should have read-only permissions, except for accessing the eviction API. Any expansion of permitted rules must be kept to a minimum.

With two decoupled and independent controllers and two different service accounts (where the soft-tainter one and all of its ancillary resources are deployed only if requested) we can avoid overloading the RBAC rules granted to the descheduler SA.

tiraboschi · 2025-05-07T11:05:07Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+
+On the descheduler side:
+1. The node classification thresholds gets configured as static (using `thresholds` and `targetThresholds`, as configured in the current `LowNodeUtilization` plugin) or dynamic, based on the state of other nodes in the cluster (using the `useDeviationThresholds` option within the plugin).
+1. The `LowNodeUtilization` plugin classifies nodes into three categories: under-utilized, appropriately-utilized, and over-utilized, as it does today. The metric used for classification will be derived from a PromQL query executed against a Prometheus instance. Prometheus will handle the aggregation of historical data and time series processing. The query is expected to return a scalar [0-1] value for each node.


Maybe it's worth to mention that we intend to use recording rules on Prometheus side to emphasize that we want keep aggregation of historical data and time series processing there keeping the descheduler simpler.

tiraboschi · 2025-05-07T11:08:19Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+1. Looping through the node list:
+    1. The soft-tainter controller will apply a soft taint (`effect: PreferNoSchedule`) to the node if it is not already present and the node is classified as overutilized.
+    1. The soft-tainter controller will remove the soft taint from the node if it is present and the node is now classified as either low or appropriately utilized.
+


should we mention the multi-soft taints logic here?
eg. 2 soft taints for overutilized nodes, 1 for appropriately utilized, 0 for under utilized.
We know that the scheduler is counting the number of soft-taints while weighing the nodes so this will lead under utilized nodes to be preferred over appropriately utilized that will also be preferred over over utilized ones.

tiraboschi · 2025-05-07T11:12:51Z

enhancements/scheduling/load-aware-vm-rebalancing-through-descheduler.md

+
+- List of key tuning options is identified
+- Customers confirm the tuning options lead to sufficiently fast rebalancing
+- The new descheduler operator profile is available by default


I'd like to mention that we should also provide sane defaults (ideally to fit a relevant amount of typical cluster sizes) for all the profileCustomizations coming with the new profile.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 26, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2025

openshift-ci bot requested review from jwmatthews and tjungblu February 26, 2025 11:53

ingvagabund force-pushed the descheduler-kubevirt-intergration branch from 2c3f6bd to 196cadf Compare February 26, 2025 17:15

ingvagabund changed the title ~~wip: CNV-55593: Load-aware VM rebalancing through descheduler~~ CNV-55593: Load-aware VM rebalancing through descheduler Feb 26, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2025

ingvagabund mentioned this pull request Mar 7, 2025

WIP: RelieveAndMigrate: new operator profile for PSI pressure based load-aware descheduling openshift/cluster-kube-descheduler-operator#460

Closed

Load-aware VM rebalancing through descheduler

f27bce5

ingvagabund force-pushed the descheduler-kubevirt-intergration branch from 196cadf to f27bce5 Compare March 7, 2025 15:21

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2025

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 7, 2025

tkashem reviewed May 2, 2025

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2025

tiraboschi reviewed May 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNV-55593: Load-aware VM rebalancing through descheduler #1760

CNV-55593: Load-aware VM rebalancing through descheduler #1760

ingvagabund commented Feb 26, 2025 •

edited

Loading

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Mar 7, 2025

openshift-bot commented Apr 5, 2025

ingvagabund commented Apr 7, 2025

tkashem left a comment

tkashem May 2, 2025

tiraboschi May 7, 2025

ingvagabund May 12, 2025

tkashem May 2, 2025

tiraboschi May 7, 2025

tkashem May 2, 2025

tiraboschi May 7, 2025

tkashem May 2, 2025

tiraboschi May 7, 2025

tkashem commented May 2, 2025

openshift-ci bot commented May 2, 2025

tiraboschi May 7, 2025

ricardomaraschini May 13, 2025

tiraboschi May 7, 2025

tiraboschi May 7, 2025

tiraboschi May 7, 2025

tiraboschi May 7, 2025

tiraboschi May 7, 2025

tiraboschi May 7, 2025

tiraboschi May 7, 2025

		### Update LowNodeUtilization plugin to consume Prometheus metrics

		Currently, the descheduler's `LowNodeUtilization` plugin classifies node utilization based on Kubernetes metrics. However, these metrics primarily capture CPU and memory usage at the node and pod levels, which are not well-suited for VMs. By consuming Prometheus metrics, the plugin can leverage a broader range of data that better aligns with VM-specific needs.


		The existing plugin code is already structured to support additional resource clients, making integration straightforward. To simplify implementation, the plugin will accept a single metric entry per node, with values in the <0;1> interval—where higher value indicates higher node utilization. Since resource thresholds are expressed as percentages by default, this classification remains intuitive and easy to interpret.

		In the current iteration of the design, PSI pressure metrics were selected. Due to their nature, it is not possible to predict how evicting a single VM will impact a node's reported PSI pressure. Therefore, the plugin will be designed to evict only one VM per over-utilized node per cycle by default with option to evict more. Since it is safe to assume that each evicted VM will undergo live migration and the number of evictions per node is limited, running the descheduling cycle more frequently (e.g., every 3 minutes) will approximate the effect of evicting multiple VMs within a single cycle.


		The taint controller needs to be allowed to switch to a mode where it only removes the soft-taint(s) from all affected nodes. In case an administrator decides to disable the load-aware functionality.

		In general, eviction, scheduling, and tainting steps can occur at any time, making VM rebalancing inherently non-deterministic. As a result, the process may not always converge to an optimal state or may take a long time to do so.


		Over time, these reconciliation loops will inevitably desynchronize due to factors such as transient outages, the number of pods and nodes to process, container restarts, and other operational randomness. So, while it may be tempting to share the same snapshot of metrics between both controllers, the benefit would be minimal unless explicit synchronization is required.

		The following is enumeration of some approaches for sharing the node clasification:

CNV-55593: Load-aware VM rebalancing through descheduler #1760

Are you sure you want to change the base?

CNV-55593: Load-aware VM rebalancing through descheduler #1760

Conversation

ingvagabund commented Feb 26, 2025 • edited Loading

openshift-ci-robot commented Feb 26, 2025 • edited by openshift-ci bot Loading

openshift-ci-robot commented Feb 26, 2025 • edited by openshift-ci bot Loading

openshift-ci-robot commented Feb 26, 2025 • edited by openshift-ci bot Loading

openshift-ci-robot commented Feb 26, 2025 • edited by openshift-ci bot Loading

openshift-ci bot commented Mar 7, 2025

openshift-bot commented Apr 5, 2025

ingvagabund commented Apr 7, 2025

tkashem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkashem commented May 2, 2025

openshift-ci bot commented May 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ingvagabund commented Feb 26, 2025 •

edited

Loading

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 26, 2025 •

edited by openshift-ci bot

Loading