Skip to content

CNV-55593: Load-aware VM rebalancing through descheduler #1760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ingvagabund
Copy link
Member

@ingvagabund ingvagabund commented Feb 26, 2025

Propose a collaboration between the descheduler and the kube-virt stack to enable more seamless VM rebalancing. In December 2024, we released version 5.1.1 of the descheduler operator, which included TP functionality. This proposal outlines the design process and the rationale behind key decisions. Our upcoming release will introduce a new profile with enhanced tuning options to further enrich the customer experience while gathering valuable feedback to drive continuous improvements.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 26, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 26, 2025

@ingvagabund: This pull request references CNV-55593 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

TBD

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 26, 2025

@ingvagabund: This pull request references CNV-55593 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

TBD

TODO:

  • notes about extending the operator with rendering the taint controller (two separate deployments, two separate SAs)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ingvagabund ingvagabund force-pushed the descheduler-kubevirt-intergration branch from 2c3f6bd to 196cadf Compare February 26, 2025 17:15
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 26, 2025

@ingvagabund: This pull request references CNV-55593 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Lay down a proposal for collaboration of the descheduler with the kube-virt stack to create a more friendly rebalancing for VMs. We released 5.1.1 descheduler operator with the TP functionality described in December 2024. This proposal is here to describe the design process and reasoning behind some of the decisions. We have another release scheduled that will introduce a new profile with new tuning options to make the customer experience richer. While collecting a valuable customer feedback to enhance the experience even more.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 26, 2025

@ingvagabund: This pull request references CNV-55593 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Propose a collaboration between the descheduler and the kube-virt stack to enable more seamless VM rebalancing. In December 2024, we released version 5.1.1 of the descheduler operator, which included TP functionality. This proposal outlines the design process and the rationale behind key decisions. Our upcoming release will introduce a new profile with enhanced tuning options to further enrich the customer experience while gathering valuable feedback to drive continuous improvements.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ingvagabund ingvagabund changed the title wip: CNV-55593: Load-aware VM rebalancing through descheduler CNV-55593: Load-aware VM rebalancing through descheduler Feb 26, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2025
@ingvagabund ingvagabund force-pushed the descheduler-kubevirt-intergration branch from 196cadf to f27bce5 Compare March 7, 2025 15:21
Copy link
Contributor

openshift-ci bot commented Mar 7, 2025

@ingvagabund: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint f27bce5 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2025
@ingvagabund
Copy link
Member Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 7, 2025
Copy link
Contributor

@tkashem tkashem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a few questions, but otherwise
LGTM

### Update LowNodeUtilization plugin to consume Prometheus metrics

Currently, the descheduler's `LowNodeUtilization` plugin classifies node utilization based on Kubernetes metrics. However, these metrics primarily capture CPU and memory usage at the node and pod levels, which are not well-suited for VMs. By consuming Prometheus metrics, the plugin can leverage a broader range of data that better aligns with VM-specific needs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switching from metrics API to consuming raw Prometheus metrics, how do we ensure that the right version of the components that provide these metrics are available? if not available, then do we have a plan to surface this to the admin in any status object?

Also, for future, is it feasible to abstract these metrics into the metrics API?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the idea is to have our descheduler operator managing a few Prometheus recording rules and have the descheduler acting based on this recording rules instead of having it fetching and aggregating low level metrics. This should be our interface.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The responsibility for making sure the right metrics (in version and values) are consumed is moved to the consuming components. The same descheduler version can be used across multiple versions of the CNV stack. Each CNV stack version can keep improving the metrics and produce more accurate results. Thus, for now ensuring the right version of the components producing the metrics is bound by the OCP versions and what version of the descheduler and CNV are supported there. In 4.18 we have CNV 4.18 + KDO 5.1.1-5.1.2. In 4.19 CNV 4.19+ KDO 5.2.


The existing plugin code is already structured to support additional resource clients, making integration straightforward. To simplify implementation, the plugin will accept a single metric entry per node, with values in the <0;1> interval—where higher value indicates higher node utilization. Since resource thresholds are expressed as percentages by default, this classification remains intuitive and easy to interpret.

In the current iteration of the design, PSI pressure metrics were selected. Due to their nature, it is not possible to predict how evicting a single VM will impact a node's reported PSI pressure. Therefore, the plugin will be designed to evict only one VM per over-utilized node per cycle by default with option to evict more. Since it is safe to assume that each evicted VM will undergo live migration and the number of evictions per node is limited, running the descheduling cycle more frequently (e.g., every 3 minutes) will approximate the effect of evicting multiple VMs within a single cycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how evicting a single VM will impact a node's reported PSI

Question:

  • is a "single VM" represented by a Pod object?
  • what does the Node object represent here, the VM or something else?

Can you please clarify this topology, it would be helpful for someone not familiar with this domain

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With node we mean an OpenShift Node so, for the OpenShift Virtualization case, a physical machine.
A VM is a Virtual Machine managed by OpenShift Virtualization.
A running Virtual Machine is executed inside a pod containing libvirt and qemu.
Only during a live migration we have two pods active at the same time for a single VM: source and destination one.


The taint controller needs to be allowed to switch to a mode where it only removes the soft-taint(s) from all affected nodes. In case an administrator decides to disable the load-aware functionality.

In general, eviction, scheduling, and tainting steps can occur at any time, making VM rebalancing inherently non-deterministic. As a result, the process may not always converge to an optimal state or may take a long time to do so.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i understand it correctly, the steps involved are:

  • a1) descheduler plugin reads the PSI metrics, andclassifies a node (overcommitted, undercommitted, normal)
  • a2) descheduler plugin evicts a Pod (a VM in this case) based on results from a
  • b) taint controller applies soft tainting label to an overcommitted node
  • c) scheduler schedules the evicted Pod to a new node

a, b and c are asyn steps, is this a correct summary of the overall flow?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step c is a direct consequence of step a2: the descheduler evicts a VM with evict API, an OpenShift Virtualization controller notices it and creates a new pod as the target for the live migration.
That pod has to be scheduled by the scheduler as for any other pod.

b is completely async.

I'd rephrase it as:

a1) descheduler plugin reads the PSI metrics, and classifies a node (overcommitted, undercommitted, normal)
a2) descheduler plugin evicts a Pod (a VM in this case) based on results from a1
a3) a KubeVirt controller notices the eviction request for a pod related to a VM and translates it to a live migration. A live migration requires a target pod and so the scheduler will schedule it on a different node.
a4) once the target pod for the live migration is running, the KubeVirt controller will start a live migration from qemu on the source pod (the pod we are trying to evict) to the target one
a5) once the migration is completed, the source pod is freed up and the eviction is completed

b) taint controller applies soft tainting label to an overcommitted node

a1-a5 are sequential, a* and b are async.


Over time, these reconciliation loops will inevitably desynchronize due to factors such as transient outages, the number of pods and nodes to process, container restarts, and other operational randomness. So, while it may be tempting to share the same snapshot of metrics between both controllers, the benefit would be minimal unless explicit synchronization is required.

The following is enumeration of some approaches for sharing the node clasification:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the descheduling plugin and the taint controller be part of the same process, the descheduling plugin can publish its result to an in-memory synchronized data structure, and the taint controller can read off of it, eliminating any need for sharing.
are there other constraints that prohibit the above approach?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically only this:

  • Keeping rbac rules restricted: The descheduler should have read-only permissions, except for accessing the eviction API. Any expansion of permitted rules must be kept to a minimum.

With two decoupled and independent controllers and two different service accounts (where the soft-tainter one and all of its ancillary resources are deployed only if requested) we can avoid overloading the RBAC rules granted to the descheduler SA.

@tkashem
Copy link
Contributor

tkashem commented May 2, 2025

/approve

Copy link
Contributor

openshift-ci bot commented May 2, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2025
### Goals

* **Load-aware eviction**: VM pods are evicted based on their actual resource consumption, such as high PSI (Pressure Stall Information) resource pressure. The descheduler will be enhanced to incorporate Prometheus metrics for classifying node utilization.
* **Guided scheduling**: While keeping the default Kubernetes scheduler unchanged, nodes are soft-tainted when their actual resource utilization exceeds a predefined threshold. Soft taints help prevent overutilized nodes from becoming completely unschedulable while making them less desirable for scheduling. A soft taint (`effect: PreferNoSchedule`) acts as a preference rather than a strict rule, meaning the scheduler will attempt to avoid placing pods on nodes flagged as overutilized by the descheduler, but it is not guaranteed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predefined threshold -> I'd suggest to remove predefined since we want also to support Adaptive thresholds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that even with Adaptive we have predefined thresholds. :-)

### Update LowNodeUtilization plugin to consume Prometheus metrics

Currently, the descheduler's `LowNodeUtilization` plugin classifies node utilization based on Kubernetes metrics. However, these metrics primarily capture CPU and memory usage at the node and pod levels, which are not well-suited for VMs. By consuming Prometheus metrics, the plugin can leverage a broader range of data that better aligns with VM-specific needs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the idea is to have our descheduler operator managing a few Prometheus recording rules and have the descheduler acting based on this recording rules instead of having it fetching and aggregating low level metrics. This should be our interface.


The existing plugin code is already structured to support additional resource clients, making integration straightforward. To simplify implementation, the plugin will accept a single metric entry per node, with values in the <0;1> interval—where higher value indicates higher node utilization. Since resource thresholds are expressed as percentages by default, this classification remains intuitive and easy to interpret.

In the current iteration of the design, PSI pressure metrics were selected. Due to their nature, it is not possible to predict how evicting a single VM will impact a node's reported PSI pressure. Therefore, the plugin will be designed to evict only one VM per over-utilized node per cycle by default with option to evict more. Since it is safe to assume that each evicted VM will undergo live migration and the number of evictions per node is limited, running the descheduling cycle more frequently (e.g., every 3 minutes) will approximate the effect of evicting multiple VMs within a single cycle.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With node we mean an OpenShift Node so, for the OpenShift Virtualization case, a physical machine.
A VM is a Virtual Machine managed by OpenShift Virtualization.
A running Virtual Machine is executed inside a pod containing libvirt and qemu.
Only during a live migration we have two pods active at the same time for a single VM: source and destination one.


The taint controller needs to be allowed to switch to a mode where it only removes the soft-taint(s) from all affected nodes. In case an administrator decides to disable the load-aware functionality.

In general, eviction, scheduling, and tainting steps can occur at any time, making VM rebalancing inherently non-deterministic. As a result, the process may not always converge to an optimal state or may take a long time to do so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step c is a direct consequence of step a2: the descheduler evicts a VM with evict API, an OpenShift Virtualization controller notices it and creates a new pod as the target for the live migration.
That pod has to be scheduled by the scheduler as for any other pod.

b is completely async.

I'd rephrase it as:

a1) descheduler plugin reads the PSI metrics, and classifies a node (overcommitted, undercommitted, normal)
a2) descheduler plugin evicts a Pod (a VM in this case) based on results from a1
a3) a KubeVirt controller notices the eviction request for a pod related to a VM and translates it to a live migration. A live migration requires a target pod and so the scheduler will schedule it on a different node.
a4) once the target pod for the live migration is running, the KubeVirt controller will start a live migration from qemu on the source pod (the pod we are trying to evict) to the target one
a5) once the migration is completed, the source pod is freed up and the eviction is completed

b) taint controller applies soft tainting label to an overcommitted node

a1-a5 are sequential, a* and b are async.


Over time, these reconciliation loops will inevitably desynchronize due to factors such as transient outages, the number of pods and nodes to process, container restarts, and other operational randomness. So, while it may be tempting to share the same snapshot of metrics between both controllers, the benefit would be minimal unless explicit synchronization is required.

The following is enumeration of some approaches for sharing the node clasification:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically only this:

  • Keeping rbac rules restricted: The descheduler should have read-only permissions, except for accessing the eviction API. Any expansion of permitted rules must be kept to a minimum.

With two decoupled and independent controllers and two different service accounts (where the soft-tainter one and all of its ancillary resources are deployed only if requested) we can avoid overloading the RBAC rules granted to the descheduler SA.


On the descheduler side:
1. The node classification thresholds gets configured as static (using `thresholds` and `targetThresholds`, as configured in the current `LowNodeUtilization` plugin) or dynamic, based on the state of other nodes in the cluster (using the `useDeviationThresholds` option within the plugin).
1. The `LowNodeUtilization` plugin classifies nodes into three categories: under-utilized, appropriately-utilized, and over-utilized, as it does today. The metric used for classification will be derived from a PromQL query executed against a Prometheus instance. Prometheus will handle the aggregation of historical data and time series processing. The query is expected to return a scalar [0-1] value for each node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth to mention that we intend to use recording rules on Prometheus side to emphasize that we want keep aggregation of historical data and time series processing there keeping the descheduler simpler.

1. Looping through the node list:
1. The soft-tainter controller will apply a soft taint (`effect: PreferNoSchedule`) to the node if it is not already present and the node is classified as overutilized.
1. The soft-tainter controller will remove the soft taint from the node if it is present and the node is now classified as either low or appropriately utilized.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mention the multi-soft taints logic here?
eg. 2 soft taints for overutilized nodes, 1 for appropriately utilized, 0 for under utilized.
We know that the scheduler is counting the number of soft-taints while weighing the nodes so this will lead under utilized nodes to be preferred over appropriately utilized that will also be preferred over over utilized ones.


- List of key tuning options is identified
- Customers confirm the tuning options lead to sufficiently fast rebalancing
- The new descheduler operator profile is available by default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to mention that we should also provide sane defaults (ideally to fit a relevant amount of typical cluster sizes) for all the profileCustomizations coming with the new profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants