Skip to content

OCPBUGS-7747: Do not set cpu system reserve below the default value #5046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

harche
Copy link
Contributor

@harche harche commented May 13, 2025

- What I did
If the number of CPUs is large, the recommended system reserved by the auto node sizing could be lower than the default 500m value. This PR descards the value calculated by the auto node sizing script if it is lower than 0.5.

- How to verify it
Enable the auto node sizing with the worker nodes having not so large number of CPUs, and the value set for system reserved cpu should be at least 0.5

- Description for the changelog

System reserved CPU value defaults to 500m if the number of CPUs is not large enough.

@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 13, 2025
@openshift-ci-robot
Copy link
Contributor

@harche: This pull request references Jira Issue OCPBUGS-7747, which is invalid:

  • expected the bug to target only the "4.20.0" version, but multiple target versions were set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
If the number of CPUs are below 32, the recommended system reserved by the auto node sizing could be lower than the default 500m value. This PR descards the value calculated by the auto node sizing script if it is lower than 0.5.

- How to verify it
Enable the auto node sizing with the worker nodes having less than 32 CPUs, and the value set for system reserved cpu should be at least 0.5

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 13, 2025
@harche
Copy link
Contributor Author

harche commented May 13, 2025

/hold for testing.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 13, 2025
Copy link
Contributor

openshift-ci bot commented May 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: harche
Once this PR has been reviewed and has the lgtm label, please assign umohnani8 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@harche
Copy link
Contributor Author

harche commented May 13, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 13, 2025
@openshift-ci-robot
Copy link
Contributor

@harche: This pull request references Jira Issue OCPBUGS-7747, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 13, 2025
@harche
Copy link
Contributor Author

harche commented May 13, 2025

I created a standalone script for testing,

$ cat test.sh 
#!/bin/bash
set -e

VERSION_1=1
VERSION_2=2

function dynamic_cpu_sizing {
    total_cpu=$2  # passed-in for testing
    if [ -z "$total_cpu" ]; then
        total_cpu=$(getconf _NPROCESSORS_ONLN)
    fi

    if [ "$1" -eq "$VERSION_1" ]; then
        recommended_systemreserved_cpu=0
        if (($total_cpu <= 1)); then
            recommended_systemreserved_cpu=$(echo "$total_cpu * 0.06" | bc -l)
            total_cpu=0
        else
            recommended_systemreserved_cpu=0.06
            total_cpu=$((total_cpu - 1))
        fi
        if (($total_cpu <= 1)); then
            recommended_systemreserved_cpu=$(echo "$recommended_systemreserved_cpu + ($total_cpu * 0.01)" | bc -l)
            total_cpu=0
        else
            recommended_systemreserved_cpu=$(echo "$recommended_systemreserved_cpu + 0.01" | bc -l)
            total_cpu=$((total_cpu - 1))
        fi
        if (($total_cpu <= 2)); then
            recommended_systemreserved_cpu=$(echo "$recommended_systemreserved_cpu + ($total_cpu * 0.005)" | bc -l)
            total_cpu=0
        else
            recommended_systemreserved_cpu=$(echo "$recommended_systemreserved_cpu + 0.01" | bc -l)
            total_cpu=$((total_cpu - 2))
        fi
        if (($total_cpu >= 0)); then
            recommended_systemreserved_cpu=$(echo "$recommended_systemreserved_cpu + ($total_cpu * 0.0025)" | bc -l)
        fi
    else
        base_allocation_fraction=0.06
        increment_per_cpu_fraction=0.012
        if ((total_cpu > 1)); then
            recommended_systemreserved_cpu=$(awk -v base="$base_allocation_fraction" -v increment="$increment_per_cpu_fraction" -v cpus="$total_cpu" 'BEGIN {printf "%.3f\n", base + increment * (cpus - 1)}')
        else
            recommended_systemreserved_cpu=$base_allocation_fraction
        fi
    fi

    # Enforce minimum threshold of 0.5 CPU
    recommended_systemreserved_cpu=$(awk -v val="$recommended_systemreserved_cpu" 'BEGIN {if (val < 0.5) print 0.5; else printf "%.3f\n", val}')

    echo "SYSTEM_RESERVED_CPU=${recommended_systemreserved_cpu}"
}

function run_cpu_tests {
    for version in $VERSION_1 $VERSION_2; do
        echo "=== CPU Sizing Tests (Version $version) ==="
        for cpu in 1 2 4 8 16 32 64 128 256 512; do
            echo -n "CPU Count $cpu, "
            dynamic_cpu_sizing $version $cpu
        done
        echo
    done
}

if [[ "$1" == "true" ]]; then
    run_cpu_tests
else
    echo "Usage: $0 true"
    exit 1
fi

Looking at the output, it seems like the changes are working as expected,

$ ./test.sh true 
=== CPU Sizing Tests (Version 1) ===
CPU Count 1, SYSTEM_RESERVED_CPU=0.5
CPU Count 2, SYSTEM_RESERVED_CPU=0.5
CPU Count 4, SYSTEM_RESERVED_CPU=0.5
CPU Count 8, SYSTEM_RESERVED_CPU=0.5
CPU Count 16, SYSTEM_RESERVED_CPU=0.5
CPU Count 32, SYSTEM_RESERVED_CPU=0.5
CPU Count 64, SYSTEM_RESERVED_CPU=0.5
CPU Count 128, SYSTEM_RESERVED_CPU=0.5
CPU Count 256, SYSTEM_RESERVED_CPU=0.710
CPU Count 512, SYSTEM_RESERVED_CPU=1.350

=== CPU Sizing Tests (Version 2) ===
CPU Count 1, SYSTEM_RESERVED_CPU=0.5
CPU Count 2, SYSTEM_RESERVED_CPU=0.5
CPU Count 4, SYSTEM_RESERVED_CPU=0.5
CPU Count 8, SYSTEM_RESERVED_CPU=0.5
CPU Count 16, SYSTEM_RESERVED_CPU=0.5
CPU Count 32, SYSTEM_RESERVED_CPU=0.5
CPU Count 64, SYSTEM_RESERVED_CPU=0.816
CPU Count 128, SYSTEM_RESERVED_CPU=1.584
CPU Count 256, SYSTEM_RESERVED_CPU=3.120
CPU Count 512, SYSTEM_RESERVED_CPU=6.192

@openshift-ci-robot
Copy link
Contributor

@harche: This pull request references Jira Issue OCPBUGS-7747, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

- What I did
If the number of CPUs is large, the recommended system reserved by the auto node sizing could be lower than the default 500m value. This PR descards the value calculated by the auto node sizing script if it is lower than 0.5.

- How to verify it
Enable the auto node sizing with the worker nodes having not so large number of CPUs, and the value set for system reserved cpu should be at least 0.5

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@harche
Copy link
Contributor Author

harche commented May 14, 2025

/test unit

Copy link
Contributor

openshift-ci bot commented May 14, 2025

@harche: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-upgrade-out-of-change 10f431d link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-gcp-op-single-node 10f431d link true /test e2e-gcp-op-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants