Skip to content

WIP: DRA: automated upgrade/downgrade testing #132295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jun 13, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Promotion of any feature, in this case core DRA to GA, depends on (so far) manually testing the upgrade/downgrade path. It's hard to document what exactly was tested in a way that others can verify the procedure and/or replicate it.

This PR adds helper packages for upgrading/downgrading a kind cluster and running E2E tests against, and uses that to test some DRA scenarios. It runs as part of the normal e2e.test invocation in pull/ci-kubernetes-kind-dra.

Which issue(s) this PR is related to:

#128965
KEP: kubernetes/enhancements#4381

Special notes for your reviewer:

It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done as E2E test mostly for pragmatic reasons:

  • It can reuse the existing helper code.
  • It can run in the normal DRA E2E jobs, with timeouts defined there.
  • Integration tests are limited to 10 minutes per directory and pull-kubernetes-integration is already a big and slow job. A different approach for per-SIG or per-feature integration testing would be needed.

The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. #122481 could make that a bit easier in an E2E test, but is not absolutely required.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 13, 2025
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2025
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 13, 2025
@k8s-ci-robot k8s-ci-robot requested review from bart0sh and liggitt June 13, 2025 18:39
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 14, 2025
@pohly pohly force-pushed the dra-version-skew branch from 2a1de75 to ba9e6e5 Compare June 17, 2025 16:16
pohly added 5 commits June 17, 2025 18:37
EOF occurs after restarting the API server and, despite a retry loop in
client-go/rest/request.go, sometimes is returned to the application.
Getting slices can be done with the helper.
Can be done via -vmodule, albeit not precisely because other controllers
also have a controller.go file.
During upgrade/downgrade testing, errors are encountered while the apiserver is
down. This is normal and handled via retrying, so we don't need to be verbose.
Hiding the error in WithError is the right choice for example
when it is used inside ktesting.Eventually. Most callers probably want to deal
with the unexpected error themselves. For those who don't, WithErrorLogging
continues to log it.
@pohly pohly force-pushed the dra-version-skew branch from ba9e6e5 to 764287b Compare June 17, 2025 16:39
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2025
@pohly
Copy link
Contributor Author

pohly commented Jun 17, 2025

/test pull-kubernetes-kind-dra

@pohly pohly changed the title DRA: automated upgrade/downgrade testing WIP: DRA: automated upgrade/downgrade testing Jun 17, 2025
Copy link
Contributor Author

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeding the review by calling out some aspects...

tCtx = ktesting.End(tCtx)

clusterName := "dra"
tCtx = ktesting.Begin(tCtx, fmt.Sprintf("bring up v%d.%d", major, previousMinor))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still undecided whether helper code like the new kindcluster should raise assertions itself or return an error that the caller must wrap to provide more context.

In this PR I am using "steps" to provide that context and avoid returning errors. Not having to handle those makes the test simpler and the step context also shows up in log entries, which helps when the same log message is emitted in different parts of the same test:

  I0617 17:34:42.433225 951392 stepcontext.go:75] bring up v1.33: Starting...
  I0617 17:34:42.876173 951392 versionskew.go:83] bring up v1.33: Running command: kind create cluster --retain --name dra --config /tmp/ginkgo3832759802/kind.yaml --kubeconfig /tmp/ginkgo3832759802/kube.config --image kind/cluster:1.33
  I0617 17:34:43.008304 951392 kindcluster.go:452] bring up v1.33: kind: Creating cluster "dra" ...
...
  I0617 17:35:08.992754 951392 stepcontext.go:86] bring up v1.33: Done, duration 26.5595254s

The downsides of steps is that it's easier to forget about setting them up, compared to the more usual error wrapping, and then assertion failures lack additional context.

// They must support bringing up a cluster with a node image built from
// the current Kubernetes version on the host on which the E2E suite runs.
// The repo root must point to the Kubernetes source code.
KindCommand = framework.WithFeature(framework.ValidFeatures.Add("KindCommand"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea: does this look reasonable to you?

From the commit message:

It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done mostly for pragmatic reasons:

  • It can reuse the existing helper code.
  • It can run in the normal DRA E2E jobs, with timeouts defined there.
  • Integration tests are limited to 10 minutes per directory and
    pull-kubernetes-integration is already a big and slow job.
    A different approach for per-SIG or per-feature integration
    testing would be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea wrote:

I don't feel we should embed this in the e2e framework, some light kubetest2 kind of thing fits better

If we don't build this into the existing e2e.test, then we will need a new Ginkgo or Go testsuite and some mechanism to invoke it, including separate jobs. I'm not sure whether it's worth it.

But let's discuss the alternative. Suppose we put that under test/upgrade/dra (name TBD) and treat it like another E2E suite, i.e. exclude it from running in pull-kubernetes-unit by adding it to:

-e '^k8s.io/kubernetes/third_party(/.*)?$' \
-e '^k8s.io/kubernetes/cmd/kubeadm/test(/.*)?$' \
-e '^k8s.io/kubernetes/test/e2e$' \
-e '^k8s.io/kubernetes/test/e2e_node(/.*)?$' \
-e '^k8s.io/kubernetes/test/e2e_kubeadm(/.*)?$' \
-e '^k8s.io/.*/test/integration(/.*)?$'

To implement the test, the shared helper code ideally shouldn't need the Framework instance, but that could be cleaned up later. I probably would do it as a Go test instead of Ginkgo suite. make test WHAT=test/upgrade/dra might work.

Technically this seems doable. I don't like the need to spin up another job because, besides the need to maintain that job, my gut feeling is that we'll get better resource utilization when running this test in parallel to other, normal E2E tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use test/e2e/dra/upgrade-downgrade, then we don't need a top-level test/upgrade-downgrade. That package then wouldn't get imported into the test/e2e suite, i.e. remain as a standalone test.

It still would need to be added to that exclude list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or simply leave it where it is and turn it into a Go unit test which skips itself unless a command line flag or env variable enables it...

I'm going to try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The e2e test framework, should focus on testing the application's functionality, and now becomes responsible for provisioning and de-provisioning the test environment. This blurs responsibilities violating the separation of concerns.

I really do not like this coupling

If we don't build this into the existing e2e.test, then we will need a new Ginkgo or Go testsuite and some mechanism to invoke it, including separate jobs. I'm not sure whether it's worth it.

This already exist https://github.com/kubernetes/kubeadm/blob/main/kinder/doc/test-upgrades.md

kubeadm folks were testing upgrades for a long time https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-upgrade-1-33-latest


var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), feature.DynamicResourceAllocation, framework.WithFeatureGate(features.DynamicResourceAllocation), framework.WithSlow() /* TODO: add feature.KindCommand once jobs are updated*/, func() {

ginkgo.It("upgrade/downgrade works", func(ctx context.Context) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lauralorenz, @SergeyKanzhelev: once this PR is merged with this initial base test, it would be great to extend it with other scenarios described in #128965.

// Beware that the apiserver not coming up again can be caused by not
// having up-to-date release images, so run "make quick-release" manually.

var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), feature.DynamicResourceAllocation, framework.WithFeatureGate(features.DynamicResourceAllocation), framework.WithSlow() /* TODO: add feature.KindCommand once jobs are updated*/, func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea: we need to agree on this new feature, then I need to whitelist it in the DRA jobs, then use it here.

We do not absolutely need it for DRA (tests currently only run in jobs which support kind), but it would be cleaner to have it.

@pohly
Copy link
Contributor Author

pohly commented Jun 17, 2025

/test pull-kubernetes-kind-dra

deleted := newObj != nil
deleted := newObj == nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts exactly 😢

#132334

Included here as a stop-gap, not meant to be merged via this PR.

tCtx = ktesting.WithContext(tCtx, ctx)

tCtx = ktesting.Begin(tCtx, "get source code version")
gitVersion, dockerTag, err := sourceVersion(tCtx, framework.TestContext.RepoRoot)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you get that from the apiserver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs before there is a cluster, so there is no apiserver yet.

// sourceVersion identifies the Kubernetes git version based on hack/print-workspace-status.sh.
//
// Adapted from https://github.com/kubernetes-sigs/kind/blob/3df64e784cc0ea74125b2a2e9877817418afa3af/pkg/build/nodeimage/internal/kube/source.go#L71-L104
func sourceVersion(tCtx ktesting.TContext, kubeRoot string) (gitVersion string, dockerTag string, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel we should embed this in the e2e framework, some light kubetest2 kind of thing fits better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss in #132295 (comment).

@pohly pohly force-pushed the dra-version-skew branch from 764287b to 7fd0653 Compare June 17, 2025 18:00
@pohly
Copy link
Contributor Author

pohly commented Jun 17, 2025

/test pull-kubernetes-kind-dra

Temporarily removed the WithSlow to get it to run...

@pohly
Copy link
Contributor Author

pohly commented Jun 17, 2025

/test pull-kubernetes-kind-dra

2 similar comments
@pohly
Copy link
Contributor Author

pohly commented Jun 17, 2025

/test pull-kubernetes-kind-dra

@pohly
Copy link
Contributor Author

pohly commented Jun 17, 2025

/test pull-kubernetes-kind-dra

@pohly pohly force-pushed the dra-version-skew branch from d77a4d9 to 069aa93 Compare June 17, 2025 19:54
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 17, 2025
pohly added 4 commits June 18, 2025 16:33
This allows declaring a code region as one step without having to use
an anonymous callback function, which has the advantage that variables
set during the step are visible afterwards.

In Python, this would be done as

    with ktesting.Step(tctx) as tcxt:
        // some code code inside step
    // code not in the same step

But Go has no such construct.

In contrast to WithStep, the start and end of the step are logged, including
timing information.
This is a DRA-specific stop-gap solution for using the E2E framework together
with ktesting. Long-term this should better land in the E2E framework itself.
The test brings up the kind cluster and uses that power to run through
an upgrade/downgrade. Version skew testing (running tests while the cluster
is partially up- or downgraded) could be added.

Several tests could run in parallel because each cluster is independent.
Inside the test the steps have to be sequential.

It is debatable whether this should be an E2E test at all. Technically this
could also be an integration test. It's currently done mostly for pragmatic
reasons:
- It can reuse the existing helper code.
- It can run in the normal DRA E2E jobs, with timeouts defined there.
- Integration tests are limited to 10 minutes per directory and
  pull-kubernetes-integration is already a big and slow job.
  A different approach for per-SIG or per-feature integration
  testing would be needed.

The new helper code for managing a kind cluster is written so that it could be
used both in an integration test and an E2E
test. kubernetes#122481 could make that a
bit easier in an E2E test, but is not absolutely required.
"KIND_COMMAND=kind go test ./test/e2e/dra" can run it, as if it was a unit test.
-ginkgo.junit-report can be used to produce a JUnit report. The main advantage
over a unit test is that the report will only contain the actual failure
message, because Ginkgo (in contrast to `go test`) tracks failures separately
from output.

From a practical perspective, doing this in a Ginkgo suite in the same
directory as before has the advantage that none of the helper code has to be
updated. Doing it as a Go unit test had been tried and turned into a major
effort.

To avoid running this in "make test", the test function returns unless
KIND_COMMAND is set.
@pohly pohly force-pushed the dra-version-skew branch from 069aa93 to cf1a211 Compare June 18, 2025 19:34
@pohly
Copy link
Contributor Author

pohly commented Jun 18, 2025

/test pull-kubernetes-kind-dra pull-kubernetes-unit pull-kubernetes-integration

line := scanner.Text()
line = strings.TrimSuffix(line, "\n")
tCtx.Logf("%s: %s", name, line)
output.WriteString(line)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Danger! DATA RACE.

Will fix...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 🆕 New
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

3 participants