WIP: DRA: automated upgrade/downgrade testing #132295

pohly · 2025-06-13T18:38:49Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Promotion of any feature, in this case core DRA to GA, depends on (so far) manually testing the upgrade/downgrade path. It's hard to document what exactly was tested in a way that others can verify the procedure and/or replicate it.

This PR adds helper packages for upgrading/downgrading a kind cluster and running E2E tests against, and uses that to test some DRA scenarios. It runs as part of the normal e2e.test invocation in pull/ci-kubernetes-kind-dra.

Which issue(s) this PR is related to:

#128965
KEP: kubernetes/enhancements#4381

Special notes for your reviewer:

It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done as E2E test mostly for pragmatic reasons:

It can reuse the existing helper code.
It can run in the normal DRA E2E jobs, with timeouts defined there.
Integration tests are limited to 10 minutes per directory and pull-kubernetes-integration is already a big and slow job. A different approach for per-SIG or per-feature integration testing would be needed.

The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. #122481 could make that a bit easier in an E2E test, but is not absolutely required.

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2025-06-13T18:38:52Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-06-13T18:38:58Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

EOF occurs after restarting the API server and, despite a retry loop in client-go/rest/request.go, sometimes is returned to the application.

Getting slices can be done with the helper.

Can be done via -vmodule, albeit not precisely because other controllers also have a controller.go file.

During upgrade/downgrade testing, errors are encountered while the apiserver is down. This is normal and handled via retrying, so we don't need to be verbose.

Hiding the error in WithError is the right choice for example when it is used inside ktesting.Eventually. Most callers probably want to deal with the unexpected error themselves. For those who don't, WithErrorLogging continues to log it.

pohly · 2025-06-17T16:39:47Z

/test pull-kubernetes-kind-dra

pohly

Seeding the review by calling out some aspects...

pohly · 2025-06-17T16:47:58Z

test/e2e/dra/versionskew.go

+		tCtx = ktesting.End(tCtx)
+
+		clusterName := "dra"
+		tCtx = ktesting.Begin(tCtx, fmt.Sprintf("bring up v%d.%d", major, previousMinor))


I'm still undecided whether helper code like the new kindcluster should raise assertions itself or return an error that the caller must wrap to provide more context.

In this PR I am using "steps" to provide that context and avoid returning errors. Not having to handle those makes the test simpler and the step context also shows up in log entries, which helps when the same log message is emitted in different parts of the same test:

I0617 17:34:42.433225 951392 stepcontext.go:75] bring up v1.33: Starting... I0617 17:34:42.876173 951392 versionskew.go:83] bring up v1.33: Running command: kind create cluster --retain --name dra --config /tmp/ginkgo3832759802/kind.yaml --kubeconfig /tmp/ginkgo3832759802/kube.config --image kind/cluster:1.33 I0617 17:34:43.008304 951392 kindcluster.go:452] bring up v1.33: kind: Creating cluster "dra" ... ... I0617 17:35:08.992754 951392 stepcontext.go:86] bring up v1.33: Done, duration 26.5595254s

The downsides of steps is that it's easier to forget about setting them up, compared to the more usual error wrapping, and then assertion failures lack additional context.

pohly · 2025-06-17T16:49:57Z

test/e2e/feature/feature.go

+	// They must support bringing up a cluster with a node image built from
+	// the current Kubernetes version on the host on which the E2E suite runs.
+	// The repo root must point to the Kubernetes source code.
+	KindCommand = framework.WithFeature(framework.ValidFeatures.Add("KindCommand"))


@aojea: does this look reasonable to you?

From the commit message:

It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done mostly for pragmatic reasons:

It can reuse the existing helper code.

It can run in the normal DRA E2E jobs, with timeouts defined there.

Integration tests are limited to 10 minutes per directory and
pull-kubernetes-integration is already a big and slow job.
A different approach for per-SIG or per-feature integration
testing would be needed.

@aojea wrote:

I don't feel we should embed this in the e2e framework, some light kubetest2 kind of thing fits better

If we don't build this into the existing e2e.test, then we will need a new Ginkgo or Go testsuite and some mechanism to invoke it, including separate jobs. I'm not sure whether it's worth it.

But let's discuss the alternative. Suppose we put that under test/upgrade/dra (name TBD) and treat it like another E2E suite, i.e. exclude it from running in pull-kubernetes-unit by adding it to:

kubernetes/hack/make-rules/test.sh

Lines 50 to 55 in 211e586

-e '^k8s.io/kubernetes/third_party(/.*)?$' \

-e '^k8s.io/kubernetes/cmd/kubeadm/test(/.*)?$' \

-e '^k8s.io/kubernetes/test/e2e$' \

-e '^k8s.io/kubernetes/test/e2e_node(/.*)?$' \

-e '^k8s.io/kubernetes/test/e2e_kubeadm(/.*)?$' \

-e '^k8s.io/.*/test/integration(/.*)?$'

To implement the test, the shared helper code ideally shouldn't need the Framework instance, but that could be cleaned up later. I probably would do it as a Go test instead of Ginkgo suite. make test WHAT=test/upgrade/dra might work.

Technically this seems doable. I don't like the need to spin up another job because, besides the need to maintain that job, my gut feeling is that we'll get better resource utilization when running this test in parallel to other, normal E2E tests.

We can also use test/e2e/dra/upgrade-downgrade, then we don't need a top-level test/upgrade-downgrade. That package then wouldn't get imported into the test/e2e suite, i.e. remain as a standalone test.

It still would need to be added to that exclude list.

Or simply leave it where it is and turn it into a Go unit test which skips itself unless a command line flag or env variable enables it...

I'm going to try that.

New job: kubernetes/test-infra#35004

The e2e test framework, should focus on testing the application's functionality, and now becomes responsible for provisioning and de-provisioning the test environment. This blurs responsibilities violating the separation of concerns.

I really do not like this coupling

If we don't build this into the existing e2e.test, then we will need a new Ginkgo or Go testsuite and some mechanism to invoke it, including separate jobs. I'm not sure whether it's worth it.

This already exist https://github.com/kubernetes/kubeadm/blob/main/kinder/doc/test-upgrades.md

kubeadm folks were testing upgrades for a long time https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-upgrade-1-33-latest

pohly · 2025-06-17T16:51:39Z

test/e2e/dra/versionskew.go

+
+var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), feature.DynamicResourceAllocation, framework.WithFeatureGate(features.DynamicResourceAllocation), framework.WithSlow() /* TODO: add feature.KindCommand once jobs are updated*/, func() {
+
+	ginkgo.It("upgrade/downgrade works", func(ctx context.Context) {


@lauralorenz, @SergeyKanzhelev: once this PR is merged with this initial base test, it would be great to extend it with other scenarios described in #128965.

pohly · 2025-06-17T16:53:25Z

test/e2e/dra/versionskew.go

+// Beware that the apiserver not coming up again can be caused by not
+// having up-to-date release images, so run "make quick-release" manually.
+
+var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), feature.DynamicResourceAllocation, framework.WithFeatureGate(features.DynamicResourceAllocation), framework.WithSlow() /* TODO: add feature.KindCommand once jobs are updated*/, func() {


@aojea: we need to agree on this new feature, then I need to whitelist it in the DRA jobs, then use it here.

We do not absolutely need it for DRA (tests currently only run in jobs which support kind), but it would be cleaner to have it.

pohly · 2025-06-17T16:54:32Z

/test pull-kubernetes-kind-dra

aojea · 2025-06-17T17:18:08Z

pkg/controller/resourceclaim/controller.go

-	deleted := newObj != nil
+	deleted := newObj == nil


My thoughts exactly 😢

#132334

Included here as a stop-gap, not meant to be merged via this PR.

aojea · 2025-06-17T17:20:13Z

test/e2e/dra/versionskew.go

+		tCtx = ktesting.WithContext(tCtx, ctx)
+
+		tCtx = ktesting.Begin(tCtx, "get source code version")
+		gitVersion, dockerTag, err := sourceVersion(tCtx, framework.TestContext.RepoRoot)


can't you get that from the apiserver?

This runs before there is a cluster, so there is no apiserver yet.

aojea · 2025-06-17T17:32:55Z

test/e2e/dra/versionskew.go

+// sourceVersion identifies the Kubernetes git version based on hack/print-workspace-status.sh.
+//
+// Adapted from https://github.com/kubernetes-sigs/kind/blob/3df64e784cc0ea74125b2a2e9877817418afa3af/pkg/build/nodeimage/internal/kube/source.go#L71-L104
+func sourceVersion(tCtx ktesting.TContext, kubeRoot string) (gitVersion string, dockerTag string, err error) {


I don't feel we should embed this in the e2e framework, some light kubetest2 kind of thing fits better

Let's discuss in #132295 (comment).

pohly · 2025-06-17T18:20:28Z

/test pull-kubernetes-kind-dra

Temporarily removed the WithSlow to get it to run...

pohly · 2025-06-17T18:52:46Z

/test pull-kubernetes-kind-dra

pohly · 2025-06-17T19:39:23Z

/test pull-kubernetes-kind-dra

pohly · 2025-06-17T19:43:52Z

/test pull-kubernetes-kind-dra

This allows declaring a code region as one step without having to use an anonymous callback function, which has the advantage that variables set during the step are visible afterwards. In Python, this would be done as with ktesting.Step(tctx) as tcxt: // some code code inside step // code not in the same step But Go has no such construct. In contrast to WithStep, the start and end of the step are logged, including timing information.

This is a DRA-specific stop-gap solution for using the E2E framework together with ktesting. Long-term this should better land in the E2E framework itself.

The test brings up the kind cluster and uses that power to run through an upgrade/downgrade. Version skew testing (running tests while the cluster is partially up- or downgraded) could be added. Several tests could run in parallel because each cluster is independent. Inside the test the steps have to be sequential. It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done mostly for pragmatic reasons: - It can reuse the existing helper code. - It can run in the normal DRA E2E jobs, with timeouts defined there. - Integration tests are limited to 10 minutes per directory and pull-kubernetes-integration is already a big and slow job. A different approach for per-SIG or per-feature integration testing would be needed. The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. kubernetes#122481 could make that a bit easier in an E2E test, but is not absolutely required.

"KIND_COMMAND=kind go test ./test/e2e/dra" can run it, as if it was a unit test. -ginkgo.junit-report can be used to produce a JUnit report. The main advantage over a unit test is that the report will only contain the actual failure message, because Ginkgo (in contrast to `go test`) tracks failures separately from output. From a practical perspective, doing this in a Ginkgo suite in the same directory as before has the advantage that none of the helper code has to be updated. Doing it as a Go unit test had been tried and turned into a major effort. To avoid running this in "make test", the test function returns unless KIND_COMMAND is set.

pohly · 2025-06-18T20:10:53Z

/test pull-kubernetes-kind-dra pull-kubernetes-unit pull-kubernetes-integration

pohly · 2025-06-18T20:13:09Z

test/utils/kindcluster/kindcluster.go

+			line := scanner.Text()
+			line = strings.TrimSuffix(line, "\n")
+			tCtx.Logf("%s: %s", name, line)
+			output.WriteString(line)


Danger! DATA RACE.

Will fix...

github-project-automation bot added this to SIG Node CI/Test Board and SIG Node: code and documentation PRs Jun 13, 2025

k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2025

github-project-automation bot moved this to Triage in SIG Node CI/Test Board Jun 13, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Jun 13, 2025

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 13, 2025

k8s-ci-robot requested review from bart0sh and liggitt June 13, 2025 18:39

github-project-automation bot added this to Dynamic Resource Allocation Jun 13, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Jun 13, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2025

pohly mentioned this pull request Jun 13, 2025

DRA: version skew testing #128965

Open

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 14, 2025

pohly force-pushed the dra-version-skew branch from 2a1de75 to ba9e6e5 Compare June 17, 2025 16:16

pohly added 5 commits June 17, 2025 18:37

E2E framework: retry also on EOF errors

efedf5b

EOF occurs after restarting the API server and, despite a retry loop in client-go/rest/request.go, sometimes is returned to the application.

DRA E2E: simplify IsGone

3bc2a31

Getting slices can be done with the helper.

DRA E2E: increase verbosity of resource claim controller

f109a8f

Can be done via -vmodule, albeit not precisely because other controllers also have a controller.go file.

DRA E2E: reduce port proxy verbosity

c4b080f

During upgrade/downgrade testing, errors are encountered while the apiserver is down. This is normal and handled via retrying, so we don't need to be verbose.

ktesting: skip logging error when capturing it

21cf9f3

Hiding the error in WithError is the right choice for example when it is used inside ktesting.Eventually. Most callers probably want to deal with the unexpected error themselves. For those who don't, WithErrorLogging continues to log it.

pohly force-pushed the dra-version-skew branch from ba9e6e5 to 764287b Compare June 17, 2025 16:39

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2025

pohly changed the title ~~DRA: automated upgrade/downgrade testing~~ WIP: DRA: automated upgrade/downgrade testing Jun 17, 2025

pohly commented Jun 17, 2025

View reviewed changes

aojea reviewed Jun 17, 2025

View reviewed changes

pohly force-pushed the dra-version-skew branch from 764287b to 7fd0653 Compare June 17, 2025 18:00

pohly force-pushed the dra-version-skew branch from d77a4d9 to 069aa93 Compare June 17, 2025 19:54

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 17, 2025

pohly mentioned this pull request Jun 18, 2025

DRA canary: add integration test job kubernetes/test-infra#35004

Merged

pohly added 4 commits June 18, 2025 16:33

DRA E2E: support using ktesting

5f2e663

This is a DRA-specific stop-gap solution for using the E2E framework together with ktesting. Long-term this should better land in the E2E framework itself.

pohly force-pushed the dra-version-skew branch from 069aa93 to cf1a211 Compare June 18, 2025 19:34

pohly commented Jun 18, 2025

View reviewed changes

	-e '^k8s.io/kubernetes/third_party(/.*)?$' \
	-e '^k8s.io/kubernetes/cmd/kubeadm/test(/.*)?$' \
	-e '^k8s.io/kubernetes/test/e2e$' \
	-e '^k8s.io/kubernetes/test/e2e_node(/.*)?$' \
	-e '^k8s.io/kubernetes/test/e2e_kubeadm(/.*)?$' \
	-e '^k8s.io/./test/integration(/.)?$'


		var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), feature.DynamicResourceAllocation, framework.WithFeatureGate(features.DynamicResourceAllocation), framework.WithSlow() /* TODO: add feature.KindCommand once jobs are updated*/, func() {

		ginkgo.It("upgrade/downgrade works", func(ctx context.Context) {

WIP: DRA: automated upgrade/downgrade testing #132295

Are you sure you want to change the base?

WIP: DRA: automated upgrade/downgrade testing #132295

Conversation

pohly commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Jun 13, 2025

Uh oh!

k8s-ci-robot commented Jun 13, 2025

Uh oh!

pohly commented Jun 17, 2025

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly commented Jun 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly commented Jun 17, 2025

Uh oh!

pohly commented Jun 17, 2025

Uh oh!

pohly commented Jun 17, 2025

Uh oh!

pohly commented Jun 17, 2025

Uh oh!

pohly commented Jun 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pohly commented Jun 13, 2025 •

edited

Loading