-
Notifications
You must be signed in to change notification settings - Fork 40.8k
WIP: DRA: automated upgrade/downgrade testing #132295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
EOF occurs after restarting the API server and, despite a retry loop in client-go/rest/request.go, sometimes is returned to the application.
Getting slices can be done with the helper.
Can be done via -vmodule, albeit not precisely because other controllers also have a controller.go file.
During upgrade/downgrade testing, errors are encountered while the apiserver is down. This is normal and handled via retrying, so we don't need to be verbose.
Hiding the error in WithError is the right choice for example when it is used inside ktesting.Eventually. Most callers probably want to deal with the unexpected error themselves. For those who don't, WithErrorLogging continues to log it.
/test pull-kubernetes-kind-dra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeding the review by calling out some aspects...
test/e2e/dra/versionskew.go
Outdated
tCtx = ktesting.End(tCtx) | ||
|
||
clusterName := "dra" | ||
tCtx = ktesting.Begin(tCtx, fmt.Sprintf("bring up v%d.%d", major, previousMinor)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still undecided whether helper code like the new kindcluster
should raise assertions itself or return an error that the caller must wrap to provide more context.
In this PR I am using "steps" to provide that context and avoid returning errors. Not having to handle those makes the test simpler and the step context also shows up in log entries, which helps when the same log message is emitted in different parts of the same test:
I0617 17:34:42.433225 951392 stepcontext.go:75] bring up v1.33: Starting...
I0617 17:34:42.876173 951392 versionskew.go:83] bring up v1.33: Running command: kind create cluster --retain --name dra --config /tmp/ginkgo3832759802/kind.yaml --kubeconfig /tmp/ginkgo3832759802/kube.config --image kind/cluster:1.33
I0617 17:34:43.008304 951392 kindcluster.go:452] bring up v1.33: kind: Creating cluster "dra" ...
...
I0617 17:35:08.992754 951392 stepcontext.go:86] bring up v1.33: Done, duration 26.5595254s
The downsides of steps is that it's easier to forget about setting them up, compared to the more usual error wrapping, and then assertion failures lack additional context.
// They must support bringing up a cluster with a node image built from | ||
// the current Kubernetes version on the host on which the E2E suite runs. | ||
// The repo root must point to the Kubernetes source code. | ||
KindCommand = framework.WithFeature(framework.ValidFeatures.Add("KindCommand")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea: does this look reasonable to you?
From the commit message:
It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done mostly for pragmatic reasons:
- It can reuse the existing helper code.
- It can run in the normal DRA E2E jobs, with timeouts defined there.
- Integration tests are limited to 10 minutes per directory and
pull-kubernetes-integration is already a big and slow job.
A different approach for per-SIG or per-feature integration
testing would be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea wrote:
I don't feel we should embed this in the e2e framework, some light kubetest2 kind of thing fits better
If we don't build this into the existing e2e.test, then we will need a new Ginkgo or Go testsuite and some mechanism to invoke it, including separate jobs. I'm not sure whether it's worth it.
But let's discuss the alternative. Suppose we put that under test/upgrade/dra
(name TBD) and treat it like another E2E suite, i.e. exclude it from running in pull-kubernetes-unit
by adding it to:
kubernetes/hack/make-rules/test.sh
Lines 50 to 55 in 211e586
-e '^k8s.io/kubernetes/third_party(/.*)?$' \ | |
-e '^k8s.io/kubernetes/cmd/kubeadm/test(/.*)?$' \ | |
-e '^k8s.io/kubernetes/test/e2e$' \ | |
-e '^k8s.io/kubernetes/test/e2e_node(/.*)?$' \ | |
-e '^k8s.io/kubernetes/test/e2e_kubeadm(/.*)?$' \ | |
-e '^k8s.io/.*/test/integration(/.*)?$' |
To implement the test, the shared helper code ideally shouldn't need the Framework
instance, but that could be cleaned up later. I probably would do it as a Go test instead of Ginkgo suite. make test WHAT=test/upgrade/dra
might work.
Technically this seems doable. I don't like the need to spin up another job because, besides the need to maintain that job, my gut feeling is that we'll get better resource utilization when running this test in parallel to other, normal E2E tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also use test/e2e/dra/upgrade-downgrade
, then we don't need a top-level test/upgrade-downgrade
. That package then wouldn't get imported into the test/e2e
suite, i.e. remain as a standalone test.
It still would need to be added to that exclude list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or simply leave it where it is and turn it into a Go unit test which skips itself unless a command line flag or env variable enables it...
I'm going to try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New job: kubernetes/test-infra#35004
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The e2e test framework, should focus on testing the application's functionality, and now becomes responsible for provisioning and de-provisioning the test environment. This blurs responsibilities violating the separation of concerns.
I really do not like this coupling
If we don't build this into the existing e2e.test, then we will need a new Ginkgo or Go testsuite and some mechanism to invoke it, including separate jobs. I'm not sure whether it's worth it.
This already exist https://github.com/kubernetes/kubeadm/blob/main/kinder/doc/test-upgrades.md
kubeadm folks were testing upgrades for a long time https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-upgrade-1-33-latest
test/e2e/dra/versionskew.go
Outdated
|
||
var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), feature.DynamicResourceAllocation, framework.WithFeatureGate(features.DynamicResourceAllocation), framework.WithSlow() /* TODO: add feature.KindCommand once jobs are updated*/, func() { | ||
|
||
ginkgo.It("upgrade/downgrade works", func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lauralorenz, @SergeyKanzhelev: once this PR is merged with this initial base test, it would be great to extend it with other scenarios described in #128965.
test/e2e/dra/versionskew.go
Outdated
// Beware that the apiserver not coming up again can be caused by not | ||
// having up-to-date release images, so run "make quick-release" manually. | ||
|
||
var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), feature.DynamicResourceAllocation, framework.WithFeatureGate(features.DynamicResourceAllocation), framework.WithSlow() /* TODO: add feature.KindCommand once jobs are updated*/, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea: we need to agree on this new feature, then I need to whitelist it in the DRA jobs, then use it here.
We do not absolutely need it for DRA (tests currently only run in jobs which support kind), but it would be cleaner to have it.
/test pull-kubernetes-kind-dra |
deleted := newObj != nil | ||
deleted := newObj == nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test/e2e/dra/versionskew.go
Outdated
tCtx = ktesting.WithContext(tCtx, ctx) | ||
|
||
tCtx = ktesting.Begin(tCtx, "get source code version") | ||
gitVersion, dockerTag, err := sourceVersion(tCtx, framework.TestContext.RepoRoot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you get that from the apiserver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runs before there is a cluster, so there is no apiserver yet.
test/e2e/dra/versionskew.go
Outdated
// sourceVersion identifies the Kubernetes git version based on hack/print-workspace-status.sh. | ||
// | ||
// Adapted from https://github.com/kubernetes-sigs/kind/blob/3df64e784cc0ea74125b2a2e9877817418afa3af/pkg/build/nodeimage/internal/kube/source.go#L71-L104 | ||
func sourceVersion(tCtx ktesting.TContext, kubeRoot string) (gitVersion string, dockerTag string, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel we should embed this in the e2e framework, some light kubetest2 kind of thing fits better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss in #132295 (comment).
/test pull-kubernetes-kind-dra Temporarily removed the |
/test pull-kubernetes-kind-dra |
2 similar comments
/test pull-kubernetes-kind-dra |
/test pull-kubernetes-kind-dra |
This allows declaring a code region as one step without having to use an anonymous callback function, which has the advantage that variables set during the step are visible afterwards. In Python, this would be done as with ktesting.Step(tctx) as tcxt: // some code code inside step // code not in the same step But Go has no such construct. In contrast to WithStep, the start and end of the step are logged, including timing information.
This is a DRA-specific stop-gap solution for using the E2E framework together with ktesting. Long-term this should better land in the E2E framework itself.
The test brings up the kind cluster and uses that power to run through an upgrade/downgrade. Version skew testing (running tests while the cluster is partially up- or downgraded) could be added. Several tests could run in parallel because each cluster is independent. Inside the test the steps have to be sequential. It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done mostly for pragmatic reasons: - It can reuse the existing helper code. - It can run in the normal DRA E2E jobs, with timeouts defined there. - Integration tests are limited to 10 minutes per directory and pull-kubernetes-integration is already a big and slow job. A different approach for per-SIG or per-feature integration testing would be needed. The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. kubernetes#122481 could make that a bit easier in an E2E test, but is not absolutely required.
"KIND_COMMAND=kind go test ./test/e2e/dra" can run it, as if it was a unit test. -ginkgo.junit-report can be used to produce a JUnit report. The main advantage over a unit test is that the report will only contain the actual failure message, because Ginkgo (in contrast to `go test`) tracks failures separately from output. From a practical perspective, doing this in a Ginkgo suite in the same directory as before has the advantage that none of the helper code has to be updated. Doing it as a Go unit test had been tried and turned into a major effort. To avoid running this in "make test", the test function returns unless KIND_COMMAND is set.
/test pull-kubernetes-kind-dra pull-kubernetes-unit pull-kubernetes-integration |
line := scanner.Text() | ||
line = strings.TrimSuffix(line, "\n") | ||
tCtx.Logf("%s: %s", name, line) | ||
output.WriteString(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Danger! DATA RACE.
Will fix...
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Promotion of any feature, in this case core DRA to GA, depends on (so far) manually testing the upgrade/downgrade path. It's hard to document what exactly was tested in a way that others can verify the procedure and/or replicate it.
This PR adds helper packages for upgrading/downgrading a kind cluster and running E2E tests against, and uses that to test some DRA scenarios. It runs as part of the normal e2e.test invocation in pull/ci-kubernetes-kind-dra.
Which issue(s) this PR is related to:
#128965
KEP: kubernetes/enhancements#4381
Special notes for your reviewer:
It is debatable whether this should be an E2E test at all. Technically this could also be an integration test. It's currently done as E2E test mostly for pragmatic reasons:
The new helper code for managing a kind cluster is written so that it could be used both in an integration test and an E2E test. #122481 could make that a bit easier in an E2E test, but is not absolutely required.
Does this PR introduce a user-facing change?