release-24.3: roachtest/mixedversion: upgrade plan should be bounded #140674

srosenberg · 2025-02-07T14:45:28Z

Backport 1/1 commits from #137963.

/cc @cockroachdb/release

Previously, the mixedversion framework did not bound
the total number of steps in a test plan. Since steps
are generated according to different pseudo-random
distributions, the total number of resulting steps
can vary significantly.
E.g., for tpcc/mixed-headroom/n5cpu16, the smallest
test plan has 14 steps, whereas the largest, based
on a sampling of 1_000_000 valid test plans, has
135 steps!

High variability in the size of the test plan
is directly proportional to the running time.
Thus, a very large test plan can cause a test
to time out, due to exceeding its max running time.
That is the case for tpcc/mixed-headroom/n5cpu16.

This PR adds an option, namely MaxNumPlanSteps,
which enforces an upper bound. If a generated
test plan exceeds the specified value, a new
one is generated until the resulting length
is within the specification.

This PR also adds a primitive dry-run mode,
which can be useful for debugging test plans.
If MVT_DRY_RUN_MODE env. var. is set, print
the mixedversion test plan and exit.

Resolves: #138014
Informs: #137332
Epic: none
Release note: None
Release Justification: test-only change
Resolves: #140059

blathers-crl · 2025-02-07T14:45:32Z

cockroach-teamcity · 2025-02-07T14:45:40Z

This change is

srosenberg · 2025-02-07T15:20:14Z

Interesting... wasn't expecting,

=== RUN   Test_maxNumPlanSteps    planner_test.go:313:         	Error Trace:	pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go:313        	Error:      	Received unexpected error:        	            	error creating test plan: unable to generate a test plan with at most 15 steps [owner=test-eng]        	Test:       	Test_maxNumPlanSteps--- FAIL: Test_maxNumPlanSteps (0.41s)

DarrylWong · 2025-02-07T17:47:44Z

Had a suspicion it was something to do with setting the build version when you mentioned it only fails with the entire package is run since that has caused me grief in the past. Sure enough, TestTestPlanner sets the build version to v24.3 and adding:

diff --git a/pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go b/pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go
index 2f418c179bd..110bff0a584 100644
--- a/pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go
+++ b/pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go
@@ -52,6 +52,7 @@ var (
this seed
 
 func TestTestPlanner(t *testing.T) {
+       defer setDefaultVersions()
 to the test.
        mutatorsAvailable := append([]mutator{
                concurrentUserHooksMutator{},

Seems to make it work as suspected. Seems like we should be calling setDefaultVersions before every test though, surprised this hasn't caused issues before.

Previously, the mixedversion framework did not bound the total number of steps in a test plan. Since steps are generated according to different pseudo-random distributions, the total number of resulting steps can vary significantly. E.g., for `tpcc/mixed-headroom/n5cpu16`, the smallest test plan has 14 steps, whereas the largest, based on a sampling of 1_000_000 valid test plans, has 135 steps! High variability in the size of the test plan is directly proportional to the running time. Thus, a very large test plan can cause a test to time out, due to exceeding its max running time. That is the case for `tpcc/mixed-headroom/n5cpu16`. This PR adds an option, namely `MaxNumPlanSteps`, which enforces an upper bound. If a generated test plan exceeds the specified value, a new one is generated until the resulting length is within the specification. This PR also adds a primitive `dry-run` mode, which can be useful for debugging test plans. If `MVT_DRY_RUN_MODE` env. var. is set, print the mixedversion test plan and exit. Resolves: cockroachdb#138014 Informs: cockroachdb#137332 Epic: none Release note: None

srosenberg · 2025-02-07T20:06:33Z

Sure enough, TestTestPlanner sets the build version to v24.3 and adding:

Nice catch! I went with your approach to reduce changes to the backport. It turns out the 15-step plan does exist but it requires more retries to find it, i.e.,

--- a/pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
+++ b/pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
@@ -801,7 +801,7 @@ func (t *Test) plan() (plan *TestPlan, retErr error) {
        var retries int
        // In case the length of the test plan exceeds `opts.maxNumPlanSteps`, retry up to 100 times.
        // N.B. Statistically, the expected number of retries is miniscule; see #138014 for more info.
-       for ; retries < 100; retries++ {
+       for ; retries < 1000; retries++ {

Now we get,

planner_test.go:317: Seed:               12345
        Upgrades:           v24.2.2 → <current>
        Deployment mode:    system-only
        Plan:
        ├── start cluster at version "v24.2.2" (1)
        ├── wait for all nodes (:1-4) to acknowledge cluster version '24.2' on system tenant (2)
        └── upgrade cluster from "v24.2.2" to "<current>"
           ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (3)
           ├── upgrade nodes :1-4 from "v24.2.2" to "<current>"
           │   ├── restart node 1 with binary version <current> (4)
           │   ├── run "mixed-version 2" (5)
           │   ├── restart node 2 with binary version <current> (6)
           │   ├── restart node 3 with binary version <current> (7)
           │   ├── restart node 4 with binary version <current> (8)
           │   └── run mixed-version hooks concurrently
           │      ├── run "on startup 1", after 0s delay (9)
           │      └── run "mixed-version 1", after 5s delay (10)
           ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (11)
           ├── run mixed-version hooks concurrently
           │   ├── run "on startup 1", after 0s delay (12)
           │   └── run "mixed-version 1", after 30s delay (13)
           ├── wait for all nodes (:1-4) to acknowledge cluster version <current> on system tenant (14)
           └── run "after finalization" (15)

that's similar (not identical) to the plan when restoring setDefaultVersions,

planner_test.go:317: Seed:               12345
        Upgrades:           v24.1.1 → <current>
        Deployment mode:    system-only
        Mutators:           preserve_downgrade_option_randomizer
        Plan:
        ├── install fixtures for version "v24.1.1" (1)
        ├── start cluster at version "v24.1.1" (2)
        ├── wait for all nodes (:1-4) to acknowledge cluster version '24.1' on system tenant (3)
        └── upgrade cluster from "v24.1.1" to "<current>"
           ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (4)
           ├── upgrade nodes :1-4 from "v24.1.1" to "<current>"
           │   ├── restart node 3 with binary version <current> (5)
           │   ├── run "mixed-version 1" (6)
           │   ├── restart node 2 with binary version <current> (7)
           │   ├── run "on startup 1" (8)
           │   ├── restart node 1 with binary version <current> (9)
           │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (10)
           │   ├── run "mixed-version 2" (11)
           │   └── restart node 4 with binary version <current> (12)
           ├── run "mixed-version 2" (13)
           ├── wait for all nodes (:1-4) to acknowledge cluster version <current> on system tenant (14)
           └── run "after finalization" (15)

The reason it wasn't finding it (with 100 retries) was because the initial upgrade path sequence is longer,

upgradePath=[v23.1.17 v23.2.4 v24.1.1 v24.2.2 <current>]

vs

upgradePath=[v23.1.17 v23.2.4 v24.1.1 <current>]

srosenberg · 2025-02-07T20:14:19Z

Seems like we should be calling setDefaultVersions before every test though, surprised this hasn't caused issues before.

Right. I'm not sure why this hasn't flaked on master or 25.1. It must be the case that TestTestPlanner isn't run before Test_maxNumPlanSteps?! 🤔

DarrylWong · 2025-02-07T20:51:19Z

It must be the case that TestTestPlanner isn't run before Test_maxNumPlanSteps?! 🤔

I see the same thing happening on master. Running the entire pkg, the build version is 24.3, running the test standalone the build version is 24.2. I guess whatever rng changed on 25.1+ just happened to be lucky and make this test work. We should probably make the same fix on master/25.1 to avoid any future confusing bugs.

srosenberg · 2025-02-07T23:24:43Z

I see the same thing happening on master. Running the entire pkg, the build version is 24.3, running the test standalone the build version is 24.2. I guess whatever rng changed on 25.1+ just happened to be lucky and make this test work. We should probably make the same fix on master/25.1 to avoid any future confusing bugs.

Right. Initially, I thought it was only the length of the upgrade sequence which would determine the next plan iteration, but there is more in choosePreviousReleases. E.g., they start out with the same length but diverge after a few iterations, with the one on this branch being an "unlucky" one :).

We should probably make the same fix on master/25.1 to avoid any future confusing bugs.

Yep, I'll send another PR.

srosenberg · 2025-02-07T23:25:43Z

TFTR!

We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different unit test, namely `TestTestPlanner` did not restore the default version (`clusterupgrade.TestBuildVersion`). This change forwardports the missing restore to ensure future test executions follow the same PRNG sequence. [1] cockroachdb#140674 Epic: none Release note: None

140753: roachtest/mixedversion: TestTestPlanner should restore default version r=herkolategan,darrylwon a=srosenberg We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different unit test, namely `TestTestPlanner` did not restore the default version (`clusterupgrade.TestBuildVersion`). This change forwardports the missing restore to ensure future test executions follow the same PRNG sequence. [1] #140674 Epic: none Release note: None Co-authored-by: Stan Rosenberg <[email protected]>

140753: roachtest/mixedversion: TestTestPlanner should restore default version r=herkolategan,darrylwon a=srosenberg We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different unit test, namely `TestTestPlanner` did not restore the default version (`clusterupgrade.TestBuildVersion`). This change forwardports the missing restore to ensure future test executions follow the same PRNG sequence. [1] #140674 Epic: none Release note: None 140785: rac2,kvserver: start the StreamCloseScheduler r=sumeerbhola a=pav-kv This commit enables the `StreamCloseScheduler`, which is responsible for closing RACv2 streams some time (400ms) after they enter `StateProbe`. The initialization had to move from `NewStore` to `Store.Start()` because it needs the `stopper` to start the job. Epic: none Release note: none 140791: sem/tree: avoid assertion error on unimplemented builtins in views r=yuzefovich a=yuzefovich Previously, we would hit an assertion error when trying to use an unimplemented builtin in the CREATE VIEW statement and this is now fixed. The issue is that we resolve unimplemented builtins as a definition with zero overloads, and all places that previously assumed at least one existing overload have been audited. The resolved function definition has been updated to have "unsupported with issue" integer indicating why there are no overloads. Additionally, `UnsupportedWithIssue` property is now changed to be `uint` since we didn't use "negative value as having no corresponding issue" ability. I decided to not include a release note since this seems like an edge case and we've only seen this a handful of times in sentry. Fixes: #128535. Release note: None 141009: sqlstats: use `BatchProcessLatencyBuckets` for flush latency r=xinhaoz a=dhartunian The max of 10s on the IO latency buckets is too short to measure flush latency effectively. Previously, this metric was measuring per-statement flush latency, but this was altered in #122919. Release note: None Co-authored-by: Stan Rosenberg <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: David Hartunian <[email protected]>

140753: roachtest/mixedversion: TestTestPlanner should restore default version r=herkolategan,darrylwon a=srosenberg We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different unit test, namely `TestTestPlanner` did not restore the default version (`clusterupgrade.TestBuildVersion`). This change forwardports the missing restore to ensure future test executions follow the same PRNG sequence. [1] #140674 Epic: none Release note: None 140785: rac2,kvserver: start the StreamCloseScheduler r=sumeerbhola a=pav-kv This commit enables the `StreamCloseScheduler`, which is responsible for closing RACv2 streams some time (400ms) after they enter `StateProbe`. The initialization had to move from `NewStore` to `Store.Start()` because it needs the `stopper` to start the job. Epic: none Release note: none Co-authored-by: Stan Rosenberg <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>

We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different unit test, namely `TestTestPlanner` did not restore the default version (`clusterupgrade.TestBuildVersion`). This change forwardports the missing restore to ensure future test executions follow the same PRNG sequence. [1] #140674 Epic: none Release note: None

srosenberg requested a review from a team as a code owner February 7, 2025 14:45

srosenberg requested review from herkolategan and DarrylWong and removed request for a team February 7, 2025 14:45

blathers-crl bot added the backport Label PR's that are backports to older release branches label Feb 7, 2025

DarrylWong approved these changes Feb 7, 2025

View reviewed changes

srosenberg force-pushed the backport24.3-137963 branch from cbdbabe to 02eebdc Compare February 7, 2025 20:01

srosenberg merged commit 303bfcd into cockroachdb:release-24.3 Feb 7, 2025
20 of 21 checks passed

srosenberg mentioned this pull request Feb 7, 2025

roachtest/mixedversion: TestTestPlanner should restore default version #140753

Merged

blathers-crl bot mentioned this pull request Feb 10, 2025

release-25.1: roachtest/mixedversion: TestTestPlanner should restore default version #141025

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-24.3: roachtest/mixedversion: upgrade plan should be bounded #140674

release-24.3: roachtest/mixedversion: upgrade plan should be bounded #140674

srosenberg commented Feb 7, 2025 •

edited

Loading

blathers-crl bot commented Feb 7, 2025

cockroach-teamcity commented Feb 7, 2025

srosenberg commented Feb 7, 2025

DarrylWong commented Feb 7, 2025

srosenberg commented Feb 7, 2025 •

edited

Loading

srosenberg commented Feb 7, 2025 •

edited

Loading

DarrylWong commented Feb 7, 2025

srosenberg commented Feb 7, 2025

srosenberg commented Feb 7, 2025

release-24.3: roachtest/mixedversion: upgrade plan should be bounded #140674

release-24.3: roachtest/mixedversion: upgrade plan should be bounded #140674

Conversation

srosenberg commented Feb 7, 2025 • edited Loading

blathers-crl bot commented Feb 7, 2025

cockroach-teamcity commented Feb 7, 2025

srosenberg commented Feb 7, 2025

DarrylWong commented Feb 7, 2025

srosenberg commented Feb 7, 2025 • edited Loading

srosenberg commented Feb 7, 2025 • edited Loading

DarrylWong commented Feb 7, 2025

srosenberg commented Feb 7, 2025

srosenberg commented Feb 7, 2025

srosenberg commented Feb 7, 2025 •

edited

Loading

srosenberg commented Feb 7, 2025 •

edited

Loading

srosenberg commented Feb 7, 2025 •

edited

Loading