Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-24.3: roachtest/mixedversion: upgrade plan should be bounded #140674

Merged

Conversation

srosenberg
Copy link
Member

@srosenberg srosenberg commented Feb 7, 2025

Backport 1/1 commits from #137963.

/cc @cockroachdb/release


Previously, the mixedversion framework did not bound
the total number of steps in a test plan. Since steps
are generated according to different pseudo-random
distributions, the total number of resulting steps
can vary significantly.
E.g., for tpcc/mixed-headroom/n5cpu16, the smallest
test plan has 14 steps, whereas the largest, based
on a sampling of 1_000_000 valid test plans, has
135 steps!

High variability in the size of the test plan
is directly proportional to the running time.
Thus, a very large test plan can cause a test
to time out, due to exceeding its max running time.
That is the case for tpcc/mixed-headroom/n5cpu16.

This PR adds an option, namely MaxNumPlanSteps,
which enforces an upper bound. If a generated
test plan exceeds the specified value, a new
one is generated until the resulting length
is within the specification.

This PR also adds a primitive dry-run mode,
which can be useful for debugging test plans.
If MVT_DRY_RUN_MODE env. var. is set, print
the mixedversion test plan and exit.

Resolves: #138014
Informs: #137332
Epic: none
Release note: None
Release Justification: test-only change
Resolves: #140059

@srosenberg srosenberg requested a review from a team as a code owner February 7, 2025 14:45
@srosenberg srosenberg requested review from herkolategan and DarrylWong and removed request for a team February 7, 2025 14:45
Copy link

blathers-crl bot commented Feb 7, 2025

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Feb 7, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@srosenberg
Copy link
Member Author

Interesting... wasn't expecting,

=== RUN   Test_maxNumPlanSteps    planner_test.go:313:         	Error Trace:	pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go:313        	Error:      	Received unexpected error:        	            	error creating test plan: unable to generate a test plan with at most 15 steps [owner=test-eng]        	Test:       	Test_maxNumPlanSteps--- FAIL: Test_maxNumPlanSteps (0.41s)

@DarrylWong
Copy link
Contributor

Had a suspicion it was something to do with setting the build version when you mentioned it only fails with the entire package is run since that has caused me grief in the past. Sure enough, TestTestPlanner sets the build version to v24.3 and adding:

diff --git a/pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go b/pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go
index 2f418c179bd..110bff0a584 100644
--- a/pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go
+++ b/pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go
@@ -52,6 +52,7 @@ var (
this seed
 
 func TestTestPlanner(t *testing.T) {
+       defer setDefaultVersions()
 to the test.
        mutatorsAvailable := append([]mutator{
                concurrentUserHooksMutator{},

Seems to make it work as suspected. Seems like we should be calling setDefaultVersions before every test though, surprised this hasn't caused issues before.

Previously, the mixedversion framework did not bound
the total number of steps in a test plan. Since steps
are generated according to different pseudo-random
distributions, the total number of resulting steps
can vary significantly.
E.g., for `tpcc/mixed-headroom/n5cpu16`, the smallest
test plan has 14 steps, whereas the largest, based
on a sampling of 1_000_000 valid test plans, has
135 steps!

High variability in the size of the test plan
is directly proportional to the running time.
Thus, a very large test plan can cause a test
to time out, due to exceeding its max running time.
That is the case for `tpcc/mixed-headroom/n5cpu16`.

This PR adds an option, namely `MaxNumPlanSteps`,
which enforces an upper bound. If a generated
test plan exceeds the specified value, a new
one is generated until the resulting length
is within the specification.

This PR also adds a primitive `dry-run` mode,
which can be useful for debugging test plans.
If `MVT_DRY_RUN_MODE` env. var. is set, print
the mixedversion test plan and exit.

Resolves: cockroachdb#138014
Informs: cockroachdb#137332
Epic: none
Release note: None
@srosenberg
Copy link
Member Author

srosenberg commented Feb 7, 2025

Sure enough, TestTestPlanner sets the build version to v24.3 and adding:

Nice catch! I went with your approach to reduce changes to the backport. It turns out the 15-step plan does exist but it requires more retries to find it, i.e.,

--- a/pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
+++ b/pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
@@ -801,7 +801,7 @@ func (t *Test) plan() (plan *TestPlan, retErr error) {
        var retries int
        // In case the length of the test plan exceeds `opts.maxNumPlanSteps`, retry up to 100 times.
        // N.B. Statistically, the expected number of retries is miniscule; see #138014 for more info.
-       for ; retries < 100; retries++ {
+       for ; retries < 1000; retries++ {

Now we get,

planner_test.go:317: Seed:               12345
        Upgrades:           v24.2.2 → <current>
        Deployment mode:    system-only
        Plan:
        ├── start cluster at version "v24.2.2" (1)
        ├── wait for all nodes (:1-4) to acknowledge cluster version '24.2' on system tenant (2)
        └── upgrade cluster from "v24.2.2" to "<current>"
           ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (3)
           ├── upgrade nodes :1-4 from "v24.2.2" to "<current>"
           │   ├── restart node 1 with binary version <current> (4)
           │   ├── run "mixed-version 2" (5)
           │   ├── restart node 2 with binary version <current> (6)
           │   ├── restart node 3 with binary version <current> (7)
           │   ├── restart node 4 with binary version <current> (8)
           │   └── run mixed-version hooks concurrently
           │      ├── run "on startup 1", after 0s delay (9)
           │      └── run "mixed-version 1", after 5s delay (10)
           ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (11)
           ├── run mixed-version hooks concurrently
           │   ├── run "on startup 1", after 0s delay (12)
           │   └── run "mixed-version 1", after 30s delay (13)
           ├── wait for all nodes (:1-4) to acknowledge cluster version <current> on system tenant (14)
           └── run "after finalization" (15)

that's similar (not identical) to the plan when restoring setDefaultVersions,

planner_test.go:317: Seed:               12345
        Upgrades:           v24.1.1 → <current>
        Deployment mode:    system-only
        Mutators:           preserve_downgrade_option_randomizer
        Plan:
        ├── install fixtures for version "v24.1.1" (1)
        ├── start cluster at version "v24.1.1" (2)
        ├── wait for all nodes (:1-4) to acknowledge cluster version '24.1' on system tenant (3)
        └── upgrade cluster from "v24.1.1" to "<current>"
           ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (4)
           ├── upgrade nodes :1-4 from "v24.1.1" to "<current>"
           │   ├── restart node 3 with binary version <current> (5)
           │   ├── run "mixed-version 1" (6)
           │   ├── restart node 2 with binary version <current> (7)
           │   ├── run "on startup 1" (8)
           │   ├── restart node 1 with binary version <current> (9)
           │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (10)
           │   ├── run "mixed-version 2" (11)
           │   └── restart node 4 with binary version <current> (12)
           ├── run "mixed-version 2" (13)
           ├── wait for all nodes (:1-4) to acknowledge cluster version <current> on system tenant (14)
           └── run "after finalization" (15)

The reason it wasn't finding it (with 100 retries) was because the initial upgrade path sequence is longer,

upgradePath=[v23.1.17 v23.2.4 v24.1.1 v24.2.2 <current>]

vs

upgradePath=[v23.1.17 v23.2.4 v24.1.1 <current>]

@srosenberg
Copy link
Member Author

srosenberg commented Feb 7, 2025

Seems like we should be calling setDefaultVersions before every test though, surprised this hasn't caused issues before.

Right. I'm not sure why this hasn't flaked on master or 25.1. It must be the case that TestTestPlanner isn't run before Test_maxNumPlanSteps?! 🤔

@DarrylWong
Copy link
Contributor

It must be the case that TestTestPlanner isn't run before Test_maxNumPlanSteps?! 🤔

I see the same thing happening on master. Running the entire pkg, the build version is 24.3, running the test standalone the build version is 24.2. I guess whatever rng changed on 25.1+ just happened to be lucky and make this test work. We should probably make the same fix on master/25.1 to avoid any future confusing bugs.

@srosenberg
Copy link
Member Author

I see the same thing happening on master. Running the entire pkg, the build version is 24.3, running the test standalone the build version is 24.2. I guess whatever rng changed on 25.1+ just happened to be lucky and make this test work. We should probably make the same fix on master/25.1 to avoid any future confusing bugs.

Right. Initially, I thought it was only the length of the upgrade sequence which would determine the next plan iteration, but there is more in choosePreviousReleases. E.g., they start out with the same length but diverge after a few iterations, with the one on this branch being an "unlucky" one :).

We should probably make the same fix on master/25.1 to avoid any future confusing bugs.

Yep, I'll send another PR.

@srosenberg
Copy link
Member Author

TFTR!

@srosenberg srosenberg merged commit 303bfcd into cockroachdb:release-24.3 Feb 7, 2025
20 of 21 checks passed
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Feb 7, 2025
We saw that `Test_maxNumPlanSteps` suddenly failed in an
otherwise unrelated backport [1]. The reason turned
out to be non-determinisim. That is, a different
unit test, namely `TestTestPlanner` did not
restore the default version (`clusterupgrade.TestBuildVersion`).

This change forwardports the missing restore to
ensure future test executions follow the same
PRNG sequence.

[1] cockroachdb#140674

Epic: none

Release note: None
craig bot pushed a commit that referenced this pull request Feb 10, 2025
140753: roachtest/mixedversion: TestTestPlanner should restore default version r=herkolategan,darrylwon a=srosenberg

We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different
unit test, namely `TestTestPlanner` did not
restore the default version (`clusterupgrade.TestBuildVersion`).

This change forwardports the missing restore to
ensure future test executions follow the same
PRNG sequence.

[1] #140674

Epic: none

Release note: None

Co-authored-by: Stan Rosenberg <[email protected]>
craig bot pushed a commit that referenced this pull request Feb 10, 2025
140753: roachtest/mixedversion: TestTestPlanner should restore default version r=herkolategan,darrylwon a=srosenberg

We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different
unit test, namely `TestTestPlanner` did not
restore the default version (`clusterupgrade.TestBuildVersion`).

This change forwardports the missing restore to
ensure future test executions follow the same
PRNG sequence.

[1] #140674

Epic: none

Release note: None

140785: rac2,kvserver: start the StreamCloseScheduler r=sumeerbhola a=pav-kv

This commit enables the `StreamCloseScheduler`, which is responsible for closing RACv2 streams some time (400ms) after they enter `StateProbe`.

The initialization had to move from `NewStore` to `Store.Start()` because it needs the `stopper` to start the job.

Epic: none
Release note: none

140791: sem/tree: avoid assertion error on unimplemented builtins in views r=yuzefovich a=yuzefovich

Previously, we would hit an assertion error when trying to use an unimplemented builtin in the CREATE VIEW statement and this is now fixed. The issue is that we resolve unimplemented builtins as a definition with zero overloads, and all places that previously assumed at least one existing overload have been audited. The resolved function definition has been updated to have "unsupported with issue" integer indicating why there are no overloads.

Additionally, `UnsupportedWithIssue` property is now changed to be `uint` since we didn't use "negative value as having no corresponding issue" ability.

I decided to not include a release note since this seems like an edge case and we've only seen this a handful of times in sentry.

Fixes: #128535.

Release note: None

141009: sqlstats: use `BatchProcessLatencyBuckets` for flush latency r=xinhaoz a=dhartunian

The max of 10s on the IO latency buckets is too short to measure flush latency effectively. Previously, this metric was measuring per-statement flush latency, but this was altered in #122919.

Release note: None

Co-authored-by: Stan Rosenberg <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
craig bot pushed a commit that referenced this pull request Feb 10, 2025
140753: roachtest/mixedversion: TestTestPlanner should restore default version r=herkolategan,darrylwon a=srosenberg

We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different
unit test, namely `TestTestPlanner` did not
restore the default version (`clusterupgrade.TestBuildVersion`).

This change forwardports the missing restore to
ensure future test executions follow the same
PRNG sequence.

[1] #140674

Epic: none

Release note: None

140785: rac2,kvserver: start the StreamCloseScheduler r=sumeerbhola a=pav-kv

This commit enables the `StreamCloseScheduler`, which is responsible for closing RACv2 streams some time (400ms) after they enter `StateProbe`.

The initialization had to move from `NewStore` to `Store.Start()` because it needs the `stopper` to start the job.

Epic: none
Release note: none

Co-authored-by: Stan Rosenberg <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Feb 10, 2025
We saw that `Test_maxNumPlanSteps` suddenly failed in an
otherwise unrelated backport [1]. The reason turned
out to be non-determinisim. That is, a different
unit test, namely `TestTestPlanner` did not
restore the default version (`clusterupgrade.TestBuildVersion`).

This change forwardports the missing restore to
ensure future test executions follow the same
PRNG sequence.

[1] #140674

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants