-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-24.3: roachtest/mixedversion: upgrade plan should be bounded #140674
release-24.3: roachtest/mixedversion: upgrade plan should be bounded #140674
Conversation
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
Interesting... wasn't expecting,
|
Had a suspicion it was something to do with setting the build version when you mentioned it only fails with the entire package is run since that has caused me grief in the past. Sure enough,
Seems to make it work as suspected. Seems like we should be calling |
Previously, the mixedversion framework did not bound the total number of steps in a test plan. Since steps are generated according to different pseudo-random distributions, the total number of resulting steps can vary significantly. E.g., for `tpcc/mixed-headroom/n5cpu16`, the smallest test plan has 14 steps, whereas the largest, based on a sampling of 1_000_000 valid test plans, has 135 steps! High variability in the size of the test plan is directly proportional to the running time. Thus, a very large test plan can cause a test to time out, due to exceeding its max running time. That is the case for `tpcc/mixed-headroom/n5cpu16`. This PR adds an option, namely `MaxNumPlanSteps`, which enforces an upper bound. If a generated test plan exceeds the specified value, a new one is generated until the resulting length is within the specification. This PR also adds a primitive `dry-run` mode, which can be useful for debugging test plans. If `MVT_DRY_RUN_MODE` env. var. is set, print the mixedversion test plan and exit. Resolves: cockroachdb#138014 Informs: cockroachdb#137332 Epic: none Release note: None
cbdbabe
to
02eebdc
Compare
Nice catch! I went with your approach to reduce changes to the backport. It turns out the 15-step plan does exist but it requires more retries to find it, i.e.,
Now we get,
that's similar (not identical) to the plan when restoring
The reason it wasn't finding it (with 100 retries) was because the initial upgrade path sequence is longer,
vs
|
Right. I'm not sure why this hasn't flaked on master or 25.1. It must be the case that |
I see the same thing happening on master. Running the entire pkg, the build version is 24.3, running the test standalone the build version is 24.2. I guess whatever rng changed on 25.1+ just happened to be lucky and make this test work. We should probably make the same fix on master/25.1 to avoid any future confusing bugs. |
Right. Initially, I thought it was only the length of the upgrade sequence which would determine the next plan iteration, but there is more in
Yep, I'll send another PR. |
TFTR! |
We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different unit test, namely `TestTestPlanner` did not restore the default version (`clusterupgrade.TestBuildVersion`). This change forwardports the missing restore to ensure future test executions follow the same PRNG sequence. [1] cockroachdb#140674 Epic: none Release note: None
140753: roachtest/mixedversion: TestTestPlanner should restore default version r=herkolategan,darrylwon a=srosenberg We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different unit test, namely `TestTestPlanner` did not restore the default version (`clusterupgrade.TestBuildVersion`). This change forwardports the missing restore to ensure future test executions follow the same PRNG sequence. [1] #140674 Epic: none Release note: None Co-authored-by: Stan Rosenberg <[email protected]>
140753: roachtest/mixedversion: TestTestPlanner should restore default version r=herkolategan,darrylwon a=srosenberg We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different unit test, namely `TestTestPlanner` did not restore the default version (`clusterupgrade.TestBuildVersion`). This change forwardports the missing restore to ensure future test executions follow the same PRNG sequence. [1] #140674 Epic: none Release note: None 140785: rac2,kvserver: start the StreamCloseScheduler r=sumeerbhola a=pav-kv This commit enables the `StreamCloseScheduler`, which is responsible for closing RACv2 streams some time (400ms) after they enter `StateProbe`. The initialization had to move from `NewStore` to `Store.Start()` because it needs the `stopper` to start the job. Epic: none Release note: none 140791: sem/tree: avoid assertion error on unimplemented builtins in views r=yuzefovich a=yuzefovich Previously, we would hit an assertion error when trying to use an unimplemented builtin in the CREATE VIEW statement and this is now fixed. The issue is that we resolve unimplemented builtins as a definition with zero overloads, and all places that previously assumed at least one existing overload have been audited. The resolved function definition has been updated to have "unsupported with issue" integer indicating why there are no overloads. Additionally, `UnsupportedWithIssue` property is now changed to be `uint` since we didn't use "negative value as having no corresponding issue" ability. I decided to not include a release note since this seems like an edge case and we've only seen this a handful of times in sentry. Fixes: #128535. Release note: None 141009: sqlstats: use `BatchProcessLatencyBuckets` for flush latency r=xinhaoz a=dhartunian The max of 10s on the IO latency buckets is too short to measure flush latency effectively. Previously, this metric was measuring per-statement flush latency, but this was altered in #122919. Release note: None Co-authored-by: Stan Rosenberg <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: David Hartunian <[email protected]>
140753: roachtest/mixedversion: TestTestPlanner should restore default version r=herkolategan,darrylwon a=srosenberg We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different unit test, namely `TestTestPlanner` did not restore the default version (`clusterupgrade.TestBuildVersion`). This change forwardports the missing restore to ensure future test executions follow the same PRNG sequence. [1] #140674 Epic: none Release note: None 140785: rac2,kvserver: start the StreamCloseScheduler r=sumeerbhola a=pav-kv This commit enables the `StreamCloseScheduler`, which is responsible for closing RACv2 streams some time (400ms) after they enter `StateProbe`. The initialization had to move from `NewStore` to `Store.Start()` because it needs the `stopper` to start the job. Epic: none Release note: none Co-authored-by: Stan Rosenberg <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
We saw that `Test_maxNumPlanSteps` suddenly failed in an otherwise unrelated backport [1]. The reason turned out to be non-determinisim. That is, a different unit test, namely `TestTestPlanner` did not restore the default version (`clusterupgrade.TestBuildVersion`). This change forwardports the missing restore to ensure future test executions follow the same PRNG sequence. [1] #140674 Epic: none Release note: None
Backport 1/1 commits from #137963.
/cc @cockroachdb/release
Previously, the mixedversion framework did not bound
the total number of steps in a test plan. Since steps
are generated according to different pseudo-random
distributions, the total number of resulting steps
can vary significantly.
E.g., for
tpcc/mixed-headroom/n5cpu16
, the smallesttest plan has 14 steps, whereas the largest, based
on a sampling of 1_000_000 valid test plans, has
135 steps!
High variability in the size of the test plan
is directly proportional to the running time.
Thus, a very large test plan can cause a test
to time out, due to exceeding its max running time.
That is the case for
tpcc/mixed-headroom/n5cpu16
.This PR adds an option, namely
MaxNumPlanSteps
,which enforces an upper bound. If a generated
test plan exceeds the specified value, a new
one is generated until the resulting length
is within the specification.
This PR also adds a primitive
dry-run
mode,which can be useful for debugging test plans.
If
MVT_DRY_RUN_MODE
env. var. is set, printthe mixedversion test plan and exit.
Resolves: #138014
Informs: #137332
Epic: none
Release note: None
Release Justification: test-only change
Resolves: #140059