-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: mark polled VM preemptions as non reportable #139075
Conversation
eff81d5
to
d8c1c51
Compare
On the fence on the second commit. One one hand there have been two cases where teams fix their tests, but in general it appears most engineers just close out the issue. Arguable how much value is even gained from fixing said tests anyway (saving money from not having the vms hang?). |
pkg/cmd/roachtest/test_test.go
Outdated
// Test that if VM preemption polling finds a preempted VM but the post test failure | ||
// check doesn't, the test is still marked as a flake. | ||
t.Run("post test check doesn't catch preemption", func(t *testing.T) { | ||
setPollPreemptionInterval(50 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would flake under --stress
, right?
} | ||
mockTest.Timeout = 50 * time.Millisecond | ||
|
||
err := runner.Run(ctx, []registry.TestSpec{mockTest}, 1, /* count */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above... timeout may never occur under --stress
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lowered the timeout to 10 ms, we have another unit test that uses that time and seems to be doing fine, so hopefully it should be for this as well. Stressed 3000 times with no flakes.
d8c1c51
to
ade241d
Compare
Previously, polled VM preemptions would simply cancel the test, as post test processing would recheck for preemptions again. However, we've seen some cases in AWS where the post test check returns no preemptions despite the polling returning preemptions. This may be just be the AWS check being eventually consistent, so we want to avoid posting if either check finds preemptions.
bce3ef7
to
9885bd9
Compare
Yup,
So this is a possible data race we could see in actual roachtests if it times out. From what I can tell, the only impact is that we might lose the failure being logged to For these unit tests, I just made sure the above race doesn't happen since it doesn't affect what it's testing. Not sure if we want to add a real fix for the data race though. |
Not entirely following... I thought |
Sure, here's a simple repro of what I'm seeing:
Running
So the cockroach/pkg/cmd/roachtest/test_impl.go Line 349 in 1de6c82
isn't because we replace the logger after the test is done: cockroach/pkg/cmd/roachtest/test_runner.go Lines 1090 to 1098 in 3c64af2
Note the comment: So my understanding is that the following ordering can happen:
The original test was hitting the same race w/ the same stack trace, it was just the polling preemption adding the failure instead of the test itself. |
Yep. That's a perfectly plausible scenario. We should fix it; e.g., changing the type of |
return nil, nil | ||
} | ||
}() | ||
// Make sure the preemption polling is called and the test context is cancelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. This removes the clock, and the corresponding "flakiness", from the equation.
This change resets failures in the case of a vm preemption, in case a timeout occurred which normally takes precedence over all other failures. While a timeout suggests that something should be fixed with the test (usually respecting the test context cancellation), we see that in practice, engineers tend to close the issue without investigating as soon as they see the preemption. This also removes the potential duplicate vm_preemption failure that may have been added by the preemption polling.
9885bd9
to
031e055
Compare
SGTM, will put it out as a followup PR. |
TFTR! bors r=srosenberg |
blathers backport 25.1.0-rc |
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #139004: branch-release-25.1.0-rc. Issue #139931: branch-release-25.1.0-rc. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Previously, polled VM preemptions would simply cancel the test, as post test processing would recheck for preemptions again. However, we've seen some cases in AWS where the post test check returns no preemptions despite the polling returning preemptions.
This may be just be the AWS check being eventually consistent, so we want to avoid posting if either check finds preemptions.
The second change resets failures in the case of a vm preemption, in case a timeout occurred which normally takes precedence over all other failures. While a timeout suggests that something should be fixed with the test (usually respecting the test context cancellation), we see that in practice, engineers tend to close the issue without investigating as soon as they see the preemption.
This also removes the potential duplicate vm_preemption failure that may have been added by the preemption polling.
Fixes: #139004
Fixes: #139931
Release note: none
Epic: none