Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: mark polled VM preemptions as non reportable #139075

Merged
merged 2 commits into from
Feb 6, 2025

Conversation

DarrylWong
Copy link
Contributor

@DarrylWong DarrylWong commented Jan 14, 2025

Previously, polled VM preemptions would simply cancel the test, as post test processing would recheck for preemptions again. However, we've seen some cases in AWS where the post test check returns no preemptions despite the polling returning preemptions.

This may be just be the AWS check being eventually consistent, so we want to avoid posting if either check finds preemptions.


The second change resets failures in the case of a vm preemption, in case a timeout occurred which normally takes precedence over all other failures. While a timeout suggests that something should be fixed with the test (usually respecting the test context cancellation), we see that in practice, engineers tend to close the issue without investigating as soon as they see the preemption.

This also removes the potential duplicate vm_preemption failure that may have been added by the preemption polling.

Fixes: #139004
Fixes: #139931
Release note: none
Epic: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@DarrylWong DarrylWong force-pushed the preemption-polling-fix branch 2 times, most recently from eff81d5 to d8c1c51 Compare January 14, 2025 23:00
@DarrylWong
Copy link
Contributor Author

DarrylWong commented Jan 28, 2025

On the fence on the second commit. One one hand there have been two cases where teams fix their tests, but in general it appears most engineers just close out the issue. Arguable how much value is even gained from fixing said tests anyway (saving money from not having the vms hang?).

@DarrylWong DarrylWong marked this pull request as ready for review January 28, 2025 16:02
@DarrylWong DarrylWong requested a review from a team as a code owner January 28, 2025 16:02
@DarrylWong DarrylWong requested review from herkolategan and srosenberg and removed request for a team January 28, 2025 16:02
// Test that if VM preemption polling finds a preempted VM but the post test failure
// check doesn't, the test is still marked as a flake.
t.Run("post test check doesn't catch preemption", func(t *testing.T) {
setPollPreemptionInterval(50 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would flake under --stress, right?

}
mockTest.Timeout = 50 * time.Millisecond

err := runner.Run(ctx, []registry.TestSpec{mockTest}, 1, /* count */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above... timeout may never occur under --stress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lowered the timeout to 10 ms, we have another unit test that uses that time and seems to be doing fine, so hopefully it should be for this as well. Stressed 3000 times with no flakes.

@DarrylWong DarrylWong force-pushed the preemption-polling-fix branch from d8c1c51 to ade241d Compare January 29, 2025 22:42
Previously, polled VM preemptions would simply cancel the
test, as post test processing would recheck for preemptions
again. However, we've seen some cases in AWS where the post test
check returns no preemptions despite the polling returning
preemptions.

This may be just be the AWS check being eventually consistent,
so we want to avoid posting if either check finds preemptions.
@DarrylWong DarrylWong force-pushed the preemption-polling-fix branch 2 times, most recently from bce3ef7 to 9885bd9 Compare January 30, 2025 15:40
@DarrylWong
Copy link
Contributor Author

Yup, stress didn't fail but race did. Interestingly, the failure being added by the preemption polling races with the post test assertion check replacing the logger. This isn't specific to preemption polling, I can repro this by just running:

		mockTest.Run = func(ctx context.Context, t test.Test, c cluster.Cluster) {
			t.Error("test failed")
		}
		mockTest.Timeout = 1 * time.Nanosecond

So this is a possible data race we could see in actual roachtests if it times out. From what I can tell, the only impact is that we might lose the failure being logged to test.log, which is not a huge deal since the test timed out which is most likely the real failure anyway.

For these unit tests, I just made sure the above race doesn't happen since it doesn't affect what it's testing. Not sure if we want to add a real fix for the data race though.

@srosenberg
Copy link
Member

Interestingly, the failure being added by the preemption polling races with the post test assertion check replacing the logger.

Not entirely following... I thought addFailure was thread-safe. Can you post the report from the data race, i.e., the racing goroutine stacks.

@DarrylWong
Copy link
Contributor Author

Sure, here's a simple repro of what I'm seeing:

func TestRunnerTestTimeoutRace(t *testing.T) {
	ctx := context.Background()
	stopper := stop.NewStopper()
	defer stopper.Stop(ctx)
	cr := newClusterRegistry()
	runner := newUnitTestRunner(cr, stopper)

	var buf syncedBuffer
	copt := defaultClusterOpt()
	lopt := defaultLoggingOpt(&buf)
	test := registry.TestSpec{
		Name:             `timeout`,
		Owner:            OwnerUnitTest,
		Timeout:          1 * time.Millisecond,
		Cluster:          spec.MakeClusterSpec(0),
		CompatibleClouds: registry.AllExceptAWS,
		Suites:           registry.Suites(registry.Nightly),
		CockroachBinary:  registry.StandardCockroach,
		Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
			t.Error("test failed")
		},
	}
	_ = runner.Run(ctx, []registry.TestSpec{test}, 1, /* count */
		defaultParallelism, copt, testOpts{}, lopt)
}

Running ./dev test pkg/cmd/roachtest -f TestRunnerTestTimeoutRace --stress --race hits:

==================
WARNING: DATA RACE
Read at 0x00c004b9e888 by goroutine 40:
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testImpl).L()
      pkg/cmd/roachtest/test_impl.go:262 +0x520
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testImpl).addFailure()
      pkg/cmd/roachtest/test_impl.go:470 +0x530
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testImpl).addFailureAndCancel()
      pkg/cmd/roachtest/test_impl.go:441 +0x5c
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testImpl).Error()
      pkg/cmd/roachtest/test_impl.go:432 +0x4c
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.TestRunnerTestTimeoutRace.func1()
      pkg/cmd/roachtest/test_test.go:381 +0x88
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runTest.func2()
      pkg/cmd/roachtest/test_runner.go:1315 +0x168

Previous write at 0x00c004b9e888 by goroutine 39:
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testImpl).ReplaceL()
      pkg/cmd/roachtest/test_impl.go:268 +0x1d0
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runTest.func4()
      pkg/cmd/roachtest/test_runner.go:1380 +0x1cc
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runTest()
      pkg/cmd/roachtest/test_runner.go:1402 +0x14d8
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runWorker()
      pkg/cmd/roachtest/test_runner.go:937 +0x2b6c
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).Run.func1()
      pkg/cmd/roachtest/test_runner.go:389 +0x3b4
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      pkg/util/stop/stopper.go:500 +0x214
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.gowrap1()
      pkg/util/stop/stopper.go:501 +0x54

Goroutine 40 (running) created at:
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runTest()
      pkg/cmd/roachtest/test_runner.go:1296 +0xba8
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runWorker()
      pkg/cmd/roachtest/test_runner.go:937 +0x2b6c
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).Run.func1()
      pkg/cmd/roachtest/test_runner.go:389 +0x3b4
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      pkg/util/stop/stopper.go:500 +0x214
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.gowrap1()
      pkg/util/stop/stopper.go:501 +0x54

Goroutine 39 (running) created at:
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx()
      pkg/util/stop/stopper.go:489 +0x52c
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask()
      pkg/util/stop/stopper.go:360 +0xfd4
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).Run()
      pkg/cmd/roachtest/test_runner.go:378 +0xcc4
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.TestRunnerTestTimeoutRace()
      pkg/cmd/roachtest/test_test.go:384 +0x5e4
  testing.tRunner()
      GOROOT/src/testing/testing.go:1689 +0x180
  testing.(*T).Run.gowrap1()
      GOROOT/src/testing/testing.go:1742 +0x40
==================

So the failures struct in addFailure is thread safe, but the logger it uses here:

t.L().Printf("test failure #%d: full stack retained in %s.log: %s", failureNum, failureLog, msg)

isn't because we replace the logger after the test is done:

if !t.Failed() {
// Awkward file name to keep it close to test.log.
l.Printf("running post test assertions (test-post-assertions.log)")
replaceLogger("test-post-assertions")
// We still want to run the post-test assertions even if the test timed out as it
// might provide useful information about the health of the nodes. Any assertion failures
// will will be recorded against, and eventually fail, the test.
if err := r.postTestAssertions(ctx, t, c, 10*time.Minute); err != nil {

Note the comment: We still want to run the post-test assertions even if the test timed out along with the fact that on test timeouts we purposely don't cancel the test context and let the test keep running.

So my understanding is that the following ordering can happen:

  1. The test times out and the test runner stops blocking.
  2. The test runner starts post tests assertion checks, replacing the logger.
  3. The test itself runs t.Error("test failed") which eventually calls t.L().Printf. This races with the above.

The original test was hitting the same race w/ the same stack trace, it was just the polling preemption adding the failure instead of the test itself.

@srosenberg
Copy link
Member

The test times out and the test runner stops blocking.
The test runner starts post tests assertion checks, replacing the logger.
The test itself runs t.Error("test failed") which eventually calls t.L().Printf. This races with the above.

Yep. That's a perfectly plausible scenario. We should fix it; e.g., changing the type of testImpl.L to atomic.Pointer.

return nil, nil
}
}()
// Make sure the preemption polling is called and the test context is cancelled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. This removes the clock, and the corresponding "flakiness", from the equation.

This change resets failures in the case of a vm preemption,
in case a timeout occurred which normally takes precedence
over all other failures. While a timeout suggests that
something should be fixed with the test (usually respecting
the test context cancellation), we see that in practice,
engineers tend to close the issue without investigating as
soon as they see the preemption.

This also removes the potential duplicate vm_preemption
failure that may have been added by the preemption polling.
@DarrylWong DarrylWong force-pushed the preemption-polling-fix branch from 9885bd9 to 031e055 Compare February 6, 2025 15:44
@DarrylWong
Copy link
Contributor Author

We should fix it; e.g., changing the type of testImpl.L to atomic.Pointer

SGTM, will put it out as a followup PR.

@DarrylWong
Copy link
Contributor Author

TFTR!

bors r=srosenberg

@srosenberg
Copy link
Member

blathers backport 25.1.0-rc

Copy link

blathers-crl bot commented Feb 18, 2025

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #139004: branch-release-25.1.0-rc.


Issue #139931: branch-release-25.1.0-rc.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: tpccbench/nodes=3/cpu=16 failed roachtest: kv0/enc=false/nodes=1/mt-shared-process failed
3 participants