roachtest: mark polled VM preemptions as non reportable #139075

DarrylWong · 2025-01-14T21:52:19Z

Previously, polled VM preemptions would simply cancel the test, as post test processing would recheck for preemptions again. However, we've seen some cases in AWS where the post test check returns no preemptions despite the polling returning preemptions.

This may be just be the AWS check being eventually consistent, so we want to avoid posting if either check finds preemptions.

The second change resets failures in the case of a vm preemption, in case a timeout occurred which normally takes precedence over all other failures. While a timeout suggests that something should be fixed with the test (usually respecting the test context cancellation), we see that in practice, engineers tend to close the issue without investigating as soon as they see the preemption.

This also removes the potential duplicate vm_preemption failure that may have been added by the preemption polling.

Fixes: #139004
Fixes: #139931
Release note: none
Epic: none

cockroach-teamcity · 2025-01-14T21:52:29Z

This change is

DarrylWong · 2025-01-28T16:01:42Z

On the fence on the second commit. One one hand there have been two cases where teams fix their tests, but in general it appears most engineers just close out the issue. Arguable how much value is even gained from fixing said tests anyway (saving money from not having the vms hang?).

srosenberg · 2025-01-28T16:47:47Z

pkg/cmd/roachtest/test_test.go

+	// Test that if VM preemption polling finds a preempted VM but the post test failure
+	// check doesn't, the test is still marked as a flake.
+	t.Run("post test check doesn't catch preemption", func(t *testing.T) {
+		setPollPreemptionInterval(50 * time.Millisecond)


This would flake under --stress, right?

srosenberg · 2025-01-28T16:53:07Z

pkg/cmd/roachtest/test_test.go

+		}
+		mockTest.Timeout = 50 * time.Millisecond
+
+		err := runner.Run(ctx, []registry.TestSpec{mockTest}, 1, /* count */


Same as above... timeout may never occur under --stress.

Lowered the timeout to 10 ms, we have another unit test that uses that time and seems to be doing fine, so hopefully it should be for this as well. Stressed 3000 times with no flakes.

Previously, polled VM preemptions would simply cancel the test, as post test processing would recheck for preemptions again. However, we've seen some cases in AWS where the post test check returns no preemptions despite the polling returning preemptions. This may be just be the AWS check being eventually consistent, so we want to avoid posting if either check finds preemptions.

DarrylWong · 2025-01-30T16:01:49Z

Yup, stress didn't fail but race did. Interestingly, the failure being added by the preemption polling races with the post test assertion check replacing the logger. This isn't specific to preemption polling, I can repro this by just running:

		mockTest.Run = func(ctx context.Context, t test.Test, c cluster.Cluster) {
			t.Error("test failed")
		}
		mockTest.Timeout = 1 * time.Nanosecond

So this is a possible data race we could see in actual roachtests if it times out. From what I can tell, the only impact is that we might lose the failure being logged to test.log, which is not a huge deal since the test timed out which is most likely the real failure anyway.

For these unit tests, I just made sure the above race doesn't happen since it doesn't affect what it's testing. Not sure if we want to add a real fix for the data race though.

srosenberg · 2025-02-05T19:03:56Z

Interestingly, the failure being added by the preemption polling races with the post test assertion check replacing the logger.

Not entirely following... I thought addFailure was thread-safe. Can you post the report from the data race, i.e., the racing goroutine stacks.

DarrylWong · 2025-02-05T19:29:18Z

Sure, here's a simple repro of what I'm seeing:

func TestRunnerTestTimeoutRace(t *testing.T) {
	ctx := context.Background()
	stopper := stop.NewStopper()
	defer stopper.Stop(ctx)
	cr := newClusterRegistry()
	runner := newUnitTestRunner(cr, stopper)

	var buf syncedBuffer
	copt := defaultClusterOpt()
	lopt := defaultLoggingOpt(&buf)
	test := registry.TestSpec{
		Name:             `timeout`,
		Owner:            OwnerUnitTest,
		Timeout:          1 * time.Millisecond,
		Cluster:          spec.MakeClusterSpec(0),
		CompatibleClouds: registry.AllExceptAWS,
		Suites:           registry.Suites(registry.Nightly),
		CockroachBinary:  registry.StandardCockroach,
		Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
			t.Error("test failed")
		},
	}
	_ = runner.Run(ctx, []registry.TestSpec{test}, 1, /* count */
		defaultParallelism, copt, testOpts{}, lopt)
}

Running ./dev test pkg/cmd/roachtest -f TestRunnerTestTimeoutRace --stress --race hits:

==================
WARNING: DATA RACE
Read at 0x00c004b9e888 by goroutine 40:
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testImpl).L()
      pkg/cmd/roachtest/test_impl.go:262 +0x520
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testImpl).addFailure()
      pkg/cmd/roachtest/test_impl.go:470 +0x530
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testImpl).addFailureAndCancel()
      pkg/cmd/roachtest/test_impl.go:441 +0x5c
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testImpl).Error()
      pkg/cmd/roachtest/test_impl.go:432 +0x4c
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.TestRunnerTestTimeoutRace.func1()
      pkg/cmd/roachtest/test_test.go:381 +0x88
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runTest.func2()
      pkg/cmd/roachtest/test_runner.go:1315 +0x168

Previous write at 0x00c004b9e888 by goroutine 39:
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testImpl).ReplaceL()
      pkg/cmd/roachtest/test_impl.go:268 +0x1d0
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runTest.func4()
      pkg/cmd/roachtest/test_runner.go:1380 +0x1cc
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runTest()
      pkg/cmd/roachtest/test_runner.go:1402 +0x14d8
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runWorker()
      pkg/cmd/roachtest/test_runner.go:937 +0x2b6c
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).Run.func1()
      pkg/cmd/roachtest/test_runner.go:389 +0x3b4
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      pkg/util/stop/stopper.go:500 +0x214
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.gowrap1()
      pkg/util/stop/stopper.go:501 +0x54

Goroutine 40 (running) created at:
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runTest()
      pkg/cmd/roachtest/test_runner.go:1296 +0xba8
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).runWorker()
      pkg/cmd/roachtest/test_runner.go:937 +0x2b6c
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).Run.func1()
      pkg/cmd/roachtest/test_runner.go:389 +0x3b4
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      pkg/util/stop/stopper.go:500 +0x214
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.gowrap1()
      pkg/util/stop/stopper.go:501 +0x54

Goroutine 39 (running) created at:
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx()
      pkg/util/stop/stopper.go:489 +0x52c
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask()
      pkg/util/stop/stopper.go:360 +0xfd4
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.(*testRunner).Run()
      pkg/cmd/roachtest/test_runner.go:378 +0xcc4
  github.com/cockroachdb/cockroach/pkg/cmd/roachtest.TestRunnerTestTimeoutRace()
      pkg/cmd/roachtest/test_test.go:384 +0x5e4
  testing.tRunner()
      GOROOT/src/testing/testing.go:1689 +0x180
  testing.(*T).Run.gowrap1()
      GOROOT/src/testing/testing.go:1742 +0x40
==================

So the failures struct in addFailure is thread safe, but the logger it uses here:

cockroach/pkg/cmd/roachtest/test_impl.go

Line 349 in 1de6c82

    
           t.L().Printf("test failure #%d: full stack retained in %s.log: %s", failureNum, failureLog, msg)

isn't because we replace the logger after the test is done:

cockroach/pkg/cmd/roachtest/test_runner.go

Lines 1090 to 1098 in 3c64af2

    
           if !t.Failed() { 
        
           	// Awkward file name to keep it close to test.log. 
        
           	l.Printf("running post test assertions (test-post-assertions.log)") 
        
           	replaceLogger("test-post-assertions") 
        
           	// We still want to run the post-test assertions even if the test timed out as it 
        
           	// might provide useful information about the health of the nodes. Any assertion failures 
        
           	// will will be recorded against, and eventually fail, the test. 
        
           	if err := r.postTestAssertions(ctx, t, c, 10*time.Minute); err != nil {

Note the comment: We still want to run the post-test assertions even if the test timed out along with the fact that on test timeouts we purposely don't cancel the test context and let the test keep running.

So my understanding is that the following ordering can happen:

The test times out and the test runner stops blocking.
The test runner starts post tests assertion checks, replacing the logger.
The test itself runs t.Error("test failed") which eventually calls t.L().Printf. This races with the above.

The original test was hitting the same race w/ the same stack trace, it was just the polling preemption adding the failure instead of the test itself.

srosenberg · 2025-02-06T02:31:25Z

The test times out and the test runner stops blocking.
The test runner starts post tests assertion checks, replacing the logger.
The test itself runs t.Error("test failed") which eventually calls t.L().Printf. This races with the above.

Yep. That's a perfectly plausible scenario. We should fix it; e.g., changing the type of testImpl.L to atomic.Pointer.

srosenberg · 2025-02-06T02:35:15Z

pkg/cmd/roachtest/test_test.go

+					return nil, nil
+				}
+			}()
+			// Make sure the preemption polling is called and the test context is cancelled


Good. This removes the clock, and the corresponding "flakiness", from the equation.

pkg/cmd/roachtest/test_test.go

This change resets failures in the case of a vm preemption, in case a timeout occurred which normally takes precedence over all other failures. While a timeout suggests that something should be fixed with the test (usually respecting the test context cancellation), we see that in practice, engineers tend to close the issue without investigating as soon as they see the preemption. This also removes the potential duplicate vm_preemption failure that may have been added by the preemption polling.

DarrylWong · 2025-02-06T15:46:20Z

We should fix it; e.g., changing the type of testImpl.L to atomic.Pointer

SGTM, will put it out as a followup PR.

DarrylWong · 2025-02-06T16:57:06Z

TFTR!

bors r=srosenberg

craig · 2025-02-06T17:36:39Z

Build succeeded:

srosenberg · 2025-02-18T20:29:55Z

blathers backport 25.1.0-rc

blathers-crl · 2025-02-18T20:30:03Z

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.

Issue #139004: branch-release-25.1.0-rc.

Issue #139931: branch-release-25.1.0-rc.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

DarrylWong force-pushed the preemption-polling-fix branch 2 times, most recently from eff81d5 to d8c1c51 Compare January 14, 2025 23:00

DarrylWong marked this pull request as ready for review January 28, 2025 16:02

DarrylWong requested a review from a team as a code owner January 28, 2025 16:02

DarrylWong requested review from herkolategan and srosenberg and removed request for a team January 28, 2025 16:02

srosenberg reviewed Jan 28, 2025

View reviewed changes

DarrylWong force-pushed the preemption-polling-fix branch from d8c1c51 to ade241d Compare January 29, 2025 22:42

DarrylWong force-pushed the preemption-polling-fix branch 2 times, most recently from bce3ef7 to 9885bd9 Compare January 30, 2025 15:40

srosenberg reviewed Feb 6, 2025

View reviewed changes

pkg/cmd/roachtest/test_test.go Outdated Show resolved Hide resolved

srosenberg reviewed Feb 6, 2025

View reviewed changes

pkg/cmd/roachtest/test_test.go Outdated Show resolved Hide resolved

srosenberg reviewed Feb 6, 2025

View reviewed changes

pkg/cmd/roachtest/test_test.go Show resolved Hide resolved

srosenberg approved these changes Feb 6, 2025

View reviewed changes

DarrylWong force-pushed the preemption-polling-fix branch from 9885bd9 to 031e055 Compare February 6, 2025 15:44

srosenberg approved these changes Feb 6, 2025

View reviewed changes

craig bot merged commit 15a3216 into cockroachdb:master Feb 6, 2025
22 checks passed

srosenberg mentioned this pull request Feb 6, 2025

roachtest: use atomic pointer for logger #140604

Merged

herkolategan mentioned this pull request Feb 11, 2025

roachtest: decommission/nodes=4/duration=1h0m0s failed #140795

Closed

blathers-crl bot mentioned this pull request Feb 18, 2025

roachtest: kv0/enc=false/nodes=1/mt-shared-process failed #139004

Closed

This was referenced Feb 18, 2025

roachtest: tpccbench/nodes=3/cpu=16 failed #139931

Closed

release-25.1.0-rc: roachtest: mark polled VM preemptions as non reportable #141645

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: mark polled VM preemptions as non reportable #139075

roachtest: mark polled VM preemptions as non reportable #139075

DarrylWong commented Jan 14, 2025 •

edited

Loading

cockroach-teamcity commented Jan 14, 2025

DarrylWong commented Jan 28, 2025 •

edited

Loading

srosenberg Jan 28, 2025

srosenberg Jan 28, 2025

DarrylWong Jan 30, 2025

DarrylWong commented Jan 30, 2025

srosenberg commented Feb 5, 2025

DarrylWong commented Feb 5, 2025

srosenberg commented Feb 6, 2025

srosenberg Feb 6, 2025

DarrylWong commented Feb 6, 2025

DarrylWong commented Feb 6, 2025

craig bot commented Feb 6, 2025

srosenberg commented Feb 18, 2025

blathers-crl bot commented Feb 18, 2025

roachtest: mark polled VM preemptions as non reportable #139075

roachtest: mark polled VM preemptions as non reportable #139075

Conversation

DarrylWong commented Jan 14, 2025 • edited Loading

cockroach-teamcity commented Jan 14, 2025

DarrylWong commented Jan 28, 2025 • edited Loading

srosenberg Jan 28, 2025

Choose a reason for hiding this comment

srosenberg Jan 28, 2025

Choose a reason for hiding this comment

DarrylWong Jan 30, 2025

Choose a reason for hiding this comment

DarrylWong commented Jan 30, 2025

srosenberg commented Feb 5, 2025

DarrylWong commented Feb 5, 2025

srosenberg commented Feb 6, 2025

srosenberg Feb 6, 2025

Choose a reason for hiding this comment

DarrylWong commented Feb 6, 2025

DarrylWong commented Feb 6, 2025

craig bot commented Feb 6, 2025

srosenberg commented Feb 18, 2025

blathers-crl bot commented Feb 18, 2025

DarrylWong commented Jan 14, 2025 •

edited

Loading

DarrylWong commented Jan 28, 2025 •

edited

Loading