Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: run post test health checks on system #141430

Merged
merged 2 commits into from
Feb 14, 2025

Conversation

DarrylWong
Copy link
Contributor

@DarrylWong DarrylWong commented Feb 13, 2025

This changes the post test assertions to always run the health checks on the system tenant (KV pod). Previously it would run these health checks on the default virtual cluster, which may be a SQL pod. We run consistency checks on the system tenant so we care about the liveness of the system tenant here.

Fixes: #140774


After this change, running decomission/mixed-versions:

test-post-assertions: 2025/02/13 16:23:02 cluster.go:2577: running cmd `./cockroach auth-session lo...` on nodes [:1]; details in run_162302.356113000_n1_cockroach-authsessio.log
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1502: n1: https://127.0.0.1:29001/health?ready=1 status=503 body={
  "error": "node is shutting down",
  "code": 14,
  "message": "node is shutting down",
  "details": [
  ]
}
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1509: n2: https://127.0.0.1:29005/health?ready=1 status=200 ok
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1509: n3: https://127.0.0.1:29003/health?ready=1 status=200 ok
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1509: n4: https://127.0.0.1:29007/health?ready=1 status=200 ok
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1525: running validation checks on node 2 (<10m)

We can see it correctly identifies node 1 as shutting down and runs the validation checks on node 2. It also only attempts one auth-session login.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@DarrylWong DarrylWong force-pushed the fix-health-check-tenant-url branch from 8f5fb57 to 261de61 Compare February 13, 2025 19:06
@DarrylWong DarrylWong changed the title roachtest: run post test health checks on kv pod roachtest: run post test health checks on system Feb 13, 2025
This changes the post test assertions to always run the health checks on the KV pod.
Previously it would run these health checks on the default virtual cluster, which may be a SQL pod.

Fixes: cockroachdb#140774
Before, the defaultHTTPClient would grab the lock
only when adding cookies. However, we also want to
protect the check of a sessionID from concurrent
access as well. Previously, it was possible that
concurrent calls would all see the sessionID as
not set and all attempt to authenticate with the
cluster.
@DarrylWong DarrylWong force-pushed the fix-health-check-tenant-url branch from 261de61 to f33b7c5 Compare February 13, 2025 19:08
@DarrylWong DarrylWong marked this pull request as ready for review February 13, 2025 19:09
@DarrylWong DarrylWong requested a review from a team as a code owner February 13, 2025 19:09
@DarrylWong DarrylWong requested review from herkolategan and srosenberg and removed request for a team February 13, 2025 19:09
@@ -134,6 +134,9 @@ func (r *RoachtestHTTPClient) addCookies(ctx context.Context, cookieUrl string)
if !r.cluster.IsSecure() {
return nil
}

r.mu.Lock()
defer r.mu.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! Because of this data race, there was a high bias to return n1; now, all other nodes are equally likely to be picked (in postTestAssertions).

@DarrylWong
Copy link
Contributor Author

TFTR!

bors r=srosenberg

craig bot pushed a commit that referenced this pull request Feb 14, 2025
141430: roachtest: run post test health checks on system r=srosenberg a=DarrylWong

This changes the post test assertions to always run the health checks on the system tenant (KV pod). Previously it would run these health checks on the default virtual cluster, which may be a SQL pod. We run consistency checks on the system tenant so we care about the liveness of the system tenant here.

Fixes: #140774


-----

After this change, running `decomission/mixed-versions`:

```
test-post-assertions: 2025/02/13 16:23:02 cluster.go:2577: running cmd `./cockroach auth-session lo...` on nodes [:1]; details in run_162302.356113000_n1_cockroach-authsessio.log
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1502: n1: https://127.0.0.1:29001/health?ready=1 status=503 body={
  "error": "node is shutting down",
  "code": 14,
  "message": "node is shutting down",
  "details": [
  ]
}
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1509: n2: https://127.0.0.1:29005/health?ready=1 status=200 ok
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1509: n3: https://127.0.0.1:29003/health?ready=1 status=200 ok
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1509: n4: https://127.0.0.1:29007/health?ready=1 status=200 ok
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1525: running validation checks on node 2 (<10m)
```

We can see it correctly identifies node 1 as shutting down and runs the validation checks on node 2. It also only attempts one `auth-session login`.

Co-authored-by: DarrylWong <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 14, 2025

Build failed:

@DarrylWong
Copy link
Contributor Author

unrelated flake

bors retry

@DarrylWong DarrylWong added backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.1.x Flags PRs that need to be backported to 25.1 labels Feb 14, 2025
craig bot pushed a commit that referenced this pull request Feb 14, 2025
141430: roachtest: run post test health checks on system r=srosenberg a=DarrylWong

This changes the post test assertions to always run the health checks on the system tenant (KV pod). Previously it would run these health checks on the default virtual cluster, which may be a SQL pod. We run consistency checks on the system tenant so we care about the liveness of the system tenant here.

Fixes: #140774


-----

After this change, running `decomission/mixed-versions`:

```
test-post-assertions: 2025/02/13 16:23:02 cluster.go:2577: running cmd `./cockroach auth-session lo...` on nodes [:1]; details in run_162302.356113000_n1_cockroach-authsessio.log
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1502: n1: https://127.0.0.1:29001/health?ready=1 status=503 body={
  "error": "node is shutting down",
  "code": 14,
  "message": "node is shutting down",
  "details": [
  ]
}
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1509: n2: https://127.0.0.1:29005/health?ready=1 status=200 ok
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1509: n3: https://127.0.0.1:29003/health?ready=1 status=200 ok
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1509: n4: https://127.0.0.1:29007/health?ready=1 status=200 ok
test-post-assertions: 2025/02/13 16:23:17 test_runner.go:1525: running validation checks on node 2 (<10m)
```

We can see it correctly identifies node 1 as shutting down and runs the validation checks on node 2. It also only attempts one `auth-session login`.

141442: *: add testutils as disallowed import to top-level packages r=rafiss a=rafiss

See the individual commits to make this easier to review.

This PR is a step towards completing #81375, with the ultimate goal
of no longer depending on test packages in our production binary. More
work is needed to avoid depending on any of the subpackages inside of `testutils`.

informs #81375

Co-authored-by: DarrylWong <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 14, 2025

Build failed (retrying...):

@craig craig bot merged commit b374872 into cockroachdb:master Feb 14, 2025
24 checks passed
Copy link

blathers-crl bot commented Feb 14, 2025

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #140774: branch-release-24.1, branch-release-24.3, branch-release-25.1.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link

blathers-crl bot commented Feb 14, 2025

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 98dcef2 to blathers/backport-release-24.1-141430: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.1.x failed. See errors above.


error creating merge commit from 98dcef2 to blathers/backport-release-24.3-141430: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.3.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.1.x Flags PRs that need to be backported to 25.1 target-release-25.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: decommission/mixed-versions failed
3 participants