Skip to content

test failed in CI: test_instance_failed_by_stop_request_does_not_reincarnate #8119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
iximeow opened this issue May 8, 2025 · 1 comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@iximeow
Copy link
Member

iximeow commented May 8, 2025

This test failed on a CI run on 5b6db25 (imminently a PR, just not one yet):

This is the runs page, though I reran the failed job because.. well, it's a flake: https://github.com/oxidecomputer/omicron/runs/41756574687

Buildomat from the failed run though: https://buildomat.eng.oxide.computer/wg/0/details/01JTKR9B4DY2HPFFKKXH60N0QC/2WSRkGyhWpcAbSCPmZLleIzTW0fVbYxTlNQrx6qWIENCOeiz/01JTKR9SBGS5SP7K0ZHFM64VJ6 .

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01JTKR9B4DY2HPFFKKXH60N0QC/2WSRkGyhWpcAbSCPmZLleIzTW0fVbYxTlNQrx6qWIENCOeiz/01JTKR9SBGS5SP7K0ZHFM64VJ6#S7310

Excerpt from the log showing the failure:

  [nexus/tests/integration_tests/instances.rs:1463:5] nexus_test_utils::background::activate_background_task(&cptestctx.internal_client,
7277  2025-05-06T22:34:22.953Z  "instance_reincarnation",).await = BackgroundTask {
7278  2025-05-06T22:34:22.953Z      current: Idle,
7279  2025-05-06T22:34:22.953Z      description: "schedules start sagas for failed instances that can be automatically restarted",
7280  2025-05-06T22:34:22.953Z      last: Completed(
7281  2025-05-06T22:34:22.953Z          LastResultCompleted {
7282  2025-05-06T22:34:22.953Z              details: Object {
7283  2025-05-06T22:34:22.953Z                  "changed_state": Array [],
7284  2025-05-06T22:34:22.953Z                  "disabled": Bool(false),
7285  2025-05-06T22:34:22.953Z                  "errors": Array [],
7286  2025-05-06T22:34:22.953Z                  "instances_found": Object {
7287  2025-05-06T22:34:22.953Z                      "Failed": Number(0),
7288  2025-05-06T22:34:22.953Z                      "SagaUnwound": Number(0),
7289  2025-05-06T22:34:22.953Z                  },
7290  2025-05-06T22:34:22.953Z                  "instances_reincarnated": Array [],
7291  2025-05-06T22:34:22.953Z                  "restart_errors": Array [],
7292  2025-05-06T22:34:22.953Z              },
7293  2025-05-06T22:34:22.953Z              elapsed: Duration {
7294  2025-05-06T22:34:22.953Z                  nanos: 172882315,
7295  2025-05-06T22:34:22.953Z                  secs: 0,
7296  2025-05-06T22:34:22.953Z              },
7297  2025-05-06T22:34:22.953Z              iteration: 2,
7298  2025-05-06T22:34:22.954Z              reason: Signaled,
7299  2025-05-06T22:34:22.954Z              start_time: 2025-05-06T22:31:41.967322213Z,
7300  2025-05-06T22:34:22.954Z          },
7301  2025-05-06T22:34:22.954Z      ),
7302  2025-05-06T22:34:22.954Z      name: "instance_reincarnation",
7303  2025-05-06T22:34:22.954Z      period: Duration {
7304  2025-05-06T22:34:22.954Z          nanos: 0,
7305  2025-05-06T22:34:22.954Z          secs: 600,
7306  2025-05-06T22:34:22.954Z      },
7307  2025-05-06T22:34:22.954Z  }
7308  2025-05-06T22:34:22.954Z  [nexus/tests/integration_tests/instances.rs:1490:5] expect_instance_start_ok(client, instance_name).await = ()
7309  2025-05-06T22:34:22.954Z  
7310  2025-05-06T22:34:22.954Z  thread 'integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate' panicked at nexus/tests/integration_tests/instances.rs:6922:19:
7311  2025-05-06T22:34:22.954Z  instance 3aca89b7-bb2f-4ad1-81b5-54b5346d84c6 did not transition to Failed after 120s: timed out after 120.602060563s

which is to say that this expected transition to Failed didn't happen within 120s: https://github.com/oxidecomputer/omicron/blob/ef01db1/nexus/tests/integration_tests/instances.rs#L1518-L1522

(@hawkw's dbg!() in the tests is immediately validated because the print at 1490 rules out any of the other earlier instance_wait_for_state, nice)

this log looks like the expect_instance_reboot_fail just before the fated instance_wait_for_state, but it looks like everything is in a reasonable state. i think what happened here is that, just before we'd instance_wait_for_state, the reincarnation background task ran (see here) and reincarnated the instance. then the instance was Running so we'd never reach Failed in instance_wait_for_state() because Nexus kind of beat us to running the tail end of this test.

@hawkw i'm mostly not sure if background tasks would run themselves in tests like this, does that sound right? maybe we should have the test look for Failed or Running, and if it's already Running we instead assume the background task independently ran and did what we wanted?

@iximeow iximeow added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

2 participants