Skip to content

test failed in CI: test_instance_failed_by_stop_request_does_not_reincarnate #8119

Closed
@iximeow

Description

@iximeow

This test failed on a CI run on 5b6db25 (imminently a PR, just not one yet):

This is the runs page, though I reran the failed job because.. well, it's a flake: https://github.com/oxidecomputer/omicron/runs/41756574687

Buildomat from the failed run though: https://buildomat.eng.oxide.computer/wg/0/details/01JTKR9B4DY2HPFFKKXH60N0QC/2WSRkGyhWpcAbSCPmZLleIzTW0fVbYxTlNQrx6qWIENCOeiz/01JTKR9SBGS5SP7K0ZHFM64VJ6 .

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01JTKR9B4DY2HPFFKKXH60N0QC/2WSRkGyhWpcAbSCPmZLleIzTW0fVbYxTlNQrx6qWIENCOeiz/01JTKR9SBGS5SP7K0ZHFM64VJ6#S7310

Excerpt from the log showing the failure:

  [nexus/tests/integration_tests/instances.rs:1463:5] nexus_test_utils::background::activate_background_task(&cptestctx.internal_client,
7277  2025-05-06T22:34:22.953Z  "instance_reincarnation",).await = BackgroundTask {
7278  2025-05-06T22:34:22.953Z      current: Idle,
7279  2025-05-06T22:34:22.953Z      description: "schedules start sagas for failed instances that can be automatically restarted",
7280  2025-05-06T22:34:22.953Z      last: Completed(
7281  2025-05-06T22:34:22.953Z          LastResultCompleted {
7282  2025-05-06T22:34:22.953Z              details: Object {
7283  2025-05-06T22:34:22.953Z                  "changed_state": Array [],
7284  2025-05-06T22:34:22.953Z                  "disabled": Bool(false),
7285  2025-05-06T22:34:22.953Z                  "errors": Array [],
7286  2025-05-06T22:34:22.953Z                  "instances_found": Object {
7287  2025-05-06T22:34:22.953Z                      "Failed": Number(0),
7288  2025-05-06T22:34:22.953Z                      "SagaUnwound": Number(0),
7289  2025-05-06T22:34:22.953Z                  },
7290  2025-05-06T22:34:22.953Z                  "instances_reincarnated": Array [],
7291  2025-05-06T22:34:22.953Z                  "restart_errors": Array [],
7292  2025-05-06T22:34:22.953Z              },
7293  2025-05-06T22:34:22.953Z              elapsed: Duration {
7294  2025-05-06T22:34:22.953Z                  nanos: 172882315,
7295  2025-05-06T22:34:22.953Z                  secs: 0,
7296  2025-05-06T22:34:22.953Z              },
7297  2025-05-06T22:34:22.953Z              iteration: 2,
7298  2025-05-06T22:34:22.954Z              reason: Signaled,
7299  2025-05-06T22:34:22.954Z              start_time: 2025-05-06T22:31:41.967322213Z,
7300  2025-05-06T22:34:22.954Z          },
7301  2025-05-06T22:34:22.954Z      ),
7302  2025-05-06T22:34:22.954Z      name: "instance_reincarnation",
7303  2025-05-06T22:34:22.954Z      period: Duration {
7304  2025-05-06T22:34:22.954Z          nanos: 0,
7305  2025-05-06T22:34:22.954Z          secs: 600,
7306  2025-05-06T22:34:22.954Z      },
7307  2025-05-06T22:34:22.954Z  }
7308  2025-05-06T22:34:22.954Z  [nexus/tests/integration_tests/instances.rs:1490:5] expect_instance_start_ok(client, instance_name).await = ()
7309  2025-05-06T22:34:22.954Z  
7310  2025-05-06T22:34:22.954Z  thread 'integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate' panicked at nexus/tests/integration_tests/instances.rs:6922:19:
7311  2025-05-06T22:34:22.954Z  instance 3aca89b7-bb2f-4ad1-81b5-54b5346d84c6 did not transition to Failed after 120s: timed out after 120.602060563s

which is to say that this expected transition to Failed didn't happen within 120s: https://github.com/oxidecomputer/omicron/blob/ef01db1/nexus/tests/integration_tests/instances.rs#L1518-L1522

(@hawkw's dbg!() in the tests is immediately validated because the print at 1490 rules out any of the other earlier instance_wait_for_state, nice)

this log looks like the expect_instance_reboot_fail just before the fated instance_wait_for_state, but it looks like everything is in a reasonable state. i think what happened here is that, just before we'd instance_wait_for_state, the reincarnation background task ran (see here) and reincarnated the instance. then the instance was Running so we'd never reach Failed in instance_wait_for_state() because Nexus kind of beat us to running the tail end of this test.

@hawkw i'm mostly not sure if background tasks would run themselves in tests like this, does that sound right? maybe we should have the test look for Failed or Running, and if it's already Running we instead assume the background task independently ran and did what we wanted?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Test FlakeTests that work. Wait, no. Actually yes. Hang on. Something is broken.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions