Description
This test failed on a CI run on 5b6db25 (imminently a PR, just not one yet):
This is the runs page, though I reran the failed job because.. well, it's a flake: https://github.com/oxidecomputer/omicron/runs/41756574687
Buildomat from the failed run though: https://buildomat.eng.oxide.computer/wg/0/details/01JTKR9B4DY2HPFFKKXH60N0QC/2WSRkGyhWpcAbSCPmZLleIzTW0fVbYxTlNQrx6qWIENCOeiz/01JTKR9SBGS5SP7K0ZHFM64VJ6 .
Log showing the specific test failure:
Excerpt from the log showing the failure:
[nexus/tests/integration_tests/instances.rs:1463:5] nexus_test_utils::background::activate_background_task(&cptestctx.internal_client,
7277 2025-05-06T22:34:22.953Z "instance_reincarnation",).await = BackgroundTask {
7278 2025-05-06T22:34:22.953Z current: Idle,
7279 2025-05-06T22:34:22.953Z description: "schedules start sagas for failed instances that can be automatically restarted",
7280 2025-05-06T22:34:22.953Z last: Completed(
7281 2025-05-06T22:34:22.953Z LastResultCompleted {
7282 2025-05-06T22:34:22.953Z details: Object {
7283 2025-05-06T22:34:22.953Z "changed_state": Array [],
7284 2025-05-06T22:34:22.953Z "disabled": Bool(false),
7285 2025-05-06T22:34:22.953Z "errors": Array [],
7286 2025-05-06T22:34:22.953Z "instances_found": Object {
7287 2025-05-06T22:34:22.953Z "Failed": Number(0),
7288 2025-05-06T22:34:22.953Z "SagaUnwound": Number(0),
7289 2025-05-06T22:34:22.953Z },
7290 2025-05-06T22:34:22.953Z "instances_reincarnated": Array [],
7291 2025-05-06T22:34:22.953Z "restart_errors": Array [],
7292 2025-05-06T22:34:22.953Z },
7293 2025-05-06T22:34:22.953Z elapsed: Duration {
7294 2025-05-06T22:34:22.953Z nanos: 172882315,
7295 2025-05-06T22:34:22.953Z secs: 0,
7296 2025-05-06T22:34:22.953Z },
7297 2025-05-06T22:34:22.953Z iteration: 2,
7298 2025-05-06T22:34:22.954Z reason: Signaled,
7299 2025-05-06T22:34:22.954Z start_time: 2025-05-06T22:31:41.967322213Z,
7300 2025-05-06T22:34:22.954Z },
7301 2025-05-06T22:34:22.954Z ),
7302 2025-05-06T22:34:22.954Z name: "instance_reincarnation",
7303 2025-05-06T22:34:22.954Z period: Duration {
7304 2025-05-06T22:34:22.954Z nanos: 0,
7305 2025-05-06T22:34:22.954Z secs: 600,
7306 2025-05-06T22:34:22.954Z },
7307 2025-05-06T22:34:22.954Z }
7308 2025-05-06T22:34:22.954Z [nexus/tests/integration_tests/instances.rs:1490:5] expect_instance_start_ok(client, instance_name).await = ()
7309 2025-05-06T22:34:22.954Z
7310 2025-05-06T22:34:22.954Z thread 'integration_tests::instances::test_instance_failed_by_stop_request_does_not_reincarnate' panicked at nexus/tests/integration_tests/instances.rs:6922:19:
7311 2025-05-06T22:34:22.954Z instance 3aca89b7-bb2f-4ad1-81b5-54b5346d84c6 did not transition to Failed after 120s: timed out after 120.602060563s
which is to say that this expected transition to Failed didn't happen within 120s: https://github.com/oxidecomputer/omicron/blob/ef01db1/nexus/tests/integration_tests/instances.rs#L1518-L1522
(@hawkw's dbg!()
in the tests is immediately validated because the print at 1490 rules out any of the other earlier instance_wait_for_state
, nice)
this log looks like the expect_instance_reboot_fail
just before the fated instance_wait_for_state
, but it looks like everything is in a reasonable state. i think what happened here is that, just before we'd instance_wait_for_state
, the reincarnation background task ran (see here) and reincarnated the instance. then the instance was Running so we'd never reach Failed in instance_wait_for_state()
because Nexus kind of beat us to running the tail end of this test.
@hawkw i'm mostly not sure if background tasks would run themselves in tests like this, does that sound right? maybe we should have the test look for Failed or Running, and if it's already Running we instead assume the background task independently ran and did what we wanted?