Retry mechanism for discover hosts #2501

rdiazcam · 2024-10-30T16:09:10Z

This retry mechanism is intended to raise an error if no computes are
found after the dataplane deployment, preventing futher execution of
tasks (which depend on a successful dataplane deployment)[0].
Limitation: a discover_hosts execution with --strict option is
considered successful (exit code 0) when an unmapped host is
discovered so if for example we have two hosts to discover and only
one is found no error will be thrown.

[0] https://issues.redhat.com/browse/OSPRH-11052

openshift-ci · 2024-10-30T16:09:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign raukadah for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2024-10-30T16:09:21Z

Thanks for the PR! ❤️
I'm marking it as a draft, once your happy with it merging and the PR is passing CI, click the "Ready for review" button below.

This retry mechanism is intended to raise an error if no computes are found after the dataplane deployment, preventing futher execution of tasks (which depend on a successful dataplane deployment)[0]. Limitation: a discover_hosts execution with --strict option is considered successful (exit code 0) when an unmapped host is discovered so if for example we have two hosts to discover and only one is found no error will be thrown. [0] https://issues.redhat.com/browse/OSPRH-11052

softwarefactory-project-zuul · 2024-10-31T12:48:19Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/61f1cb66198142b393699bc98fdf71a0

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 47m 37s
❌ podified-multinode-edpm-deployment-crc FAILURE in 59m 14s
❌ cifmw-crc-podified-edpm-baremetal FAILURE in 1h 24m 04s
❌ podified-multinode-hci-deployment-crc FAILURE in 1h 18m 26s
✔️ noop SUCCESS in 0s
❌ cifmw-pod-ansible-test NODE_FAILURE Node request 100-0007647437 failed in 0s
❌ cifmw-pod-pre-commit NODE_FAILURE Node request 100-0007647438 failed in 0s
✔️ build-push-container-cifmw-client SUCCESS in 37m 34s
✔️ cifmw-molecule-edpm_deploy SUCCESS in 4m 23s
✔️ cifmw-molecule-edpm_deploy_baremetal SUCCESS in 4m 18s

lewisdenny · 2024-11-04T01:48:19Z

Thank you for the patch @rdiazcam, I would love input from @gibizer about the use of the --strict argument.

Do we ever have a reason to deploy without compute nodes?

lewisdenny · 2024-11-04T01:48:45Z

/cc @gibizer

gibizer · 2024-11-04T07:42:47Z

playbooks/06-deploy-architecture.yml

@@ -276,6 +276,11 @@
        -n openstack
        nova-cell0-conductor-0
        nova-manage cell_v2 discover_hosts --verbose
+        --strict


From the doc:

If --strict is specified, the command will only return 0 if an unmapped host was discovered and mapped successfully.

So this change means that the discover_hosts will be called until at least 1 host is discovered. This could help a bit with the race condition between starting the nova-compute service by the DataPlaneDeployment, letting the nova-compute start up and report the first status to the DB, and between the ci_framework running the discover_hosts command to discover those computes.

However it will not be the final solution due to the following. If there are more than one compute in the deployment then discovering one of them is enough for this change to move forward. Which means maybe not all the compute will be discovered.

The final solution requires either of the two directions:

a) do not let the DataPlaneDeployment finish until the nova-compute not just started but also successfully reported its first status to the DB.

b) retry discover_hosts until not just one but all the deployed compute nodes are discovered.

gibizer · 2024-11-05T16:31:34Z

I suggest to pull this into the mainline of ci framework instead https://github.com/openstack-k8s-operators/ci-framework/pull/2459/files as it waits until all the expected computes are discovered not just the first.

lewisdenny · 2024-11-06T00:49:54Z

@gibizer My understanding was the content in the hook was only a workaround, a hook is the correct place for a workaround but if this is how you/nova team want the discover_hosts logic to work then I guess we should update our call to discover_hosts to use the retry logic as you suggest.

@rdiazcam Would you like to update you PR to match the logic? My initial thoughts would be creating a tiny role that handles discover_hosts would be a nice route to go.

gibizer · 2024-11-06T07:49:36Z

@gibizer My understanding was the content in the hook was only a workaround, a hook is the correct place for a workaround but if this is how you/nova team want the discover_hosts logic to work then I guess we should update our call to discover_hosts to use the retry logic as you suggest.

I think the long term solution is to have a proper healthcheck endpoint implemented for nova-compute and let the container starting logic in edpm_ansible waiting for nova-compute to report a healthy state before returning (including the compute node creation). This way when the DataPlaneDeployment returns Ready=True, all the computes are in a discoverable state. However this still fairly fare from the product perspective as the work is ongoing upstream in https://blueprints.launchpad.net/nova/+spec/per-process-healthchecks (and probably won't be fully ready in Epoxy either).

So while the retry in #2459 is a workaround, it is not something we will be able to remove soon. Keeping it as an optional hook makes the problem widespread as not all job has this workaround enabled and creates a load of triaging CI issues due to this race condition. I suggest add the logic of the workaround to the main codepath of the ci-framework, or at least make this workaround enabled by default in every job that does EDPM deployment.

@rdiazcam Would you like to update you PR to match the logic? My initial thoughts would be creating a tiny role that handles discover_hosts would be a nice route to go.

rdiazcam added the do-not-merge/work-in-progress label Oct 30, 2024

openshift-ci bot removed the do-not-merge/work-in-progress label Oct 30, 2024

github-actions bot marked this pull request as draft October 30, 2024 16:09

openshift-ci bot added the do-not-merge/work-in-progress label Oct 30, 2024

rdiazcam force-pushed the discover_hots_retry branch from 9c924e9 to 609d1e5 Compare October 31, 2024 10:32

lewisdenny requested a review from a team November 4, 2024 01:48

openshift-ci bot requested a review from gibizer November 4, 2024 01:48

gibizer reviewed Nov 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry mechanism for discover hosts #2501

Retry mechanism for discover hosts #2501

rdiazcam commented Oct 30, 2024 •

edited

Loading

openshift-ci bot commented Oct 30, 2024

github-actions bot commented Oct 30, 2024

softwarefactory-project-zuul bot commented Oct 31, 2024

lewisdenny commented Nov 4, 2024

lewisdenny commented Nov 4, 2024

gibizer Nov 4, 2024

gibizer commented Nov 5, 2024 •

edited

Loading

lewisdenny commented Nov 6, 2024

gibizer commented Nov 6, 2024

Retry mechanism for discover hosts #2501

Are you sure you want to change the base?

Retry mechanism for discover hosts #2501

Conversation

rdiazcam commented Oct 30, 2024 • edited Loading

openshift-ci bot commented Oct 30, 2024

github-actions bot commented Oct 30, 2024

softwarefactory-project-zuul bot commented Oct 31, 2024

lewisdenny commented Nov 4, 2024

lewisdenny commented Nov 4, 2024

gibizer Nov 4, 2024

Choose a reason for hiding this comment

gibizer commented Nov 5, 2024 • edited Loading

lewisdenny commented Nov 6, 2024

gibizer commented Nov 6, 2024

rdiazcam commented Oct 30, 2024 •

edited

Loading

gibizer commented Nov 5, 2024 •

edited

Loading