-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry mechanism for discover hosts #2501
base: main
Are you sure you want to change the base?
Retry mechanism for discover hosts #2501
Conversation
rdiazcam
commented
Oct 30, 2024
•
edited
Loading
edited
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for the PR! ❤️ |
This retry mechanism is intended to raise an error if no computes are found after the dataplane deployment, preventing futher execution of tasks (which depend on a successful dataplane deployment)[0]. Limitation: a discover_hosts execution with --strict option is considered successful (exit code 0) when an unmapped host is discovered so if for example we have two hosts to discover and only one is found no error will be thrown. [0] https://issues.redhat.com/browse/OSPRH-11052
9c924e9
to
609d1e5
Compare
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/61f1cb66198142b393699bc98fdf71a0 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 47m 37s |
/cc @gibizer |
@@ -276,6 +276,11 @@ | |||
-n openstack | |||
nova-cell0-conductor-0 | |||
nova-manage cell_v2 discover_hosts --verbose | |||
--strict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the doc:
If --strict is specified, the command will only return 0 if an unmapped host was discovered and mapped successfully.
So this change means that the discover_hosts will be called until at least 1 host is discovered. This could help a bit with the race condition between starting the nova-compute service by the DataPlaneDeployment, letting the nova-compute start up and report the first status to the DB, and between the ci_framework running the discover_hosts command to discover those computes.
However it will not be the final solution due to the following. If there are more than one compute in the deployment then discovering one of them is enough for this change to move forward. Which means maybe not all the compute will be discovered.
The final solution requires either of the two directions:
- a) do not let the DataPlaneDeployment finish until the nova-compute not just started but also successfully reported its first status to the DB.
- b) retry discover_hosts until not just one but all the deployed compute nodes are discovered.
I suggest to pull this into the mainline of ci framework instead https://github.com/openstack-k8s-operators/ci-framework/pull/2459/files as it waits until all the expected computes are discovered not just the first. |
@gibizer My understanding was the content in the hook was only a workaround, a hook is the correct place for a workaround but if this is how you/nova team want the discover_hosts logic to work then I guess we should update our call to discover_hosts to use the retry logic as you suggest. @rdiazcam Would you like to update you PR to match the logic? My initial thoughts would be creating a tiny role that handles discover_hosts would be a nice route to go. |
I think the long term solution is to have a proper healthcheck endpoint implemented for nova-compute and let the container starting logic in edpm_ansible waiting for nova-compute to report a healthy state before returning (including the compute node creation). This way when the DataPlaneDeployment returns Ready=True, all the computes are in a discoverable state. However this still fairly fare from the product perspective as the work is ongoing upstream in https://blueprints.launchpad.net/nova/+spec/per-process-healthchecks (and probably won't be fully ready in Epoxy either). So while the retry in #2459 is a workaround, it is not something we will be able to remove soon. Keeping it as an optional hook makes the problem widespread as not all job has this workaround enabled and creates a load of triaging CI issues due to this race condition. I suggest add the logic of the workaround to the main codepath of the ci-framework, or at least make this workaround enabled by default in every job that does EDPM deployment.
|