Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry mechanism for discover hosts #2501

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rdiazcam
Copy link
Contributor

@rdiazcam rdiazcam commented Oct 30, 2024

This retry mechanism is intended to raise an error if no computes are
found after the dataplane deployment, preventing futher execution of
tasks (which depend on a successful dataplane deployment)[0].
Limitation: a discover_hosts execution with --strict option is
considered successful (exit code 0) when an unmapped host is
discovered so if for example we have two hosts to discover and only
one is found no error will be thrown.

[0] https://issues.redhat.com/browse/OSPRH-11052

Copy link
Contributor

openshift-ci bot commented Oct 30, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign raukadah for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions github-actions bot marked this pull request as draft October 30, 2024 16:09
Copy link

Thanks for the PR! ❤️
I'm marking it as a draft, once your happy with it merging and the PR is passing CI, click the "Ready for review" button below.

This retry mechanism is intended to raise an error if no computes are
found after the dataplane deployment, preventing futher execution of
tasks (which depend on a successful dataplane deployment)[0].
Limitation: a discover_hosts execution with --strict option is
considered successful (exit code 0) when an unmapped host is
discovered so if for example we have two hosts to discover and only
one is found no error will be thrown.

[0] https://issues.redhat.com/browse/OSPRH-11052
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/61f1cb66198142b393699bc98fdf71a0

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 47m 37s
podified-multinode-edpm-deployment-crc FAILURE in 59m 14s
cifmw-crc-podified-edpm-baremetal FAILURE in 1h 24m 04s
podified-multinode-hci-deployment-crc FAILURE in 1h 18m 26s
✔️ noop SUCCESS in 0s
cifmw-pod-ansible-test NODE_FAILURE Node request 100-0007647437 failed in 0s
cifmw-pod-pre-commit NODE_FAILURE Node request 100-0007647438 failed in 0s
✔️ build-push-container-cifmw-client SUCCESS in 37m 34s
✔️ cifmw-molecule-edpm_deploy SUCCESS in 4m 23s
✔️ cifmw-molecule-edpm_deploy_baremetal SUCCESS in 4m 18s

@lewisdenny
Copy link
Collaborator

Thank you for the patch @rdiazcam, I would love input from @gibizer about the use of the --strict argument.

Do we ever have a reason to deploy without compute nodes?

@lewisdenny lewisdenny requested a review from a team November 4, 2024 01:48
@lewisdenny
Copy link
Collaborator

/cc @gibizer

@openshift-ci openshift-ci bot requested a review from gibizer November 4, 2024 01:48
@@ -276,6 +276,11 @@
-n openstack
nova-cell0-conductor-0
nova-manage cell_v2 discover_hosts --verbose
--strict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the doc:

If --strict is specified, the command will only return 0 if an unmapped host was discovered and mapped successfully.

So this change means that the discover_hosts will be called until at least 1 host is discovered. This could help a bit with the race condition between starting the nova-compute service by the DataPlaneDeployment, letting the nova-compute start up and report the first status to the DB, and between the ci_framework running the discover_hosts command to discover those computes.

However it will not be the final solution due to the following. If there are more than one compute in the deployment then discovering one of them is enough for this change to move forward. Which means maybe not all the compute will be discovered.

The final solution requires either of the two directions:

  • a) do not let the DataPlaneDeployment finish until the nova-compute not just started but also successfully reported its first status to the DB.
  • b) retry discover_hosts until not just one but all the deployed compute nodes are discovered.

@gibizer
Copy link
Contributor

gibizer commented Nov 5, 2024

I suggest to pull this into the mainline of ci framework instead https://github.com/openstack-k8s-operators/ci-framework/pull/2459/files as it waits until all the expected computes are discovered not just the first.

@lewisdenny
Copy link
Collaborator

@gibizer My understanding was the content in the hook was only a workaround, a hook is the correct place for a workaround but if this is how you/nova team want the discover_hosts logic to work then I guess we should update our call to discover_hosts to use the retry logic as you suggest.

@rdiazcam Would you like to update you PR to match the logic? My initial thoughts would be creating a tiny role that handles discover_hosts would be a nice route to go.

@gibizer
Copy link
Contributor

gibizer commented Nov 6, 2024

@gibizer My understanding was the content in the hook was only a workaround, a hook is the correct place for a workaround but if this is how you/nova team want the discover_hosts logic to work then I guess we should update our call to discover_hosts to use the retry logic as you suggest.

I think the long term solution is to have a proper healthcheck endpoint implemented for nova-compute and let the container starting logic in edpm_ansible waiting for nova-compute to report a healthy state before returning (including the compute node creation). This way when the DataPlaneDeployment returns Ready=True, all the computes are in a discoverable state. However this still fairly fare from the product perspective as the work is ongoing upstream in https://blueprints.launchpad.net/nova/+spec/per-process-healthchecks (and probably won't be fully ready in Epoxy either).

So while the retry in #2459 is a workaround, it is not something we will be able to remove soon. Keeping it as an optional hook makes the problem widespread as not all job has this workaround enabled and creates a load of triaging CI issues due to this race condition. I suggest add the logic of the workaround to the main codepath of the ci-framework, or at least make this workaround enabled by default in every job that does EDPM deployment.

@rdiazcam Would you like to update you PR to match the logic? My initial thoughts would be creating a tiny role that handles discover_hosts would be a nice route to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants