Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: preserve NVIPAM CM after upgrade #672

Merged
merged 1 commit into from
Nov 16, 2023
Merged

Conversation

rollandf
Copy link
Member

@rollandf rollandf commented Nov 14, 2023

In order to allow migration from config map to IP Pool CRs in NV IPAM, the config map needs to be kept.

  • Removing the state label will make sure it is not deleted as a stale object.
  • Remove the helm hook annotation on NiClusterPolicy CR template, that causes delete and recreate of CR and the delete-cr hook. Note that this will prevent CR validation in install stage for admission controller.
  • Make sure to stop existing Network Operator before updating CR, so that it does not delete the config map since config is removed.

@rollandf rollandf added the on hold This enhancement is currently on hold pending additional clarification and evaluation label Nov 14, 2023
ykulazhenkov
ykulazhenkov previously approved these changes Nov 14, 2023
@e0ne
Copy link
Collaborator

e0ne commented Nov 14, 2023

/retest-nic_operator_helm

@rollandf rollandf force-pushed the keep-ipam-cm branch 2 times, most recently from 25e892f to 6dc56e7 Compare November 14, 2023 21:36
@rollandf rollandf removed the on hold This enhancement is currently on hold pending additional clarification and evaluation label Nov 14, 2023
@rollandf rollandf changed the title fix: remove state label on NVIPAM CM fix: preserve NVIPAM CM after upgrade Nov 15, 2023
@@ -46,6 +49,10 @@ func migrate(ctx context.Context, log logr.Logger, c client.Client) error {
// not critical for the operator operation, safer to ignore
log.V(consts.LogLevelWarning).Info("ignore error during whereabouts CronJob removal")
}
if err := removeStateLabelFromNVIpamConfigMap(ctx, log, c); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to return error if we ignore it? We've got logging inside removeStateLabelFromNVIpamConfigMap so no need to return an error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the same pattern as removeWhereaboutsIPReconcileCronJob.
WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it as is. removeStateLabelFromNVIpamConfigMap function do the action and an error can happen during this action - it is fair to return the error from the function. It is up to caller to decide what to do with the error.

Copy link
Collaborator

@ykulazhenkov ykulazhenkov Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should not block the PR

Copy link
Collaborator

@ykulazhenkov ykulazhenkov Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have one more thing. It is pretty safe to run removeWhereaboutsIPReconcileCronJob in a "best effort" mode and ignore any error if it happens (we will have an unneeded Job in the cluster, but this will not break anything). But with removeStateLabelFromNVIpamConfigMap, we are in a bit different situation - if we fail to remove the label, we will probably lose all nv-ipam allocations. Maybe we should crash in this case and rely on k8s to restart the Pod? WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, returning error to main.

@ykulazhenkov ykulazhenkov self-requested a review November 15, 2023 07:10
Copy link
Collaborator

@e0ne e0ne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove 'request changes' label since we've got the same patter in other part of code

@ykulazhenkov ykulazhenkov dismissed their stale review November 15, 2023 07:58

updates in the PR

Copy link
Collaborator

@ykulazhenkov ykulazhenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the PR on my env. I tried to upgrade from 23.7 to the version from this PR. All nv-ipam allocations were preserved. I added one comment about the error handling for annotation removal call - maybe we should react on an error

@ykulazhenkov
Copy link
Collaborator

/retest-all

@e0ne
Copy link
Collaborator

e0ne commented Nov 15, 2023

/retest-image_scan

Copy link
Collaborator

@e0ne e0ne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It could be merged once CI pass

log.V(consts.LogLevelError).Error(err, "fail to get NVIPAM configmap")
return err
}
log.V(consts.LogLevelDebug).Info("clear State label from NVIPAM configmap")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we first check the label exists ? or will patch op suceed if label is not there ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a check before the patch

In order to allow migration from config map
to IP Pool CRs in NV IPAM, the config map needs to be kept.

- Removing the state label will make sure it is not deleted as
a stale object.

- Remove the helm hook annotation on NiClusterPolicy CR template,
  that causes delete and recreate of CR and the delete-cr hook.
  Note that this will prevent CR validation in install stage.

- Make sure to stop existing Network Operator before updating CR,
  so that it does not delete the config map since config is removed.

Signed-off-by: Fred Rolland <[email protected]>
@e0ne e0ne merged commit 5c901b3 into Mellanox:master Nov 16, 2023
10 checks passed
@rollandf rollandf deleted the keep-ipam-cm branch February 28, 2024 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants