Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement clusterStagedUpdateRun execution #1000

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jwtty
Copy link
Contributor

@jwtty jwtty commented Dec 23, 2024

Description of your changes

Implement clusterStagedUpdateRun execution. Also made some refactoring.

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

continue
}
// The cluster status is not deleting yet
if err := r.Client.Delete(ctx, binding); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared with the original implementation, I did not check if binding is up-to-date before deleting. I don't think it's necessary. Do we want that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do support multiple update run in parallel, I am not sure if there is any race condition that we may delete some other run's binidng.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of both resource changes and scheduling policy changes, the binding to be deleted may not have up-to-date resources. If there are concurrent update runs, the same binding could be deleted by either updateRun. If there's a new scheduling policy change that marks this to-be-deleted binding as scheduled again, this updateRun can fail during validation. Or if there's chance that the change happens right after the validation passes and before deleting the bindings, I think I can add a check to make sure the binding's state is unScheduled or report unexpected error if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have a concern: in rollout controller, we make sure to complete the deletion of the bindings on a cluster before scheduling a newly created binding on the same cluster: https://github.com/Azure/fleet/blob/main/pkg/controllers/rollout/controller.go#L226. I wonder if we should have similar logic in the updateRun controller too.

klog.V(2).InfoS("Executing the clusterStagedUpdateRun", "clusterStagedUpdateRun", runObjRef, "updatingStageIndex", updatingStageIndex,
"toBeUpdatedBindings count", len(toBeUpdatedBindings), "toBeDeletedBindings count", len(toBeDeletedBindings))
return runtime.Result{RequeueAfter: stageUpdatingWaitTime}, nil
// The previous run is completed but the update to the status failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a check on the staged run status that got removed. I think this comment is not accurate without that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check I removed is about when all the updating stages finish and the deletion stage is about to start. In that case, updatingStageIndex == len(updatingStages). When updatingStageIndex == -1, it's always when all the stages including the deletion have finished but the recordUpdateRunSucceeded on line 143 somehow fails.

pkg/controllers/updaterun/execution.go Show resolved Hide resolved
continue
}
// The cluster status is not deleting yet
if err := r.Client.Delete(ctx, binding); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do support multiple update run in parallel, I am not sure if there is any race condition that we may delete some other run's binidng.

}
klog.V(2).InfoS("Updated the status of a binding to bound", "binding", klog.KObj(binding), "cluster", clusterStatus.ClusterName, "stage", updatingStageStatus.StageName, "clusterStagedUpdateRun", updateRunRef)
} else {
if _, updateErr := checkClusterUpdateResult(binding, clusterStatus, updatingStageStatus, updateRun); updateErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those error always retriable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, yes. Because we currently do not identify any termination errors for cluster updates. I have a TODO in this function. But even in the future when we can identify some, I'll return an errExecutionAborted inside the checkClusterUpdateResult function and can still return the updateErr directly.

Comment on lines -308 to -317
// The delete stage is still updating.
if condition.IsConditionStatusTrue(deleteStageProgressingCond, updateRun.Generation) {
klog.InfoS("The delete stage is updating", "clusterStagedUpdateRun", updateRunRef)
return totalStages, nil
}
// All stages have finished, but the delete stage is not active or finished.
unexpectedErr := controller.NewUnexpectedBehaviorError(fmt.Errorf("the delete stage is not active, but all stages finished"))
klog.ErrorS(unexpectedErr, "There is no stage active", "clusterStagedUpdateRun", updateRunRef)
return -1, fmt.Errorf("%w: %s", errStagedUpdatedAborted, unexpectedErr.Error())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete this check?

Copy link
Contributor Author

@jwtty jwtty Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the last updating stage completes, execution does not mark the deletion stage as started. Instead, it returns a 0 wait-time and moves on to the delete stage in the next reconcile loop, just like finishing a regular updating stage and moving on to the next updating stage. With this check, the validation in the next reconcile would fail because the deletion stage is not marked as started yet, so I removed this part.

pkg/controllers/updaterun/execution_integration_test.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants