Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

traffic is switched before replicaset is fully available when using rollbackWindow #3941

Open
staffanselander opened this issue Nov 13, 2024 · 5 comments
Labels
bug Something isn't working traffic-routing

Comments

@staffanselander
Copy link

staffanselander commented Nov 13, 2024

Describe the bug

We had an incident last Sunday. A team rolled out a new release using the canary strategy provided by Argo Rollouts.

The canary finished successfully and eventually transitioned to stable. Afterwards, the team discovered a bug and decided to roll back the release.

As the previous deployment was within the "rollback window", traffic was switched as soon as a single replica in the "new" replicaset became available. However, the replicaset were still scaling up to match the number of replicas of the previous replicaset, thus not being able to handle the load and stopped responding.

The service in question has a fairly high and undetermenistic start-up time which makes this issue more visible.

To Reproduce

1: Create a Rollout resource using the canary strategy

  • With rollback window:
    • Example: spec.rollbackWindow.revisions: 5
    • Example: spec.revisionHistoryLimit: 5
  • With trafficRouting:
    • Example: trafficRouting.traefik.weightedTraefikServiceName: xxx

2: Rollout a new change

A change which starts a canary deployment and wait until it's fully promoted and the old replicaset is scaled down.

3: Rollback the change

A "rollback" or modifications which aligns with the old replicaset.

Note:

  • When reproducing the issue, adding a random startup delay of the service makes the problem more visible.

Expected behavior

I would expect the replicaset to become fully available before traffic is switched back to the "old" replicaset. Or rather, have an option which would allow this behaviour.

Version

  • Controller version: v1.7.2

Logs

Logs are from a local environment where the issue was later on reproduced.

time="2024-11-13T12:00:33Z" level=info msg="Rollback within the window: 0 (5)" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Rollout completed update to revision 16 (57bd56cdd8): Rollback within window" event_reason=RolloutCompleted namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"podinfo\", Name:\"backend-podinfo\", UID:\"08eb9dd2-8504-4540-bb84-2177f0a22206\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"71866\", FieldPath:\"\"}): type: 'Normal' reason: 'RolloutCompleted' Rollout completed update to revision 16 (57bd56cdd8): Rollback within window"
time="2024-11-13T12:00:33Z" level=info msg="Patched: {\"status\":{\"availableReplicas\":6,\"conditions\":[{\"lastTransitionTime\":\"2024-11-13T07:17:01Z\",\"lastUpdateTime\":\"2024-11-13T07:17:01Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-11-13T11:36:53Z\",\"lastUpdateTime\":\"2024-11-13T11:36:53Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-11-13T11:54:18Z\",\"lastUpdateTime\":\"2024-11-13T11:54:18Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-11-13T11:57:31Z\",\"lastUpdateTime\":\"2024-11-13T12:00:33Z\",\"message\":\"ReplicaSet \\\"backend-podinfo-57bd56cdd8\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2024-11-13T12:00:33Z\",\"lastUpdateTime\":\"2024-11-13T12:00:33Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"}],\"message\":null,\"phase\":\"Healthy\",\"readyReplicas\":6,\"stableRS\":\"57bd56cdd8\"}}" generation=22 namespace=podinfo resourceVersion=71866 rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="persisted to informer" generation=22 namespace=podinfo resourceVersion=71978 rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Reconciliation completed" generation=22 namespace=podinfo resourceVersion=71866 rollout=backend-podinfo time_ms=31.628003000000003
time="2024-11-13T12:00:33Z" level=info msg="Start processing" resource=podinfo/backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Processing completed" resource=podinfo/backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Started syncing rollout" generation=22 namespace=podinfo resourceVersion=71978 rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Cleaning up 6 old replicasets from revision history limit 5" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Looking to cleanup old replica sets" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Trying to cleanup replica set \"backend-podinfo-7b6f4759c9\"" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Enqueueing parent of podinfo/backend-podinfo-7b6f4759c9: Rollout podinfo/backend-podinfo"
time="2024-11-13T12:00:33Z" level=info msg="Switched selector for service 'backend-podinfo-stable' from '5bffc65f84' to '57bd56cdd8'" event_reason=SwitchService namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Reconciling TrafficRouting with type 'Traefik'" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="syncing service" namespace=podinfo rollout=backend-podinfo service=backend-podinfo-stable
time="2024-11-13T12:00:33Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"podinfo\", Name:\"backend-podinfo\", UID:\"08eb9dd2-8504-4540-bb84-2177f0a22206\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"71978\", FieldPath:\"\"}): type: 'Normal' reason: 'SwitchService' Switched selector for service 'backend-podinfo-stable' from '5bffc65f84' to '57bd56cdd8'"
time="2024-11-13T12:00:33Z" level=info msg="Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:100,ServiceName:backend-podinfo-canary,PodTemplateHash:57bd56cdd8,},Stable:WeightDestination{Weight:0,ServiceName:backend-podinfo-stable,PodTemplateHash:5bffc65f84,},Additional:[]WeightDestination{},Verified:nil,}" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:backend-podinfo-canary,PodTemplateHash:57bd56cdd8,},Stable:WeightDestination{Weight:100,ServiceName:backend-podinfo-stable,PodTemplateHash:57bd56cdd8,},Additional:[]WeightDestination{},Verified:nil,}" namespace=podinfo rollout=backend-podinfo
time="2024-11-13T12:00:33Z" level=info msg="Traffic weight updated from 100 to 0" event_reason=TrafficWeightUpdated namespace=podinfo rollout=backend-podinfo

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@staffanselander staffanselander added the bug Something isn't working label Nov 13, 2024
@staffanselander
Copy link
Author

staffanselander commented Nov 13, 2024

I'll be working on a patch which will allow us to delay traffic switching until the "old" replicaset is fully available. I'd be happy to contribute it upstream.

From my first "naive" look at the source code, I would add an additional condition after:

if dynamicallyRollingBackToStable, currSelector := isDynamicallyRollingBackToStable(c.rollout, c.newRS); dynamicallyRollingBackToStable {
// User may have interrupted an update in order go back to stableRS, and is using dynamic
// stable scaling. If that is the case, the stableRS might be undersized and if we blindly
// switch service selector we could overwhelm stableRS pods.
// If we get here, we detected that the canary service needs to be pointed back to
// stable, but stable is not fully available. Skip the service switch for now.
c.log.Infof("delaying fully promoted service switch of '%s' from %s to %s: ReplicaSet '%s' not fully available",
c.rollout.Spec.Strategy.Canary.CanaryService, currSelector, replicasetutil.GetPodTemplateHash(c.newRS), c.newRS.Name)
return nil
}

No idea what the best naming for the configuration would be though:

.spec.rolloutWindow.requireAvailability: full
.spec.rolloutWindow.trafficSwitchWhenAvailability: full | partial

Please make some recommendations here :')

@zachaller
Copy link
Collaborator

Is your rollout configured with dynamicStableScale: true?

@zachaller
Copy link
Collaborator

I would actually also check this code path:

} else if c.isRollbackWithinWindow() && replicasetutil.IsActive(c.newRS) {
IsActive seems to possibly not be doing enough of a check.

@staffanselander
Copy link
Author

Is your rollout configured with dynamicStableScale: true?

No, dynamicStableScale is using the default value of false.

I would actually also check this code path:

} else if c.isRollbackWithinWindow() && replicasetutil.IsActive(c.newRS) {

IsActive seems to possibly not be doing enough of a check.

Thank you for pointing that out, I'll have a look there as-well.

@staffanselander
Copy link
Author

I wonder if #3878 will solve this issue as-well. I'll check it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working traffic-routing
Projects
None yet
Development

No branches or pull requests

3 participants