Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable high availability (HA) configuration for ASO #4445

Merged
merged 6 commits into from
Dec 6, 2024

Conversation

bingikarthik
Copy link
Contributor

What this PR does

Closes #4215
As we adopt Azure Service Operator (ASO), it's essential to run multiple replicas for production workloads, especially during cluster upgrade operations. This PR enables high availability (HA) for ASO by increasing the replica count and implementing a Pod Disruption Budget (PDB) to maintain consistent uptime and resilience during potential disruptions.

How does this PR make you feel?

gif

Checklist

  • this PR contains documentation
  • this PR contains tests
  • this PR contains YAML Samples

config/manager/manager.yaml Outdated Show resolved Hide resolved
config/default/manager_pod_disruption_budget.yaml Outdated Show resolved Hide resolved
config/default/manager_pod_disruption_budget.yaml Outdated Show resolved Hide resolved
Copy link
Member

@theunrepentantgeek theunrepentantgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will things behave when upgrading ASO to the next version?

I'm worried about the scenario where the version N+1 of ASO deploys new CRD versions which (still running) version N doesn't understand.

In this scenario, since we use the newest resource version as the storage (hub) version of the resource, any resource that has been touched by the version N+1 will be unintelligible to version N, resulting in a panic/crash.

@bingikarthik
Copy link
Contributor Author

@matthchr I've made the requested changes. When you have a moment, could you kindly review the PR?

@matthchr
Copy link
Member

matthchr commented Nov 18, 2024

@bingikarthik - thanks! I think we need to make a deployment rollout change to make this safe (as @theunrepentantgeek called out). I'll send a separate PR for that and link it here, and then once that merges we can make sure this change is safe w/ it, and assuming it is, merge this too.

Thanks for your patience

@nishant221
Copy link
Contributor

Does ASO support "leader" kind of approach? If multiple replicas are running (and reconciling independently), can the same request can be processed by multiple instances of ASO resulting in duplicate calls to Azure?

@bingikarthik
Copy link
Contributor Author

Does ASO support "leader" kind of approach? If multiple replicas are running (and reconciling independently), can the same request can be processed by multiple instances of ASO resulting in duplicate calls to Azure?

Yes, it was indeed. Please check: https://github.com/Azure/azure-service-operator/blob/main/v2/charts/azure-service-operator/templates/apps_v1_deployment_azureserviceoperator-controller-manager.yaml#L59
https://github.com/Azure/azure-service-operator/blob/main/main.go#L63

@matthchr
Copy link
Member

had a few comments - once they're fixed I'll kick off CI and we can merge this.

@theunrepentantgeek
Copy link
Member

/ok-to-test sha=2857eec

@theunrepentantgeek theunrepentantgeek added this pull request to the merge queue Dec 6, 2024
@theunrepentantgeek
Copy link
Member

Thanks @bingikarthik, much appreciated.

Merged via the queue into Azure:main with commit d1ba547 Dec 6, 2024
7 checks passed
@theunrepentantgeek theunrepentantgeek added this to the v2.12.0 milestone Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready for Release
Development

Successfully merging this pull request may close these issues.

Change default pod count to 2
5 participants