-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Reconciling VirtualNetworksSubnet fails with "Request entity too large: limit is 3145728" #4428
Comments
Can you share what the spec for the subnet looks like, as managed by CAPZ? |
I think the issue we've got here is the fact that there are 14k entries for the There is also a max resource size boundary for Azure I believe, but I think it's 4mb not 1.5mb which AFAIK is the default on Kubernetes. |
AMCP resource: ---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedControlPlane
spec:
virtualNetwork:
cidrBlock: 10.0.0.0/16
name: example-cluster-vnet
resourceGroup: example-cluster-rg
subnet:
cidrBlock: 10.0.0.0/16
name: example-cluster-subnet
serviceEndpoints:
- locations:
- '*'
service: Microsoft.Sql
- locations:
- '*'
service: Microsoft.KeyVault
- locations:
- '*'
service: Microsoft.Storage
- locations:
- '*'
service: Microsoft.AzureCosmosDB
- locations:
- '*'
service: Microsoft.ServiceBus
- locations:
- '*'
service: Microsoft.EventHub And the Subnet it creates: apiVersion: network.azure.com/v1api20201101
kind: VirtualNetworksSubnet
spec:
addressPrefix: 10.0.0.0/16
addressPrefixes:
- 10.0.0.0/16
azureName: example-cluster-subnet
owner:
name: example-cluster-vnet
serviceEndpoints:
- locations:
- '*'
service: Microsoft.Sql
- locations:
- '*'
service: Microsoft.KeyVault
- locations:
- '*'
service: Microsoft.Storage
- locations:
- '*'
service: Microsoft.AzureCosmosDB
- locations:
- '*'
service: Microsoft.ServiceBus
- locations:
- '*'
service: Microsoft.EventHub |
I looked at this some more and I think this comes down to a mismatch between the allowed max size of an Azure resource (which is I think somewhere in the 4mb range) and the allowed max size of a Kubernetes resource, which is ~1.5mb. Since we fundamentally cannot fit this much data into etcd, there's not really much we can do here other than elide the @nojnhuh - is CAPZ using |
It is not, so however you handle that should work for CAPZ. |
Hey @matthchr, thanks so much for looking into this. Irt the etcd limit, the problem seems to manifest in different ways depending on the size of the object in Azure. Note that in the original ticket I opened in CAPZ, the error was different and it came from etcd:
In that case, also note that the Subnet was not as large, when the error was observed, the subnet size was around 2.9mb. Now the subnet object in Azure reached around 5.6mb and the error seems to come from the Kubernetes API server itself, this limit is hardcoded in more than on place, e.g. here. I think in this case the object did not reach etcd. |
Thanks @danilo404 - I suppose a more precise phrasing of the problem is not so much etcd but: Azure allows larger resources than Kubernetes. I think once the etcd limit is crossed it won't work in k8s, though I didn't know about the hardcoded apsierver limit that ends up giving a different error if the request gets large enough. |
In terms of plan to fix this, it didn't make 2.11.0 (which has already shipped). I think we can try getting a fix merged before most of us go on holiday, which could enable consumption of the fix via the experimental release, but official release will probably need to wait until next year. There's also the added wrinkle of CAPZ using a slightly older version of ASO which may delay uptake in vanilla CAPZ as well. Unfortunately I don't really see a workaround for this problem other than "keep the cluster small" in the meantime, though possibly this issue isn't actually breaking things severely if CAPZ isn't trying to update the subnet? Can you share what the impact is to you @danilo404, and if you have any workaround to it currently? |
Thanks for the update @matthchr. We don't have workarounds for this case, but the impact for now is not blocking. What happens now is that the CAPZ object |
Ok the experimental build should have a fix for this now @danilo404. |
Describe the bug
The bug manifests on our cluster created with the following networking parameters:
And it has 20 Agent Pools, with the following sizes:
az aks show --subscription ExampleSubscription -n example-cluster-name -g example-cluster-name-rg -o table --query "agentPoolProfiles[].{Count: count, maxCount: maxCount, maxPods: maxPods}" Count MaxCount MaxPods ------- ---------- --------- 0 3 20 5 7 150 0 3 80 2 5 110 0 50 100 36 60 100 27 100 100 12 20 100 11 33 110 1 4 110 3 8 80 0 0 100 5 10 100 2 7 100 4 30 100 5 30 100 0 3 20 15 30 100 3 3 20 2 7 80
CAPZ created a
VirtualNetworksSubnet
ASO CR for that cluster with the following configuration:When the AgentPools reach somewhere close to the "counts" above, the
VirtualNetworksSubnet
object in azure grows in size to around 5.6mb, if fills up with thousands of entries in theipConfigurations
field:ASO then tries to persist the
ipConfigurations
into theVirtualNetworksSubnet
CR's status and this causes the api server to return:Azure Service Operator Version: v2.8.0
Expected behavior
The
VirtualNetworksSubnet
to continue reconciling successfuly for any scalable size of my Agent Pools.To Reproduce
Create a
VirtualNetworksSubnet
CR for an Azure Cloud Subnet with a large number ofipConfigurations
and wait for the controller to attempt to sync it.Additional context
This issue relates to another issue in the CAPZ project kubernetes-sigs/cluster-api-provider-azure#4649
The text was updated successfully, but these errors were encountered: