You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
We have enabled Security Groups for pods meaning each pod needs to be assigned a branch interface for it to start. We have observed pods stuck in ContainerCreating state with the following error
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox
Which is because the node has run out of branch interfaces to assign to new pods. The k8s scheduler thinks that there are available branch interfaces on the node when in fact all of them are occupied.
We noticed that the logs indicate ENIs are not to be cleaned up properly when a pod is deleted /var/log/aws-routed-eni/ipamd.log
{"level":"info","ts":"2024-08-28T19:50:51.147Z","caller":"rpc/rpc.pb.go:881","msg":"Received DelNetwork for Sandbox 8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3"}
{"level":"debug","ts":"2024-08-28T19:50:51.147Z","caller":"rpc/rpc.pb.go:881","msg":"DelNetworkRequest: K8S_POD_NAME:\"test-deployment-7fdb64448-g4dnq\" K8S_POD_NAMESPACE:\"sean-scale\" K8S_POD_INFRA_CONTAINER_ID:\"8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3\" Reason:\"PodDeleted\" ContainerID:\"8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2024-08-28T19:50:51.147Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: IP address pool stats: total 28, assigned 4, sandbox aws-cni/8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3/eth0"}
{"level":"debug","ts":"2024-08-28T19:50:51.147Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2024-08-28T19:50:51.147Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3/unknown"}
{"level":"warn","ts":"2024-08-28T19:50:51.147Z","caller":"rpc/rpc.pb.go:881","msg":"Send DelNetworkReply: Failed to get pod spec: error while trying to retrieve pod info: Pod \"test-deployment-7fdb64448-g4dnq\" not found"}
/var/log/aws-routed-eni/plugin.log
{"level":"info","ts":"2024-08-28T19:38:51.056Z","caller":"routed-eni-cni-plugin/cni.go:314","msg":"Received CNI del request: ContainerID(2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f) Netns() IfName(eth0) Args(K8S_POD_INFRA_CONTAINER_ID=2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f;K8S_POD_UID=14582d59-8109-41bd-8028-ea0215dd75dc;IgnoreUnknown=1;K8S_POD_NAMESPACE=sean-scale;K8S_POD_NAME=test-deployment-7fdb64448-kr9jm) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"standard\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"error","ts":"2024-08-28T19:38:51.058Z","caller":"routed-eni-cni-plugin/cni.go:314","msg":"Error received from DelNetwork gRPC call for container 2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f: rpc error: code = Unknown desc = error while trying to retrieve pod info: Pod \"test-deployment-7fdb64448-kr9jm\" not found"}
{"level":"info","ts":"2024-08-28T19:38:51.058Z","caller":"routed-eni-cni-plugin/cni.go:393","msg":"PrevResult not available for pod. Pod may have already been deleted."}
{"level":"info","ts":"2024-08-28T19:38:51.058Z","caller":"routed-eni-cni-plugin/cni.go:314","msg":"Could not teardown pod using prevResult: ContainerID(2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f) Netns() IfName(eth0) PodNamespace(sean-scale) PodName(test-deployment-7fdb64448-kr9jm)"}
We can see by searching the network interface that the branch interfaces still exist for about a minute after scaling down the pods
And they are only cleaned up by garbage collection
What you expected to happen:
Once a pod is deleted, it's branch interface should be cleaned up and immediately available for use by another pod
How to reproduce it (as minimally and precisely as possible):
Enable SGFP ENABLE_POD_ENI = "true"
Add the following request to a deployment
Scale up the deployment so that it is assigned branch interfaces
Scale the deployment back to 0
Observe that branch interfaces are still attached for ~60 seconds after pod deletion
What happened:
We have enabled Security Groups for pods meaning each pod needs to be assigned a branch interface for it to start. We have observed pods stuck in ContainerCreating state with the following error
Which is because the node has run out of branch interfaces to assign to new pods. The k8s scheduler thinks that there are available branch interfaces on the node when in fact all of them are occupied.
We noticed that the logs indicate ENIs are not to be cleaned up properly when a pod is deleted
/var/log/aws-routed-eni/ipamd.log
/var/log/aws-routed-eni/plugin.log
We can see by searching the network interface that the branch interfaces still exist for about a minute after scaling down the pods
And they are only cleaned up by garbage collection
What you expected to happen:
Once a pod is deleted, it's branch interface should be cleaned up and immediately available for use by another pod
How to reproduce it (as minimally and precisely as possible):
Enable SGFP
ENABLE_POD_ENI = "true"
Add the following request to a deployment
Scale up the deployment so that it is assigned branch interfaces
Scale the deployment back to 0
Observe that branch interfaces are still attached for ~60 seconds after pod deletion
Anything else we need to know?:
We came across this tip https://aws.github.io/aws-eks-best-practices/networking/sgpp/#verify-terminationgraceperiodseconds-in-pod-specification-file and have set terminationGracePeriod to 30s but we are still experiencing the issues.
Environment:
kubectl version
): v1.29.7-eks-2f46c53cat /etc/os-release
): Ubuntu 22.04.4 LTSuname -a
): Linux ip-172-23-175-189 6.5.0-1024-aws Update readme for issue #21 #24~22.04.1-Ubuntu SMP Thu Jul 18 10:43:12 UTC 2024 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: