Security Groups for Pods branch interfaces not being cleaned up after pod deletion #3018

SeanEmac · 2024-08-28T20:23:09Z

What happened:
We have enabled Security Groups for pods meaning each pod needs to be assigned a branch interface for it to start. We have observed pods stuck in ContainerCreating state with the following error

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox

Which is because the node has run out of branch interfaces to assign to new pods. The k8s scheduler thinks that there are available branch interfaces on the node when in fact all of them are occupied.

We noticed that the logs indicate ENIs are not to be cleaned up properly when a pod is deleted
/var/log/aws-routed-eni/ipamd.log

{"level":"info","ts":"2024-08-28T19:50:51.147Z","caller":"rpc/rpc.pb.go:881","msg":"Received DelNetwork for Sandbox 8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3"}
{"level":"debug","ts":"2024-08-28T19:50:51.147Z","caller":"rpc/rpc.pb.go:881","msg":"DelNetworkRequest: K8S_POD_NAME:\"test-deployment-7fdb64448-g4dnq\" K8S_POD_NAMESPACE:\"sean-scale\" K8S_POD_INFRA_CONTAINER_ID:\"8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3\" Reason:\"PodDeleted\" ContainerID:\"8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2024-08-28T19:50:51.147Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: IP address pool stats: total 28, assigned 4, sandbox aws-cni/8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3/eth0"}
{"level":"debug","ts":"2024-08-28T19:50:51.147Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2024-08-28T19:50:51.147Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3/unknown"}
{"level":"warn","ts":"2024-08-28T19:50:51.147Z","caller":"rpc/rpc.pb.go:881","msg":"Send DelNetworkReply: Failed to get pod spec: error while trying to retrieve pod info: Pod \"test-deployment-7fdb64448-g4dnq\" not found"}

/var/log/aws-routed-eni/plugin.log

{"level":"info","ts":"2024-08-28T19:38:51.056Z","caller":"routed-eni-cni-plugin/cni.go:314","msg":"Received CNI del request: ContainerID(2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f) Netns() IfName(eth0) Args(K8S_POD_INFRA_CONTAINER_ID=2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f;K8S_POD_UID=14582d59-8109-41bd-8028-ea0215dd75dc;IgnoreUnknown=1;K8S_POD_NAMESPACE=sean-scale;K8S_POD_NAME=test-deployment-7fdb64448-kr9jm) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"standard\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"error","ts":"2024-08-28T19:38:51.058Z","caller":"routed-eni-cni-plugin/cni.go:314","msg":"Error received from DelNetwork gRPC call for container 2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f: rpc error: code = Unknown desc = error while trying to retrieve pod info: Pod \"test-deployment-7fdb64448-kr9jm\" not found"}
{"level":"info","ts":"2024-08-28T19:38:51.058Z","caller":"routed-eni-cni-plugin/cni.go:393","msg":"PrevResult not available for pod. Pod may have already been deleted."}
{"level":"info","ts":"2024-08-28T19:38:51.058Z","caller":"routed-eni-cni-plugin/cni.go:314","msg":"Could not teardown pod using prevResult: ContainerID(2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f) Netns() IfName(eth0) PodNamespace(sean-scale) PodName(test-deployment-7fdb64448-kr9jm)"}

We can see by searching the network interface that the branch interfaces still exist for about a minute after scaling down the pods

And they are only cleaned up by garbage collection

What you expected to happen:
Once a pod is deleted, it's branch interface should be cleaned up and immediately available for use by another pod

How to reproduce it (as minimally and precisely as possible):
Enable SGFP
ENABLE_POD_ENI = "true"
Add the following request to a deployment

resources:
          requests:
            vpc.amazonaws.com/pod-eni: "1"

Scale up the deployment so that it is assigned branch interfaces
Scale the deployment back to 0
Observe that branch interfaces are still attached for ~60 seconds after pod deletion

Anything else we need to know?:
We came across this tip https://aws.github.io/aws-eks-best-practices/networking/sgpp/#verify-terminationgraceperiodseconds-in-pod-specification-file and have set terminationGracePeriod to 30s but we are still experiencing the issues.

Environment:

Kubernetes version (use kubectl version): v1.29.7-eks-2f46c53
CNI Version: v1.18.0
OS (e.g: cat /etc/os-release): Ubuntu 22.04.4 LTS
Kernel (e.g. uname -a): Linux ip-172-23-175-189 6.5.0-1024-aws Update readme for issue #21 #24~22.04.1-Ubuntu SMP Thu Jul 18 10:43:12 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

orsenthil · 2024-08-28T20:52:23Z

Were there any services running that could cause the application pods not to be deleted? Were the application pods deleted properly?

SeanEmac · 2024-08-28T21:18:19Z

@orsenthil yes the pods were deleted properly. This is the full spec of what I used to reproduce.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  namespace: sean-scale
  labels:
    app: test-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-deployment
  template:
    metadata:
      labels:
        app: test-deployment
    spec:
      containers:
      - image: busybox
        name: busybox
        command: ["sh", "-c", "sleep infinity"]
        resources:
          requests:
            vpc.amazonaws.com/pod-eni: "1"
          limits:
            vpc.amazonaws.com/pod-eni: "1"
      terminationGracePeriodSeconds: 30

The pods terminate straight away once I scale the deployment to 0

github-actions · 2024-10-29T00:04:26Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

SeanEmac added the bug label Aug 28, 2024

github-actions bot added the stale Issue or PR is stale label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security Groups for Pods branch interfaces not being cleaned up after pod deletion #3018

Security Groups for Pods branch interfaces not being cleaned up after pod deletion #3018

SeanEmac commented Aug 28, 2024 •

edited

Loading

orsenthil commented Aug 28, 2024

SeanEmac commented Aug 28, 2024 •

edited

Loading

github-actions bot commented Oct 29, 2024

Security Groups for Pods branch interfaces not being cleaned up after pod deletion #3018

Security Groups for Pods branch interfaces not being cleaned up after pod deletion #3018

Comments

SeanEmac commented Aug 28, 2024 • edited Loading

orsenthil commented Aug 28, 2024

SeanEmac commented Aug 28, 2024 • edited Loading

github-actions bot commented Oct 29, 2024

SeanEmac commented Aug 28, 2024 •

edited

Loading

SeanEmac commented Aug 28, 2024 •

edited

Loading