Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security Groups for Pods branch interfaces not being cleaned up after pod deletion #3018

Open
SeanEmac opened this issue Aug 28, 2024 · 3 comments
Labels
bug stale Issue or PR is stale

Comments

@SeanEmac
Copy link

SeanEmac commented Aug 28, 2024

What happened:
We have enabled Security Groups for pods meaning each pod needs to be assigned a branch interface for it to start. We have observed pods stuck in ContainerCreating state with the following error

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox

Which is because the node has run out of branch interfaces to assign to new pods. The k8s scheduler thinks that there are available branch interfaces on the node when in fact all of them are occupied.

We noticed that the logs indicate ENIs are not to be cleaned up properly when a pod is deleted
/var/log/aws-routed-eni/ipamd.log

{"level":"info","ts":"2024-08-28T19:50:51.147Z","caller":"rpc/rpc.pb.go:881","msg":"Received DelNetwork for Sandbox 8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3"}
{"level":"debug","ts":"2024-08-28T19:50:51.147Z","caller":"rpc/rpc.pb.go:881","msg":"DelNetworkRequest: K8S_POD_NAME:\"test-deployment-7fdb64448-g4dnq\" K8S_POD_NAMESPACE:\"sean-scale\" K8S_POD_INFRA_CONTAINER_ID:\"8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3\" Reason:\"PodDeleted\" ContainerID:\"8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2024-08-28T19:50:51.147Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: IP address pool stats: total 28, assigned 4, sandbox aws-cni/8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3/eth0"}
{"level":"debug","ts":"2024-08-28T19:50:51.147Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2024-08-28T19:50:51.147Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/8afed98db5bd696ba1a46dcfa3e237be048d9667a9425c5bea1683beb35496e3/unknown"}
{"level":"warn","ts":"2024-08-28T19:50:51.147Z","caller":"rpc/rpc.pb.go:881","msg":"Send DelNetworkReply: Failed to get pod spec: error while trying to retrieve pod info: Pod \"test-deployment-7fdb64448-g4dnq\" not found"}

/var/log/aws-routed-eni/plugin.log

{"level":"info","ts":"2024-08-28T19:38:51.056Z","caller":"routed-eni-cni-plugin/cni.go:314","msg":"Received CNI del request: ContainerID(2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f) Netns() IfName(eth0) Args(K8S_POD_INFRA_CONTAINER_ID=2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f;K8S_POD_UID=14582d59-8109-41bd-8028-ea0215dd75dc;IgnoreUnknown=1;K8S_POD_NAMESPACE=sean-scale;K8S_POD_NAME=test-deployment-7fdb64448-kr9jm) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.4.0\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"podSGEnforcingMode\":\"standard\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"error","ts":"2024-08-28T19:38:51.058Z","caller":"routed-eni-cni-plugin/cni.go:314","msg":"Error received from DelNetwork gRPC call for container 2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f: rpc error: code = Unknown desc = error while trying to retrieve pod info: Pod \"test-deployment-7fdb64448-kr9jm\" not found"}
{"level":"info","ts":"2024-08-28T19:38:51.058Z","caller":"routed-eni-cni-plugin/cni.go:393","msg":"PrevResult not available for pod. Pod may have already been deleted."}
{"level":"info","ts":"2024-08-28T19:38:51.058Z","caller":"routed-eni-cni-plugin/cni.go:314","msg":"Could not teardown pod using prevResult: ContainerID(2e970a5ecd4137a9731736f9caf2f109023e2b6b694bdee92632fe559f84400f) Netns() IfName(eth0) PodNamespace(sean-scale) PodName(test-deployment-7fdb64448-kr9jm)"}

We can see by searching the network interface that the branch interfaces still exist for about a minute after scaling down the pods
Screen Shot 2024-08-28 at 1 10 44 PM

And they are only cleaned up by garbage collection

What you expected to happen:
Once a pod is deleted, it's branch interface should be cleaned up and immediately available for use by another pod

How to reproduce it (as minimally and precisely as possible):
Enable SGFP
ENABLE_POD_ENI = "true"
Add the following request to a deployment

resources:
          requests:
            vpc.amazonaws.com/pod-eni: "1"

Scale up the deployment so that it is assigned branch interfaces
Scale the deployment back to 0
Observe that branch interfaces are still attached for ~60 seconds after pod deletion

Anything else we need to know?:
We came across this tip https://aws.github.io/aws-eks-best-practices/networking/sgpp/#verify-terminationgraceperiodseconds-in-pod-specification-file and have set terminationGracePeriod to 30s but we are still experiencing the issues.

Environment:

  • Kubernetes version (use kubectl version): v1.29.7-eks-2f46c53
  • CNI Version: v1.18.0
  • OS (e.g: cat /etc/os-release): Ubuntu 22.04.4 LTS
  • Kernel (e.g. uname -a): Linux ip-172-23-175-189 6.5.0-1024-aws Update readme for issue #21 #24~22.04.1-Ubuntu SMP Thu Jul 18 10:43:12 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
@SeanEmac SeanEmac added the bug label Aug 28, 2024
@orsenthil
Copy link
Member

Were there any services running that could cause the application pods not to be deleted? Were the application pods deleted properly?

@SeanEmac
Copy link
Author

SeanEmac commented Aug 28, 2024

@orsenthil yes the pods were deleted properly. This is the full spec of what I used to reproduce.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  namespace: sean-scale
  labels:
    app: test-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-deployment
  template:
    metadata:
      labels:
        app: test-deployment
    spec:
      containers:
      - image: busybox
        name: busybox
        command: ["sh", "-c", "sleep infinity"]
        resources:
          requests:
            vpc.amazonaws.com/pod-eni: "1"
          limits:
            vpc.amazonaws.com/pod-eni: "1"
      terminationGracePeriodSeconds: 30

The pods terminate straight away once I scale the deployment to 0

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issue or PR is stale
Projects
None yet
Development

No branches or pull requests

2 participants