`aws-node` pod stuck after starting #3016

carlosrejano · 2024-08-28T09:55:12Z

What happened:
We are running an EKS cluster with 1.28 Kubernetes version, this cluster uses Karpenter for dynamically scale the cluster. There is constant movement in the cluster so new nodes are constantly appearing.

We've found that in some cases after a new node appears the aws-node in that node is stuck and new pods can not start due to aws-node not being reachable so the container networking is not configured. See the pod event:

Warning  FailedCreatePodSandBox  3m53s (x268 over 62m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "94ee9c38f7a821bf5abf88a511f2ec99b13c37
43e3dfc6327e82ad84833a9e69": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 12
7.0.0.1:50051: connect: connection refused"

Checking the aws-node I see two things:

The container is marked as not fully running:

NAME             READY   STATUS        RESTARTS   AGE
aws-node-4xbrr   1/2     Running   0          4h5m

The logs show that it seems to get stuck:

Defaulted container "aws-node" out of: aws-node, aws-eks-nodeagent, aws-vpc-cni-init (init)
Installed /host/opt/cni/bin/aws-cni
Installed /host/opt/cni/bin/egress-cni
time="2024-08-28T05:11:45Z" level=info msg="Starting IPAM daemon... "
time="2024-08-28T05:11:45Z" level=info msg="Checking for IPAM connectivity... "
time="2024-08-28T05:11:50Z" level=info msg="Copying config file... "
time="2024-08-28T05:11:50Z" level=info msg="Successfully copied CNI plugin binary and config file."

I did not have debug logs enabled.

After restarting the pod it works again.

Node AMI: v1.28.11-eks-1552ad0
AWS CNI: v1.18.2-eksbuild.1

Thanks!

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-28T00:04:29Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

carlosrejano added the bug label Aug 28, 2024

github-actions bot added the stale Issue or PR is stale label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`aws-node` pod stuck after starting #3016

`aws-node` pod stuck after starting #3016

carlosrejano commented Aug 28, 2024 •

edited

Loading

github-actions bot commented Oct 28, 2024

aws-node pod stuck after starting #3016

aws-node pod stuck after starting #3016

Comments

carlosrejano commented Aug 28, 2024 • edited Loading

github-actions bot commented Oct 28, 2024

`aws-node` pod stuck after starting #3016

`aws-node` pod stuck after starting #3016

carlosrejano commented Aug 28, 2024 •

edited

Loading