Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node pod stuck after starting #3016

Open
carlosrejano opened this issue Aug 28, 2024 · 1 comment
Open

aws-node pod stuck after starting #3016

carlosrejano opened this issue Aug 28, 2024 · 1 comment
Labels
bug stale Issue or PR is stale

Comments

@carlosrejano
Copy link

carlosrejano commented Aug 28, 2024

What happened:
We are running an EKS cluster with 1.28 Kubernetes version, this cluster uses Karpenter for dynamically scale the cluster. There is constant movement in the cluster so new nodes are constantly appearing.

We've found that in some cases after a new node appears the aws-node in that node is stuck and new pods can not start due to aws-node not being reachable so the container networking is not configured. See the pod event:

Warning  FailedCreatePodSandBox  3m53s (x268 over 62m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "94ee9c38f7a821bf5abf88a511f2ec99b13c37
43e3dfc6327e82ad84833a9e69": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 12
7.0.0.1:50051: connect: connection refused"

Checking the aws-node I see two things:

  1. The container is marked as not fully running:
NAME             READY   STATUS        RESTARTS   AGE
aws-node-4xbrr   1/2     Running   0          4h5m
  1. The logs show that it seems to get stuck:
Defaulted container "aws-node" out of: aws-node, aws-eks-nodeagent, aws-vpc-cni-init (init)
Installed /host/opt/cni/bin/aws-cni
Installed /host/opt/cni/bin/egress-cni
time="2024-08-28T05:11:45Z" level=info msg="Starting IPAM daemon... "
time="2024-08-28T05:11:45Z" level=info msg="Checking for IPAM connectivity... "
time="2024-08-28T05:11:50Z" level=info msg="Copying config file... "
time="2024-08-28T05:11:50Z" level=info msg="Successfully copied CNI plugin binary and config file."

I did not have debug logs enabled.

After restarting the pod it works again.

Node AMI: v1.28.11-eks-1552ad0
AWS CNI: v1.18.2-eksbuild.1

Thanks!

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issue or PR is stale
Projects
None yet
Development

No branches or pull requests

1 participant