VPC CNI stuck in crash loop without insights #2945

duxing · 2024-06-05T03:58:57Z

What happened:

I'm working on testing autoscaling for my EKS cluster (1.29) and karpenter is frequently scaling nodes up and down during my test.
At a certain point, all newly launched nodes stuck in NotReady due to VPC CNI pod stuck in crash loop.

The symptom is very similar to hitting EC2/ENI API rate limit, however, I can't find out useful logs / metrics from the client (VPC CNI pod) to help me confirm/diagnose, despite AWS_VPC_K8S_CNI_LOGLEVEL is set to DEBUG (AWS_VPC_K8S_PLUGIN_LOG_LEVEL is also DEBUG if it matters)

The version i'm using is v1.18.1-eksbuild.3 (the EKS optimized addon) and the logs are attached below.

Attach logs

unable to run sudo bash /opt/cni/bin/aws-cni-support.sh: the image 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.18.1-eksbuild.3 seems to be distroless

logs from kubectl logs -f ...:

successfully launched aws-node pod (AWS_VPC_K8S_CNI_LOGLEVEL=DEBUG):

Installed /host/opt/cni/bin/aws-cni
Installed /host/opt/cni/bin/egress-cni
time="2024-06-04T23:27:31Z" level=info msg="Starting IPAM daemon... "
time="2024-06-04T23:27:31Z" level=info msg="Checking for IPAM connectivity... "
time="2024-06-04T23:27:33Z" level=info msg="Copying config file... "
time="2024-06-04T23:27:33Z" level=info msg="Successfully copied CNI plugin binary and config file."

stuck aws-node pod (also AWS_VPC_K8S_CNI_LOGLEVEL=DEBUG):

Installed /host/opt/cni/bin/aws-cni
Installed /host/opt/cni/bin/egress-cni
time="2024-06-05T03:40:00Z" level=info msg="Starting IPAM daemon... "
time="2024-06-05T03:40:00Z" level=info msg="Checking for IPAM connectivity... "
// stuck here indefinitely until container is restarted

What you expected to happen:

with AWS_VPC_K8S_CNI_LOGLEVEL=DEBUG, logs should be more verbose, spitting out information about what the process is doing.
whatever error/exception that caused initialization failure should be surfaced to the log stream under pretty much any log level (should be ERROR log level for these entries)

if there are exponential backoff retry for 429 responses, it needs to be surfaced during verbose mode (debug log level)

How to reproduce it (as minimally and precisely as possible):

launch [email protected] and install VPC CNI v1.18.1-eksbuild.3 from EKS addons
frequently launch and terminate (after nodes become ready) nodes in batches (batch of 15 instances, every 10mins)
wait until nodes are unable to become healthy due to aws-node pod stuck in crash loop

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4-eks-036c24b

CNI Version: v1.18.1-eksbuild.3
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):

The text was updated successfully, but these errors were encountered:

orsenthil · 2024-06-05T17:11:23Z

sudo bash /opt/cni/bin/aws-cni-support.sh should be run from the node where you are observing the crash loop. Can you also see what the ipamd.log and plugin.log say in the same node. They are located in /var/log/aws-routed-eni/ipamd.log and /var/log/aws-routed-eni/plugin.log

duxing · 2024-06-05T18:48:14Z

@orsenthil thx for the clarification.
I missed the window when this issue can be consistently reproduced (I suspect it has something to do with AWS API rate limiting) and I'm unable to reproduce this right now, but I'll check if I can find the logs from the affected nodes from that time.

update:
unable to do that. the daemonset pod failed initialization due to VPC CNI stuck in crash loop. I may have to reproduce this again to see the logs

however, I did get your point about the log verbosity setting is affecting the /var/log/aws-routed-eni/ipamd.log file, and I think this issue can be closed.

would you recommend setting AWS_VPC_K8S_CNI_LOG_FILE to be /dev/stdout to make the logs available for kubectl logs -f?

duxing · 2024-06-05T19:51:21Z

I was to reproduce this on an EKS node that I have SSH access to and the results have been sent to [email protected]

after going through ipamd.log and other log files under /var/log/aws-routed-eni/, I couldn't find any useful log lines that can help me confirm if this is a rate limit issue or identify what's causing the failure.

as this comment is written, the node has been in this state for 19m (together with many nodes that are stuck for 40mins):

kubectl get nodes --sort-by=.metadata.creationTimestamp
...
ip-10-231-76-39.ec2.internal     Ready      <none>   33h     v1.29.0-eks-5e0fdde
ip-10-231-75-201.ec2.internal    Ready      <none>   159m    v1.29.0-eks-5e0fdde
ip-10-231-107-179.ec2.internal   NotReady   <none>   37m     v1.29.0-eks-5e0fdde
...
ip-10-231-87-255.ec2.internal    NotReady   <none>   19m     v1.29.0-eks-5e0fdde

if it helps:
I managed to launch 1 node 159ms ago (pool is homogenous): after this issue is reported but before this current response.

I guess the "issue" I have is: the ipamd log or kubectl logs -f should provide some insights on why this particular failure is happening

duxing · 2024-06-06T23:20:11Z

posted some of my own findings in a separate issue #2948

the logs uploaded for this issue have the same issue as the example logs in #2948

github-actions · 2024-08-06T00:03:40Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

duxing · 2024-08-07T02:37:11Z

update on this:

I don't think this issue is fixable and as suggested, configuring kube-proxy to run in ipvs mode in a large cluster would alleviate this issue.

however, it would be helpful if vpc-cni can throw a log when this happens so that this is more self-contained. I could not figure out why when I go through the verbose log of vpc-cni, it wasn't until I cross check the timestamp of kube-proxy logs from the same node did I realize why this happened.

dejwsz · 2024-09-18T10:35:21Z

I think I had the same issue several times in EKS 1.29 with latest vpc cni (and previous versions too), any idea?

time="2024-09-16T14:56:39Z" level=info msg="Checking for IPAM connectivity... "
time="2024-09-16T14:56:48Z" level=info msg="Copying config file... "
time="2024-09-16T14:56:48Z" level=info msg="Successfully copied CNI plugin binary and config file."
	>  sigs.k8s.io/controller-runtime/pkg/client.New(0x55c436c39f30?, {0x0, 0xc0000e2e00, {0x0, 0x0}, 0xc000822390, {0x0, 0x0}, 0x0})
	>  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/client/client.go:110 +0x7d
	>  github.com/aws/amazon-vpc-cni-k8s/pkg/k8sapi.CreateKubeClient({0x55c43451b974, 0x8})
	>  	/go/src/github.com/aws/amazon-vpc-cni-k8s/pkg/k8sapi/k8sutils.go:115 +0x34a
	>  main._main()
	>  	/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/aws-k8s-agent/main.go:59 +0x15a
	>  main.main()
	>  	/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/aws-k8s-agent/main.go:42 +0x13```

dejwsz · 2024-09-18T10:58:58Z

is it for sure that ipvs mode can fix this ? or it is only a guess ?

duxing · 2024-09-18T17:56:43Z

@dejwsz what's the number of Services you have on your cluster?
the root issue is lock contention and ipvs mode can "fix" this issue in most of the cases.

still, I think vpc cni should be more verbose about lock contention to be more self-contained. kube-proxy is capable of throwing logs when iptable lock is held by a different process so I expect vpc-cni to be capable to providing the same level of verbosity

yash97 · 2024-09-18T21:18:45Z

@dejwsz from your trace it looks like code panicked in this function 115. It could be because of some null pointer. Is it possible to provide more information regarding this panic. Thanks.

yash97 · 2024-09-18T21:57:22Z

Also regarding vpc-cni not logging about iptable lock busy, core issue is with coreos-iptable by default use wait option where it will wait indefinitely or until timeout.Package Code Link. We can generate error like Another app is currently holding the xtables lock. by passing Timeout and retry mechanism.
eg to add timeout:

iptables.New(iptables.Timeout(1))

duxing added the bug label Jun 5, 2024

duxing mentioned this issue Jun 6, 2024

iptables contention between vpc-cni and kube-proxy #2948

Closed

github-actions bot added the stale Issue or PR is stale label Aug 6, 2024

github-actions bot removed the stale Issue or PR is stale label Aug 8, 2024

yash97 assigned yash97 and unassigned yash97 Sep 18, 2024

yash97 added the good first issue label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPC CNI stuck in crash loop without insights #2945

VPC CNI stuck in crash loop without insights #2945

duxing commented Jun 5, 2024

orsenthil commented Jun 5, 2024

duxing commented Jun 5, 2024 •

edited

Loading

duxing commented Jun 5, 2024

duxing commented Jun 6, 2024

github-actions bot commented Aug 6, 2024

duxing commented Aug 7, 2024

dejwsz commented Sep 18, 2024 •

edited

Loading

dejwsz commented Sep 18, 2024

duxing commented Sep 18, 2024

yash97 commented Sep 18, 2024

yash97 commented Sep 18, 2024 •

edited

Loading

VPC CNI stuck in crash loop without insights #2945

VPC CNI stuck in crash loop without insights #2945

Comments

duxing commented Jun 5, 2024

orsenthil commented Jun 5, 2024

duxing commented Jun 5, 2024 • edited Loading

duxing commented Jun 5, 2024

duxing commented Jun 6, 2024

github-actions bot commented Aug 6, 2024

duxing commented Aug 7, 2024

dejwsz commented Sep 18, 2024 • edited Loading

dejwsz commented Sep 18, 2024

duxing commented Sep 18, 2024

yash97 commented Sep 18, 2024

yash97 commented Sep 18, 2024 • edited Loading

duxing commented Jun 5, 2024 •

edited

Loading

dejwsz commented Sep 18, 2024 •

edited

Loading

yash97 commented Sep 18, 2024 •

edited

Loading