Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] Pods stuck in ContainerCreating status after upgrading to Kubernetes version 1.30 #2970

Closed
Gier32o opened this issue Jun 27, 2024 · 8 comments
Labels
bug stale Issue or PR is stale

Comments

@Gier32o
Copy link

Gier32o commented Jun 27, 2024

Pods are stuck in ContainerCreating status after upgrading to Kubernetes version 1.30 on EKS.
We have 'Security Groups for Pods' feature turned on, and when we're trying to upgrade from:

ami_id             = "ami-066d744867bb80fce"
vpc_cni_version    = "v1.16.2-eksbuild.1"
kubernetes_version = "1.29"

to

ami_id             = "ami-05e7e986227a095a9"
vpc_cni_version    = "v1.18.2-eksbuild.1"
kubernetes_version = "1.30"

we're getting failing pods:

  Warning  FailedCreatePodSandBox  19m                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cffd4f13c293011d5f6e967bd5859c234ab1f83731fbf1e40c46330e6276fdd7": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  66s (x85 over 19m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "38c6783c31d39443b9b0fe4873868fdf972c92d499176b5b44c9df42b4461865": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

There is no issue when 'Security Groups for Pods' feature is turned off
How to reproduce: https://github.com/Gier32o/k8s-upgrade-problem

@Gier32o Gier32o added the bug label Jun 27, 2024
@orsenthil
Copy link
Member

Hello @Gier32o, does /var/log/aws-routed-eni/plugin.log or /var/log/aws-routed-eni/ipamd.log logs show any detailed about on the ip assignment or failure? Is aws-node pod running?
Usually during K8s upgrade, the CNI version does not change, we keep the CNI version same while performing K8s upgrade. After the k8s upgrade, you can do the CNI upgrade. Does this workflow give the desirable outcome?

@Gier32o
Copy link
Author

Gier32o commented Jun 28, 2024

Hi, the aws-node pods are running fine. You were right - upgrading addon version before or at the same time as kubernetes and worker AMIs results in this error. If I run upgrade in two batches: 1. (K8s + AMIs) -> 2. (Addon) it works fine. Thanks!
Is there any way to fix such a broken cluster afterwards?

@orsenthil
Copy link
Member

orsenthil commented Jun 28, 2024

Is there any way to fix such a broken cluster afterwards?

I am not sure what could have led to this stage. But you can downgrade the addon the previous version, and restart the pods, and upgrade the addons again.

@hikouki-gumo
Copy link

Hi @orsenthil, I upgraded in the order you recommend, (K8s + AMIs) first, then Addon, but got the same problem. It even randomly failed, not all the time.

vpc_cni_version    = "v1.16.3-eksbuild.2"
kubernetes_version = "1.28"

to

vpc_cni_version    = "v1.18.2-eksbuild."
kubernetes_version = "1.29"

@Gier32o
Copy link
Author

Gier32o commented Jul 1, 2024

Error: updating EKS Add-On (test:vpc-cni): operation error EKS: UpdateAddon, https response error StatusCode: 400, RequestID: 608f24fe-795a-4c7c-acba-8d11836aa01b, InvalidParameterException: Addon version specified is not supported
when trying to downgrade the plugin 1.18.2 -> 1.16.3.

Nothing changed when I downgraded to 1.17.1

@Gier32o
Copy link
Author

Gier32o commented Jul 18, 2024

So is it a bug or is it something wrong with configuration?

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Sep 17, 2024
Copy link

github-actions bot commented Oct 1, 2024

Issue closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issue or PR is stale
Projects
None yet
Development

No branches or pull requests

3 participants