Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node should untaint node #2808

Open
runningman84 opened this issue Feb 23, 2024 · 21 comments
Open

aws-node should untaint node #2808

runningman84 opened this issue Feb 23, 2024 · 21 comments

Comments

@runningman84
Copy link

What would you like to be added:
In our eks bottlerocket usecase we see karpenter provisioning nodes which get the aws-node pod and some application pods upon start. Unfortunatly the aws-node pod takes several dozens of seconds to get ready. The application pod try to start in the meantime and get error because they do not get an ip address.

Should we use karpenter to taint the nodes until the aws-node is ready? Is the cni plugin able to remove a startup taint once it is ready?

Why is this needed:
Pods constantly fail during startup due to missing ip addresses because the aws-node pod is not ready.

@jdn5126
Copy link
Contributor

jdn5126 commented Feb 23, 2024

@runningman84 the node should not be marked as "Ready" until the aws-node pod copies the CNI to /etc/cni/net.d/, which it does after it finishes initialization. So the scheduler should not schedule application pods on a "Not Ready" node (unless those pods tolerate the node being not ready for some reason)

@jdn5126
Copy link
Contributor

jdn5126 commented Feb 23, 2024

The VPC CNI plugin is not able to modify node taints, as that would be a sizable security risk

@runningman84
Copy link
Author

okay what could be the reason for us seeing this behaviour?

@jdn5126
Copy link
Contributor

jdn5126 commented Feb 23, 2024

okay what could be the reason for us seeing this behaviour?

Do you see the nodes as "Not Ready" during this time window? Do these application pods have their own tolerations?

@runningman84
Copy link
Author

runningman84 commented Feb 26, 2024

I have double check that ...

t = point in time... can be seconds or even minute between two numbers

t0 node appears as not ready
t1 daemon set pods are scheduled to it
t2 aws-node pod is in initializing
t2 kube-proxy pod is in initializing
t3 kube-proxy is ready
t4 node becomes ready (aws-node still not in running state)
t5 additional pods are scheduled to it
t6 all additional pods stay in state container creating (warnings due to failed to assign IP address to container)
t7 aws-node pods is ready
t8 new pods get ready within seconds
t9 the additional pods from t6 start to get ready

It looks like the node gets ready too fast before waiting for the aws-node pod...

@jdn5126
Copy link
Contributor

jdn5126 commented Feb 26, 2024

@runningman84 can you please sure the node logs during this timeline? Mainly we would need to look at the CNI and IPAMD logs in /var/log/aws-routed-eni/. You can email them to [email protected] and we can take a look

@runningman84
Copy link
Author

runningman84 commented Feb 27, 2024

I just sent the logs and we also have an open case id: 170893944201879

@jdn5126
Copy link
Contributor

jdn5126 commented Feb 27, 2024

Thanks @runningman84, let's work through the support case, as the support team will triage and then bring in the service team if needed.

@mathieuherbert
Copy link

Hi any news on this one? we have the same issue

@runningman84
Copy link
Author

The aws support case did not really solve that, we got the suggestion that we should try to use prefix delegation mode or things like that to speedup the ip allocation. The general question is should a node be unready until aws-node is up and running?
I could image that that behaviour would also have downsides because images would be pulled before the node is ready. The current situation is that pod might not start due to the ip address thing but at least all images are already pulled before they eventually start fine...

@tooptoop4
Copy link

tooptoop4 commented May 18, 2024

i am facing the same issue on eks 1.28

kubelet shows ready status on node
pods start being scheduled onto that node
few seconds later node goes into networknotready state
pods above get stuck forever
few seconds later kubelet switches back to ready state and new pods work, but the earlier ones don't

seems i will have to make my own workaround by monitoring new nodes myself and assigning a label aged=y after they have been there for a minute

then make all my pods have a nodeaffinity looking for that label

ideally aws pods would add label to the node themself

any ideas @jdn5126 ?

@orsenthil
Copy link
Member

The general question is should a node be unready until aws-node is up and running?

The node should be marked as Not Ready, until aws-node pod copies the configuration files of CNI to /etc/cni/net.d/, which it does after it finishes initialization.

Marking the node as schedulable (Ready or Not Ready) is marked by kubelet process.
CNI should not be tainting or untainting the node.

@tooptoop4 - did you manage to resolve this issue at your end?

@tooptoop4
Copy link

@orsenthil not resolved, the node goes 1.NotReady>2.Ready>3.NotReady>4.Ready pods are getting scheduled and stuck between 2. and 3.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Aug 26, 2024
@tooptoop4
Copy link

tooptoop4 commented Sep 4, 2024

still an issue in EKS v1.29.7 with amazon-k8s-cni:v1.18.2

kubectl events shows this timeline:

default 40m Normal Starting node/redactip kubelet, redactip Starting kubelet. 40m 1 redactip
default 40m Warning InvalidDiskCapacity node/redactip kubelet, redactip invalid capacity 0 on image filesystem 40m 1 redactip
default 40m Normal NodeHasSufficientMemory node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientMemory 40m 2 redactip
default 40m Normal NodeHasNoDiskPressure node/redactip kubelet, redactip Node redactip status is now: NodeHasNoDiskPressure 40m 2 redactip
default 40m Normal NodeHasSufficientPID node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientPID 40m 2 redactip
default 40m Normal NodeAllocatableEnforced node/redactip kubelet, redactip Updated Node Allocatable limit across pods 40m 1 redactip
default 39m Normal Synced node/redactip cloud-node-controller Node synced successfully 39m 1 redactip
default 39m Normal RegisteredNode node/redactip node-controller Node redactip event: Registered Node redactip in Controller 39m 1 redactip
default 39m Normal Starting node/redactip kube-proxy, kube-proxy-redactip 39m 1 redactip
default 39m Normal NodeReady node/redactip kubelet, redactip Node redactip status is now: NodeReady 39m 1 redactip
default 39m Normal Starting node/redactip kubelet, redactip Starting kubelet. 39m 1 redactip
default 39m Warning InvalidDiskCapacity node/redactip kubelet, redactip invalid capacity 0 on image filesystem 39m 1 redactip
default 39m Normal NodeHasSufficientMemory node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientMemory 39m 2 redactip
default 39m Normal NodeHasNoDiskPressure node/redactip kubelet, redactip Node redactip status is now: NodeHasNoDiskPressure 39m 2 redactip
default 39m Normal NodeHasSufficientPID node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientPID 39m 2 redactip
default 39m Normal NodeAllocatableEnforced node/redactip kubelet, redactip Updated Node Allocatable limit across pods 39m 1 redactip

@github-actions github-actions bot removed the stale Issue or PR is stale label Sep 5, 2024
@youwalther65
Copy link

youwalther65 commented Sep 5, 2024

I have two, somehow opposite points:

  1. Other CNI like Cilium together with Karpenter works just fine - Karpenter NodePool adds the Cilium startupTaint and Cilium CNI removes it after successful start. So why can't AWS VPC CNI be implemented that way too? Sure, it requires some more permissions to patch the K8s nodes object, so it might be a security concern.
  2. On the other side we need to figure out, what is happening here and causing pods scheduled/kubelet showing node Ready even if AWS VCI CNI is not ready. Because this works in almost all cases we must have some kind of corner case here.

@youwalther65
Copy link

youwalther65 commented Sep 5, 2024

@runningman84 @tooptoop4 Have you tried the following type of debugging to narrow down the timing of the issue?

  1. Enable all EKS control plane logs
  2. Use CloudWatch Logs Insights queries for the control plane CloudWatch log group with appropriate timeframe to watch the node and AWS VPC CNI pod lifecycle or other things like kubelet posting ready state

Example 1

fields @timestamp, @message
| filter @message like /ip-<redacted>.<redacted AWS region>.compute.internal/
| sort @timestamp asc

Example 2 (remove # to view only messages for userAgent kubelet

fields @timestamp, @message
| filter @message like /<K8s node name>/
#| filter userAgent like /kubelet/
| filter @message like /kubelet is posting ready status/
| display @timestamp, userAgent, verb, requestURI
| sort @timestamp asc

Note: There are some CloudWatch costs associated with this.

Another ask just to be sure: Have you modified kubelet maxPods in a way that it is larger then number of IP for this particular instance type, see Maximum IP addresses per network interface?

@orsenthil
Copy link
Member

1.NotReady>2.Ready>3.NotReady>4.Ready pods are getting scheduled and stuck between 2. and 3.

This is strange. Do you have any logs that indicate why the node is getting marked as NotReady from the Ready State. This is key to understanding the behavior here. The pods were probably scheduled when node was marked in the Ready State, but then it went to NotReady for some reason.

@youwalther65
Copy link

youwalther65 commented Sep 5, 2024

@runningman84 @tooptoop4 if you have additional logs please send them to [email protected].
For collecting the logs either use admin container and either chroot or sheltie to get (based on this issue here)

  • journalctl -xeu kubelet
  • journalctl -xeu containerd
  • content of /var/log/aws-routed-eni/ipamd.log
  • content of /var/log/aws-routed-eni/plugin.log

@orsenthil Anything else? Please advise!

Thank you!

@tooptoop4
Copy link

@youwalther65

kubelet logs for the affected node:

I0904 14:03:22.313659    2015 log.go:245] http: TLS handshake error from 127.0.0.1:57056: no serving certificate available for the kubelet"}
I0904 14:03:23.058421    2015 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"auth/fluent-bit-n4cvm\" podStartSLOduration=2.7230901039999997 podStartE2EDuration=\"6.058375092s\" podCreationTimestamp=\"2024-09-04 14:03:17 +0000 UTC\" firstStartedPulling=\"2024-09-04 14:03:18.740862166 +0000 UTC m=+13.605813629\" lastFinishedPulling=\"2024-09-04 14:03:22.076147154 +0000 UTC m=+16.941098617\" observedRunningTime=\"2024-09-04 14:03:23.056421468 +0000 UTC m=+17.921372941\" watchObservedRunningTime=\"2024-09-04 14:03:23.058375092 +0000 UTC m=+17.923326545\""}
I0904 14:03:26.198922    2015 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: ebs.csi.aws.com endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock versions: 1.0.0"}
I0904 14:03:26.199009    2015 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: ebs.csi.aws.com at endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock"}
I0904 14:03:27.072469    2015 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"kube-system/ebs-csi-node-lcm6l\" podStartSLOduration=14.045209864 podStartE2EDuration=\"21.072403703s\" podCreationTimestamp=\"2024-09-04 14:03:06 +0000 UTC\" firstStartedPulling=\"2024-09-04 14:03:19.39328969 +0000 UTC m=+14.258241143\" lastFinishedPulling=\"2024-09-04 14:03:26.420483529 +0000 UTC m=+21.285434982\" observedRunningTime=\"2024-09-04 14:03:27.070570631 +0000 UTC m=+21.935522104\" watchObservedRunningTime=\"2024-09-04 14:03:27.072403703 +0000 UTC m=+21.937355156\""}
I0904 14:03:27.081623    2015 topology_manager.go:215] \"Topology Admit Handler\" podUID=\"6b9f76aa-3164-4a7c-87be-25091855c916\" podNamespace=\"auth\" podName=\"redactmypodtooearly\""}
I0904 14:03:27.172891    2015 reconciler_common.go:258] \"operationExecutor.VerifyControllerAttachedVolume started for volume \\\"var-run-argo\\\" (UniqueName: \\\"kubernetes.io/empty-dir/6b9f76aa-3164-4a7c-87be-25091855c916-var-run-argo\\\") pod \\\"redactmypodtooearly\\\" (UID: \\\"6b9f76aa-3164-4a7c-87be-25091855c916\\\") \" pod=\"auth/redactmypodtooearly\""}
I0904 14:03:27.172985    2015 reconciler_common.go:258] \"operationExecutor.VerifyControllerAttachedVolume started for volume \\\"redactacm-sh\\\" (UniqueName: \\\"kubernetes.io/configmap/6b9f76aa-3164-4a7c-87be-25091855c916-redactacm-sh\\\") pod \\\"redactmypodtooearly\\\" (UID: \\\"6b9f76aa-3164-4a7c-87be-25091855c916\\\") \" pod=\"auth/redactmypodtooearly\""}
E0904 14:03:27.395878    2015 kubelet_pods.go:417] \"Hostname for pod was too long, truncated it\" podName=\"redactmypodtooearly\" hostnameMaxLen=63 truncatedHostname=\"redactmypodtooearly\""}
E0904 14:03:36.349734    2015 cadvisor_stats_provider.go:501] \"Partial failure issuing cadvisor.ContainerInfoV2\" err=\"partial failures: [\\\"/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfb3b1b0e_c2b1_43a9_9fb4_61f90fccbe2e.slice/cri-containerd-ec9066265bcb1a93f47607cabf6defe2be7b46dccd9620fec39caf57efe0707e.scope\\\": RecentStats: unable to find data in memory cache]\""}
I0904 14:03:37.104227    2015 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"kube-system/node-configuration-dqmn7\" podStartSLOduration=18.792128877 podStartE2EDuration=\"31.104161659s\" podCreationTimestamp=\"2024-09-04 14:03:06 +0000 UTC\" firstStartedPulling=\"2024-09-04 14:03:23.705144657 +0000 UTC m=+18.570096110\" lastFinishedPulling=\"2024-09-04 14:03:36.017177439 +0000 UTC m=+30.882128892\" observedRunningTime=\"2024-09-04 14:03:37.097340999 +0000 UTC m=+31.962292482\" watchObservedRunningTime=\"2024-09-04 14:03:37.104161659 +0000 UTC m=+31.969113122\""}
time=\"2024-09-04T14:03:38Z\" level=info msg=\"Image exists, fetching cached image from image store\" ref=\"ecr.aws/arn:aws:ecr:ap-southeast-2:602401143452:repository/eks/pause:3.1-eksbuild.1\"","_PID":"5377","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4"}
time=\"2024-09-04T14:03:38Z\" level=info msg=\"tagging image\" img=\"602401143452.dkr.ecr.ap-southeast-2.amazonaws.com/eks/pause:3.1-eksbuild.1\""}
Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.","_COMM":"kubelet","_EXE":"/x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/kubelet","_CMDLINE":"/usr/bin/kubelet --cloud-provider external --kubeconfig /etc/kubernetes/kubelet/kubeconfig --config /etc/kubernetes/kubelet/config --container-runtime-endpoint=unix:///run/containerd/containerd.sock --containerd=/run/containerd/containerd.sock --root-dir /var/lib/kubelet --cert-dir /var/lib/kubelet/pki --image-credential-provider-bin-dir /usr/libexec/kubernetes/kubelet/plugins --image-credential-provider-config /etc/kubernetes/kubelet/credential-provider-config.yaml --hostname-override redactmyip.ap-southeast-2.compute.internal --node-ip redactmyipfmt --node-labels eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup=redactng,eks.amazonaws.com/nodegroup-image=redactami,eks.amazonaws.com/sourceLaunchTemplateId=redactlt,eks.amazonaws.com/sourceLaunchTemplateVersion=1,type=large --register-with-taints \"\" --pod-infra-container-image 602401143452.dkr.ecr.ap-southeast-2.amazonaws.com/eks/pause:3.1-eksbuild.1 --runtime-cgroups=/runtime.slice/containerd.service","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
Flag --register-with-taints has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
I0904 14:03:38.957155    5413 server.go:204] \"--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime\""}
I0904 14:03:38.962790    5413 server.go:487] \"Kubelet version\" kubeletVersion=\"v1.29.5-eks-1109419\""}
I0904 14:03:38.962860    5413 server.go:489] \"Golang settings\" GOGC=\"\" GOMAXPROCS=\"\" GOTRACEBACK=\"\""}
I0904 14:03:38.966945    5413 dynamic_cafile_content.go:157] \"Starting controller\" name=\"client-ca-bundle::/etc/kubernetes/pki/ca.crt\""}
I0904 14:03:38.983684    5413 container_manager_linux.go:265] \"Container manager verified user specified cgroup-root exists\" cgroupRoot=[]"}
I0904 14:03:38.983990    5413 container_manager_linux.go:270] \"Creating Container Manager object based on Node Config\" nodeConfig={\"RuntimeCgroupsName\":\"/runtime.slice/containerd.service\",\"SystemCgroupsName\":\"\",\"KubeletCgroupsName\":\"\",\"KubeletOOMScoreAdj\":-999,\"ContainerRuntime\":\"\",\"CgroupsPerQOS\":true,\"CgroupRoot\":\"/\",\"CgroupDriver\":\"systemd\",\"KubeletRootDir\":\"/var/lib/kubelet\",\"ProtectKernelDefaults\":true,\"KubeReservedCgroupName\":\"/runtime\",\"SystemReservedCgroupName\":\"\",\"ReservedSystemCPUs\":{},\"EnforceNodeAllocatable\":{\"pods\":{}},\"KubeReserved\":{\"cpu\":\"90m\",\"ephemeral-storage\":\"1Gi\",\"memory\":\"739Mi\"},\"SystemReserved\":null,\"HardEvictionThresholds\":[{\"Signal\":\"memory.available\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":\"100Mi\",\"Percentage\":0},\"GracePeriod\":0,\"MinReclaim\":null},{\"Signal\":\"nodefs.available\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":null,\"Percentage\":0.1},\"GracePeriod\":0,\"MinReclaim\":null},{\"Signal\":\"nodefs.inodesFree\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":null,\"Percentage\":0.05},\"GracePeriod\":0,\"MinReclaim\":null},{\"Signal\":\"imagefs.available\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":null,\"Percentage\":0.15},\"GracePeriod\":0,\"MinReclaim\":null}],\"QOSReserved\":{},\"CPUManagerPolicy\":\"none\",\"CPUManagerPolicyOptions\":null,\"TopologyManagerScope\":\"container\",\"CPUManagerReconcilePeriod\":10000000000,\"ExperimentalMemoryManagerPolicy\":\"None\",\"ExperimentalMemoryManagerReservedMemory\":null,\"PodPidsLimit\":1048576,\"EnforceCPULimits\":true,\"CPUCFSQuotaPeriod\":100000000,\"TopologyManagerPolicy\":\"none\",\"TopologyManagerPolicyOptions\":null}"}
I0904 14:03:38.984045    5413 topology_manager.go:138] \"Creating topology manager with none policy\""}
I0904 14:03:38.984060    5413 container_manager_linux.go:301] \"Creating device plugin manager\""}
I0904 14:03:38.984104    5413 state_mem.go:36] \"Initialized new in-memory state store\""}
I0904 14:03:38.984248    5413 kubelet.go:396] \"Attempting to sync node with API server\""}
I0904 14:03:38.984285    5413 kubelet.go:301] \"Adding static pod path\" path=\"/etc/kubernetes/static-pods/\""}
I0904 14:03:38.984350    5413 kubelet.go:312] \"Adding apiserver pod source\""}
I0904 14:03:38.984370    5413 apiserver.go:42] \"Waiting for node sync before watching apiserver pods\""}
I0904 14:03:38.985350    5413 kuberuntime_manager.go:258] \"Container runtime initialized\" containerRuntime=\"containerd\" version=\"1.6.34+bottlerocket\" apiVersion=\"v1\""}
I0904 14:03:38.986115    5413 certificate_store.go:130] Loading cert/key pair from \"/var/lib/kubelet/pki/kubelet-server-current.pem\"."}
I0904 14:03:38.986545    5413 kubelet.go:809] \"Not starting ClusterTrustBundle informer because we are in static kubelet mode\""}
I0904 14:03:38.987073    5413 server.go:1256] \"Started kubelet\""}
I0904 14:03:38.987222    5413 server.go:162] \"Starting to listen\" address=\"0.0.0.0\" port=10250"}
I0904 14:03:38.988733    5413 ratelimit.go:55] \"Setting rate limiting for endpoint\" service=\"podresources\" qps=100 burstTokens=10"}
I0904 14:03:38.989829    5413 server.go:233] \"Starting to serve the podresources API\" endpoint=\"unix:/var/lib/kubelet/pod-resources/kubelet.sock\""}
I0904 14:03:38.991602    5413 fs_resource_analyzer.go:67] \"Starting FS ResourceAnalyzer\""}
I0904 14:03:38.992840    5413 volume_manager.go:291] \"Starting Kubelet Volume Manager\""}
I0904 14:03:38.993006    5413 server.go:461] \"Adding debug handlers to kubelet server\""}
I0904 14:03:38.993240    5413 desired_state_of_world_populator.go:151] \"Desired state populator starts to run\""}
I0904 14:03:38.996540    5413 factory.go:221] Registration of the systemd container factory successfully"}
I0904 14:03:38.997429    5413 factory.go:219] Registration of the crio container factory failed: Get \"http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info\": dial unix /var/run/crio/crio.sock: connect: no such file or directory"}
I0904 14:03:38.998756    5413 dynamic_cafile_content.go:171] \"Shutting down controller\" name=\"client-ca-bundle::/etc/kubernetes/pki/ca.crt\""}

one of my pods got allocated to this node by argoworkflows at 14:03:26 and got stuck

@youwalther65
Copy link

@tooptoop4 Thank you, I see. But we need the other logs mentioned here for this timeframe as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants