Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium sometimes ends up in a failed state unable to contact K8s API Server #592

Closed
playworker opened this issue Aug 8, 2024 · 3 comments

Comments

@playworker
Copy link

Summary

Sometimes when things go wrong with the k8s cluster Cilium ends up in a failed state and I don't know how to recover it.

The Cilium failure looks a lot like this: cilium/cilium#20679

The Cilium Operator and the Daemonset pods are trying to contact the K8s API Server but can't, I don't believe the IP address they're trying is correct, it's a 10.x.x.x address.

I've not been able to recover from this error, but I haven't really tried much. Disabling the network using the k8s CLI tool doesn't have any effect.

What Should Happen Instead?

I'm not sure what the underlying issue is, if it is a issue with the API Server IP address being wrong then I guess that needs to be set correctly somehow.

Reproduction Steps

The most recent time this happened I set the containerd_custom_registries setting to a bad value, it included a semi-colon in the middle of the string:

juju config k8s containerd_custom_registries='[{"url": "https://hostname";, "host": "host:4567", "username": "user", "password": "pass"}]'

I corrected the setting but the k8s cluster in Juju ended up in an errored state and the Cilium Operator and Pods ended up in the situation described above. I managed to recover the k8s units in Juju by downgrading the release then bumping it back up again, but I am unable to recover the Cilium installation back to a working state

System information

inspection-report-20240808_102202.tar.gz

Can you suggest a fix?

No response

Are you interested in contributing with a fix?

No response

@playworker playworker changed the title Cilium sometimes ends up in a failed state Cilium sometimes ends up in a failed state unable to contact K8s API Server Aug 8, 2024
@mateoflorido
Copy link
Member

Hello @playworker ,
We are aware of this issue and are currently working on a fix. In the meantime, here are a couple of workarounds we've tested to temporarily fix the issue:

  • Run the following command:
    /opt/cni/bin/cilium-dbg cleanup --all-state --force
  • Restart the affected node.

@marcofranssen
Copy link

marcofranssen commented Sep 10, 2024

I'm experiencing this issue as well on my EKS cluster.

I configured cilium 1.16.1 as following:

helm upgrade --install --create-namespace --namespace kube-system cilium cilium/cilium \
        --values cilium-bootstrap-values.yaml \
        --set cluster.id=1 \
        --set cluster.name="$cluster_name" \
        --set eni.iamRole="$cilium_role_arn" \
        --set "serviceAccounts.operator.annotations.eks\.amazonaws\.com/role-arn"="$cilium_role_arn" \
        --set k8sServiceHost="$cluster_api_endpoint"

The important part is the k8sServiceHost which I pointed at my eks cluster api endpoint.

In my case this results in nodes being destroyed and created continously by Karpenter which is our autoscaler.

@eaudetcobello
Copy link
Contributor

This is resolved in the latest version of the snap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants