Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hetzner Cloud Control manager not connecting with Hetzner #663

Closed
abishekas opened this issue Jun 12, 2024 · 6 comments
Closed

Hetzner Cloud Control manager not connecting with Hetzner #663

abishekas opened this issue Jun 12, 2024 · 6 comments
Assignees
Labels
bug Something isn't working stale

Comments

@abishekas
Copy link

TL;DR

Hi Everyone,

We were planning to move our production environment to the Hetzner cloud. So we provisioned a Kubernetes cluster (self-managed) setup in Hetzner servers for our project and for making connection establishment between Hetzner and our Kubernetes cluster we used the Hetzner cloud controller manager. By following the below document, we provisioned it.

https://community.hetzner.com/tutorials/install-kubernetes-cluster#:~:text=Now%20deploy%20the%20Hetzner%20Cloud%20controller%20manager%20into%20the%20cluster

Expected behavior

We deployed this during March 2024 and everything was working as expected till yesterday, but today when we create a new server in the Hcloud console and add it to the same cluster, the hcloud providerid and the region topology labels are not added for that server and we are utilizing the nginx ingress as Loadbalancer for this setup. when we apply the ingress-nginx it will automatically connect with the load balancer in the cloud but from today that connection is also not working.

Observed behavior

We tried to resolve this with logs from the Hetzner cloud controller manager but we couldn't see any errors in the logs. I'm sharing the log data below for reference. We also tried provisioning a new setup to see if that works, but we received the same issue. We verified the network connectivity to Hetzner Cloud from our server through API calls, and through PING requests, it works fine. We even created a new setup with another region, but the issue still persists.

We have planned our production migration for this weekend, so any quick help would be greatly appreciated. Thanks.

Minimal working example

No response

Log output

I0612 17:13:22.834269       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:13:22.834282       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:13:52.838954       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:13:52.839004       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:13:52.839018       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:13:52.839030       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:13:56.277820       1 load_balancers.go:137] "ensure Load Balancer" op="hcloud/loadBalancers.EnsureLoadBalancer" service="ingress-nginx-controller" nodes=["jenkins-server","postgresql-testing","hcloud-owrker"]
I0612 17:13:56.277968       1 event.go:307] "Event occurred" object="ingress-nginx/ingress-nginx-controller" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0612 17:13:56.777461       1 load_balancer.go:820] "update service" op="hcops/LoadBalancerOps.ReconcileHCLBServices" port=80 loadBalancerID=1798225
I0612 17:13:57.567504       1 load_balancer.go:820] "update service" op="hcops/LoadBalancerOps.ReconcileHCLBServices" port=443 loadBalancerID=1798225
E0612 17:13:58.576626       1 controller.go:298] error processing service ingress-nginx/ingress-nginx-controller (retrying with exponential backoff): failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): 
I0612 17:13:58.576714       1 event.go:307] "Event occurred" object="ingress-nginx/ingress-nginx-controller" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: hcloud/loadBalancers.EnsureLoadBalancer: hcops/LoadBalancerOps.ReconcileHCLBTargets: providerID does not have one of the the expected prefixes (hcloud://, hrobot://, hcloud://bm-): "
I0612 17:14:22.848432       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:14:22.848502       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:14:22.848524       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:14:22.848543       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:14:52.888746       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:14:52.888807       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:14:52.888831       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:14:52.888856       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:15:22.824760       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:15:22.824815       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:15:22.824830       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:15:22.824843       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:15:52.824370       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:15:52.824415       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:15:52.824433       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:15:52.824448       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:16:22.965547       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:16:22.965596       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:16:22.965618       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:16:22.965637       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:16:52.946044       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:16:52.946077       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:16:52.946092       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:16:52.946105       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"
I0612 17:17:22.863320       1 route_controller.go:216] action for Node "hcloud-owrker" with CIDR "10.244.4.0/24": "keep"
I0612 17:17:22.863365       1 route_controller.go:216] action for Node "jenkins-server" with CIDR "10.244.1.0/24": "keep"
I0612 17:17:22.863382       1 route_controller.go:216] action for Node "master" with CIDR "10.244.0.0/24": "keep"
I0612 17:17:22.863402       1 route_controller.go:216] action for Node "postgresql-testing" with CIDR "10.244.2.0/24": "keep"

Additional information

No response

@abishekas abishekas added the bug Something isn't working label Jun 12, 2024
@apricote
Copy link
Member

Hey @abishekas,

first off, there was an incident in our API yesterday around 17:00-17:40 CEST which might have been the cause for this. Can you try again today?

If it still does not work:

  • Can you send us logs of HCCM from 5minutes before to 5 minutes after you add the node
  • Can you send us the output of kubectl get node <your-broken-node> -o yaml

@apricote apricote self-assigned this Jun 13, 2024
@vigneshb118
Copy link

Hey @apricote ,

Myself and @abishekas are part of the same team. Here are the logs you have requested.
hccm-logs.txt

Screenshot:
Screenshot 2024-06-13 at 1 33 39 PM

P.S:
Below are the errors when I try to install kubernetes package on the new server inside the old projects I had yesterday where I faced actual issue:
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg curl: (22) The requested URL returned error: 403 gpg: no valid OpenPGP data found.
So I have created new project and tried this by adding new server. Because on the older machine where I tried yesterday I was getting the error while installing kuberentes packages.

@apricote
Copy link
Member

The node does not have the unitialized taint that HCCM expects. Are you sure you started the kubelet on that node with --cloud-provider=external? HCCM will only "adopt" the node if that taint is set.

You can try to re-add the taint with kubectl taint node master node.cloudprovider.kubernetes.io/uninitialized:NoSchedule


curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.30/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg curl: (22) The requested URL returned error: 403 gpg: no valid OpenPGP data found. 

This sounds like your IP is blocked by pkgs.k8s.io. This unfortunately happens from time to time and you will need to try with another IP. We recommend to mirror all assets you need for your production infrastructure to local services. You can not rely on pkgs.k8s.io being available at all times. See this thread for previous discussions of this topic: kubernetes/registry.k8s.io#138

@vigneshb118
Copy link

Hi @apricote ,
Thanks for your valuable response , we will always up all nodes with --cloud-provider=external flag in the kubelet configuration, and also the taint is already there in my master machines and am attaching this screenshot for your reference.

Screenshot 2024-06-13 at 10 48 33 PM

Today around 13:30 UTC+0 we saw a maintenance work on the cloud API and cloud console in hetzner side and after that our cluster’s are able to make connection with the hetzner cloud. I am attaching that maintenance window screenshot for your reference.

Screenshot 2024-06-13 at 10 48 41 PM

It is resolved right after the maintenance window. Not sure if anything is changed at your end. We want this to be future proof. As a precaution do you have any suggestions to solve this for future if such issue happen again?

@apricote
Copy link
Member

Good to hear that everything works now.

I am not really sure what the issue was, so I do not have any suggestions on what you can improve for the future.

If you ever encounter issues again, you can try to run HCCM with env variable HCLOUD_DEBUG=true and the flag -v=5 to get way more logs.

Copy link
Contributor

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

@github-actions github-actions bot added the stale label Sep 12, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

3 participants