Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ELB health check fails with Kubernetes >=v1.30.x #5139

Open
dkoshkin opened this issue Oct 7, 2024 · 14 comments
Open

ELB health check fails with Kubernetes >=v1.30.x #5139

dkoshkin opened this issue Oct 7, 2024 · 14 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@dkoshkin
Copy link
Contributor

dkoshkin commented Oct 7, 2024

/kind bug

What steps did you take and what happened:
Follow the quickstart documentation with Kubernetes v1.30.5 and a custom built AMI (the public AMIs are missing for that version and the default v1.31.0 version).
The ELB Health Check fails and the cluster is stuck after creating the first control-plane instance. The AWS console shows that 0 of 1 instanced are in service.

What did you expect to happen:
The defaults should result in a working cluster.

Anything else you would like to add:

  1. Changing the health check to TCP in the AWS console did fix the check, but this update is not allowed by a webhook here and even after removing the webhook, the new value from AWSCluster never got updated.

  2. Setting this on the apiserver and other control-plane components allowed the ELB health check to pass

tls-cipher-suites: ...,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_3DES_EDE_CBC_SHA
  1. Using an NLB loadbalancer works
  controlPlaneLoadBalancer:
    loadBalancerType: nlb

Some discussion about this in the Kuberentes slack https://kubernetes.slack.com/archives/C3QUFP0QM/p1726622974749509

Environment:

  • Cluster-api-provider-aws version: 2.6.1
  • Kubernetes version: (use kubectl version): v1.30.5
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 7, 2024
@AndiDog
Copy link
Contributor

AndiDog commented Oct 10, 2024

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Oct 10, 2024
@richardcase
Copy link
Member

I'm running into this as well.

@richardcase
Copy link
Member

/priority critical-urgent
/milestone v2.7.0

@k8s-ci-robot k8s-ci-robot added this to the v2.7.0 milestone Oct 15, 2024
@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 15, 2024
@richardcase
Copy link
Member

This was discussed at the office hours 14th October 2024. The summary is that:

  • We should update the templates, docs (and at releases) notes on how to explicitly specify the tls cipher suites to ythe kube components. Also, recommend that new clusters consider using nlb instead of classic elb.
  • In the future with an API version bump we could consider making nlb the default.

@richardcase
Copy link
Member

/help

@k8s-ci-robot
Copy link
Contributor

@richardcase:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 15, 2024
@richardcase
Copy link
Member

/milestone v2.8.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v2.7.0, v2.8.0 Oct 28, 2024
@richardcase
Copy link
Member

/milestone v2.7.2

@k8s-ci-robot k8s-ci-robot modified the milestones: v2.8.0, v2.7.2 Oct 29, 2024
@richardcase
Copy link
Member

I tried setting the tls cipher suites but that didn't work:

https://gist.github.com/richardcase/47118a404bc832904c399ba1360462f2

@dkoshkin
Copy link
Contributor Author

@richardcase I wasn't able to just apply your spec directly because of some IAM issues, but was able to create by explicitly setting this public AMI:

spec:
  ami:
    id: ami-0797a44e8719c5e53

AWSCluster

  controlPlaneLoadBalancer:
    crossZoneLoadBalancing: false
    healthCheckProtocol: HTTPS
    loadBalancerType: classic
    scheme: internet-facing

and KCP:

    initConfiguration:
      localAPIEndpoint: {}
      nodeRegistration:
        imagePullPolicy: IfNotPresent
        kubeletExtraArgs:
          cloud-provider: external
          tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_3DES_EDE_CBC_SHA
        name: '{{ ds.meta_data.local_hostname }}'
    joinConfiguration:
      discovery: {}
      nodeRegistration:
        imagePullPolicy: IfNotPresent
        kubeletExtraArgs:
          cloud-provider: external
          tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_3DES_EDE_CBC_SHA
        name: '{{ ds.meta_data.local_hostname }}'

@AndiDog
Copy link
Contributor

AndiDog commented Nov 13, 2024

I got it working with this additional argument in the template (I'm using cluster-template-flatcar-machinepool.yaml but that shouldn't matter):

apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: "${CLUSTER_NAME}-control-plane"
spec:
  kubeadmConfigSpec:
    clusterConfiguration:
      apiServer:
        extraArgs:
          cloud-provider: external

          # This is needed for Kubernetes v1.30+ since else it uses the Go defaults which don't
          # work with AWS classic load balancers, see
          # https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5139. If you use
          # another load balancer type such as NLB, this is not needed.
          #
          # The list consists of the secure ciphers from Go 1.23.3, plus some less secure
          # RSA ciphers which the AWS classic load balancer instance health check supports.
          tls-cipher-suites: TLS_AES_128_GCM_SHA256,TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA

Do we really want to hardcode these less secure settings in the template? This makes it very likely for users to blindly take it over. I'm rather thinking of other options:

  • Explicitly set NLB as load balancer type in templates, making it easier for users to make the switch later once we make breaking changes to the API specs. However, these template changes should also go into new 2.x release, not a patch release. Document what needs to be done for ELBs. This gives some chance of silent breakage for users (different LB type than expected; what happens if they apply the changed manifest on existing clusters...).
  • Only create a documentation page about ELBs. Hard to figure out for users what's wrong.
  • (My preference:) Perform the correction in CAPA code with an if block. This is the easiest way to make the defaults work immediately without changes by the users, and also it's easy to rip it out later if AWS should improve TLS cipher support in ELB instance health checks.

@AndiDog
Copy link
Contributor

AndiDog commented Nov 25, 2024

We talked in the office hours to consider switching the default type to NLB.

Next steps:

  • Test what happens when the LB type changes (downtime?). The webhook code currently doesn't forbid a change of that field.
  • Test use of secondaryControlPlaneLoadBalancer or other alternatives to do a migration

@k8s-triage-robot
Copy link

This issue is labeled with priority/critical-urgent but has not been updated in over 30 days, and should be re-triaged.
Critical-urgent issues must be actively worked on as someone's top priority right now.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority {important-soon, important-longterm, backlog}
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Dec 25, 2024
@yoctozepto
Copy link

Next steps:

  • Test what happens when the LB type changes (downtime?). The webhook code currently doesn't forbid a change of that field.

The reconciler fails afterwards. Looks like a change in the loadbalancer naming scheme has happened as well (test-apiserver vs default-test-apiserver):

	could not register machine to load balancer: could not register control plane instance "i-03312db3fdf64ffea" with load balancer - error determining registration status: error describing ELB "default-test-apiserver": LoadBalancerNotFound: Load balancers '[default-test-apiserver]' not found
		status code: 400, request id: dffbc514-8563-4761-980d-d1cbfad3a6e7
 >
E0116 13:03:59.918681       1 controller.go:324] "Reconciler error" err=<
	could not register machine to load balancer: could not register control plane instance "i-03312db3fdf64ffea" with load balancer - error determining registration status: error describing ELB "default-test-apiserver": LoadBalancerNotFound: Load balancers '[default-test-apiserver]' not found
		status code: 400, request id: dffbc514-8563-4761-980d-d1cbfad3a6e7
 > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="default/test-control-plane-2dcjf" namespace="default" name="test-control-plane-2dcjf" reconcileID="1a2929bf-1ca0-49e9-af28-4d9394c84718"
E0116 13:04:00.700097       1 awsmachine_controller.go:628] "failed to reconcile LB attachment" err=<
	could not register machine to load balancer: could not register control plane instance "i-03312db3fdf64ffea" with load balancer - error determining registration status: error describing ELB "default-test-apiserver": LoadBalancerNotFound: Load balancers '[default-test-apiserver]' not found
		status code: 400, request id: ce6b72c5-f4bf-4720-bf3f-125b1549dac3
 >
E0116 13:04:00.705525       1 controller.go:324] "Reconciler error" err=<
	could not register machine to load balancer: could not register control plane instance "i-03312db3fdf64ffea" with load balancer - error determining registration status: error describing ELB "default-test-apiserver": LoadBalancerNotFound: Load balancers '[default-test-apiserver]' not found
		status code: 400, request id: ce6b72c5-f4bf-4720-bf3f-125b1549dac3
 > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="default/test-control-plane-2dcjf" namespace="default" name="test-control-plane-2dcjf" reconcileID="a4ded9f9-337f-49d6-aa8f-4a0c33305763"
E0116 13:04:01.003343       1 awscluster_controller.go:279] "failed to reconcile load balancer" err="no loadbalancer exists for the AWSCluster test, the cluster has become unrecoverable and should be deleted manually: no load balancer found with name: \"default-test-apiserver\"" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/test" namespace="default" name="test" reconcileID="3b022292-e663-444c-b59c-945484e1649b" cluster="default/test"
E0116 13:04:01.040211       1 controller.go:324] "Reconciler error" err="no loadbalancer exists for the AWSCluster test, the cluster has become unrecoverable and should be deleted manually: no load balancer found with name: \"default-test-apiserver\"" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/test" namespace="default" name="test" reconcileID="3b022292-e663-444c-b59c-945484e1649b"

Even if it did work, it would break existing clients as there is no way to convert ELB, just create a new one and then manually (or with the help of DNS, think Route 53) migrate the existing clients.


Test use of secondaryControlPlaneLoadBalancer or other alternatives to do a migration

This looks interesting as the enabler of the proper way forward in my opinion (read on).


Since the working (NLB) alternative uses the TCP healthcheck by default - why not make the default classic ELB use TCP healthchecks by default as a current workaround? I can agree that the verification of SSL/TLS is good beyond just TCP but still not as comprehensive as proper HTTPS healthcheck and so this default change does not really dumb down the defaults much in my opinion.

Note there are no raw SSL/TLS checks in non-classic ELBs.


I see the way forward would be to migrate to NLB with HTTPS checks as they should be the most comprehensive (and working), i.e.:

spec:
  controlPlaneLoadBalancer:
    loadBalancerType: nlb
    healthCheckProtocol: HTTPS

See also notes below.


Other notes:

  • switching from NLB to ALB in place is not possible either [expected]
  • another bug: ALB fails to reconcile because the code creates target groups only with TCP protocol
    • what is worse, it spams new and new target groups - I ended up with 200+ (!) after a while of leaving it running - it did not clean them up after removing the cluster either
  • HTTPS works as the healthCheckProtocol for NLB (apart from TCP in case of classic and NLB)
    • note though that healthCheckProtocol is marked immutable so this needs a deletion and recreation of the cluster resource
      • but there is nothing on AWS side that mandates having this immutable - lifting this limitation would help with making switches
    • a minor bug: the HTTPS healthcheck seems to be accepting way too many codes as success, it should only really accept 200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

6 participants