Add cluster-api based cloudprovider #1866

frobware · 2019-04-05T14:52:50Z

This is a new cloudprovider implementation based on the cluster-api project.

This PR has been cut from openshift@b38dd11 and updated to reflect changes in autoscaler master (1.15); the openshift version is using 1.13. This implementation has been working well for many months and will scale up/down via a MachineSet or a MachineDeployment.

Known limitations:

does not support scale from 0
scale down is not currently atomic

Node groups are represented when a MachineSet or MachineDeployment has positive scaling values. The min/max values are encoded as annotations on the respective objects, for example:

apiVersion: cluster.k8s.io/v1alpha1
kind: MachineSet
metadata:
  annotations:
    cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "1"
    cluster.k8s.io/cluster-api-autoscaler-node-group-max-size: "10"

To map between nodes and machines we currently depend on the following annotation getting added to the node object, for example:

annotations:
    cluster.k8s.io/machine: "machine-namespace/machine-name"

We currently do this using a nodelink-controller but we have future plans to remove this and rely on the node.Spec.ProviderID value.

For scale down the cloudprovider implementation annotates the machine object with:

annotations:
    cluster.k8s.io/delete-machine: <date>

and the machine controller will drain the node, delete the machine, then finally delete the node.
Using cluster.k8s.io/delete-machine will force the betterDelete deletion policy in the machineset controller. The default deletion policy is random but machines annotated with cluster.k8s.io/delete-machine will be deleted in preference.

We have future plans to address the scale from 0 limitation using MachineClasses
We have future plans to address scale down using strategies in the machine set controller

frobware · 2019-04-25T09:34:38Z

As mentioned in the PR description we use an annotation to delete a specific machine during scale down. This is the corresponding cluster-api implementation: kubernetes-sigs/cluster-api#726

frobware · 2019-04-26T09:52:33Z

I did a PoC of scale from 0 in the openshift implementation: openshift#89. This PR is derived from the openshift implementation.

alvaroaleman

A couple of nits, but overall this just works, thanks for your work! :)))

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_machinedeployment.go

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_provider.go

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_utils.go

k8s-ci-robot · 2019-04-29T12:33:10Z

The following users are mentioned in OWNERS file(s) but are not members of the kubernetes org.

@sig-scheduling
- cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/OWNERS
- cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/apis/config/OWNERS
@sig-scheduling-maintainers
- cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/OWNERS
- cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/apis/config/OWNERS
@api-reviewers
- cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/apis/config/OWNERS
@api-approvers
- cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/apis/config/OWNERS

damiandabrowski5 · 2019-05-28T17:30:26Z

tested and works like charm. frobware was extremely helpful with setting it up.

unfortunately, at this moment it's dependent on nodelink-controller and openshift at some time renamed cluster.k8s.io to openshift.machine.io so it's needed to use commit before this changes (e87de64297ff7b9352b72d7512f1f50abf9c15a4) to get it working outside openshift. (can't wait to abadon this dependency)

frobware · 2019-05-28T17:35:44Z

unfortunately, at this moment it's dependent on nodelink-controller and openshift at some time renamed cluster.k8s.io to openshift.machine.io so it's needed to use commit before this changes (e87de64297ff7b9352b72d7512f1f50abf9c15a4) to get it working outside openshift. (can't wait to abadon this dependency)

Merging the following PRs from OpenShift will remove the nodelink-controller dependency:

~~UPSTREAM: <carry>: openshift: drop machine annotation linkage openshift/kubernetes-autoscaler#97~~
~~UPSTREAM: <carry>: openshift: use ProviderID for node lookup openshift/kubernetes-autoscaler#99~~
UPSTREAM: <carry>: openshift: prioritise search by Provider ID openshift/kubernetes-autoscaler#100 - Done in 8b32581

If I push these two commits into this PR you will have to manually accommodate the following PR to make openstack work again:

Add providerID kubernetes-sigs/cluster-api-provider-openstack#274

frobware · 2019-06-05T18:04:45Z

Scale down can fail unless you have PR #2096.

…1 alias This is largely to be consistent with other usages (in the community) but really to be at parity with the upstream PR [1] that uses this import alias already. This also makes it easier to backport changes made from openshift/autoscaler into upstream. [1] kubernetes#1866

embik · 2019-07-17T09:31:30Z

We're evaluating this feature and overall it looks quite solid with our cluster-api (v1alpha1) implementation. Is there any reason this PR gets very little feedback / approval?

hardikdr · 2020-03-10T02:27:02Z

/lgtm

Access to this is required by cloudprovider/clusterapi.

Enable cloudprovider/clusterapi.

This adds a new cloudprovider based on the cluster-api project: https://github.com/kubernetes-sigs/cluster-api

These are copied to facilitate testing. They are not meant to reflect upstream clusterapi/v1alpha1 - in fact, fields have been removed. They are here to support the switch to unstructured types in the tests without having to rewrite all of the unit tests.

The autoscaler expects provider implementations nodeGroups to implement the Nodes() function to return the number of instances belonging to the group regardless of they have become a kubernetes node or not. This information is then used for instance to realise about unregistered nodes https://github.com/kubernetes/autoscaler/blob/bf3a9fb52e3214dff0bea5ef2b97f17ad00a7702/cluster-autoscaler/clusterstate/clusterstate.go#L307-L311

We index on providerID but it turns out that those values on node and machine are not always consistent. Some encode region, some do not, for example. This commit normalizes all values through the normalizedProviderString(). To ensure that we catch all places I've introduced a new type and made the find() functions take this new type in lieu of a string. Unit tests have also been adjusted to introduce a 'test:///' prefix on the providerID value to further validate the change. This change allows CAPI to work out-of-the-box, assuming v1alpha2. It's also reasonable to assert that this consistency should be enforced elsewhere and to make this behaviour easily revertable I'm leaving this as a separate commit in this patch series.

…8f44206ff4dd9b58386d96462b01a3d79fb1 (f8ff8f4)

frobware · 2020-03-10T11:07:57Z

New changes are detected. LGTM label has been removed.

@MaciekPytel @enxebre @elmiko @hardikdr @detiber: I rebased for the updated vendor in PR #2914

I also added context.TODO() calls to places where the signature had changed post the vendor update in cloudprovider/clusterapi. I also squashed that change so that the addition of the clusterapi provider is still one commit.

enxebre · 2020-03-10T11:08:20Z

/lgtm

elmiko · 2020-03-10T12:25:59Z

/lgtm

thanks @frobware !

hardikdr · 2020-03-10T12:29:34Z

/lgtm

detiber · 2020-03-10T13:06:59Z

/lgtm

MaciekPytel · 2020-03-12T15:43:24Z

/approve

k8s-ci-robot · 2020-03-12T15:43:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 5, 2019

k8s-ci-robot requested review from aleksandra-malinowska and piosz April 5, 2019 14:53

alvaroaleman reviewed Apr 28, 2019

View reviewed changes

alvaroaleman mentioned this pull request Apr 29, 2019

Integrate cluster-autoscaler kubermatic/kubermatic#3370

Closed

k8s-ci-robot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Apr 29, 2019

kron4eg mentioned this pull request Apr 29, 2019

Integrate cluster-autoscaler as addon kubermatic/kubeone#391

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2019

frobware mentioned this pull request May 23, 2019

Add providerID kubernetes-sigs/cluster-api-provider-openstack#274

Closed

k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label May 28, 2019

frobware force-pushed the clusterapi-cloudprovider branch from 7e4b91f to 00edc56 Compare May 31, 2019 14:14

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2019

frobware force-pushed the clusterapi-cloudprovider branch from 8b32581 to b4efe3c Compare June 5, 2019 17:51

frobware force-pushed the clusterapi-cloudprovider branch from b4efe3c to 7a31c91 Compare June 11, 2019 14:34

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jun 11, 2019

frobware force-pushed the clusterapi-cloudprovider branch from 7a31c91 to c3c1551 Compare June 12, 2019 08:27

frobware mentioned this pull request Jul 17, 2019

UPSTREAM: <carry>: openshift: reference k8s.io/api/core/v1 as corev1 openshift/kubernetes-autoscaler#111

Merged

alvaroaleman mentioned this pull request Jul 31, 2019

Autoscaling kubermatic/machine-controller#605

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2019

k8s-ci-robot assigned hardikdr Mar 10, 2020

frobware and others added 10 commits March 10, 2020 10:27

config/options: add KubeConfigPath

1efc258

Access to this is required by cloudprovider/clusterapi.

cloudprovider/builder: add clusterapi

b95eeb7

Enable cloudprovider/clusterapi.

cloudprovider/clusterapi: new provider

46bb9b4

This adds a new cloudprovider based on the cluster-api project: https://github.com/kubernetes-sigs/cluster-api

Ensure DeleteNodes doesn't delete a node twice

eae1579

Make machine API swappable as an env variable

7ba9798

Update OWNERS

c5fa2b4

Updating vendor against [email protected]:kubernetes/kubernetes.git:f8ff…

3955223

…8f44206ff4dd9b58386d96462b01a3d79fb1 (f8ff8f4)

frobware force-pushed the clusterapi-cloudprovider branch from cf4575f to 3955223 Compare March 10, 2020 11:03

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2020

k8s-ci-robot assigned enxebre Mar 10, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2020

k8s-ci-robot merged commit 150bf34 into kubernetes:master Mar 12, 2020

enxebre mentioned this pull request Mar 17, 2020

Rebase 1.18 openshift/kubernetes-autoscaler#139

Merged

elmiko mentioned this pull request Apr 6, 2020

Cluster Autoscaler / Cluster API Integration kubernetes/enhancements#609

Closed

This was referenced Apr 15, 2020

CAPI: Do not normalize Node IDs outside of CAPI provider #3057

Merged

UPSTREAM: 3057: openshift: Do not normalize Node IDs outside of CAPI provider openshift/kubernetes-autoscaler#142

Merged

detiber mentioned this pull request Jun 2, 2020

[CA-1.18] #3057 cherry-pick: CAPI: Do not normalize Node IDs outside of CAPI provider #3175

Merged

Add cluster-api based cloudprovider #1866

Add cluster-api based cloudprovider #1866

Uh oh!

Conversation

frobware commented Apr 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frobware commented Apr 25, 2019

Uh oh!

frobware commented Apr 26, 2019

Uh oh!

alvaroaleman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 29, 2019

Uh oh!

damiandabrowski5 commented May 28, 2019

Uh oh!

frobware commented May 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frobware commented Jun 5, 2019

Uh oh!

embik commented Jul 17, 2019

Uh oh!

hardikdr commented Mar 10, 2020

Uh oh!

frobware commented Mar 10, 2020

Uh oh!

enxebre commented Mar 10, 2020

Uh oh!

elmiko commented Mar 10, 2020

Uh oh!

hardikdr commented Mar 10, 2020

Uh oh!

detiber commented Mar 10, 2020

Uh oh!

MaciekPytel commented Mar 12, 2020

Uh oh!

k8s-ci-robot commented Mar 12, 2020

Uh oh!

Uh oh!

frobware commented Apr 5, 2019 •

edited

Loading

frobware commented May 28, 2019 •

edited

Loading