Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

gardener / machine-controller-manager Public

Notifications You must be signed in to change notification settings
Fork 117
Star 257

Code
Issues 78
Pull requests 1
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Refactor MCM to leverage controller-runtime #724

Open

4 of 49 tasks

himanshu-kun opened this issue May 25, 2022 · 8 comments

Open

4 of 49 tasks

Refactor MCM to leverage controller-runtime #724

himanshu-kun opened this issue May 25, 2022 · 8 comments

Labels

Output qualification (tests, checks, scans, automation in general, etc.) related

Effort for issue is around 6 months

Issue that requires expert level knowledge

important-soon kind/cleanup

Something that is not needed anymore and can be cleaned up

kind/enhancement

Enhancement, improvement, extension

lifecycle/stale

Nobody worked on this for 6 months (will further age)

Needs (more) planning with other MCM maintainers

Priority (lower number equals higher priority)

Comments

Copy link

Contributor

himanshu-kun commented May 25, 2022 •

edited by unmarshall

Loading

How to categorize this issue?

/area quality
/kind enhancement
/priority 2

What would you like to be added:
MCM should be re-written using controller-runtime framework. Currently we use client-go directly . As a part of this we would also remove the in-tree deprecated code #622

Why is this needed:
Some advantages are:

lesser LOC
automatic metric registration
(more could be added here)

FAQ for controller runtime -> https://github.com/kubernetes-sigs/controller-runtime/blob/v0.12.1/FAQ.md

BrainStorm / Ideas

Machine disruption budget (Similar to PDB)
Distribution of machines based on affinity rules and not necessarily creating one machine deployment per zone.
Controlled Eviction of Pods during Cluster Rollout #470

Small refactoring / changes

node should be tainted as well during cordoning before drain, not just marking unschedulable=true

TODO

changes for better MCM - CA interaction
- Early abort/backoff support for Gardener nodegroups a.k.a machinedeployments autoscaler#154
Check instance reachability in machine status
Redesign maxReplacement logic , refer PR An option to throttle deletion of Failed Machines #482 for new ideas. Created issue Make MaxReplacement configurable #688
Use labels instead of tags to identify Machines machine-controller-manager-provider-gcp#7
Reduce load on the APIServer #293
Reduce CPU utilization #304
- We need to profile cpu,memory Before and After the refactoring.
Improve overshooting logic #346
- need to reproduce the case where overshooting could happen to see if this logic is still required.
events on machine objects and nodes
not allow health timeout deletion of old machines during rolling update as it leaves many pods unschedulable causing downtimes. example -> Live Issue #2164
add signal handling to MCM for its graceful termination.
Move drain logic into a separate controller #621
Add support for error codes #590
We need to fix the health and readiness endpoints for the MC/MCM. The health endpoint is too simplistic. We should also consider developing a readiness endpoint for the MC for all the cases where the controller loop could not be started.
Remove duplicate construction of EventRecorders.
MC and MCM currently run as 2 containers within the same Pod. Consider running them as single container.
Relook at MC metrics, especially commented code and remove. See Improve Monitoring/Alerting/Metrics #211
reconcileClusterNodeKey is useless and not being called.
Optimize secret enqueues while reconciling cluster secrets. Don't react on non-MCM secrets controlCoreInformerFactory.Core().V1().Secrets() is far too generic.
Remove MC finalizer add code from ValidateMachineClass. As a corollary, consider optimizing validateNodeTemplate as this is repeated in the flow of reconciling machine objects.
Our reconcile functions are too heavy-weight. Consider making them fine grained. Look at gardener actuator pattern.
addMachineFinalizers adds MCMFinalizer instead of MCFinalizer.
Provider implementations of Driver should return appropriate error Codes . Also status.FromError(err) in machine controller does not seem necessary since Driver methods already returns errors as status objects. (No need to convert to string and regex parse and build status again)
Relook at logic and rationale for NotManagedByMCM. This has an issue for multiple MC controller processes.

if _, annotationPresent := node.ObjectMeta.Annotations[machineutils.NotManagedByMCM]; annotationPresent

isMachineStatusSimilar is confounding, delicate and easily broken. Replace with a better implementation. This is happening because the Description field mixes arbitrary text with constants such as machineutils.GetVMStatus, etc. Consider alternative approach such as use of custom conditions / introducing operation label as separate field, etc
The triggerCreationFlow in the MC file pkg/util/provider/machinecontroller/machine.go has logic to check for Azure specific stale node. Provider specific handling should not percolate to the common library code. Re-think on this while refactoring.
Fix Machine.Status.LastOperation being used in several places as the next operation when Machine.Status.LastOperation.State is MachineStateProcessing. Also MachineState is bad name - it actually means MachineOperationState. Instead of computing and recording the next operation, reconciler should compute this based on current conditions. Consider introducing new conditions and leveraging the gardener flow library for conditional execution.
After setting machine phase to Terminating, if we get stuck in the controller.getVmStatus under the cases: case codes.Unknown, codes.DeadlineExceeded, codes.Aborted, codes.Unavailable:, then we keep repeating getVMStatus with a short retry. We should prefer exponential backoff here.
In controller.getVMStatus consider failing permanently for Unimplemented or error in decoding statuses returned by Driver instead of just queuing again. Check all the Driver.GetMachineStatus implementations for providers before doing this.
In controller.drainNode, we need to revise the logic for skipping the drain if the read only FS condition is true.
In controller.deleteVM we have a perma-loop when status decode fails.
controller.deleteMachineFinalizers(ctx, machine) need not return a retry period as we ignore it.
controller.triggerUpdationFlow is not being used currently. Study and consider removal.
drain.Options.getPodsForDeletion should be simplified.
In drain.Options.evictPod we should not use the beta1 Policy client.
Providers
- We need to enhance the Driver interface for providers and enhance GetMachineStatusResponse , ListMachinesResponse to get the status of each individual resource of a machine and not just the current barebones providerId and nodeName. This will permit us to differentiate between when there are issues with the VM and issues in VM resource creation.
- Use labels instead of tags to identify Machines machine-controller-manager-provider-gcp#7
Relook at Provide a way to initialize out-of-tree providers #513
[Edge Case] Creation/Health Timeout not respected #811
Make creation and deletion of resources associated with a single machine robust. One idea is to retry creation of all resources (NIC, Disks, VM) for a machine till the point that machine contract/spec changes. At which point mark the sub-resources (NIC, Disk) that got created as stale/orphan and that should be removed via the orphan collector. This avoids multiple calls to delete each resource if its creation fails. The reason is that the deletion of that resource can also fail as well. We have seen that in azure many times.
For better tracing, every request made to the driver must have a unique request-id. This will then be used as a correlation-id for one more more provide API calls. For example: If you wish to delete a machine in Azure provider, then there are many different AZ API calls that are made. In order to easily diagnose why the call to the Driver failed, all AZ API calls needs a correlation. This can easily be achieved via a correlation-ID which will be the request-ID. This will require changes to be done in the API request/response structs.
Error handling is not up-to-the-mark. We use util/provider/machinecodes/status but it currently loses the underline error. There is also no method that accepts an error and a code. The custom error need to be enhanced.
Re-look at the need to have multiple machine-sets per machine-deployment. Is there a real need for this complication or it could drastically simplified?
Static distribution of total min replicas across machine-deployments should not be done at the time of creation of change of worker pools at the shoot level.
Create a state machine to handle deterministic transitions between VM states.
Right now MachineClass is validated by mcm-provider- driver implementation methods repeatedly even if there is no change being done to the MachineClass. This is sub-optimal. We should ideally add a validating webhook for MachineClass that will do the validation before persisting the object to etcd. Majority of the validations can be done inside the webhook. Gardener extensions has validating webhooks today but they currently do not validate the ProviderConfig. Current view is that we should not link our webhooks to extension webhooks as they will directly tie us to g/g. To support use cases where customers can choose to use MCM independent of g/g shoot validation (done in extension webhooks) should not be used. Instead MCM should create its own validating webhooks which are invoked in addition to the ones defined in gardener extensions.
Provide a generic way to inject faults. This allows to run different scenarios by simulating provider responses.
Rolling updates do not timeout today. Typically customers have a limited window within which they wish to complete the updates. During the rolling updates we cordon the nodes to prevent new workloads to be scheduled onto them. It is also possible that due to insufficient capacity CA is not able to launch new nodes. This leads to customers unable to schedule workloads as many old nodes would have been cordoned off and new nodes cannot be launched. We need a way to end the update process after a defined timeout.

The text was updated successfully, but these errors were encountered:

All reactions

himanshu-kun added kind/enhancement

Enhancement, improvement, extension

Issue that requires expert level knowledge

Something that is not needed anymore and can be cleaned up

important-soon effort/3m Effort for issue is around 3 months labels

Copy link

Contributor Author

himanshu-kun commented May 25, 2022

All reactions

Sorry, something went wrong.

gardener-robot added area/quality

Output qualification (tests, checks, scans, automation in general, etc.) related

priority/2 Priority (lower number equals higher priority) labels

himanshu-kun mentioned this issue

Improve MCM log messages/status for meltdown scenario #742

Closed

Copy link

gardener-robot commented Oct 11, 2022

@himanshu-kun You have mentioned internal references in the public. Please check.

All reactions

Sorry, something went wrong.

himanshu-kun self-assigned this

Copy link

gardener-robot commented Oct 21, 2022

@himanshu-kun You have mentioned internal references in the public. Please check.

All reactions

Sorry, something went wrong.

himanshu-kun mentioned this issue

Retry for Machine reconciliation happening quicker than cache update leading to the object has been modified errors #767

Closed

himanshu-kun mentioned this issue

[WIP] Add Context to Run method of the manager #757

Closed

Copy link

gardener-robot commented Feb 6, 2023

@himanshu-kun You have mentioned internal references in the public. Please check.

All reactions

Sorry, something went wrong.

This was referenced Feb 14, 2023

Cannot use the internalversion clientset #381

Open

Improvements for MCM #287

Closed

Copy link

gardener-robot commented Feb 21, 2023

@himanshu-kun You have mentioned internal references in the public. Please check.

All reactions

Sorry, something went wrong.

3 similar comments

Copy link

gardener-robot commented Feb 21, 2023

@himanshu-kun You have mentioned internal references in the public. Please check.

All reactions

Sorry, something went wrong.

Copy link

gardener-robot commented Feb 21, 2023

@himanshu-kun You have mentioned internal references in the public. Please check.

All reactions

Sorry, something went wrong.

Copy link

gardener-robot commented Feb 21, 2023

@himanshu-kun You have mentioned internal references in the public. Please check.

All reactions

Sorry, something went wrong.

himanshu-kun added needs/planning

Needs (more) planning with other MCM maintainers

Effort for issue is around 6 months

priority/3 Priority (lower number equals higher priority) and removed effort/3m

Effort for issue is around 3 months

priority/2 Priority (lower number equals higher priority) labels

gardener-robot added priority/2 Priority (lower number equals higher priority) and removed priority/3 Priority (lower number equals higher priority) labels

unmarshall changed the title ~~Refactor MCM with controller runtime~~ Refactor MCM

himanshu-kun removed their assignment

himanshu-kun pinned this issue

unmarshall changed the title ~~Refactor MCM~~ Refactor MCM to use controller-runtime

unmarshall changed the title ~~Refactor MCM to use controller-runtime~~ Refactor MCM to leverage controller-runtime

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Assignees

No one assigned

Labels

Output qualification (tests, checks, scans, automation in general, etc.) related

Effort for issue is around 6 months

Issue that requires expert level knowledge

important-soon kind/cleanup

Something that is not needed anymore and can be cleaned up

kind/enhancement

Enhancement, improvement, extension

lifecycle/stale

Nobody worked on this for 6 months (will further age)

Needs (more) planning with other MCM maintainers

Priority (lower number equals higher priority)

Projects

None yet

Milestone

No milestone

Development

No branches or pull requests

2 participants

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.