Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add console manager supervisor logic w/ restart option #270

Merged

Conversation

floreks
Copy link
Member

@floreks floreks commented Sep 17, 2024

  • Added supervised controller start functionality that can restart the controller if a last poll/reconcile time indicates that it might have died
  • Updated Reconciler interface and Controller implementation to allow it to be restarted
  • Removed 3 different logger implementations usage across the codebase and replaced it with a single klog logger.
  • poll/refresh/jitter interval args are now correctly used by the controllers
  • Cleaned up console Reconcilers. Some struct fields were not used anywhere
  • Refactored gate cache queue into a standalone cache that can be safely reused by multiple goroutines now
  • Refactored queue usage across console reconcilers to use getter instead of reference to a variable
  • Refactored PollUntilContextCancel usage in the console controller manager not to rely on our internal method implementation when deciding when to stop polling. Internal method will only return error now that can be logged but the poll function will always return false, nil (never stop).
  • Added controller restart metric counter to be able to track the number of per controller restarts (if any)

Copy link

linear bot commented Sep 17, 2024

@floreks floreks changed the title wip: feat: add console manager supervisor logic w/ restart option feat: add console manager supervisor logic w/ restart option Sep 18, 2024
@floreks floreks self-assigned this Sep 18, 2024
Copy link
Member

@michaeljguarino michaeljguarino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly lgtm for the most part but I think @zreigz should give it a close review as well.

One thing i'm wondering about though is if the heartbeat approach is unnecessary. You only really need heartbeats for multi-process communication (monitoring liveness in a distributed system). Couldn't we have some wrapper class like:

type MonitoredRoutine struct {
  alive bool
  Runnable func()
}

func (r *MonitoredRoutine) Run() {
  defer func() { 
    // catch panics
    r.alive = false
  }()
  
  r.Runnable()
}

func (r *MonitoredRoutine) Alive() bool { return r.alive }

and then if any of them die, just force restart the controller according to some mechanism?

I suppose part of the problem here is you don't have a natural way to restart child goroutines too, but this seems like a more robust and general api for liveness than a heartbeat, could also write to a channel to signal a parent goroutine that it needs to restart.

@floreks
Copy link
Member Author

floreks commented Sep 18, 2024

What might be problematic with this approach is detecting if the controller is still running or not. Heartbeat in this case is the last poll time. Since we have information about how often polling should be executed, we can calculate the time difference between last poll time and current time to see if controller could be dead.

Recovering from panic technically does not help us much since if it will panic the app should crash and pod will be restarted anyway.

We should try to avoid a situation where there is no panic but controller for some unknown reason stopped polling/reconciling.

@maciaszczykm
Copy link
Member

I reviewed as well, then we talked about it with @floreks and @zreigz. It looks good to me, issues with pollers being stuck for any reasons should not happen anymore. One thing that can be added is validation for args to avoid situations like poll interval or jitter being too short.

@floreks floreks merged commit f7fa8a7 into main Sep 25, 2024
33 checks passed
@floreks floreks deleted the sebastian/prod-2611-deployment-operator-service-reconcilers-died branch September 25, 2024 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants