Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event based Kubernetes liveness probe #319

Merged
merged 5 commits into from
Mar 21, 2023

Conversation

bestie
Copy link
Contributor

@bestie bestie commented Mar 7, 2023

This PR is functionally similar to @jbdietrich's PR #308

The main loop of the runner touches a file and a script will exit 0 or 1 depending on whether it was touched recently enough.

Why this implementation?

  • The health check is triggered by the in-built instrumentation events so doesn't clutter up the main loop code and is entirely optional.
  • The 'probe' is a first class object, allowing for extensibility such as monitoring individual partition processing or rebalancing (Document how a Kubernetes-style liveness probe may be implemented #309)
  • The script is written in Ruby, I think we agree we like more than bash and can spare the extra ~10ms of CPU to run it.
  • There's some appealing refactoring and test fixes bundled with it.

This has been back-ported into Voice and will be deployed shortly. In testing it has achieved very smooth rolling deploys with minimal disruption to message processing.

After production deployment I'd be happy to update the k8s deployment section of the README.

@bestie bestie force-pushed the bestie/liveness_probe branch from 40be758 to ba06e60 Compare March 7, 2023 15:45
@bestie bestie requested review from dasch, pfemo, leonmaia, deepredsky and martinstiago and removed request for dasch March 7, 2023 15:48
Copy link
Contributor

@dasch dasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! Couple of comments.

I think it would be best to use the existing Config system for this rather than introduce new ENV vars.

Also... should liveness_probe either 1) be a separate thing in exe/, e.g. racecar-liveness-probe or 2) be a new subcommand in racecarctl?

#!/usr/bin/env -S ruby --disable=gems

$file_path = ENV.fetch("RACECAR_LIVENESS_PROBE_FILE_PATH", "")
$liveness_max_interval = Integer(ENV.fetch("RACECAR_LIVENESS_PROBE_MAX_INTERVAL")) rescue nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the config system here? Add the vars to Racecar::Config, e.g. liveness_probe_file_path? Then folks can pass the value as ENV vars or just configure with YAML.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about that, my thinking here was that I was competing with a bash script.

Something that executes quickly with no dependencies is more palatable to ops people and they will likely want to put those values in the ENV anyway.

If you don't buy that argument then I'm happy to change it.

Copy link
Contributor Author

@bestie bestie Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be concerned that people would assume their Rails app is being loaded for the probe, that wouldn't make sense to do that but someone will think that.

I will of course add docs and we can try to make that clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should target the dev ops folks, basically the engineers who write the consumers, and they'd be surprised by having a separate config system. Many people like having their config in a single YAML file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some discussions around the this when mobbing on #308. With the bash version in that PR, reusing the config was not easy. I agree with @dasch about using the same config although it comes with additional load. cc @jbdietrich

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I would tend to agree that it is better to make this consistent with the existing config API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; let's use Racecar::Config for this configuration - besides making it easier to configure, it also helps us maintain documentation on Racecar.

Copy link
Contributor Author

@bestie bestie Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.

It's now executed via racecarctl liveness_probe, it uses Racecar::Config and loads config/racecar.yml if RAILS_ENV is set in the environment.

Looking for RAILS_ENV seems to be the best way to check. I expect this to be set in a Kubernetes deployment and we can't load Rails to get the value.

bin/liveness_probe Outdated Show resolved Hide resolved
bin/liveness_probe Outdated Show resolved Hide resolved
bin/liveness_probe Outdated Show resolved Hide resolved
bin/liveness_probe Outdated Show resolved Hide resolved
bin/liveness_probe Outdated Show resolved Hide resolved
@dasch
Copy link
Contributor

dasch commented Mar 8, 2023

You should be able to extend https://github.com/zendesk/racecar/blob/master/lib/racecar/ctl.rb with a new subcommand. Perhaps pull the probe logic into a new class, e.g. Racecar::LivenessProbe?

@bestie
Copy link
Contributor Author

bestie commented Mar 9, 2023

Great stuff! Couple of comments.

I think it would be best to use the existing Config system for this rather than introduce new ENV vars.

Also... should liveness_probe either 1) be a separate thing in exe/, e.g. racecar-liveness-probe or 2) be a new subcommand in racecarctl?

Definitely a neater solution but let me know what you think about the "easier for ops people" argument.

@bestie bestie force-pushed the bestie/liveness_probe branch from 1b47fc7 to 6534537 Compare March 9, 2023 13:14
@dasch
Copy link
Contributor

dasch commented Mar 9, 2023

@leonmaia would you also like to review this?

Copy link
Contributor

@deepredsky deepredsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually wanted to rewrite the bash part in #308 into ruby as well, so thanks for this PR! :)

lib/racecar/liveness_probe.rb Outdated Show resolved Hide resolved
lib/racecar/liveness_probe.rb Outdated Show resolved Hide resolved
bin/liveness_probe Outdated Show resolved Hide resolved
Copy link
Contributor

@dasch dasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 🚀

README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
Copy link
Contributor

@dasch dasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wohoo!

@bestie bestie force-pushed the bestie/liveness_probe branch from 63a85df to c452171 Compare March 20, 2023 15:42
- Probe command executed via `$ racecarctl liveness_probe`
- Liveness probe behavior is disabled by default, set `liveness_probe_enabled` to enable
- Triggered by instrumentation events to keep overhead and coupling low
@bestie bestie force-pushed the bestie/liveness_probe branch from c452171 to c7f88f9 Compare March 20, 2023 18:07
@dasch
Copy link
Contributor

dasch commented Mar 20, 2023

Alrighty! Let's add an entry to the CHANGELOG and I can merge this!

Co-authored-by: Daniel Schierbeck <[email protected]>
@bestie bestie force-pushed the bestie/liveness_probe branch from c7f88f9 to ea9b1c4 Compare March 21, 2023 12:01
@leonmaia leonmaia merged commit 9fdcf1e into zendesk:master Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants