Add ability to change k8s readiness/liveness probes #2363

Meemaw · 2023-01-06T17:02:16Z

Describe the solution you'd like
Flexibility around readiness/liveness probes in the helm chart, especially the timeouts.

Describe alternatives you've considered
N/A

Geal · 2023-01-09T08:49:32Z

could you elaborate a bit? What kind of flexibility do you want? What should be configurable?

Meemaw · 2023-01-09T09:08:14Z

Timeouts and failure thresholds. We've seen router tip off at scale (when under extreme IO pressure). What made things worse is that the liveness check is somewhat aggressive (3 failures) and the pods will start restarting which makes things even worse.

Geal · 2023-01-09T11:50:59Z

under what kind of load did you see issues?

Meemaw · 2023-01-09T12:01:40Z

Tried to deploy router in front of an monolithic API (just a single subgraph) that receives ~5k rps and it tipped off at 3, 8, and 16 pods. Given that CPU was not in the extremes (3 cores at max), I assume the problem was in the IO (HTTP/TCP layer) and router/axum/hyper was not able to handle so many in flight connections. We stabilised router at ~40 pods to handle that kind of load.

Didn't see any errors related to file descriptors limits being hit, but that is one of my assumptions as to what might have gone wrong. Our long tail requests have quite a high latency, so number of in flight requests can pile up very quickly.

garypen · 2023-01-09T15:49:44Z

As things stand, we just take the defaults.

I'd be happy to see the deployment template enhanced so that it added something like:

            initialDelaySeconds: {{ .Values.probes.livenessProbe.initialDelaySeconds }}
            periodSeconds: {{ .Values.probes.livenessProbe.periodSeconds }}
            timeoutSeconds: {{ .Values.probes.livenessProbe.timeoutSeconds }}
            successThreshold: {{ .Values.probes.livenessProbe.successThreshold }}
            failureThreshold: {{ .Values.probes.livenessProbe.failureThreshold }}

(Same again for readiness and perhaps startup)

This would facilitate customisation of these probes.

Meemaw · 2023-01-11T10:04:22Z

@garypen yes! Another thing is an option to disable probes. For example, I believe liveness probe is widely miss-understood & miss-used in k8s. Given that router is doing a dummy 200 response on health checks, it is not appropriate for liveness check. If the server has issues, restart will only make things worse.

This is a good post, that points this out, and gives one example as to what might be checked in a liveness probe: https://www.linkedin.com/posts/llarsson_betterdevopsonkubernetes-devops-devsecops-activity-7018587202121076736-LRxE

garypen · 2023-06-29T12:38:44Z

As an extension to this, it would be very useful to enable the router to access at least .Values.probes.livenessProbe.periodSeconds during shutdown so that we can interact cleanly with the readinessProbe.

It may be possible to achieve this using: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/ to expose these details. If not, we could probably access the data programmatically.

Meemaw added raised by user triage labels Jan 6, 2023

garypen added the component/helm label Jan 9, 2023

Geal assigned garypen Jan 11, 2023

ahiggins0 linked a pull request Sep 17, 2024 that will close this issue

feat: Adds More Options For Liveness/Readiness Probe #6018

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to change k8s readiness/liveness probes #2363

Add ability to change k8s readiness/liveness probes #2363

Meemaw commented Jan 6, 2023

Geal commented Jan 9, 2023

Meemaw commented Jan 9, 2023

Geal commented Jan 9, 2023

Meemaw commented Jan 9, 2023 •

edited

Loading

garypen commented Jan 9, 2023

Meemaw commented Jan 11, 2023 •

edited

Loading

garypen commented Jun 29, 2023

Add ability to change k8s readiness/liveness probes #2363

Add ability to change k8s readiness/liveness probes #2363

Comments

Meemaw commented Jan 6, 2023

Geal commented Jan 9, 2023

Meemaw commented Jan 9, 2023

Geal commented Jan 9, 2023

Meemaw commented Jan 9, 2023 • edited Loading

garypen commented Jan 9, 2023

Meemaw commented Jan 11, 2023 • edited Loading

garypen commented Jun 29, 2023

Meemaw commented Jan 9, 2023 •

edited

Loading

Meemaw commented Jan 11, 2023 •

edited

Loading