Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to change k8s readiness/liveness probes #2363

Open
Meemaw opened this issue Jan 6, 2023 · 7 comments · May be fixed by #6018
Open

Add ability to change k8s readiness/liveness probes #2363

Meemaw opened this issue Jan 6, 2023 · 7 comments · May be fixed by #6018

Comments

@Meemaw
Copy link
Contributor

Meemaw commented Jan 6, 2023

Describe the solution you'd like
Flexibility around readiness/liveness probes in the helm chart, especially the timeouts.

Describe alternatives you've considered
N/A

@Geal
Copy link
Contributor

Geal commented Jan 9, 2023

could you elaborate a bit? What kind of flexibility do you want? What should be configurable?

@Meemaw
Copy link
Contributor Author

Meemaw commented Jan 9, 2023

Timeouts and failure thresholds. We've seen router tip off at scale (when under extreme IO pressure). What made things worse is that the liveness check is somewhat aggressive (3 failures) and the pods will start restarting which makes things even worse.

@Geal
Copy link
Contributor

Geal commented Jan 9, 2023

under what kind of load did you see issues?

@Meemaw
Copy link
Contributor Author

Meemaw commented Jan 9, 2023

Tried to deploy router in front of an monolithic API (just a single subgraph) that receives ~5k rps and it tipped off at 3, 8, and 16 pods. Given that CPU was not in the extremes (3 cores at max), I assume the problem was in the IO (HTTP/TCP layer) and router/axum/hyper was not able to handle so many in flight connections. We stabilised router at ~40 pods to handle that kind of load.

Didn't see any errors related to file descriptors limits being hit, but that is one of my assumptions as to what might have gone wrong. Our long tail requests have quite a high latency, so number of in flight requests can pile up very quickly.

@garypen
Copy link
Contributor

garypen commented Jan 9, 2023

As things stand, we just take the defaults.

I'd be happy to see the deployment template enhanced so that it added something like:

            initialDelaySeconds: {{ .Values.probes.livenessProbe.initialDelaySeconds }}
            periodSeconds: {{ .Values.probes.livenessProbe.periodSeconds }}
            timeoutSeconds: {{ .Values.probes.livenessProbe.timeoutSeconds }}
            successThreshold: {{ .Values.probes.livenessProbe.successThreshold }}
            failureThreshold: {{ .Values.probes.livenessProbe.failureThreshold }}

(Same again for readiness and perhaps startup)

This would facilitate customisation of these probes.

@Meemaw
Copy link
Contributor Author

Meemaw commented Jan 11, 2023

@garypen yes! Another thing is an option to disable probes. For example, I believe liveness probe is widely miss-understood & miss-used in k8s. Given that router is doing a dummy 200 response on health checks, it is not appropriate for liveness check. If the server has issues, restart will only make things worse.

This is a good post, that points this out, and gives one example as to what might be checked in a liveness probe: https://www.linkedin.com/posts/llarsson_betterdevopsonkubernetes-devops-devsecops-activity-7018587202121076736-LRxE

@garypen
Copy link
Contributor

garypen commented Jun 29, 2023

As an extension to this, it would be very useful to enable the router to access at least .Values.probes.livenessProbe.periodSeconds during shutdown so that we can interact cleanly with the readinessProbe.

It may be possible to achieve this using: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/ to expose these details. If not, we could probably access the data programmatically.

@ahiggins0 ahiggins0 linked a pull request Sep 17, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants