-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to change k8s readiness/liveness probes #2363
Comments
could you elaborate a bit? What kind of flexibility do you want? What should be configurable? |
Timeouts and failure thresholds. We've seen router tip off at scale (when under extreme IO pressure). What made things worse is that the liveness check is somewhat aggressive (3 failures) and the pods will start restarting which makes things even worse. |
under what kind of load did you see issues? |
Tried to deploy router in front of an monolithic API (just a single subgraph) that receives ~5k rps and it tipped off at 3, 8, and 16 pods. Given that CPU was not in the extremes (3 cores at max), I assume the problem was in the IO (HTTP/TCP layer) and router/axum/hyper was not able to handle so many in flight connections. We stabilised router at ~40 pods to handle that kind of load. Didn't see any errors related to file descriptors limits being hit, but that is one of my assumptions as to what might have gone wrong. Our long tail requests have quite a high latency, so number of in flight requests can pile up very quickly. |
As things stand, we just take the defaults. I'd be happy to see the deployment template enhanced so that it added something like:
(Same again for readiness and perhaps startup) This would facilitate customisation of these probes. |
@garypen yes! Another thing is an option to disable probes. For example, I believe liveness probe is widely miss-understood & miss-used in k8s. Given that router is doing a dummy 200 response on health checks, it is not appropriate for liveness check. If the server has issues, restart will only make things worse. This is a good post, that points this out, and gives one example as to what might be checked in a liveness probe: https://www.linkedin.com/posts/llarsson_betterdevopsonkubernetes-devops-devsecops-activity-7018587202121076736-LRxE |
As an extension to this, it would be very useful to enable the router to access at least It may be possible to achieve this using: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/ to expose these details. If not, we could probably access the data programmatically. |
Describe the solution you'd like
Flexibility around readiness/liveness probes in the helm chart, especially the timeouts.
Describe alternatives you've considered
N/A
The text was updated successfully, but these errors were encountered: