Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP RST and 502 errors when pods are terminated #7935

Open
tweeks-reify opened this issue Oct 8, 2024 · 2 comments
Open

TCP RST and 502 errors when pods are terminated #7935

tweeks-reify opened this issue Oct 8, 2024 · 2 comments

Comments

@tweeks-reify
Copy link

Issue Description

We see 502 errors when an apollo-server pod is gracefully terminated in k8s. We host apollo-server in EKS and expose it using AWS ALBs.

AWS provides a troubleshooting guide for 502s and we can see our issue falls under the criteria for "The load balancer received a TCP RST from the target when attempting to establish a connection" (see screenshot below).

We attempted to increase the stopGracePeriodMillis in ApolloServerPluginDrainHttpServer to be higher than the kubernetes terminationGracePeriodSeconds and the ALB Target Group deregistration_delay but did not see a change in behavior.

We also have set httpServer.keepAliveTimeout and httpServer.headersTimeout higher than our ALB Session Timeout.

ALB Logs

Link to Reproduction

https://repost.aws/knowledge-center/elb-alb-troubleshoot-502-errors

Reproduction Steps

  1. Terminate a pod running running apollo-server in EKS behind an AWS ALB
@glasser
Copy link
Member

glasser commented Oct 8, 2024

I think you'd want the apollo server grace period to be smaller than the k8s-level grace period so it has time to finish up before k8s kills it...

@tweeks-reify
Copy link
Author

That is the default behavior (10s vs 60s in K8s). Our theory is that although K8s says it sends the sigterm and stops traffic at the same time they are not simultaneous (i.e. AWS does not tell the TargetGroup not to send traffic until after the sigterm is issued). And then any sessions using keepAlive

We added a preStop lifecycle hook in K8s that runs sleep 30 like this in the container spec and we haven't encountered the issue again.

        lifecycle:
          preStop:
            exec:
              command: [ "/bin/sleep", "30" ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants