TCP RST and 502 errors when pods are terminated #7935

tweeks-reify · 2024-10-08T17:15:38Z

Issue Description

We see 502 errors when an apollo-server pod is gracefully terminated in k8s. We host apollo-server in EKS and expose it using AWS ALBs.

AWS provides a troubleshooting guide for 502s and we can see our issue falls under the criteria for "The load balancer received a TCP RST from the target when attempting to establish a connection" (see screenshot below).

We attempted to increase the stopGracePeriodMillis in ApolloServerPluginDrainHttpServer to be higher than the kubernetes terminationGracePeriodSeconds and the ALB Target Group deregistration_delay but did not see a change in behavior.

We also have set httpServer.keepAliveTimeout and httpServer.headersTimeout higher than our ALB Session Timeout.

Link to Reproduction

https://repost.aws/knowledge-center/elb-alb-troubleshoot-502-errors

Reproduction Steps

Terminate a pod running running apollo-server in EKS behind an AWS ALB

The text was updated successfully, but these errors were encountered:

glasser · 2024-10-08T18:48:07Z

I think you'd want the apollo server grace period to be smaller than the k8s-level grace period so it has time to finish up before k8s kills it...

tweeks-reify · 2024-10-10T12:32:13Z

That is the default behavior (10s vs 60s in K8s). Our theory is that although K8s says it sends the sigterm and stops traffic at the same time they are not simultaneous (i.e. AWS does not tell the TargetGroup not to send traffic until after the sigterm is issued). And then any sessions using keepAlive

We added a preStop lifecycle hook in K8s that runs sleep 30 like this in the container spec and we haven't encountered the issue again.

        lifecycle:
          preStop:
            exec:
              command: [ "/bin/sleep", "30" ]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCP RST and 502 errors when pods are terminated #7935

TCP RST and 502 errors when pods are terminated #7935

tweeks-reify commented Oct 8, 2024

glasser commented Oct 8, 2024

tweeks-reify commented Oct 10, 2024

TCP RST and 502 errors when pods are terminated #7935

TCP RST and 502 errors when pods are terminated #7935

Comments

tweeks-reify commented Oct 8, 2024

Issue Description

Link to Reproduction

Reproduction Steps

glasser commented Oct 8, 2024

tweeks-reify commented Oct 10, 2024