-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPv6 containers experience connectivity issues with large simultaneous file downloads #2817
Comments
@chen-anders I suggest filing an AWS support case here, as the complexity for this issue will likely require debug sessions and cluster access. In the meantime, I recommend collecting the node logs from the AL2 reproduction by executing the following bash script: https://github.com/awslabs/amazon-eks-ami/blob/main/log-collector-script/linux/eks-log-collector.sh |
Hi @jdn5126 , We don't have an AWS support plan that would allow us to file a technical support case. In the meantime, I'm going to be working with the team to try to get the requisite logs you're asking for. The production workload currently runs on Bottlerocket OS - is there a similar log collector script we can use there? |
I see that Bottlerocket has a section on logs: https://github.com/bottlerocket-os/bottlerocket#logs, but it does not look like it collects everything that we would need. I wonder if we can use the same strategy laid out there to execute the EKS AMI bash script |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
Sorry for the delay on our end. We're still planning to collect and share logs. |
We've repeated our tests over the past few days and are not able to repro the download stall anymore. We haven't made any related changes to our infrastructure and are still puzzled by the behavior. A few notes for anyone who might run into the same problem:
Hopefully this is resolved. Thanks for your help! |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
Issue closed due to inactivity. |
What happened:
Observed behavior is that large simultaneous downloads stall out and eventually we receive a "connection reset by peer" error. Sometimes, we also see TLS connection errors and DNS resolution errors, which cause some downloads to immediately error out.
These errors only affect downloads from IPv6 servers/endpoints. IPv4 works perfectly fine.
Example error output
Sometimes we see errors around establishing connections over HTTPS:
We host-mounted the CNI logs on the hosts we performed the testing, but didn't see any associated logs during our testing.
What you expected to happen:
Downloads complete without connection errors
How to reproduce it (as minimally and precisely as possible):
We have a Procfile that runs 9 downloads of a 700MB file in parallel.
Debian Slim Container
Launch a container:
kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/debian/debian:bullseye-slim --command -- bash
Alpine Container
Launch a container:
kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/docker/library/alpine:3.19.1 --command -- ash
`
Anything else we need to know?:
Environment is a dualstack IPv4/IPv6 VPC. We've been able to reproduce this on both nodes on public/private subnets.
Environment:
Kubernetes Versions:
Reproduced across AL2/Ubuntu/Bottlerocket with Kernel versions via EKS Managed Nodegroups:
-AL2:
5.10.209-198.858.amzn2.aarch64
/5.10.209-198.858.amzn2.x86_64
6.2.0-1017-aws #17~22.04.1-Ubuntu SMP
5.15.0-1048-aws #53~20.04.1-Ubuntu SMP
Reproduced on AWS VPC CNI versions:
Instance types used:
The text was updated successfully, but these errors were encountered: