Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open files #1317

Closed
edfungus opened this issue Mar 15, 2019 · 6 comments
Closed

Too many open files #1317

edfungus opened this issue Mar 15, 2019 · 6 comments

Comments

@edfungus
Copy link

Describe the bug
I seem to reach the open file limit fairly quickly which causes pods to restart leading to downtime. I see
[2019-03-15 00:14:11.730][000182][critical][assert] [source/common/network/listener_impl.cc:82] panic: listener accept failure: Too many open files
in the logs and the container is restarted by Kubernetes. The load per Ambassador node is between ~1300 to ~1,800 request per second. Ulimit seems to default to 2048.

To Reproduce
Steps to reproduce the behavior:

  1. Previously tested our service without Ambassador to 100,000 rps. Now adding in Ambassador that routes / to this service only. Our service test endpoints returns 200 with a random latency of 0 to 1 sec.
  2. Ambassador is scaled to 10 nodes (backend also scale to 10 nodes). Kubernetes has 5 total workers
  3. Using Vegeta to load test at 16,000 rps total (~1,600 rps per Ambassador pod) and there are no issues
  4. Using Vegeta to load test at 18,000 rps total (~1,800 rps per Ambassador pod) and Ambassador nodes are starting to restart because of liveliness probe failures. Logs also show too many open files.
  5. I tried with 30 pods of Ambassador and was able to get up to 40,000 rps total before similar pod failures. That averaged out to ~1,300 rps per Ambassador pod

Expected behavior
I didn't expect Ambassador pods to die/restart at that rps. I was hoping to see similar throughput to our test service. Is this around the expected load for an Ambassador pod?

Also the default ulimit seems pretty low. Is there a way to increase that? On our docker image, I was able to set the ulimit before starting the application

Versions (please complete the following information):

  • Ambassador: 0.51.2
  • Kubernetes environment: EKS
  • Version: 1.11

Additional context
I also ensured that the load balancing between Ambassador pods were equal.

Here is a graph of the tests from the perspective of the downstream service. Green lines is the total rps received on the service and the lower lines are load from each Ambassador pod.

Screen Shot 2019-03-15 at 11 08 14 AM

The first green block is at 16,000 rps total and once I bumped it up to 18,000 traffic no longer reached the downstream service because Ambassador nodes were restarting.

The second green block is at 18,000 rps total with 30 Ambassador nodes

The last green block was a ramp up test which failed at around 40,000 total rps with the same 30 Ambassador nodes.

Ambassador pods have 3Gi mem and 2000m cpu allocations to make sure those resources aren't bottlenecks

@richarddli
Copy link
Contributor

Hm, FWIW, we've seen workloads up to 60K RPS per node on m4.large. So it's definitely low.

Could you just exec into the Ambassador pod and raise the ulimit? We run Alpine Linux images; here's a thread:

http://lists.alpinelinux.org/alpine-user/0526.html

@edfungus
Copy link
Author

I seem to get sh: error setting limit: Operation not permitted when trying to raise the ulimit in the pod. I couldn't quite follow the link you gave. Where should the lxc config be? Thanks!

@richarddli
Copy link
Contributor

Hm, odd. I think the lxc link is a red herring. I just did this:

$ kubectl exec -it ambassador-d6cb9db66-r9lc6 -- /bin/sh
Defaulting container name to ambassador.
/ambassador # ulimit -n 90000
/ambassador # ulimit -Hn
90000

This doesn't work for you?

@edfungus
Copy link
Author

hmm, no it doesn't :/

edmundfung: ~/go/src/edge/visage
± |dev {24} U:7 ?:2 ✗| → kubectl exec -it k8-ambassador-859f984bbf-glvdf -- /bin/sh
/ambassador $ ulimit -n 90000
sh: error setting limit: Operation not permitted
/ambassador $ ulimit -Hn
8192
/ambassador $ ulimit -Sn
2048

Ok, it sounds like if I can get the ulimit up then Ambassador will probably work as expected?

@richarddli
Copy link
Contributor

I think this is an issue with EKS: pires/kubernetes-elasticsearch-cluster#215.

@edfungus
Copy link
Author

Ok, I can confirm that with the ulimit increase, Ambassador is working a lot better. Essentially for anyone else working with EKS and has trouble with ulimits, override the docker config json /etc/docker/daemon.json with higher default ulimits in the user data script (which runs when initializing the ec2) and restart docker afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants