Too many open files #1317

edfungus · 2019-03-15T18:45:55Z

Describe the bug
I seem to reach the open file limit fairly quickly which causes pods to restart leading to downtime. I see
[2019-03-15 00:14:11.730][000182][critical][assert] [source/common/network/listener_impl.cc:82] panic: listener accept failure: Too many open files
in the logs and the container is restarted by Kubernetes. The load per Ambassador node is between ~1300 to ~1,800 request per second. Ulimit seems to default to 2048.

To Reproduce
Steps to reproduce the behavior:

Previously tested our service without Ambassador to 100,000 rps. Now adding in Ambassador that routes / to this service only. Our service test endpoints returns 200 with a random latency of 0 to 1 sec.
Ambassador is scaled to 10 nodes (backend also scale to 10 nodes). Kubernetes has 5 total workers
Using Vegeta to load test at 16,000 rps total (~1,600 rps per Ambassador pod) and there are no issues
Using Vegeta to load test at 18,000 rps total (~1,800 rps per Ambassador pod) and Ambassador nodes are starting to restart because of liveliness probe failures. Logs also show too many open files.
I tried with 30 pods of Ambassador and was able to get up to 40,000 rps total before similar pod failures. That averaged out to ~1,300 rps per Ambassador pod

Expected behavior
I didn't expect Ambassador pods to die/restart at that rps. I was hoping to see similar throughput to our test service. Is this around the expected load for an Ambassador pod?

Also the default ulimit seems pretty low. Is there a way to increase that? On our docker image, I was able to set the ulimit before starting the application

Versions (please complete the following information):

Ambassador: 0.51.2
Kubernetes environment: EKS
Version: 1.11

Additional context
I also ensured that the load balancing between Ambassador pods were equal.

Here is a graph of the tests from the perspective of the downstream service. Green lines is the total rps received on the service and the lower lines are load from each Ambassador pod.

The first green block is at 16,000 rps total and once I bumped it up to 18,000 traffic no longer reached the downstream service because Ambassador nodes were restarting.

The second green block is at 18,000 rps total with 30 Ambassador nodes

The last green block was a ramp up test which failed at around 40,000 total rps with the same 30 Ambassador nodes.

Ambassador pods have 3Gi mem and 2000m cpu allocations to make sure those resources aren't bottlenecks

The text was updated successfully, but these errors were encountered:

richarddli · 2019-03-15T18:51:53Z

Hm, FWIW, we've seen workloads up to 60K RPS per node on m4.large. So it's definitely low.

Could you just exec into the Ambassador pod and raise the ulimit? We run Alpine Linux images; here's a thread:

http://lists.alpinelinux.org/alpine-user/0526.html

edfungus · 2019-03-15T19:53:37Z

I seem to get sh: error setting limit: Operation not permitted when trying to raise the ulimit in the pod. I couldn't quite follow the link you gave. Where should the lxc config be? Thanks!

richarddli · 2019-03-15T20:10:25Z

Hm, odd. I think the lxc link is a red herring. I just did this:

$ kubectl exec -it ambassador-d6cb9db66-r9lc6 -- /bin/sh
Defaulting container name to ambassador.
/ambassador # ulimit -n 90000
/ambassador # ulimit -Hn
90000

This doesn't work for you?

edfungus · 2019-03-15T20:24:35Z

hmm, no it doesn't :/

edmundfung: ~/go/src/edge/visage
± |dev {24} U:7 ?:2 ✗| → kubectl exec -it k8-ambassador-859f984bbf-glvdf -- /bin/sh
/ambassador $ ulimit -n 90000
sh: error setting limit: Operation not permitted
/ambassador $ ulimit -Hn
8192
/ambassador $ ulimit -Sn
2048

Ok, it sounds like if I can get the ulimit up then Ambassador will probably work as expected?

richarddli · 2019-03-15T20:29:40Z

I think this is an issue with EKS: pires/kubernetes-elasticsearch-cluster#215.

edfungus · 2019-03-18T20:13:58Z

Ok, I can confirm that with the ulimit increase, Ambassador is working a lot better. Essentially for anyone else working with EKS and has trouble with ulimits, override the docker config json /etc/docker/daemon.json with higher default ulimits in the user data script (which runs when initializing the ec2) and restart docker afterwards.

edfungus closed this as completed Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many open files #1317

Too many open files #1317

edfungus commented Mar 15, 2019

richarddli commented Mar 15, 2019

edfungus commented Mar 15, 2019

richarddli commented Mar 15, 2019

edfungus commented Mar 15, 2019

richarddli commented Mar 15, 2019

edfungus commented Mar 18, 2019

Too many open files #1317

Too many open files #1317

Comments

edfungus commented Mar 15, 2019

richarddli commented Mar 15, 2019

edfungus commented Mar 15, 2019

richarddli commented Mar 15, 2019

edfungus commented Mar 15, 2019

richarddli commented Mar 15, 2019

edfungus commented Mar 18, 2019