Memory leak in proxy? #388

snickell · 2022-04-07T15:09:59Z

Aloha, we've been seeing a pattern of growing daily memory usage (followed by increasing slugishness then non-responsiveness above around 1-2GB of RAM) in the 'proxy' pod:

The different colors are fresh proxy reboots, which have been required to keep the cluster running.

-Seth

snickell · 2022-04-07T15:10:50Z

Sorry, clipped the units:

The pattern is nearly identical on the other cluster.

snickell · 2022-04-07T15:13:39Z

We're running z2jh chart version 1.1.3-n354.h751bc313 (I believe the latest ~3 weeks ago), but as you can see, this pattern predates this chart version by quite a bit.

snickell · 2022-04-07T15:52:50Z

We start seeing serious performance problems at about 1.5GB, which is suspiciously close to the heap limit for node 🤔 So maybe its a memory leak that then cascade fails at the heap limit into some sort of .... garbage collection nightmare? or?

manics · 2022-04-07T16:12:09Z

Do you happen to know if the memory increases are correlated with particular events, e.g. a user starting a new server, or connecting to a particular service?

snickell · 2022-04-07T19:05:47Z

No, but I'm looking into it, my vague suspicion: websockets? We push them pretty hard, e.g. many users are streaming VNC over websocket. Is there a log mode that has useful stats about e.g. the routing table?

snickell · 2022-04-10T14:43:05Z

OK, so a further development, since high RAM usage correlated with performance problems, I added a k8s memory limit to the pod, thinking it would get killed when it passed 1.4GB of RAM, and reboot fresh, a decent-ish workaround for now.

Here's what happened instead:

Note that there's one other unusual thing here, I kubectl exec'ed several 200MB "ram balloon" processes to try to push it over the edge faster for testing. They clearly didn't work haha, and I doubt that's why this is not growing at the normal leakage rate, but worth mentioning.

Did something change or did adding a k8s memory limit suddenly change the behavior?

snickell · 2022-04-10T14:43:51Z

(note this otherwise consistent memory growth pattern goes back to jan, and a number of version upgrades since from the z2jh chart..... this is.... weird)

consideRatio · 2022-04-11T05:32:05Z

Hmmm, so when rhe pod restarts, is it because it has been evicted from a node, or is it because it has restarted its process within the container etc?

Being evicted from a node can happen based on external logic, while managing memory within the container can happen because of more internal logic, which can be enabled by limits to clairfy it needs to not surpass certain limits.

Need to learn more about OOMkiller things within the container vs by the kubelet etc, but perhaps you ended up helping it avoid getting evicted by surpassing its memory limit. Hmmm..

rcthomas · 2022-07-06T20:13:02Z

@snickell was what you observed related to load at all? Like, on weekend days do you observe this behavior? We're currently experiencing relatively-speaking high load on our deployment, and I observe something similar. Memory consumption in the proxy will just suddenly shoot up and it becomes non-responsive. Are you still using CHP for your proxy? I am considering swapping it for Traefik in the coming days here.

consideRatio · 2022-07-14T01:06:19Z

@snickell have you experienced this with older versions of z2jh -> chp as well?

marcelofernandez · 2023-08-28T13:13:59Z

Still happening on the latest version (v4.5.6).

shaneknapp · 2024-06-06T18:19:47Z

see also #434

i believe the socket leak is the root cause of the memory leak. on our larger, more active hubs we've seen constant spiking of the chp ram under "load", and chp running out of heap space: #434 (comment)

"load" is ~300+ users logging in around the "same time".

"same time" is anywhere from 15m to a couple of hours.

i don't believe that increasing the chp heap size is the correct fix, as the memory/socket leak still needs to be addressed. however, increasing it may help, but that would need some experimentation.

marcelofernandez · 2024-06-07T14:09:43Z

We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬

Check out that alternative just in case you are experiencing this.

shaneknapp · 2024-06-07T16:16:33Z

thanks, good to know. we've also been considering this as well.

…

On Fri, Jun 7, 2024 at 7:10 AM Marcelo Fernández ***@***.***> wrote: We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬 Check out that alternative just in case you are experiencing this. — Reply to this email directly, view it on GitHub <#388 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMIHLEJDD5VQBI52PJGGKTZGG5L3AVCNFSM6AAAAABI5H3Y42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJUHEZDOOJWGU> . You are receiving this because you commented.Message ID: ***@***.***>

consideRatio · 2024-06-07T16:20:03Z

@marcelofernandez are you able to share config for your setup?

shaneknapp · 2024-06-15T17:50:11Z

We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬

Check out that alternative just in case you are experiencing this.

echoing @consideRatio -- do you have any relevant traefik config bits you could share? this would be super useful! :)

thanks in advance...

marcelofernandez · 2024-06-17T15:45:01Z

Hey guys, sure!

First, and foremost, I'm sorry I can't give you all the details of my company's internal PR because:

I don't wanna go into any IP issues and (most importantly),
We're still using a very old version of z2jh so I'm not sure how all this is still relevant to the latest versions. I'd prepare a PR for z2jh without a hitch in a perfect world.

That said, I can give you an overview of what I did.

The complicated part was that it seemed like nobody did this in the past, so I based my job on this (far more ambitious) previous and rejected PR which originally was aimed to replace both proxies:

HTTP -> HTTPS one (the TLS frontend terminator called autohttps), and
configurable-http-proxy, but:
- Also making it HA-ready, supporting more than one instance of the proxy (making it more scalable), and
- Creating a new service called Consul to store all the proxies shared-config, etc. which brought more complexity to the PR.

The only thing I did (because I only wanted stability) based on that PR was to:

Drop the configurable-http-proxy Pod, and
Replace it with just one container of Traefik inside the Hub Pod,
Using the JupyterHub Traefik Proxy component (running in the Hub container) to automatically configure the Traefik container.
Now, both containers (Hub + Traefik) run in the same Pod still called Hub.

Based on the Z2JH's architectural graph, here are the changes.

Before:

After:

Once I defined the idea of what I wanted, I had to drop unneeded code from the PR above, configure the hub to call the proxy in the same pod (http://localhost:8081) and that's it.

I implemented this like a year and a half ago, if you have more questions, just let me know...

Regards

manics · 2024-08-13T13:27:54Z

4.6.2 was released 2 months ago with a fix for the leaking sockets. Is there still a memory leak or can we close this issue?

shaneknapp · 2024-08-13T22:00:43Z

@manics i don't think we should close this yet... we still saw chp run out of nodejs heap on hubs w/lots of traffic and usage even after we deployed 4.6.2, but since summer is slow it hasn't bitten us yet.

i'm sure that within a few weeks we'll see OOMs/socket leaks once the fall term ramps up.

minrk · 2024-08-15T13:18:11Z

If anyone can make a stress test to provoke this, ideally with just CHP (or the JupyterHub Proxy API, like the traefik proxy benchmarks) I can test if the migration to http2-proxy will help. I tried a simple local test with a simple backend and apache-bench, but many millions of requests and hundreds of gigabytes later, I see no significant increase in memory or socket consumption (still sub-100MB). So there must be something relevant in typical use (websockets, connections dropped in a particular way, adding/removing routes, etc.) that a naïve benchmark doesn't trigger.

shaneknapp · 2024-09-12T16:18:02Z

If anyone can make a stress test to provoke this, ideally with just CHP (or the JupyterHub Proxy API, like the traefik proxy benchmarks) I can test if the migration to http2-proxy will help. I tried a simple local test with a simple backend and apache-bench, but many millions of requests and hundreds of gigabytes later, I see no significant increase in memory or socket consumption (still sub-100MB). So there must be something relevant in typical use (websockets, connections dropped in a particular way, adding/removing routes, etc.) that a naïve benchmark doesn't trigger.

re benchmarking... we really don't have either the cycles, available staff or deep understanding of how the proxy works to do this.

re the something: we're seeing mildly improved performance w/4.6.2 but are still experiencing pretty regular, albeit much shorter (and self-recovering) outages at "peak"[1] usage.

[1] peak can be anywhere from ~200 up to ~800 users on a hub.

for example, last night between 845p and 9p, we had ~188 students logged on to datahub (the lowest end of 'peak') and saw the proxy peg at 100% CPU and 1.16G ram.

hub cpu hovered around ~40% until the outage, and during that 15m dropped to nearly 0%. hub memory usage was steady at around ~477M

only during the 15m of the outage (~845p - 9p) our chp logs were full of entries like this:

not surprisingly, the readiness probes couldn't find either the hub or proxy during this outage (and also the hub just past when things recovered?):

i'll dig more through the logs and see what needles i can winnow out of the haystacks.

shaneknapp · 2024-09-12T17:36:54Z

running dmesg -T on the core node where this proxy was running indeed shows that chp is still getting OOMKilled:

[Wed Sep 11 21:48:22 2024] nginx invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=995
[Wed Sep 11 21:48:22 2024] CPU: 11 PID: 3160400 Comm: nginx Not tainted 6.1.90+ #1
[Wed Sep 11 21:48:22 2024] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
[Wed Sep 11 21:48:22 2024] Call Trace:
[Wed Sep 11 21:48:22 2024]  <TASK>
[Wed Sep 11 21:48:22 2024]  dump_stack_lvl+0x4a/0x70
[Wed Sep 11 21:48:22 2024]  dump_header+0x52/0x250
[Wed Sep 11 21:48:22 2024]  oom_kill_process+0x10a/0x220
[Wed Sep 11 21:48:22 2024]  out_of_memory+0x3dc/0x5c0
[Wed Sep 11 21:48:22 2024]  ? mem_cgroup_iter+0x1c6/0x240
[Wed Sep 11 21:48:22 2024]  try_charge_memcg+0x827/0xa90
[Wed Sep 11 21:48:22 2024]  charge_memcg+0x3f/0x1f0
[Wed Sep 11 21:48:22 2024]  __mem_cgroup_charge+0x2b/0x80
[Wed Sep 11 21:48:22 2024]  handle_mm_fault+0xf80/0x16b0
[Wed Sep 11 21:48:22 2024]  do_user_addr_fault+0x271/0x4d0
[Wed Sep 11 21:48:22 2024]  exc_page_fault+0x78/0xf0
[Wed Sep 11 21:48:22 2024]  asm_exc_page_fault+0x22/0x30
[Wed Sep 11 21:48:22 2024] RIP: 0033:0x5cd7083be658
[Wed Sep 11 21:48:22 2024] Code: 10 e8 1c 5e 00 00 49 89 87 08 02 00 00 48 85 c0 0f 84 24 02 00 00 49 8b 8f e8 01 00 00 48 85 c9 74 20 31 d2 66 0f 1f 44 00 00 <80> 48 0a 01 49 8b 8f e8 01 00 00 48 83 c2 01 48 83 c0 60 48 39 ca
[Wed Sep 11 21:48:22 2024] RSP: 002b:00007fff48f73f70 EFLAGS: 00010283
[Wed Sep 11 21:48:22 2024] RAX: 00007ba90b837030 RBX: 00007ba90bad2050 RCX: 0000000000004000
[Wed Sep 11 21:48:22 2024] RDX: 000000000000112a RSI: 0000000000180000 RDI: 00005cd709921ea0
[Wed Sep 11 21:48:22 2024] RBP: 00007ba90a329380 R08: 00000000000000fe R09: 0000000000000000
[Wed Sep 11 21:48:22 2024] R10: 00007ba90b950ffc R11: 0000000000000006 R12: 00005cd70856a530
[Wed Sep 11 21:48:22 2024] R13: 00005cd7084eb121 R14: 00005cd7084eb12e R15: 00007ba90a329380
[Wed Sep 11 21:48:22 2024]  </TASK>
[Wed Sep 11 21:48:22 2024] memory: usage 2097168kB, limit 2097152kB, failcnt 806
[Wed Sep 11 21:48:22 2024] swap: usage 0kB, limit 9007199254740988kB, failcnt 0

this repeated regularly for about 30m, and then for another 30m lots of messages like this:

[Wed Sep 11 22:38:28 2024] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc982fb8d_dd48_4bff_a016_38cda0c5767a.slice/cri-containerd-58fd1ffa87a46e002e02ede50e42cf51b21de27dc6ffccf8f22032be3bcc2f80.scope are going to be killed due to memory.oom.group set
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3223198 (dumb-init) total-vm:220kB, anon-rss:8kB, file-rss:0kB, shmem-rss:0kB, UID:101 pgtables:24kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3223258 (nginx) total-vm:155116kB, anon-rss:37304kB, file-rss:7056kB, shmem-rss:756kB, UID:101 pgtables:188kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3225302 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3226667 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227348 (nginx) total-vm:171012kB, anon-rss:48508kB, file-rss:5728kB, shmem-rss:2528kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227349 (nginx) total-vm:172136kB, anon-rss:49644kB, file-rss:5728kB, shmem-rss:2492kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227350 (nginx) total-vm:172080kB, anon-rss:49560kB, file-rss:5728kB, shmem-rss:2544kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227351 (nginx) total-vm:171792kB, anon-rss:49336kB, file-rss:5728kB, shmem-rss:2504kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227382 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227423 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227567 (nginx) total-vm:171948kB, anon-rss:49412kB, file-rss:5728kB, shmem-rss:2520kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227592 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227680 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227713 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227746 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3228024 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3228067 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3228303 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3230430 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3230824 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3230826 (nginx) total-vm:170948kB, anon-rss:48460kB, file-rss:5728kB, shmem-rss:4652kB, UID:101 pgtables:216kB oom_score_adj:995
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231122 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231172 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231288 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231347 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231814 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231921 (nginx) is already exiting. Skip killing the task

during all of this, the chp logs are full of thousands of 503 GET or 503 POST errors for active users attempting to get work done. :\

so: something is still amiss. chp is running out of heap space. there's absolutely a memory/socket leak remaining somewhere.

consideRatio · 2024-09-12T17:48:51Z

We have discussed two kinds of memory, network tcp memory, and normal ram memory - getting normal memory killed is a consequence of surpassing requested memory via k8s.

This is normal memory killing i think, so what memory request is configured for the peoxy pod?

Note that i think the graph you have in grafana may represent an average combination of pods if you have multiple proxy pods in the k8s cluster, so then you could see memory usage below the requested amount even though an individual pod goes above it. I recall an issue opened about this... Found it: jupyterhub/grafana-dashboards#128

Is normal memory still growing without bound over time as users comes and goes, and chp getting memory killed for that reason, making requesting more memory just a matter of gaining time before a crash?

shaneknapp · 2024-09-12T17:56:57Z

We have discussed two kinds of memory, network tcp memory, and normal ram memory - getting normal memory killed is a consequence of surpassing requested memory via k8s.

This is normal memory killing i think, so what memory request is configured for the peoxy pod?

      resources:
        requests:
          cpu: 0.001
          memory: 64Mi
        limits:
          memory: 1.5Gi

Note that i think the graph you have in grafana may represent an average combination of pods if you have multiple proxy pods in the k8s cluster, so then you could see memory usage below the requested amount even though an individual pod goes above it. I recall an issue opened about this... Found it: jupyterhub/grafana-dashboards#128

nope -- this is only one deployment, not a sum of them all.

Is normal memory still growing without bound over time as users comes and goes, and chp getting memory killed for that reason, making requesting more memory just a matter of gaining time before a crash?

it seems to grow over time, and as it surpasses the node max heap cpu will inevitably ramp up before things start getting OOMKilled.

shaneknapp · 2024-09-12T18:16:57Z

We have discussed two kinds of memory, network tcp memory, and normal ram memory - getting normal memory killed is a consequence of surpassing requested memory via k8s.
This is normal memory killing i think, so what memory request is configured for the peoxy pod?
      resources:
        requests:
          cpu: 0.001
          memory: 64Mi
        limits:
          memory: 1.5Gi

fwiw i'm about to bump this to 3Gi "just to see".

shaneknapp · 2024-09-13T00:29:09Z

this just happened again and it really looks like we're running out of ephemeral ports...

shaneknapp · 2024-09-13T00:31:38Z

we're actually thinking about putting our chp pods in their own pool, one proxy per (very small) node to get past this until a fix or better solution comes along.

shaneknapp · 2024-09-13T00:34:14Z

currently on the impacted proxy node:

/srv/configurable-http-proxy $ netstat -natp|wc -l
24536

shaneknapp · 2024-09-13T01:39:29Z

hmm, well, this sheds some light on things: #465

some of our hubs have > 1000 users.

felder · 2024-09-13T16:11:05Z

Noticed that when this behavior happens at berkeley we see accompanying logs indicating, EADDRNOTAVAIL and tons of connection failures to the hub pod.

Really like the issue described here
https://blog.cloudflare.com/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/

https://devops-insider.mygraphql.com/zh-cn/latest/network/tcp/ephemeral-port/ephemeral-port-exhaustion-and-how-to-avoid-it.html

Running netstat -natp on the chp proxy indicates it’s very possible we have enough connections to be running out of ephemeral ports.

Would likely explain the sudden increase in cpu as well because I think proxy rapidly retries when this behavior starts.

shaneknapp · 2024-09-18T19:11:04Z

for those continuing to be held in rapture by this enthralling story, i have some relatively useful updates!

for large deployments, chp needs at least 2.5Gi of ram. we're seeing ram usage up to nearly 2Gi when many users are logged in and working. we have 3Gi allocated for this, and many users == ~300+
after meeting w/@minrk and @consideRatio and discussing what was going on, we decided to try setting the chp timeout and proxy-timeout to 5 seconds. this gave us a huge improvement in reliability, and while this issue is still impacting us, we have a little bit more breathing room: https://github.com/berkeley-dsep-infra/datahub/blob/staging/hub/values.yaml#L47
we are definitely, 100% running out of ephemeral ports in the chp pod for hubs that are using jupyterlab/notebooks/etc.
we are definitely, 100% NOT running out of ephemeral ports on hubs w/similar usage patterns, but are deployed with RStudio, VSCode or anything using a proxy like jupyter-rsession-proxy and jupyter-vscode-proxy.
during peak usage times, the ephemeral ports used will increase significantly (25k+) and drop down to ~2k+ after peak. at least once or twice a day we'll hit the 28k limit, and the best/only way to stop a significant outage is to kill/restart the hub and proxy pods. at least we now have the time to catch and kill before an outage w/the timeouts set to 5 seconds. this is a big win, but still sub-optimal.
if we don't intervene, and the chp gets OOMKilled, this will lead to a short outage between ~10-30 minutes before things auto-recover. users will see 503/service unavailable errors, as chp is unable to route them to the hub and user pods.

point 4 is important... and makes me wonder if something in lab or notebook has buggy socket handling code.

felder · 2024-09-18T19:16:54Z

A big difference between hubs that use one of the mentioned proxies and others that don't is that instead of a majority of the connections going from chp -> hub:8081, the connections instead go to the user pods directly. Since the user pods are spread out across a subnet, the connections aren't all focused on a single ip:port destination and so there is no issue with ephemeral port exhaustion.

shaneknapp · 2024-09-18T19:18:08Z

A big difference between hubs that use one of the mentioned proxies and others that don't is that instead of a majority of the connections going from chp -> hub:8081, the connections instead go to the user pods directly. Since the user pods are spread out across a subnet, the connections aren't all focused on a single ip:port destination and so there is no issue with ephemeral port exhaustion.

yep, thanks for clarifying @felder !

consideRatio · 2024-09-18T20:21:07Z

About where connections go etc, i expect:

ingress controller -> chp + chp -> user pods (incoming connections for browser -> jupyter server traffic)
user pods -> chp + chp -> hub (reporting user activity, server checking auth, maybe more?)
user pods -> internet (jupyter server making internet request via node's public ip without chp involved)

If you could figure out something more about where things go that amounts too huge requests, that may allow us to tune lab/hub etc, for example how often user server reporta activity to hub can be reduced i think. I also think lab checks connectivity with hub regularly.

felder · 2024-09-18T21:14:17Z

@consideRatio Do you have ideas for determining more information regarding these connections? Currently we're just looking at reporting from netstat and seeing the ip:port pairs for source and destination on the connections.

Here's sample output for what we're seeing with netstat on the chp pod, which unfortunately is not particularly helpful with regard to getting more specifics.

$ netstat -natp | grep ESTABLISHED | more
tcp        0      0 10.28.7.157:52072       10.28.11.251:8888       ESTABLISHED 1/node
tcp        0      0 10.28.7.157:36022       10.28.30.220:8888       ESTABLISHED 1/node
tcp        0      0 10.28.7.157:36018       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:48910       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:33400       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:50372       10.28.5.202:8888        ESTABLISHED 1/node
tcp        0      0 10.28.7.157:38690       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:43694       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:55028       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:40936       hubip:8081     ESTABLISHED 1/node
...
...
...

In this case 10.28.7.157 is the chp pod ip, I replaced the actual ip for the hub with "hubip" and we can see a few connections going to user pods on port 8888.

What I do not know at this time is how to figure out why the connections were opened in the first place, or what user/process they are associated with. We're just getting started on this line of investigation (we only discovered the ephemeral port exhaustion late last week) at this time and while I would love to give you the information you're requesting, I don't know how to obtain it.

Also, out of curiosity do you see a similar ratio as we do on your deployments with regard to ephemeral port usage?

We're seeing this issue primarily on hubs with > 200 users which would suggest that on a per user basis there's ~100 connections from the chp -> hub:8081 but again I have no way at this time of associating the ephemeral ports with anything meaningful so for all I know it could be some activity on the part of a subset of users or processes.

benz0li · 2024-09-19T06:53:28Z

3. we are definitely, 100% running out of ephemeral ports in the chp pod for hubs that are using jupyterlab/notebooks/etc.

@shaneknapp What image(s) do you use?

Jupyter Docker Stacks?
Self-built?
- Based on Jupyter Docker Stacks: What do you add?
- From scratch: Dockerfile(s)?

benz0li · 2024-09-19T07:14:43Z

@shaneknapp If ephemeral port exhaustion is not caused by chp itself, this discussion should be moved to another issue.

@snickell Could you test, if there is still a memory leak with v4.6.2?

IMHO these are separate issues.

shaneknapp · 2024-09-19T16:47:30Z

we are definitely, 100% running out of ephemeral ports in the chp pod for hubs that are using jupyterlab/notebooks/etc.

@shaneknapp What image(s) do you use?
1. [Jupyter Docker Stacks](https://github.com/jupyter/docker-stacks)?

2. Self-built?
   
   * Based on Jupyter Docker Stacks: What do you add?
   * From scratch: `Dockerfile`(s)?

all of our images are built from scratch... some are dockerfile-based, most are pure repo2docker, and everything is built w/r2d. of the three hubs most impacted by this, one has a complex Dockerfile-based build, and the other two are very straightforward python repo2docker builds.

@shaneknapp If ephemeral port exhaustion is not caused by chp itself, this discussion should be moved to another issue.

well, until literally late last thursday we had no idea what was going on, and we still don't know what is causing the port spam yet. if you have any suggestions i'd be happy to move the conversation there.

perhaps #434 ?

@snickell Could you test, if there is still a memory leak with v4.6.2?

IMHO these are separate issues.

benz0li · 2024-09-19T17:37:38Z

perhaps #434 ?

No.

shaneknapp · 2024-09-19T20:16:41Z

perhaps #434 ?

No.

until we figure out a better home for this issue, i will continue to update our findings here. :)

anyways...

we had to revert the --timeout=5000 and --proxy-timeout=5000 changes as this broke PDF exporting (i know, right!?!). we're considering setting these to be 5 or 10m long.
yesterday we increased the number of ephemeral ports and this seems to give us enough headroom for the impacted chps to continue working... this feels much more like a bandaid than a solution to me, but for now it seems to be working pretty well. we will continue to monitor and intervene as needed to avert outages.

spec:
  template:
    spec:
      initContainers:
      - command:
        - sysctl
        - -w
        - net.ipv4.ip_local_port_range=10000 65000
        image: busybox
        imagePullPolicy: IfNotPresent
        name: init-sysctl
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File

deployed via kubectl -n <namespace> patch deployment proxy --patch-file chp-pod-deployment-patch.yaml

this gives us 55000 (65000 - 10000) ephemeral ports.

consideRatio · 2024-09-19T20:49:34Z

I figure we have the following kinds of issues, I opened #557 to repsent the third kind - lets switch to that!

Memory leak
For observations about normal RAM memory usage increasing over time without being reclaimed, making it grow indefinitively.
Socket leak
For observations about either sockets increasing over time without being reclaimed, making socket count grow indefinitively.
Related to this is probably any networking related memory issues, as unbounded growth of sockets could go hand in hand with using more and more of that.
Running low on ephemeral ports
I think this is separate from the other issues, and perhaps also isn't a bug with CHP but for example with software spamming connections.

snickell added the bug label Apr 7, 2022

consideRatio transferred this issue from jupyterhub/zero-to-jupyterhub-k8s Apr 7, 2022

consideRatio mentioned this issue Jul 14, 2022

Information about memory/cpu etc for the JupyterHub Helm chart's proxy pod jupyterhub/grafana-dashboards#44

Closed

manics mentioned this issue Oct 12, 2022

Socket leak #434

Open

cccs-nik mentioned this issue Aug 14, 2024

Ability to disable proxy jupyterhub/zero-to-jupyterhub-k8s#3481

Open

shaneknapp mentioned this issue Sep 12, 2024

[DH-3] bumping chp ram again berkeley-dsep-infra/datahub#6169

Merged

balajialg mentioned this issue Sep 13, 2024

Collating information about outages for Incident Reports berkeley-dsep-infra/datahub#2791

Open

consideRatio mentioned this issue Sep 19, 2024

Hundred of users leads to running out of tens of thousands of ephemeral ports #557

Open

Memory leak in proxy? #388

Memory leak in proxy? #388

Comments

snickell commented Apr 7, 2022

snickell commented Apr 7, 2022

snickell commented Apr 7, 2022

snickell commented Apr 7, 2022

manics commented Apr 7, 2022

snickell commented Apr 7, 2022

snickell commented Apr 10, 2022

snickell commented Apr 10, 2022

consideRatio commented Apr 11, 2022

rcthomas commented Jul 6, 2022

consideRatio commented Jul 14, 2022 • edited Loading

marcelofernandez commented Aug 28, 2023

shaneknapp commented Jun 6, 2024 • edited Loading

marcelofernandez commented Jun 7, 2024

shaneknapp commented Jun 7, 2024 via email

consideRatio commented Jun 7, 2024

shaneknapp commented Jun 15, 2024

marcelofernandez commented Jun 17, 2024

manics commented Aug 13, 2024 • edited Loading

shaneknapp commented Aug 13, 2024

minrk commented Aug 15, 2024

shaneknapp commented Sep 12, 2024 • edited Loading

shaneknapp commented Sep 12, 2024 • edited Loading

consideRatio commented Sep 12, 2024 • edited Loading

shaneknapp commented Sep 12, 2024

shaneknapp commented Sep 12, 2024

shaneknapp commented Sep 13, 2024

shaneknapp commented Sep 13, 2024

shaneknapp commented Sep 13, 2024

shaneknapp commented Sep 13, 2024

felder commented Sep 13, 2024 • edited Loading

shaneknapp commented Sep 18, 2024 • edited Loading

felder commented Sep 18, 2024

shaneknapp commented Sep 18, 2024

consideRatio commented Sep 18, 2024 • edited Loading

felder commented Sep 18, 2024 • edited Loading

benz0li commented Sep 19, 2024

benz0li commented Sep 19, 2024 • edited Loading

shaneknapp commented Sep 19, 2024

benz0li commented Sep 19, 2024

shaneknapp commented Sep 19, 2024 • edited Loading

consideRatio commented Sep 19, 2024

consideRatio commented Jul 14, 2022 •

edited

Loading

shaneknapp commented Jun 6, 2024 •

edited

Loading

manics commented Aug 13, 2024 •

edited

Loading

shaneknapp commented Sep 12, 2024 •

edited

Loading

shaneknapp commented Sep 12, 2024 •

edited

Loading

consideRatio commented Sep 12, 2024 •

edited

Loading

felder commented Sep 13, 2024 •

edited

Loading

shaneknapp commented Sep 18, 2024 •

edited

Loading

consideRatio commented Sep 18, 2024 •

edited

Loading

felder commented Sep 18, 2024 •

edited

Loading

benz0li commented Sep 19, 2024 •

edited

Loading

shaneknapp commented Sep 19, 2024 •

edited

Loading