Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in proxy? #388

Open
snickell opened this issue Apr 7, 2022 · 41 comments
Open

Memory leak in proxy? #388

snickell opened this issue Apr 7, 2022 · 41 comments
Labels

Comments

@snickell
Copy link

snickell commented Apr 7, 2022

Aloha, we've been seeing a pattern of growing daily memory usage (followed by increasing slugishness then non-responsiveness above around 1-2GB of RAM) in the 'proxy' pod:
image

The different colors are fresh proxy reboots, which have been required to keep the cluster running.

Screen Shot 2022-04-07 at 5 01 00 AM

-Seth

@snickell snickell added the bug label Apr 7, 2022
@snickell
Copy link
Author

snickell commented Apr 7, 2022

Sorry, clipped the units:
image

The pattern is nearly identical on the other cluster.

@snickell
Copy link
Author

snickell commented Apr 7, 2022

We're running z2jh chart version 1.1.3-n354.h751bc313 (I believe the latest ~3 weeks ago), but as you can see, this pattern predates this chart version by quite a bit.

@consideRatio consideRatio transferred this issue from jupyterhub/zero-to-jupyterhub-k8s Apr 7, 2022
@snickell
Copy link
Author

snickell commented Apr 7, 2022

We start seeing serious performance problems at about 1.5GB, which is suspiciously close to the heap limit for node 🤔 So maybe its a memory leak that then cascade fails at the heap limit into some sort of .... garbage collection nightmare? or?

@manics
Copy link
Member

manics commented Apr 7, 2022

Do you happen to know if the memory increases are correlated with particular events, e.g. a user starting a new server, or connecting to a particular service?

@snickell
Copy link
Author

snickell commented Apr 7, 2022

No, but I'm looking into it, my vague suspicion: websockets? We push them pretty hard, e.g. many users are streaming VNC over websocket. Is there a log mode that has useful stats about e.g. the routing table?

@snickell
Copy link
Author

OK, so a further development, since high RAM usage correlated with performance problems, I added a k8s memory limit to the pod, thinking it would get killed when it passed 1.4GB of RAM, and reboot fresh, a decent-ish workaround for now.

Here's what happened instead:
image

Note that there's one other unusual thing here, I kubectl exec'ed several 200MB "ram balloon" processes to try to push it over the edge faster for testing. They clearly didn't work haha, and I doubt that's why this is not growing at the normal leakage rate, but worth mentioning.

Did something change or did adding a k8s memory limit suddenly change the behavior?

@snickell
Copy link
Author

(note this otherwise consistent memory growth pattern goes back to jan, and a number of version upgrades since from the z2jh chart..... this is.... weird)

@consideRatio
Copy link
Member

Hmmm, so when rhe pod restarts, is it because it has been evicted from a node, or is it because it has restarted its process within the container etc?

Being evicted from a node can happen based on external logic, while managing memory within the container can happen because of more internal logic, which can be enabled by limits to clairfy it needs to not surpass certain limits.

Need to learn more about OOMkiller things within the container vs by the kubelet etc, but perhaps you ended up helping it avoid getting evicted by surpassing its memory limit. Hmmm..

@rcthomas
Copy link
Contributor

rcthomas commented Jul 6, 2022

@snickell was what you observed related to load at all? Like, on weekend days do you observe this behavior? We're currently experiencing relatively-speaking high load on our deployment, and I observe something similar. Memory consumption in the proxy will just suddenly shoot up and it becomes non-responsive. Are you still using CHP for your proxy? I am considering swapping it for Traefik in the coming days here.

@consideRatio
Copy link
Member

consideRatio commented Jul 14, 2022

@snickell have you experienced this with older versions of z2jh -> chp as well?

@marcelofernandez
Copy link

Still happening on the latest version (v4.5.6).

@shaneknapp
Copy link

shaneknapp commented Jun 6, 2024

see also #434

i believe the socket leak is the root cause of the memory leak. on our larger, more active hubs we've seen constant spiking of the chp ram under "load", and chp running out of heap space: #434 (comment)

"load" is ~300+ users logging in around the "same time".

"same time" is anywhere from 15m to a couple of hours.

i don't believe that increasing the chp heap size is the correct fix, as the memory/socket leak still needs to be addressed. however, increasing it may help, but that would need some experimentation.

@marcelofernandez
Copy link

We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬

Check out that alternative just in case you are experiencing this.

@shaneknapp
Copy link

shaneknapp commented Jun 7, 2024 via email

@consideRatio
Copy link
Member

@marcelofernandez are you able to share config for your setup?

@shaneknapp
Copy link

We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬

Check out that alternative just in case you are experiencing this.

echoing @consideRatio -- do you have any relevant traefik config bits you could share? this would be super useful! :)

thanks in advance...

@marcelofernandez
Copy link

Hey guys, sure!

First, and foremost, I'm sorry I can't give you all the details of my company's internal PR because:

  • I don't wanna go into any IP issues and (most importantly),
  • We're still using a very old version of z2jh so I'm not sure how all this is still relevant to the latest versions. I'd prepare a PR for z2jh without a hitch in a perfect world.

That said, I can give you an overview of what I did.

The complicated part was that it seemed like nobody did this in the past, so I based my job on this (far more ambitious) previous and rejected PR which originally was aimed to replace both proxies:

  • HTTP -> HTTPS one (the TLS frontend terminator called autohttps), and
  • configurable-http-proxy, but:
    • Also making it HA-ready, supporting more than one instance of the proxy (making it more scalable), and
    • Creating a new service called Consul to store all the proxies shared-config, etc. which brought more complexity to the PR.

The only thing I did (because I only wanted stability) based on that PR was to:

  • Drop the configurable-http-proxy Pod, and
  • Replace it with just one container of Traefik inside the Hub Pod,
  • Using the JupyterHub Traefik Proxy component (running in the Hub container) to automatically configure the Traefik container.
  • Now, both containers (Hub + Traefik) run in the same Pod still called Hub.

Based on the Z2JH's architectural graph, here are the changes.

Before:
image

After:
image

Once I defined the idea of what I wanted, I had to drop unneeded code from the PR above, configure the hub to call the proxy in the same pod (http://localhost:8081) and that's it.

I implemented this like a year and a half ago, if you have more questions, just let me know...

Regards

@manics
Copy link
Member

manics commented Aug 13, 2024

4.6.2 was released 2 months ago with a fix for the leaking sockets. Is there still a memory leak or can we close this issue?

@shaneknapp
Copy link

@manics i don't think we should close this yet... we still saw chp run out of nodejs heap on hubs w/lots of traffic and usage even after we deployed 4.6.2, but since summer is slow it hasn't bitten us yet.

i'm sure that within a few weeks we'll see OOMs/socket leaks once the fall term ramps up.

@minrk
Copy link
Member

minrk commented Aug 15, 2024

If anyone can make a stress test to provoke this, ideally with just CHP (or the JupyterHub Proxy API, like the traefik proxy benchmarks) I can test if the migration to http2-proxy will help. I tried a simple local test with a simple backend and apache-bench, but many millions of requests and hundreds of gigabytes later, I see no significant increase in memory or socket consumption (still sub-100MB). So there must be something relevant in typical use (websockets, connections dropped in a particular way, adding/removing routes, etc.) that a naïve benchmark doesn't trigger.

@shaneknapp
Copy link

shaneknapp commented Sep 12, 2024

If anyone can make a stress test to provoke this, ideally with just CHP (or the JupyterHub Proxy API, like the traefik proxy benchmarks) I can test if the migration to http2-proxy will help. I tried a simple local test with a simple backend and apache-bench, but many millions of requests and hundreds of gigabytes later, I see no significant increase in memory or socket consumption (still sub-100MB). So there must be something relevant in typical use (websockets, connections dropped in a particular way, adding/removing routes, etc.) that a naïve benchmark doesn't trigger.

re benchmarking... we really don't have either the cycles, available staff or deep understanding of how the proxy works to do this.

re the something: we're seeing mildly improved performance w/4.6.2 but are still experiencing pretty regular, albeit much shorter (and self-recovering) outages at "peak"[1] usage.

[1] peak can be anywhere from ~200 up to ~800 users on a hub.

for example, last night between 845p and 9p, we had ~188 students logged on to datahub (the lowest end of 'peak') and saw the proxy peg at 100% CPU and 1.16G ram.

image

image

hub cpu hovered around ~40% until the outage, and during that 15m dropped to nearly 0%. hub memory usage was steady at around ~477M

only during the 15m of the outage (~845p - 9p) our chp logs were full of entries like this:
chp-error

not surprisingly, the readiness probes couldn't find either the hub or proxy during this outage (and also the hub just past when things recovered?):
chp-hub-down

i'll dig more through the logs and see what needles i can winnow out of the haystacks.

@shaneknapp
Copy link

shaneknapp commented Sep 12, 2024

running dmesg -T on the core node where this proxy was running indeed shows that chp is still getting OOMKilled:

[Wed Sep 11 21:48:22 2024] nginx invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=995
[Wed Sep 11 21:48:22 2024] CPU: 11 PID: 3160400 Comm: nginx Not tainted 6.1.90+ #1
[Wed Sep 11 21:48:22 2024] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
[Wed Sep 11 21:48:22 2024] Call Trace:
[Wed Sep 11 21:48:22 2024]  <TASK>
[Wed Sep 11 21:48:22 2024]  dump_stack_lvl+0x4a/0x70
[Wed Sep 11 21:48:22 2024]  dump_header+0x52/0x250
[Wed Sep 11 21:48:22 2024]  oom_kill_process+0x10a/0x220
[Wed Sep 11 21:48:22 2024]  out_of_memory+0x3dc/0x5c0
[Wed Sep 11 21:48:22 2024]  ? mem_cgroup_iter+0x1c6/0x240
[Wed Sep 11 21:48:22 2024]  try_charge_memcg+0x827/0xa90
[Wed Sep 11 21:48:22 2024]  charge_memcg+0x3f/0x1f0
[Wed Sep 11 21:48:22 2024]  __mem_cgroup_charge+0x2b/0x80
[Wed Sep 11 21:48:22 2024]  handle_mm_fault+0xf80/0x16b0
[Wed Sep 11 21:48:22 2024]  do_user_addr_fault+0x271/0x4d0
[Wed Sep 11 21:48:22 2024]  exc_page_fault+0x78/0xf0
[Wed Sep 11 21:48:22 2024]  asm_exc_page_fault+0x22/0x30
[Wed Sep 11 21:48:22 2024] RIP: 0033:0x5cd7083be658
[Wed Sep 11 21:48:22 2024] Code: 10 e8 1c 5e 00 00 49 89 87 08 02 00 00 48 85 c0 0f 84 24 02 00 00 49 8b 8f e8 01 00 00 48 85 c9 74 20 31 d2 66 0f 1f 44 00 00 <80> 48 0a 01 49 8b 8f e8 01 00 00 48 83 c2 01 48 83 c0 60 48 39 ca
[Wed Sep 11 21:48:22 2024] RSP: 002b:00007fff48f73f70 EFLAGS: 00010283
[Wed Sep 11 21:48:22 2024] RAX: 00007ba90b837030 RBX: 00007ba90bad2050 RCX: 0000000000004000
[Wed Sep 11 21:48:22 2024] RDX: 000000000000112a RSI: 0000000000180000 RDI: 00005cd709921ea0
[Wed Sep 11 21:48:22 2024] RBP: 00007ba90a329380 R08: 00000000000000fe R09: 0000000000000000
[Wed Sep 11 21:48:22 2024] R10: 00007ba90b950ffc R11: 0000000000000006 R12: 00005cd70856a530
[Wed Sep 11 21:48:22 2024] R13: 00005cd7084eb121 R14: 00005cd7084eb12e R15: 00007ba90a329380
[Wed Sep 11 21:48:22 2024]  </TASK>
[Wed Sep 11 21:48:22 2024] memory: usage 2097168kB, limit 2097152kB, failcnt 806
[Wed Sep 11 21:48:22 2024] swap: usage 0kB, limit 9007199254740988kB, failcnt 0

this repeated regularly for about 30m, and then for another 30m lots of messages like this:

[Wed Sep 11 22:38:28 2024] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc982fb8d_dd48_4bff_a016_38cda0c5767a.slice/cri-containerd-58fd1ffa87a46e002e02ede50e42cf51b21de27dc6ffccf8f22032be3bcc2f80.scope are going to be killed due to memory.oom.group set
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3223198 (dumb-init) total-vm:220kB, anon-rss:8kB, file-rss:0kB, shmem-rss:0kB, UID:101 pgtables:24kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3223258 (nginx) total-vm:155116kB, anon-rss:37304kB, file-rss:7056kB, shmem-rss:756kB, UID:101 pgtables:188kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3225302 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3226667 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227348 (nginx) total-vm:171012kB, anon-rss:48508kB, file-rss:5728kB, shmem-rss:2528kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227349 (nginx) total-vm:172136kB, anon-rss:49644kB, file-rss:5728kB, shmem-rss:2492kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227350 (nginx) total-vm:172080kB, anon-rss:49560kB, file-rss:5728kB, shmem-rss:2544kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227351 (nginx) total-vm:171792kB, anon-rss:49336kB, file-rss:5728kB, shmem-rss:2504kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227382 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227423 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227567 (nginx) total-vm:171948kB, anon-rss:49412kB, file-rss:5728kB, shmem-rss:2520kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227592 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227680 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227713 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227746 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3228024 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3228067 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3228303 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3230430 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3230824 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3230826 (nginx) total-vm:170948kB, anon-rss:48460kB, file-rss:5728kB, shmem-rss:4652kB, UID:101 pgtables:216kB oom_score_adj:995
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231122 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231172 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231288 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231347 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231814 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231921 (nginx) is already exiting. Skip killing the task

during all of this, the chp logs are full of thousands of 503 GET or 503 POST errors for active users attempting to get work done. :\

so: something is still amiss. chp is running out of heap space. there's absolutely a memory/socket leak remaining somewhere.

@consideRatio
Copy link
Member

consideRatio commented Sep 12, 2024

We have discussed two kinds of memory, network tcp memory, and normal ram memory - getting normal memory killed is a consequence of surpassing requested memory via k8s.

This is normal memory killing i think, so what memory request is configured for the peoxy pod?

Note that i think the graph you have in grafana may represent an average combination of pods if you have multiple proxy pods in the k8s cluster, so then you could see memory usage below the requested amount even though an individual pod goes above it. I recall an issue opened about this... Found it: jupyterhub/grafana-dashboards#128

Is normal memory still growing without bound over time as users comes and goes, and chp getting memory killed for that reason, making requesting more memory just a matter of gaining time before a crash?

@shaneknapp
Copy link

We have discussed two kinds of memory, network tcp memory, and normal ram memory - getting normal memory killed is a consequence of surpassing requested memory via k8s.

This is normal memory killing i think, so what memory request is configured for the peoxy pod?

      resources:
        requests:
          cpu: 0.001
          memory: 64Mi
        limits:
          memory: 1.5Gi

Note that i think the graph you have in grafana may represent an average combination of pods if you have multiple proxy pods in the k8s cluster, so then you could see memory usage below the requested amount even though an individual pod goes above it. I recall an issue opened about this... Found it: jupyterhub/grafana-dashboards#128

nope -- this is only one deployment, not a sum of them all.

Is normal memory still growing without bound over time as users comes and goes, and chp getting memory killed for that reason, making requesting more memory just a matter of gaining time before a crash?

it seems to grow over time, and as it surpasses the node max heap cpu will inevitably ramp up before things start getting OOMKilled.

@shaneknapp
Copy link

We have discussed two kinds of memory, network tcp memory, and normal ram memory - getting normal memory killed is a consequence of surpassing requested memory via k8s.
This is normal memory killing i think, so what memory request is configured for the peoxy pod?

      resources:
        requests:
          cpu: 0.001
          memory: 64Mi
        limits:
          memory: 1.5Gi

fwiw i'm about to bump this to 3Gi "just to see".

@shaneknapp
Copy link

this just happened again and it really looks like we're running out of ephemeral ports...

@shaneknapp
Copy link

we're actually thinking about putting our chp pods in their own pool, one proxy per (very small) node to get past this until a fix or better solution comes along.

@shaneknapp
Copy link

currently on the impacted proxy node:

/srv/configurable-http-proxy $ netstat -natp|wc -l
24536

@shaneknapp
Copy link

hmm, well, this sheds some light on things: #465

some of our hubs have > 1000 users.

@felder
Copy link

felder commented Sep 13, 2024

Noticed that when this behavior happens at berkeley we see accompanying logs indicating, EADDRNOTAVAIL and tons of connection failures to the hub pod.

Really like the issue described here
https://blog.cloudflare.com/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/

https://devops-insider.mygraphql.com/zh-cn/latest/network/tcp/ephemeral-port/ephemeral-port-exhaustion-and-how-to-avoid-it.html

Running netstat -natp on the chp proxy indicates it’s very possible we have enough connections to be running out of ephemeral ports.

Would likely explain the sudden increase in cpu as well because I think proxy rapidly retries when this behavior starts.

@shaneknapp
Copy link

shaneknapp commented Sep 18, 2024

for those continuing to be held in rapture by this enthralling story, i have some relatively useful updates!

  1. for large deployments, chp needs at least 2.5Gi of ram. we're seeing ram usage up to nearly 2Gi when many users are logged in and working. we have 3Gi allocated for this, and many users == ~300+
  2. after meeting w/@minrk and @consideRatio and discussing what was going on, we decided to try setting the chp timeout and proxy-timeout to 5 seconds. this gave us a huge improvement in reliability, and while this issue is still impacting us, we have a little bit more breathing room: https://github.com/berkeley-dsep-infra/datahub/blob/staging/hub/values.yaml#L47
  3. we are definitely, 100% running out of ephemeral ports in the chp pod for hubs that are using jupyterlab/notebooks/etc.
  4. we are definitely, 100% NOT running out of ephemeral ports on hubs w/similar usage patterns, but are deployed with RStudio, VSCode or anything using a proxy like jupyter-rsession-proxy and jupyter-vscode-proxy.
  5. during peak usage times, the ephemeral ports used will increase significantly (25k+) and drop down to ~2k+ after peak. at least once or twice a day we'll hit the 28k limit, and the best/only way to stop a significant outage is to kill/restart the hub and proxy pods. at least we now have the time to catch and kill before an outage w/the timeouts set to 5 seconds. this is a big win, but still sub-optimal.
  6. if we don't intervene, and the chp gets OOMKilled, this will lead to a short outage between ~10-30 minutes before things auto-recover. users will see 503/service unavailable errors, as chp is unable to route them to the hub and user pods.

point 4 is important... and makes me wonder if something in lab or notebook has buggy socket handling code.

@felder
Copy link

felder commented Sep 18, 2024

A big difference between hubs that use one of the mentioned proxies and others that don't is that instead of a majority of the connections going from chp -> hub:8081, the connections instead go to the user pods directly. Since the user pods are spread out across a subnet, the connections aren't all focused on a single ip:port destination and so there is no issue with ephemeral port exhaustion.

@shaneknapp
Copy link

A big difference between hubs that use one of the mentioned proxies and others that don't is that instead of a majority of the connections going from chp -> hub:8081, the connections instead go to the user pods directly. Since the user pods are spread out across a subnet, the connections aren't all focused on a single ip:port destination and so there is no issue with ephemeral port exhaustion.

yep, thanks for clarifying @felder !

@consideRatio
Copy link
Member

consideRatio commented Sep 18, 2024

About where connections go etc, i expect:

  • ingress controller -> chp + chp -> user pods (incoming connections for browser -> jupyter server traffic)
  • user pods -> chp + chp -> hub (reporting user activity, server checking auth, maybe more?)
  • user pods -> internet (jupyter server making internet request via node's public ip without chp involved)

If you could figure out something more about where things go that amounts too huge requests, that may allow us to tune lab/hub etc, for example how often user server reporta activity to hub can be reduced i think. I also think lab checks connectivity with hub regularly.

@felder
Copy link

felder commented Sep 18, 2024

@consideRatio Do you have ideas for determining more information regarding these connections? Currently we're just looking at reporting from netstat and seeing the ip:port pairs for source and destination on the connections.

Here's sample output for what we're seeing with netstat on the chp pod, which unfortunately is not particularly helpful with regard to getting more specifics.

$ netstat -natp | grep ESTABLISHED | more
tcp        0      0 10.28.7.157:52072       10.28.11.251:8888       ESTABLISHED 1/node
tcp        0      0 10.28.7.157:36022       10.28.30.220:8888       ESTABLISHED 1/node
tcp        0      0 10.28.7.157:36018       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:48910       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:33400       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:50372       10.28.5.202:8888        ESTABLISHED 1/node
tcp        0      0 10.28.7.157:38690       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:43694       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:55028       hubip:8081     ESTABLISHED 1/node
tcp        0      0 10.28.7.157:40936       hubip:8081     ESTABLISHED 1/node
...
...
...

In this case 10.28.7.157 is the chp pod ip, I replaced the actual ip for the hub with "hubip" and we can see a few connections going to user pods on port 8888.

What I do not know at this time is how to figure out why the connections were opened in the first place, or what user/process they are associated with. We're just getting started on this line of investigation (we only discovered the ephemeral port exhaustion late last week) at this time and while I would love to give you the information you're requesting, I don't know how to obtain it.

Also, out of curiosity do you see a similar ratio as we do on your deployments with regard to ephemeral port usage?

We're seeing this issue primarily on hubs with > 200 users which would suggest that on a per user basis there's ~100 connections from the chp -> hub:8081 but again I have no way at this time of associating the ephemeral ports with anything meaningful so for all I know it could be some activity on the part of a subset of users or processes.

@benz0li
Copy link

benz0li commented Sep 19, 2024

3. we are definitely, 100% running out of ephemeral ports in the chp pod for hubs that are using jupyterlab/notebooks/etc.

@shaneknapp What image(s) do you use?

  1. Jupyter Docker Stacks?
  2. Self-built?
    • Based on Jupyter Docker Stacks: What do you add?
    • From scratch: Dockerfile(s)?

@benz0li
Copy link

benz0li commented Sep 19, 2024

@shaneknapp If ephemeral port exhaustion is not caused by chp itself, this discussion should be moved to another issue.

@snickell Could you test, if there is still a memory leak with v4.6.2?

IMHO these are separate issues.

@shaneknapp
Copy link

  1. we are definitely, 100% running out of ephemeral ports in the chp pod for hubs that are using jupyterlab/notebooks/etc.

@shaneknapp What image(s) do you use?

1. [Jupyter Docker Stacks](https://github.com/jupyter/docker-stacks)?

2. Self-built?
   
   * Based on Jupyter Docker Stacks: What do you add?
   * From scratch: `Dockerfile`(s)?

all of our images are built from scratch... some are dockerfile-based, most are pure repo2docker, and everything is built w/r2d. of the three hubs most impacted by this, one has a complex Dockerfile-based build, and the other two are very straightforward python repo2docker builds.

@shaneknapp If ephemeral port exhaustion is not caused by chp itself, this discussion should be moved to another issue.

well, until literally late last thursday we had no idea what was going on, and we still don't know what is causing the port spam yet. if you have any suggestions i'd be happy to move the conversation there.

perhaps #434 ?

@snickell Could you test, if there is still a memory leak with v4.6.2?

IMHO these are separate issues.

@benz0li
Copy link

benz0li commented Sep 19, 2024

perhaps #434 ?

No.

@shaneknapp
Copy link

shaneknapp commented Sep 19, 2024

perhaps #434 ?

No.

until we figure out a better home for this issue, i will continue to update our findings here. :)

anyways...

  1. we had to revert the --timeout=5000 and --proxy-timeout=5000 changes as this broke PDF exporting (i know, right!?!). we're considering setting these to be 5 or 10m long.
  2. yesterday we increased the number of ephemeral ports and this seems to give us enough headroom for the impacted chps to continue working... this feels much more like a bandaid than a solution to me, but for now it seems to be working pretty well. we will continue to monitor and intervene as needed to avert outages.
spec:
  template:
    spec:
      initContainers:
      - command:
        - sysctl
        - -w
        - net.ipv4.ip_local_port_range=10000 65000
        image: busybox
        imagePullPolicy: IfNotPresent
        name: init-sysctl
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File

deployed via kubectl -n <namespace> patch deployment proxy --patch-file chp-pod-deployment-patch.yaml

this gives us 55000 (65000 - 10000) ephemeral ports.

@consideRatio
Copy link
Member

I figure we have the following kinds of issues, I opened #557 to repsent the third kind - lets switch to that!

  1. Memory leak
    For observations about normal RAM memory usage increasing over time without being reclaimed, making it grow indefinitively.
  2. Socket leak
    For observations about either sockets increasing over time without being reclaimed, making socket count grow indefinitively.
    Related to this is probably any networking related memory issues, as unbounded growth of sockets could go hand in hand with using more and more of that.
  3. Running low on ephemeral ports
    I think this is separate from the other issues, and perhaps also isn't a bug with CHP but for example with software spamming connections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants