Intermittent OOM issues for Girder #489

arjunrajlab · 2023-08-07T21:49:03Z

We get out of memory issues that seem to kill Girder intermittently:

The issues seem to occur when multiple people are using the server, especially with somewhat larger files, but it is somewhat intermittent.

arjunrajlab · 2023-08-08T17:07:54Z

@manthey I found a file that is able to reproduce the issue! If you load this file:

https://www.dropbox.com/scl/fi/xduq3y1buv6h8acu2rxbo/20230807_133502_470_LVE016-HKC-1dpi-HSV-1-WT-ICP4_GFP-MX1_Cy3-IFNb_A594-IRF3_Cy5-DDX58_Cy7_Stitched.nd2?rlkey=kf1vjnyl4nefq2bn2cu37env6&dl=0

then just scrub rapidly up and down in Z before the tile cache builds. That seems to cause the issue.

arjunrajlab · 2023-08-08T17:11:58Z

Here's what I found in the syslog:

arjun@raj:/var/log$ grep -E 'OOM|kill' syslog
Aug  8 12:46:26 raj kernel: [8295223.437605] girder invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
Aug  8 12:46:26 raj kernel: [8295223.437630]  oom_kill_process.cold+0xb/0x10
Aug  8 12:46:26 raj kernel: [8295223.437839] [   7384]  1000  7384   116627      164   118784       62             0 gsd-rfkill
Aug  8 12:46:26 raj kernel: [8295223.438051] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-d22742cef66d743d213f48000505f4f5af9277e1b75911e06e8b3d5d137ff7a0.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-d22742cef66d743d213f48000505f4f5af9277e1b75911e06e8b3d5d137ff7a0.scope,task=girder,pid=1641364,uid=0
Aug  8 12:46:26 raj systemd[1]: docker-d22742cef66d743d213f48000505f4f5af9277e1b75911e06e8b3d5d137ff7a0.scope: A process of this unit has been killed by the OOM killer.
arjun@raj:/var/log$

arjunrajlab · 2023-08-08T17:15:43Z

(Reproduced on two separate machines.)

manthey · 2023-08-11T14:28:27Z

There are several things going on here. I'll break them out into individual issues.

Immediately after upload, we scheduled (a) a maxmerge version of the data, (b) the tile_frames sprites for rapid scrubbing. We then would create a multi-source file and, optionally, transcode to tiff. The multisource version isn't the technically the same as the original file, even if it were a direct passthorugh. This means that we were largely duplicating the maxmerge/tile_frames work, which increased memory, I/O, and cpu usage. See Change when we schedule caching tile_frames and maxmerge #490 for a fix for this.
We do both the maxmerge and the tile_frames caching concurrently. We should do these sequentially to avoid overloading the I/O, memory, and CPU
We should check that, when the UI asks for the tile_frames, even if they are scheduled via the cacher, are we ever duplicating this work.
We make a number of concurrent requests based on the number of channels. A browser will only allow 6 concurrent requests before queuing them for later. We have a single queue for tiles; we should also use this queue for tile_frames and histogram requests to reduce query saturation.
We could precache histograms if we aren't already doing so to improve responsiveness, but only if queued properly. I thought we were do this, but, if not, we should.
We saturate the browser queue via pixel requests. These are fired one per layer debounced based on time. We could make a single-call endpoint and debounce them based on promise status, not time.
Many of the browser requests start a separate thread to respond and temporarily load parts of the file into memory; before caching this ends up will multiple requests all containing similar data. We can reduce memory load through better queuing and better server locks on handling data.

manthey · 2023-08-11T14:29:09Z

This particular test file shows up a bunch of the problems because it is larger than 2k x 2k per frame, but not optimally chunked internally AND has 6 channels.

arjunrajlab · 2023-08-23T11:46:35Z

@manthey could you document here the progress to date and what remains? I think the histogram pre-cache and pixel value requests are important next steps but not sure.

manthey · 2023-08-31T15:16:41Z

I discovered a bug (fix is girder/large_image#1283) where depending on what we did, we could prematurely close nd2 file handles. This could be forced by asking for a bad style, but could also happen with other conditions. If we saw segfaults in syslog that weren't OOM issues, this could have been the issue. The proximate problem was reported as a numpy operation attempting to access memory it wasn't allowed to.

manthey · 2023-08-31T15:18:38Z

With the various changes and issues, I recommend we close this issue and if we see Girder crash again, we create a new issue with a syslog entry for the crash (whether an OOM kill or segfault).

manthey · 2023-08-31T15:20:07Z

For reference, this has generated issue #502 for precaching histograms, #503 for making fewer pixel requests, plus a variety of PRs in large_image to address memory and stability.

arjunrajlab · 2023-08-31T15:32:36Z

Excellent! Yes, I think we can close the issue for now. We were still having some OOM stability issues when multiple users were accessing the server, but we also just increased memory to 128GB and everything seems to be fine for now.

arjunrajlab added the bug Something isn't working label Aug 7, 2023

arjunrajlab added this to the Alpha-Version milestone Aug 7, 2023

arjunrajlab assigned manthey Aug 7, 2023

arjunrajlab closed this as completed Aug 31, 2023

arjunrajlab added this to Alpha Release May 18, 2024

arjunrajlab moved this to Done in Alpha Release May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent OOM issues for Girder #489

Intermittent OOM issues for Girder #489

arjunrajlab commented Aug 7, 2023

arjunrajlab commented Aug 8, 2023

arjunrajlab commented Aug 8, 2023

arjunrajlab commented Aug 8, 2023

manthey commented Aug 11, 2023

manthey commented Aug 11, 2023

arjunrajlab commented Aug 23, 2023

manthey commented Aug 31, 2023

manthey commented Aug 31, 2023

manthey commented Aug 31, 2023

arjunrajlab commented Aug 31, 2023

Intermittent OOM issues for Girder #489

Intermittent OOM issues for Girder #489

Comments

arjunrajlab commented Aug 7, 2023

arjunrajlab commented Aug 8, 2023

arjunrajlab commented Aug 8, 2023

arjunrajlab commented Aug 8, 2023

manthey commented Aug 11, 2023

manthey commented Aug 11, 2023

arjunrajlab commented Aug 23, 2023

manthey commented Aug 31, 2023

manthey commented Aug 31, 2023

manthey commented Aug 31, 2023

arjunrajlab commented Aug 31, 2023