Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear out /dev/shm when things go wrong #378

Open
IanHeywood opened this issue May 15, 2020 · 8 comments
Open

Clear out /dev/shm when things go wrong #378

IanHeywood opened this issue May 15, 2020 · 8 comments

Comments

@IanHeywood
Copy link
Collaborator

My DDFacet job just failed on an IDIA node with /dev/shm issues. Nothing surprising whatsoever about that, but after the job had failed I noticed:

ianh@slwrk-301:~$ du -hs /dev/shm/
71G	/dev/shm/

which turns out to be from a (probably failed) CubiCal run on May 11:

ianh@slwrk-301:~$ ls -l /dev/shm/
total 0
drwxr-xr-x 5 [redacted] idia-group 100 May 11 16:58 cubical.32097

That's hogging a fair amount of real estate there. Some suggestions for good citizenship, in order of decreasing hassle for users (increasing hassle for devs):

  1. Users remember log in and clean up their mess after a crash.

  2. Every time I invoke DDFacet in a script I've been running CleanSHM.py immediately afterwards. Pipeline people could consider implementing something similar. I guess this script could be modified for the CubiCal output in /dev/shm.

  3. Develop some kind of Lazarus the Janitor feature for CubiCal where it comes back to life just long enough to tidy up after its been killed.

I think this is important, especially when testing out pipelines on new systems. Build up of this junk might make things fail when they otherwise wouldn't.

Cheers.

@SpheMakh
Copy link
Collaborator

@o-smirnov we should do this for the cubical and ddfacet cabs!

@o-smirnov
Copy link
Collaborator

Great idea. But CleanSHM.py per se is too brute force, because I think it will nuke your other DDFs you may have running on that box. Which is ok when you do it manually, but not suitable for automatic use in a pipeline.

Rather do this for DDF:

python -c "from DDFacet.Other import Multiprocessing; Multiprocessing.cleanupStaleShm()"

And this for CubiCal:

python -c "from cubical.tools import shm_utils; shm_utils.cleanupStaleShm()"

@SpheMakh
Copy link
Collaborator

Cool, I'll add these commands after cubical and ddfcacet runs.

@IanHeywood
Copy link
Collaborator Author

Great idea. But CleanSHM.py per se is too brute force, because I think it will nuke your other DDFs you may have running on that box. Which is ok when you do it manually, but not suitable for automatic use in a pipeline.

Whoops! I'll steal your method instead.

@SpheMakh
Copy link
Collaborator

Hehe, have you been mistakenly nuking peoples DDF jobs @IanHeywood ?

@IanHeywood
Copy link
Collaborator Author

Well I book entire nodes for DDFacet, so if I have it's what they get for sneaking around trying to jump the queue.

@o-smirnov
Copy link
Collaborator

Sadly, filesystem permissions won't let you do that to others. But it's a great way to shoot yourself in the foot!

@bennahugo
Copy link
Collaborator

bennahugo commented Aug 19, 2021

yup.... lets please have a atexit to at minimum handle graceful SIGINT... - seeing a lot of stuff in shared mem. The gridder has a atexit handler already, must be something in the visibility machine...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants