Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open file descriptors #190

Closed
Icemole opened this issue May 23, 2024 · 5 comments · Fixed by #191
Closed

Too many open file descriptors #190

Icemole opened this issue May 23, 2024 · 5 comments · Fixed by #191

Comments

@Icemole
Copy link
Collaborator

Icemole commented May 23, 2024

Hi, I was using sisyphus today for a big recipe and I got an error in my worker which claimed too many open files:

OSError: [Errno 24] Unable to synchronously open file (unable to open file: name = <filename>, errno = 24, error message = 'Too many open files', flags = 0, o_flags = 0)

However, not only had the worker crashed, but also the manager crashed with this error. Moreover, the terminal (tmux pane) entered a state where every character I typed was converted into a newline character. As a result, I couldn't type any instruction, so I ended up killing the tmux pane and recreating it.

I got into an investigation of what was happening and I developed a really small test:

# test.py

def py():
    pass

I found out that running the sisyphus manager on the test (bare; without any settings.py) opened ~3k files, from my baseline of 524 opened files to 3254 opened files after running sis m test.py, according to lsof | grep <user> | wc.

Besides that, I had the issue that every job triggered by the manager was adding exactly 105 opened files to the list of open file descriptors. However, I can't reproduce this starting from scratch, which leads me to think that it might be a problem about how our code base interacts with sisyphus (or just from our code base). I'll keep investigating and keep you tuned.

Is this issue about opening too many files by sisyphus intended because of some sisyphus caching strategy or related work? Was this ever addressed?

If you need more details, I'll be glad to provide them. Thank you in advance.

@Icemole
Copy link
Collaborator Author

Icemole commented May 31, 2024

I had this happen again. It was with a relatively big setup, but I'm not sure what causes the issue yet since my manager shouldn't be opening many files, if any. Please find attached the corresponding stack trace from the manager
here.

Note that the last newlines from the stack trace are relevant, since these represent me trying to write anything at all, and any character becoming a newline:

Moreover, the terminal (tmux pane) entered a state where every character I typed was converted into a newline character. As a result, I couldn't type any instruction, so I ended up killing the tmux pane and recreating it.

I think this could be an interaction with the manager prompting me for my SSH password many times because I had left the tmux session, and then crashing, thus leaving the prompt in an unstable state (i.e. whatever you write when writing your SSH password key is transparent).

@Icemole
Copy link
Collaborator Author

Icemole commented May 31, 2024

Analyzing the stack trace I found that both issues (too many open files, and ssh key prompt) could be related. What sisyphus seems to be trying to do after each password prompt is running a subprocess with the squeue command (I'm running in SLURM, but this also used to happen in SGE as well, so it should be cluster-independent). Right now I'm running my setup with the gateway="..." option in settings.py, but I recall it could have happened without such an option.

This happens for each 30 seconds, which is the time my sisyphus is configured to scan the queue. With an open file cap of 1024 in the manager (assuming sisyphus doesn't open any other files and any are opened to begin with), the time needed to reach the cap would be 1024 * 30 = 30k seconds = 8.5 hours. Even though the time cap is practically lower because there are more files opened in the manager, it makes sense given the lengths of duration in which I abandoned the ssh/tmux session (evening/night).

I'll try to solve it on my end, but I think it could also make sense to try to fix it in sisyphus. How can we tell sisyphus to wait before running a new queue scan job? Maybe storing the last queue command issued and setting it to None after completion?

Edit: a pretty easy solution would probably be setting the timeout of the corresponding SSH command queue scan to the number of seconds it takes for sisyphus to run a new queue command.

@albertz
Copy link
Member

albertz commented May 31, 2024

What are the open files? (I think you should be able to see via /proc.)

Are there uncleaned zombie procs? Or actual still alive sub procs?

@Icemole
Copy link
Collaborator Author

Icemole commented May 31, 2024

Thanks for the pointer. These seem to be anonymous pipes, as expected from the chunk of code.

$ ls -lha /proc/25427/fd/
total 0
dr-x------ 2 nbeneitez domain_users  0 May 31 04:46 .
dr-xr-xr-x 9 nbeneitez domain_users  0 May 29 14:22 ..
lr-x------ 1 nbeneitez domain_users 64 May 31 04:46 0 -> 'pipe:[1164667388]'
l-wx------ 1 nbeneitez domain_users 64 May 31 04:46 1 -> 'pipe:[1164667389]'
l-wx------ 1 nbeneitez domain_users 64 May 31 04:46 2 -> 'pipe:[1164667390]'
lrwx------ 1 nbeneitez domain_users 64 May 31 04:46 3 -> 'socket:[1164925030]'
lrwx------ 1 nbeneitez domain_users 64 May 31 04:46 4 -> /dev/tty

So every 30 seconds there were 4 files (stdin/out/err + socket for the SSH session, I assume) being opened but never closed. Combine this with some managers opened in the same machine and you reach the cap much faster.

@albertz
Copy link
Member

albertz commented May 31, 2024

The actual issue is that the subprocess.Popen is never properly cleaned up, i.e. the subproc is still alive (you can also check that). So the fix would be to properly kill it in case the timeout was triggered. Then also those pipes will get closed. But as discussed in #191, simpler would be to just use subprocess.run, which also kills the proc in case of a timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants