-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many open file descriptors #190
Comments
I had this happen again. It was with a relatively big setup, but I'm not sure what causes the issue yet since my manager shouldn't be opening many files, if any. Please find attached the corresponding stack trace from the manager Note that the last newlines from the stack trace are relevant, since these represent me trying to write anything at all, and any character becoming a newline:
I think this could be an interaction with the manager prompting me for my SSH password many times because I had left the tmux session, and then crashing, thus leaving the prompt in an unstable state (i.e. whatever you write when writing your SSH password key is transparent). |
Analyzing the stack trace I found that both issues (too many open files, and ssh key prompt) could be related. What sisyphus seems to be trying to do after each password prompt is running a subprocess with the This happens for each 30 seconds, which is the time my sisyphus is configured to scan the queue. With an open file cap of 1024 in the manager (assuming sisyphus doesn't open any other files and any are opened to begin with), the time needed to reach the cap would be 1024 * 30 = 30k seconds = 8.5 hours. Even though the time cap is practically lower because there are more files opened in the manager, it makes sense given the lengths of duration in which I abandoned the ssh/tmux session (evening/night). I'll try to solve it on my end, but I think it could also make sense to try to fix it in sisyphus. How can we tell sisyphus to wait before running a new queue scan job? Maybe storing the last queue command issued and setting it to Edit: a pretty easy solution would probably be setting the timeout of the corresponding SSH command queue scan to the number of seconds it takes for sisyphus to run a new queue command. |
What are the open files? (I think you should be able to see via Are there uncleaned zombie procs? Or actual still alive sub procs? |
Thanks for the pointer. These seem to be anonymous pipes, as expected from the chunk of code.
So every 30 seconds there were 4 files (stdin/out/err + socket for the SSH session, I assume) being opened but never closed. Combine this with some managers opened in the same machine and you reach the cap much faster. |
The actual issue is that the |
Hi, I was using sisyphus today for a big recipe and I got an error in my worker which claimed
too many open files
:However, not only had the worker crashed, but also the manager crashed with this error. Moreover, the terminal (tmux pane) entered a state where every character I typed was converted into a newline character. As a result, I couldn't type any instruction, so I ended up killing the tmux pane and recreating it.
I got into an investigation of what was happening and I developed a really small test:
I found out that running the sisyphus manager on the test (bare; without any
settings.py
) opened ~3k files, from my baseline of 524 opened files to 3254 opened files after runningsis m test.py
, according tolsof | grep <user> | wc
.Besides that, I had the issue that every job triggered by the manager was adding exactly 105 opened files to the list of open file descriptors. However, I can't reproduce this starting from scratch, which leads me to think that it might be a problem about how our code base interacts with sisyphus (or just from our code base). I'll keep investigating and keep you tuned.
Is this issue about opening too many files by sisyphus intended because of some sisyphus caching strategy or related work? Was this ever addressed?
If you need more details, I'll be glad to provide them. Thank you in advance.
The text was updated successfully, but these errors were encountered: