-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer overflow error for large number of jobs #96
Comments
I've just discovered a bug: setting Furthermore, I tried to finally get rid of all database problems once and forever by avoiding all database transaction on the nodes (although read-only should in theory be safe). This is now also in the devel branch and automatically enabled if |
I'm having issues with
In I realize that the number of jobs is high, but the number of chunks should be reasonable in a way that submission should be done a a minute (without explicit I'm thinking that this is mainly due to file system load that already starts during submission, but later continues at running - I designed my algorithm in a way that bigger objects are all passed via If I understand my debugging correctly, |
Well that's basically the tradeoff. You either rely on the database to retrieve the information and risk running into a locked database. Or you just store everything on the file system to avoid querying the database on the nodes which might be a big overhead. Are you sure that https://github.com/tudo-r/BatchJobs/blob/master/R/writeFiles.R#L34 is the bottleneck? I've tried myself with 1e6 jobs and found that it takes less than 5 minutes on a local (but slow) hdd. If you're sure that the |
The other question would be: is it really required for the master to have information about every job on each slave at any time? I certainly don't, and if not, then you could avoid most of the db or file system load altogether: for instance, you use one R session per chunk and only report chunk statistics. This would make the whole thing a lot more scalable (but I realize this would be a major undertaking). I am sure that that |
We kind of do this already by using a buffer which is flushed every 5-10 minutes (c.f.
Please try if 0f914a2 mitigates the runtime issues. |
0f914a2: submission is down to 9 minutes (30x speedup), "Syncing registry" afterwards takes 10 minutes. File system load overall is still too high (at about 300 jobs running), I had to manually stop and resume jobs to keep the volume responsive. Reducing the results over night only got 4%, time remaining is shown as 99 hours. When complete after 4 days, R crashed with a message saying:
|
Well that sound acceptable to me.
I'll set the update frequency for chunked jobs more conservatively. But also check your log files -- if you produce a lot of output, this could be the problem. You could probably try to redirect logs to
I think I did not touch anything here ... have you solved this? I would assume this is a (temporary) file system problem. We just iterate over the results and load them, nothing special. |
The crash was caused by a bug in dplyr (used to assemble my results after - this has got nothing to do with BatchJobs). The time it takes for reducing the results remains, however. |
Well, maybe reading 500.000 files just takes some time. But if you give me some more information, I can try to optimize this step a bit.
|
Thank you for your contined efforts!
In general, I think that the approach of having one result file with one I also played around using |
When submitting a large number of jobs, BatchJobs still fails for me (this is somewhat similar to #58, but the number of jobs is almost 50 times higher).
I submit between 275,000 and 500,000 jobs in 1, 2, 10, and 25 chunks.
Submitting jobs in one chunk always works, so does sending 2 chunks. 10 chunks sometimes works and sometimes doesn't, and 25 chunks never works.
If
staged.queries = TRUE
(otherwise same behaviour as in #58), independent ofdb.options = list(pragmas = c("busy_timeout=5000", "journal_mode=WAL"))
andfs.timeout
:submitJobs()
call function itself runs fine untilreturn(invisible(ids))
waitForJobs()
, R segfaultsThe text was updated successfully, but these errors were encountered: