Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer overflow error for large number of jobs #96

Open
mschubert opened this issue Aug 1, 2015 · 10 comments
Open

Buffer overflow error for large number of jobs #96

mschubert opened this issue Aug 1, 2015 · 10 comments

Comments

@mschubert
Copy link

When submitting a large number of jobs, BatchJobs still fails for me (this is somewhat similar to #58, but the number of jobs is almost 50 times higher).

I submit between 275,000 and 500,000 jobs in 1, 2, 10, and 25 chunks.

BatchJobs_1.7 BBmisc_1.9

Submitting jobs in one chunk always works, so does sending 2 chunks. 10 chunks sometimes works and sometimes doesn't, and 25 chunks never works.

If staged.queries = TRUE (otherwise same behaviour as in #58), independent of db.options = list(pragmas = c("busy_timeout=5000", "journal_mode=WAL")) and fs.timeout:

  • in submitJobs() call function itself runs fine until return(invisible(ids))
  • message "Might take some time, do not interrupt this!"
  • after this, all jobs are killed/crash/disappear
  • if I waitForJobs(), R segfaults
Might take some time, do not interrupt this!
Syncing registry ...
Waiting [S:550000 D:0 E:0 R:0] |+                                 |   0% (00:00:00)
Status for 550000 jobs at 2015-08-01 18:00:21
Submitted: 550000 (100.00%)
Started:   550000 (100.00%)
Running:      0 (  0.00%)
Done:         0 (  0.00%)
Errors:       0 (  0.00%)
Expired:   550000 (100.00%)
Time: min=NAs avg=NAs max=NAs
       n submitted started done error running expired t_min t_avg t_max
1 550000    550000  550000    0     0       0  550000    NA    NA    NA
*** buffer overflow detected ***: /usr/lib/R/bin/exec/R terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x3ec4302527]
/lib64/libc.so.6[0x3ec4300410]
/lib64/libc.so.6[0x3ec42ff2c7]
/usr/lib/R/lib/libR.so(+0xfc01b)[0x2b722845901b]
/usr/lib/R/lib/libR.so(+0xfff4e)[0x2b722845cf4e]
/usr/lib/R/lib/libR.so(+0xfffbf)[0x2b722845cfbf]
@mllg
Copy link
Member

mllg commented Aug 3, 2015

I've just discovered a bug: setting fs.timeout disabled staging (D'OH!). I've already pushed a fix for this (cefc9c2).

Furthermore, I tried to finally get rid of all database problems once and forever by avoiding all database transaction on the nodes (although read-only should in theory be safe). This is now also in the devel branch and automatically enabled if staged.queries is set. Could you please give it a try on your system?

@mschubert
Copy link
Author

I'm having issues with submitJobs() in cefc9c2.

Backend: LSF
Submitting 292 chunks / 1457500 jobs.

In writeFiles(), Map(...) causes the submission of jobs to take over 5 hours total.

I realize that the number of jobs is high, but the number of chunks should be reasonable in a way that submission should be done a a minute (without explicit job.delay - note that with staged.queries = FALSE submission takes about 30 minutes, but then some jobs fail because of locked db).

I'm thinking that this is mainly due to file system load that already starts during submission, but later continues at running - I designed my algorithm in a way that bigger objects are all passed via more.args and each individual function call (=job) does only need an index on those objects (couple of bytes). Still, the file system becomes a lot less responsive after a couple of chunks are started.

If I understand my debugging correctly, BatchJobs runs a new instance of R for every job (not chunk, both for staged.queries=T/F)? If that's the case, then BatchJobs can not be used over approximately 1M function calls with > 100 chunks (which would be a pity).

@mllg
Copy link
Member

mllg commented Aug 5, 2015

Well that's basically the tradeoff. You either rely on the database to retrieve the information and risk running into a locked database. Or you just store everything on the file system to avoid querying the database on the nodes which might be a big overhead.

Are you sure that https://github.com/tudo-r/BatchJobs/blob/master/R/writeFiles.R#L34 is the bottleneck? I've tried myself with 1e6 jobs and found that it takes less than 5 minutes on a local (but slow) hdd. If you're sure that the Map() takes most of the time, I'll try to optimize this for high-latency file systems, i.e. writing the job information chunked.

@mschubert
Copy link
Author

The other question would be: is it really required for the master to have information about every job on each slave at any time? I certainly don't, and if not, then you could avoid most of the db or file system load altogether: for instance, you use one R session per chunk and only report chunk statistics. This would make the whole thing a lot more scalable (but I realize this would be a major undertaking).

I am sure that that Map() is the problem for the first 20 chunks submitted (stepped through in the debugger). After the first couple of chunks it also gets remarkably slower, but I haven't taken a look at this specifically. I can do a full profiling, but this will take at least a day because of the nature of the issue.

@mllg
Copy link
Member

mllg commented Aug 5, 2015

The other question would be: is it really required for the master to have information about every job on each slave at any time? I certainly don't, and if not, then you could avoid most of the db or file system load altogether: for instance, you use one R session per chunk and only report chunk statistics. This would make the whole thing a lot more scalable (but I realize this would be a major undertaking).

We kind of do this already by using a buffer which is flushed every 5-10 minutes (c.f. doJob.R, msg.buf).

I am sure that that Map() is the problem for the first 20 chunks submitted (stepped through in the debugger). After the first couple of chunks it also gets remarkably slower, but I haven't taken a look at this specifically. I can do a full profiling, but this will take at least a day because of the nature of the issue.

Please try if 0f914a2 mitigates the runtime issues.

@mschubert
Copy link
Author

0f914a2: submission is down to 9 minutes (30x speedup), "Syncing registry" afterwards takes 10 minutes.

File system load overall is still too high (at about 300 jobs running), I had to manually stop and resume jobs to keep the volume responsive.

Reducing the results over night only got 4%, time remaining is shown as 99 hours. When complete after 4 days, R crashed with a message saying:

*** caught bus error ***
address 0x2abbf7d08e50, cause 'non-existent physical address'
Bus error (core dumped)

@mllg
Copy link
Member

mllg commented Oct 1, 2015

0f914a2: submission is down to 9 minutes (30x speedup), "Syncing registry" afterwards takes 10 minutes.

Well that sound acceptable to me.

File system load overall is still too high (at about 300 jobs running), I had to manually stop and resume jobs to keep the volume responsive.

I'll set the update frequency for chunked jobs more conservatively. But also check your log files -- if you produce a lot of output, this could be the problem. You could probably try to redirect logs to /dev/null.

Reducing the results over night only got 4%, time remaining is shown as 99 hours. When complete after 4 days, R crashed with a message saying:

*** caught bus error ***
address 0x2abbf7d08e50, cause 'non-existent physical address'
Bus error (core dumped)

I think I did not touch anything here ... have you solved this? I would assume this is a (temporary) file system problem. We just iterate over the results and load them, nothing special.

mllg added a commit that referenced this issue Oct 1, 2015
@mschubert
Copy link
Author

The crash was caused by a bug in dplyr (used to assemble my results after - this has got nothing to do with BatchJobs).

The time it takes for reducing the results remains, however.

@mllg
Copy link
Member

mllg commented Oct 1, 2015

Well, maybe reading 500.000 files just takes some time. But if you give me some more information, I can try to optimize this step a bit.

  • How does your result look like, i.e. how big is a single result ( as reported by object.size())?
  • What function do you call exactly? reduceResults()?
  • What is the runtime of a reduction step? Do you think it is linear/quadratic/exponential wrt the number of jobs?
  • Is it possible to use reduceResultsList() instead? This one pre-allocates and thus does not need to copy aggr in every iteration which is a likely bottleneck.
  • Is reduceResultsParallel() a viable alternative?

@mschubert
Copy link
Author

Thank you for your contined efforts!

  • My result is a single numeric representing the mean cross-validation error of a trained glmnet model
  • I already use reduceResultsList()
  • reduceResultsParallel() would put additional strain on the file system and I'd like to avoid that
  • I haven't tested how the timing behaves with different numbers of jobs, and I don't have the time for the next two weeks unfortunately (I'll update here in ping you once I get around to doing it)

In general, I think that the approach of having one result file with one numeric per function call is not feasible with >1M calls on a high latency fs.

I also played around using rzmq that bypasses the file system altogether and could see 100x speedups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants