-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to reassign a specific task when some remote cluster connection closed? #188
Comments
This is related to questions/comments in Issue #154. There's currently no automatic, built-in "failover" mechanism in the future framework. Here's a minimal reproducible example that emulates an R worker going down: library("future")
plan(multisession, workers = 2L)
f <- future( quit("no") )
v <- value(f)
# Error in unserialize(node$con) :
# Failed to retrieve the value of MultisessionFuture from cluster node #1 (on 'localhost'). The reason reported was 'error reading from connection' As a first step, what needs to be added to the future framework is a way to distinguish this type of errors from regular errors produced from evaluating the future expression itself. This minimal extension is on my todo list. |
A quick follow up; with the release of future 1.8.0, the first two items below should now be possible:
Errors due to orchestration of futures (e.g. connection errors) are now of class > library("future")
> plan(multisession, workers = 2L)
> f <- future( quit("no") )
> res <- tryCatch(v <- value(f), FutureError = identity)
> str(res)
List of 2
$ message: chr "Failed to retrieve the value of MultisessionFuture from cluster node #1 (on 'localhost'). The reason reported "| __truncated__
$ call : language unserialize(node$con)
- attr(*, "class")= chr [1:5] "FutureError" "simpleError" "error" "FutureCondition" ...
- attr(*, "future")=Classes 'MultisessionFuture', 'ClusterFuture', 'MultiprocessFuture', 'Future', 'environment' <environment: 0x40122a8> The actually relaunching of a failed future is discussed in Issue #205 - more work is needed there for sure. Another issue is what happens with the state of the workers and how to recover those. A naive approach is to restart the workers by temporarily switching to another plan and back: plan(sequential)
plan(multisession, workers = 2L) |
Hi,
I want to know how to reassign a specific task which one was failed to get when some remote cluster connection closed.
Sometimes, I work in the for loop with
listenv::listenv()
, I experienced the remote cluster connection closed with internet connectivity issues, or remote R session was dead unexpectedly.Then, I do not know which task was failed, gone, or successes. Hence I got an incomplete result or none.
In the current, I did like this with my little knowledge, that's time waste so many situations, I try to parallelise over one thousand model estimation works during Bayesian calibrations.
https://github.com/seonghobae/kaefa/blob/master/R/kaefa.R#L441-L462
https://github.com/seonghobae/kaefa/blob/master/R/newEngine.R#L226-L435
Here are my detailed questions:
Best,
Seongho
The text was updated successfully, but these errors were encountered: