Long Running Remote Future Terminated #236

bwbioinfo · 2018-06-20T19:40:42Z

I have a long running future that runs remotely. It gets killed after ~2hours. I've run future in debug mode and I've looked at some of the logs on my system. There seems to be a point where the system management software causes a blip in ssh connectivity and it will kill the future if it intersects with a poll (the run has been successful in the past). Is it possible to pass an option to re-try the poll or extend the wait time?

bwbioinfo · 2018-06-26T18:18:05Z

Update:
I tried to run it extending the time between polls and continue to receive the same error:

Error in unserialize(node$con) :
Failed to retrieve the value of ClusterFuture from cluster node #1 (on ‘myserver’). The reason reported was ‘error reading from connection’
Calls: source ... value -> value.Future -> result -> result.ClusterFuture
Execution halted

HenrikBengtsson · 2018-06-28T22:00:23Z

The "cluster" backend is a wrapper around the clusterApply framework of the parallel package (?parallel::clusterApply) with a PSOCK cluster (?parallel::makeCluster). To connect to other machines, the default protocol is SSH. I haven't tried, but the ssh client accepts lots of -o options. For instance, maybe you can use -o ConnectTimeout=<seconds> to workaround timeouts in the connection, if that's the underlaying cause.

If the connection drops, other than fixing the connection and the discussion on adding support for restarting futures (e.g. Issues #188, #205), I'm not sure there's anything that can be "fixed" in the future package per se.

Having said this, nothing prevents someone from implementing a more robust future backend that, for instance, can reconnect and restart a remote worker if it goes down. But, the underlying PSOCK workers provided by the parallel package don't support this and I'm pretty sure they never will. It's possible that batchtools has some mechanisms for this - I'm not fully up-to-date with its features but I know people asked about restarting batchtools if R crashes. If batchtools support this, then you could try with the future.batchtools backend.

bwbioinfo closed this as completed Jun 26, 2018

bwbioinfo reopened this Jun 26, 2018

HenrikBengtsson added the question label Nov 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long Running Remote Future Terminated #236

Long Running Remote Future Terminated #236

bwbioinfo commented Jun 20, 2018

bwbioinfo commented Jun 26, 2018

HenrikBengtsson commented Jun 28, 2018

Long Running Remote Future Terminated #236

Long Running Remote Future Terminated #236

Comments

bwbioinfo commented Jun 20, 2018

bwbioinfo commented Jun 26, 2018

HenrikBengtsson commented Jun 28, 2018