Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Running Remote Future Terminated #236

Open
bwbioinfo opened this issue Jun 20, 2018 · 2 comments
Open

Long Running Remote Future Terminated #236

bwbioinfo opened this issue Jun 20, 2018 · 2 comments
Labels

Comments

@bwbioinfo
Copy link

I have a long running future that runs remotely. It gets killed after ~2hours. I've run future in debug mode and I've looked at some of the logs on my system. There seems to be a point where the system management software causes a blip in ssh connectivity and it will kill the future if it intersects with a poll (the run has been successful in the past). Is it possible to pass an option to re-try the poll or extend the wait time?

@bwbioinfo
Copy link
Author

Update:
I tried to run it extending the time between polls and continue to receive the same error:

Error in unserialize(node$con) :
Failed to retrieve the value of ClusterFuture from cluster node #1 (on ‘myserver’). The reason reported was ‘error reading from connection’
Calls: source ... value -> value.Future -> result -> result.ClusterFuture
Execution halted

@bwbioinfo bwbioinfo reopened this Jun 26, 2018
@HenrikBengtsson
Copy link
Collaborator

The "cluster" backend is a wrapper around the clusterApply framework of the parallel package (?parallel::clusterApply) with a PSOCK cluster (?parallel::makeCluster). To connect to other machines, the default protocol is SSH. I haven't tried, but the ssh client accepts lots of -o options. For instance, maybe you can use -o ConnectTimeout=<seconds> to workaround timeouts in the connection, if that's the underlaying cause.

If the connection drops, other than fixing the connection and the discussion on adding support for restarting futures (e.g. Issues #188, #205), I'm not sure there's anything that can be "fixed" in the future package per se.

Having said this, nothing prevents someone from implementing a more robust future backend that, for instance, can reconnect and restart a remote worker if it goes down. But, the underlying PSOCK workers provided by the parallel package don't support this and I'm pretty sure they never will. It's possible that batchtools has some mechanisms for this - I'm not fully up-to-date with its features but I know people asked about restarting batchtools if R crashes. If batchtools support this, then you could try with the future.batchtools backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants