Skip to content

Executing many runs in parallel with ssh cluster targets

Jaeho Shin edited this page Sep 1, 2014 · 3 revisions

If you have a set of machines accessible via SSH, 3X provides a simple way to execute a large number of runs on them in parallel. All you need to do is defining a new target as ssh-cluster type with a list of hostnames, then starting the queue, and periodically retrieving results back as you want until all runs are done. 3X will take all planned runs in the queue and distribute them across the machines you have listed based on how busy each machine is. As opposed to the ssh type targets, the runs are executed remotely in complete isolation with ssh-cluster types, so you no longer need to worry about your laptop running 3X losing power or Wi-Fi connection for example.

Defining an ssh-cluster target

Use the following command to create a new target.

3x target TARGET define ssh-cluster REMOTE... SHARED_PATH [NAME[=VALUE]]...

For example,

3x target mycluster define ssh-cluster [email protected]:tmp [email protected]:/localdisk/tmp/user [email protected]:/ssd/tmp/user /shared/users/user/3x-shared

will create a target named mycluster that executes runs on three machines:

  • foo.example.org (with login user and keeping temporary files under ~/tmp/)
  • bar.example.org (with login user and keeping temporary files under /localdisk/tmp/user/)
  • baz.example.org (with login differentuser and keeping temporary files under /ssd/tmp/user/)

keeping all shared data under the path /shared/users/user/3x-shared/ (last argument) where each machine will read from and write to for executing individual runs.

Any environment variables necessary for execution can be passed as argument after the shared path.

Installing 3X on the cluster

Note that 3X must be installed on all the machines of the ssh-cluster target for the remote execution to work, i.e., 3x executable should be on PATH of each machine when logged into it. If it's not installed already, you can use the following command for example to copy the current executable to the shared path and configure the target to use the absolute path to it (in 3x-path file under the target's directory). These should be run from the root of the 3X repository.

path_to_3x=/shared/users/user/3x-shared/3x
scp "$(type -p 3x)" [email protected]:$path_to_3x
echo $path_to_3x >run/target/foo/3x-path

Starting execution of runs

Next, use the standard commands to configure the target of the current queue to the ssh-cluster type just created.

3x target TARGET

Then, start the execution of planned runs in the queue on the target.

3x start

This will first create a clone of the experiment repository under the shared path via one of the machine, then send subsets of runs to all accessible machines in the target, and start the execution in parallel. Note that this command will end after setting up and initiating the execution, and won't wait for all the runs to finish.

Retrieving results back as you need

Finally, use the following command to retrieve results of finished runs.

3x sync

3X will not synchronize automatically, so no status in the GUI or CLI will update on its own unless this command is run manually. If you want to retrieve the results periodically, say every five minutes or 30 seconds, use the following shell script:

while :; do 3x sync; sleep 5m; done  # every five minutes
while :; do 3x sync; sleep 30s; done  # every 30 seconds

Once 3x sync finds all runs have finished execution, it will perform necessary clean up on the cluster, and mark the queue as stopped.