-
Notifications
You must be signed in to change notification settings - Fork 4
Executing many runs in parallel with ssh cluster targets
If you have a set of machines accessible via SSH, 3X provides a simple way to execute a large number of runs on them in parallel. All you need to do is defining a new target as ssh-cluster
type with a list of hostnames, then starting the queue, and periodically retrieving results back as you want until all runs are done. 3X will take all planned runs in the queue and distribute them across the machines you have listed based on how busy each machine is. As opposed to the ssh
type targets, the runs are executed remotely in complete isolation with ssh-cluster types, so you no longer need to worry about your laptop running 3X losing power or Wi-Fi connection for example.
Use the following command to create a new target.
3x target TARGET define ssh-cluster REMOTE... SHARED_PATH [NAME[=VALUE]]...
For example,
3x target mycluster define ssh-cluster [email protected]:tmp [email protected]:/localdisk/tmp/user [email protected]:/ssd/tmp/user /shared/users/user/3x-shared
will create a target named mycluster
that executes runs on three machines:
- foo.example.org (with login
user
and keeping temporary files under~/tmp/
) - bar.example.org (with login
user
and keeping temporary files under/localdisk/tmp/user/
) - baz.example.org (with login
differentuser
and keeping temporary files under/ssd/tmp/user/
)
keeping all shared data under the path /shared/users/user/3x-shared/
(last argument) where each machine will read from and write to for executing individual runs.
Any environment variables necessary for execution can be passed as argument after the shared path.
Note that 3X must be installed on all the machines of the ssh-cluster target for the remote execution to work, i.e., 3x
executable should be on PATH
of each machine when logged into it.
If it's not installed already, you can use the following command for example to copy the current executable to the shared path and configure the target to use the absolute path to it (in 3x-path
file under the target's directory). These should be run from the root of the 3X repository.
path_to_3x=/shared/users/user/3x-shared/3x
scp "$(type -p 3x)" [email protected]:$path_to_3x
echo $path_to_3x >run/target/foo/3x-path
Next, use the standard commands to configure the target of the current queue to the ssh-cluster type just created.
3x target TARGET
Then, start the execution of planned runs in the queue on the target.
3x start
This will first create a clone of the experiment repository under the shared path via one of the machine, then send subsets of runs to all accessible machines in the target, and start the execution in parallel. Note that this command will end after setting up and initiating the execution, and won't wait for all the runs to finish.
Finally, use the following command to retrieve results of finished runs.
3x sync
3X will not synchronize automatically, so no status in the GUI or CLI will update on its own unless this command is run manually. If you want to retrieve the results periodically, say every five minutes or 30 seconds, use the following shell script:
while :; do 3x sync; sleep 5m; done # every five minutes
while :; do 3x sync; sleep 30s; done # every 30 seconds
Once 3x sync
finds all runs have finished execution, it will perform necessary clean up on the cluster, and mark the queue as stopped.