This system is a job scheduler that works cooperatively with MPI programs, and consists of below three programs.
- rapi works with MPI programs using MPI profiling interface.
- rapid manages MPI programs.
- rapictld manages all rapid.
- Rapi sends MPI program information such as "whether is interactive" or "how to stop" This allows job scheduler more flexible job management.
- Rapi sends MPI program's running state such as "Now enter MPI_Send function" This allows job scheduler to adjust the timing to suspend/resume jobs.
- Clone this repository
git clone https://github.com/nomlab/rapi.git && cd rapi
- Build
make
- Install
sudo make install # Install rapi.so, rapid, rapictld into /usr/local/bin/
make install-local # You can use local install (~/.local/bin/)
In the above procedure, installation is applied to only one node. You must do this on all nodes you use. Or use file sharing, such as NFS.
- Launch rapictld on the control node
# Interval between suspend and resume is 10ms # Compute nodes that locate rapid are "compute_node1" and "compute_node2" rapictld -t 10 -a computing_node1,cpmputing_node2
# You can use debug message rapictld -t 10 -a computing_node1,cpmputing_node2 -d Debug
- Launch rapid on all compute nodes
# Launch rapid on each node rapid -a control_node
# Launch with ssh ssh compute_noding1 rapid -a control_node # Or launch on background (But it is difficult to kill rapid) ssh -f computing_node1 rapid -a control_node
# You can use debug message rapid -a control_node -d Debug
- Launch an MPI program with specifying "LD_PRELOAD"
# -x option is to set environment variable for all MPI processes mpirun -x LD_PRELOAD=/usr/local/bin/rapi.so mpi_program
You can use a script for launching rapid, rapictld and MPI program (with rapi).
- Setup rapi
- Install rapi.so and rapid to all computing-nodes, and add the location of them to PATH
- Install rapictld to the control-node, and the location of it to PATH
- Setup MPI
- Install
mpirun
to all computing-nodes - Install an MPI program you use to the same location
- Install
hostfile
to the control-node
- Install
- Setup ssh
- Control-node can ssh to all computing-nodes with
- the hostname in
hostfile
- no password or passphrase
- the hostname in
- Control-node can ssh to all computing-nodes with
Run the script on control-node
script/run.sh -h hostfile -t 100 -p example_program -l log/$(date "+%Y%m%d%H%M%S").log