Skip to content

Latest commit

 

History

History
103 lines (89 loc) · 2.78 KB

README.md

File metadata and controls

103 lines (89 loc) · 2.78 KB

jobstart

Jobstart-related software and information

A. Get Slurm deploy scripts:

  1. Go to the root directory of the experiment:
cd <rootdir>
  1. Clone deploy scripts
$ git clone https://github.com/artpol84/jobstart.git 
  1. Go to the deploy directory:
cd jobstart/slurm_deploy/
  1. Setup configuration in deploy_ctl.conf NOTE: You need to set the INSTALL_DIR to the directory that is unique for each node (like /tmp/slurm_deploy). Otherwise Slurm daemon instances will conflict for the common files.

B. Bild and start the installation

  1. Allocate resources:
$ salloc -N <x> -t <y>
  1. Download all of the packages:
$ ./deploy_cmd.sh source_prepare
  1. Build and install all of the packages:
./deploy_cmd.sh build_all
  1. Distribute everything
$ ./deploy_cmd.sh distribute_all
  1. Configure Slurm, please see jobstart/slurm_deploy/files/slurm.conf.in for the general configuration and provide the customization file <local.conf> with control machine and partitions description (see jobstart/slurm_deploy/files/local.conf as an example)
./deploy_cmd.sh slurm_config ./files/local.conf
  1. Start the Slurm instance:
./deploy_cmd.sh slurm_start

C. Check the installation

NOTE: From another terminal!

  1. Check that deploy is functional.
$ export SLURMDEP_INST=<INSTALL_DIR from deploy_ctl.conf>
$ cd $SLURMDEP_INST/slurm/bin
$ ./sinfo
<check that the output is correct>
  1. Allocate nodes inside the deployed Slurm installation:
$ ./salloc -N <X> <other options>
  1. Run hostname to test:
$ ./srun hostname

5.Run hostname with pmix plugin:

./srun --mpi=pmix hostname

D. Check with the distributed application

NOTE: from the allocation of deployed Slurm (same terminal as C.)

  1. Go to the test app directory
$ cd <rootdir>/jobstart/shmem/
  1. compile the program
$ $SLURMDEP_INST/ompi/bin/oshcc -o hello_oshmem_c -g hello_oshmem_c.c # INSTALL_DIR from deploy_ctl.conf
  1. Launch the application
$ cd <rootdir>/jobstart/launch/
$ ./run.sh {dtcp|ducx|sapi} [early|noearly] [openib] [timing] -N <nnodes> -n <nprocs> <other-slurm-opts> ./hello_oshmem_c

The following set of commands can be used to re-deploy Slurm after the initial allocation was lost:

export SLURMDEP_INST=<INSTALL_DIR from deploy_ctl.conf>
./deploy_cmd.sh slurm_stop
./deploy_cmd.sh cleanup_remote
rm --preserve-root ${SLURMDEP_INST}/slurm/tmp/*
rm --preserve-root ${SLURMDEP_INST}/slurm/var/*
./deploy_cmd.sh distribute_all
./deploy_cmd.sh slurm_config
./deploy_cmd.sh slurm_start