Skip to content

Commit

Permalink
Updated wrap_rrdesi to fix multiple use cases.
Browse files Browse the repository at this point in the history
The root bug was that if the number of GPUs available was > the rank of the communicator, it was making a bad assumption that you wanted to use at least ngpu ranks. So when calling wrap_rrdesi directly without srun, the length of the communicator was obviously 1 but there were 4 GPUs in the node so it was splitting the input files and rank 0 was only taking 1/4 of them but there were no other ranks to run anything.

I fixed this, added informative warning messages where appropriate, and cleaned up the login node logic that had been copy/pasted from elsewhere. Here are a bunch of example test cases:

    Run on login node

cdwarner@perlmutter:login16:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite
wrap_rrdesi should not be run on a login node.

The following were all run after getting an interactive node with

 salloc -N 1 -C gpu -q interactive -t 3:00:00 -A desi_g --gpus-per-node=4

    Run directly - now works with warnings:

cdwarner@nid001173:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite
WARNING: Detected that wrap_rrdesi is not being run with srun command.
WARNING: Calling directly can lead to under-utilizing resources.
Recommended syntax: srun -N nodes -n tasks -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi [options]
	Ex: 8 tasks each with GPU support on 2 nodes:
		srun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0  wrap_rrdesi ...
	Ex: 64 tasks on 1 node and 4 GPUs - this will run on both GPU and non-GPU nodes at once:
		srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0  wrap_rrdesi ...
WARNING: wrap_rrdesi was called with 4 GPUs but only 1 MPI ranks.
WARNING: Will only use 1 GPUs.
Running 18 input files on 1 GPUs and 1 total procs...

    Run with srun and n < ngpu:

cdwarner@nid001173:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 1 -n 2 -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite
WARNING: wrap_rrdesi was called with 4 GPUs but only 2 MPI ranks.
WARNING: Will only use 2 GPUs.
Running 18 input files on 2 GPUs and 2 total procs...

    As expected run:

cdwarner@nid001173:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 1 -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite
Running 18 input files on 4 GPUs and 4 total procs...

    Run with GPU + CPU:

cdwarner@nid001173:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite
Running 18 input files on 4 GPUs and 6 total procs...

    Run with -n 64 but --gpuonly

cdwarner@nid001133:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite --gpuonly
Running 18 input files on 4 GPUs and 4 total procs...

    Run with too many nodes requested (handled by srun):

cdwarner@nid001173:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite
srun: error: Only allocated 1 nodes asked for 2

The following were all run after getting an interactive 2 nodes with

salloc --nodes 2 --qos interactive --time 4:00:00 --constraint gpu --gpus-per-node=4 --account desi_g

    Run as expected

cdwarner@nid001048:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite
Running 18 input files on 8 GPUs and 8 total procs...

    Run with too few n

cdwarner@nid001048:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 2 -n 6 -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite
WARNING: wrap_rrdesi was called with 8 GPUs but only 6 MPI ranks.
WARNING: Will only use 6 GPUs.
Running 18 input files on 6 GPUs and 6 total procs...

The following were run with an interactive node obtained with -n argument:

salloc --nodes 1 -n 128 --qos interactive --time 4:00:00 --constraint gpu --gpus-per-node=4 --account desi_g

    Run directly

cdwarner@nid001133:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite
WARNING: Detected that wrap_rrdesi is not being run with srun command.
WARNING: Calling directly can lead to under-utilizing resources.
Recommended syntax: srun -N nodes -n tasks -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi [options]
	Ex: 8 tasks each with GPU support on 2 nodes:
		srun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0  wrap_rrdesi ...
	Ex: 64 tasks on 1 node and 4 GPUs - this will run on both GPU and non-GPU nodes at once:
		srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0  wrap_rrdesi ...
WARNING: wrap_rrdesi was called with 4 GPUs but only 1 MPI ranks.
WARNING: Will only use 1 GPUs.
Running 18 input files on 1 GPUs and 1 total procs...

    Run as expected

cdwarner@nid001133:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 1 -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite
Running 18 input files on 4 GPUs and 4 total procs...

Finally, if MPI is not available:

    try:
        import mpi4py.MPI as MPI
    except ImportError:
        have_mpi = False
        print ("MPI not available - required to run wrap_rrdesi")
        sys.exit(0)
  • Loading branch information
craigwarner-ufastro committed Dec 19, 2024
1 parent 08d4acd commit 510426e
Showing 1 changed file with 29 additions and 2 deletions.
31 changes: 29 additions & 2 deletions bin/wrap_rrdesi
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,15 @@ from desispec.scripts import qsoqn, qsomgii, emlinefit
# MPI environment availability
have_mpi = None
if nersc_login_node():
have_mpi = False
print ("wrap_rrdesi should not be run on a login node.")
sys.exit(0)
else:
have_mpi = True
try:
import mpi4py.MPI as MPI
except ImportError:
have_mpi = False
print ("MPI not available")
print ("MPI not available - required to run wrap_rrdesi")
sys.exit(0)

parser = argparse.ArgumentParser(allow_abbrev=False)
Expand Down Expand Up @@ -61,6 +62,18 @@ afterburners = args.afterburners
comm = MPI.COMM_WORLD
comm_rank = comm.rank

#print ("COMM", comm.size, comm.rank)
env = os.environ
if not 'SLURM_STEP_RESV_PORTS' in os.environ and comm.rank == 0:
print ("WARNING: Detected that wrap_rrdesi is not being run with srun command.")
print ("WARNING: Calling directly can lead to under-utilizing resources.")
print ("Recommended syntax: srun -N nodes -n tasks -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi [options]")
print ("\tEx: 8 tasks each with GPU support on 2 nodes:")
print ("\t\tsrun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi ...")
print ("\tEx: 64 tasks on 1 node and 4 GPUs - this will run on both GPU and non-GPU nodes at once:")
print ("\t\tsrun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi ...")


#Get number of nodes
nhosts = os.getenv('SLURM_NNODES')
if nhosts is None:
Expand All @@ -84,11 +97,21 @@ if args.gpu:
gpu_per_node = int(gpu_per_node)
ngpu = gpu_per_node*nhosts

if ngpu > comm.size:
if comm.rank == 0:
print (f"WARNING: wrap_rrdesi was called with {ngpu} GPUs but only {comm.size} MPI ranks.")
print (f"WARNING: Will only use {comm.size} GPUs.")
ngpu = comm.size

#Set GPU nodes
#We want the first gpu_per_node ranks of each host
ranks_per_host = comm.size // nhosts
use_gpu = (comm_rank % ranks_per_host) < gpu_per_node
ncpu_ranks = (comm.size - ngpu -1) // cpu_per_task + 1
#if comm.rank == 0:
# print (f'{ngpu=}, {gpu_per_node=}, {nhosts=}')
# print (f'{ranks_per_host=}, {use_gpu=}, {ncpu_ranks=}')
# print (f'{comm.size=}, {comm_rank=}, {cpu_per_task=}')
if args.gpuonly:
ncpu_ranks = 0

Expand Down Expand Up @@ -119,6 +142,7 @@ if use_gpu:
else:
myhost = ngpu + (comm.rank - gpu_per_node*(comm.rank // ranks_per_host)) // cpu_per_task
subcomm = comm.Split(myhost)
#print (f'{comm.rank=}, {ncomm=}, {myhost=}, {subcomm.size=}')

if comm.rank == 0:
print("Running "+str(len(inputfiles))+" input files on "+str(ngpu)+" GPUs and "+str(ncomm)+" total procs...")
Expand All @@ -127,6 +151,8 @@ if comm.rank == 0:
# In --gpuonly mode, CPU procs will not enter this block
if myhost < ncomm:
myfiles = np.array_split(inputfiles, ncomm)[myhost]
nfiles = len(myfiles)
#print (f'DEBUG: {myhost=} {ncomm=} {nfiles=} {myfiles=}, {comm.rank=}')
for infile in myfiles:
redrockfile = os.path.join(outdir, os.path.basename(infile).replace('coadd-', 'redrock-'))
if os.path.isfile(redrockfile) and not overwrite:
Expand All @@ -145,6 +171,7 @@ if myhost < ncomm:
opts.extend(args_to_pass)
if use_gpu:
opts.append('--gpu')
print (f'Running rrdesi on {myhost=} {subcomm.rank=} with options {opts=}')
desi.rrdesi(opts, comm=subcomm)

# optionally run all the afterburners
Expand Down

0 comments on commit 510426e

Please sign in to comment.