Updated wrap_rrdesi to fix multiple use cases.

The root bug was that if the number of GPUs available was > the rank of the communicator, it was making a bad assumption that you wanted to use at least ngpu ranks. So when calling wrap_rrdesi directly without srun, the length of the communicator was obviously 1 but there were 4 GPUs in the node so it was splitting the input files and rank 0 was only taking 1/4 of them but there were no other ranks to run anything. I fixed this, added informative warning messages where appropriate, and cleaned up the login node logic that had been copy/pasted from elsewhere. Here are a bunch of example test cases: Run on login node cdwarner@perlmutter:login16:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite wrap_rrdesi should not be run on a login node. The following were all run after getting an interactive node with salloc -N 1 -C gpu -q interactive -t 3:00:00 -A desi_g --gpus-per-node=4 Run directly - now works with warnings: cdwarner@nid001173:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite WARNING: Detected that wrap_rrdesi is not being run with srun command. WARNING: Calling directly can lead to under-utilizing resources. Recommended syntax: srun -N nodes -n tasks -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi [options] Ex: 8 tasks each with GPU support on 2 nodes: srun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi ... Ex: 64 tasks on 1 node and 4 GPUs - this will run on both GPU and non-GPU nodes at once: srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi ... WARNING: wrap_rrdesi was called with 4 GPUs but only 1 MPI ranks. WARNING: Will only use 1 GPUs. Running 18 input files on 1 GPUs and 1 total procs... Run with srun and n < ngpu: cdwarner@nid001173:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 1 -n 2 -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite WARNING: wrap_rrdesi was called with 4 GPUs but only 2 MPI ranks. WARNING: Will only use 2 GPUs. Running 18 input files on 2 GPUs and 2 total procs... As expected run: cdwarner@nid001173:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 1 -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite Running 18 input files on 4 GPUs and 4 total procs... Run with GPU + CPU: cdwarner@nid001173:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite Running 18 input files on 4 GPUs and 6 total procs... Run with -n 64 but --gpuonly cdwarner@nid001133:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite --gpuonly Running 18 input files on 4 GPUs and 4 total procs... Run with too many nodes requested (handled by srun): cdwarner@nid001173:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite srun: error: Only allocated 1 nodes asked for 2 The following were all run after getting an interactive 2 nodes with salloc --nodes 2 --qos interactive --time 4:00:00 --constraint gpu --gpus-per-node=4 --account desi_g Run as expected cdwarner@nid001048:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite Running 18 input files on 8 GPUs and 8 total procs... Run with too few n cdwarner@nid001048:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 2 -n 6 -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite WARNING: wrap_rrdesi was called with 8 GPUs but only 6 MPI ranks. WARNING: Will only use 6 GPUs. Running 18 input files on 6 GPUs and 6 total procs... The following were run with an interactive node obtained with -n argument: salloc --nodes 1 -n 128 --qos interactive --time 4:00:00 --constraint gpu --gpus-per-node=4 --account desi_g Run directly cdwarner@nid001133:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite WARNING: Detected that wrap_rrdesi is not being run with srun command. WARNING: Calling directly can lead to under-utilizing resources. Recommended syntax: srun -N nodes -n tasks -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi [options] Ex: 8 tasks each with GPU support on 2 nodes: srun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi ... Ex: 64 tasks on 1 node and 4 GPUs - this will run on both GPU and non-GPU nodes at once: srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 wrap_rrdesi ... WARNING: wrap_rrdesi was called with 4 GPUs but only 1 MPI ranks. WARNING: Will only use 1 GPUs. Running 18 input files on 1 GPUs and 1 total procs... Run as expected cdwarner@nid001133:/global/cfs/cdirs/desi/users/cdwarner/code/desispec/bin> srun -N 1 -n 4 -c 2 --gpu-bind=map_gpu:3,2,1,0 ./wrap_rrdesi -i $MYRRDIR/list_coadds.ascii -o $SCRATCH/wrap/ --gpu --overwrite Running 18 input files on 4 GPUs and 4 total procs... Finally, if MPI is not available: try: import mpi4py.MPI as MPI except ImportError: have_mpi = False print ("MPI not available - required to run wrap_rrdesi") sys.exit(0)
desihub · Dec 19, 2024 · 510426e · 510426e
1 parent 08d4acd
commit 510426e
Showing 1 changed file with 29 additions and 2 deletions.
diff --git a/bin/wrap_rrdesi b/bin/wrap_rrdesi
@@ -17,14 +17,15 @@ from desispec.scripts import qsoqn, qsomgii, emlinefit
 # MPI environment availability
 have_mpi = None
 if nersc_login_node():
-    have_mpi = False
+    print ("wrap_rrdesi should not be run on a login node.")
+    sys.exit(0)
 else:
     have_mpi = True
     try:
         import mpi4py.MPI as MPI
     except ImportError:
         have_mpi = False
-        print ("MPI not available")
+        print ("MPI not available - required to run wrap_rrdesi")
         sys.exit(0)
 
 parser = argparse.ArgumentParser(allow_abbrev=False)
@@ -61,6 +62,18 @@ afterburners = args.afterburners
 comm = MPI.COMM_WORLD
 comm_rank = comm.rank
 
+#print ("COMM", comm.size, comm.rank)
+env = os.environ
+if not 'SLURM_STEP_RESV_PORTS' in os.environ and comm.rank == 0:
+    print ("WARNING: Detected that wrap_rrdesi is not being run with srun command.")
+    print ("WARNING: Calling directly can lead to under-utilizing resources.")
+    print ("Recommended syntax: srun -N nodes -n tasks -c 2 --gpu-bind=map_gpu:3,2,1,0  ./wrap_rrdesi [options]") 
+    print ("\tEx: 8 tasks each with GPU support on 2 nodes:")
+    print ("\t\tsrun -N 2 -n 8 -c 2 --gpu-bind=map_gpu:3,2,1,0  wrap_rrdesi ...")
+    print ("\tEx: 64 tasks on 1 node and 4 GPUs - this will run on both GPU and non-GPU nodes at once:")
+    print ("\t\tsrun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0  wrap_rrdesi ...")
+
+
 #Get number of nodes
 nhosts = os.getenv('SLURM_NNODES')
 if nhosts is None:
@@ -84,11 +97,21 @@ if args.gpu:
         gpu_per_node = int(gpu_per_node)
     ngpu = gpu_per_node*nhosts
 
+if ngpu > comm.size:
+    if comm.rank == 0:
+        print (f"WARNING: wrap_rrdesi was called with {ngpu} GPUs but only {comm.size} MPI ranks.")
+        print (f"WARNING: Will only use {comm.size} GPUs.")
+    ngpu = comm.size 
+
 #Set GPU nodes
 #We want the first gpu_per_node ranks of each host
 ranks_per_host = comm.size // nhosts
 use_gpu = (comm_rank % ranks_per_host) < gpu_per_node
 ncpu_ranks = (comm.size - ngpu -1) // cpu_per_task + 1
+#if comm.rank == 0:
+#    print (f'{ngpu=}, {gpu_per_node=}, {nhosts=}')
+#    print (f'{ranks_per_host=}, {use_gpu=}, {ncpu_ranks=}')
+#    print (f'{comm.size=}, {comm_rank=}, {cpu_per_task=}')
 if args.gpuonly:
     ncpu_ranks = 0
 
@@ -119,6 +142,7 @@ if use_gpu:
 else:
     myhost = ngpu + (comm.rank - gpu_per_node*(comm.rank // ranks_per_host)) // cpu_per_task
 subcomm = comm.Split(myhost)
+#print (f'{comm.rank=}, {ncomm=}, {myhost=}, {subcomm.size=}')
 
 if comm.rank == 0:
     print("Running "+str(len(inputfiles))+" input files on "+str(ngpu)+" GPUs and "+str(ncomm)+" total procs...")
@@ -127,6 +151,8 @@ if comm.rank == 0:
 # In --gpuonly mode, CPU procs will not enter this block 
 if myhost < ncomm:
     myfiles = np.array_split(inputfiles, ncomm)[myhost]
+    nfiles = len(myfiles)
+    #print (f'DEBUG: {myhost=} {ncomm=} {nfiles=} {myfiles=}, {comm.rank=}')
     for infile in myfiles:
         redrockfile = os.path.join(outdir, os.path.basename(infile).replace('coadd-', 'redrock-'))
         if os.path.isfile(redrockfile) and not overwrite:
@@ -145,6 +171,7 @@ if myhost < ncomm:
                 opts.extend(args_to_pass)
             if use_gpu:
                 opts.append('--gpu')
+            print (f'Running rrdesi on {myhost=} {subcomm.rank=} with options {opts=}')
             desi.rrdesi(opts, comm=subcomm)
 
         # optionally run all the afterburners