Memory issues for large healpix jobs in Jura #2279

akremin · 2024-06-12T21:27:20Z

We ran into crashes, MPI rank OOM's, and timeouts in Jura with healpix jobs that had a large number (N>1000) of inputs. A closely related ticket is issue #2277 , but that one is specific to issues in logging rather than crashes/timeouts.

I have pushed a branch that may help in this regard, though it wasn't necessary for Jura: prunespectragroup.

That modifies the spectral grouping code to subselect only fibers in a loaded frame that overlap a given healpixel, rather than loading information for all 500 fibers for all input files before subselecting to a given healpixel. In principle, this should reduce the memory footprint since not all fibers will overlap the tile, so we expect to retain fewer fibers in memory. I have not yet checked that the new code can produce identical results to the old code. For that reason I have not opened a pull request.

Beyond this we may want to consider scaling the MPI ranks, number of nodes, or other parallelism for the spectral grouping; while not making the redrock and afterburner processing too inefficient.

The text was updated successfully, but these errors were encountered:

sbailey · 2024-07-05T23:35:28Z

Examples from Jura for debugging:

OOM failed when running N>1 healpix in same job, but worked when split out into 1 job per healpix

Update: OOM fixed by PR #2290. Might still need to tune job runtimes.

srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 --cpu-bind=cores \
desi_zproc --groupname healpix --max-gpuprocs 4 --mpi --survey sv1 --program dark \
  --healpix 7015 7017 7018 7019 7020 7021 7022 7023 7026 7032 --expfiles \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7015/hpixexp-sv1-dark-7015.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7017/hpixexp-sv1-dark-7017.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7018/hpixexp-sv1-dark-7018.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7019/hpixexp-sv1-dark-7019.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7020/hpixexp-sv1-dark-7020.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7021/hpixexp-sv1-dark-7021.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7022/hpixexp-sv1-dark-7022.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7023/hpixexp-sv1-dark-7023.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7026/hpixexp-sv1-dark-7026.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7032/hpixexp-sv1-dark-7032.csv

Caveat: some of these timed out when running as single healpix and needed more walltime, but they didn't OOM

OOM during spectra creation

/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27256.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27258.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27259.slurm

e.g.

srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 --cpu-bind=cores \
  desi_zproc --max-gpuprocs 4 --mpi \
  --groupname healpix --survey special --program other --healpix 27256 \
  --expfiles /global/cfs/cdirs/desi/spectro/redux/jura/healpix/special/other/272/27256/hpixexp-special-other-27256.csv

These required long custom runs in CPU interactive nodes to get the spectra files generated before proceeding.

Slow spectra grouping

sv1 dark healpix 27258 spent 21 minutes in groupspec combining 975 frame files, and then only needed 43 seconds for redrock (!)

sbailey · 2024-09-09T23:10:54Z

Kibo report

Kibo was run with bundling 10 healpix per job, only two jobs having memory problems:

zpix-special-dark-26192-27251.slurm
zpix-special-dark-27256-27345.slurm

with commands like

srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 --cpu-bind=cores desi_zproc --groupname healpix --survey special --program dark --healpix 27256 27257 27258 27259 27260 27262 27263 27333 27344 27345 --expfiles ...

For these we resorted to generating the spectra and coadd files in a separate interactive job with commands like

HPIXDIR=$DESI_SPECTRO_REDUX/$SPECPROD/healpix/$SURVEY/$PROGRAM/272/$HEALPIX
echo Logging to $HPIXDIR/logs/spectra-$SURVEY-$PROGRAM-$HEALPIX.log
srun -n 64 -c 2 desi_group_spectra --mpi --healpix $HEALPIX \
  --expfile $HPIXDIR/hpixexp-$SURVEY-$PROGRAM-$HEALPIX.csv \
  --header SURVEY=$SURVEY PROGRAM=$PROGRAM \
  -o $HPIXDIR/spectra-$SURVEY-$PROGRAM-$HEALPIX.fits.gz \
  -c $HPIXDIR/coadd-$SURVEY-$PROGRAM-$HEALPIX.fits > $HPIXDIR/logs/spectra-$SURVEY-$PROGRAM-$HEALPIX.log

Note: "272" was hardcoded and used for healpix 272xx, and similarly for 273 etc.

After generating the spectra and coadd files, we then resubmitted the original zpix-special-dark-*-*.slurm files. That got through Redrock and some of the afterburners, but then hit memory problems running emlinefit in parallel on the 10 healpix. We eventually dropped down to running that in serial in an interactive node.

At minimum it would be useful to add a desi_zproc option to process desi_group_spectra one healpix at a time instead of doing 10 healpix in parallel using sub-communicators. That could be used by desi_healpix_redshifts when generating the slurm scripts when it recognizes jobs with an especially large number of inputs. Alternatively or in addition, desi_healpix_redshifts could group healpix jobs by a variable number of healpix so that it doesn't put multiple very large healpix together in a single job.

The zproc wrapper to emlinefit will need more study for why that was running out of memory (given that other afterburners didn't), and what could be done.

sbailey self-assigned this Jul 15, 2024

sbailey added the pipeline label Jul 15, 2024

sbailey mentioned this issue Sep 9, 2024

Largest healpix jobs in Jura crash without logging an error message #2277

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues for large healpix jobs in Jura #2279

Memory issues for large healpix jobs in Jura #2279

akremin commented Jun 12, 2024 •

edited

Loading

sbailey commented Jul 5, 2024 •

edited

Loading

sbailey commented Sep 9, 2024

Memory issues for large healpix jobs in Jura #2279

Memory issues for large healpix jobs in Jura #2279

Comments

akremin commented Jun 12, 2024 • edited Loading

sbailey commented Jul 5, 2024 • edited Loading

Examples from Jura for debugging:

OOM failed when running N>1 healpix in same job, but worked when split out into 1 job per healpix

OOM during spectra creation

Slow spectra grouping

sbailey commented Sep 9, 2024

Kibo report

akremin commented Jun 12, 2024 •

edited

Loading

sbailey commented Jul 5, 2024 •

edited

Loading