Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues for large healpix jobs in Jura #2279

Open
akremin opened this issue Jun 12, 2024 · 2 comments
Open

Memory issues for large healpix jobs in Jura #2279

akremin opened this issue Jun 12, 2024 · 2 comments
Assignees
Labels

Comments

@akremin
Copy link
Member

akremin commented Jun 12, 2024

We ran into crashes, MPI rank OOM's, and timeouts in Jura with healpix jobs that had a large number (N>1000) of inputs. A closely related ticket is issue #2277 , but that one is specific to issues in logging rather than crashes/timeouts.

I have pushed a branch that may help in this regard, though it wasn't necessary for Jura: prunespectragroup.

That modifies the spectral grouping code to subselect only fibers in a loaded frame that overlap a given healpixel, rather than loading information for all 500 fibers for all input files before subselecting to a given healpixel. In principle, this should reduce the memory footprint since not all fibers will overlap the tile, so we expect to retain fewer fibers in memory. I have not yet checked that the new code can produce identical results to the old code. For that reason I have not opened a pull request.

Beyond this we may want to consider scaling the MPI ranks, number of nodes, or other parallelism for the spectral grouping; while not making the redrock and afterburner processing too inefficient.

@sbailey
Copy link
Contributor

sbailey commented Jul 5, 2024

Examples from Jura for debugging:

OOM failed when running N>1 healpix in same job, but worked when split out into 1 job per healpix

Update: OOM fixed by PR #2290. Might still need to tune job runtimes.

srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 --cpu-bind=cores \
desi_zproc --groupname healpix --max-gpuprocs 4 --mpi --survey sv1 --program dark \
  --healpix 7015 7017 7018 7019 7020 7021 7022 7023 7026 7032 --expfiles \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7015/hpixexp-sv1-dark-7015.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7017/hpixexp-sv1-dark-7017.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7018/hpixexp-sv1-dark-7018.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7019/hpixexp-sv1-dark-7019.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7020/hpixexp-sv1-dark-7020.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7021/hpixexp-sv1-dark-7021.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7022/hpixexp-sv1-dark-7022.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7023/hpixexp-sv1-dark-7023.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7026/hpixexp-sv1-dark-7026.csv \
  /global/cfs/cdirs/desi/spectro/redux/jura/healpix/sv1/dark/70/7032/hpixexp-sv1-dark-7032.csv

Caveat: some of these timed out when running as single healpix and needed more walltime, but they didn't OOM

OOM during spectra creation

/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27256.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27258.slurm
/global/cfs/cdirs/desi/spectro/redux/jura/run/scripts/healpix/special/other/272/zpix-special-other-27259.slurm

e.g.

srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 --cpu-bind=cores \
  desi_zproc --max-gpuprocs 4 --mpi \
  --groupname healpix --survey special --program other --healpix 27256 \
  --expfiles /global/cfs/cdirs/desi/spectro/redux/jura/healpix/special/other/272/27256/hpixexp-special-other-27256.csv

These required long custom runs in CPU interactive nodes to get the spectra files generated before proceeding.

Slow spectra grouping

sv1 dark healpix 27258 spent 21 minutes in groupspec combining 975 frame files, and then only needed 43 seconds for redrock (!)

@sbailey
Copy link
Contributor

sbailey commented Sep 9, 2024

Kibo report

Kibo was run with bundling 10 healpix per job, only two jobs having memory problems:

zpix-special-dark-26192-27251.slurm
zpix-special-dark-27256-27345.slurm

with commands like

srun -N 1 -n 64 -c 2 --gpu-bind=map_gpu:3,2,1,0 --cpu-bind=cores desi_zproc --groupname healpix --survey special --program dark --healpix 27256 27257 27258 27259 27260 27262 27263 27333 27344 27345 --expfiles ...

For these we resorted to generating the spectra and coadd files in a separate interactive job with commands like

HPIXDIR=$DESI_SPECTRO_REDUX/$SPECPROD/healpix/$SURVEY/$PROGRAM/272/$HEALPIX
echo Logging to $HPIXDIR/logs/spectra-$SURVEY-$PROGRAM-$HEALPIX.log
srun -n 64 -c 2 desi_group_spectra --mpi --healpix $HEALPIX \
  --expfile $HPIXDIR/hpixexp-$SURVEY-$PROGRAM-$HEALPIX.csv \
  --header SURVEY=$SURVEY PROGRAM=$PROGRAM \
  -o $HPIXDIR/spectra-$SURVEY-$PROGRAM-$HEALPIX.fits.gz \
  -c $HPIXDIR/coadd-$SURVEY-$PROGRAM-$HEALPIX.fits > $HPIXDIR/logs/spectra-$SURVEY-$PROGRAM-$HEALPIX.log

Note: "272" was hardcoded and used for healpix 272xx, and similarly for 273 etc.

After generating the spectra and coadd files, we then resubmitted the original zpix-special-dark-*-*.slurm files. That got through Redrock and some of the afterburners, but then hit memory problems running emlinefit in parallel on the 10 healpix. We eventually dropped down to running that in serial in an interactive node.

At minimum it would be useful to add a desi_zproc option to process desi_group_spectra one healpix at a time instead of doing 10 healpix in parallel using sub-communicators. That could be used by desi_healpix_redshifts when generating the slurm scripts when it recognizes jobs with an especially large number of inputs. Alternatively or in addition, desi_healpix_redshifts could group healpix jobs by a variable number of healpix so that it doesn't put multiple very large healpix together in a single job.

The zproc wrapper to emlinefit will need more study for why that was running out of memory (given that other afterburners didn't), and what could be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants