-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I reduce Pilgrim overhead? #38
Comments
Did you see any output suggesting the simulation was still running? 15 secs vs 3 hours doesn't seem to be an overhead issue. More likely a deadlock/blocking bug in Pilgrim. Can you share your LAMMPS configuration file? I could try it on my side. |
Sure thing! Here is the input I mentioned `# 3d Lennard-Jones melt variable N index off # Newton Setting variable x index 1 variable xx equal $x newton $N units lj lattice fcc 0.8442 velocity all create 1.44 87287 loop geom pair_style lj/cut 2.5 neighbor 0.3 bin fix 1 all nve if "$p > 0" then "run_style verlet/power" if "$w > 0" then "run $w" |
Just tried your input and a few other configurations for LAMMPS and they all worked fine on my side. |
Any scientific application (LAMMPS, WarpX) I try to plug pilgrim into seems to end up not being able to run at all. For example, I used a very simple LJ LAMMPS potential with a very small problem size that finishes running in about 15 seconds on a single node. When I turn on Pilgrim like so:
`#!/bin/bash -l
#SBATCH ...
ml load cray-mpich/8.1.25
ml load PrgEnv-gnu/8.3.3
export PILGRIM_INSTALL=""
export PILGRIM_DEBUG=0
export PILGRIM_TIMING_MODE=ZSTD # or LOSSLESS, or AGGREGATED, i've tried them all
export PILGRIM_TRACING=ON
export PILGRIM_TRACING_MODE=DEFAULT
pilgrim_flags="--export=ALL,LD_PRELOAD=${PILGRIM_INSTALL}/.libs/libpilgrim.so"
EXE=../bin/warpx.3d.MPI.CUDA.DP.PDP.OPMD.QED
INPUTS=./inputs
export MPICH_OFI_NIC_POLICY=GPU
GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1"
SRUN_FLAGS="--cpus-per-task=16 --cpu-bind=cores"
srun --cpu-bind=cores $pilgrim_flags bash -c "
export CUDA_VISIBLE_DEVICES=$((3-SLURM_LOCALID));
${EXE} ${INPUTS} ${GPU_AWARE_MPI}" \
${PILGRIM_INSTALL}/pilgrim2text ./pilgrim-logs`
The job times out after 3 hours. Any suggestions to reduce the overhead so I can get the job to finish? I have been successful with small test cases but not with any "real world" apps.
The text was updated successfully, but these errors were encountered: