-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
freeze_iers problematically slow on KNL #180
Comments
Apologies, meant to mention @weaverba137 not -at-weaver |
Interesting. My hypothesis is that this has always been a problem, because we've never bothered to change the cache directory to anything other than the default one on the home filesystem. What has changed is that there have been many, many filesystem issues both on home and CFS since everest was run, and that we are now effectively running with a degraded filesystem. |
FWIW, /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/f2mk was run on Nov 30 using master+KNL and did not have these super-slow freeze_iers. Maybe it was just lucky at that time? e.g. see run/scripts/night/20210702/psfnight-20210702-00096944-a0123456789-timing-51219681.json and run/scripts/night/20210702/psfnight-20210702-00096944-a0123456789-51219681.log, which still has a bunch of those "Cache is locked after 5.01 s..." messages, but was still able to do the freeze in ~3 seconds. |
OK, so that begs the question, was this just an unlucky run? In other words, have all runs, every day since 30 November had this problem, or was this only observed in the most recent run? |
I have had many timeout runs recently, which motivated trying to isolate a specific problem for study and thus this ticket. The underlying problem may well be on the NERSC side, but astropy caching to $HOME isn't helping, and I'd appreciate help with someone else running with this to find workarounds on both the astropy and the NERSC side. |
I'm just trying to make a scientific assessment of the scope of the problem. The implication from the initial description was that only one run was affected. I don't mean to imply that there isn't an underlying problem with caching on the home directory. |
My reading of the astropy PRs implementing a lock-free cache have been backported to several recent versions of astropy. Could you tell me what version of astropy you have been using for these recent runs? In theory, |
These were done with the test-21.11 environment, which uses the same underlying desiconda as master and 21.7e (everest), i.e. with astropy 4.0.1.post1. Please test with an environment with a newer astropy to see if the problem is fixed. |
As a side note, it might be best to remove desiconda version 20210706-1.5.0 completely. It appears to contain a bad compile of numpy. I don't have the means to quickly create an environment where this can be tested, and it's already well into Friday evening. |
I spoke too soon. In fact was able to run tests in three environments: astropy/4.0.6, astropy/4.3.1, astropy/5.0. On Exception: one test on haswell/astropy/4.0.6 test had a wall time more like 2 minutes. When I repeated the test with a fresh It may not be fair to compare wall clock time between these systems, since there are different number of nodes involved. Unless I'm missing something? In no case did I ever see the warning mentioned in the description. Nor any warning at all. That's ~1200 individual runs of that Python command. |
Thanks for the checks. It's nice to hear that the N>>1 lock WARNING / RuntimeError messages are gone with newer versions of astropy, but it sounds like the updates to the underlying lock mechanism didn't fix whatever the current slowdown is. My primary concern is not the relative speed difference between Haswell and KNL (regardless of number of nodes), but rather between KNL during the summer (Everest) and KNL now. 200 rank freeze_iers used to take seconds and now takes minutes (see log files and json timing files under everest/run/scripts/night/YEARMMDD/). I also don't think |
I think the next step would be to move the cache to |
I was able to set up one test this morning, but I didn't see any significant difference between having the astropy cache dir on HOME versus |
For the record, here's the full procedure for moving the config and cache directories:
However, I still have not seen any significant changes in performance with the config dir shifted. Also note that |
Thanks for these additional checks. Using $TMPDIR to get the node-local RAM disk should provide the best possible cache speed per-node, though whatever solution we end up with we'll need to have fallbacks so that we don't solely rely upon a NERSC-specific directory or environment variable. However, I'm also wondering if the problem isn't due to writing/reading the cache itself, but something like the network query to IERS etc. It might not even be a NERSC problem, if the IERS server changed policy to throttle simultaneous connections from the same domain. Does astropy (any version) have an "offline mode" where you can configure it to never even attempt a network query (vs. attempt and gracefully fail, albeit potentially slowly)? |
I hope I'm contributing useful information here; if not, please ignore! There's lots of discussion of IERS and the kinds of issues that are being reported on the astropy ticket system. Here are a couple of the more recent ones with many related issues and PRs, including (I think) the ability to suppress downloading the IERS file-- In addition, @dstndstn worked out a solution in the context of the imaging pipeline by pre-downloading the IERS file once, ahead of time-- However, this "trick" was for production with a Docker container, so it may not apply or work here. |
I think there's a bit of a misunderstanding about what To directly answer Stephen's comment, My current gut feeling is that reading from the desiutil/data directory may be slow due to unannounced performance degradation on |
Here is a snippet that performs the exact same IO operations as
I've only had a chance to run one test so far, but the time reported was very similar to the knl times reported above. |
New data point: I'm rerunning the original
and it is taking ~30 seconds instead of ~4 minutes. It also takes ~30 seconds to just do the import, i.e.
Trying again with just the astropy table part:
took 20 seconds on one run followed by 3m23 sec immediately after that, then back to 18 sec. |
That supports my hypothesis is that we're seeing performance hits on I don't think examining astropy caching further is worthwhile: as far as I can tell Finally, it would probably be worthwhile to set up a test with a real code profiler. |
Some of the slowness on KNL could be related to core/task affinity. I've noticed performance issues in the past when the number of tasks is not equal to the number of allocated cores and that seems particularly bad on KNL. But that doesn't explain changes in behavior from a few months ago though so it might be worth pursing with a NERSC ticket. Here are some warnings about this in the NERSC docs: https://docs.nersc.gov/jobs/affinity/#cori-knl-slurm-affinity Reproduce slow behavior in 3 KNL node interactive job:
Try with 68 cores per KNL node * 3 KNL nodes -> 204 tasks:
Use
You can also set this with an env var
|
@dmargala thanks for these suggestions but I suspect that in this case the fluctuations are random, not actually tied to cpu-bind=cores, -n 200, -n 204 etc. e.g. rerunning your examples now:
Last Friday a ~4 minute runtime was quite reproducible, but now ~30s seems common with occasional fluctuations to 3-4 minutes. @weaverba137 agreed that evidence is pointing towards /global/common/software performance fluctuations, not caching. FYI See DESI-6042 for some tips about profiling single-core code. I'm not adept at multi-core profiling, but @dmargala is and could perhaps point us at some NERSC docs/presentations. |
I see fairly consistent behavior. I have not seen fluctuations like you are reporting. The table below shows trials over a 30 minute period where the cpu binding method is chosen at random.
Here's the batch script that I used to generate the rows for that table. #!/bin/bash
#SBATCH --qos=debug
#SBATCH --constraint=knl
#SBATCH --time=30
#SBATCH --nodes=3
for i in $(seq 0 100); do
number=$RANDOM
let "number %= 2"
if [ "$number" -eq "0" ]; then
export SLURM_CPU_BIND=none
else
export SLURM_CPU_BIND=cores
fi
t=$(/usr/bin/time -f "%e" bash -c 'srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()" >/dev/null 2>&1' 2>&1)
echo "$i|SLURM_CPU_BIND=$SLURM_CPU_BIND|$t"
done crude profiling with straceThe output from strace is quite verbose but I see some different characteristics in the distribution of time spent in memory related syscalls The following is a snippet of strace output from a run with
Here is a comparison between the two cpu binding modes of the
Commands to generate strace output and quick analysis from the command line: # generate per rank strace output (this will create 200 output files in the working directory)
time srun -n 200 bash -c 'strace -T -o strace.out.$SLURM_PROCID python -s -c "import desiutil.iers; desiutil.iers.freeze_iers()"'
# sort by time (reverse lines, sort, reverse lines again to restore original order)
cat strace.out.48 | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' | sort -V | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }'
# count occurrences of a specific syscall
cat strace.out.48 | grep mmap | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' | sort -V | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' | wc -l |
@dmargala that does look pretty convincing :) Thanks for the details. Perhaps my |
After live discussions with @sbailey, we are closing this ticket in favor of using |
Addendum: we have to be careful with |
desiutil.iers.freeze_iers
is problematically slow on KNL with MPI jobs of a few hundred ranks. This is much worse than it was for everest, where it typically took ~3 seconds but now takes ~4 minutes:Both before and after that ran, there is no /global/homes/s/sjbailey/.astropy/cache/download/py3/lock directory, so those lock messages are self-inflicted from just that command.
On Haswell it isn't super fast, but at least runs in ~20 seconds instead of >4 minutes (requires 7 nodes instead of 3 to fit 200 ranks):
FWIW, also using 7 nodes on KNL to spread out the I/O brought it down to 2 minutes, but that it still pretty slow.
Items to study:
$XDG_CACHE_DIR
pointing to the node-local ramdisk help?@weaver could you look into this performance issue and followup with NERSC as needed?
The text was updated successfully, but these errors were encountered: