Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freeze_iers problematically slow on KNL #180

Closed
sbailey opened this issue Dec 3, 2021 · 26 comments
Closed

freeze_iers problematically slow on KNL #180

sbailey opened this issue Dec 3, 2021 · 26 comments
Assignees

Comments

@sbailey
Copy link
Contributor

sbailey commented Dec 3, 2021

desiutil.iers.freeze_iers is problematically slow on KNL with MPI jobs of a few hundred ranks. This is much worse than it was for everest, where it typically took ~3 seconds but now takes ~4 minutes:

salloc -N 3 -q interactive -t 1:00:00 -C knl
time srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"
...
WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.01 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/s/sjbailey/.astropy/cache/download/py3/lock. Lock claims to be held by process 153915.') [astropy.time.core]
WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.01 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/s/sjbailey/.astropy/cache/download/py3/lock. Lock claims to be held by process 153915.') [astropy.time.core]
WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.01 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/s/sjbailey/.astropy/cache/download/py3/lock. Lock claims to be held by process 153915.') [astropy.time.core]

real	4m17.687s
user	0m0.478s
sys	0m0.217s

Both before and after that ran, there is no /global/homes/s/sjbailey/.astropy/cache/download/py3/lock directory, so those lock messages are self-inflicted from just that command.

On Haswell it isn't super fast, but at least runs in ~20 seconds instead of >4 minutes (requires 7 nodes instead of 3 to fit 200 ranks):

salloc -N 7 -q interactive -t 1:00:00 -C haswell
time srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"
...
[same messages]
...
real	0m20.578s
user	0m0.071s
sys	0m0.065s

FWIW, also using 7 nodes on KNL to spread out the I/O brought it down to 2 minutes, but that it still pretty slow.

Items to study:

@weaver could you look into this performance issue and followup with NERSC as needed?

@sbailey
Copy link
Contributor Author

sbailey commented Dec 3, 2021

Apologies, meant to mention @weaverba137 not -at-weaver

@weaverba137
Copy link
Member

Interesting. My hypothesis is that this has always been a problem, because we've never bothered to change the cache directory to anything other than the default one on the home filesystem. What has changed is that there have been many, many filesystem issues both on home and CFS since everest was run, and that we are now effectively running with a degraded filesystem.

@sbailey
Copy link
Contributor Author

sbailey commented Dec 3, 2021

FWIW, /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/f2mk was run on Nov 30 using master+KNL and did not have these super-slow freeze_iers. Maybe it was just lucky at that time? e.g. see run/scripts/night/20210702/psfnight-20210702-00096944-a0123456789-timing-51219681.json and run/scripts/night/20210702/psfnight-20210702-00096944-a0123456789-51219681.log, which still has a bunch of those "Cache is locked after 5.01 s..." messages, but was still able to do the freeze in ~3 seconds.

@weaverba137
Copy link
Member

OK, so that begs the question, was this just an unlucky run? In other words, have all runs, every day since 30 November had this problem, or was this only observed in the most recent run?

@sbailey
Copy link
Contributor Author

sbailey commented Dec 3, 2021

I have had many timeout runs recently, which motivated trying to isolate a specific problem for study and thus this ticket. The underlying problem may well be on the NERSC side, but astropy caching to $HOME isn't helping, and I'd appreciate help with someone else running with this to find workarounds on both the astropy and the NERSC side.

@weaverba137
Copy link
Member

I'm just trying to make a scientific assessment of the scope of the problem. The implication from the initial description was that only one run was affected. I don't mean to imply that there isn't an underlying problem with caching on the home directory.

@weaverba137
Copy link
Member

My reading of the astropy PRs implementing a lock-free cache have been backported to several recent versions of astropy. Could you tell me what version of astropy you have been using for these recent runs? In theory, v5.0, v4.3.1 and v4.0.6 should all have this patch.

@sbailey
Copy link
Contributor Author

sbailey commented Dec 3, 2021

These were done with the test-21.11 environment, which uses the same underlying desiconda as master and 21.7e (everest), i.e. with astropy 4.0.1.post1. Please test with an environment with a newer astropy to see if the problem is fixed.

@weaverba137
Copy link
Member

As a side note, it might be best to remove desiconda version 20210706-1.5.0 completely. It appears to contain a bad compile of numpy.

I don't have the means to quickly create an environment where this can be tested, and it's already well into Friday evening.

@weaverba137
Copy link
Member

I spoke too soon. In fact was able to run tests in three environments: astropy/4.0.6, astropy/4.3.1, astropy/5.0. On knl and haswell I got a very similar user/sys/wall time to what you reported for all versions. That's six different tests.

Exception: one test on haswell/astropy/4.0.6 test had a wall time more like 2 minutes. When I repeated the test with a fresh salloc, then I got a wall clock time comparable to the other haswell results.

It may not be fair to compare wall clock time between these systems, since there are different number of nodes involved. Unless I'm missing something?

In no case did I ever see the warning mentioned in the description. Nor any warning at all. That's ~1200 individual runs of that Python command.

@sbailey
Copy link
Contributor Author

sbailey commented Dec 6, 2021

Thanks for the checks. It's nice to hear that the N>>1 lock WARNING / RuntimeError messages are gone with newer versions of astropy, but it sounds like the updates to the underlying lock mechanism didn't fix whatever the current slowdown is.

My primary concern is not the relative speed difference between Haswell and KNL (regardless of number of nodes), but rather between KNL during the summer (Everest) and KNL now. 200 rank freeze_iers used to take seconds and now takes minutes (see log files and json timing files under everest/run/scripts/night/YEARMMDD/). I also don't think freeze_iers is uniquely the problem; it's just a well isolated example of the general slowdown at NERSC for debugging what potential workarounds might be (e.g. caching to non-$HOME locations).

@weaverba137
Copy link
Member

I think the next step would be to move the cache to /global/common/software, rather than CFS or HOME.

@weaverba137
Copy link
Member

I was able to set up one test this morning, but I didn't see any significant difference between having the astropy cache dir on HOME versus /global/common/software.

@weaverba137
Copy link
Member

For the record, here's the full procedure for moving the config and cache directories:

$ mkdir -p /global/common/software/desi/users/bweaver/astropy/cache/astropy
$ mkdir -p /global/common/software/desi/users/bweaver/astropy/config/astropy
$ export XDG_CACHE_HOME=/global/common/software/desi/users/bweaver/astropy/cache
$ export XDG_CONFIG_HOME=/global/common/software/desi/users/bweaver/astropy/config
$ python -c "from astropy.config import create_config_file; create_config_file('astropy')"
INFO: The configuration file has been successfully written to /global/common/software/desi/users/bweaver/astropy/config/astropy/astropy.cfg [astropy.config.configuration]

However, I still have not seen any significant changes in performance with the config dir shifted. Also note that /global/common/software/ is mounted read-only on the compute nodes. I had forgotten about that.

@sbailey
Copy link
Contributor Author

sbailey commented Dec 6, 2021

Thanks for these additional checks. Using $TMPDIR to get the node-local RAM disk should provide the best possible cache speed per-node, though whatever solution we end up with we'll need to have fallbacks so that we don't solely rely upon a NERSC-specific directory or environment variable.

However, I'm also wondering if the problem isn't due to writing/reading the cache itself, but something like the network query to IERS etc. It might not even be a NERSC problem, if the IERS server changed policy to throttle simultaneous connections from the same domain.

Does astropy (any version) have an "offline mode" where you can configure it to never even attempt a network query (vs. attempt and gracefully fail, albeit potentially slowly)?

@moustakas
Copy link
Member

I hope I'm contributing useful information here; if not, please ignore!

There's lots of discussion of IERS and the kinds of issues that are being reported on the astropy ticket system. Here are a couple of the more recent ones with many related issues and PRs, including (I think) the ability to suppress downloading the IERS file--
astropy/astropy#10457
astropy/astropy#9227

In addition, @dstndstn worked out a solution in the context of the imaging pipeline by pre-downloading the IERS file once, ahead of time--
https://github.com/legacysurvey/legacypipe/blob/main/docker-nersc/Dockerfile#L224-L226

However, this "trick" was for production with a Docker container, so it may not apply or work here.

@weaverba137
Copy link
Member

I think there's a bit of a misunderstanding about what desiutil.iers.freeze_iers() actually does. If it is called properly, it should never make a network call to any remove server. In contrast desiutil.iers.update_iers() downloads a file that is added to desiutil/data, and that file is read by freeze_iers(). In other words, a local file read is substituted for a network call. If we detect that freeze_iers() is actually making a network request, we have bigger problems than slow performance.

To directly answer Stephen's comment, freeze_iers() is meant to do exactly the steps to put astropy into an "offline mode".

My current gut feeling is that reading from the desiutil/data directory may be slow due to unannounced performance degradation on /global/common/software. It would be great if we could get a code profile, but I'm not an expert on how to do that, especially in parallel operation.

@weaverba137
Copy link
Member

Here is a snippet that performs the exact same IO operations as desiutil.iers.freeze_iers():

time srun -n 200 python -c "from astropy.utils.data import get_pkg_data_path; from astropy.table import Table; name = get_pkg_data_path('data/iers_frozen.ecsv', package='desiutil'); t = Table.read(name, format='ascii.ecsv').filled(); print(t[0])"

I've only had a chance to run one test so far, but the time reported was very similar to the knl times reported above.

@sbailey
Copy link
Contributor Author

sbailey commented Dec 7, 2021

New data point: I'm rerunning the original

time srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"

and it is taking ~30 seconds instead of ~4 minutes. It also takes ~30 seconds to just do the import, i.e. freeze_iers() itself isn't the current slow part:

time srun -n 200 python -c "import desiutil.iers; desiutil.iers"

real	0m35.276s
user	0m0.438s
sys	0m0.278s

Trying again with just the astropy table part:

time srun -n 200 python -c "from astropy.table import Table"

took 20 seconds on one run followed by 3m23 sec immediately after that, then back to 18 sec.

@weaverba137
Copy link
Member

That supports my hypothesis is that we're seeing performance hits on /global/common/software. I think that is the path that should be pursued with NERSC.

I don't think examining astropy caching further is worthwhile: as far as I can tell freeze_iers() doesn't cache anything. However, it does perform IO on /global/common/software.

Finally, it would probably be worthwhile to set up a test with a real code profiler.

@dmargala
Copy link

dmargala commented Dec 7, 2021

Some of the slowness on KNL could be related to core/task affinity. I've noticed performance issues in the past when the number of tasks is not equal to the number of allocated cores and that seems particularly bad on KNL. But that doesn't explain changes in behavior from a few months ago though so it might be worth pursing with a NERSC ticket.

Here are some warnings about this in the NERSC docs:

https://docs.nersc.gov/jobs/affinity/#cori-knl-slurm-affinity
https://docs.nersc.gov/performance/io/knl/#process-affinity

Reproduce slow behavior in 3 KNL node interactive job:

time srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"

real	3m39.741s
user	0m0.439s
sys	0m0.297s

Try with 68 cores per KNL node * 3 KNL nodes -> 204 tasks:

time srun -n 204 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"

real	0m26.951s
user	0m0.408s
sys	0m0.310s

Use --cpu-bind=cores srun option:

time srun -n 200 --cpu-bind=cores python -c "import desiutil.iers; desiutil.iers.freeze_iers()"

real	0m23.129s
user	0m0.395s
sys	0m0.325s

You can also set this with an env var SLURM_CPU_BIND=cores:

time SLURM_CPU_BIND=cores srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"

real	0m24.940s
user	0m0.399s
sys	0m0.319s

@sbailey
Copy link
Contributor Author

sbailey commented Dec 7, 2021

@dmargala thanks for these suggestions but I suspect that in this case the fluctuations are random, not actually tied to cpu-bind=cores, -n 200, -n 204 etc. e.g. rerunning your examples now:

options Daniel Stephen
-n 200 3m40s 41s
-n 204 27s 30s
-n 200 --cpu-bind=cores 25s 40s

Last Friday a ~4 minute runtime was quite reproducible, but now ~30s seems common with occasional fluctuations to 3-4 minutes.

@weaverba137 agreed that evidence is pointing towards /global/common/software performance fluctuations, not caching. FYI See DESI-6042 for some tips about profiling single-core code. I'm not adept at multi-core profiling, but @dmargala is and could perhaps point us at some NERSC docs/presentations.

@dmargala
Copy link

dmargala commented Dec 7, 2021

I see fairly consistent behavior. I have not seen fluctuations like you are reporting. The table below shows trials over a 30 minute period where the cpu binding method is chosen at random.

Trial Setting Elapsed Time (seconds)
0 SLURM_CPU_BIND=none 220.18
1 SLURM_CPU_BIND=none 228.21
2 SLURM_CPU_BIND=cores 24.52
3 SLURM_CPU_BIND=none 216.42
4 SLURM_CPU_BIND=cores 23.74
5 SLURM_CPU_BIND=none 217.03
6 SLURM_CPU_BIND=cores 23.71
7 SLURM_CPU_BIND=cores 23.73
8 SLURM_CPU_BIND=cores 20.84
9 SLURM_CPU_BIND=none 220.06
10 SLURM_CPU_BIND=none 218.94
11 SLURM_CPU_BIND=cores 22.97
12 SLURM_CPU_BIND=none 225.48

Here's the batch script that I used to generate the rows for that table.

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --constraint=knl
#SBATCH --time=30
#SBATCH --nodes=3

for i in $(seq 0 100); do
    number=$RANDOM
    let "number %= 2"
    if [ "$number" -eq "0" ]; then
        export SLURM_CPU_BIND=none
    else
        export SLURM_CPU_BIND=cores
    fi
    t=$(/usr/bin/time -f "%e" bash -c 'srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()" >/dev/null 2>&1' 2>&1)
    echo "$i|SLURM_CPU_BIND=$SLURM_CPU_BIND|$t"
done

crude profiling with strace

The output from strace is quite verbose but I see some different characteristics in the distribution of time spent in memory related syscalls brk and mmap.

The following is a snippet of strace output from a run with SLURM_CPU_BIND=none. The numbers between the < > at the end of each line is the time spent in the system call (in seconds):

...
brk(0x555556937000)                     = 0x555556937000 <0.255716>
brk(0x555556959000)                     = 0x555556959000 <0.018900>
brk(0x55555697c000)                     = 0x55555697c000 <0.013232>
brk(0x55555699e000)                     = 0x55555699e000 <0.049271>
brk(0x5555569c0000)                     = 0x5555569c0000 <0.054858>
brk(0x5555569e2000)                     = 0x5555569e2000 <0.019420>
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaae1e00000 <0.554662>
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaae1e40000 <0.235079>
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaae1e80000 <0.720408>
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaae1ec0000 <0.150882>
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaae1f00000 <0.828170>
brk(0x555556a03000)                     = 0x555556a03000 <0.023716>
brk(0x555556a44000)                     = 0x555556a44000 <0.177744>
brk(0x555556a84000)                     = 0x555556a84000 <0.732115>
brk(0x555556aa5000)                     = 0x555556aa5000 <0.500225>
brk(0x555556ae5000)                     = 0x555556ae5000 <0.463878>
brk(0x555556b06000)                     = 0x555556b06000 <0.584732>
brk(0x555556b46000)                     = 0x555556b46000 <0.514777>
...

Here is a comparison between the two cpu binding modes of the brk and mmap syscalls with some crude estimates of 90th and 99th percentile of time spent per call (seconds):

Setting syscall Count ~90% ~99%
SLURM_CPU_BIND=none mmap 937 0.036 0.85
SLURM_CPU_BIND=none brk 175 0.392 0.784
SLURM_CPU_BIND=cores mmap 923 0.00007 0.0002
SLURM_CPU_BIND=cores brk 170 0.00005 0.003

Commands to generate strace output and quick analysis from the command line:

# generate per rank strace output (this will create 200 output files in the working directory)
time srun -n 200 bash -c 'strace -T -o strace.out.$SLURM_PROCID python -s -c "import desiutil.iers; desiutil.iers.freeze_iers()"'

# sort by time (reverse lines, sort, reverse lines again to restore original order)
cat strace.out.48 | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' | sort -V | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' 

# count occurrences of a specific syscall 
cat strace.out.48 | grep mmap | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' | sort -V | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' | wc -l

@sbailey
Copy link
Contributor Author

sbailey commented Dec 7, 2021

@dmargala that does look pretty convincing :) Thanks for the details. Perhaps my -n 200 run was a lucky downward fluctuation. At minimum we can add --cpu-bind=cores to our pipeline srun calls; I don't see any downside to that (other than verbosity).

@weaverba137
Copy link
Member

After live discussions with @sbailey, we are closing this ticket in favor of using --cpu-bind=cores more widely. We are also pursuing INC0178137 with the NERSC help desk.

@sbailey
Copy link
Contributor Author

sbailey commented Dec 8, 2021

Addendum: we have to be careful with --cpu-bind=cores with any steps that use multiprocessing. Empirically that appears to also bind the multiprocessing processes to the same core of the parent process, rendering multiprocessing useless because they aren't fanning out to other cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants