freeze_iers problematically slow on KNL #180

sbailey · 2021-12-03T22:00:58Z

desiutil.iers.freeze_iers is problematically slow on KNL with MPI jobs of a few hundred ranks. This is much worse than it was for everest, where it typically took ~3 seconds but now takes ~4 minutes:

salloc -N 3 -q interactive -t 1:00:00 -C knl
time srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"
...
WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.01 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/s/sjbailey/.astropy/cache/download/py3/lock. Lock claims to be held by process 153915.') [astropy.time.core]
WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.01 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/s/sjbailey/.astropy/cache/download/py3/lock. Lock claims to be held by process 153915.') [astropy.time.core]
WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.01 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/s/sjbailey/.astropy/cache/download/py3/lock. Lock claims to be held by process 153915.') [astropy.time.core]

real	4m17.687s
user	0m0.478s
sys	0m0.217s

Both before and after that ran, there is no /global/homes/s/sjbailey/.astropy/cache/download/py3/lock directory, so those lock messages are self-inflicted from just that command.

On Haswell it isn't super fast, but at least runs in ~20 seconds instead of >4 minutes (requires 7 nodes instead of 3 to fit 200 ranks):

salloc -N 7 -q interactive -t 1:00:00 -C haswell
time srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"
...
[same messages]
...
real	0m20.578s
user	0m0.071s
sys	0m0.065s

FWIW, also using 7 nodes on KNL to spread out the I/O brought it down to 2 minutes, but that it still pretty slow.

Items to study:

the above was run with desiutil/master; was it better with desiutil/3.2.2 used by everest or is it some new problem with NERSC?
- CORRECTION: that was run with the test-21.11 prod which also used desiutil/3.2.2, not master
is this fixed in astropy 5.0 (see Cache directory on read-only filesystem astropy/astropy#10003 and Cache locked errors when download cache dir is on NFS astropy/astropy#9970 and PRs mentioned there)?
if not, could something like $XDG_CACHE_DIR pointing to the node-local ramdisk help?

@weaver could you look into this performance issue and followup with NERSC as needed?

The text was updated successfully, but these errors were encountered:

sbailey · 2021-12-03T22:11:13Z

Apologies, meant to mention @weaverba137 not -at-weaver

weaverba137 · 2021-12-03T22:18:46Z

Interesting. My hypothesis is that this has always been a problem, because we've never bothered to change the cache directory to anything other than the default one on the home filesystem. What has changed is that there have been many, many filesystem issues both on home and CFS since everest was run, and that we are now effectively running with a degraded filesystem.

sbailey · 2021-12-03T22:45:49Z

FWIW, /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/f2mk was run on Nov 30 using master+KNL and did not have these super-slow freeze_iers. Maybe it was just lucky at that time? e.g. see run/scripts/night/20210702/psfnight-20210702-00096944-a0123456789-timing-51219681.json and run/scripts/night/20210702/psfnight-20210702-00096944-a0123456789-51219681.log, which still has a bunch of those "Cache is locked after 5.01 s..." messages, but was still able to do the freeze in ~3 seconds.

weaverba137 · 2021-12-03T22:51:06Z

OK, so that begs the question, was this just an unlucky run? In other words, have all runs, every day since 30 November had this problem, or was this only observed in the most recent run?

sbailey · 2021-12-03T23:22:40Z

I have had many timeout runs recently, which motivated trying to isolate a specific problem for study and thus this ticket. The underlying problem may well be on the NERSC side, but astropy caching to $HOME isn't helping, and I'd appreciate help with someone else running with this to find workarounds on both the astropy and the NERSC side.

weaverba137 · 2021-12-03T23:34:54Z

I'm just trying to make a scientific assessment of the scope of the problem. The implication from the initial description was that only one run was affected. I don't mean to imply that there isn't an underlying problem with caching on the home directory.

weaverba137 · 2021-12-03T23:49:13Z

My reading of the astropy PRs implementing a lock-free cache have been backported to several recent versions of astropy. Could you tell me what version of astropy you have been using for these recent runs? In theory, v5.0, v4.3.1 and v4.0.6 should all have this patch.

sbailey · 2021-12-03T23:53:36Z

These were done with the test-21.11 environment, which uses the same underlying desiconda as master and 21.7e (everest), i.e. with astropy 4.0.1.post1. Please test with an environment with a newer astropy to see if the problem is fixed.

weaverba137 · 2021-12-04T01:11:33Z

As a side note, it might be best to remove desiconda version 20210706-1.5.0 completely. It appears to contain a bad compile of numpy.

I don't have the means to quickly create an environment where this can be tested, and it's already well into Friday evening.

weaverba137 · 2021-12-04T03:02:45Z

I spoke too soon. In fact was able to run tests in three environments: astropy/4.0.6, astropy/4.3.1, astropy/5.0. On knl and haswell I got a very similar user/sys/wall time to what you reported for all versions. That's six different tests.

Exception: one test on haswell/astropy/4.0.6 test had a wall time more like 2 minutes. When I repeated the test with a fresh salloc, then I got a wall clock time comparable to the other haswell results.

It may not be fair to compare wall clock time between these systems, since there are different number of nodes involved. Unless I'm missing something?

In no case did I ever see the warning mentioned in the description. Nor any warning at all. That's ~1200 individual runs of that Python command.

sbailey · 2021-12-06T05:00:21Z

Thanks for the checks. It's nice to hear that the N>>1 lock WARNING / RuntimeError messages are gone with newer versions of astropy, but it sounds like the updates to the underlying lock mechanism didn't fix whatever the current slowdown is.

My primary concern is not the relative speed difference between Haswell and KNL (regardless of number of nodes), but rather between KNL during the summer (Everest) and KNL now. 200 rank freeze_iers used to take seconds and now takes minutes (see log files and json timing files under everest/run/scripts/night/YEARMMDD/). I also don't think freeze_iers is uniquely the problem; it's just a well isolated example of the general slowdown at NERSC for debugging what potential workarounds might be (e.g. caching to non-$HOME locations).

weaverba137 · 2021-12-06T16:07:35Z

I think the next step would be to move the cache to /global/common/software, rather than CFS or HOME.

weaverba137 · 2021-12-06T16:53:53Z

I was able to set up one test this morning, but I didn't see any significant difference between having the astropy cache dir on HOME versus /global/common/software.

weaverba137 · 2021-12-06T17:13:30Z

For the record, here's the full procedure for moving the config and cache directories:

$ mkdir -p /global/common/software/desi/users/bweaver/astropy/cache/astropy
$ mkdir -p /global/common/software/desi/users/bweaver/astropy/config/astropy
$ export XDG_CACHE_HOME=/global/common/software/desi/users/bweaver/astropy/cache
$ export XDG_CONFIG_HOME=/global/common/software/desi/users/bweaver/astropy/config
$ python -c "from astropy.config import create_config_file; create_config_file('astropy')"
INFO: The configuration file has been successfully written to /global/common/software/desi/users/bweaver/astropy/config/astropy/astropy.cfg [astropy.config.configuration]

However, I still have not seen any significant changes in performance with the config dir shifted. Also note that /global/common/software/ is mounted read-only on the compute nodes. I had forgotten about that.

sbailey · 2021-12-06T19:55:05Z

Thanks for these additional checks. Using $TMPDIR to get the node-local RAM disk should provide the best possible cache speed per-node, though whatever solution we end up with we'll need to have fallbacks so that we don't solely rely upon a NERSC-specific directory or environment variable.

However, I'm also wondering if the problem isn't due to writing/reading the cache itself, but something like the network query to IERS etc. It might not even be a NERSC problem, if the IERS server changed policy to throttle simultaneous connections from the same domain.

Does astropy (any version) have an "offline mode" where you can configure it to never even attempt a network query (vs. attempt and gracefully fail, albeit potentially slowly)?

moustakas · 2021-12-06T20:13:47Z

I hope I'm contributing useful information here; if not, please ignore!

There's lots of discussion of IERS and the kinds of issues that are being reported on the astropy ticket system. Here are a couple of the more recent ones with many related issues and PRs, including (I think) the ability to suppress downloading the IERS file--
astropy/astropy#10457
astropy/astropy#9227

In addition, @dstndstn worked out a solution in the context of the imaging pipeline by pre-downloading the IERS file once, ahead of time--
https://github.com/legacysurvey/legacypipe/blob/main/docker-nersc/Dockerfile#L224-L226

However, this "trick" was for production with a Docker container, so it may not apply or work here.

weaverba137 · 2021-12-06T20:35:47Z

I think there's a bit of a misunderstanding about what desiutil.iers.freeze_iers() actually does. If it is called properly, it should never make a network call to any remove server. In contrast desiutil.iers.update_iers() downloads a file that is added to desiutil/data, and that file is read by freeze_iers(). In other words, a local file read is substituted for a network call. If we detect that freeze_iers() is actually making a network request, we have bigger problems than slow performance.

To directly answer Stephen's comment, freeze_iers() is meant to do exactly the steps to put astropy into an "offline mode".

My current gut feeling is that reading from the desiutil/data directory may be slow due to unannounced performance degradation on /global/common/software. It would be great if we could get a code profile, but I'm not an expert on how to do that, especially in parallel operation.

weaverba137 · 2021-12-06T21:35:26Z

Here is a snippet that performs the exact same IO operations as desiutil.iers.freeze_iers():

time srun -n 200 python -c "from astropy.utils.data import get_pkg_data_path; from astropy.table import Table; name = get_pkg_data_path('data/iers_frozen.ecsv', package='desiutil'); t = Table.read(name, format='ascii.ecsv').filled(); print(t[0])"

I've only had a chance to run one test so far, but the time reported was very similar to the knl times reported above.

sbailey · 2021-12-07T00:51:01Z

New data point: I'm rerunning the original

time srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"

and it is taking ~30 seconds instead of ~4 minutes. It also takes ~30 seconds to just do the import, i.e. freeze_iers() itself isn't the current slow part:

time srun -n 200 python -c "import desiutil.iers; desiutil.iers"

real	0m35.276s
user	0m0.438s
sys	0m0.278s

Trying again with just the astropy table part:

time srun -n 200 python -c "from astropy.table import Table"

took 20 seconds on one run followed by 3m23 sec immediately after that, then back to 18 sec.

weaverba137 · 2021-12-07T16:19:57Z

That supports my hypothesis is that we're seeing performance hits on /global/common/software. I think that is the path that should be pursued with NERSC.

I don't think examining astropy caching further is worthwhile: as far as I can tell freeze_iers() doesn't cache anything. However, it does perform IO on /global/common/software.

Finally, it would probably be worthwhile to set up a test with a real code profiler.

dmargala · 2021-12-07T17:03:02Z

Some of the slowness on KNL could be related to core/task affinity. I've noticed performance issues in the past when the number of tasks is not equal to the number of allocated cores and that seems particularly bad on KNL. But that doesn't explain changes in behavior from a few months ago though so it might be worth pursing with a NERSC ticket.

Here are some warnings about this in the NERSC docs:

https://docs.nersc.gov/jobs/affinity/#cori-knl-slurm-affinity
https://docs.nersc.gov/performance/io/knl/#process-affinity

Reproduce slow behavior in 3 KNL node interactive job:

time srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"

real	3m39.741s
user	0m0.439s
sys	0m0.297s

Try with 68 cores per KNL node * 3 KNL nodes -> 204 tasks:

time srun -n 204 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"

real	0m26.951s
user	0m0.408s
sys	0m0.310s

Use --cpu-bind=cores srun option:

time srun -n 200 --cpu-bind=cores python -c "import desiutil.iers; desiutil.iers.freeze_iers()"

real	0m23.129s
user	0m0.395s
sys	0m0.325s

You can also set this with an env var SLURM_CPU_BIND=cores:

time SLURM_CPU_BIND=cores srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()"

real	0m24.940s
user	0m0.399s
sys	0m0.319s

sbailey · 2021-12-07T17:31:48Z

@dmargala thanks for these suggestions but I suspect that in this case the fluctuations are random, not actually tied to cpu-bind=cores, -n 200, -n 204 etc. e.g. rerunning your examples now:

options	Daniel	Stephen
-n 200	3m40s	41s
-n 204	27s	30s
-n 200 --cpu-bind=cores	25s	40s

Last Friday a ~4 minute runtime was quite reproducible, but now ~30s seems common with occasional fluctuations to 3-4 minutes.

@weaverba137 agreed that evidence is pointing towards /global/common/software performance fluctuations, not caching. FYI See DESI-6042 for some tips about profiling single-core code. I'm not adept at multi-core profiling, but @dmargala is and could perhaps point us at some NERSC docs/presentations.

dmargala · 2021-12-07T20:36:45Z

I see fairly consistent behavior. I have not seen fluctuations like you are reporting. The table below shows trials over a 30 minute period where the cpu binding method is chosen at random.

Trial	Setting	Elapsed Time (seconds)
0	SLURM_CPU_BIND=none	220.18
1	SLURM_CPU_BIND=none	228.21
2	SLURM_CPU_BIND=cores	24.52
3	SLURM_CPU_BIND=none	216.42
4	SLURM_CPU_BIND=cores	23.74
5	SLURM_CPU_BIND=none	217.03
6	SLURM_CPU_BIND=cores	23.71
7	SLURM_CPU_BIND=cores	23.73
8	SLURM_CPU_BIND=cores	20.84
9	SLURM_CPU_BIND=none	220.06
10	SLURM_CPU_BIND=none	218.94
11	SLURM_CPU_BIND=cores	22.97
12	SLURM_CPU_BIND=none	225.48

Here's the batch script that I used to generate the rows for that table.

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --constraint=knl
#SBATCH --time=30
#SBATCH --nodes=3

for i in $(seq 0 100); do
    number=$RANDOM
    let "number %= 2"
    if [ "$number" -eq "0" ]; then
        export SLURM_CPU_BIND=none
    else
        export SLURM_CPU_BIND=cores
    fi
    t=$(/usr/bin/time -f "%e" bash -c 'srun -n 200 python -c "import desiutil.iers; desiutil.iers.freeze_iers()" >/dev/null 2>&1' 2>&1)
    echo "$i|SLURM_CPU_BIND=$SLURM_CPU_BIND|$t"
done

crude profiling with strace

The output from strace is quite verbose but I see some different characteristics in the distribution of time spent in memory related syscalls brk and mmap.

The following is a snippet of strace output from a run with SLURM_CPU_BIND=none. The numbers between the < > at the end of each line is the time spent in the system call (in seconds):

...
brk(0x555556937000)                     = 0x555556937000 <0.255716>
brk(0x555556959000)                     = 0x555556959000 <0.018900>
brk(0x55555697c000)                     = 0x55555697c000 <0.013232>
brk(0x55555699e000)                     = 0x55555699e000 <0.049271>
brk(0x5555569c0000)                     = 0x5555569c0000 <0.054858>
brk(0x5555569e2000)                     = 0x5555569e2000 <0.019420>
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaae1e00000 <0.554662>
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaae1e40000 <0.235079>
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaae1e80000 <0.720408>
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaae1ec0000 <0.150882>
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaae1f00000 <0.828170>
brk(0x555556a03000)                     = 0x555556a03000 <0.023716>
brk(0x555556a44000)                     = 0x555556a44000 <0.177744>
brk(0x555556a84000)                     = 0x555556a84000 <0.732115>
brk(0x555556aa5000)                     = 0x555556aa5000 <0.500225>
brk(0x555556ae5000)                     = 0x555556ae5000 <0.463878>
brk(0x555556b06000)                     = 0x555556b06000 <0.584732>
brk(0x555556b46000)                     = 0x555556b46000 <0.514777>
...

Here is a comparison between the two cpu binding modes of the brk and mmap syscalls with some crude estimates of 90th and 99th percentile of time spent per call (seconds):

Setting	syscall	Count	~90%	~99%
SLURM_CPU_BIND=none	mmap	937	0.036	0.85
SLURM_CPU_BIND=none	brk	175	0.392	0.784
SLURM_CPU_BIND=cores	mmap	923	0.00007	0.0002
SLURM_CPU_BIND=cores	brk	170	0.00005	0.003

Commands to generate strace output and quick analysis from the command line:

# generate per rank strace output (this will create 200 output files in the working directory)
time srun -n 200 bash -c 'strace -T -o strace.out.$SLURM_PROCID python -s -c "import desiutil.iers; desiutil.iers.freeze_iers()"'

# sort by time (reverse lines, sort, reverse lines again to restore original order)
cat strace.out.48 | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' | sort -V | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' 

# count occurrences of a specific syscall 
cat strace.out.48 | grep mmap | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' | sort -V | awk '{for (i=NF; i>0; i--) {printf "%s ", $i;} printf "\n" }' | wc -l

sbailey · 2021-12-07T20:56:38Z

@dmargala that does look pretty convincing :) Thanks for the details. Perhaps my -n 200 run was a lucky downward fluctuation. At minimum we can add --cpu-bind=cores to our pipeline srun calls; I don't see any downside to that (other than verbosity).

weaverba137 · 2021-12-07T21:17:47Z

After live discussions with @sbailey, we are closing this ticket in favor of using --cpu-bind=cores more widely. We are also pursuing INC0178137 with the NERSC help desk.

sbailey · 2021-12-08T05:54:36Z

Addendum: we have to be careful with --cpu-bind=cores with any steps that use multiprocessing. Empirically that appears to also bind the multiprocessing processes to the same core of the parent process, rendering multiprocessing useless because they aren't fanning out to other cores.

sbailey assigned weaverba137 Dec 3, 2021

weaverba137 closed this as completed Dec 7, 2021

sbailey mentioned this issue Dec 10, 2021

restore pipeline operations on KNL desihub/desispec#1523

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

freeze_iers problematically slow on KNL #180

freeze_iers problematically slow on KNL #180

sbailey commented Dec 3, 2021 •

edited

Loading

sbailey commented Dec 3, 2021

weaverba137 commented Dec 3, 2021

sbailey commented Dec 3, 2021

weaverba137 commented Dec 3, 2021

sbailey commented Dec 3, 2021

weaverba137 commented Dec 3, 2021

weaverba137 commented Dec 3, 2021

sbailey commented Dec 3, 2021

weaverba137 commented Dec 4, 2021

weaverba137 commented Dec 4, 2021

sbailey commented Dec 6, 2021

weaverba137 commented Dec 6, 2021

weaverba137 commented Dec 6, 2021

weaverba137 commented Dec 6, 2021

sbailey commented Dec 6, 2021

moustakas commented Dec 6, 2021

weaverba137 commented Dec 6, 2021

weaverba137 commented Dec 6, 2021

sbailey commented Dec 7, 2021

weaverba137 commented Dec 7, 2021

dmargala commented Dec 7, 2021

sbailey commented Dec 7, 2021

dmargala commented Dec 7, 2021

sbailey commented Dec 7, 2021

weaverba137 commented Dec 7, 2021

sbailey commented Dec 8, 2021

freeze_iers problematically slow on KNL #180

freeze_iers problematically slow on KNL #180

Comments

sbailey commented Dec 3, 2021 • edited Loading

sbailey commented Dec 3, 2021

weaverba137 commented Dec 3, 2021

sbailey commented Dec 3, 2021

weaverba137 commented Dec 3, 2021

sbailey commented Dec 3, 2021

weaverba137 commented Dec 3, 2021

weaverba137 commented Dec 3, 2021

sbailey commented Dec 3, 2021

weaverba137 commented Dec 4, 2021

weaverba137 commented Dec 4, 2021

sbailey commented Dec 6, 2021

weaverba137 commented Dec 6, 2021

weaverba137 commented Dec 6, 2021

weaverba137 commented Dec 6, 2021

sbailey commented Dec 6, 2021

moustakas commented Dec 6, 2021

weaverba137 commented Dec 6, 2021

weaverba137 commented Dec 6, 2021

sbailey commented Dec 7, 2021

weaverba137 commented Dec 7, 2021

dmargala commented Dec 7, 2021

sbailey commented Dec 7, 2021

dmargala commented Dec 7, 2021

crude profiling with strace

sbailey commented Dec 7, 2021

weaverba137 commented Dec 7, 2021

sbailey commented Dec 8, 2021

sbailey commented Dec 3, 2021 •

edited

Loading