Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

web dyno getting OOMKilled #118

Open
mvandenburgh opened this issue Mar 25, 2024 · 67 comments
Open

web dyno getting OOMKilled #118

mvandenburgh opened this issue Mar 25, 2024 · 67 comments
Assignees
Labels
high priority Work on these first OOM Dealing with out-of-memory errors performance Efficient use of time and space

Comments

@mvandenburgh
Copy link
Member

In the last 24 hours, we've had over 20 instances of the dandidav web dyno exceeding its memory limit of 512MB and getting killed by Heroku as a result. See https://dashboard.heroku.com/apps/dandidav/metrics/web?ending=0-hours-ago&starting=24-hours-ago

@yarikoptic @jwodder is 512MB enough for the dandidav server, or should we provision a higher tier dyno?

@jwodder
Copy link
Member

jwodder commented Mar 25, 2024

My best guess is that the high memory usage probably comes from the caching of Zarr manifest files. However, no more than 16 files are cached at a time, and most manifest files seem to be around 10-20 MB, which doesn't add up to 512 MB.

@yarikoptic
Copy link
Member

Do we have access to dandidav logs?

Could you @jwodder add to log line amount of memory that process uses to be provided with each log line -- it might help to identify which actions consuming memory? (I had similar done long ago in pymvpa -- IIRC came handy on a number of occasions)

@yarikoptic
Copy link
Member

10-20 MB, which doesn't add up to 512 MB.

FWIW some manifest files are much larger than that (e.g. 62MB of ./5c3/7c2/5c37c233-222f-4e60-96e7-a7536e08ef61/4cb549b2e2346bb1a30f493b50fb6a2e-526320--1023396474554.json) but they are in serialized form, so indeed in memory should not take as much.

so should be investigated on how much memory is actually consumed by the process... may be multiple caches for the same thing across threads etc?

@yarikoptic
Copy link
Member

FWIW, remains the case. @jwodder - in case might be of use, we do dump logs on drogon which you can inspect:

dandi@drogon:/mnt/backup/dandi/heroku-logs/dandidav$ tail 20240326-0901.log
2024-03-26T13:17:54.887354+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:17:54.889073+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:18:15.933714+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:18:15.935432+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:18:36.885785+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:18:36.888306+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:18:57.096775+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:18:57.098827+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2024-03-26T13:19:19.324735+00:00 heroku[web.1]: Process running mem=635M(124.1%)
2024-03-26T13:19:19.326660+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)

@jwodder
Copy link
Member

jwodder commented Mar 26, 2024

@yarikoptic

Could you @jwodder add to log line amount of memory that process uses to be provided with each log line

I don't think that's possible. The closest thing would be to emit dedicated log lines listing the current memory used, either at the start and/or end of each request or periodically.

may be multiple caches for the same thing across threads etc?

That shouldn't be the case. There should only be one manifest cache in the whole process.

@yarikoptic
Copy link
Member

FWIW an idea -- you can "replay" sequence of visits using those logs on a local dandidav instance and monitor memory consumption any way you like

@jwodder
Copy link
Member

jwodder commented Mar 28, 2024

dandidav now logs the current memory usage before & after handling each request as an INFO message of the form "Current memory usage: {int} physical, {int} virtual".

@mvandenburgh
Copy link
Member Author

Do we have access to dandidav logs?

Currently, we have access to the live logs that Heroku provides:
https://dashboard.heroku.com/apps/dandidav/logs

We could also enable Papertrail if needed.

@yarikoptic
Copy link
Member

great -- did you pin point what causes that consistent memory consumption growth?

dandi@drogon:/mnt/backup/dandi/heroku-logs/dandidav$ sed -ne '/usage/s,.*usage:,,gp' 20240328-1401.log | uniq -c 
      5  201826304 physical, 714584064 virtual
      1  201826304 physical, 714571776 virtual
      1  244264960 physical, 781692928 virtual
      1  244264960 physical, 781680640 virtual
      1  244264960 physical, 781692928 virtual
      1  244264960 physical, 781680640 virtual
      2  244264960 physical, 781692928 virtual
      1  246632448 physical, 781692928 virtual
      1  246632448 physical, 781680640 virtual
      1  246632448 physical, 781692928 virtual
      1  246632448 physical, 781680640 virtual
      1  306573312 physical, 869711872 virtual
      3  306573312 physical, 869699584 virtual
      1  306573312 physical, 869711872 virtual
      9  306573312 physical, 869699584 virtual
      1  308158464 physical, 869711872 virtual
      1  308158464 physical, 869699584 virtual
      1  308158464 physical, 869711872 virtual
      1  308158464 physical, 869699584 virtual
      1  308174848 physical, 869711872 virtual
      3  308174848 physical, 869699584 virtual
      9  308174848 physical, 869711872 virtual
      5  308174848 physical, 869699584 virtual
      1  325824512 physical, 869711872 virtual
      1  325824512 physical, 869699584 virtual
      1  325857280 physical, 869711872 virtual

looking at sample:

dandi@drogon:/mnt/backup/dandi/heroku-logs/dandidav$ grep -B 3 ' 325824512 physical' 20240328-1401.log
2024-03-28T18:11:38.319749+00:00 app[web.1]: 2024-03-28T18:11:38.319732347Z DEBUG request{method=GET uri=/zarrs/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.zarr/4/ version=HTTP/1.1}:outgoing-request{url=https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.json method=GET}: reqwest::connect: starting new connection: https://datasets.datalad.org/
2024-03-28T18:11:38.408735+00:00 app[web.1]: 2024-03-28T18:11:38.408630383Z DEBUG request{method=GET uri=/zarrs/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.zarr/4/ version=HTTP/1.1}:outgoing-request{url=https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.json method=GET}: dandidav::httputil: Response received status=200 OK
2024-03-28T18:11:38.672798+00:00 heroku[router]: at=info method=GET path="/zarrs/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.zarr/4/" host=webdav.dandiarchive.org request_id=1a482587-988a-4235-b677-13c4b77419f5 fwd="85.208.96.201" dyno=web.1 connect=4ms service=358ms status=200 bytes=3562 protocol=https
2024-03-28T18:11:38.671895+00:00 app[web.1]: 2024-03-28T18:11:38.671796029Z  INFO request{method=GET uri=/zarrs/9f0/a96/9f0a9681-f4e9-4103-a205-7faecb311246/a77cc8f6ced80f8fb0c712b8b207e14b-55031--94971788668.zarr/4/ version=HTTP/1.1}: dandidav: Current memory usage: 325824512 physical, 869711872 virtual

so it seems about zarrs and their manifests... please review further and see what could be a reason that it hogs on memory

@jwodder jwodder added this to the Current priorities milestone Apr 28, 2024
@yarikoptic
Copy link
Member

data point - we had a number of days without any memory issues although with traffic:

(dandisets-2) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandidav$ echo -e "filename\t\t#memory\t#lines"; for f in 20240417-*.log; do echo -en "$f\t"; grep -c 'Memory quota exceeded' $f | tr '\n' '\t'; wc -l $f | awk '{print $1;}' ; done
filename                #memory #lines
20240417-0001.log       201     1807
20240417-0101.log       189     1104
20240417-0201.log       198     1197
20240417-0301.log       207     1180
20240417-0401.log       193     2239
20240417-0501.log       187     2433
20240417-0601.log       187     1770
20240417-0701.log       197     1036
20240417-0801.log       198     619
20240417-0901.log       215     724
20240417-1001.log       208     669
20240417-1101.log       210     730
20240417-1201.log       190     643
20240417-1301.log       206     673
20240417-1401.log       27      198
20240417-1501.log       0       249
20240417-1601.log       0       238
20240417-1701.log       0       196
20240417-1801.log       0       236
20240417-1901.log       0       429
20240417-2001.log       0       1128
20240417-2101.log       0       8187
20240417-2201.log       0       7785
20240417-2301.log       100     2731

@jwodder
Copy link
Member

jwodder commented May 30, 2024

I've adjusted dandidav's logging to now also log the change in memory usage after each request in order to make investigating the logs easier.

@jwodder
Copy link
Member

jwodder commented Jul 11, 2024

@yarikoptic

  • The requests that increase the amount of virtual memory1 used by the largest amounts are almost all for Zarrs under /zarrs/.

  • While Heroku has been complaining about excess memory usage constantly, since the start of July, dandidav has only been restarted (presumably always for memory issues) only about once a day:

    Script
    #!/bin/bash
    set -u
    printf 'Log\t\t\tQuota\tReboots\tDeploys\tLines\n'
    for f in raw/*
    do
        complaints="$(grep -cF 'Memory quota exceeded' "$f")"
        restarts="$(grep -cF SIG "$f")"
        if [ "$complaints" -gt 0 ] || [ "$restarts" -gt 0 ]
        then
            deploys="$(grep -cF 'Deploy by' "$f")"
            lines="$(wc -l "$f" | awk '{print $1}')"
            printf '%s\t%d\t%d\t%d\t%d\n' "$f" "$complaints" "$restarts" "$deploys" "$lines"
        fi
    done
    Output
    Log			Quota	Reboots	Deploys	Lines
    raw/20240701-0001.log	177	0	0	2260
    raw/20240701-0101.log	178	0	0	2211
    raw/20240701-0201.log	179	0	0	2182
    raw/20240701-0301.log	176	0	0	1981
    raw/20240701-0401.log	182	0	0	1993
    raw/20240701-0501.log	181	0	0	2175
    raw/20240701-0601.log	175	0	0	1616
    raw/20240701-0701.log	175	0	0	2489
    raw/20240701-0801.log	177	0	0	2785
    raw/20240701-0901.log	175	0	0	1506
    raw/20240701-1001.log	181	0	0	1599
    raw/20240701-1101.log	188	0	0	1676
    raw/20240701-1201.log	183	0	0	1535
    raw/20240701-1301.log	185	0	0	1618
    raw/20240701-1401.log	187	0	0	1317
    raw/20240701-1501.log	188	0	0	1212
    raw/20240701-1601.log	180	0	0	1986
    raw/20240701-1701.log	180	0	0	2328
    raw/20240701-1801.log	182	0	0	2450
    raw/20240701-1901.log	178	0	0	3585
    raw/20240701-2001.log	141	1	0	7133
    raw/20240701-2301.log	93	0	0	3310
    raw/20240702-0001.log	169	0	0	3539
    raw/20240702-0101.log	171	0	0	3242
    raw/20240702-0201.log	171	0	0	3602
    raw/20240702-0301.log	168	0	0	3985
    raw/20240702-0401.log	168	0	0	3761
    raw/20240702-0501.log	167	0	0	3636
    raw/20240702-0601.log	168	0	0	3589
    raw/20240702-0701.log	168	0	0	3932
    raw/20240702-0801.log	169	0	0	4231
    raw/20240702-0901.log	167	0	0	3458
    raw/20240702-1001.log	171	0	0	3928
    raw/20240702-1101.log	166	0	0	3677
    raw/20240702-1201.log	167	0	0	3589
    raw/20240702-1301.log	171	0	0	2559
    raw/20240702-1401.log	169	0	0	2885
    raw/20240702-1501.log	170	0	0	3002
    raw/20240702-1601.log	171	0	0	2880
    raw/20240702-1701.log	169	0	0	2343
    raw/20240702-1801.log	173	0	0	1725
    raw/20240702-1901.log	168	0	0	1592
    raw/20240702-2001.log	173	0	0	1509
    raw/20240702-2101.log	59	1	0	1557
    raw/20240702-2201.log	179	0	0	2201
    raw/20240702-2301.log	174	0	0	2350
    raw/20240703-0001.log	176	0	0	1578
    raw/20240703-0101.log	177	0	0	1558
    raw/20240703-0201.log	187	0	0	1640
    raw/20240703-0301.log	183	0	0	1616
    raw/20240703-0401.log	185	0	0	2131
    raw/20240703-0501.log	180	0	0	2242
    raw/20240703-0601.log	181	0	0	2482
    raw/20240703-0701.log	177	0	0	2746
    raw/20240703-0801.log	177	0	0	2988
    raw/20240703-0901.log	180	0	0	2437
    raw/20240703-1001.log	175	0	0	2546
    raw/20240703-1101.log	177	0	0	3419
    raw/20240703-1201.log	175	0	0	4224
    raw/20240703-1301.log	177	0	0	2327
    raw/20240703-1401.log	184	0	0	2752
    raw/20240703-1501.log	176	0	0	2505
    raw/20240703-1601.log	122	0	0	3114
    raw/20240703-1801.log	152	0	0	2029
    raw/20240703-1901.log	182	0	0	3215
    raw/20240703-2001.log	173	0	0	5082
    raw/20240703-2101.log	183	0	0	2937
    raw/20240703-2201.log	182	0	0	2131
    raw/20240703-2301.log	189	0	0	2072
    raw/20240704-0001.log	176	0	0	2462
    raw/20240704-0101.log	179	0	0	2657
    raw/20240704-0201.log	179	0	0	2619
    raw/20240704-0301.log	172	0	0	3320
    raw/20240704-0401.log	141	1	0	3181
    raw/20240704-0501.log	180	0	0	3346
    raw/20240704-0601.log	177	0	0	2648
    raw/20240704-0701.log	182	0	0	2146
    raw/20240704-0801.log	178	0	0	1943
    raw/20240704-0901.log	183	0	0	2088
    raw/20240704-1001.log	175	0	0	1900
    raw/20240704-1101.log	181	0	0	2395
    raw/20240704-1201.log	176	0	0	2672
    raw/20240704-1301.log	179	0	0	2483
    raw/20240704-1401.log	180	0	0	3105
    raw/20240704-1501.log	180	0	0	3142
    raw/20240704-1601.log	175	0	0	3443
    raw/20240704-1701.log	171	0	0	3422
    raw/20240704-1801.log	177	0	0	3165
    raw/20240704-1901.log	180	0	0	2734
    raw/20240704-2001.log	179	0	0	3173
    raw/20240704-2101.log	176	0	0	3011
    raw/20240704-2201.log	176	0	0	3198
    raw/20240704-2301.log	181	0	0	3185
    raw/20240705-0001.log	178	0	0	3494
    raw/20240705-0101.log	177	0	0	2611
    raw/20240705-0201.log	182	0	0	2300
    raw/20240705-0301.log	178	0	0	2230
    raw/20240705-0401.log	180	0	0	1932
    raw/20240705-0501.log	125	1	0	1635
    raw/20240705-0601.log	174	0	0	1848
    raw/20240705-0701.log	171	0	0	1476
    raw/20240705-0801.log	180	0	0	1267
    raw/20240705-0901.log	177	0	0	1085
    raw/20240705-1001.log	178	0	0	1322
    raw/20240705-1101.log	179	0	0	1195
    raw/20240705-1201.log	180	0	0	1091
    raw/20240705-1301.log	178	0	0	1080
    raw/20240705-1401.log	180	0	0	1252
    raw/20240705-1501.log	180	0	0	1041
    raw/20240705-1601.log	182	0	0	994
    raw/20240705-1701.log	182	0	0	829
    raw/20240705-1801.log	181	0	0	819
    raw/20240705-1901.log	187	0	0	829
    raw/20240705-2001.log	181	0	0	856
    raw/20240705-2101.log	191	0	0	809
    raw/20240705-2201.log	198	0	0	728
    raw/20240705-2301.log	188	0	0	847
    raw/20240706-0001.log	185	0	0	636
    raw/20240706-0101.log	199	0	0	548
    raw/20240706-0201.log	202	0	0	538
    raw/20240706-0301.log	202	0	0	588
    raw/20240706-0401.log	202	0	0	740
    raw/20240706-0501.log	130	1	0	739
    raw/20240706-0701.log	81	0	0	942
    raw/20240706-0801.log	196	0	0	991
    raw/20240706-0901.log	208	0	0	924
    raw/20240706-1001.log	206	0	0	1023
    raw/20240706-1101.log	205	0	0	1022
    raw/20240706-1201.log	195	0	0	915
    raw/20240706-1301.log	208	0	0	754
    raw/20240706-1401.log	204	0	0	904
    raw/20240706-1501.log	212	0	0	884
    raw/20240706-1601.log	192	0	0	967
    raw/20240706-1701.log	200	0	0	853
    raw/20240706-1801.log	200	0	0	902
    raw/20240706-1901.log	198	0	0	911
    raw/20240706-2001.log	196	0	0	914
    raw/20240706-2101.log	205	0	0	939
    raw/20240706-2201.log	210	0	0	842
    raw/20240706-2301.log	211	0	0	776
    raw/20240707-0001.log	214	0	0	753
    raw/20240707-0101.log	202	0	0	1386
    raw/20240707-0201.log	190	0	0	1064
    raw/20240707-0301.log	193	0	0	1335
    raw/20240707-0401.log	192	0	0	1244
    raw/20240707-0501.log	196	0	0	1467
    raw/20240707-0601.log	37	1	0	1054
    raw/20240707-0701.log	28	0	0	926
    raw/20240707-0801.log	176	0	0	1571
    raw/20240707-0901.log	176	0	0	1353
    raw/20240707-1001.log	176	0	0	1374
    raw/20240707-1101.log	170	0	0	1629
    raw/20240707-1201.log	175	0	0	1513
    raw/20240707-1301.log	176	0	0	2816
    raw/20240707-1401.log	179	0	0	1144
    raw/20240707-1501.log	185	0	0	1389
    raw/20240707-1601.log	173	0	0	1359
    raw/20240707-1701.log	180	0	0	1291
    raw/20240707-1801.log	183	0	0	3557
    raw/20240707-1901.log	166	0	0	10994
    raw/20240707-2001.log	165	0	0	34459
    raw/20240707-2101.log	164	0	0	53459
    raw/20240707-2201.log	166	0	0	963
    raw/20240707-2301.log	192	0	0	744
    raw/20240708-0001.log	189	0	0	603
    raw/20240708-0101.log	194	0	0	594
    raw/20240708-0201.log	199	0	0	683
    raw/20240708-0301.log	181	0	0	809
    raw/20240708-0401.log	181	0	0	912
    raw/20240708-0501.log	186	0	0	972
    raw/20240708-0601.log	166	1	0	845
    raw/20240708-0701.log	44	1	0	772
    raw/20240708-0801.log	196	0	0	895
    raw/20240708-0901.log	206	0	0	947
    raw/20240708-1001.log	206	0	0	1277
    raw/20240708-1101.log	196	0	0	1335
    raw/20240708-1201.log	193	0	0	1079
    raw/20240708-1301.log	199	0	0	892
    raw/20240708-1401.log	206	0	0	876
    raw/20240708-1501.log	199	0	0	823
    raw/20240708-1601.log	200	0	0	1019
    raw/20240708-1701.log	199	0	0	865
    raw/20240708-1801.log	205	0	0	960
    raw/20240708-1901.log	202	0	0	1026
    raw/20240708-2001.log	196	0	0	935
    raw/20240708-2101.log	203	0	0	945
    raw/20240708-2201.log	196	0	0	887
    raw/20240708-2301.log	204	0	0	1560
    raw/20240709-0001.log	186	0	0	1043
    raw/20240709-0101.log	206	0	0	1513
    raw/20240709-0201.log	185	0	0	1422
    raw/20240709-0301.log	192	0	0	1570
    raw/20240709-0401.log	186	0	0	1550
    raw/20240709-0501.log	186	0	0	1500
    raw/20240709-0601.log	189	0	0	1298
    raw/20240709-0701.log	67	1	0	1442
    raw/20240709-0801.log	78	0	0	766
    raw/20240709-0901.log	180	0	0	1000
    raw/20240709-1001.log	178	0	0	1026
    raw/20240709-1101.log	94	0	0	830
    raw/20240709-1201.log	174	0	0	1204
    raw/20240709-1301.log	57	2	2	1049
    raw/20240709-1401.log	36	0	0	12384
    raw/20240709-1501.log	164	0	0	22790
    raw/20240709-1601.log	166	0	0	14761
    raw/20240709-1701.log	171	0	0	2401
    raw/20240709-1801.log	180	0	0	1769
    raw/20240709-1901.log	172	0	0	1822
    raw/20240709-2001.log	175	0	0	1720
    raw/20240709-2101.log	180	0	0	2228
    raw/20240709-2201.log	96	1	1	2604
    raw/20240709-2301.log	188	0	0	2324
    raw/20240710-0001.log	188	0	0	2805
    raw/20240710-0101.log	188	0	0	3052
    raw/20240710-0201.log	187	0	0	2974
    raw/20240710-0301.log	185	0	0	3531
    raw/20240710-0401.log	187	0	0	3223
    raw/20240710-0501.log	188	0	0	2958
    raw/20240710-0601.log	186	0	0	3715
    raw/20240710-0701.log	186	0	0	3751
    raw/20240710-0801.log	184	0	0	3816
    raw/20240710-0901.log	185	0	0	3403
    raw/20240710-1001.log	166	1	1	3525
    raw/20240710-1101.log	169	0	0	3505
    raw/20240710-1201.log	167	0	0	4814
    raw/20240710-1301.log	168	0	0	5547
    raw/20240710-1401.log	147	1	0	5871
    raw/20240710-1501.log	173	0	0	7685
    raw/20240710-1601.log	175	0	0	4539
    raw/20240710-1701.log	176	0	0	2942
    raw/20240710-1801.log	175	0	0	2782
    raw/20240710-1901.log	180	0	0	2852
    raw/20240710-2001.log	181	0	0	3097
    raw/20240710-2101.log	179	0	0	2372
    raw/20240710-2201.log	175	0	0	3761
    raw/20240710-2301.log	175	0	0	3454
    raw/20240711-0001.log	175	0	0	3117
    raw/20240711-0101.log	178	0	0	2867
    raw/20240711-0201.log	178	0	0	2919
    raw/20240711-0301.log	179	0	0	5960
    raw/20240711-0401.log	174	0	0	3625
    raw/20240711-0501.log	177	0	0	2789
    raw/20240711-0601.log	190	0	0	6081
    raw/20240711-0701.log	188	0	0	1411
    raw/20240711-0801.log	184	0	0	5573
    raw/20240711-0901.log	102	0	0	3628
    

Footnotes

  1. I'm assuming that virtual memory (rather than physical memory) is what Heroku cares about. For some reason, the memory values reported by Heroku and the memory logging are very different, though they do appear to always be the same order of magnitude.

@yarikoptic
Copy link
Member

Thanks for the investigation!

  • The fact that it restarts just once a day is good for now, but it might deteriorate quickly whenever service gets used more often, so I would not account on the event to be conveniently sparse in time.
  • Given that memory consumption does go down some times, but not that much, under assumption that memory management in Rust does pick up all unused correctly (no memory leaks... but running under valgrind might be useful) - my guess would be that it is likely some caching of folders listing since there is so many of them under zarrs, which hogs memory and caching mechanism is not "limiting itself" by amount of free memory?

@jwodder
Copy link
Member

jwodder commented Jul 11, 2024

@yarikoptic The Zarr manifests cache is currently limited simply by the number of manifests it can store at once (set by the code to 16); if a new manifest is added while the cache is full, the least-frequently used manifest in the cache is deleted. It seems that the cache dandidav uses can be configured to weigh entries by size, but I don't know what would happen if the code tried to cache a manifest larger than the maximum weight; it might refuse to cache it, resulting in a very large manifest being fetched & parsed for every URL in the Zarr requested.

@yarikoptic
Copy link
Member

So we might want to either increase memory (IMHO preferable!) on node or decrease to eg 12 from 16 and see how frequently we exhaust memory. Also getting some stats on cache hits logged once in a while would be great to inform us between the two solutions

@jwodder
Copy link
Member

jwodder commented Jul 11, 2024

@yarikoptic

Also getting some stats on cache hits logged

I'm not sure how to do that. The caching library (moka) uses a different logging system (log) than the rest of the code (tracing), and I couldn't get a supposed bridge between them to work. Switching the code over to just use log entirely wouldn't work, as the web framework (axum) and multiple other dependencies only use tracing.

@yarikoptic
Copy link
Member

Eh, with
Screenshot_20240711-113844
and last release half a year ago, for a logging (core used functionally) library... sounds like a ticking bomb...

Log: release two weeks ago and
Screenshot_20240711-114320 but has half of stars , so less adopted I guess.

Not sure what to advise pragmatically on that, but I guess long term projects might need to move to log

@jwodder
Copy link
Member

jwodder commented Jul 12, 2024

@yarikoptic

So we might want to either increase memory (IMHO preferable!) on node or decrease to eg 12 from 16 and see how frequently we exhaust memory.

I believe the amount of memory is determined by the dyno type. dandidav is currently using a "Basic" dyno, which has the smallest amount of RAM available (512 MB). See https://devcenter.heroku.com/articles/dyno-types for all the possible dyno types and related information, and see https://devcenter.heroku.com/articles/usage-and-billing for the per-dyno pricing.

@jwodder
Copy link
Member

jwodder commented Jul 12, 2024

@yarikoptic I believe I've figured out how to log a message on every cache hit and/or miss. Would that be acceptable? What exactly should be logged? For cache misses, do you want the log message to be emitted before the Zarr manifest is fetched & parsed, in case an out-of-memory death occurs during the fetching & parsing?

@yarikoptic
Copy link
Member

  • instance: so we would need to jump to Standard-2X to see any difference as it is the first which doubles RAM to 1GB. I think it would be ok to do for a week or two to see if positive effect.
  • logging:
    • would you at the moment of logging know a number of active cache entries and if hitting the cache - which of the instances we hit so to see how recent entry we are hitting?
    • Also is there logging for cache cleanup, i.e. whenever it is a miss -- would we know that some record was cleaned up to accommodate a new one?
    • Would probably make sense to emit at the beginning and end of processing cache miss (with some identifier to pair each couple of such messages), so indeed we indeed be able to tell if out-of-memory death is in the middle.
    • Do we have any control over the cache as if at the end of processing a miss we see that we are close to RAM limit and would go out of it -- clean some LR cache entry?

@jwodder
Copy link
Member

jwodder commented Jul 12, 2024

@yarikoptic

would you at the moment of logging know a number of active cache entries

I can get an approximate count. (I don't think calling run_pending_tasks() on every cache interaction would be very efficient.)

if hitting the cache - which of the instances we hit so to see how recent entry we are hitting?

I would know which instance was hit. I'm not sure what you mean by "how recent entry."

Also is there logging for cache cleanup, i.e. whenever it is a miss -- would we know that some record was cleaned up to accommodate a new one?

As I said above, we don't have access to whatever logs moka generates, but we can add our own logging around the cache, and this could include logging of removed items.

Do we have any control over the cache as if at the end of processing a miss we see that we are close to RAM limit and would go out of it -- clean some LR cache entry?

Selectively removing cache items is possible, but checking whether we're near the RAM limit would be problematic as (a) I'm not sure exactly what memory stat Heroku cares about, in part because the memory values that Heroku itself logs don't match the memory values that dandidav gets from memory-stats, and (b) using memory-stats slows down response times by an order of magnitude, so I was hoping we could remove or disable it once we've gotten past this issue.

@jwodder
Copy link
Member

jwodder commented Jul 12, 2024

@yarikoptic I've set dandidav's dyno type to Standard-2X.

@jwodder jwodder added the OOM Dealing with out-of-memory errors label Jul 17, 2024
@jwodder
Copy link
Member

jwodder commented Jul 19, 2024

@yarikoptic Ping.

@jwodder
Copy link
Member

jwodder commented Jul 19, 2024

@yarikoptic Now that dandidav has been running on a Standard-2X dyno for a week, here's an updated summary of the logs for July:

Date        Requests   MemMsgs  Restarts    Cycles   Deploys
2024-07-01      3497      3607         0         0         0
2024-07-02      6995      3619         0         1         0
2024-07-03      4586      3901         0         1         0
2024-07-04      5930      4254         0         0         0
2024-07-05      4191      4250         0         1         0
2024-07-06      1057      4382         0         1         0
2024-07-07      3668      4176         0         1         0
2024-07-08      2910      4447         0         2         0
2024-07-09      5573      3842         0         1         2
2024-07-10      7001      4197         2         0         2
2024-07-11     10045      4189         1         0         0
2024-07-12     29855      3948         1         1         2
2024-07-13     27899      3460         0         1         0
2024-07-14      6904         0         1         0         2
2024-07-15      3448         0         0         1         0
2024-07-16      4287       148         0         1         0
2024-07-17      3513         0         0         1         0
2024-07-18      5516         4         0         1         0
2024-07-19      6372         0         0         1         0
Script used to generate the summary
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = []
# ///
from __future__ import annotations
from collections import Counter
from datetime import UTC, date, datetime
import re
import sys


def main() -> None:
    days = set()
    requests_per_day: dict[date, int] = Counter()
    quota_msgs_per_day: dict[date, int] = Counter()
    deployments_per_day: dict[date, int] = Counter()
    cyclings_per_day: dict[date, int] = Counter()
    restart_stamps = []
    for fname in sys.argv[1:]:
        with open(fname) as fp:
            for line in fp:
                m = re.match(
                    r"\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d\.\d+[-+]\d\d:\d\d\s*", line
                )
                if not m:
                    # Warn? Error?
                    continue
                timestamp = datetime.fromisoformat(m[0].strip()).astimezone(UTC)
                day = timestamp.date()
                days.add(day)
                line = line[m.end() :].strip()
                if line.startswith("heroku[router]:"):
                    requests_per_day[day] += 1
                elif re.fullmatch(
                    r"heroku\[web\.1\]: Error (?:R14 \(Memory quota exceeded\)"
                    r"|R15 \(Memory quota vastly exceeded\))",
                    line,
                ):
                    quota_msgs_per_day[day] += 1
                elif line == "heroku[web.1]: Cycling":
                    cycling = True
                elif re.fullmatch(
                    r"heroku\[web\.1\]: Stopping (all processes with SIGTERM"
                    r"|process with SIGKILL)",
                    line,
                ):
                    if cycling:
                        cyclings_per_day[day] += 1
                        cycling = False
                    else:
                        restart_stamps.append(timestamp)
                elif line.startswith(("app[api]: Deploy by", "app[api]: Scaled to")):
                    assert not cycling
                    deployments_per_day[day] += 1
                    try:
                        restart_stamps.pop()
                    except IndexError:
                        pass
    restarts_per_day = Counter(ts.date() for ts in restart_stamps)
    print("Date        Requests   MemMsgs  Restarts    Cycles   Deploys")
    for d in sorted(days):
        requests = requests_per_day[d]
        restarts = restarts_per_day[d]
        cycles = cyclings_per_day[d]
        quota_msgs = quota_msgs_per_day[d]
        deployments = deployments_per_day[d]
        print(
            f"{d}  {requests:8d}  {quota_msgs:8d}  {restarts:8d}  {cycles:8d}"
            f"  {deployments:8d}"
        )


if __name__ == "__main__":
    main()

About the summary:

  • Dates are in UTC
  • "Requests" is the number of HTTP requests made to the server
  • "MemMsgs" is the number of log messages emitted about error R14 (Memory quota exceeded) or R15 (Memory quota vastly exceeded)
  • "Restarts" is the number of restarts not caused by cycling or deployment
  • "Cycles" is the number of restarts caused by Heroku automatically restarting dynos every 24 hours
  • "Deploys" is the number of times a new server executable was deployed

@yarikoptic
Copy link
Member

so starting from which date exactly? is that the one with 1 restart

2024-07-14      6904         0         1

? then we did not really tested it in restarts causing conditions which seems to happen whenever there is 10 of thousands of requests... I guess we are not running healthchecks any longer... indeed we do not

here is a diff I am reverting for run.sh
dandi@drogon:~/cronlib/dandisets-healthstatus$ git diff
diff --git a/tools/run.sh b/tools/run.sh
index 585de7d5..c7d3ee63 100755
--- a/tools/run.sh
+++ b/tools/run.sh
@@ -14,11 +14,11 @@ cd "$(dirname "$0")"/..
 
 "$PYTHON" -m virtualenv --clear venv
 . venv/bin/activate
-pip install ./code
+chronic pip install ./code
 #pip install 'git+https://github.com/fsspec/filesystem_spec'
 #pip install 'git+https://github.com/jwodder/filesystem_spec@rlock-cache'
-dandisets-healthstatus check -m "$MOUNT_PATH" -J 10 "$@"
-dandisets-healthstatus report
+dandisets-healthstatus check -m "$MOUNT_PATH" -J 10 --mode random-outdated-asset-first 000027
+#dandisets-healthstatus report
 
 # TODO: Uncomment this block when setting up the cronjob:
 #git add .
@@ -28,7 +28,7 @@ dandisets-healthstatus report
 #else echo "No changes to commit"
 #fi
 
-datalad save -m "Results from a sweep of an archive: $*"
-datalad push
+#datalad save -m "Results from a sweep of an archive: $*"
+#datalad push
 
 # vim:set et sts=4:
but apparently we are not in `main` branch on drogon for healthchecks
dandi@drogon:~/cronlib/dandisets-healthstatus$ git pull --ff-only
remote: Enumerating objects: 49, done.
remote: Counting objects: 100% (21/21), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 49 (delta 18), reused 18 (delta 18), pack-reused 28
Unpacking objects: 100% (49/49), 20.18 KiB | 666.00 KiB/s, done.
From github.com:dandi/dandisets-healthstatus
   eef70d99..789e8071  main       -> origin/main
Your configuration specifies to merge with the ref 'refs/heads/traverse-api'
from the remote, but no such ref was fetched.
dandi@drogon:~/cronlib/dandisets-healthstatus$ git branch
  ds-arg
  gh-2
  gh-47
  gh-9
  main
* traverse-api

@jwodder please bring back that repository into "main" line (I do not want to possibly destroy of any other changes you might want to keep) and invoke tools/run_loop.sh under screen so we get healthchecks going again and adding to Requests above

@jwodder
Copy link
Member

jwodder commented Jul 22, 2024

@yarikoptic

so starting from which date exactly?

Starting from the 12th at 3:45 PM EDT.

please bring back that repository into "main" line ... and invoke tools/run_loop.sh under screen

Done.

@jwodder
Copy link
Member

jwodder commented Jul 22, 2024

@yarikoptic

@jwodder
Copy link
Member

jwodder commented Jul 22, 2024

@yarikoptic Also, here is an up-to-date summary of the logs for July after fixing a couple of issues with the script:

Date        Requests   MemMsgs  Restarts    Cycles   Deploys
2024-07-01      3336      3431         0         0         0
2024-07-02      6771      3501         0         1         0
2024-07-03      4390      3737         0         1         0
2024-07-04      5710      4075         1         0         0
2024-07-05      3962      3948         0         1         0
2024-07-06       917      3822         0         1         0
2024-07-07      3467      3816         0         1         0
2024-07-08      2763      3972         0         1         0
2024-07-09      5389      3583         0         1         2
2024-07-10      6788      4051         1         0         2
2024-07-11      9824      4031         1         0         0
2024-07-12     29642      3882         2         1         2
2024-07-13     27674      3326         0         1         0
2024-07-14      6676         0         0         0         2
2024-07-15      3174         0         0         1         0
2024-07-16      4037       130         0         1         0
2024-07-17      3247         0         0         1         0
2024-07-18      5249         4         0         1         0
2024-07-19      7277       367         0         1         0
2024-07-20      5866      2644         0         1         0
2024-07-21      6655         0         0         1         0
2024-07-22      5370         0         0         0         1
Script
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = []
# ///
from __future__ import annotations
from collections import Counter
from datetime import UTC, datetime, timedelta
import re
import sys

# Assume any restarts this close to a deployment are due to it:
DEPLOY_WINDOW = timedelta(seconds=120)


def main() -> None:
    days = set()
    # Sometimes the same log message ends up in more than one log file, so we
    # have to dedup events by timestamp and then, once that's done, bucket them
    # by date.
    requests_stamps = set()
    quota_msgs_stamps = set()
    deployments_stamps = set()
    cyclings_stamps = set()
    restart_stamps = []
    current_deploy_time: datetime | None = None
    for fname in sys.argv[1:]:
        with open(fname) as fp:
            for line in fp:
                m = re.match(
                    r"\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d\.\d+[-+]\d\d:\d\d\s*", line
                )
                if not m:
                    # Warn? Error?
                    continue
                timestamp = datetime.fromisoformat(m[0].strip()).astimezone(UTC)
                day = timestamp.date()
                days.add(day)
                line = line[m.end() :].strip()
                if line.startswith("heroku[router]:"):
                    requests_stamps.add(timestamp)
                elif re.fullmatch(
                    r"heroku\[web\.1\]: Error (?:R14 \(Memory quota exceeded\)"
                    r"|R15 \(Memory quota vastly exceeded\))",
                    line,
                ):
                    quota_msgs_stamps.add(timestamp)
                elif line == "heroku[web.1]: Cycling":
                    cycling = True
                elif re.fullmatch(
                    r"heroku\[web\.1\]: Stopping (all processes with SIGTERM"
                    r"|process with SIGKILL)",
                    line,
                ):
                    if cycling:
                        cyclings_stamps.add(timestamp)
                        cycling = False
                    elif not (
                        current_deploy_time is not None
                        and abs(timestamp - current_deploy_time) < DEPLOY_WINDOW
                    ):
                        restart_stamps.append(timestamp)
                elif line.startswith(("app[api]: Deploy by", "app[api]: Scaled to")):
                    assert not cycling
                    deployments_stamps.add(timestamp)
                    current_deploy_time = timestamp
                    while (
                        restart_stamps
                        and abs(timestamp - restart_stamps[-1]) < DEPLOY_WINDOW
                    ):
                        restart_stamps.pop()
    requests_per_day = Counter(ts.date() for ts in requests_stamps)
    quota_msgs_per_day = Counter(ts.date() for ts in quota_msgs_stamps)
    deployments_per_day = Counter(ts.date() for ts in deployments_stamps)
    cyclings_per_day = Counter(ts.date() for ts in cyclings_stamps)
    restarts_per_day = Counter(ts.date() for ts in set(restart_stamps))
    print("Date        Requests   MemMsgs  Restarts    Cycles   Deploys")
    for d in sorted(days):
        requests = requests_per_day[d]
        restarts = restarts_per_day[d]
        cycles = cyclings_per_day[d]
        quota_msgs = quota_msgs_per_day[d]
        deployments = deployments_per_day[d]
        print(
            f"{d}  {requests:8d}  {quota_msgs:8d}  {restarts:8d}  {cycles:8d}"
            f"  {deployments:8d}"
        )


if __name__ == "__main__":
    main()

@jwodder jwodder added the high priority Work on these first label Jul 22, 2024
@jwodder jwodder removed this from the Current priorities milestone Jul 22, 2024
@yarikoptic
Copy link
Member

@yarikoptic

  • hit_recencies - histogram of how recently accessed before (on miss or hit) each hit was -- pretty much a histogram with int per each of the "size(cache)" entries.

To be clear, you want a report on the durations between each hit and the previous hit or miss for that manifest path, correct? I'm not sure how to interpret "a histogram with int per each of the 'size(cache)' entries."

not for that manifest path, but rather for that "cache index" if cache index how recently hit index. I.e. how recent we typically hit cache entries in terms of their recency -- do we hit most recent right away within a ms or it takes seconds?
What about least recent -- does it take hours to hit any later entry or it is also hit within seconds (hence we better increase cache size)?

@yarikoptic Also, do you want the log analysis tool to produce output in a specific format? Your use of field names in the description suggests that you want JSON or similar. Is the histogram supposed to be an actual graph produced with matplotlib, or would just a textual list of numbers be sufficient?

textual would be sufficient, could even be just structured json/yaml structure

  • Are you sure the Zarr manifest cache logs should include the total memory used by the process? Getting that value slows things down.

if slows down, ok - let's skip, but can we somehow track sizes of records in the cache entries? This relates to those questions below:

  • Do you want the Zarr manifest cache to continue to be bounded by number of entries, or do you want it to be bounded by the total size of the entries instead?

can we augment eviction policy as e.g. indeed having considered a miss, check total size and possibly trigger eviction until we fall under some threshold in terms of size?

so 86MB -- good to know... so we should be able to cache at least a few.

@jwodder
Copy link
Member

jwodder commented Jul 23, 2024

@yarikoptic

not for that manifest path, but rather for that "cache index"

If by "cache index" you mean the specific entry in the cache or its key, that's what I was referring to by "manifest path," as the cache entries are keyed by an identifier for the Zarr manifest. If that's not what you meant, then I have no idea what you're saying here.

if cache index how recently hit index. I.e. how recent we typically hit cache entries in terms of their recency -- do we hit most recent right away within a ms or it takes seconds?
What about least recent -- does it take hours to hit any later entry or it is also hit within seconds (hence we better increase cache size)?

I can't tell what you're trying to say here. Could you reword this or otherwise elaborate on exactly what information you want regarding recency? Could you give an example?

can we somehow track sizes of records in the cache entries?

I can log the size of an individual cache entry whenever it's hit, missed, or evicted. If you want the total size of all entries, we get that automatically if we choose to limit the cache by total entry size. Do you want either of these in the logs?

can we augment eviction policy as e.g. indeed having considered a miss, check total size and possibly trigger eviction until we fall under some threshold in terms of size?

If by "augment" you mean that you want the cache to be limited by both the number of entries and their total size, no. It's one or the other.

@jwodder
Copy link
Member

jwodder commented Jul 25, 2024

@yarikoptic Ping.

@yarikoptic
Copy link
Member

not for that manifest path, but rather for that "cache index"

If by "cache index" you mean the specific entry in the cache or its key, that's what I was referring to by "manifest path," as the cache entries are keyed by an identifier for the Zarr manifest. If that's not what you meant, then I have no idea what you're saying here.

if cache index how recently hit index. I.e. how recent we typically hit cache entries in terms of their recency -- do we hit most recent right away within a ms or it takes seconds?
What about least recent -- does it take hours to hit any later entry or it is also hit within seconds (hence we better increase cache size)?

I can't tell what you're trying to say here. Could you reword this or otherwise elaborate on exactly what information you want regarding recency? Could you give an example?

ok, lets play out an example with LRU cache of size 2 and paths 1 2 3 4 , where first I list the path, and then cache where I order cache sorted by hit/recency among entries -- position of path within that list would instruct the "index" I was talking about, and then I add cache-index-hit as what index was hit with this new value in the prior cache:

  • 1 [1] cache-index-hit: None
  • 2 [ 2 1 ] cache-index-hit: None
  • 3 [ 3 2 ] cache-index-hit: None
  • 2 [ 2 3 ] cache-index-hit: 1
  • 2 [ 2 3 ] cache-index-hit: 0
  • 2 [ 2 3 ] cache-index-hit: 0
  • 3 [ 3 2 ] cache-index-hit: 1
  • 2 [ 2 3 ] cache-index-hit: 1

so in this case we would have histogram of index-hits: {None: 3, 0: 2, 1: 3} so we had 3 misses (no index at all), and overall slightly preferred 2nd recent entry.

can we somehow track sizes of records in the cache entries?

I can log the size of an individual cache entry whenever it's hit, missed, or evicted. If you want the total size of all entries, we get that automatically if we choose to limit the cache by total entry size. Do you want either of these in the logs?

but also "total" is just a sum of entries in the cache, right? so we can log individual sizes and compute by tracking history of miss/evict entries, right?

Given that we are experiencing out of memory condition, shouldn't we indeed just limit cache by total entry size and not number of entries?

can we augment eviction policy as e.g. indeed having considered a miss, check total size and possibly trigger eviction until we fall under some threshold in terms of size?

If by "augment" you mean that you want the cache to be limited by both the number of entries and their total size, no. It's one or the other.

gotcha... I was thinking about really custom (some callback) handler which upon miss would decide what (if any) entry to evict and/or either to actually cache this new miss? (e.g. we might not cache any entry which exceeds our memory size, or to not cache anyhow known to be rare but large queries, while favoring caching some other smaller ones)

@jwodder
Copy link
Member

jwodder commented Jul 25, 2024

@yarikoptic

but also "total" is just a sum of entries in the cache, right? so we can log individual sizes and compute by tracking history of miss/evict entries, right?

Yes. Do you want the logs to contain individual entry sizes, total entry sizes, both, or neither? Should any of this (or statistics derived therefrom) be included in the output from the log-analysis tool?

Given that we are experiencing out of memory condition, shouldn't we indeed just limit cache by total entry size and not number of entries?

Do you still want the total size to be 100 MB as stated in this comment or a different value? And is that a decimal MB (= 1,000,000 bytes) or binary MB (= 1,048,576 bytes)?

@jwodder
Copy link
Member

jwodder commented Jul 25, 2024

@yarikoptic Regarding cache indices: There's no guarantee that the logs fed to the analysis tool will start right at the start of a server run, so the tool could encounter records of hits for entries that it never saw previous accesses for, leading to potentially inaccurate cache hit indices. Do you have a preference for how that should be handled?

@yarikoptic
Copy link
Member

@yarikoptic Regarding cache indices: There's no guarantee that the logs fed to the analysis tool will start right at the start of a server run

why? couldn't we restart the server and feed logs since that point? or do you mean that there is log rotation happening so we might need to feed multiple log files for the complete "history"?

@jwodder
Copy link
Member

jwodder commented Jul 26, 2024

@yarikoptic Yes, if we're careful, things will work, but if the tool is run on an arbitrary collection of logs without being careful, the logs will start in the middle of a server run. We can't count on always being careful.

@jwodder
Copy link
Member

jwodder commented Jul 29, 2024

@yarikoptic Ping on the following questions:

  • Assuming you want the cache to be limited by total entry size, what should the limit be? You previously suggested 100 MB. Should that be the limit, and is "MB" here decimal (= 1,000,000 bytes) or binary (= 1,048,576 bytes)?

  • Do you want the logs to contain individual entry sizes, total entry sizes, both, or neither? Should any of this (or statistics derived therefrom) be included in the output from the log-analysis tool?

  • There's no guarantee that the logs fed to the analysis tool will start right at the start of a server run, so the tool could encounter records of hits for entries that it never saw previous accesses for, leading to potentially inaccurate cache hit indices. Do you have a preference for how that should be handled?

@jwodder
Copy link
Member

jwodder commented Aug 5, 2024

@yarikoptic Ping.

@yarikoptic
Copy link
Member

  • I didn't really care either it is 1000 or 1024 base. Overall - should be configurable
  • let's log both if cheap
  • would we have access to full cache to do periodic (eg each hour) log dump of what is in the cache and sizes? May be could be done on receiving HUP signal - then we could trigger a frequently as we like as well

@jwodder
Copy link
Member

jwodder commented Aug 5, 2024

@yarikoptic

Overall - should be configurable

Do you mean configurable via a CLI option or a constant in the code or something else?

  • would we have access to full cache to do periodic (eg each hour) log dump of what is in the cache and sizes?

Yes, that's doable.

May be could be done on receiving HUP signal - then we could trigger a frequently as we like as well

That seems a bit superfluous. Is there even a way to send a SIGHUP on Heroku?

@jwodder
Copy link
Member

jwodder commented Aug 5, 2024

@yarikoptic FYI, the logs for Zarr cache events currently look like this:

{"timestamp":"2024-08-05T10:50:13.135433-04:00","level":"DEBUG","fields":{"message":"Cache miss for Zarr manifest; about to fetch from repository","cache_event":"miss_pre","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","approx_cache_len":0,"approx_cache_size":0},"target":"dandidav::zarrman","span":{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.zarr/","version":"HTTP/1.1","name":"request"},{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"}]}
{"timestamp":"2024-08-05T10:50:13.965786-04:00","level":"DEBUG","fields":{"message":"Fetched Zarr manifest from repository","cache_event":"miss_post","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","manifest_size":18170624,"approx_cache_len":0,"approx_cache_size":0},"target":"dandidav::zarrman","span":{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.zarr/","version":"HTTP/1.1","name":"request"},{"id":"76b4f503-1be7-4316-8ba1-1daf548d7ce5","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"}]}
{"timestamp":"2024-08-05T10:50:15.469629-04:00","level":"DEBUG","fields":{"message":"Fetched Zarr manifest from cache","cache_event":"hit","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","manifest_size":18170624,"approx_cache_len":1,"approx_cache_size":18170624},"target":"dandidav::zarrman","span":{"id":"0bf7db6c-2f1a-4599-bbfa-3c8facc7f5cf","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.zarr/0/","version":"HTTP/1.1","name":"request"},{"id":"0bf7db6c-2f1a-4599-bbfa-3c8facc7f5cf","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","name":"get_zarr_manifest"}]}
{"timestamp":"2024-08-05T10:50:29.57411-04:00","level":"DEBUG","fields":{"message":"Zarr manifest evicted from cache","cache_event":"evict","cache":"zarr-manifests","manifest":"001/e3b/001e3b6d-26fb-463f-af28-520a25680ab4/326273bcc8730474323a66ea4e3daa49-113328--97037755426.json","manifest_size":18170624,"cause":"Expired"},"target":"dandidav::zarrman","span":{"id":"8561ffcf-3a2e-4e6e-9375-40c89cace59c","manifest":"003/eb0/003eb053-d45c-42e2-a040-79bca28c84ae/481a2f77ab786a0f45aafd5db0971caa-0--0.json","name":"get_zarr_manifest"},"spans":[{"method":"GET","uri":"/zarrs/003/eb0/003eb053-d45c-42e2-a040-79bca28c84ae/481a2f77ab786a0f45aafd5db0971caa-0--0.zarr/","version":"HTTP/1.1","name":"request"},{"id":"8561ffcf-3a2e-4e6e-9375-40c89cace59c","manifest":"003/eb0/003eb053-d45c-42e2-a040-79bca28c84ae/481a2f77ab786a0f45aafd5db0971caa-0--0.json","name":"get_zarr_manifest"}]}

@yarikoptic
Copy link
Member

@yarikoptic

Overall - should be configurable

Do you mean configurable via a CLI option or a constant in the code or something else?

CLI or env var or you could even establish config

May be could be done on receiving HUP signal - then we could trigger a frequently as we like as well

That seems a bit superfluous. Is there even a way to send a SIGHUP on Heroku?

I dunno, didn't check. Just know that HUP is often used as a signal to trigger some action (like reload of config or log rotation).

logs for cache look good to me. ( I guess cache size was 1 and that lead to the eviction in the last entry?)

@jwodder
Copy link
Member

jwodder commented Aug 6, 2024

@yarikoptic #170 should be finished and ready for review now.

Also, here is the latest summary of the Heroku logs:

Date        Requests   MemMsgs  Restarts    Cycles   Deploys
2024-07-01      3336      3431         0         0         0
2024-07-02      6771      3501         0         1         0
2024-07-03      4390      3737         0         1         0
2024-07-04      5710      4075         1         0         0
2024-07-05      3962      3948         0         1         0
2024-07-06       917      3822         0         1         0
2024-07-07      3467      3816         0         1         0
2024-07-08      2763      3972         0         1         0
2024-07-09      5389      3583         0         1         2
2024-07-10      6788      4051         1         0         2
2024-07-11      9824      4031         1         0         0
2024-07-12     29642      3882         2         1         2
2024-07-13     27674      3326         0         1         0
2024-07-14      6676         0         0         0         2
2024-07-15      3174         0         0         1         0
2024-07-16      4037       130         0         1         0
2024-07-17      3247         0         0         1         0
2024-07-18      5249         4         0         1         0
2024-07-19      7277       367         0         1         0
2024-07-20      5866      2644         0         1         0
2024-07-21      6655         0         0         1         0
2024-07-22     17268         0         0         0         2
2024-07-23     23557         0         0         1         0
2024-07-24     30840         0         0         1         0
2024-07-25     23018      2360         0         1         0
2024-07-26     23830         0         0         0         1
2024-07-27     30217         0         0         1         0
2024-07-28     33752      1511         0         1         0
2024-07-29     32343       484         0         1         0
2024-07-30     28100       574         0         1         0
2024-07-31     29263         0         0         1         0
2024-08-01     27751         0         0         1         0
2024-08-02     37921         0         0         1         0
2024-08-03     35148         0         0         1         0
2024-08-04     35380         0         0         1         0
2024-08-05     34857         0         0         1         0
2024-08-06     20461         0         0         0         0

@yarikoptic
Copy link
Member

@jwodder could you place that script to process heroku logs on cron somewhere and put pointer to the file here? I am interested where we are ATM in the status on this

@jwodder
Copy link
Member

jwodder commented Aug 26, 2024

@yarikoptic

@yarikoptic
Copy link
Member

@yarikoptic

yes

  • Are you saying that you want a cronjob created that runs the script periodically?

yes

If so, how often,

seems to be doing daily stats, so daily

what logfiles should be provided as input,

whatever you gave to produces those stats.

and where should the output be saved?

It is not for public consumption, so anywhere, e.g. ~dandi/.cache/dandi/zarr-cache-stats.log. Can even tee it so cron job emails those stats daily

@jwodder
Copy link
Member

jwodder commented Aug 26, 2024

@yarikoptic The first question was an either/or question, not a yes/no question.

@yarikoptic
Copy link
Member

I was talking about " the script for generating the Requests/MemMsgs/Restarts/Cycles/Deploys table "

@jwodder
Copy link
Member

jwodder commented Aug 27, 2024

@yarikoptic Done. The output is saved in /home/dandi/cronlib/dandidav-restarts/report.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Work on these first OOM Dealing with out-of-memory errors performance Efficient use of time and space
Projects
None yet
Development

No branches or pull requests

3 participants