-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understand nextstrain.org caching #832
Comments
This is certainly odd behaviour. Off the bat, I can't say with certainty what the cause might be. It's certainly able to be understood, but it's the kind of situation where being able to reproduce and poke at the behaviour is immensely useful to gaining that understanding. It would also be useful to have seen the server's logs resulting from running your test script, as they'll show caching information.
Between these two uploads, did you access the dataset via the nextstrain.org server? |
In any case, unless you preserved the full context in which you saw the behaviour or we can reproduce it (I can't off hand), then I suspect it's not worth digging into more as we'll be missing access to important information. |
Yes (although maybe only for one of the APIs, see below). That's how the cache became invalid, although it should have been detected as so during revalidation by one of the many layers involved here (
Server logs (while running the test script whose output is shown above, and some line breaks added for clarity)
We may decide it's not worth digging into this one, but it's important for the wider team to realise that the server has caching bugs in it. We've talked about server logging improvements, and this would be a good example where debug-level logging (to a file?) would have been useful; we could use this in tests to better debug the many CI failures we are getting. Node.js v18 (we were using v16 until a few weeks ago) has a browser-compatible |
A little more digging and logging of the upstream response (S3 + cache) in the server: Restful API: proxyFromUpstream
request url: https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json
request accept-encoding: gzip
[verbose] [fetch] GET https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache: no-cache)
[verbose] [fetch] 200 OK https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache revalidated, timestamp Fri, 19 Apr 2024 04:53:46 GMT)
proxyResponseBodyFromUpstream
(upstream response) headers:
Content-Encoding null **** zika shows "gzip" here *****
Content-Type application/json
etag "a6fb19653970598e7cb514f01122cf73"
last-modified Tue, 16 Apr 2024 04:11:22 GMT
cache key make-fetch-happen%3Arequest-cache%3Ahttps%3A%2F%2Fnextstrain-staging.s3.amazonaws.com%2Fseasonal-flu_h3n2_ha_2y.json
cache hash sha512-AxkNCcsms9SmJPycZsJaZPX2n7dGH8NCQ5px9t4opKF4eL8CFUmHsCE8%2BrX9VxKrKYje0pigr6IA8FsPNr4TfQ%3D%3D
cache time Fri, 19 Apr 2024 04:53:46 GMT
(upstream request) compress? false
(_our_ response) Content-Encoding undefined **** zika shows "gzip" here *****
proxyResponseBodyFromUpstream streaming to client (without decompressing) Charon API: Getting (nextstrain) datasets for: prefix=staging/seasonal-flu/h3n2/ha/2y
proxyFromUpstream
request url: https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json
request accept-encoding: gzip
[verbose] [fetch] GET https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache: no-cache)
[verbose] [fetch] 200 OK https://nextstrain-staging.s3.amazonaws.com/seasonal-flu_h3n2_ha_2y.json (cache revalidated, timestamp Mon, 22 Apr 2024 01:01:47 GMT)
proxyResponseBodyFromUpstream
(upstream response) headers:
Content-Encoding null **** zika shows "gzip" here *****
Content-Type application/json
etag "a6fb19653970598e7cb514f01122cf73"
last-modified Tue, 16 Apr 2024 04:11:22 GMT
cache key make-fetch-happen%3Arequest-cache%3Ahttps%3A%2F%2Fnextstrain-staging.s3.amazonaws.com%2Fseasonal-flu_h3n2_ha_2y.json
cache hash sha512-AxkNCcsms9SmJPycZsJaZPX2n7dGH8NCQ5px9t4opKF4eL8CFUmHsCE8%2BrX9VxKrKYje0pigr6IA8FsPNr4TfQ%3D%3D
cache time Mon, 22 Apr 2024 01:01:47 GMT
(upstream request) compress? false
(_our_ response) Content-Encoding undefined **** zika shows "gzip" here *****
proxyResponseBodyFromUpstream streaming to client (without decompressing) Looking at the server, it seems clear that the upstream response headers (S3 via a cache layer) doesn't have content encoding set. When you clear the cache this is set to gzip (as expected) so I think it's pretty clear this is a cache-revalidation bug in one of the fetch layers. It's probably not worth spending too much more time here, but there is one question I'd like to answer to better understand the system: From my logging (above) it seems our server does not set the content-encoding header but the test script ("client") shows that this is set (to "gzip") for the RESTful API but not the charon API. I can't explain this difference, and would really like to know where in the chain P.S. I'm pretty sure the cache times change between requests because the first request updates the cache before the second request arrives, so that's not strange like I originally thought. It happens for zika as well. Server commands to produce logging shown abovediff --git a/src/upstream.js b/src/upstream.js
index 6a892b89..20c4ca98 100644
--- a/src/upstream.js
+++ b/src/upstream.js
@@ -43,6 +43,11 @@ export async function deleteByUrls(urls) {
* @param {String} accept - HTTP Accept header value
*/
export async function proxyFromUpstream(req, res, url, accept) {
+
+ console.log("proxyFromUpstream")
+ console.log("\trequest url:", url)
+ console.log("\trequest accept-encoding:", req.acceptsEncodings("gzip") ? 'gzip' : 'identity')
+
return await proxyResponseBodyFromUpstream(req, res, new Request(url, {
headers: {
Accept: accept,
@@ -190,7 +195,6 @@ export async function proxyToUpstream(req, res, upstreamUrlExtractor, contentTyp
*/
async function proxyResponseBodyFromUpstream(req, res, upstreamReq) {
const upstreamRes = await fetch(upstreamReq);
-
switch (upstreamRes.status) {
case 200:
case 304:
@@ -238,6 +242,19 @@ async function proxyResponseBodyFromUpstream(req, res, upstreamReq) {
res.set(copyHeaders(upstreamRes.headers, forwardedUpstreamResHeaders));
+
+ console.log(`proxyResponseBodyFromUpstream`)
+
+ console.log("(upstream response) headers:")
+ console.log("\tContent-Encoding", upstreamRes.headers.get('Content-Encoding'))
+ console.log("\tContent-Type", upstreamRes.headers.get('Content-Type'))
+ console.log("\tetag", upstreamRes.headers.get('etag'))
+ console.log("\tlast-modified", upstreamRes.headers.get('last-modified'))
+ console.log("\tcache key", upstreamRes.headers.get('x-local-cache-key'))
+ console.log("\tcache hash", upstreamRes.headers.get('x-local-cache-hash'))
+ console.log("\tcache time", upstreamRes.headers.get('x-local-cache-time'))
+ console.log("(upstream request) compress?", upstreamReq.compress)
+ console.log("(_our_ response) Content-Encoding", res.get("Content-Encoding"))
/* Allow private (e.g. browser) caches to store this response, but require
* them to revalidate it every time before use. They'll make conditional
* requests which we can respond to quickly from our own server-side cache.
@@ -252,6 +269,7 @@ async function proxyResponseBodyFromUpstream(req, res, upstreamReq) {
res.removeHeader("Content-Type");
res.removeHeader("Content-Length");
res.removeHeader("Content-Encoding");
+ console.log("**** SENDING 304 ****")
return res.status(304).end();
}
@@ -266,6 +284,7 @@ async function proxyResponseBodyFromUpstream(req, res, upstreamReq) {
* -trs, 9 Dec 2021
*/
if (res.get("Content-Encoding") === "gzip" && !req.acceptsEncodings("gzip")) {
+ console.log("proxyResponseBodyFromUpstream Unzipping because client can't handle content-encoding")
res.removeHeader("Content-Encoding");
res.removeHeader("Content-Length");
@@ -275,6 +294,7 @@ async function proxyResponseBodyFromUpstream(req, res, upstreamReq) {
await pipeline(upstreamRes.body, zlib.createGunzip(), res);
} else {
+ console.log("proxyResponseBodyFromUpstream streaming to client (without decompressing)")
try {
await pipeline(upstreamRes.body, res);
} catch (err) {
@@ -286,6 +306,7 @@ async function proxyResponseBodyFromUpstream(req, res, upstreamReq) {
if (err?.code !== 'ERR_STREAM_PREMATURE_CLOSE') throw err;
}
}
+ console.log("\n\n\n")
return res.end();
}
|
@jameshadfield Do you have a reproduction for this locally starting from a fresh cache? Because try as I might, I can't reproduce this with a test dataset that I manipulate on S3 between requests to a local server. I see the cache updated (not revalidated) on the first request after switching the dataset in S3 from no compression to compressed. The nextstrain.org/src/routing/setup.js Line 30 in b031d11
You can see its logs by running the server under Looking back at what you said you did and the logs you experienced…
How did you download the
I'm often for moving from third-party libraries to stdlibs when possible, but I'm skeptical about this. It's marked "Stability 1 - Experimental" up until v20 LTS, only just stable in v21. The experimental stability level is "may have backwards incompat breakages at any time; do not use in production". |
Node's stdlib It is also light on cache documentation, but does expose this: https://undici.nodejs.org/#/docs/api/CacheStorage In general, Undici looks useful, but I'm still not sold on switching. Switching to using its |
Ok, I'm gonna stop looking at this for now, but what I found re: caching in Undici is not promising. It's |
Thanks! This was the layer I had forgotten about. I will find some time to check if the upstreamResponse body is uncompressed and we're compressing it in the compression middleware or if the body is already gzipped and we're gzipping it again. The script output in the original issue here would point to the latter, especially if the content-encoding is being added to the RESTful API but not Charon, but I'll try to confirm this.
Nope (and I haven't tried to recreate it). I'm sure it's not a common problem we will run into, but it's certainly a bug and it's been a useful exercise for me to learn more about caching. Thanks for the pointers! |
Caching bugs are certainly possible and in general I wouldn't bet against them, but in this case I don't think there's a bug here in our server. With some more testing, all signs I can see point to this sequence of events:
The last step is a cache revalidation not a cache update because S3 is not changing the ETag when the object's Content-Encoding header changes. In my prior testing, the S3 object's bytes were changing, so the ETag was changing, and I couldn't reproduce what you reported. But once I ensure the bytes don't change, just change the Content-Encoding, I can reproduce. $ echo '{"hello":"gzip"}' | gzip > gzipped
$ aws s3api put-object --bucket nextstrain-staging --key trs_test.json --body gzipped --content-type application/json
{
"ETag": "\"3c099fc7a460e777422ae4445291de38\"",
"ServerSideEncryption": "AES256"
}
$ curl -si --compressed https://nextstrain-staging.s3.amazonaws.com/trs_test.json | (while read -r line; do echo "$line"; [[ "$line" == $'\r' ]] && exec hexdump -C; done)
HTTP/1.1 200 OK
x-amz-id-2: qzk1gCZiVQtAjWeBe3jQZuDywixkXKcQFk0FyxiV4BmRag9bFUZXmei4UjDyB2ovhMiOZQKxJak=
x-amz-request-id: EEDMETHC1F1R9JTE
Date: Mon, 22 Apr 2024 21:24:11 GMT
Last-Modified: Mon, 22 Apr 2024 21:23:56 GMT
ETag: "3c099fc7a460e777422ae4445291de38"
x-amz-server-side-encryption: AES256
x-amz-version-id: null
Accept-Ranges: bytes
Content-Type: application/json
Server: AmazonS3
Content-Length: 37
00000000 1f 8b 08 00 60 d5 26 66 00 03 ab 56 ca 48 cd c9 |....`.&f...V.H..|
00000010 c9 57 b2 52 4a af ca 2c 50 aa e5 02 00 f5 54 66 |.W.RJ..,P.....Tf|
00000020 66 11 00 00 00 |f....|
00000025
$ aws s3api put-object --bucket nextstrain-staging --key trs_test.json --body gzipped --content-type application/json --content-encoding gzip
{
"ETag": "\"3c099fc7a460e777422ae4445291de38\"",
"ServerSideEncryption": "AES256"
}
$ curl -si --compressed https://nextstrain-staging.s3.amazonaws.com/trs_test.json | (while read -r line; do echo "$line"; [[ "$line" == $'\r' ]] && exec hexdump -C; done)
HTTP/1.1 200 OK
x-amz-id-2: AMoa8yHwNoWimnCMV+uckIfl94btqoWZGV21vlcpcT2wY8WVfnZV3qINkvqu+wLdk8lz0/VN1sU=
x-amz-request-id: ZERP234SJ600759G
Date: Mon, 22 Apr 2024 21:24:24 GMT
Last-Modified: Mon, 22 Apr 2024 21:24:21 GMT
ETag: "3c099fc7a460e777422ae4445291de38"
x-amz-server-side-encryption: AES256
Content-Encoding: gzip
x-amz-version-id: null
Accept-Ranges: bytes
Content-Type: application/json
Server: AmazonS3
Content-Length: 37
00000000 7b 22 68 65 6c 6c 6f 22 3a 22 67 7a 69 70 22 7d |{"hello":"gzip"}|
00000010 0a |.|
00000011 Start our server with a fresh cache and then run your test script (pointed at |
ETags are supposed to be aware of content-encoding, but S3's apparently aren't. |
Yeah, that sounds about right. And in reality that's a perfectly plausible series of events that we should expect to happen every now and again when uploading datasets. To summarise: the upstream content-encoding has changed and we are not updating our cache. Now we may say the problem is upstream of our server, and it may well be, but at the end of the day this is a bug which manifests in our server. And maybe it's not worth solving - I'm guessing it's not - but something we should be aware of. The compression middleware, and more specifically its different behaviour across our two APIs, was the piece of the puzzle that made the output very hard to understand. Thanks for looking into it for me, the pointers were helpful for me to figure out what's happening. |
Yeah. This automatic compression wasn't being applied to the |
… one For example, our RESTful endpoints perform content negotiation and use the result to determine Content-Type (and also what to Accept in the request to upstream), but our Charon endpoints do not set Content-Type at all. Set it based on the upstream response in that situation. I noticed this bug (while investigating another issue¹) because the automatic compression middleware wasn't applying to the /charon/getDataset endpoint due to the lack of Content-Type. ¹ <#832>
As part of writing up this issue I fixed it, but it's still really weird to me, and I'd like to understand why this happened.
Description of bug
As part of #811 I uploaded new datasets to our staging bucket, by downloading then uploading the flu/seasonal/h3n2/ha/2y JSONs and renaming them to seasonal-flu, so we could update our tests to use the latter. I initially uploaded them using
aws s3 cp
(so the data wasn't zipped), and then re-uploaded them usingnextstrain remote upload ...
(which magically zips the data). The strangeness is to do with the associated content-encoding metadata and/or the actual encoding of the file. The file is gzipped (as it's 1.1MB, and uncompressed it's 37MB), and the content encoding is gzip (can see this in AWS console).Using a small script to access this file (cf. the staging/zika dataset) via various APIs gives me weird results.
Test script
Run the server locally at the default port 5000 and then run the following script:
When I run this script locally I get:
The weirdness (for me) is:
requests.test.js
and this test passes on GitHub actions but not locally. This is what led me down this rabbit hole, and hinted at a caching issue.I have a mad hunch that the server's reported
content-encoding: gzip
for the REST API response is somehow corrupted, and the script'sfetch
call is not respecting it, but I don't understand the mechanics well enough.The fix
The following patch, which creates a new (empty) cache for the server's
fetch
machinery, fixed it (response content encoding: gzip for all responses, and correct JSON contents):Note: keeping the old cache but changing to
cache: "no-store"
, orcache: "no-cache"
, orcache: "reload"
did not fix the bug.Questions
Why was the cache we are using not detecting that the cached data was now invalid? We set
cache: no-cache
on the request so the response must be validated with the origin server before each reuse, but it seems we didn't pick up that the contents/metadata had changed. (I mean, is this what happened? Clearly the caching used byfetch
(make-fetch-happen
) is at fault here somehow.)Why was the content-encoding different on the charon (
null
) vs RESTful API (gzip
) response, and why did this seemingly not affect decompression (or lack thereof) of the response?P.S. Looks like MacOS will remove
/tmp
files after 3 days, so there were a few different things that lined up in order for me to crash into this bug.The text was updated successfully, but these errors were encountered: