-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect values in pages.jsonl
for javascript redirects
#760
Comments
Hm, yeah this is a bit tricky, since by the time the page load completes, it would have already redirected... |
Actually, something like this is possible, where both are separate entries. {"format":"json-pages-1.0","id":"pages","title":"Seed Pages","hasText":"false"}
{"id":"2014f52c-2a13-45bf-a1e9-76a30d5f3bb3","url":"https://www.energy.gov/justice/no-fear-act-data","loadState":1,"ts":"2025-02-21T06:41:38.883Z","status":403,"seed":true,"depth":0}
{"id":"9e3dc480-d9c3-4726-a95a-f6747e6fa11f","url":"https://www.energy.gov/","title":"Department of Energy","loadState":4,"ts":"2025-02-21T06:41:34.757Z","mime":"text/html","status":200,"seed":true,"depth":0,"favIconUrl":"https://www.energy.gov/themes/custom/energy_gov/favicon.ico"} |
Is this what should already be happening right now? I don’t think I saw this, but I’ll go back and check. (It’s also a little complicated in my specific case, since I think both of these would have been seeds in my crawl — this is for Environmental Data & Governance Initiative (EDGI), monitoring changes to sites under Trump.)
I suppose my specific case makes this more complicated! Would Another approach that might be useful would be to just have one listing in // Pretty-formatted for ease of reading
{
"id": "2014f52c-2a13-45bf-a1e9-76a30d5f3bb3",
"url": "https://energy.gov/justice/no-fear-act-data", // Tweaked this URL slightly to show a more interesting redirect history
"loadState": 4,
"ts": "2025-02-21T06:41:38.883Z",
"status": 200,
"seed": true,
"depth": 0,
"history": [
{"url": "https://energy.gov/justice/no-fear-act-data", status: 302},
{"url": "https://www.energy.gov/justice/no-fear-act-data", status: 403},
{"url": "https://www.energy.gov/", status: 200}
]
} |
This is an unusual situation and what’s right is probably debatable, but there are a few pages I’m crawling where the server responds with a 403 error, but the error page includes javascript that immediately navigates to a different URL, which has a 200 status. The listing in
pages.jsonl
records values from the page that was redirected to via JS, i.e. it lists a 200 status and the title from the target page.This page is a good example: https://www.energy.gov/justice/no-fear-act-data — the page has been removed, but instead of stopping to show the error, it just immediately directs the user’s browser to the DOE home page at https://www.energy.gov/. If you use a client that doesn’t run JS, you’ll see this snippet in the source:
There’s probably room for debate as to what should be recorded in
pages.jsonl
here. I’m hitting this in a case where the redirect target is not really a meaningful equivalent and is functioning more in a way that hides the error, and so I’d like to clearly differentiate HTTP vs. client redirects here. But I can also imagine lots of sites on static file servers (e.g. GitHub pages) using this technique to implement dynamic routing. Maybe thepages.jsonl
entry could record info about both responses in this kind of case?I imagine the same or similar issues exist with
<meta http-equiv="refresh">
redirects, too.The text was updated successfully, but these errors were encountered: