Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect values in pages.jsonl for javascript redirects #760

Open
Mr0grog opened this issue Feb 9, 2025 · 3 comments
Open

Incorrect values in pages.jsonl for javascript redirects #760

Mr0grog opened this issue Feb 9, 2025 · 3 comments

Comments

@Mr0grog
Copy link

Mr0grog commented Feb 9, 2025

This is an unusual situation and what’s right is probably debatable, but there are a few pages I’m crawling where the server responds with a 403 error, but the error page includes javascript that immediately navigates to a different URL, which has a 200 status. The listing in pages.jsonl records values from the page that was redirected to via JS, i.e. it lists a 200 status and the title from the target page.

This page is a good example: https://www.energy.gov/justice/no-fear-act-data — the page has been removed, but instead of stopping to show the error, it just immediately directs the user’s browser to the DOE home page at https://www.energy.gov/. If you use a client that doesn’t run JS, you’ll see this snippet in the source:

<script type="text/javascript">
  window.location.href = "https://www.energy.gov/";
</script>

There’s probably room for debate as to what should be recorded in pages.jsonl here. I’m hitting this in a case where the redirect target is not really a meaningful equivalent and is functioning more in a way that hides the error, and so I’d like to clearly differentiate HTTP vs. client redirects here. But I can also imagine lots of sites on static file servers (e.g. GitHub pages) using this technique to implement dynamic routing. Maybe the pages.jsonl entry could record info about both responses in this kind of case?

I imagine the same or similar issues exist with <meta http-equiv="refresh"> redirects, too.

@ikreymer
Copy link
Member

Hm, yeah this is a bit tricky, since by the time the page load completes, it would have already redirected...
I agree that it should save the original page if possible. It looks like the original URL is used, by the status is from the newly navigated to page.. The status should be easier, but title might be tricky as well, since it's not actually ever shown in the browser..

@ikreymer
Copy link
Member

Actually, something like this is possible, where both are separate entries.
The first one would have loadState of 1, while the second is the actual page that gets loaded..
Also unsure if both should be marked as seeds - i guess the answer is yes, since the redirected page becomes a seed internally...

{"format":"json-pages-1.0","id":"pages","title":"Seed Pages","hasText":"false"}
{"id":"2014f52c-2a13-45bf-a1e9-76a30d5f3bb3","url":"https://www.energy.gov/justice/no-fear-act-data","loadState":1,"ts":"2025-02-21T06:41:38.883Z","status":403,"seed":true,"depth":0}
{"id":"9e3dc480-d9c3-4726-a95a-f6747e6fa11f","url":"https://www.energy.gov/","title":"Department of Energy","loadState":4,"ts":"2025-02-21T06:41:34.757Z","mime":"text/html","status":200,"seed":true,"depth":0,"favIconUrl":"https://www.energy.gov/themes/custom/energy_gov/favicon.ico"}

@Mr0grog
Copy link
Author

Mr0grog commented Feb 21, 2025

something like this is possible, where both are separate entries. The first one would have loadState of 1, while the second is the actual page that gets loaded.

Is this what should already be happening right now? I don’t think I saw this, but I’ll go back and check. (It’s also a little complicated in my specific case, since I think both of these would have been seeds in my crawl — this is for Environmental Data & Governance Initiative (EDGI), monitoring changes to sites under Trump.)

Also unsure if both should be marked as seeds

I suppose my specific case makes this more complicated! Would https://www.energy.gov/ be listed multiple times (presumably with different id values) if it was both a literal seed and the destination of a client redirect from this other seed?

Another approach that might be useful would be to just have one listing in pages.jsonl, but include a redirect history, e.g:

// Pretty-formatted for ease of reading
{
  "id": "2014f52c-2a13-45bf-a1e9-76a30d5f3bb3",
  "url": "https://energy.gov/justice/no-fear-act-data",  // Tweaked this URL slightly to show a more interesting redirect history
  "loadState": 4,
  "ts": "2025-02-21T06:41:38.883Z",
  "status": 200,
  "seed": true,
  "depth": 0,
  "history": [
    {"url": "https://energy.gov/justice/no-fear-act-data", status: 302},
    {"url": "https://www.energy.gov/justice/no-fear-act-data", status: 403},
    {"url": "https://www.energy.gov/", status: 200}
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants