-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redirect URLs not resolved correctly #19
Comments
I've added a proposal in #20 (WIP). To summarize the issues with the current situation:
The main problem is that when using the Questions
|
What happens when you change the order of the middlewares, so that this one is before the redirect?
Hmmm good thing to think about. At first glance, I would think that ideally, all network interactions would (be able to) end up in the archive, including redirects, and perhaps even things like middlewares doing things like login or CDN evasion. But there are also cases where it is more convenient to have just the 'initial' requests and 'final' responses. To make the spider work when crawling from the archive, just as online, it needs to find the corresponding response with the request, also when it was redirected. I think storing the redirects is the most straightforward approach, actually. There could be alternatives, like extension-specific fields or so, but I'd rather stick to what is standard. If a request fails, we might want to store the failure too - but that would be a different issue. We don't need to address that here, I think.
For WACZs generated elsewhere, there can be duplicate requests/responses. When iterating over all responses, you will get all of them. When using the index, one may want to use the last one found (probably multiple entries in the index). |
…ed instead of only the success (#19)
Super! |
When a request is redirected to a new URL, the downloader middleware cannot resolve the redirect and will always return a 404 status code. An example of this:
If the start URL redirects to
https://www.example.com/other_page/
and you run the example spider with a previously generated WACZ (using the WACZ extension), the request will return a 404 (becausehttps://www.example.com/
is not in the CDXJ index) and stop scraping.The text was updated successfully, but these errors were encountered: