Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redirect URLs not resolved correctly #19

Closed
leewesleyv opened this issue Nov 5, 2024 · 3 comments
Closed

Redirect URLs not resolved correctly #19

leewesleyv opened this issue Nov 5, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@leewesleyv
Copy link
Collaborator

When a request is redirected to a new URL, the downloader middleware cannot resolve the redirect and will always return a 404 status code. An example of this:

class ExampleSpider(Spider):
    name = "example"
    start_urls = ["https://www.example.com/"]

If the start URL redirects to https://www.example.com/other_page/ and you run the example spider with a previously generated WACZ (using the WACZ extension), the request will return a 404 (because https://www.example.com/ is not in the CDXJ index) and stop scraping.

@leewesleyv leewesleyv added the bug Something isn't working label Nov 5, 2024
@leewesleyv leewesleyv added this to the Initial release milestone Nov 5, 2024
@leewesleyv
Copy link
Collaborator Author

leewesleyv commented Nov 5, 2024

I've added a proposal in #20 (WIP). To summarize the issues with the current situation:

  1. Start URLs contains https://www.example.com/, this is where we will start our scraping process
  2. Request goes through downloadermiddleware and a new Request object is returned for the redirect (https://docs.scrapy.org/en/2.11/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_response)
  3. The WaczExporter's response_received method is not called on the first request/response and therefore the redirect request/response are not written to the WARC
  4. The redirected request is scheduled and yielded, eventually returning a 200 and ending up in the WaczExporter's response_received where we write the success request/response to the WARC.
  5. At the end of the crawl the WACZ is created and exported
  6. Starting a new crawl based on the WACZ URI with the WaczMiddleware will start with a lookup of https://www.example.com/ in the CDXJ index. This URL is not present in the index and a 404 is returned.

The main problem is that when using the WaczMiddleware downloader middleware the requests are generated through the spider, but the crawl process does not have access to the live resource and does not know that a redirect for this URL is present. We also do not want to rely on the live data to be able to crawl the archive, this would defeat the purpose of crawling against the archive. This would mean that we do not only need the 200 responses, but also all other response (or only the redirects 307, 308?) in the archive.

Questions

  • What are some things in the redirect middleware that we need to take into account while implementing this in our archive extension/downloader middleware/spider middleware?
  • Should we/do we need to write any other status codes in the archive that we currently do not do yet?
  • Is there an alternative to recording them in the archive?
  • How do we prevent duplicate request/response to be written to the archive? Here we can probably just add a flag when writing that we can check at a later point if it is set

@wvengen
Copy link
Member

wvengen commented Nov 11, 2024

What are some things in the redirect middleware that we need to take into account while implementing this in our archive extension/downloader middleware/spider middleware?

What happens when you change the order of the middlewares, so that this one is before the redirect?

Should we/do we need to write any other status codes in the archive that we currently do not do yet?

Hmmm good thing to think about. At first glance, I would think that ideally, all network interactions would (be able to) end up in the archive, including redirects, and perhaps even things like middlewares doing things like login or CDN evasion. But there are also cases where it is more convenient to have just the 'initial' requests and 'final' responses.
When you configure e.g. the retry middleware, you already configure response codes. So it would also make sense to say: retry is handled there, we don't need to register the failed responses.

To make the spider work when crawling from the archive, just as online, it needs to find the corresponding response with the request, also when it was redirected. I think storing the redirects is the most straightforward approach, actually. There could be alternatives, like extension-specific fields or so, but I'd rather stick to what is standard.

If a request fails, we might want to store the failure too - but that would be a different issue. We don't need to address that here, I think.

How do we prevent duplicate request/response to be written to the archive?

For WACZs generated elsewhere, there can be duplicate requests/responses. When iterating over all responses, you will get all of them. When using the index, one may want to use the last one found (probably multiple entries in the index).

leewesleyv pushed a commit that referenced this issue Nov 13, 2024
@wvengen
Copy link
Member

wvengen commented Nov 14, 2024

Super!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants