Recommendations for returning status codes / redirect flow #173

benjohnsonn · 2024-03-19T06:51:17Z

benjohnsonn
Mar 19, 2024

Hi there! I've been experimenting with spider for a bit to crawl page content and various on-page elements.

From what I can understand, the library is more low-level than something like scrapy (which I used to use) and thus, seems to be designed to add scraping functionality to a more robust crawler project.

I was wondering if you have any recommendations for tools/libraries that would also return values like status codes, redirect url chains and more? I've used reqwest a little, not sure if there is a better solution out there though!

for context, one use case would be, feeding urls from Google Search Console, Google Analytics, the sitemap, and urls discovered from a crawl, to build an exhaustive index of all possible pages for a site (real, redirected, errored etc). spider is useful for the discovery crawl, but page status, redirect chain info, and other network info is also key, so I'm trying to figure out how to efficiently capture this info!

Answered by j-mendez

Mar 19, 2024

Hi there! I've been experimenting with spider for a bit to crawl page content and various on-page elements.

From what I can understand, the library is more low-level than something like scrapy (which I used to use) and thus, seems to be designed to add scraping functionality to a more robust crawler project.

I was wondering if you have any recommendations for tools/libraries that would also return values like status codes, redirect url chains and more? I've used reqwest a little, not sure if there is a better solution out there though!

for context, one use case would be, feeding urls from Google Search Console, Google Analytics, the sitemap, and urls discovered from a crawl, to build an …

View full answer

j-mendez · 2024-03-19T13:13:06Z

j-mendez
Mar 19, 2024
Maintainer

Hi there! I've been experimenting with spider for a bit to crawl page content and various on-page elements.

From what I can understand, the library is more low-level than something like scrapy (which I used to use) and thus, seems to be designed to add scraping functionality to a more robust crawler project.

I was wondering if you have any recommendations for tools/libraries that would also return values like status codes, redirect url chains and more? I've used reqwest a little, not sure if there is a better solution out there though!

for context, one use case would be, feeding urls from Google Search Console, Google Analytics, the sitemap, and urls discovered from a crawl, to build an exhaustive index of all possible pages for a site (real, redirected, errored etc). spider is useful for the discovery crawl, but page status, redirect chain info, and other network info is also key, so I'm trying to figure out how to efficiently capture this info!

Hi @benjohnsonn, yes Spider is low level. You can get the details of the redirect url, error, and page status, using the Page object. The values are stored if you use one of the scrape methods. You can also get the values when using subscriptions in the response. The sitemap can be handled using the sitemap feature flag. For request libraries reqwest is battle tested and handles a lot of cases. If you need something with a bit more control you can use hyper which reqwest is built ontop of. Hope this helps!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spider-rs

Recommendations for returning status codes / redirect flow #173

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

spider-rs

Recommendations for returning status codes / redirect flow #173

benjohnsonn Mar 19, 2024

Replies: 1 comment

j-mendez Mar 19, 2024 Maintainer

benjohnsonn
Mar 19, 2024

j-mendez
Mar 19, 2024
Maintainer