Recommendations for returning status codes / redirect flow #173
Replies: 1 comment
-
Hi @benjohnsonn, yes Spider is low level. You can get the details of the redirect url, error, and page status, using the Page object. The values are stored if you use one of the scrape methods. You can also get the values when using subscriptions in the response. The sitemap can be handled using the |
Beta Was this translation helpful? Give feedback.
-
Hi there! I've been experimenting with spider for a bit to crawl page content and various on-page elements.
From what I can understand, the library is more low-level than something like scrapy (which I used to use) and thus, seems to be designed to add scraping functionality to a more robust crawler project.
I was wondering if you have any recommendations for tools/libraries that would also return values like status codes, redirect url chains and more? I've used reqwest a little, not sure if there is a better solution out there though!
for context, one use case would be, feeding urls from Google Search Console, Google Analytics, the sitemap, and urls discovered from a crawl, to build an exhaustive index of all possible pages for a site (real, redirected, errored etc). spider is useful for the discovery crawl, but page status, redirect chain info, and other network info is also key, so I'm trying to figure out how to efficiently capture this info!
Beta Was this translation helpful? Give feedback.
All reactions