You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most pages are however static, i.e. they rarely change from one crawl to the next one, so some caching could definitely help, but I have no idea how we could implement this.
So far now, I don't know how we can handle to crawl such big sites in a reasonable manner
The text was updated successfully, but these errors were encountered:
Scraping large website (millions of pages) is challenging because:
One example of such a website is https://forums.gentoo.org/ where it looks like we have between 1 and 6M of pages to crawl. See openzim/zim-requests#1057
Most pages are however static, i.e. they rarely change from one crawl to the next one, so some caching could definitely help, but I have no idea how we could implement this.
So far now, I don't know how we can handle to crawl such big sites in a reasonable manner
The text was updated successfully, but these errors were encountered: