-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New request: forums.gentoo.org #1057
Comments
Recipe created |
@RavanJAltaie , why on https://farm.openzim.org/pipeline/e0b8e527-0cab-4514-880b-9434ea0a32b2/debug I see:
/login.php and /posting.php are disallowed by https://forums.gentoo.org/robots.txt |
@benoit74 could you please check the above? |
I'm not sure what you wanna me to check. What I can state is that:
Does it answer your request? |
FYI, I've opened webrecorder/browsertrix-crawler#631 to discuss the second point with webrecorder folks. |
Nota: I've cancelled the task and disabled the recipe for now, since configuration is wrong, more than 1M pages have been found, there is no point in continuing to crawl with current configuration. Let's discuss this this afternoon. |
Hello, I've added the exclude parameter and requested the recipe again, it's been running for 2 days now with only 5% progress. |
5% in 2 days is indeed a concern, because it means that it will take about 40 days to complete. What is even more concerning is that it already displays 1172895 pages to fetch (more than 1M). Given that a page takes in average 4 secs to load (rough number, just to give an envelop), it gives 46 days, so scraper is behaving normally. The 1 million pages figure is also probably correct since a quick sum of numbers displayed on main page gives 721207 topics. Given some topics have multiple pages due to high activity, the crawler now seems to be properly configured. So the problem is not really a misconfiguration of the crawler but rather a lack of analysis of the size of the website. Do we really want to build such a big ZIM? How usable will that be? How big? Is there alternatives to explore? |
I'm actually surprised it would take that long considering those pages are mostly text, aren't they? Generally speaking, is the 4s throttling from us or from them? |
Maybe for phpBB the crawler can take that number on the bottom - to predict time. |
4 secs is a conservative estimate based on how long it usually takes to crawl on page on whatever website. This is the time it takes to fetch all resources, check that there is no video, no audio, no special assets to fetch (this takes easily 1 sec), and record everything inside the WARC. Maybe it could be 3 secs, but I've never seen a website achieve 2 secs (just retrieving the page usually takes 2 secs). |
Counting 4s per page that would make it almost 300 days to get a single pass. Irrespective of the file size, which the zim can manage, I do not think this is reasonable. |
Yep ... I propose to mark this as upstream and scraper_needed, because clearly we need to find a solution (might be a change in the way zimit is operating or it might be another solution to scrape phpbb websites). For now I've disabled the recipe. @vitaly-zdanevich sorry for that, but I do not expect we can quickly find a solution to this problem since our tooling is clearly insufficient to handle this case I think. |
Edit: 6M is the number of articles, isn't this the number of posts rather than topics? Since we have multiple posts per pages, I think the real number of pages is somewhere between the number of topics and the number of articles. Anyway, this is too much. |
See openzim/zimit#333 for the upstream issue should zimit be the solution the scrape such large websites |
@benoit74 noted. |
The text was updated successfully, but these errors were encountered: