New request: forums.gentoo.org #1057

vitaly-zdanevich · 2024-06-19T22:49:13Z

Website URL: https://forums.gentoo.org
License: I believe its free
Desired ZIM Title: forums.gentoo.org
Desired ZIM Description: phpBB official Gentoo forum
Desired ZIM Icon –png (URL or attach one): https://www.gentoo.org/favicon.ico
Language (ISO 639-3): eng
Is this a MediaWiki?: no

RavanJAltaie · 2024-06-23T19:43:52Z

Recipe created
https://farm.openzim.org/recipes/forums.gentoo.org_en_all
I'll update the library link once ready

AngryLoki · 2024-06-25T16:11:41Z

@RavanJAltaie , why on https://farm.openzim.org/pipeline/e0b8e527-0cab-4514-880b-9434ea0a32b2/debug I see:

{"workerid":0,"page":"https://forums.gentoo.org/posting.php?mode=reply&t=193199%22%7D%7D
{"timestamp":"2024-06-25T15:49:46.110Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":49103,"total":960486,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-06-25T15:49:46.109Z\",\"extraHops\":0,\"url\":\"https:\\/\\/forums.gentoo.org\\/posting.php?mode=reply&t=193199\",\"added\":\"2024-06-23T23:18:07.166Z\",\"depth\":3}"]}}
{"timestamp":"2024-06-25T15:49:46.309Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://forums.gentoo.org/posting.php?mode=reply&t=193199%22%2C%22workerid%22%3A0%7D%7D

...

{"timestamp":"2024-06-25T15:50:34.199Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://forums.gentoo.org/login.php?redirect=posting.php&mode=reply&t=1112802%22%2C%22workerid%22%3A0%7D%7D

/login.php and /posting.php are disallowed by https://forums.gentoo.org/robots.txt

RavanJAltaie · 2024-06-26T14:01:01Z

@benoit74 could you please check the above?

benoit74 · 2024-06-27T07:22:16Z

I'm not sure what you wanna me to check.

What I can state is that:

your current recipe configuration is not in-line with what the robots.txt is requesting to ignore, you have no exclude parameter
there is nothing in browsertrix crawler ensuring that the robots.txt file Disallow rules are automatically respected (at least by default)

Does it answer your request?

benoit74 · 2024-06-27T07:22:46Z

FYI, I've opened webrecorder/browsertrix-crawler#631 to discuss the second point with webrecorder folks.

benoit74 · 2024-06-27T07:25:21Z

Nota: I've cancelled the task and disabled the recipe for now, since configuration is wrong, more than 1M pages have been found, there is no point in continuing to crawl with current configuration. Let's discuss this this afternoon.

RavanJAltaie · 2024-07-01T00:05:42Z

Hello, I've added the exclude parameter and requested the recipe again, it's been running for 2 days now with only 5% progress.
@benoit74 I assume this didn't fix the problem?

benoit74 · 2024-07-01T06:24:42Z

5% in 2 days is indeed a concern, because it means that it will take about 40 days to complete.

What is even more concerning is that it already displays 1172895 pages to fetch (more than 1M). Given that a page takes in average 4 secs to load (rough number, just to give an envelop), it gives 46 days, so scraper is behaving normally.

The 1 million pages figure is also probably correct since a quick sum of numbers displayed on main page gives 721207 topics. Given some topics have multiple pages due to high activity, the crawler now seems to be properly configured.

So the problem is not really a misconfiguration of the crawler but rather a lack of analysis of the size of the website. Do we really want to build such a big ZIM? How usable will that be? How big? Is there alternatives to explore?

Popolechien · 2024-07-01T07:12:44Z

I'm actually surprised it would take that long considering those pages are mostly text, aren't they? Generally speaking, is the 4s throttling from us or from them?

vitaly-zdanevich · 2024-07-01T08:34:20Z

On the bottom of the main page I see

Our users have posted a total of 6_369_939 articles

vitaly-zdanevich · 2024-07-01T08:37:24Z

a lack of analysis of the size of the website

Maybe for phpBB the crawler can take that number on the bottom - to predict time.

benoit74 · 2024-07-01T10:09:48Z

I'm actually surprised it would take that long considering those pages are mostly text, aren't they? Generally speaking, is the 4s throttling from us or from them?

4 secs is a conservative estimate based on how long it usually takes to crawl on page on whatever website. This is the time it takes to fetch all resources, check that there is no video, no audio, no special assets to fetch (this takes easily 1 sec), and record everything inside the WARC. Maybe it could be 3 secs, but I've never seen a website achieve 2 secs (just retrieving the page usually takes 2 secs).

Popolechien · 2024-07-01T11:22:12Z

Our users have posted a total of 6_369_939 articles

Counting 4s per page that would make it almost 300 days to get a single pass. Irrespective of the file size, which the zim can manage, I do not think this is reasonable.

benoit74 · 2024-07-01T12:31:47Z

Counting 4s per page that would make it almost 300 days to get a single pass. Irrespective of the file size, which the zim can manage, I do not think this is reasonable.

Yep ... I propose to mark this as upstream and scraper_needed, because clearly we need to find a solution (might be a change in the way zimit is operating or it might be another solution to scrape phpbb websites).

For now I've disabled the recipe.

@vitaly-zdanevich sorry for that, but I do not expect we can quickly find a solution to this problem since our tooling is clearly insufficient to handle this case I think.

benoit74 · 2024-07-01T12:37:15Z

Edit: 6M is the number of articles, isn't this the number of posts rather than topics? Since we have multiple posts per pages, I think the real number of pages is somewhere between the number of topics and the number of articles. Anyway, this is too much.

benoit74 · 2024-07-01T12:39:14Z

See openzim/zimit#333 for the upstream issue should zimit be the solution the scrape such large websites

RavanJAltaie · 2024-07-02T14:11:59Z

@benoit74 noted.

RavanJAltaie self-assigned this Jun 23, 2024

RavanJAltaie added Computer Science Content is related to coding and software Zimit labels Jun 23, 2024

RavanJAltaie added the Bug Something isn't working label Jun 28, 2024

benoit74 added Scraper Needed We need to build a dedicated scraper for this website Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) labels Jul 1, 2024

benoit74 mentioned this issue Jul 1, 2024

How to scrape large websites in a reasonable manner openzim/zimit#333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New request: forums.gentoo.org #1057

New request: forums.gentoo.org #1057

vitaly-zdanevich commented Jun 19, 2024

RavanJAltaie commented Jun 23, 2024

AngryLoki commented Jun 25, 2024

RavanJAltaie commented Jun 26, 2024

benoit74 commented Jun 27, 2024

benoit74 commented Jun 27, 2024

benoit74 commented Jun 27, 2024

RavanJAltaie commented Jul 1, 2024

benoit74 commented Jul 1, 2024

Popolechien commented Jul 1, 2024

vitaly-zdanevich commented Jul 1, 2024

vitaly-zdanevich commented Jul 1, 2024

benoit74 commented Jul 1, 2024

Popolechien commented Jul 1, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 1, 2024

RavanJAltaie commented Jul 2, 2024

New request: forums.gentoo.org #1057

New request: forums.gentoo.org #1057

Comments

vitaly-zdanevich commented Jun 19, 2024

RavanJAltaie commented Jun 23, 2024

AngryLoki commented Jun 25, 2024

RavanJAltaie commented Jun 26, 2024

benoit74 commented Jun 27, 2024

benoit74 commented Jun 27, 2024

benoit74 commented Jun 27, 2024

RavanJAltaie commented Jul 1, 2024

benoit74 commented Jul 1, 2024

Popolechien commented Jul 1, 2024

vitaly-zdanevich commented Jul 1, 2024

vitaly-zdanevich commented Jul 1, 2024

benoit74 commented Jul 1, 2024

Popolechien commented Jul 1, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 1, 2024

RavanJAltaie commented Jul 2, 2024