Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New request: forums.gentoo.org #1057

Open
vitaly-zdanevich opened this issue Jun 19, 2024 · 17 comments
Open

New request: forums.gentoo.org #1057

vitaly-zdanevich opened this issue Jun 19, 2024 · 17 comments
Assignees
Labels
Bug Something isn't working Computer Science Content is related to coding and software Scraper Needed We need to build a dedicated scraper for this website Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) Zimit

Comments

@vitaly-zdanevich
Copy link

@RavanJAltaie RavanJAltaie self-assigned this Jun 23, 2024
@RavanJAltaie RavanJAltaie added Computer Science Content is related to coding and software Zimit labels Jun 23, 2024
@RavanJAltaie
Copy link
Contributor

Recipe created
https://farm.openzim.org/recipes/forums.gentoo.org_en_all
I'll update the library link once ready

@AngryLoki
Copy link

@RavanJAltaie , why on https://farm.openzim.org/pipeline/e0b8e527-0cab-4514-880b-9434ea0a32b2/debug I see:

{"workerid":0,"page":"https://forums.gentoo.org/posting.php?mode=reply&t=193199%22%7D%7D
{"timestamp":"2024-06-25T15:49:46.110Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":49103,"total":960486,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-06-25T15:49:46.109Z\",\"extraHops\":0,\"url\":\"https:\\/\\/forums.gentoo.org\\/posting.php?mode=reply&t=193199\",\"added\":\"2024-06-23T23:18:07.166Z\",\"depth\":3}"]}}
{"timestamp":"2024-06-25T15:49:46.309Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://forums.gentoo.org/posting.php?mode=reply&t=193199%22%2C%22workerid%22%3A0%7D%7D

...

{"timestamp":"2024-06-25T15:50:34.199Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://forums.gentoo.org/login.php?redirect=posting.php&mode=reply&t=1112802%22%2C%22workerid%22%3A0%7D%7D

/login.php and /posting.php are disallowed by https://forums.gentoo.org/robots.txt

@RavanJAltaie
Copy link
Contributor

@benoit74 could you please check the above?

@benoit74
Copy link
Contributor

I'm not sure what you wanna me to check.

What I can state is that:

  • your current recipe configuration is not in-line with what the robots.txt is requesting to ignore, you have no exclude parameter
  • there is nothing in browsertrix crawler ensuring that the robots.txt file Disallow rules are automatically respected (at least by default)

Does it answer your request?

@benoit74
Copy link
Contributor

FYI, I've opened webrecorder/browsertrix-crawler#631 to discuss the second point with webrecorder folks.

@benoit74
Copy link
Contributor

Nota: I've cancelled the task and disabled the recipe for now, since configuration is wrong, more than 1M pages have been found, there is no point in continuing to crawl with current configuration. Let's discuss this this afternoon.

@RavanJAltaie RavanJAltaie added the Bug Something isn't working label Jun 28, 2024
@RavanJAltaie
Copy link
Contributor

Hello, I've added the exclude parameter and requested the recipe again, it's been running for 2 days now with only 5% progress.
@benoit74 I assume this didn't fix the problem?

@benoit74
Copy link
Contributor

benoit74 commented Jul 1, 2024

5% in 2 days is indeed a concern, because it means that it will take about 40 days to complete.

What is even more concerning is that it already displays 1172895 pages to fetch (more than 1M). Given that a page takes in average 4 secs to load (rough number, just to give an envelop), it gives 46 days, so scraper is behaving normally.

The 1 million pages figure is also probably correct since a quick sum of numbers displayed on main page gives 721207 topics. Given some topics have multiple pages due to high activity, the crawler now seems to be properly configured.

So the problem is not really a misconfiguration of the crawler but rather a lack of analysis of the size of the website. Do we really want to build such a big ZIM? How usable will that be? How big? Is there alternatives to explore?

@Popolechien
Copy link
Collaborator

I'm actually surprised it would take that long considering those pages are mostly text, aren't they? Generally speaking, is the 4s throttling from us or from them?

@vitaly-zdanevich
Copy link
Author

On the bottom of the main page I see

Our users have posted a total of 6_369_939 articles

image

@vitaly-zdanevich
Copy link
Author

a lack of analysis of the size of the website

Maybe for phpBB the crawler can take that number on the bottom - to predict time.

@benoit74
Copy link
Contributor

benoit74 commented Jul 1, 2024

I'm actually surprised it would take that long considering those pages are mostly text, aren't they? Generally speaking, is the 4s throttling from us or from them?

4 secs is a conservative estimate based on how long it usually takes to crawl on page on whatever website. This is the time it takes to fetch all resources, check that there is no video, no audio, no special assets to fetch (this takes easily 1 sec), and record everything inside the WARC. Maybe it could be 3 secs, but I've never seen a website achieve 2 secs (just retrieving the page usually takes 2 secs).

@Popolechien
Copy link
Collaborator

Our users have posted a total of 6_369_939 articles

Counting 4s per page that would make it almost 300 days to get a single pass. Irrespective of the file size, which the zim can manage, I do not think this is reasonable.

@benoit74
Copy link
Contributor

benoit74 commented Jul 1, 2024

Counting 4s per page that would make it almost 300 days to get a single pass. Irrespective of the file size, which the zim can manage, I do not think this is reasonable.

Yep ... I propose to mark this as upstream and scraper_needed, because clearly we need to find a solution (might be a change in the way zimit is operating or it might be another solution to scrape phpbb websites).

For now I've disabled the recipe.

@vitaly-zdanevich sorry for that, but I do not expect we can quickly find a solution to this problem since our tooling is clearly insufficient to handle this case I think.

@benoit74 benoit74 added Scraper Needed We need to build a dedicated scraper for this website Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) labels Jul 1, 2024
@benoit74
Copy link
Contributor

benoit74 commented Jul 1, 2024

Edit: 6M is the number of articles, isn't this the number of posts rather than topics? Since we have multiple posts per pages, I think the real number of pages is somewhere between the number of topics and the number of articles. Anyway, this is too much.

@benoit74
Copy link
Contributor

benoit74 commented Jul 1, 2024

See openzim/zimit#333 for the upstream issue should zimit be the solution the scrape such large websites

@RavanJAltaie
Copy link
Contributor

@benoit74 noted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Computer Science Content is related to coding and software Scraper Needed We need to build a dedicated scraper for this website Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) Zimit
Projects
None yet
Development

No branches or pull requests

5 participants