Missing pages in Common Crawl #22

hadiasghari · 2022-11-02T16:11:51Z

Hello all, and thank you for the great research project.

I have a question about pages that are missing from the Oscar dataset for a particular language. For instance, this page: https://www.bmbf.de/bmbf/de/service/leichte-sprache/leichte-sprache_node.html is missing from the German dataset, while there are other pages from this domain (bmbf.de) in the data, and this page is only one click from the homepage.

After asking this question on discourse, I understand that the Common Crawl isn't complete, and uses random sampling. It would still be good to know the sampling strategy used, e.g., what % of websites are typically crawled, and if this is weighed by some popularity factor? (I couldn't find this information mentioned on their website).

For the specific URL referenced above, I additionally queried the full index (via the Athena interface) and saw that it has never been crawled. This IMO is an unfortunate limitation for some research questions.

It might thus be nice to share information about other large-scale crawl projects/datasets that could be used in place of CC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing pages in Common Crawl #22

Missing pages in Common Crawl #22

hadiasghari commented Nov 2, 2022

Missing pages in Common Crawl #22

Missing pages in Common Crawl #22

Comments

hadiasghari commented Nov 2, 2022