Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing pages in Common Crawl #22

Open
hadiasghari opened this issue Nov 2, 2022 · 0 comments
Open

Missing pages in Common Crawl #22

hadiasghari opened this issue Nov 2, 2022 · 0 comments

Comments

@hadiasghari
Copy link

Hello all, and thank you for the great research project.

I have a question about pages that are missing from the Oscar dataset for a particular language. For instance, this page: https://www.bmbf.de/bmbf/de/service/leichte-sprache/leichte-sprache_node.html is missing from the German dataset, while there are other pages from this domain (bmbf.de) in the data, and this page is only one click from the homepage.

After asking this question on discourse, I understand that the Common Crawl isn't complete, and uses random sampling. It would still be good to know the sampling strategy used, e.g., what % of websites are typically crawled, and if this is weighed by some popularity factor? (I couldn't find this information mentioned on their website).

For the specific URL referenced above, I additionally queried the full index (via the Athena interface) and saw that it has never been crawled. This IMO is an unfortunate limitation for some research questions.

It might thus be nice to share information about other large-scale crawl projects/datasets that could be used in place of CC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant