You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello all, and thank you for the great research project.
I have a question about pages that are missing from the Oscar dataset for a particular language. For instance, this page: https://www.bmbf.de/bmbf/de/service/leichte-sprache/leichte-sprache_node.html is missing from the German dataset, while there are other pages from this domain (bmbf.de) in the data, and this page is only one click from the homepage.
After asking this question on discourse, I understand that the Common Crawl isn't complete, and uses random sampling. It would still be good to know the sampling strategy used, e.g., what % of websites are typically crawled, and if this is weighed by some popularity factor? (I couldn't find this information mentioned on their website).
For the specific URL referenced above, I additionally queried the full index (via the Athena interface) and saw that it has never been crawled. This IMO is an unfortunate limitation for some research questions.
It might thus be nice to share information about other large-scale crawl projects/datasets that could be used in place of CC.
The text was updated successfully, but these errors were encountered:
Hello all, and thank you for the great research project.
I have a question about pages that are missing from the Oscar dataset for a particular language. For instance, this page: https://www.bmbf.de/bmbf/de/service/leichte-sprache/leichte-sprache_node.html is missing from the German dataset, while there are other pages from this domain (bmbf.de) in the data, and this page is only one click from the homepage.
After asking this question on discourse, I understand that the Common Crawl isn't complete, and uses random sampling. It would still be good to know the sampling strategy used, e.g., what % of websites are typically crawled, and if this is weighed by some popularity factor? (I couldn't find this information mentioned on their website).
For the specific URL referenced above, I additionally queried the full index (via the Athena interface) and saw that it has never been crawled. This IMO is an unfortunate limitation for some research questions.
It might thus be nice to share information about other large-scale crawl projects/datasets that could be used in place of CC.
The text was updated successfully, but these errors were encountered: