This the home of the Statistical Scraping Interest Group for methodologically sound use of web data in official statistics
Roughly stated: statistical scraping is the use of web data in such way that the relationship with the statistical context is known. It often invloves search operations on the web. It is sometimes called selective scraping.
For example:
- an automated search for the price of a well-defined product on the internet is statistical scraping
- an automated search for the most likely URL for an enterprise of a known statistical population is statistical scraping
- interpretation of textual or structured data on an internet site that is known to belong to a statistical unit (organisation) of a known statistical population is statistical scraping
- collecting all prices and product characteristics of all products sold by a specific web portal (bulk scraping) is not statistical scraping (although this can be very useful in certain cases)
One of the advantages of statistical scraping is that if applied on the unit level, well-defined and proven survey methodology quality indicators can be calculated. We note here that bulk scraping certainly has its value in certain statistical use cases. Even stronger, in cases where the statistical population yet has to be discovered it may be the only option. However, we think that in cases where statistical scraping can be applied it may complement or in some cases replace bulk scraping methods.
For a more detailed explanation we refer to this paper from the Q2024 conference: Statistical scraping: informed plough begets finer crops, Q conference 2024, Estoril pdf
Statistical scraping is for now just a concept. An idea that we think might be useful. We are about to start implementing and testing this concept and this interest group is created to share experiences, best practices and software and tools implementing statistical scraping in official statistics. It is created by he research groups of Statistics Austria and Statistics Netherlands and will also be used to communicate about meetings of this group.
More details will be communicated here later. Please follow this page if you want to stay informed
Questions, suggestions, additions: send an e-mail to olav dot tenbosch at gmail dot com
This work is licensed under a Creative Commons Attribution 4.0 International License.