Skip to content

Latest commit

 

History

History
40 lines (21 loc) · 2.86 KB

README.md

File metadata and controls

40 lines (21 loc) · 2.86 KB

Scraping German News Websites

Scripts to scrape large German news websites & resulting data set of one million German news articles from 01.01.2020 to 31.12.2020. To get the code, simply run

git clone https://github.com/kssrr/german-media-scrape

If you are unfamiliar with git, you can copy-paste & run the setup.R-script, which will also install the dependencies for you.

Getting the data

Direct download (compressed .tar.gz)

We assembled a demo-dataset that includes all articles between January 1st 2020 and December 31st 2022 from the media outlets taz, Zeit, Süddeutsche, Spiegel & Welt. The data set includes a little over one million German-language news articles (uncompressed ~3.5 GB) of varying length. Article titles are missing for some sites due to an earlier problem with the scrapes; we plan to add them in later versions. The data is hosted here.

The data set includes broad coverage of various impactful events that could be fruitfully analysed, like the German federal election 2021, COVID-19, the 2022 Soccer World Cup, and of course the Russian invasion of Ukraine in early 2022.

Theoretically, the scripts could also be used to scrape data going back as far as the newspapers' archives allow; simply change the corresponding code early on in the scripts where the dates (years) to scrape are specified.

Example usage

An elaborate example (topic modelling) is shown here, but you can also do a lot of interesting, more basic exploratory analysis with this kind of data, for example examine reporting on political parties:

Reporting on political parties on two German news websites.

You could also look at the salience of particular topics:

Media Attention on Ukraine after the 2022 invasion.

World Cup 2022

Or investigate pairwise correlation clusters of keywords (click to enlarge; see here for the methodology):

Network

welt_network

Special thanks to the University of Münster for providing us with additional computational resources for this project.