Scraping German News Websites

Scripts to scrape large German news websites & resulting data set of one million German news articles from 01.01.2020 to 31.12.2020. To get the code, simply run

git clone https://github.com/kssrr/german-media-scrape

If you are unfamiliar with git, you can copy-paste & run the setup.R-script, which will also install the dependencies for you.

Getting the data

Direct download (compressed .tar.gz)

We assembled a demo-dataset that includes all articles between January 1st 2020 and December 31st 2022 from the media outlets taz, Zeit, Süddeutsche, Spiegel & Welt. The data set includes a little over one million German-language news articles (uncompressed ~3.5 GB) of varying length. Article titles are missing for some sites due to an earlier problem with the scrapes; we plan to add them in later versions. The data is hosted here.

The data set includes broad coverage of various impactful events that could be fruitfully analysed, like the German federal election 2021, COVID-19, the 2022 Soccer World Cup, and of course the Russian invasion of Ukraine in early 2022.

Theoretically, the scripts could also be used to scrape data going back as far as the newspapers' archives allow; simply change the corresponding code early on in the scripts where the dates (years) to scrape are specified.

Example usage

An elaborate example (topic modelling) is shown here, but you can also do a lot of interesting, more basic exploratory analysis with this kind of data, for example examine reporting on political parties:

You could also look at the salience of particular topics:

Or investigate pairwise correlation clusters of keywords (click to enlarge; see here for the methodology):

Special thanks to the University of Münster for providing us with additional computational resources for this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scraping German News Websites

Getting the data

Example usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scraping German News Websites

Getting the data

Example usage