The Parlamentsspiegel collects the federal parliamentary documentation of Germany.
Unfortunately, the documents on Parlamentsspiegel lack consistent meta data. For example, they use weird abbreviations for the German federal states, like SACA
for Sachsen-Anhalt. That's far away from common standards like NUTS or ISO 3166... The translation between Parlamentsspiegel's weird country codes and more widely used codes can be found in input/lookup_laender_ps.csv
Also, the HTML structure has no well defined css classes, which makes the parsing a bit annoying.
The crawler is written in Python, the parsing in R.
- create a virtual environment:
python3 -m venv env
- activate it:
source env/bin/activate
- install requirements:
pip3 install -r requirements.txt
- Write your search words for the search input field in
input/searchwords.csv
- and the official tags ("Schlagworte") you're interested in
input/keywords.csv
01_get_overview.py
: to fetch all overview tables, which will be stored ashtml
files ininput/html/overview/
and relevant links ininput/data/links_beratungsstand.csv
02_get_detailpages.py
: to get all the information for every single document, the resultinghtml
files can be found ininput/html/beratungsstand/
03_parsing.R
: to free metadata and to get a csv file with parsed information, that can be found ininput/data/df.csv
- clear distiction between
input
andoutput
- automate folder creation
- improve metadata parsing