Parlamentsspiegel Scraper

The Parlamentsspiegel collects the federal parliamentary documentation of Germany.

Unfortunately, the documents on Parlamentsspiegel lack consistent meta data. For example, they use weird abbreviations for the German federal states, like SACA for Sachsen-Anhalt. That's far away from common standards like NUTS or ISO 3166... The translation between Parlamentsspiegel's weird country codes and more widely used codes can be found in input/lookup_laender_ps.csv

Also, the HTML structure has no well defined css classes, which makes the parsing a bit annoying.

The crawler is written in Python, the parsing in R.

Setup

create a virtual environment: python3 -m venv env
activate it: source env/bin/activate
install requirements: pip3 install -r requirements.txt

Use it

Define your search interest

Write your search words for the search input field in input/searchwords.csv
and the official tags ("Schlagworte") you're interested in input/keywords.csv

Run scripts

01_get_overview.py: to fetch all overview tables, which will be stored as html files in input/html/overview/ and relevant links in input/data/links_beratungsstand.csv
02_get_detailpages.py: to get all the information for every single document, the resulting html files can be found in input/html/beratungsstand/
03_parsing.R: to free metadata and to get a csv file with parsed information, that can be found in input/data/df.csv

To Do

clear distiction between input and output
automate folder creation
improve metadata parsing

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
input		input
.gitattributes		.gitattributes
.gitignore		.gitignore
01_get_overiew.py		01_get_overiew.py
02_get_detailpages.py		02_get_detailpages.py
03_parsing.R		03_parsing.R
2020-scraper.Rproj		2020-scraper.Rproj
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parlamentsspiegel Scraper

Setup

Use it

Define your search interest

Run scripts

To Do

About

Languages

License

cutterkom/parlamentsspiegel-scraper

Folders and files

Latest commit

History

Repository files navigation

Parlamentsspiegel Scraper

Setup

Use it

Define your search interest

Run scripts

To Do

About

Topics

Resources

License

Stars

Watchers

Forks

Languages