Open Government Data Processing Basel-Stadt

Architecture, processes, methods and code used to process Open Government Data (OGD) for Canton Basel-Stadt, Switzerland.

Overview of Architecture and Processes

Involved Servers

The Open Data infrastructure of Basel-Stadt consists of the following platforms:

Data Processing Server (internal)
Web Server https://data-bs.ch
Data Platform https://data.bs.ch

Outline of the ETL Process

Usually, data is published from data-producing governmental entities on internal network drives to the Fachstelle OGD. From there, jobs running on the data processing server read and extract, transform and then load (ETL) the resulting dataset to the web server via (S)FTP. These datasets are then retrieved and published by the data platform so that they can be consumed by the public.

Technical Implementation

Involved Systems and their purpose

Data Processing Server (internal)
- There are Linux mount points below the folder "/mnt" that serve the data received from other government entities.
- It runs Docker daemon which hosts docker containers that each contain their own isolated data transformation job.
- ETL jobs programmed in Python. Source code of these jobs are in subfolders of the present repository, see e.g. aue-umweltlabor.
- ETL jobs containerized in Docker images, so that each job has its own containerized environment to run in. The environment is configured using the Dockerfile, see e.g. here.
- AirFlow workflow scheduler. Runs as a docker container, see configuration.
- Every ETL job to run has its own Apache Airflow Directed Acyclical Graph (DAG) file. It is written in Python and defines when a containerized ETL job is run, and how to proceed if the job fails. DAG files are stored in the AirFlow repo, see e.g. this one.
- AirFlow DAG jobs can be found on the server in the folder '/data/dev/workspace/docker-airflow/dags', ETL jobs in '/data/dev/workspace/data-processing'.
- Deployment of source code is done via git: Push from development environment to github, pull from github to live environment in above mentioned folders.
Web Server https://data-bs.ch
- Linux server that is primarly used to host data ready to be published onto the data portal.
- Hosts the RUES Viz for real-time Rhein data, see the source code.
- Hosts some cron jobs, those are being trainsitioned to run as AirFlow jobs on the data processing server.
- All data on this server is public, including data that is being processed on this server before publication.
Data Platform https://data.bs.ch
- The data platform is a cloud service that is not hosted on the BS network, but by Opendatasoft.
- It presents data to the public in diverse formats (table, file export, Viz, API).
- Simple processing steps can be applied also here.
- All data on this server is public, including data that is being processed on this server before publication.
- Data is retrieved from the web server via FTP or HTTPS. Exceptions include the following:
  - Real-time data being pushed into the data platform via Opendatasoft Real Time API, e.g. Occupation of charging stations
  - Real-time data retrieved every minute directly by the data platform without involvement of any AirFlow jobs on the data processing server, e.g. Occupation status of parking lots

Data Harvesting Mechanisms

Into the Data Platform

Single datasets are sourced via these mechanisms:
- Direct https source, e.g. Coronavirus Fallzahlen Basel-Stadt
- Opendatasoft real-time API, e.g. Real-time occupation status of charging stations
- FTP(S) sourcing of a directory, e.g. Smart Climate Schallpegelmessungen
About using an FTP(S) directory instead of a file as the source for a dataset: "Using a directory is often the prefered solution to automate incremental updates between a customer's information system and the platform. All the files in the directory need to have the same format and schema (e.g. CSV files with the same column titles). In case of automation, whenever the dataset is published, new and updated files are fetched from the remote location and processed and thanks to Opendatasoft's native deduplication strategy". For more technical information how these mechanisms work see the Opendatasoft documentation.
Catalogs of datasets are harvested via the FTP with meta CSV harvester. Currently these include the following:
1. OGD datasets by Statistisches Amt Basel-Stadt
  - Metadata of datasets to be harvested by the data portal are saved onto the web server in folder "/public_html/opendatasoft/harvesters/stata/ftp-csv/" by the (closed source) publishing process run by members of the Statistisches Amt.
2. Open Datasets by Grundbuch- und Vermessungsamt Basel-Stadt
  - Data and metadata of datasets to be harvested by the data platform are daily created by the data processing job gva_geodatenshop and uploaded to the web server into folder "/public_html/opendatasoft/harvesters/GVA/".

Out of the Data Platform

The data platform can be harvested by other data platforms e.g. via the DCAT-AP for Switzerland API by using an URL in the form of https://data.bs.ch/api/v2/catalog/exports/dcat_ap_ch (see here for further technical information).

To our knowledge, the only direct current consumer/harvester of our data platform metadata is https://opendata.swiss, which in turn is being harvested by the European Data Portal, and possibly others.

As an example, see how this dataset presented by different data portals:

In the data portal Basel-Stadt (original): https://data.bs.ch/explore/dataset/100042
In opendata.swiss (harvested from the above): https://opendata.swiss/de/dataset/statistische-raumeinheiten-wohnviertel
In the European Data portal (harvested from the above): https://www.europeandataportal.eu/data/datasets/100042-statistisches-amt-kanton-basel-stadt

Miscellaneous

Usage of git

On the data processing server we use the Docker container 'alpine/git:v2.26.2' as a git client, see https://hub.docker.com/r/alpine/git.
First usage on the Docker host to download the Docker image and see git --versionexecuted:

docker run -ti --rm -v ${HOME}:/root -v $(pwd):/git alpine/git:v2.26.2 --version

Adding a custom 'git' function in ~/.bashrc:

# User specific aliases and functions
function git () {
    (docker run -ti --rm -v ${HOME}:/root -v $(pwd):/git alpine/git:v2.26.2 "$@")
}

Embargo Feature

To create an embargo on a dataset based on a csv file named "data.csv", place a file named "data_embargo.txt" into the folder where the data file resides.
The "_embargo.txt" file must contain a datetime string in the form YYYY-MM-DDThh:mm, e.g.

2021-10-22T09:00

The data processing job must be enhanced to use the embargo function:

common.is_embargo_over(data_file_path)

Always update the embargo file before uploading new data!

Name		Name	Last commit message	Last commit date
Latest commit History 3,935 Commits
artsuebli_urbanart		artsuebli_urbanart
aue_fischereistatistik		aue_fischereistatistik
aue_grundwasser		aue_grundwasser
aue_rues		aue_rues
aue_schall		aue_schall
aue_umweltlabor		aue_umweltlabor
bachapp		bachapp
bafu_hydrodaten		bafu_hydrodaten
bafu_hydrodaten_vorhersagen		bafu_hydrodaten_vorhersagen
bag_coronavirus		bag_coronavirus
bvb		bvb
bvb_fahrgastzahlen		bvb_fahrgastzahlen
common		common
covid19_hosp		covid19_hosp
covid19bs		covid19bs
covid19dashboard		covid19dashboard
ed_schulferien		ed_schulferien
ethz_covid_reprod		ethz_covid_reprod
euroairport		euroairport
fgi_geodatenshop		fgi_geodatenshop
gd_abwassermonitoring		gd_abwassermonitoring
gd_coronavirus_abwassermonitoring		gd_coronavirus_abwassermonitoring
gd_coronavirus_massentests		gd_coronavirus_massentests
gsv_covid19_hosp_bl		gsv_covid19_hosp_bl
gsv_covid19_hosp_bs		gsv_covid19_hosp_bs
gva_geodatenshop		gva_geodatenshop
ibs_parkhaus_bewegungen		ibs_parkhaus_bewegungen
itbs_klv		itbs_klv
iwb_elektroauto_ladestationen		iwb_elektroauto_ladestationen
iwb_gas		iwb_gas
iwb_netzlast		iwb_netzlast
jfs_gartenbaeder		jfs_gartenbaeder
jsd_swisslos		jsd_swisslos
kapo_geschwindigkeitsmonitoring		kapo_geschwindigkeitsmonitoring
kapo_ordnungsbussen		kapo_ordnungsbussen
kapo_smileys		kapo_smileys
lufthygiene_pm25		lufthygiene_pm25
lufthygiene_rosental		lufthygiene_rosental
luftqualitaet_ch		luftqualitaet_ch
md_covid19cases		md_covid19cases
meteoblue_rosental		meteoblue_rosental
meteoblue_wolf		meteoblue_wolf
meteoblue_wolf_r		meteoblue_wolf_r
mkb_sammlung_europa		mkb_sammlung_europa
mobilitaet_dtv		mobilitaet_dtv
mobilitaet_parkflaechen		mobilitaet_parkflaechen
mobilitaet_verkehrszaehldaten		mobilitaet_verkehrszaehldaten
ods_catalog		ods_catalog
ods_check_urls		ods_check_urls
ods_harvest		ods_harvest
ods_publish		ods_publish
parkendd		parkendd
parkhaeuser		parkhaeuser
parlamentsdienst_gr_abstimmungen		parlamentsdienst_gr_abstimmungen
parlamentsdienst_grosserrat		parlamentsdienst_grosserrat
smarte_strasse_ladestation		smarte_strasse_ladestation
smarte_strasse_luft		smarte_strasse_luft
smarte_strasse_parking		smarte_strasse_parking
smarte_strasse_schall		smarte_strasse_schall
stadtreinigung_sauberkeitsindex		stadtreinigung_sauberkeitsindex
stadtreinigung_wildedeponien		stadtreinigung_wildedeponien
staka_abstimmungen		staka_abstimmungen
staka_baupublikationen		staka_baupublikationen
staka_briefliche_stimmabgaben		staka_briefliche_stimmabgaben
staka_kandidaturen		staka_kandidaturen
staka_kantonsblatt		staka_kantonsblatt
staka_staatskalender		staka_staatskalender
stata_befragungen		stata_befragungen
stata_bik		stata_bik
stata_daily_upload		stata_daily_upload
stata_erwarteter_gasverbrauch		stata_erwarteter_gasverbrauch
stata_erwarteter_stromverbrauch		stata_erwarteter_stromverbrauch
stata_gwr		stata_gwr
stata_konoer		stata_konoer
stata_ods/daily_jobs/update_temporal_coverage		stata_ods/daily_jobs/update_temporal_coverage
stata_parzellen		stata_parzellen
stata_pull_changes		stata_pull_changes
tba_abfuhrdaten		tba_abfuhrdaten
tba_abfuhrtermine		tba_abfuhrtermine
tba_baustellen		tba_baustellen
tba_sprayereien		tba_sprayereien
tba_wiese		tba_wiese
util		util
wahlen/Gericht		wahlen/Gericht
zefix_handelsregister		zefix_handelsregister
zrd_gesetzessammlung		zrd_gesetzessammlung
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Government Data Processing Basel-Stadt

Overview of Architecture and Processes

Involved Servers

Outline of the ETL Process

Technical Implementation

Involved Systems and their purpose

Data Harvesting Mechanisms

Into the Data Platform

Out of the Data Platform

Miscellaneous

Usage of git

Embargo Feature

About

Releases

Packages

Contributors 10

Languages

opendatabs/data-processing

Folders and files

Latest commit

History

Repository files navigation

Open Government Data Processing Basel-Stadt

Overview of Architecture and Processes

Involved Servers

Outline of the ETL Process

Technical Implementation

Involved Systems and their purpose

Data Harvesting Mechanisms

Into the Data Platform

Out of the Data Platform

Miscellaneous

Usage of git

Embargo Feature

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages