Architecture, processes, methods and code used to process Open Government Data (OGD) for Canton Basel-Stadt, Switzerland.
The Open Data infrastructure of Basel-Stadt consists of the following platforms:
- Data Processing Server (internal)
- Web Server https://data-bs.ch
- Data Platform https://data.bs.ch
Usually, data is published from data-producing governmental entities on internal network drives to the Fachstelle OGD. From there, jobs running on the data processing server read and extract, transform and then load (ETL) the resulting dataset to the web server via (S)FTP. These datasets are then retrieved and published by the data platform so that they can be consumed by the public.
-
Data Processing Server (internal)
- There are Linux mount points below the folder "/mnt" that serve the data received from other government entities.
- It runs Docker daemon which hosts docker containers that each contain their own isolated data transformation job.
- ETL jobs programmed in Python. Source code of these jobs are in subfolders of the present repository, see e.g. aue-umweltlabor.
- ETL jobs containerized in Docker images, so that each job has its own containerized environment to run in. The environment is configured using the Dockerfile, see e.g. here.
- AirFlow workflow scheduler. Runs as a docker container, see configuration.
- Every ETL job to run has its own Apache Airflow Directed Acyclical Graph (DAG) file. It is written in Python and defines when a containerized ETL job is run, and how to proceed if the job fails. DAG files are stored in the AirFlow repo, see e.g. this one.
- AirFlow DAG jobs can be found on the server in the folder '/data/dev/workspace/docker-airflow/dags', ETL jobs in '/data/dev/workspace/data-processing'.
- Deployment of source code is done via git: Push from development environment to github, pull from github to live environment in above mentioned folders.
-
Web Server https://data-bs.ch
- Linux server that is primarly used to host data ready to be published onto the data portal.
- Hosts the RUES Viz for real-time Rhein data, see the source code.
- Hosts some cron jobs, those are being trainsitioned to run as AirFlow jobs on the data processing server.
- All data on this server is public, including data that is being processed on this server before publication.
-
Data Platform https://data.bs.ch
- The data platform is a cloud service that is not hosted on the BS network, but by Opendatasoft.
- It presents data to the public in diverse formats (table, file export, Viz, API).
- Simple processing steps can be applied also here.
- All data on this server is public, including data that is being processed on this server before publication.
- Data is retrieved from the web server via FTP or HTTPS. Exceptions include the following:
- Real-time data being pushed into the data platform via Opendatasoft Real Time API, e.g. Occupation of charging stations
- Real-time data retrieved every minute directly by the data platform without involvement of any AirFlow jobs on the data processing server, e.g. Occupation status of parking lots
-
Single datasets are sourced via these mechanisms:
- Direct https source, e.g. Coronavirus Fallzahlen Basel-Stadt
- Opendatasoft real-time API, e.g. Real-time occupation status of charging stations
- FTP(S) sourcing of a directory, e.g. Smart Climate Schallpegelmessungen
About using an FTP(S) directory instead of a file as the source for a dataset: "Using a directory is often the prefered solution to automate incremental updates between a customer's information system and the platform. All the files in the directory need to have the same format and schema (e.g. CSV files with the same column titles). In case of automation, whenever the dataset is published, new and updated files are fetched from the remote location and processed and thanks to Opendatasoft's native deduplication strategy". For more technical information how these mechanisms work see the Opendatasoft documentation.
-
Catalogs of datasets are harvested via the FTP with meta CSV harvester. Currently these include the following:
- OGD datasets by Statistisches Amt Basel-Stadt
- Metadata of datasets to be harvested by the data portal are saved onto the web server in folder "/public_html/opendatasoft/harvesters/stata/ftp-csv/" by the (closed source) publishing process run by members of the Statistisches Amt.
- Open Datasets by Grundbuch- und Vermessungsamt Basel-Stadt
- Data and metadata of datasets to be harvested by the data platform are daily created by the data processing job gva_geodatenshop and uploaded to the web server into folder "/public_html/opendatasoft/harvesters/GVA/".
- OGD datasets by Statistisches Amt Basel-Stadt
The data platform can be harvested by other data platforms e.g. via the DCAT-AP for Switzerland API by using an URL in the form of https://data.bs.ch/api/v2/catalog/exports/dcat_ap_ch (see here for further technical information).
To our knowledge, the only direct current consumer/harvester of our data platform metadata is https://opendata.swiss, which in turn is being harvested by the European Data Portal, and possibly others.
As an example, see how this dataset presented by different data portals:
- In the data portal Basel-Stadt (original): https://data.bs.ch/explore/dataset/100042
- In opendata.swiss (harvested from the above): https://opendata.swiss/de/dataset/statistische-raumeinheiten-wohnviertel
- In the European Data portal (harvested from the above): https://www.europeandataportal.eu/data/datasets/100042-statistisches-amt-kanton-basel-stadt
- On the data processing server we use the Docker container 'alpine/git:v2.26.2' as a git client, see https://hub.docker.com/r/alpine/git.
- First usage on the Docker host to download the Docker image and see
git --version
executed:
docker run -ti --rm -v ${HOME}:/root -v $(pwd):/git alpine/git:v2.26.2 --version
- Adding a custom 'git' function in ~/.bashrc:
# User specific aliases and functions
function git () {
(docker run -ti --rm -v ${HOME}:/root -v $(pwd):/git alpine/git:v2.26.2 "$@")
}
- To create an embargo on a dataset based on a csv file named "data.csv", place a file named "data_embargo.txt" into the folder where the data file resides.
- The "_embargo.txt" file must contain a datetime string in the form YYYY-MM-DDThh:mm, e.g.
2021-10-22T09:00
- The data processing job must be enhanced to use the embargo function:
common.is_embargo_over(data_file_path)
- Always update the embargo file before uploading new data!