Skip to content

opendatabs/data-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open Government Data Processing Basel-Stadt

Architecture, processes, methods and code used to process Open Government Data (OGD) for Canton Basel-Stadt, Switzerland.

Overview of Architecture and Processes

Involved Servers

The Open Data infrastructure of Basel-Stadt consists of the following platforms:

Outline of the ETL Process

Usually, data is published from data-producing governmental entities on internal network drives to the Fachstelle OGD. From there, jobs running on the data processing server read and extract, transform and then load (ETL) the resulting dataset to the web server via (S)FTP. These datasets are then retrieved and published by the data platform so that they can be consumed by the public.

Technical Implementation

Involved Systems and their purpose

  1. Data Processing Server (internal)

    • There are Linux mount points below the folder "/mnt" that serve the data received from other government entities.
    • It runs Docker daemon which hosts docker containers that each contain their own isolated data transformation job.
    • ETL jobs programmed in Python. Source code of these jobs are in subfolders of the present repository, see e.g. aue-umweltlabor.
    • ETL jobs containerized in Docker images, so that each job has its own containerized environment to run in. The environment is configured using the Dockerfile, see e.g. here.
    • AirFlow workflow scheduler. Runs as a docker container, see configuration.
    • Every ETL job to run has its own Apache Airflow Directed Acyclical Graph (DAG) file. It is written in Python and defines when a containerized ETL job is run, and how to proceed if the job fails. DAG files are stored in the AirFlow repo, see e.g. this one.
    • AirFlow DAG jobs can be found on the server in the folder '/data/dev/workspace/docker-airflow/dags', ETL jobs in '/data/dev/workspace/data-processing'.
    • Deployment of source code is done via git: Push from development environment to github, pull from github to live environment in above mentioned folders.
  2. Web Server https://data-bs.ch

    • Linux server that is primarly used to host data ready to be published onto the data portal.
    • Hosts the RUES Viz for real-time Rhein data, see the source code.
    • Hosts some cron jobs, those are being trainsitioned to run as AirFlow jobs on the data processing server.
    • All data on this server is public, including data that is being processed on this server before publication.
  3. Data Platform https://data.bs.ch

    • The data platform is a cloud service that is not hosted on the BS network, but by Opendatasoft.
    • It presents data to the public in diverse formats (table, file export, Viz, API).
    • Simple processing steps can be applied also here.
    • All data on this server is public, including data that is being processed on this server before publication.
    • Data is retrieved from the web server via FTP or HTTPS. Exceptions include the following:

Data Harvesting Mechanisms

Into the Data Platform

  1. Single datasets are sourced via these mechanisms:

    About using an FTP(S) directory instead of a file as the source for a dataset: "Using a directory is often the prefered solution to automate incremental updates between a customer's information system and the platform. All the files in the directory need to have the same format and schema (e.g. CSV files with the same column titles). In case of automation, whenever the dataset is published, new and updated files are fetched from the remote location and processed and thanks to Opendatasoft's native deduplication strategy". For more technical information how these mechanisms work see the Opendatasoft documentation.

  2. Catalogs of datasets are harvested via the FTP with meta CSV harvester. Currently these include the following:

    1. OGD datasets by Statistisches Amt Basel-Stadt
      • Metadata of datasets to be harvested by the data portal are saved onto the web server in folder "/public_html/opendatasoft/harvesters/stata/ftp-csv/" by the (closed source) publishing process run by members of the Statistisches Amt.
    2. Open Datasets by Grundbuch- und Vermessungsamt Basel-Stadt
      • Data and metadata of datasets to be harvested by the data platform are daily created by the data processing job gva_geodatenshop and uploaded to the web server into folder "/public_html/opendatasoft/harvesters/GVA/".

Out of the Data Platform

The data platform can be harvested by other data platforms e.g. via the DCAT-AP for Switzerland API by using an URL in the form of https://data.bs.ch/api/v2/catalog/exports/dcat_ap_ch (see here for further technical information).

To our knowledge, the only direct current consumer/harvester of our data platform metadata is https://opendata.swiss, which in turn is being harvested by the European Data Portal, and possibly others.

As an example, see how this dataset presented by different data portals:

Miscellaneous

Usage of git

  • On the data processing server we use the Docker container 'alpine/git:v2.26.2' as a git client, see https://hub.docker.com/r/alpine/git.
  • First usage on the Docker host to download the Docker image and see git --versionexecuted:
docker run -ti --rm -v ${HOME}:/root -v $(pwd):/git alpine/git:v2.26.2 --version
  • Adding a custom 'git' function in ~/.bashrc:
# User specific aliases and functions
function git () {
    (docker run -ti --rm -v ${HOME}:/root -v $(pwd):/git alpine/git:v2.26.2 "$@")
}

Embargo Feature

  • To create an embargo on a dataset based on a csv file named "data.csv", place a file named "data_embargo.txt" into the folder where the data file resides.
  • The "_embargo.txt" file must contain a datetime string in the form YYYY-MM-DDThh:mm, e.g.
2021-10-22T09:00
  • The data processing job must be enhanced to use the embargo function:
common.is_embargo_over(data_file_path)
  • Always update the embargo file before uploading new data!