Template Maintainer: Tim
This repository serves as a template for developers writing scripts to load data into covidgraph.
We try to have data loading scripts in a certain format to build a automated data loading pipeline.
When a developer created a script, it can be registered at https://git.connect.dzd-ev.de/dzdconnectpipeline/pipeline/-/blob/master/pipeline.yml
Motherlode will run the script to load the data into the DEV instance of covidgraph. When the function and content of the data is verified, it can be run by motherlode against the PROD instance.
If you write your script in python, you can take this repo as template. When you write in another laguage, take it as inspiration and just follow the "Connect to the correct neo4j instance" part
All data source scripts will be wrapped in a docker container. Motherlode is just a python script that pulls the data source docker images from docker hub and runs them.
The datasource script is responsible for the following tasks:
- Download the source data
- Transform the source data
- Connect to the correct neo4j instance
- Load the data in an idempotent* way into the database
* "idempotent" means basically, merge your nodes. If your script fails on half the way the first time, we want to be able to just re-run. Without duplicating all nodes that are already loaded in the DB.
After that you need to do the following things to let your script work at the covid graph
- Publish your script as a docker image
- Register the script at https://github.com/covidgraph/motherlode
Motherlode will hand over the following environment variables when your data source script is called:
ENV
: will be PROD
or DEV
NEO4J
: A json string with the connection details. Parameter names are based on the py2neo.Graph format. e.g. NEO4J can be {"host":"db.covidgraph.com","user":"neo4j","password":"somepw"}
In python you can use this very convenient with
NEO4J_CONFIG_STRING = os.getenv("NEO4J")
NEO4J_CONFIG_DICT = json.loads(NEO4J_CONFIG_STRING)
graph = Graph(**NEO4J_CONFIG_DICT)
You have to take care that your script uses these variables to connect to the database
We share the scripts via docker images at https://hub.docker.com/
When you finished developeing your script, you have to "dockerize"/"container" it.
There are multiple ways, to build and publish your script. The easiest is to just run:
docker build -t data-my-datasource-script .
docker login --username my-docker-hub-username
docker tag data-my-datasource-script:$tag covidgraph/data-my-datasource-script:version
docker push covidgraph/data-my-datasource-script:$tag
To be able to publish your script to the docker hub organization covidgraph
you need to be a member. ping Martin or Tim for that)
A more convenient way of publishing your image is, to let github take care of that. You can use github Actions
for that.
As an example have a look at https://github.com/covidgraph/data_cord19 or ping Tim
just ping Tim for that and tell him the location of your docker image
Thats it! :)
All following text belongs to the template README.md. Cut here when you create a new template ✂---------------------------------
This script loads data X and Y from source Z into the neo4j based covidgraph
Maintainer: {You}[https://github.com/{YourGitHub.Com-id}]
Version: 0.0.1
Docker image location: https://hub.docker.com/repository/docker/covidgraph/data-template