A Cookiecutter template for Flowbot repos. This is for instantiating and maintaining reporting data pipelines
Make sure you have a .geojson
of your affected areas before you begin - it's helpful to have the path to it in your clipboard.
To instantiate:
- Clone this repo
git clone https://github.com/Flowminder/flowbot-pipeline-root.git cd flowbot-pipeline-root
- Install deps with
pipenv install
- Run Cookiecutter with
bash run.sh . -o ..
- Choose 'active crisis' for the first option
- Follow the prompts onscreen, pasting the path to the geojson when asked. Some prompts are not needed for the active crisis
- Copy the result to the relevant server's
dags
orflowbot
folder; you should see the new pipeline in Airflow's interface - Copy the replay file from
~/.cookiecutter_replays/{your_project_slug}
to this folder; this will support merging changes from the original root repo into your rendered repo. - Optional but strongly advised: run
git init
to create a git repo for tracking changes to the new pipeline
A pipeline is a combination of a country and a report configiuration, used to render a new repository from the contents of {{cookiecutter.__project_slug}}
. The rendering process populates the relevant *_config.py
file and removes any irrelevant notebooks and dags.
A pipeline for monitoring an upcoming or ongoing crisis. To define, you need:
- The geographic area of the crisis; this is currently a polygon in EPSG:4326
- A key start date to begin monitoring
- An optional end date
This pipeline produces a daily active crisis report showing total and newly displaced subscribers from affected areas to host locations. For an example report, see https://www.flowminder.org/resources/publications-reports/haiti-gang-violence-in-downtown-port-au-prince-mobility-situation-report-29-february-12-march-2024
A pipeline for producing monthly reports on national mobility trends. To define this pipeline, you need:
- A reference date range - this will be a period that builds up a baseline for the rest of the reports to refer to.
- A start date
For more information, see https://haiti.mobility-dashboard.org/files/HTI%20platform%20release%20documentation%20Nov%202024.pdf
Some key other parts of the repo:
A set of modules contain the specific configurations for a country and a report type. They are part of the pipenv used in rendering the pipelines.
- CountryStaticData defines a dataclass for country-server specific configurations; for example, if there is a query to fetch the administrative geometry of a country from
flowdb
, it would live in here. If you are implementing existing pipelines on a new in-country server, start here. - ProjectStructure specifies which parts of
{{cookiecutter._project_slug}}
are kept for a given pipeline, and which*_config.py
is to be rendered. If you are implementing a new pipeline, start here.
The cookiecutter
hooks for populating the repo.
pre_gen_project
is some simple validation rulespost_get_project
is the machinery that implements the logic defined inflowbot_dataclasses
Defines a set of custom Jinja filters for accessing flowbot_dataclasses
This is the only Python file populated by the Cookiecutter template (the only other file populated is the README). This is imported into the DAG - if you need to make changes to an existing pipeline, this is where you should start looking.
Contains the DAGs associated with this pipeline. Although all piplines currently consist of a single DAG, this may change in the future and is not an assumption you can rely on. If you are extending or modifying a DAG, it is recommended that you add any configuration variable (dates, static file names, region lists, ect) to the pipeline's config.py
.
Contains the parameterised notebooks and associated helper libraries that form the bulk of the pipeline processing via flowpyter-task
. These should be as generic as possible and take any date ranges, static data sources or similar from the config.py
file via a params
cell. For further information, see https://github.com/Flowminder/flowpyter-task
Contains artefacts that are unchanging between dagruns, but may be changed by users. The current two use cases for this are static data files that contain population estimates and the Jinja templates used for reports.
Room for information that must be added manually per dagrun. The main case for this is the key observations, but it is also used to provide references to previous reports for the back matter.
These folders are populated by flowpyter-task
tasks - they exist, but should be left empty.
There is a simple approvaltest in the .circleci dir. If you make a change to the structure of the repo, run
pipenv run cookiecutter . --replay-file .circleci/test-replay.json -o .circleci/approved -f
...to update the approval test. If you make changes to the cookiecutter parameters, you'll need to generate a new replay file; run cookciecutter with your new parameters and copy the replay file to .circleci/test-replay.json
as above.