This is the V2 API for MGnify, EBI's metagenomics platform. It is:
- an architecture for the DB schema/ORM and API (based on Django and Postgres) intended to replace EMG API v1
- a "production automation system" (based on Prefect) intended to replace ๐ MI Automation
- Development setup and running locally
- Developing data models
- Developing workflows: basic and detailed
- Developing the slurm integration: basic and detailed
- Deploying to production environment: summary and detailed
Clone the repo.
pip install -r requirements-dev.txt
(or just have pre-commit
installed somehow).
pre-commit install
.
There are three main parts โ an API server, a Prefect server, and a Prefect agent. (There are also database and object stores to run โ these can be sqlite/local-fs, but this aims to be a more production-like setup.) In a real world these would probably live on separate VMs: on HPC, on hosted DBs, and on K8s. For local development, these are all run in a docker-compose environment.
There is also a docker-compose setup of Slurm, so that automation of HPC scheduling can be developed.
This creates a tiny slurm cluster called donco
(not codon
).
This is in the slurm-dev-environment
directory: see slurm-dev-environment/README.md for more.
E.g. following the docker docs or using Podman or Colima, as you prefer. In theory all should work.
(There is a docker compose file rooted at ./docker-compose.yaml
,
so later on you can do normal docker/compose things like rebuild a container with docker-compose build app
.
However, most common tasks are covered by the Taskfile, see below.)
That file is used in development (docker-compose.yml) to export variables into environment. Currently, that file has mandatory variables: username and password for assembly uploader using webin-cli
export EMG_WEBIN__EMG_WEBIN_ACCOUNT="Webin-XXX"
export EMG_WEBIN__EMG_WEBIN_PASSWORD="password"
The project has Taskfiles to simplify some common activities. So, install Task. These are just helpers for local development and deployment.
task --list-all # this shows you all of the available commands.
task make-dev-data
This will have created a Django-managed DB on a dockerized Postgres host, and put some fixtures into it.
It will ask you for a password, which is for a django admin user called emgdev
. This password can be anything.
(You could also just create/migrate the DB, without the fixture placement, using task manage -- migrate
.)
task prefect -- block register -m prefect_slack.credentials # this enables a prefect -> slack notification system
FLOW=realistic_example task deploy-flow # this "deploys" workflows/flows/realistic_example.py:realistic_example to your local prefect server
# This flow is just a minimal demo to show how the prefect+django integration works.
FILE=workflows/prefect_utils/slurm_flow.py FLOW=move_data task deploy-flow
# if a flow filename + function name don't match, specify FILE separately.
# This move_data flow needs to be deployed, because it is used by other flows.
Run everything (the databases, the Django app, the Prefect workflow server, a Prefect work agent, and a small Slurm cluster with associated controllers+dbs.)
task run
Be aware this runs 7 containers using ~2GB of RAM. Configure your Podman Machine / Docker Desktop / Colima setup accordingly.
You'll see logs from all the containers.
Depending on your containerisation setup, you may need to tweak the
CPUs=4
line ofslurm-dev-environment/configs/slurm_single_node.conf:45
, e.g. setting it to 1. This is related to the number of CPUs on your host machine or on the container VM you're using: e.g. what you set in Docker Desktop.
You can then go to http://127.0.0.1:4200 to see the Prefect dashboard (workflows to be run).
You can also go to http://localhost:8000/api/v2/docs to see the Django app.
The django admin dashboard is at http://localhost:8000/admin (username: emgdev
if you used task make-dev-data
).
Prefect flows are just Python. There is a hello-world like example in workflows/flows/simple_example.py
.
It can be run using Python, e.g. inside the app
container:
docker-compose exec app python workflows/flows/simple_example.py
You'll see that the flow and task decorators break the workflow up into individually executable bits of work. You can use this kind of approach to debug things. Meaningful flows, however, are run on separate infrastructure โ and that is what the slurm and prefect agent dev environments are for.
FLOW=realistic_example task deploy-flow
This "builds" a prefect flow (from the workflows/flows/
directory, in a file of name realstic_example
with an @flow
-decorated method also called realistic_example
).
(Use FILE=... FLOW=... task deploy-flow
if the filename doesn't match the method name.)
It also "applies" the "flow deployment", which means the Prefect server knows how to execute it.
It will register it as requiring a worker from the workpool "slurm" to run it.
The Prefect agent in the docker compose setup is labelled as being from this "slurm" pool, so will pick it up.
This agent simulates a worker node on an HPC cluster, e.g. it can submit nextflow
pipeline executions which can in turn launch slurm jobs.
Note that this is a very minimal development environment... the entire "slurm cluster" is just two docker containers on your computer.
Either: open the Prefect dashboard, or use a POST request on the MGnify API, or use the prefect CLI via docker compose.
E.g., use the Prefect dashboard to do a "quick run" of the Realistic Example flow you just deployed with accession PRJNA521078
.
This example will:
- make an ENA Study in the database
- suspend itself and wait to be "resumed" in the Prefect dashboard, because it needs to know a "sample limit" from the admin user
- get a list of samples from the ENA API, in an @task
- run a nextflow pipeline for each sample, on slurm, that downloads the read runs
You could also run this newly deployed flow from the command line, using the Prefect CLI. e.g.:
task prefect -- deployment run "Download a study read-runs/realistic_example_deployment" --param accession=PRJNA521078
(Note that you can't run this one in the same way as simple_example.py
, because realistic_example.py
does not have a __main__
).
- Use type hinting:
def my_func(param: List[str]) -> int:
- Prefer to use
pathlib
instead ofos.path
, e.g. for joining parts:Path("/nfs/my/dir") / "subdir" / "file.txt"
- Config parameters (like the URL for ENA etc.) should use structured Pydantic Settings. See
settings.EMG_CONFIG
. EMG_CONFIG
should always be imported viadjango.conf.settings
: e.g.from django.conf import settings; EMG_CONFIG = settings.EMG_CONFIG
- When you have a list of acceptable options for something, use
Enum
s orTextChoices
(a kind of enum for Django db fields):class AssemblyStatuses(str, Enum):...
- Use Django/postgres JSONFields liberally (they can save a load of complicated JOINs)
- Apply a schema to JSONFields, using Enums, default dicts, custom pydantic types... see
WithDownloadsModel
for an example - Use class mixins and Django abstract models liberally, to add shared/similar functionality to multiple models
- API list endpoints should not perform many/any JOINs. Prefer to have less information on the endpoint than introduce JOINs (it can be separately indexed for search)
- API detail endpoints may perform JOINs
- API action endpoints (e.g.
/analyses/MGYA1/taxonomies
) should be used where a very large dataset (taxonomies
) is to be returned. This meanstaxonomies
is not needed (so can be deferred) on the main analysis detail endpoint. - Use a variable for the labels of JSON/dicts:
STATUS = "status"; my_dict = {STATUS: get_status_of_run()}
- Use ReST style docstrings for functions:
:param sample_accession: The sample to be analysed
There are two real Django apps here:
ena
, for models that mirror objects in ENA: studies, samples, etc.analyses
, for models associated with MGnify analysis production work (MGYS, MGYA etc).- TODO: other models like genomes could live in separate apps.
There is one fake Django app workflows
, which is used to tie Prefect (the workflow scheduler) into Django.
This is bidi: it creates a manage.py prefectcli
command to run Prefect, and it allows Prefect tasks to use instantiated Django.
The API is implemented with ninja
(emgapiv2/api.py
), and uses Open API spec with Swagger.
See the workflows/README.md for details. In short: add Python/Prefect code to a file in workflows/flows/
and then FLOW=my_flow task deploy-flow
.
The project uses the pytest framework. Prefect has some helpers for testing. We also use Pytest-django to help with Django testing.
Testing libraries are in requirements-dev.txt
. These are installed in the docker compose app
container. So:
task test
# ...will run everything. Or for a subset, use pytest arguments after -- e.g.:
task test -- -k study
See the slurm-dev-environment/README.md for details. In short: task slurm
and you're on a slurm node
of the containerised slurm "cluster".
The deployment/
folder has deployment configs for different environments.
Each should have its own Taskfile
, included in the main Taskfile
.
E.g. see the EBI WP K8s HL deployment README.
Run e.g. task ebi-wp-k8s-hl:update-api
to build/push/restart the EMG API service in that deployment (requires some secrets setup).
Run e.g. FLOW=assembly_study task ebi-wp-k8s-hl:deploy-flow
to deploy a new flow to this production environment.
Note that the prefect workers ALSO need to have your new flow code, which is currently deployed separately. For EBI-WP-K8s-HL, there is a Jenkins job to deploy those workers to Codon.
- DB Schema parity with EMG DB (v1) and EMG Backlog
- Job cleanup flows
- Legacy data importers