AICS Cookiecutter template for a simple data + code workflow:
- git(hub) for code
- quilt for data
- prefect to combine
An Example Workflow Produced with this Template
To use this template for a new workflow, use the following commands and then follow the prompts from the terminal.
pip install cookiecutter
cookiecutter gh:AllenCellModeling/cookiecutter-stepworkflow
Once you've followed the prompts, you should have a template repository that we need to
- install as a Python package
- connect to GitHub
- connect to Quilt
First, we'll make a conda
environment to house this project's python dependencies.
If you don't have conda
installed, install it with
miniconda.
Whatever you named your project, make a conda environment of the same name
conda create --name <project_name> python=3.7
and activate it with
conda activate <project_name>
To install the project as a python package, cd
into the project directory, and then
cd <project_name>
pip install -e .[dev]
This will install your package in editable mode with all the required development dependencies.
Create an empty repository on GitHub that has the same name as your project (you need to do this via the GitHub website). Don't initialize it with a README or anything.
Once the GitHub repo is created, push your project up to Github with
git remote add origin [email protected]:AllenCellModeling/<project_name>.git
git push -u origin master
If you get permissions errors, make sure you have ssh keys installed,
or use https://github.com
instead of [email protected]:
in the origin address above.
Your initial commit will show a broken build badge. To fix this, configure codecov and a documentation generation access token following the instructions here.
Access to quilt data in S3 requires two files
~/.aws/credentials
:
[default]
aws_access_key_id=<your_secret_access_key_id>
aws_secret_access_key=<your_secret_access_key>
~/.aws/config
:
[default]
region=us-west-2
This template comes with an example first workflow step Raw
.
You should be able to run this with the command
<project_name> raw run
This will write out some "raw data" (some randomly generated images) to
local_staging/raw
.
You should edit the run
function of the Raw
class in
<project_name>/steps/raw/raw.py
to do something relevant to your workflow, e.g.
aggregating raw data and getting it ready to push to Quilt.
To push the data in local_staging
to quilt, use
<project_name> raw push
If your git branch is on master
, this will save your data in quilt to
aics/<project_name>/master/raw
.
To download the remote data and overwrite your local data, use
<project_name> raw checkout
To download the remote data needed as input to run a step, use
<project_name> raw pull
Since Raw
is the first step, and doesn't need any inputs, this doesn't do anything
here.
To make a new step in your workflow, in the main project directory use
make_new_step <StepName>
This will create a StepName
class in <project_name/steps/step_name/step_name.py>
,
with a run
method that is ready for you to edit.
If your step directly depends on the output of another step for input data to this
step, set the direct_upstream_tasks
kwarg in the class __init__
method to be a list
of the steps this one depends on. The list should be of step classes, e.g.
direct_upstream_tasks = [Raw]
.
For your step to run successfully, you need to save a dataframe manifest of the files
you're writing out to self.manifest
, and then save that as manifest.csv
. See the
Raw
step for an example.
To run all of your steps at once, use
<project_name> all run
push
and checkout
also work with all
this way, to push or checkout all of your
data at once.
If you add a new step to your workflow, you should also edit
<project_name>/bin/all.py
and in the All
class, change self.step_list
to include
your new steps, in the order in which you want to run them.
You won't be able to push data to Quilt unless your git status is clean. This is intended to maintain parity between the data we save, and the code that generated it. To have alternate version of workflow data, just switch to a new git branch
git checkout -b <new_branch_name>
Pushing data to quilt with e.g. <project_name> push raw
will then save your data to
aics/<project_name>/<new_branch_name>/raw
.
See the README here for all of the optional infrastructure you can (and should) add, e.g. docs, testing, etc.