An approach to integrate metagenomics, metatranscriptomics and metaproteomics data found in public resources such as MGnify (for metagenomics/metatranscriptomics) and the PRIDE database (for metaproteomics). When these omics techniques are applied to the same sample, their integration offers new opportunities to understand the structure (metagenome) and functional expression (metatranscriptome and metaproteome) of the microbiome.
You need a working installation of Snakemake. Then:
git clone <this-repo>
The pipeline uses conda environments to manage dependencies, which are handled automatically if you run snakemake with the --use-conda
flag.
It also relies on some tools (ThermoRawFileParser
, SearchGui
and PeptideShaker
)
which do not have conda packages or docker images available for the versions we used.
These tools are downloaded on-the-fly by snakemake, so you do not need to install them separately.
There is a small test-data set, using a few assemblies from MGnify and two RAW files from PRIDE. To fetch the (~GB size) RAW files, which are too big for this git repository:
./test-data/pride/fetch-pride-test-data.sh
This downloads two RAW files into test-data/pride/
.
Then:
conda activate snakemake # (assuming you installed snakemake with conda, into an env called snakemake)
cd MetaPUF
snakemake --cores 4 --use-conda
This will run the pipeline on the small dataset, and put results into ../test-run
.
Edit the config/config.proteomics.yaml
and sample_info.csv
files to point the pipeline at real data.
sample_info.csv
is the mapping of MGnify to PRIDE datasets, and in the config parameters.input_dir
and parameters.raw_dir
refer to the MGnify and PRIDE data folders respectively.
- You can run a dry-run to check for any syntax errors
Snakemake -np
- To run the workflow
Snakemake --cores 4 --use-conda
- Using LSF on an HPC cluster:
bsub -n 4 -R "rusage[mem=4096]" -J metapuf -u $USER -o job.log -e job.err snakemake --cores 4 --use-conda
- Tips: IF the pipeline got collapsed during running, you can always try to run a dry-run
Snakemake -np
first to check how many rules have been successful executed, and if you are sure that some files are generated correctly, you can usesnakemake --cleanup-metadata <filenames>
to skip these files to be re-generated. However, sometimessnakemake --cleanup-metadata <filenames>
doesn't work, you can also try to manually delete the.snakemake/incomplete
directory.
This repository also contains a utility for packaging the pipeline's output GFF files as RO-Crate, suitable for distribution and visualisation on the MGnify website. See utils/package_as_rocrate/ for full details.
git clone https://github.com/PRIDE-reanalysis/MetaPUF.git
cd MetaPUF
conda activate snakemake # or another conda/venv if you prefer
pip install ".[dev,docs]"
pre-commit install
This installs the development requirements, and installs the pre-commit hooks which format the code correctly while commiting changes.
You can also manually format the code using black .
.
It also installs mkdocs
, which is used to build the documentation.
Change the markdown files in the docs/
folder.
Then mkdocs serve
to view the documentation site locally.
As part of our efforts toward delivering open and inclusive science, we follow the Contributor Covenant Code of Conduct for Open Source Projects.