RNA SeA-SnaP is a RNA-(Se)q (A)nalysis (Sna)kemake (P)ipeline tool and combines two tasks:
- A sub-pipeline mapping fastq files to a reference genome/transcriptome using
STAR
orSalmon
and including extensive quality control (Fastqc
,Dupradar
,Qualimap
,RNASeQC
,Preseq
,infer_experiment
,Multiqc
) - A sub-pipeline for Differential Expression (DE) analysis with
DESeq2
Both pipelines are based on Snakemake
.
The focus of RNA SeA-SnaP is to be as easy to use, adapt and develop as possible. To this end, SeA-SnaP is divided in three main parts:
- Pipeline: Nearly all code corresponding to a specific (sub-)pipeline is included in one file.
- Configuration: Setting parameters and pipeline configuration is done via a separate config file (YAML).
- Tools: Generic functions and tools that are part of the pipeline framework are located in a separate file.
Finally there is also a directory with R markdown snippets for the DE sub-pipeline. Based on a configuration made in the config file, individual snippets can be assembled to generate a customized report. The splitting into snippets allows to easily develop, share and include different analyses of the results.
After cloning this git repository:
git clone [email protected]:CUBI/Pipelines/seasnap-pipeline.git
all required tools and packages can be installed via conda.
Currently there are two separate conda environments, one for the mapping pipeline and one for the DE pipeline
Download and install them into new environments called sea_snap_mapping
and sea_snap_de
:
conda env create -f conda_env_mapping.yaml
conda env create -f conda_env_DE.yaml
The files conda_env_mapping.yaml
and conda_env_DE.yaml
are located in the main directory of the git repository.
Each time before using SeA-SnaP, activate the environment with:
conda activate seasnap-mapping
or
conda activate seasnap-de
Finally, run the following command in the seasnap-de environment:
conda activate seasnap-de
Rscript install_r_packages.R
set up a working directory
Set up a working directory to store the results produced by the pipeline.
(For CUBI projects create a project directory in the cluster under /fast/groups/cubi/projects/
).
To create a directory and copy required files for the configuration of your pipeline run:
path/to/git/sea-snap.py working_dir
This will create a directory at the location from where you are running the command called results_<year>_<month>_<day>/
and add config files for both pipelines, but you can customize this behaviour via the command line options (type sea-snap.py working_dir -h
for help).
Directory names you provide can include formatting instructions for pythons time
package.
cd <dir_name>
to the newly created working directory.
SeA-SnaP also creates a symbolic link to the sea-snap.py script, so that you can from now on use ./sea-snap
to run helpers or pipelines from the working directory.
You should always run pipelines and helpers from there.
run the pipeline
The next steps depend on, whether you want to run:
The results of an analysis can also be exported
to a new folder structure, e.g. to upload them to SODAR.
Let's first introduce the general structure of SeA-SnaP.
As outlined above, the pipeline core functionality is separated from additional generic tools like the path handler (that handles where files are stored) and the pipeline configuration. The config file is loaded in Snakemake and its static parts (like parameter values) can be accessed in the pipeline rules. For other 'dynamic' parts of the configuration like file paths which are described by path patterns or the report- and contrast configuration tools are provided that can be used within the pipeline to access this information.
In addition, there is also a directory with report snippets for the DE pipeline, small pieces of R-Markdown code that run a single analysis step like producing a PCA plot. In the configuration file it can be set which snippets to use and in which order to assemble them into a full report.
Finally, there are some helper functions, that can be accessed via the ./sea-snap
wrapper to e.g. automatically produce a covariate file or sample information.
There are also folders external_scripts/
, where scripts can be placed that may be used in the pipeline (although it is prefereable if small pieces of code are kept inside of the Snakemake file), and report/R_common/
, where R functions can be put that are generic and may be used in several report snippets.
The pipelines can be easily extended.
See the separate sections for:
Available commands in the ./sea-snap
wrapper:
helpers:
working_dir
to set up a new working directory for pipeline resultssample_info
to generate a yaml file with sample information used by the mapping pipelinecovariate_file
to generate a table with information required by the DE pipelineselect_contrast
display information to help choosing contrast definition
run pipeline:
mapping
run the mapping pipelineDE
run the DE pipeline
Type ./sea-snap -h
or ./sea-snap COMMAND -h
for help.
This has been inferred from single end data:
- STAR reports the total number of input reads, the number of uniquely mapped reads, the number of reads mapped to multiple loci (counted ones)
- STAR does not report directly the total number of unmapped reads, but the number of unmapped reads due to mapping to too many location
- feature counts reports in its summary file the total number of reads found in the alignment file: multi mapping reads are counted several times
- hence by summing up all the numbers from feautureCount you will not get number of input reads as reported by STAR
- however, the number of unmapped reads should be the same amd summing up all but Unassigned_MultiMapping and Unassigned_Unmapped should give the uniqyely mapped reads reported by STAR
Address questions to Patrick Pett ([email protected])