Analysis of mutations at codon 625 of SF3B1 gene in uveal melanoma.
A slightly enhanced version of this file is available in gitbook, pdf, and epub formats here.
The pipeline runs on nextflow a domain-specific language created to automate data-analysis pipelines whilst maximising reproducibility. Nextflow enables scientists to focus on their analyses, isolating different parts of the pipeline into processes whose dependencies can be dealt with using containers and virtual environments with technologies such as Docker, Singularity, and Anaconda.
The recommended way to install nextflow
is via conda
, using the environment file.
conda env create -f nextflow_conda_env.yml # will create an env called "nextflow"
conda activate nextflow
# You can edit the file at your choice, specially if the environment name conflicts
# with a preexisting conda env on your system
Docker should be installed as well:
sudo apt install docker
Once nexflow is installed, it will automatically retrieve the docker images used within the pipeline.
Nextflow workflows should form a DAG (i.e. directed acyclic graph), which represents the flow of data through the different steps required to produce the final result.
This pipeline will generate a set of figures, representing differential gene expression analysis of RNA-Seq data.
A machine with at least 32 GB of FREE RAM (to create the index and the mapping on the reference genome). Recommended configuration is 64 GB, by default the mapping process is configured to use 50 GB.
Read more about the setup used to develop this pipeline by reading the documentation.
- Clone the repo to your machine
git clone https://github.com/bio-TAGI/Hackathon.git
cd Hackathon
- Create and activate the virtual environment
conda env create -f nextflow_conda_env.yml
conda activate nextflow
- Run the wokflow with default parameters.
cd Nextflow
nextflow run main.nf
- If you had to stop the workflow run, or if some error occurred, you can always resume the execution as follows:
nextflow run main.nf -resume
- Specifying parameters from the command line
nextflow run main.nf --param1 value1\
--param2 value2\
--paramn valuen # these are generic names, not actual parameters for the pipeline
index_cpus
(number of cpus reserved for the genome indexation process.default=14
)mapping_cpus
(idem. for the mapping process, used to create BAM files.default=14
)counting_cpus
(idem. for the counting process.default=7
)mapping_memory
(RAM reserved for mapping.default=50GB
)
If you already possess some of the files needed to execute the pipeline, you can specify them as follows:
reads
(path pointing to a directory containing thefasterq
files)genome
(path pointing to a directory containing the genome FASTA file)index
(Répertoire contenant les fichiers d’index)mapping
(Répertoire contenant les fichiers BAM)counting
(Chemin d’accès entier au fichier de comptage – comprend le fichier lui-même)metadata
(Chemin d’accès entier au fichier de métadonnées – comprend le fichier lui-même)
If unspecified, the pipeline will be executed using default values from the config file : nextflow.config. These too, can be tweaked and overriden:
ids
List of SRR accession number to fetch paired-end fastq files.- default
['SRR628582', 'SRR628583', 'SRR628584', 'SRR628585', 'SRR628586', 'SRR628587', 'SRR628588', 'SRR628589']
- default
genome_url
URL to download the reference genome.- default
ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- default
annotation_url
URL to donwload the reference genome's annotation.- default
ftp://ftp.ensembl.org/pub/release-101/gtf/homo_sapiens/Homo_sapiens.GRCh38.101.chr.gtf.gz
- default
sjdbOverhang
(a STAR-specific parameter.default=99
)- For further information about this parameter, see this tutorial, or the STAR manual.
- A good internet connection is required. Retrieving
fastq
can be really slow and is thus a bottleneck. fasterq-dump
will randomly segfault. At first we thought this was caused by connection problems, but runningping
ruled this out. Apparently, the segfault is a known issue.- The workflow will inevitably fail if you try building the genome's index on a machine with less than ~30 GB of RAM available.
- As a general rule, tweak all parameters to reasonable values that fit your setup and needs. We don't know your hardware, you do ;)