wf-iclipseq is a bioinformatic pipeline developed in Nextflow for the analysis of iCLIP data. It takes demultiplexed single-end reads (.fastq.gz), a gene annotation file, and a reference genome file to perform pre-processing (incl. quality control (QC), adapter trimming, rRNA filtering), post-processing (incl. genome alignment and crosslink extraction) and downstream analysis (gene annotation, motif analysis). The gene annotation results can be further visualized, taking the bash and R scripts in the scripts/ folder as reference.
- Removing whitespaces and special characters in read IDs (Bash)
- QC on raw reads (FastQC)
- Barcode extraction (UMI-Tools)
- Adapter and quality trimming (Trim Galore!)
- Ribosomal RNA filtering (SortMeRNA or Ribodetector*)
- Quality control on non-RNA reads (FastQC)
- Genome alignment (STAR)
- Quality control on unmapped reads (FastQC)
- Deduplication (UMI-Tools)
- Alignment sorting and indexing (SAMtools)
- Quality control on deduplicated reads (FastQC)
- Extract crosslink events (BEDTools)
- BED to BigWig coverage tracks (bedGraphToBigWig)
- Peak calling (PureCLIP and/or MACS2)
- Gene annotation (Bash & BEDTools and/or HOMER)
- Custom scripts for visualization of results (R and various packages)
- Motif detection (STREME)
- Summarizing results up to the last quality control step (MultiQC)
Steps in the pipeline (with the exception of STAR alignment, for the time being) can be skipped using the conf/params.config file. The pipeline contains multiple quality control steps which can be used to closely monitor the data before peak calling. By default, quality control before and after trimming is enabled, while the others are disabled.
*not included in the pipeline yet, but the pipeline does contain the nextflow script.
By default, the pipeline performs the following things:
- QC on raw reads by FastQC
- Quality and adapter trimming by Trim Galore!
- Genome indexing by STAR
- Genome alignment by STAR (allowing multimappers, with
--outFilterMultimapNmax 20
and doing chimeric detection) - Alignment sorting and indexing by SAMtools
- Deduplication by UMI-Tools
- Crosslink extraction
- Peak calling by PureCLIP
- Resizing PureCLIP binding sites for gene annotation by 8bp (change this to match your RBPs binding footprint, can also be deactivated)
- Gene annotation by Bash & BEDTools, and HOMER
- Motif detection by STREME
- Report by MultiQC
The output of a default run creates the following tree:
├── 01_fastqc
├── 02_trimgalore
│ └── fastqc
├── 04_star
│ ├── log
│ ├── samtools_stats
│ └── umitools
│ ├── bai
│ ├── bam
│ └── fastqc
├── 05_extract_crosslinks
│ └── 06_crosslink
│ ├── merged
│ ├── neg
│ └── pos
├── 06_peak_calling
│ └── pureclip
├── 07_downstream_analysis
│ ├── 00_sites_with_peakID
│ ├── 01_bedtools_annotations
│ ├── 01_homer_annotations
│ └── 02_streme_motifs
│ └── bedtools_resized
├── 08_multiqc
│ ├── multiqc_data
│ └── multiqc_plots
│ ├── pdf
│ ├── png
│ └── svg
└── pipeline_info
If rRNA filtering is enabled, there will be a folder named 03_sortmerna/
.
Nextflow (>23.04.4) and Singularity are required to run this pipeline. Both can be installed using Conda. To download Conda, follow this tutorial. Check that Conda is up-to-date:
conda --version
conda 23.9.0
Make a conda environment to download the packages in and activate it:
conda create -n nf_env
conda activate nf_env
You can install Nextflow using Bioconda, after setting up the proper channels: conda install -c conda-forge -c bioconda nextflow
Or a specific version: conda install -c bioconda nextflow=23.10.0
Make sure to install Singularity from conda-forge and not bioconda, as conda-forge has a more recent version:
conda install -c conda-forge singularity=3.8.6
Check the version of the programs to confirm you have a decently up-to-date program:
nextflow -version
N E X T F L O W
version 23.10.0 build 5889
created 15-10-2023 15:07 UTC (17:07 CEST)
cite doi:10.1038/nbt.3820
http://nextflow.io
singularity --version
singularity version 3.8.6
To download the repository, you can use:
git clone https://github.com/lumc-sasc/wf-iclipseq.git
It is mandatory to provide a samplesheet.csv in the input/
folder that has 5 columns and may look like this:
sample,fastq_1,fastq_2,control_bam,control_bai
s_1,/path/to/fastq.gz/,,,
s_2,/path/to/R1_fastq.gz/,/path/to/R2_fastq.gz/,,
s_3,/path/to/fastq.gz/,,/path/to/input.bam,/path/to/input.bam.bai
Those in bold are mandatory to run the pipeline.
- sample: sample name, make sure it does not contain periods as that may cause errors
- fastq_1: a demultiplexed fastq.gz file
- fastq_2: a second demultiplexed fastq.gz file, in case of paired-end reads
- control_bam: .bam file of input/control data (used for peak calling with PureCLIP/MACS2)
- control_bai: .bai file of input/control data (used for peak calling with PureCLIP)
If control_bam is provided, it is imperative that control_bai is provided as well! Note: This pipeline has been tested for single-end reads, and may not support paired-end reads (remains to be tested).
By default, the reference genome and annotation are specified in the conf/params.config file. You can use these and move on to running the pipeline or you can replace these with your own.
fasta = "https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz"
gtf = "https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz"
Finally, you can run the pipeline using the following command:
nextflow run main.nf -profile singularity
If you are running from a different (sub)directory, make sure you point to the folder main.nf
is located in. It is recommended to add the -resume
command when partially rerunning the pipeline. You may also specify the output directory with the --outdir
parameter.
- On SLURM, exit code 137 occurs when the memory limit of a process has been exceeded. For example, SortMeRNA (but also STAR) may run out of resources, especially if your sample is large (e.g. >2GB) and contains mostly rRNA (e.g. >70%) sequences you wish to filter out. You can increase these resources in the conf/resources.config file.
- PureCLIP may cause an error. It is usually solved by simply rerunning using the
-resume
command. - Genome indexing is computationally expensive and time-consuming. Therefore, we recommend generating these files once by setting
save_reference
parameter in conf/params.config totrue
and then simply running the pipeline as normal the first time/ This saves the indices in the results directory ('04_star/genome'). Afterwards, you can move/copy the STAR index folder to theinput/
folder and specify `star_index = '/path/to/STAR/index/' in conf/params.config, for example: '$projectDir/input/genome/index/star'.
Possible improvements include:
- Benchmarking with public iCLIP datasets
- Including a test case
- Including a list of parameters
- Compatibility with other container runtimes (e.g. Docker, Appcontainer, conda) to improve accessibility (currently only Singularity is supported)
- Testing and improving performance (runtime, speed, resources)
- Error testing
- Testing for different operating systems (has only been tested on Linux Ubuntu 22.04.3)
- Defining computational requirements (has only been run on an HPC)
- Specify some commands in its own process file instead of the config file
- Optimize check_samplesheet.py further by implementing error raising (instead of an extra function)
- Rename 'fasta' parameter to 'reference_fasta'
- Potentially exclude nf-core modules and subworkflows from the pipeline itself (downloadable by user), although this does not seem to be custom
This pipeline was originally made by Amarise Silié (@amarisesilie) for her MSc internship in Bioinformatics from September 2023 to April 2024. The pipeline uses modules and subworkflows from nf-core, nf-core/rnaseq and goodwright/clipseq in addition to local modules and subworkflows. The paper by Busch et al. 2020 was also used. See CITATION.md file for a full list of references.