A Nextflow pipeline for processing and calling of variants: FASTQ to VCF
Welcome to this FOCYTE Pipeline repository! This pipeline is designed to process FASTQ files, perform quality control, align reads to a reference genome, and call genetic variants. The workflow uses Nextflow to orchestrate the different steps of the analysis. Below you'll find details about the stages involved in this pipeline along with usage instructions.
The pipeline processes paired-end FASTQ files and generates a VCF file containing the variants. It consists of two main phases:
- Quality Control & Trimming: Raw reads are cleaned and trimmed using
fastp
. - Variant Mapping & Calling: Reads are mapped to a reference genome using
bwa
and variants are called usingbcftools
.
This process uses the tool fastp
to clean and trim raw FASTQ files, generating high-quality reads for downstream analysis. Each sample's paired-end reads are taken as input, and the output includes:
- Trimmed FASTQ files
- A quality control report (JSON and HTML formats)
process FASTP {
tag "FASTP on $sample_id"
publishDir params.outdir, mode: 'copy'
input:
tuple val(sample_id), path(reads)
output:
tuple val(sample_id), path('trim_*.fastq.gz'), emit: reads
tuple val(sample_id), path("${sample_id}.json"), emit: json
tuple val(sample_id), path("${sample_id}.html"), emit: html
script:
"""
fastp -i ${reads[0]} -I ${reads[1]} \
-o trim_${reads[0]} -O trim_${reads[1]} \
--json ${sample_id}.json \
--html ${sample_id}.html
"""
}
The cleaned FASTQ files are aligned to a reference genome using bwa
, and the pipeline converts the SAM file into BAM format, sorts it, and indexes it for further analysis.
The genome is indexed and this is used to align the trimmed reads from the Process pipeline to generate files in SAM format.
process INDEX {
input:
path transcriptome
output:
path "genome*"
script:
"""
bwa index $transcriptome
"""
}
process MAPPING {
tag "BWA on $sample_id"
publishDir params.outdir, mode: 'copy'
input:
tuple val(sample_id), path(reads1), path(reads2)
path index
output:
path "${sample_id}.sam"
script:
"""
bwa mem ${params.transcriptome_file} ${reads1} ${reads2} > ${sample_id}.sam
"""
}
The BAM file is generated, sorted, and indexed for variant calling:
process BAMCONVERT {
input:
path sam_file
output:
path "${sam_file.baseName}.bam"
script:
"""
samtools view -h -S -b -o ${sam_file.baseName}.bam ${sam_file}
"""
}
process BAMSORT {
input:
path bam_file
output:
path "${bam_file.baseName}_sorted.bam"
script:
"""
samtools sort ${bam_file} -o ${bam_file.baseName}_sorted.bam
"""
}
process BAMINDEX {
input:
path sorted_bam_file
output:
path "${sorted_bam_file.baseName}.bam.bai"
script:
"""
samtools index ${sorted_bam_file}
"""
}
In this stage, variants are called using bcftools
. The pipeline uses mpileup
to create a BCF file, which is then processed to generate a VCF file with the called variants.
process BCFPILEUP {
input:
path flagged_bam_file
path index
output:
path "${flagged_bam_file.baseName}.bcf"
script:
"""
bcftools mpileup -O b -o ${flagged_bam_file.baseName}.bcf -f ${params.transcriptome_file} ${flagged_bam_file}
"""
}
process BCFCALL {
input:
path flagged_bcf_file
output:
path "${flagged_bcf_file.baseName}.vcf"
script:
"""
bcftools call --ploidy 1 -m -v -o ${flagged_bcf_file.baseName}.vcf ${flagged_bcf_file}
"""
}
Clone this repository and navigate to the directory Within your project directory create the following folders:
mkdir ./data/ref-genome
In data, place your paired .fastq files for analysis In ref-genome, place a FASTA file of your genome of interest
Create your conda environment containing the required tools and dependencies by loading the .yml file:
conda env create -f variant.yml
Run the Procss.nf pipeline:
nextflow Process.nf
Check the read quailty after trimming
Run the MapCall.nf pipeline:
nextflow MapCall.nf
Clinical isolates from patients with SARS-CoV-2 infection collected by The COVID-19 Genomics UK (COG-UK) – Consortium were obtained from The European Nucleotide Archive. Eight samples were randomly selected for variant calling compared to the Wuhan-Hu-1 sequence GCA_009858895.3 using this pipeline.
VCF files were visualised against the Wuhan-Hu-1 genome assembly in IGV: