- Author(s): Boas van der Putten, Roxanne Wolthuis
- Organization: Rijksinstituut voor Volksgezondheid en Milieu (RIVM)
- Department: Infektieziekteonderzoek, Diagnostiek en Laboratorium Surveillance (IDS), Bacteriologie (BPD)
- Start date: 07 - 04 - 2023
- Commissioned by: Thijs Bosch
Apollo-mapping is the first pipeline created in the Apollo pipeline series. The Goal of these pipelines is to set up a routine surveillance for fungi (A.fumigatus, Candida). The apollo-mapping pipeline is created with the juno-template and juno-library.
The input of the pipeline is raw Illumina paired-end data in the form of two fastq files (with extension .fastq, .fastq.gz, .fq or .fq.gz), containing the forward and the reversed reads ('R1' and 'R2' must be part of the file name, respectively).
The pipeline uses the following tools(NOT COMPLETE):
- FastQC (Andrews, 2010) is used to assess the quality of the raw Illumina reads
- FastP (Chen, Zhou, Chen and Gu, 2018) is used to remove poor quality data and adapter sequences
- Picard determines the library fragment lengths
- MultiQC (Ewels, Magnusson, Lundin, & Käller, 2016) is used to summarize analysis results and quality assessments in a single report for dynamic visualization.
- Kraken2 and Bracken for identification of fungal species.
- Linux environment
- (mini)conda
- Python 3.11
- Clone the repository.
git clone https://github.com/RIVM-bioinformatics/apollo-mapping.git
- Go to the pipeline directory.
cd apollo-mapping
- Create & activate mamba environment.
conda env update -f envs/mamba.yaml
conda activate mamba
- Create & activate apollo environment.
mamba env update -f envs/apollo_mapping.yaml
conda activate apollo_mapping
- Example of run:
python3 apollo_mapping.py -i [input] -o [output] -s [species]
-h, --help
Shows the help of the pipeline
-i, --input
Relative or absolute path to the input directory. It must contain all the raw reads (fastq) files for all samples to be processed (not in subfolders)-s, --species
Species to use, choose from: ['candida_auris', 'aspergillus_fumigatus']
-o --output
Relative or absolute path to the output directory. If none is given, an 'output' directory will be created in the current directory-w, --workdir
Relative or absolute path to the working directory. If none is given, the current directory is used.-ex, --exclusionfile
Path to the file that contains samplenames to be excluded.-p, --prefix
Conda or singularity prefix. Basically a path to the place where you want to store the conda environments or the singularity images.-l, --local
If this flag is present, the pipeline will be run locally (not attempting to send the jobs to an HPC cluster**). The default is to assume that you are working on a cluster. **Note that currently only LSF clusters are supported.-tl, --time-limit
Time limit per job in minutes (passed as -W argument to bsub). Jobs will be killed if not finished in this time.-u, --unlock
Unlock output directory (passed to snakemake).-n, --dryrun
Dry run printing steps to be taken in the pipeline without actually running it (passed to snakemake).-q, --queue
Name of the queue that the job will be submitted to if working on a cluster.-mpt, --mean-quality-treshold
Phred score to be used as threshold for cleaning (filtering) fastq files.-ws, --window-size
Window size to use for cleaning (filtering) fastq files.-ml, --minimum-lenth
Minimum length for fastq reads to be kept after trimming.--no-containers
Use conda environments instead of containers.--snakemake-args
Extra arguments to be passed to snakemake API (https://snakemake.readthedocs.io/en/stable/api_reference/snakemake.html).--reference
Reference genome to use default is chosen based on species argument, defaults per species can be found in: /mnt/db/apollo/mapping/[species]--db-dir
Kraken2 database directory (should include fungi!)
python3 apollo-mapping.py -i [dir/to/fasta_or_fastq_files] -s [species]
python3 apollo-mapping.py -i [dir/to/fasta_or_fastq_files] -o [/path/to/output/location] -s aspergillus_fumigatus
Detailed information about the pipeline can be found in the [documentation](link to other docs). This documentation is only suitable for users that have access to the RIVM Linux environment.
- audit_trail: Logs of conda, git and the pipeline, a sample sheet, the used parameters and a snakemake report.
- clean_fastq: cleaned fastq files.
- identify_species: Output of kraken and bracken for species identification.
- log: Log with output and error file from the cluster for each Snakemake rule/step that is performed.
- mapped_reads: Mapping output.
- multiqc: Multiqc output and multiqc html report.
- qc_clean_fastq: Quality control of clean fastq reads.
- qc_mapping: Quality control of mapping.
- reference: Reference genome used.
- variant: Variant calling results.
- This pipeline only works on the RIVM cluster.
- Make this pipeline available and user friendly for users outside RIVM.
This pipeline is licensed with a AGPL3 license. Detailed information can be found inside the 'LICENSE' file in this repository.
- Contact person: IDS-Bioinformatics
- Email: [email protected]
Apollo pipelines use a feature branch workflow. To work on features, create a branch from the main
branch to make changes to. This branch can be merged to the main branch via a pull request. Hotfixes for bugs can be committed to the main
branch.
Please adhere to the conventional commits specification for commit messages. These commit messages can be picked up by release please to create meaningful release messages.