Snakemake workflow: `QCforSeqCode`

Author: [email protected]

About

Snakemake Pipeline to check the requirements for a prokaryotic assembly to be included in the SeqCode initiative.

The requirements are outlined in APPENDIX I of the SeqCode.

Usage

Check out the usage instructions in the snakemake workflow catalog

But here is a rough overview:

Install conda (mamba or miniconda is fine).
Install snakemake with:

conda install -c conda-forge -c bioconda snakemake

Download checkm2 database (via wget https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz)
Download GTDB-Tk database (via wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz)
Download the latest release from this repo and cd into it
Edit the config/config.yaml to provide the paths to your results/logs directories, and the paths to the databases you downloaded, as well as any parameters you might want to change.
Edit the config/sampleData.csv file with the specific details for each assembly you want to check. Depending on what you enter here, the pipeline will automatically adjust what will be done.
Open a terminal in the main dir and start a dry-run of the pipeline with the following command. This will download and install all the dependencies for the pipeline (this step takes may take some time) and it will show you if you set up the paths correctly:

snakemake --sdm conda -n --cores

Run the pipeline with

snakemake --sdm conda --cores

TODO and planned features

add 16S rRNA gene truncation check
add automatic switches for Kingdom specific modes of some tools
automate checkm2 and gtdb-tk database downloads
add checks if the config file and the sample file are correctly filled

Tools used in the pipeline and reasoning. Please cite these tools if you use this pipeline.

Taxonomy
- GTDB-Tk v2.4.0 - toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes. Used to get full genome taxonomic classification.
- Infernal v1.1.5 - RNA secondary structure/sequence profiles for homology search and alignment. Used to find and extract rRNA genes in the genomes.
- DECIPHER v2.30.0 - Tools for curating, analyzing, and manipulating biological sequences. Used to get 16S rRNA gene taxonomic classification by comparing to SILVA db.
- SILVA r138 - rRNA database. Used as source of rRNA gene taxonomy
Contamination and Completeness
- CheckM2 v1.0.1 - Assessing the quality of metagenome-derived genome bins using machine learning. Used to get completeness and contamination stats. Unlike CheckM1 (one of the most popular tools for completeness and contamination prediction), CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage. This allows it to work better with organisms that have only few known representative genomes.
tRNA gene occurence
- tRNAscan-SE v2.0.12 - An improved tool for transfer RNA detection. Used to find tRNA genes in the genomes.
General stats, file manipulation, alignment, and reporting
- seqkit v2.8.2 - ultrafast toolkit for FASTA/Q file manipulation. Used for quick and easy general stat gathering and sequence concatination.
- minimap2 v2.28 - versatile pairwise aligner for genomic and spliced nucleotide sequences. Used to align sequencing reads to assembly to get coverage stats.
- samtools v1.20 - Tools for manipulating next-generation sequencing data Used to calculate coverage stats.
- tidyverse v2.0.0 - R packages for data science Used for general data manipulation for reporting
- fs v1.6.4 - cross platform file operations Used for file manipulation for reporting
- tinytable v0.4.0 - Simple and Customizable Tables Used to generate the final report

Notes on the Test data:

data/GCF_000007305.1_ASM730v1_genomic.fna - This is the reference genome of Pyrococcus furiosus, which does fit the criteria of SeqCode. It was acquired from the RefSeq database.
data/GCA_015662175.1_ASM1566217v1_genomic.fna - This is the assembly of Thermococcus paralvinellae, which does not fit the criteria of SeqCode. It was acquired from GenBank database
data/SRR8767914_subsampled.fastq.gz is a DNA-Seq of Pyrococcus furiosus DSM 3638 dataset, that was subsampled for quicker testing via zcat SRR8767914.fastq.gz | seqkit sample --rand-seed 42 -p 0.1 -o SRR8767914_subsampled.fastq.gz.

Copyright Richard Stöckl 2024.
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE or copy at 
https://www.boost.org/LICENSE_1_0.txt)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
config		config
data		data
workflow		workflow
.gitignore		.gitignore
.snakemake-workflow-catalog.yml		.snakemake-workflow-catalog.yml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakemake workflow: `QCforSeqCode`

About

Usage

TODO and planned features

Tools used in the pipeline and reasoning. Please cite these tools if you use this pipeline.

Notes on the Test data:

About

Releases 2

Packages

Languages

License

richardstoeckl/QCforSeqCode

Folders and files

Latest commit

History

Repository files navigation

Snakemake workflow: QCforSeqCode

About

Usage

TODO and planned features

Tools used in the pipeline and reasoning. Please cite these tools if you use this pipeline.

Notes on the Test data:

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Snakemake workflow: `QCforSeqCode`

Packages