Skip to content

Snakemake Pipeline to check the requirements for a prokaryotic assembly to be included in the SeqCode initiative

License

Notifications You must be signed in to change notification settings

richardstoeckl/QCforSeqCode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snakemake workflow: QCforSeqCode

Author: [email protected]

Snakemake

About

Snakemake Pipeline to check the requirements for a prokaryotic assembly to be included in the SeqCode initiative.

The requirements are outlined in APPENDIX I of the SeqCode.

Usage

Check out the usage instructions in the snakemake workflow catalog

But here is a rough overview:

  1. Install conda (mamba or miniconda is fine).
  2. Install snakemake with:
conda install -c conda-forge -c bioconda snakemake
  1. Download checkm2 database (via wget https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz)
  2. Download GTDB-Tk database (via wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz)
  3. Download the latest release from this repo and cd into it
  4. Edit the config/config.yaml to provide the paths to your results/logs directories, and the paths to the databases you downloaded, as well as any parameters you might want to change.
  5. Edit the config/sampleData.csv file with the specific details for each assembly you want to check. Depending on what you enter here, the pipeline will automatically adjust what will be done.
  6. Open a terminal in the main dir and start a dry-run of the pipeline with the following command. This will download and install all the dependencies for the pipeline (this step takes may take some time) and it will show you if you set up the paths correctly:
snakemake --sdm conda -n --cores
  1. Run the pipeline with
snakemake --sdm conda --cores

TODO and planned features

  • add 16S rRNA gene truncation check
  • add automatic switches for Kingdom specific modes of some tools
  • automate checkm2 and gtdb-tk database downloads
  • add checks if the config file and the sample file are correctly filled

Tools used in the pipeline and reasoning. Please cite these tools if you use this pipeline.

  • Taxonomy
    • GTDB-Tk v2.4.0 - toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes. Used to get full genome taxonomic classification.
    • Infernal v1.1.5 - RNA secondary structure/sequence profiles for homology search and alignment. Used to find and extract rRNA genes in the genomes.
    • DECIPHER v2.30.0 - Tools for curating, analyzing, and manipulating biological sequences. Used to get 16S rRNA gene taxonomic classification by comparing to SILVA db.
    • SILVA r138 - rRNA database. Used as source of rRNA gene taxonomy
  • Contamination and Completeness
    • CheckM2 v1.0.1 - Assessing the quality of metagenome-derived genome bins using machine learning. Used to get completeness and contamination stats. Unlike CheckM1 (one of the most popular tools for completeness and contamination prediction), CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage. This allows it to work better with organisms that have only few known representative genomes.
  • tRNA gene occurence
    • tRNAscan-SE v2.0.12 - An improved tool for transfer RNA detection. Used to find tRNA genes in the genomes.
  • General stats, file manipulation, alignment, and reporting
    • seqkit v2.8.2 - ultrafast toolkit for FASTA/Q file manipulation. Used for quick and easy general stat gathering and sequence concatination.
    • minimap2 v2.28 - versatile pairwise aligner for genomic and spliced nucleotide sequences. Used to align sequencing reads to assembly to get coverage stats.
    • samtools v1.20 - Tools for manipulating next-generation sequencing data Used to calculate coverage stats.
    • tidyverse v2.0.0 - R packages for data science Used for general data manipulation for reporting
    • fs v1.6.4 - cross platform file operations Used for file manipulation for reporting
    • tinytable v0.4.0 - Simple and Customizable Tables Used to generate the final report

Notes on the Test data:

  • data/GCF_000007305.1_ASM730v1_genomic.fna - This is the reference genome of Pyrococcus furiosus, which does fit the criteria of SeqCode. It was acquired from the RefSeq database.
  • data/GCA_015662175.1_ASM1566217v1_genomic.fna - This is the assembly of Thermococcus paralvinellae, which does not fit the criteria of SeqCode. It was acquired from GenBank database
  • data/SRR8767914_subsampled.fastq.gz is a DNA-Seq of Pyrococcus furiosus DSM 3638 dataset, that was subsampled for quicker testing via zcat SRR8767914.fastq.gz | seqkit sample --rand-seed 42 -p 0.1 -o SRR8767914_subsampled.fastq.gz.
Copyright Richard Stöckl 2024.
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE or copy at 
https://www.boost.org/LICENSE_1_0.txt)

About

Snakemake Pipeline to check the requirements for a prokaryotic assembly to be included in the SeqCode initiative

Resources

License

Stars

Watchers

Forks

Packages

No packages published