prepare shotgun databases

Building shotgun databases

Building shotgun databases
- Overview
- Host genome(s)
  - Human
- Kraken2
- Bracken

Overview

Double check first to see whether the shotgun databases you require have already been prepared on the cluster before running the steps below.

You require the following databases (at a minimum) to run the Sunbeam pipeline and assign taxonomy.

Host genome(s) for decontamination
Kraken databases for taxonomy
Bracken databases (related to kraken2)

The recommended smux parameters for database preparation are:

smux n --time=7-00:00:00 --mem=32GB --cpuspertask=2 --ntasks=1 -J Build-Databases

Host genome(s)

We require the host genomes to remove host reads before metagenomics analysis. Of note, these need to be located in a separate folder, be decompressed, and be of file type .fasta.

Human

For shotgun metagenomics data of human-derived samples, we will combine 2 genomes together to ensure maximum removal of human genetic material.

CHM13: The telomere-to-telomere consortium CHM13 project genome is the resultant assembly of sequencing the CHM13hTERT human cell line with multiple technologies. The sequencing data included 30x PacBio HiFi, 120x Oxford Nanopore, 70x PacBio CLR, 50x 10X Genomics, as well as BioNano DLS and Arima Genomics HiC. It was highlighted in a 2023 article by Gihawi et al. on the importance of host decontamination in shotgun microbiome data that led to the retraction of a 2020 Nature publication.
GRCh38 (1000 genomes): This genome is the GRCh38 reference genome from the 1000 genomes project.

The image below provides a quick look at the CHM13 genome vs. GRCh38, but good news – the Y chromosome has been included with the CHM13 genome now!

To prepare a combine FASTA file from these two genomes, save the following code to a bash script called prepare_human_genome.sh and start it running.

#!/bin/bash

# Define the output folder
HUMANGENOME_DIR="/home/mmacowan/mf33/Databases/shotgun/human"
mkdir -p $HUMANGENOME_DIR
cd $HUMANGENOME_DIR

# STEP 1: Install the CHM13 human genome
echo "Downloading the CHM13 human genome..."
wget -c https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz
gzip -d chm13v2.0.fa.gz

# STEP 2: Install the 1000 Genomes human genome
echo "Downloading the 1000 Genomes human genome..."
wget -c http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

# STEP 3: Concatenate the two files and clean up
echo "Combining the two genomes into a single FASTA file..."
cat chm13v2.0.fa GRCh38_full_analysis_set_plus_decoy_hla.fa > chm13v2.0_GRCh38_full_plus_decoy.fasta
rm -rf chm13v2.0.fa GRCh38_full_analysis_set_plus_decoy.hla.fa
echo "Finished preparing the human genome."
cd ..

Kraken2

Kraken2 will be used to assign taxonomy – we will go in-depth here and include archaeal, fungal, and viral reference libraries too. Another thing pointed out by Gihawi et al. in their 2023 paper was the importance of including host genomes in the kraken2 database. The rationale behind this is that even if some host genetic material remains in the data at the time of taxonomic assignment, it should be caught and assigned to the host instead of incorrectly annotated as a bacterial read.

As such, we will include human, mouse, and rat libraries into the kraken2 database. Even if you think you will just use human-derived samples, there's no harm in including the mouse and rat libraries just in case.

Prepare and run the following script, prepare_kraken2db.sh, to generate your database. Ensure you alter the KRAKEN2_DIR and KRAKEN2_DB variables to whatever you require.

#!/bin/bash

# Set the directory for Kraken2 installation and database
KRAKEN2_DIR="/home/mmacowan/mf33/Databases/shotgun/kraken2"
KRAKEN2_DB="/home/mmacowan/mf33/Databases/shotgun/kraken2_database"
KRAKEN2_BUILD="$KRAKEN2_DIR/scripts/kraken2-build"  # Correct path to kraken2-build in the scripts folder

# Step 1: Install Kraken2
echo "Cloning the Kraken2 repository into $KRAKEN2_DIR..."
git clone https://github.com/DerrickWood/kraken2 "$KRAKEN2_DIR"
cd "$KRAKEN2_DIR"
echo "Installing Kraken2..."
bash install_kraken2.sh

# Verify the installation and correct path to kraken2-build
if [ ! -f "$KRAKEN2_BUILD" ]; then
    echo "Error: kraken2-build not found at $KRAKEN2_BUILD."
    echo "Please check the installation path and ensure Kraken2 is installed correctly."
    exit 1
fi

# Step 2: Create a new database and download the taxonomy
echo "Creating Kraken2 database at $KRAKEN2_DB and downloading taxonomy..."
"$KRAKEN2_BUILD" --download-taxonomy --db "$KRAKEN2_DB"

# Step 3: Load the BLAST module (necessary for low-complexity sequence masking)
echo "Loading BLAST module..."
module load blast

# Step 4: Download standard reference libraries
echo "Downloading reference libraries: archaea, bacteria, fungi, viral..."
"$KRAKEN2_BUILD" --download-library archaea --db "$KRAKEN2_DB"
"$KRAKEN2_BUILD" --download-library bacteria --db "$KRAKEN2_DB"
"$KRAKEN2_BUILD" --download-library fungi --db "$KRAKEN2_DB"
"$KRAKEN2_BUILD" --download-library viral --db "$KRAKEN2_DB"

# Step 5: Download the human genome for host read removal
echo "Downloading the human genome reference..."
"$KRAKEN2_BUILD" --download-library human --db "$KRAKEN2_DB"

# Step 6: Download mouse and rat genomes for future taxonomy assignment
echo "Downloading mouse genome reference..."
"$KRAKEN2_BUILD" --download-library mouse --db "$KRAKEN2_DB"

echo "Downloading rat genome reference..."
"$KRAKEN2_BUILD" --download-library rat --db "$KRAKEN2_DB"

# Step 7: Build the Kraken2 database
echo "Building the Kraken2 database. This may take some time..."
"$KRAKEN2_BUILD" --build --db "$KRAKEN2_DB"

# Step 8: Clean up intermediate files (optional but recommended)
echo "Cleaning up intermediate files..."
"$KRAKEN2_BUILD" --clean --db "$KRAKEN2_DB"

echo "Kraken2 database build completed successfully."

Bracken

Bracken is used to correct species abundances. Prepare and run the following script, prepare_brackendb.sh, ensuring you have set the directory variables to the right values.

# Set the directories for Bracken installation and database
BRACKEN_DIR="/home/mmacowan/mf33/Databases/shotgun/bracken"
BRACKEN_DB="/home/mmacowan/mf33/Databases/shotgun/bracken_database"
KRAKEN2_DIR="/home/mmacowan/mf33/Databases/shotgun/kraken2"

# Install braken
git clone https://github.com/jenniferlu717/Bracken "$BRACKEN_DIR"
cd "$BRACKEN_DIR"
bash install_bracken.sh

# Build the database
bracken-build -d "$BRACKEN_DB" -t 2 -x "$KRAKEN2_DIR"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly