hbctraining · Gammerdinger · Apr 25, 2025 · Apr 25, 2025 · Apr 25, 2025 · Apr 25, 2025
diff --git a/lessons/03_sequence_alignment_theory.md b/lessons/03_sequence_alignment_theory.md
@@ -200,8 +200,8 @@ Next, we need to **add the modules** that we will be using for alignment:
 
 ```
 # Load modules
-module load gcc/6.2.0
-module load bwa/0.7.17
+module load gcc/14.2.0
+module load bwa/0.7.18
 ```
 
 > NOTE: On O2, many of the common tools were compiled using `GCC` version 6.2.0, so to be able to access them, we first need to load the `GCC` module.
@@ -215,7 +215,7 @@ bwa mem \
     -M \
     -t 8 \
     -R "@RG\tID:syn3_normal\tPL:illumina\tPU:$SAMPLE\tSM:syn3_normal" \
-    /n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa \
+    /n/groups/hbctraining/variant_calling/reference/GRCh38.fa \
     ~/variant_calling/raw_data/syn3_normal_1.fq.gz \
     ~/variant_calling/raw_data/syn3_normal_2.fq.gz \
     -o /n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam
@@ -233,11 +233,11 @@ Another advantage of using `bash` variables in this way is that it can reduce ty
 
 ```
 # Assign files to bash variables
-REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
+REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
 LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_normal_1.fq.gz
 RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz`
 SAMPLE=`basename $LEFT_READS _1.fq.gz`
-SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam
+SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.sam
 ```
 
 > **NOTE:** `$RIGHT_READS` uses some `bash` string manipulation in order to swap the last parts of their filename. We also use `basename` to parse out the path from a file and when coupled with an argument after the filename, it will trim the end of the filename as well as we can see with the `$SAMPLE` variable.
@@ -270,14 +270,14 @@ bwa mem \
 #SBATCH -o bwa_alignment_normal_%j.out
 #SBATCH -e bwa_alignment_normal_%j.err<br>
 # Load modules
-module load gcc/6.2.0
-module load bwa/0.7.17<br>
+module load gcc/14.2.0
+module load bwa/0.7.18<br>
 # Assign files to bash variables
-REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
+REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
 LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_normal_1.fq.gz
 RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz`
 SAMPLE=`basename $LEFT_READS _1.fq.gz`
-SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam<br>
+SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.sam<br>
 # Align reads with bwa
 bwa mem \
     -M \
@@ -319,14 +319,14 @@ $ sed 's/normal/tumor/g' bwa_alignment_normal.sbatch >  bwa_alignment_tumor.sbat
 #SBATCH -o bwa_alignment_tumor_%j.out
 #SBATCH -e bwa_alignment_tumor_%j.err<br>
 # Load modules
-module load gcc/6.2.0
-module load bwa/0.7.17<br>
+module load gcc/14.2.0
+module load bwa/0.7.18<br>
 # Assign files to bash variables
-REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
+REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
 LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_tumor_1.fq.gz
 RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz`
 SAMPLE=`basename $LEFT_READS _1.fq.gz`
-SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam<br>
+SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.sam<br>
 # Align reads with bwa
 bwa mem \
     -M \

diff --git a/lessons/04_alignment_file_processing.md b/lessons/04_alignment_file_processing.md
diff --git a/lessons/05_alignment_QC.md b/lessons/05_alignment_QC.md
@@ -8,8 +8,8 @@ Approximate time: 30 minutes
 
 ## Learning Objectives
 
-- Verify alignment rates using `Picard`
-- Merge `Picard` QC metrics with `FastQC` metrics using `MultiQC`
+- Verify alignment rates using `GATK`/`Picard`
+- Merge `GATK`/`Picard` QC metrics with `FastQC` metrics using `MultiQC`
 
 ## Collecting Alignment Statistics
 
@@ -19,7 +19,7 @@ The next step of QC is where we need to evaluate the quality of the alignments.
 <img src="../img/Alignment_QC.png" width="800">
 </p>
 
-We are going to use `Picard` once again in order to collect our alignment statistics. `Picard` has many packages for collecting different types of data, but the one we will be using is [`CollectAlignmentSummaryMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360040507751-CollectAlignmentSummaryMetrics-Picard). This tool takes a **SAM/BAM file input** and **produces metrics** (in a tab delimited `.txt` file) detailing the quality of the read alignments. _Note that these quality filters are specific to Illumina data._  
+We are going to use `GATK`/`Picard` once again in order to collect our alignment statistics. `GATK`/`Picard` has many packages for collecting different types of data, but the one we will be using is [`CollectAlignmentSummaryMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360040507751-CollectAlignmentSummaryMetrics-Picard). This tool takes a **SAM/BAM file input** and **produces metrics** (in a tab delimited `.txt` file) detailing the quality of the read alignments. _Note that these quality filters are specific to Illumina data._  
 
 Some examples of metrics reported include (but, are not limited to):
 
@@ -44,55 +44,55 @@ Let's start creating an `sbatch` script for collecting metrics:
 
 ```
 cd ~/variant_calling/scripts/
-vim picard_metrics_normal.sbatch
+vim gatk_metrics_normal.sbatch
 ```
 
 First, we need to add our shebang line, description and `sbatch` directives to the script:
 
 ```
 #!/bin/bash
-# This sbatch script is for collecting alignment metrics using Picard 
+# This sbatch script is for collecting alignment metrics using GATK
 
 # Assign sbatch directives
 #SBATCH -p priority
 #SBATCH -t 0-00:30:00
 #SBATCH -c 1
 #SBATCH --mem 16G
-#SBATCH -o picard_metrics_normal_%j.out
-#SBATCH -e picard_metrics_normal_%j.err
+#SBATCH -o gatk_metrics_normal_%j.out
+#SBATCH -e gatk_metrics_normal_%j.err
 ```
 
-Next, we need to load `Picard`:
+Next, we need to load `GATK`:
 
 ```
-# Load picard
-module load picard/2.27.5
+# Load GATK
+module load gatk/4.6.1.0
 ```
 
 Next, let's assign our files to variables:
 
 ```
 # Assign variables
-INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam
-REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
-OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt
+INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam
+REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
+OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt
 ```
 
-Lastly, we can add the `Picard` command to gather the alignment metrics. 
+Lastly, we can add the `GATK`/`Picard` command to gather the alignment metrics. 
 
 ```
-# Run Picard CollectAlignmentSummaryMetrics
-java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
+# Run GATK CollectAlignmentSummaryMetrics
+gatk CollectAlignmentSummaryMetrics \
   --INPUT $INPUT_BAM \
   --REFERENCE_SEQUENCE $REFERENCE \
   --OUTPUT $OUTPUT_METRICS_FILE
 ```
 
 We can breakdown this command into each of its components:
 
-* `java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics` Calls the `CollectAlignmentSummaryMetrics` package from within `Picard`
-* `--INPUT $INPUT_BAM` This is the output BAM file from our previous `Picard` alignment processing steps.
-* `--REFERENCE_SEQUENCE $REFERENCE` This isn't a required parameter, but `Picard` can do a subset of mismatch-related metrics if this is provided.
+* `gatk CollectAlignmentSummaryMetrics` Calls the `CollectAlignmentSummaryMetrics` package from within `GATK`/`Picard`
+* `--INPUT $INPUT_BAM` This is the output BAM file from our previous `GATK`/`Picard` alignment processing steps.
+* `--REFERENCE_SEQUENCE $REFERENCE` This isn't a required parameter, but `GATK`/`Picard` can do a subset of mismatch-related metrics if this is provided.
 * `--OUTPUT $OUTPUT_METRICS_FILE` This is the file to write the output metrics to.
 
 
@@ -102,22 +102,22 @@ Now this script is all set to run! **Go ahead and save and quit.**
   <summary><b>Click here to see what our final <code>sbatch</code>code script for collecting the normal sample alignment metrics should look like</b></summary> 
   <pre>
 #!/bin/bash
-# This sbatch script is for collecting alignment metrics using Picard<br>
+# This sbatch script is for collecting alignment metrics using GATK<br>
 # Assign sbatch directives
 #SBATCH -p priority
 #SBATCH -t 0-00:30:00
 #SBATCH -c 1
 #SBATCH --mem 16G
-#SBATCH -o picard_metrics_normal_%j.out
-#SBATCH -e picard_metrics_normal_%j.err<br>
-# Load picard
-module load picard/2.27.5<br>
+#SBATCH -o gatk_metrics_normal_%j.out
+#SBATCH -e gatk_metrics_normal_%j.err<br>
+# Load GATK
+module load gatk/4.6.1.0<br>
 # Assign variables
-INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam
-REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
-OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt<br>
-# Run Picard CollectAlignmentSummaryMetrics
-java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
+INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam
+REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
+OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt<br>
+# Run GATK CollectAlignmentSummaryMetrics
+gatk CollectAlignmentSummaryMetrics \
   --INPUT $INPUT_BAM \
   --REFERENCE_SEQUENCE $REFERENCE \
   --OUTPUT $OUTPUT_METRICS_FILE
@@ -129,29 +129,29 @@ java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
 Now we will want to **create the tumor version of this submission script using `sed`** (as we have done previously):
 
 ```
-sed 's/normal/tumor/g' picard_metrics_normal.sbatch > picard_metrics_tumor.sbatch
+sed 's/normal/tumor/g' gatk_metrics_normal.sbatch > gatk_metrics_tumor.sbatch
 ```
 
 <details>
   <summary><b>Click here to see what our final <code>sbatch</code>code script for collecting the tumor sample alignment metrics should look like</b></summary> 
   <pre>
 #!/bin/bash
-# This sbatch script is for collecting alignment metrics using Picard<br>
+# This sbatch script is for collecting alignment metrics using GATK<br>
 # Assign sbatch directives
 #SBATCH -p priority
 #SBATCH -t 0-00:30:00
 #SBATCH -c 1
 #SBATCH --mem 16G
-#SBATCH -o picard_metrics_tumor_%j.out
-#SBATCH -e picard_metrics_tumor_%j.err<br>
-# Load picard
-module load picard/2.27.5<br>
+#SBATCH -o gatk_metrics_tumor_%j.out
+#SBATCH -e gatk_metrics_tumor_%j.err<br>
+# Load GATK
+module load gatk/4.6.1.0<br>
 # Assign variables
-INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.coordinate_sorted.bam
-REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
-OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.p7.CollectAlignmentSummaryMetrics.txt<br>
-# Run Picard CollectAlignmentSummaryMetrics
-java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
+INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.coordinate_sorted.bam
+REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
+OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_tumor/syn3_tumor_GRCh38.CollectAlignmentSummaryMetrics.txt<br>
+# Run GATK CollectAlignmentSummaryMetrics
+gatk CollectAlignmentSummaryMetrics \
   --INPUT $INPUT_BAM \
   --REFERENCE_SEQUENCE $REFERENCE \
   --OUTPUT $OUTPUT_METRICS_FILE
@@ -166,17 +166,17 @@ Before we submit our jobs, let's **check the status of our previous `Picard` ali
 squeue --me
 ```
 
-* **If your `Picard` alignment processing steps are completed**, and you have the required input files then you can submit these jobs to collect alignment metrics:
+* **If your `GATK`/`Picard` alignment processing steps are completed**, and you have the required input files then you can submit these jobs to collect alignment metrics:
 
 ```bash
-sbatch picard_metrics_normal.sbatch
-sbatch picard_metrics_tumor.sbatch
+sbatch gatk_metrics_normal.sbatch
+sbatch gatk_metrics_tumor.sbatch
 ```
 > **NOTE:** Each of these scripts should only take about 15 minutes to run.  
 
 ## Collecting Coverage Metrics
 
-Coverage is the average level of alignment for any random locus in the genome.  `Picard` also has a package called [`CollectWgsMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037269351-CollectWgsMetrics-Picard) which is also very nice for collecting data about coverage for alignments. However, **since our data set is whole exome sequencing rather than whole genome sequencing** and thus only compromises about 1-2% of the human genome, average **coverage across the whole genome is not a very useful metric**. However, if one did have whole genome data, then running `CollectWgsMetrics` would be useful. As such, in the dropdown box below we provide the code that you could use to collect this information.
+Coverage is the average level of alignment for any random locus in the genome.  `GATK`/`Picard` also has a package called [`CollectWgsMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037269351-CollectWgsMetrics-Picard) which is also very nice for collecting data about coverage for alignments. However, **since our data set is whole exome sequencing rather than whole genome sequencing** and thus only compromises about 1-2% of the human genome, average **coverage across the whole genome is not a very useful metric**. However, if one did have whole genome data, then running `CollectWgsMetrics` would be useful. As such, in the dropdown box below we provide the code that you could use to collect this information.
 
 <p align="center">
 <img src="../img/coverge.png" width="800">
@@ -185,21 +185,21 @@ Coverage is the average level of alignment for any random locus in the genome.
 _Image source: [Coverage analysis from the command line](https://medium.com/ngs-sh/coverage-analysis-from-the-command-line-542ef3545e2c)_
 
 <details>
-<summary><b>Click here to find out more on collecting coverage metrics for WGS datasets in <code>Picard</code></b></summary>
-<br>The tool in <code>Picard</code> used for collecting coverage metrics for WGS datasets is called <code>CollectWgsMetrics</code>. The code used to run <code>CollectWgsMetrics</code> can be found below.<br><br>
+<summary><b>Click here to find out more on collecting coverage metrics for WGS datasets in <code>GATK</code>/<code>Picard</code></b></summary>
+<br>The tool in <code>GATK</code>/<code>Picard</code> used for collecting coverage metrics for WGS datasets is called <code>CollectWgsMetrics</code>. The code used to run <code>CollectWgsMetrics</code> can be found below.<br><br>
   <pre>
   # Assign paths to bash variables
   $COORDINATE_SORTED_BAM_FILE=/path/to/sample.coordinate_sorted.bam
   $OUTPUT=/home/$USER/variant_calling/reports/picard/sample.CollectWgsMetrics.txt
-  $REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa<br>
+  $REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa<br>
   # Run Picard CollectWgsMetrics \
-  java -jar $PICARD/picard.jar CollectWgsMetrics \
+  gatk CollectWgsMetrics \
     --INPUT $COORDINATE_SORTED_BAM_FILE \
     --OUTPUT $METRICS_OUTPUT_FILE \
     --REFERENCE_SEQUENCE $REFERENCE
   </pre>
 
-  <ul><li><code>java -jar $PICARD/picard.jar CollectWgsMetrics</code> This calls the <code>CollectWgsMetrics</code> package within <code>Picard</code></li>
+  <ul><li><code>gatk CollectWgsMetrics</code> This calls the <code>CollectWgsMetrics</code> package within <code>GATK</code>/<code>Picard</code></li>
   <li><code>--INPUT $COORDINATE_SORTED_BAM_FILE</code> This is the input coordinate-sorted BAM file</li>
   <li><code>--OUTPUT $METRICS_OUTPUT_FILE</code> This is the output report file </li>
   <li><code>--REFERENCE_SEQUENCE $REFERENCE</code> This is the path to the reference genome that was used for the alignment.</li></ul>
@@ -208,11 +208,11 @@ _Image source: [Coverage analysis from the command line](https://medium.com/ngs-
 
 ## Factors Impacting Alignment
 
-While we mentioned above the various metrics that are computed as part of the Picard command, one of the **most important metrics for your alignment file is the alignment rate**. When aligning high-quality reads to a high quality reference genome, **one should expect to see alignment rates at 90% or better**. If alignment rates dipped below 80-85%, then there could be reason for further inspection. 
+While we mentioned above the various metrics that are computed as part of the GATK/Picard command, one of the **most important metrics for your alignment file is the alignment rate**. When aligning high-quality reads to a high quality reference genome, **one should expect to see alignment rates at 90% or better**. If alignment rates dipped below 80-85%, then there could be reason for further inspection. 
 
 Alignment rates can vary based upon many factors, including:
 
-- **Quality of reference assembly** - A high-quality assembly like GRCh38.p7 will provide an excellent reference genome for alignment. However, if you were studying a organism with a poorly assembled genome, parts of the reference genome could be missing from the assembly. Therefore, high-quality reads might not align because they there is missing reference sequence to align to that corresponds to their sequence.
+- **Quality of reference assembly** - A high-quality assembly like GRCh38 will provide an excellent reference genome for alignment. However, if you were studying a organism with a poorly assembled genome, parts of the reference genome could be missing from the assembly. Therefore, high-quality reads might not align because they there is missing reference sequence to align to that corresponds to their sequence.
 - **Quality of libraries** - If the library generation was poor and there wasn't enough input DNA, then your sequencing could be filled with low-quality reads
 - **Quality of the reads** - If the reads are poor quality, then it can make alignment more uncertain. If your `FASTQC` report shows any anomalous signs, contact your sequencing center for support.
 - **Contamination** - If your samples are contaminated, then it can also skew your alignment. For example, if your samples were heavily contaminated with some bacteria, then much of what you will sequence will be bacteria DNA and not your sample DNA. As a result, most of the sequence reads will not align to your target sequence. If you suspect contamination might be the source of a poor alignment, you could consider running [Kraken](https://ccb.jhu.edu/software/kraken/) to evaluate the levels of contamination in your samples.