From d48e8575fa6cdac0c0806568417f7ef016ea1cab Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Fri, 25 Apr 2025 16:29:14 -0400 Subject: [PATCH 01/22] Changed version numbers --- lessons/03_sequence_alignment_theory.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lessons/03_sequence_alignment_theory.md b/lessons/03_sequence_alignment_theory.md index 80e9313..00e8ae0 100644 --- a/lessons/03_sequence_alignment_theory.md +++ b/lessons/03_sequence_alignment_theory.md @@ -200,8 +200,8 @@ Next, we need to **add the modules** that we will be using for alignment: ``` # Load modules -module load gcc/6.2.0 -module load bwa/0.7.17 +module load gcc/14.2.0 +module load bwa/0.7.18 ``` > NOTE: On O2, many of the common tools were compiled using `GCC` version 6.2.0, so to be able to access them, we first need to load the `GCC` module. @@ -270,8 +270,8 @@ bwa mem \ #SBATCH -o bwa_alignment_normal_%j.out #SBATCH -e bwa_alignment_normal_%j.err
# Load modules -module load gcc/6.2.0 -module load bwa/0.7.17
+module load gcc/14.2.0 +module load bwa/0.7.18
# Assign files to bash variables REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_normal_1.fq.gz @@ -319,8 +319,8 @@ $ sed 's/normal/tumor/g' bwa_alignment_normal.sbatch > bwa_alignment_tumor.sbat #SBATCH -o bwa_alignment_tumor_%j.out #SBATCH -e bwa_alignment_tumor_%j.err
# Load modules -module load gcc/6.2.0 -module load bwa/0.7.17
+module load gcc/14.2.0 +module load bwa/0.7.18
# Assign files to bash variables REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_tumor_1.fq.gz From d8049fee38ed51fcfb91e0911ad28361950a6704 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Fri, 25 Apr 2025 16:39:52 -0400 Subject: [PATCH 02/22] Updated GATK version number --- lessons/07_variant_calling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lessons/07_variant_calling.md b/lessons/07_variant_calling.md index e3128f1..5f9cf37 100644 --- a/lessons/07_variant_calling.md +++ b/lessons/07_variant_calling.md @@ -145,7 +145,7 @@ Next to add the `GATK4` module we are going to load: ``` # Load the GATK module -module load gatk/4.1.9.0 +module load gatk/4.6.1.0 ``` And now, we need to create our variables: From e5bc4a305cd2d01267432547203b18bde656f8ed Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Fri, 25 Apr 2025 16:41:54 -0400 Subject: [PATCH 03/22] Updated GATK and snpEff versions --- lessons/08_variant_filtering.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lessons/08_variant_filtering.md b/lessons/08_variant_filtering.md index 642be64..31481b9 100644 --- a/lessons/08_variant_filtering.md +++ b/lessons/08_variant_filtering.md @@ -67,8 +67,8 @@ Next, we need to add the modules that we will be loading: ``` # Load modules -module load gatk/4.1.9.0 -module load snpEff/4.3g +module load gatk/4.6.1.0 +module load snpEff/5.2f ``` Next, we will add our variables: From 4f59f5f3b78197d7b043fff7a3de47a8bf085dcb Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Fri, 25 Apr 2025 16:43:11 -0400 Subject: [PATCH 04/22] Updated snpEff version --- lessons/09_variant_annotation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lessons/09_variant_annotation.md b/lessons/09_variant_annotation.md index c20521e..2daeb28 100644 --- a/lessons/09_variant_annotation.md +++ b/lessons/09_variant_annotation.md @@ -40,7 +40,7 @@ The first step in annotating your VCF file is finding the appropriate SnpEff dat To see if your genome of interest is in the `SnpEff` database, we first need to load the `SnpEff` module: ``` -module load snpEff/4.3g +module load snpEff/5.2f ``` With the `SnpEff` module loaded, let's use the following command to browse all of the currently available genomes: From 2013d6996dac2950cbf5aaca78f9a8cb40322962 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Mon, 28 Apr 2025 09:23:43 -0400 Subject: [PATCH 05/22] Updated annotation --- lessons/09_variant_annotation.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/lessons/09_variant_annotation.md b/lessons/09_variant_annotation.md index 2daeb28..3488efa 100644 --- a/lessons/09_variant_annotation.md +++ b/lessons/09_variant_annotation.md @@ -49,13 +49,13 @@ With the `SnpEff` module loaded, let's use the following command to browse all o java -jar $SNPEFF/snpEff.jar databases | less ``` -The first column is the database name and the second column in the `Genus_species` for the organism. There is also a database download link where the database can be downloaded at but this can be ignored as SnpEff will automatically download the database if needed. As you can see there are tens of thousands of these pre-built databases. So let's exit the `less` buffer page and **see which GRCh databases are available**: +The first column is the database name and the second column in the `Genus_species` for the organism. There is also a database download link where the database can be downloaded at but this can be ignored as SnpEff will automatically download the database if needed. As you can see there are tens of thousands of these pre-built databases. So let's exit the `less` buffer page and **see which hg38 databases are available**: ``` -java -jar $SNPEFF/snpEff.jar databases | grep "GRCh" +java -jar $SNPEFF/snpEff.jar databases | grep "hg38" ``` -We can see that this build of SnpEff has five possible GRCh databases that we can use for annotation, including one for GRCh38.p7 called GRCh38.p7.RefSeq. Now that we have found the database that we would like to use for our analysis, we can run `SnpEff`. +We can see that this build of SnpEff has three possible hg38 databases that we can use for annotation. We will use the one labelled hg38. Now that we have found the database that we would like to use for our analysis, we can run `SnpEff`. ### Running SnpEff @@ -119,7 +119,7 @@ Next, we will add the line to load the modules that we will need: # Load modules module load gcc/9.2.0 module load bcftools/1.14 -module load snpEff/4.3g +module load snpEff/5.2f ``` Also, we will add our variables: @@ -348,7 +348,7 @@ Let's explain each part of this command: # Load modules module load gcc/9.2.0 module load bcftools/1.14 -module load snpEff/4.3g
+module load snpEff/5.2f
# Assign variables REPORTS_DIRECTORY=/home/$USER/variant_calling/reports/snpeff/ SAMPLE_NAME=mutect2_syn3_normal_syn3_tumor From f92ce818b2c4758b89cc720cdc84a4daae7bb659 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Mon, 28 Apr 2025 09:27:07 -0400 Subject: [PATCH 06/22] Updated versions --- lessons/09_variant_annotation.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lessons/09_variant_annotation.md b/lessons/09_variant_annotation.md index 3488efa..ed8e6de 100644 --- a/lessons/09_variant_annotation.md +++ b/lessons/09_variant_annotation.md @@ -117,8 +117,8 @@ Next, we will add the line to load the modules that we will need: ``` # Load modules -module load gcc/9.2.0 -module load bcftools/1.14 +module load gcc/14.2.0 +module load bcftools/1.21 module load snpEff/5.2f ``` @@ -279,8 +279,8 @@ Let's discuss each part of this command: Click here to see how to annotate our VCF file with the dbSNP annotation in bcftools The first thing we are going to need to do is load the modules that we will be using:
-module load gcc/9.2.0
-module load bcftools/1.14
+module load gcc/14.2.0
+module load bcftools/1.21
 
Assuming we have already indexed our dbSNP VCF file, the first thing that we are going to need to do is compress the VCF file that we wish to annotate with:
@@ -346,8 +346,8 @@ Let's explain each part of this command:
 #SBATCH -o variant_annotation_syn3_normal_syn3_tumor_%j.out
 #SBATCH -e variant_annotation_syn3_normal_syn3_tumor_%j.err
# Load modules -module load gcc/9.2.0 -module load bcftools/1.14 +module load gcc/14.2.0 +module load bcftools/1.21 module load snpEff/5.2f
# Assign variables REPORTS_DIRECTORY=/home/$USER/variant_calling/reports/snpeff/ From 81d0bcc947d5c32be414f1282246f0fcb916ff4a Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Mon, 28 Apr 2025 09:28:08 -0400 Subject: [PATCH 07/22] Updated packages --- lessons/09_variant_annotation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lessons/09_variant_annotation.md b/lessons/09_variant_annotation.md index ed8e6de..8f0a593 100644 --- a/lessons/09_variant_annotation.md +++ b/lessons/09_variant_annotation.md @@ -230,8 +230,8 @@ Next, we are going to index this file. While it is not necesscary for us to inde We are going to be using tabix, which is part of the HTSlib module. First, we will need to load the HTSlib module, which also requires us to load the gcc module as well:
-module load gcc/9.2.0
-module load htslib/1.14
+module load gcc/14.2.0
+module load htslib/1.21 
 
In order to index our dbSNP file using tabix, we just need to run the following command: From bac022fcce1b3760e36af3e880c043dd18a05c58 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Tue, 29 Apr 2025 14:41:42 -0400 Subject: [PATCH 08/22] Updated version number --- lessons/10_variant_prioritization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lessons/10_variant_prioritization.md b/lessons/10_variant_prioritization.md index 73c7731..004c740 100644 --- a/lessons/10_variant_prioritization.md +++ b/lessons/10_variant_prioritization.md @@ -32,7 +32,7 @@ Before we do anything, let's move to the directory with our VCF files and load t ``` cd /n/scratch/users/${USER:0:1}/$USER/variant_calling/vcf_files/ -module load snpEff/4.3g +module load snpEff/5.2f ``` **SnpSift filter** is one of the most useful SnpSift commands. Using SnpSift filter you can filter VCF files **using arbitrary expressions.** In the most simple case, you can filter your SnpEff annotated VCF file based upon any of the **first seven fields** of the VCF file: From 23cf593bae9b47b84b87072d2f31dc40dc1a1647 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Thu, 15 May 2025 10:10:08 -0400 Subject: [PATCH 09/22] Conversion of Picard to GATK --- lessons/04_alignment_file_processing.md | 168 ++++++++++++------------ 1 file changed, 87 insertions(+), 81 deletions(-) diff --git a/lessons/04_alignment_file_processing.md b/lessons/04_alignment_file_processing.md index c970473..b04ed73 100644 --- a/lessons/04_alignment_file_processing.md +++ b/lessons/04_alignment_file_processing.md @@ -21,9 +21,15 @@ The alignment files that come from `bwa` are raw alignment and need some process

-## Pipeline for processing alignment files with Picard +## Pipeline for processing alignment files with GATK -[Picard](https://broadinstitute.github.io/picard/) is a set of command line tools for **processing high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF**. It is maintained by the Broad Institute, and is open-source under the MIT license and free for all uses. Picard is written in Java and does not have functionality for multi-threading. +[Picard](https://broadinstitute.github.io/picard/) is a set of command line tools for **processing high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF**. It is maintained by the Broad Institute, and is open-source under the MIT license and free for all uses. The Broad Institute also maintains the tool [GATK](https://gatk.broadinstitute.org/hc/en-us), which we will be using for variant calling and discuss more later on. The all of the functions of Picard have been ported over to GATK, so instead of using Picard, we will just use Picard's functions which are now a part of GATK. To see all of the function within GATK you can use: + +``` +gatk --list +``` + +You will notice that some packages have `(Picard)` after them and these represent tools that have been brought over from Picard and incorporated into GATK now. As a result of this history and subsequent merging of the tools, you may see people reference GATK or Picard for some of these functions interchangeably. Picard tools were written without multithreading and thus when they were brought over to GATK, many, if not all, of them retained their inability to multithread. > ### Why not use `samtools`? > The processing of the alignment files (SAM/BAM files) can also be done with [`samtools`](https://github.com/samtools/samtools). While there are some advantages to using samtools (i.e. more user-friendly, multi-threading capability), there are slight formatting differences which we may want to take advantage of downstream. Since we will be using GATK later in this workshop (also from the Broad Institute), Picard seemed like a more suitable fit. @@ -36,10 +42,10 @@ The alignment files that come from `bwa` are raw alignment and need some process > Click here if you need to merge alignment files from the same sample > You can merge alignment files with different read group IDs from the same sample in both Picard and samtools. In the dropdowns below we will outline each method: >
-> Click to see how to merge SAM/BAM files in Picard -> First, we need to load the Picard module: +> Click to see how to merge SAM/BAM files in GATK +> First, we need to load the gatk module: >
->   module load picard/2.27.5
+> module load gatk/4.6.1.0
> We can define our variables as: >
 >   INPUT_BAM_1=Read_group_1.bam
@@ -47,12 +53,12 @@ The alignment files that come from `bwa` are raw alignment and need some process
 >   MERGED_OUTPUT=Merged_output.bam
> Here is the command we would need to run to merge the SAM/BAM files: >
->   java -jar $PICARD/picard.jar MergeSamFiles \
+>   gatk MergeSamFiles \
 >     --INPUT $INPUT_BAM_1 \
 >     --INPUT $INPUT_BAM_2 \
 >     --OUTPUT $MERGED_OUTPUT
> We can breakdown this command: ->
  • java -jar $PICARD/picard.jar MergeSamFiles This calls the MergeSamFiles from within Picard
  • +>
    • gatk MergeSamFiles This calls the MergeSamFiles from within gatk
    • >
    • --INPUT $INPUT_BAM_1 This is the first SAM/BAM file that we would like to merge.
    • >
    • --INPUT $INPUT_BAM_2 This is the second SAM/BAM file that we would like to merge. We can continue to add --INPUT lines as needed.
    • >
    • --OUTPUT $MERGED_OUTPUT This is the output merged SAM/BAM file
    @@ -62,8 +68,8 @@ The alignment files that come from `bwa` are raw alignment and need some process > Click to see how to merge SAM/BAM files in samtools > First, we need to load the samtools module, which also requires gcc to be loaded: >
    ->   module load gcc/6.2.0
    ->   module load samtools/1.15.1
    +> module load gcc/14.2.0 +> module load samtools/1.21 > We can define our variables as: >
     >   INPUT_BAM_1=Read_group_1.bam
    @@ -77,18 +83,18 @@ The alignment files that come from `bwa` are raw alignment and need some process
     >     $INPUT_BAM_1 \
     >     $INPUT_BAM_2 \
     >     --output-fmt BAM \
    ->     -@ $THREADS
    +> --threads $THREADS > We can break down this command: >
    • samtools merge This calls the merge package within samtools.
    • >
    • -o $MERGED_OUTPUT This is the merged output file.
    • >
    • $INPUT_BAM_1 This is the first SAM/BAM file that we would like to merge.
    • >
    • $INPUT_BAM_2 This is the second SAM/BAM file that we would like to merge. We can continue to add additional input SAM/BAM files to this list as needed.
    • >
    • --output-fmt BAM This specifies the output format as BAM. If for some reason you wanted a SAM output file then you would use --output-fmt SAM instead.
    • ->
    • -@ $THREADS This specifies the number of threads we want to use for this process. We are using 8 threads in this example, but this could be different depending on the parameters that you would like to use.
    +>
  • --threads $THREADS This specifies the number of threads we want to use for this process. We are using 8 threads in this example, but this could be different depending on the parameters that you would like to use.
> > -Before we start processing our alignment SAM file with Picard, let's take a quick look at the steps involved in this pipeline. +Before we start processing our alignment SAM file with `GATK`/`Picard`, let's take a quick look at the steps involved in this pipeline.

@@ -100,32 +106,32 @@ Let's begin by creating a script for alignment processing. Make a new `sbatch` s ``` $ cd ~/variant_calling/scripts/ -$ vim picard_alignment_processing_normal.sbatch +$ vim gatk_alignment_processing_normal.sbatch ``` As always, we start the `sbatch` script with our shebang line, description of the script and our `sbatch` directives to request appropriate resources from the O2 cluster. ``` #!/bin/bash -# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Picard +# This sbatch script is for processing the alignment output from bwa and preparing it for use # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 1 #SBATCH --mem 32G -#SBATCH -o picard_alignment_processing_normal_%j.out -#SBATCH -e picard_alignment_processing_normal_%j.err +#SBATCH -o gatk_alignment_processing_normal_%j.out +#SBATCH -e gatk_alignment_processing_normal_%j.err ``` -Next we load the `Picard` module: +Next we load the `GATK` module: ``` # Load module -module load picard/2.27.5 +module load gatk/4.6.1.0 ``` -**Note: `Picard` is software that does NOT require gcc/6.2.0 to also be loaded** +**Note: `GATK` is software that does NOT require gcc/14.2.0 to also be loaded** Next, let's define some variables that we will be using: @@ -133,14 +139,14 @@ Next, let's define some variables that we will be using: # Assign file paths to variables SAMPLE_NAME=syn3_normal SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam -REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/picard/${SAMPLE_NAME}/ +REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/gatk/${SAMPLE_NAME}/ QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` METRICS_FILE=${REPORTS_DIRECTORY}/${SAMPLE_NAME}.remove_duplicates_metrics.txt COORDINATE_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}coordinate_sorted.bam` ``` -Finally, we can also make a directory to hold the `Picard` reports: +Finally, we can also make a directory to hold the `GATK`/`Picard` reports: ``` # Make reports directory @@ -155,7 +161,7 @@ As you might suspect, because SAM files hold alignment information for all of th ### 2. Query-sort alignment file -Alignment files are initally ordered by the order of the reads in the FASTQ file, which is not particularly useful. `Picard` can more exhaustively look for duplicates if the file is sorted by read-name (**query-sorted**). Oftentimes, when people discuss sorted BAM/SAM files, they are refering to **coordinate-sorted** BAM/SAM files. +Alignment files are initally ordered by the order of the reads in the FASTQ file, which is not particularly useful. `GATK`/`Picard` can more exhaustively look for duplicates if the file is sorted by read-name (**query-sorted**). Oftentimes, when people discuss sorted BAM/SAM files, they are refering to **coordinate-sorted** BAM/SAM files. - **Query**-sorted BAM/SAM files are sorted based upon their read names and ordered lexiographically - **Coordinate**-sorted BAM/SAM files are sorted by their aligned sequence name (chromosome/linkage group/scaffold) and position @@ -164,15 +170,15 @@ Alignment files are initally ordered by the order of the reads in the FASTQ file

-Picard can mark and remove duplicates in either coordinate-sorted or query-sorted BAM/SAM files, however, if the alignments are query-sorted it can test secondary alignments for duplicates. A brief discussion of this nuance is discussed in the [`MarkDuplicates` manual of `Picard`](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-). As a result, we will first **query**-sort our SAM file and convert it to a BAM file. +GATK/Picard can mark and remove duplicates in either coordinate-sorted or query-sorted BAM/SAM files, however, if the alignments are query-sorted it can test secondary alignments for duplicates. A brief discussion of this nuance is discussed in the [`MarkDuplicates` manual of `GATK`/`Picard`](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-). As a result, we will first **query**-sort our SAM file and convert it to a BAM file. -**While we query-sort the reads, we are also going to convert our SAM file to a BAM file**. We don't need to specify this conversion explicitly, because `Picard` will make this change by interpreting the file extensions that we provide in the `INPUT` and `OUTPUT` file options. +**While we query-sort the reads, we are also going to convert our SAM file to a BAM file**. We don't need to specify this conversion explicitly, because `GATK`/`Picard` will make this change by interpreting the file extensions that we provide in the `INPUT` and `OUTPUT` file options. Add the following command to our script: ``` # Query-sort alginment file and convert to BAM -java -jar $PICARD/picard.jar SortSam \ +gatk SortSam \ --INPUT $SAM_FILE \ --OUTPUT $QUERY_SORTED_BAM_FILE \ --SORT_ORDER queryname @@ -180,13 +186,13 @@ java -jar $PICARD/picard.jar SortSam \ The components of this command are: -* `java -jar $PICARD/picard.jar SortSam ` Calls Picard's `SortSam` software package +* `gatk SortSam ` Calls `GATK`/`Picard`'s `SortSam` software package * `--INPUT $SAM_FILE` This is where we provide the SAM input file * `--OUTPUT $QUERY_SORTED_BAM_FILE` This is the BAM output file. * `--SORT_ORDER queryname` The options here are either `queryname` or `coordinate`. > #### Why does this command look different from the Picard documentation? -> The **syntax that Picard uses** is quite particular and the syntax shown in the documentation is **not always consistent**. There are two main ways for providing input for Picard: Traditional and New (Barcalay) Syntax. Commands written in either syntax are **equally valid and produce the same output**. To better understand the different syntax, we recommend you take a look [at this short lesson](picard_syntax.md). +> The **syntax that GATK/Picard uses** is quite particular and the syntax shown in the documentation is **not always consistent**. There are two main ways for providing input for GATK/Picard: Traditional and New (Barcalay) Syntax. Commands written in either syntax are **equally valid and produce the same output**. To better understand the different syntax, we recommend you take a look [at this short lesson](picard_syntax.md). ### 3. Mark and Remove Duplicates @@ -200,7 +206,7 @@ Now we will add the command to our script that allows us to mark and remove dupl ``` # Mark and remove duplicates -java -jar $PICARD/picard.jar MarkDuplicates \ +gatk MarkDuplicates \ --INPUT $QUERY_SORTED_BAM_FILE \ --OUTPUT $REMOVE_DUPLICATES_BAM_FILE \ --METRICS_FILE $METRICS_FILE \ @@ -209,7 +215,7 @@ java -jar $PICARD/picard.jar MarkDuplicates \ The components of this command are: -* `java -jar $PICARD/picard.jar MarkDuplicates` Calls `Picard`'s `MarkDuplicates` program +* `gatk MarkDuplicates` Calls `GATK`/`Picard`'s `MarkDuplicates` program * `--INPUT $QUERY_SORTED_BAM_FILE` Uses our query-sorted BAM file as input * `--OUTPUT $REMOVE_DUPLICATES_BAM_FILE` Writes the output to a BAM file * `--METRICS_FILE $METRICS_FILE` Creates a metrics file (required by `Picard MarkDuplicates`) @@ -217,11 +223,11 @@ The components of this command are: ### 4. Coordinate-sort the Alignment File -For most downstream processes, coordinate-sorted alignment files are required. As a result, we will need to **change our alignment file from being query-sorted to being coordinate-sorted** and we will once again use the `SortSam` command within `Picard` to accomplish this. Since this BAM file will be the final BAM file that we make and will use for downstream analyses, **we will need to create an index for it at the same time**. The command we will be using for coordinate-sorting and indexing our BAM file is: +For most downstream processes, coordinate-sorted alignment files are required. As a result, we will need to **change our alignment file from being query-sorted to being coordinate-sorted** and we will once again use the `SortSam` command within `GATK` to accomplish this. Since this BAM file will be the final BAM file that we make and will use for downstream analyses, **we will need to create an index for it at the same time**. The command we will be using for coordinate-sorting and indexing our BAM file is: ``` # Coordinate-sort BAM file and create BAM index file -java -jar $PICARD/picard.jar SortSam \ +gatk SortSam \ --INPUT $REMOVE_DUPLICATES_BAM_FILE \ --OUTPUT $COORDINATE_SORTED_BAM_FILE \ --SORT_ORDER coordinate \ @@ -230,7 +236,7 @@ java -jar $PICARD/picard.jar SortSam \ The components of this command are: -* `java -jar $PICARD/picard.jar SortSam` Calls `Picard`'s `SortSam` program +* `gatk SortSam` Calls `GATK`'s `SortSam` program * `--INPUT $REMOVE_DUPLICATES_BAM_FILE` Our BAM file once we have removed the duplicate reads. **NOTE: Even though the software is called `SortSam`, it can use BAM or SAM files as input and also BAM or SAM files as output.** * `--OUTPUT $COORDINATE_SORTED_BAM_FILE` Our BAM output file sorted by coordinates. * `--SORT_ORDER coordinate` Sort the output file by **coordinates** @@ -242,20 +248,20 @@ Go ahead and save and quit. **Don't run it just yet!** Click here to see what our final sbatchcode script for the normal sample should look like
 #!/bin/bash
-# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Picard
+# This sbatch script is for processing the alignment output from bwa and preparing it
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 1 #SBATCH --mem 32G -#SBATCH -o picard_alignment_processing_normal_%j.out -#SBATCH -e picard_alignment_processing_normal_%j.err
+#SBATCH -o gatk_alignment_processing_normal_%j.out +#SBATCH -e gatk_alignment_processing_normal_%j.err
# Load module -module load picard/2.27.5
+module load gatk/4.6.1.0
# Assign file paths to variables SAMPLE_NAME=syn3_normal SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam -REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/picard/${SAMPLE_NAME}/ +REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/gatk/${SAMPLE_NAME}/ QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` METRICS_FILE=${REPORTS_DIRECTORY}/${SAMPLE_NAME}.remove_duplicates_metrics.txt @@ -263,18 +269,18 @@ COORDINATE_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}coordinate_sorted.bam`
# Make reports directory mkdir -p $REPORTS_DIRECTORY
# Query-sort alginment file and convert to BAM -java -jar $PICARD/picard.jar SortSam \ +gatk SortSam \ --INPUT $SAM_FILE \ --OUTPUT $QUERY_SORTED_BAM_FILE \ --SORT_ORDER queryname
# Mark and remove duplicates -java -jar $PICARD/picard.jar MarkDuplicates \ +gatk MarkDuplicates \ --INPUT $QUERY_SORTED_BAM_FILE \ --OUTPUT $REMOVE_DUPLICATES_BAM_FILE \ --METRICS_FILE $METRICS_FILE \ --REMOVE_DUPLICATES true
# Coordinate-sort BAM file and create BAM index file -java -jar $PICARD/picard.jar SortSam \ +gatk SortSam \ --INPUT $REMOVE_DUPLICATES_BAM_FILE \ --OUTPUT $COORDINATE_SORTED_BAM_FILE \ --SORT_ORDER coordinate \ @@ -284,19 +290,19 @@ java -jar $PICARD/picard.jar SortSam \ > #### Do I need to add read groups? -> Some pipelines will have you add read groups while procressing your alignment files. It is usually not necessary because you can typically do it during alignment. **If you are needing to add read groups, we recommend doing it first (before all the processing steps outlined above)**. You can use Picard `AddOrReplaceReadGroups`, which has the added benefit of allowing you to also sort your alignment file (our first step anyways) in the same step as adding the read group information. The dropdown below discusses how to add or replace read groups within `Picard`. +> Some pipelines will have you add read groups while procressing your alignment files. It is usually not necessary because you can typically do it during alignment. **If you are needing to add read groups, we recommend doing it first (before all the processing steps outlined above)**. You can use Picard `AddOrReplaceReadGroups`, which has the added benefit of allowing you to also sort your alignment file (our first step anyways) in the same step as adding the read group information. The dropdown below discusses how to add or replace read groups within `GATK`/`Picard`. > >
-> Click here if you need to add or replace read groups using Picard -> In order to add or replace read groups, we are going to use Picard's AddOrReplaceReadGroups tool. First we would need to load the Picard module: +> Click here if you need to add or replace read groups using GATK/Picard +> In order to add or replace read groups, we are going to use GATK/Picard's AddOrReplaceReadGroups tool. First we would need to load the GATK module: >
 >  # Load module
->  module load picard/2.27.5
+> module load gatk/4.6.1.0
> > The general syntax for AddOrReplaceReadGroups is: >
 >  # Add or replace read group information
->  java -jar $PICARD/picard.jar AddOrReplaceReadGroups \
+>  gatk AddOrReplaceReadGroups \
 >    --INPUT $SAM_FILE \
 >    --OUTPUT $BAM_FILE \
 >    --RGID $READ_GROUP_ID \
@@ -305,9 +311,9 @@ java -jar $PICARD/picard.jar SortSam \
 >    --RGPU $READ_GROUP_PLATFORM_UNIT \
 >    --RGSM $READ_GROUP_SAMPLE
> ->
  • java -jar $PICARD/picard.jar AddOrReplaceReadGroups This calls the AddOrReplaceReadGroups package within Picard
  • ->
  • --INPUT $SAM_FILEThis is your input file. It could be a BAM/SAM alignment file, but because we recommend doing this first if you need to do it, this would be a SAM file. You don't need to specifiy that it is a BAM/SAM file, Picard with figure that out from the provided extension.
  • ->
  • --OUTPUT $BAM_FILEThis would be your output file. It could be BAM/SAM, but you would mostly likely pick BAM because you'd like to save space on the cluster. You don't need to specifiy that it is a BAM/SAM file, Picard with figure that out from the provided extension.
  • +>
    • gatk AddOrReplaceReadGroups This calls the AddOrReplaceReadGroups package within GATK/Picard
    • +>
    • --INPUT $SAM_FILEThis is your input file. It could be a BAM/SAM alignment file, but because we recommend doing this first if you need to do it, this would be a SAM file. You don't need to specifiy that it is a BAM/SAM file, GATK/Picard with figure that out from the provided extension.
    • +>
    • --OUTPUT $BAM_FILEThis would be your output file. It could be BAM/SAM, but you would mostly likely pick BAM because you'd like to save space on the cluster. You don't need to specifiy that it is a BAM/SAM file, GATK/Picard with figure that out from the provided extension.
    • >
    • --RGID $READ_GROUP_IDThis is your read group ID and must be unique
    • >
    • --RGLB $READ_GROUP_LIBRARYThis is your read group library
    • >
    • --RGPL $READ_GROUP_PLATFORMThis is the platform used for the sequencing
    • @@ -355,8 +361,8 @@ Next, we are going to need to set-up our sbatch submission script w #SBATCH -o samtools_processing_normal_%j.out #SBATCH -e samtools_processing_normal_%j.err
      # Load modules -module load gcc/6.2.0 -module load samtools/1.15.1
      +module load gcc/14.2.0 +module load samtools/1.21
      # Assign file paths to variables SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` @@ -372,7 +378,7 @@ Similarly to Picard, we are going to need to initally query-
       # Sort SAM file and convert it to a query name sorted BAM file
       samtools sort \
      -  -@ 8 \
      +  --threads 8 \
         -n \
         -o $QUERY_SORTED_BAM_FILE \
         $SAM_FILE
      @@ -382,7 +388,7 @@ The components of this line of code are:
           
       
      • samtools sort This calls the sort function within samtools.
      • -
      • -@ 8 This tells samtools to use 8 threads when it multithreads this task. Since we requested 8 cores for this sbatch submission, let's go ahead and use them all.
      • +
      • --threads 8 This tells samtools to use 8 threads when it multithreads this task. Since we requested 8 cores for this sbatch submission, let's go ahead and use them all.
      • -n This argument tells samtools sort to sort by read name as opposed to the default sorting which is done by coordinate.
      • @@ -401,7 +407,7 @@ Next, we are going to add more mate-pair information to the alignments including
         # Score mates
         samtools fixmate \
        -  -@ 8 \
        +  --threads 8 \
           -m \
           $QUERY_SORTED_BAM_FILE \
           $FIXMATE_BAM_FILE
        @@ -411,7 +417,7 @@ The parts of this command are:
         
         
        • samtools fixmate This calls the fixmate command in samtools
        • -
        • -@ 8 This tells samtools to use 8 threads when it multithreads this task.
        • +
        • --threads 8 This tells samtools to use 8 threads when it multithreads this task.
        • -m This will add the mate score tag that will be critically important later for samtools markdup
        • @@ -429,7 +435,7 @@ Now that we have added the fixmate information, we need to coord
           # Sort BAM file by coordinate   
           samtools sort \
          -  -@ 8 \
          +  --threads 8 \
             -o $COORDINATE_SORTED_BAM_FILE \
             $FIXMATE_BAM_FILE
           
          @@ -448,7 +454,7 @@ Now we are going to mark and remove the duplicate reads: samtools markdup \ -r \ --write-index \ - -@ 8 \ + --threads 8 \ $COORDINATE_SORTED_BAM_FILE \ ${REMOVED_DUPLICATES_BAM_FILE}##idx##${REMOVED_DUPLICATES_BAM_FILE}.bai
        @@ -461,7 +467,7 @@ The components of this command are:
      • --write-index This writes an index file of the output BAM file
      • -
      • -@ 8 This sets that we will be using 8 threads
      • +
      • --threads 8 This sets that we will be using 8 threads
      • $BAM_FILE This is our input BAM file
      • @@ -487,8 +493,8 @@ The final script should look like: #SBATCH -o samtools_processing_normal_%j.out #SBATCH -e samtools_processing_normal_%j.err
        # Load modules -module load gcc/6.2.0 -module load samtools/1.15.1
        +module load gcc/14.2.0 +module load samtools/1.21
        # Assign file paths to variables SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` @@ -497,26 +503,26 @@ COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordi REMOVED_DUPLICATES_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}removed_duplicates.bam`
        # Sort SAM file and convert it to a query name sorted BAM file samtools sort \ - -@ 8 \ + --threads 8 \ -n \ -o $QUERY_SORTED_BAM_FILE \ $SAM_FILE
        # Score mates samtools fixmate \ - -@ 8 \ + --threads 8 \ -m \ $QUERY_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE
        # Sort BAM file by coordinate samtools sort \ - -@ 8 \ + --threads 8 \ -o $COORDINATE_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE
        # Mark and remove duplicates and then index the output file samtools markdup \ -r \ --write-index \ - -@ 8 \ + --threads 8 \ $COORDINATE_SORTED_BAM_FILE \ ${REMOVED_DUPLICATES_BAM_FILE}##idx##${REMOVED_DUPLICATES_BAM_FILE}.bai
      @@ -540,7 +546,7 @@ The final sbatch submission script for the tumor sample should look #SBATCH -o samtools_processing_tumor_%j.out #SBATCH -e samtools_processing_tumor_%j.err
      # Load modules -module load gcc/6.2.0 +module load gcc/14.2.0 module load samtools/1.15.1
      # Assign file paths to variables SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.sam @@ -550,26 +556,26 @@ COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordi REMOVED_DUPLICATES_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}removed_duplicates.bam`
      # Sort SAM file and convert it to a query name sorted BAM file samtools sort \ - -@ 8 \ + --threads 8 \ -n \ -o $QUERY_SORTED_BAM_FILE \ $SAM_FILE
      # Score mates samtools fixmate \ - -@ 8 \ + --threads 8 \ -m \ $QUERY_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE
      # Sort BAM file by coordinate samtools sort \ - -@ 8 \ + --threads 8 \ -o $COORDINATE_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE
      # Mark and remove duplicates and then index the output file samtools markdup \ -r \ --write-index \ - -@ 8 \ + --threads 8 \ $COORDINATE_SORTED_BAM_FILE \ ${REMOVED_DUPLICATES_BAM_FILE}##idx##${REMOVED_DUPLICATES_BAM_FILE}.bai
      @@ -596,26 +602,26 @@ Is this SAM file's sort order likely: unsorted, query-sorted, coordinate-sorted Similar to the `bwa` script, we will now need use `sed` to create a `sbatch` script that will be used for processing the tumor SAM file into a BAM file that can be used as input to GATK. The `sed` command to do this would be: ``` -sed 's/normal/tumor/g' picard_alignment_processing_normal.sbatch > picard_alignment_processing_tumor.sbatch +sed 's/normal/tumor/g' gatk_alignment_processing_normal.sbatch > gatk_alignment_processing_tumor.sbatch ``` -_As a result your tumor `Picard` alignment processing script should look almost identical but `normal` has been replaced by `tumor`._ +_As a result your tumor `GATK`/`Picard` alignment processing script should look almost identical but `normal` has been replaced by `tumor`._
      Click here to see what our final sbatchcode script for the tumor sample should look like
       #!/bin/bash
      -# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Picard
      +# This sbatch script is for processing the alignment output from bwa and preparing it for use
      # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 1 #SBATCH --mem 32G -#SBATCH -o picard_alignment_processing_tumor_%j.out -#SBATCH -e picard_alignment_processing_tumor_%j.err
      +#SBATCH -o gatk_alignment_processing_tumor_%j.out +#SBATCH -e gatk_alignment_processing_tumor_%j.err
      # Load module -module load picard/2.27.5
      +module load gatk/4.6.1.0
      # Assign file paths to variables SAMPLE_NAME=syn3_tumor SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam @@ -627,18 +633,18 @@ COORDINATE_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}coordinate_sorted.bam`
      # Make reports directory mkdir -p $REPORTS_DIRECTORY
      # Query-sort alginment file and convert to BAM -java -jar $PICARD/picard.jar SortSam \ +gatk SortSam \ --INPUT $SAM_FILE \ --OUTPUT $QUERY_SORTED_BAM_FILE \ --SORT_ORDER queryname
      # Mark and remove duplicates -java -jar $PICARD/picard.jar MarkDuplicates \ +gatk MarkDuplicates \ --INPUT $QUERY_SORTED_BAM_FILE \ --OUTPUT $REMOVE_DUPLICATES_BAM_FILE \ --METRICS_FILE $METRICS_FILE \ --REMOVE_DUPLICATES true
      # Coordinate-sort BAM file and create BAM index file -java -jar $PICARD/picard.jar SortSam \ +gatk SortSam \ --INPUT $REMOVE_DUPLICATES_BAM_FILE \ --OUTPUT $COORDINATE_SORTED_BAM_FILE \ --SORT_ORDER coordinate \ @@ -646,9 +652,9 @@ java -jar $PICARD/picard.jar SortSam \
      -## Submitting scripts for Picard processing +## Submitting scripts for processing -Now we are ready to submit our normal and tumor `Picard` processing scripts to the O2 cluster. However, we might have a problem. If you managed to go quickly into this lesson from the previous lesson, **your `bwa` alignment scripts may still be running and your SAM files are not complete yet!** +Now we are ready to submit our normal and tumor `GATK`/`Picard` processing scripts to the O2 cluster. However, we might have a problem. If you managed to go quickly into this lesson from the previous lesson, **your `bwa` alignment scripts may still be running and your SAM files are not complete yet!** First, we need to check the status of our `bwa` scripts and we can do this with the command: @@ -661,8 +667,8 @@ squeue --me * **If the only job running is your interactive job,** then it should be time to start your `Picard` processing scripts. You can go ahead and submit your `sbatch` scripts for `Picard` processing with: ``` -sbatch picard_alignment_processing_normal.sbatch -sbatch picard_alignment_processing_tumor.sbatch +sbatch gatk_alignment_processing_normal.sbatch +sbatch gatk_alignment_processing_tumor.sbatch ``` > **NOTE:** These scripts will take ~2 hours for each to run! From 20c08204d8f9d349b6ab4e9fd2cc1602c6d6ce85 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Thu, 15 May 2025 11:03:33 -0400 Subject: [PATCH 10/22] Updated Dictionary creation to gatk --- lessons/07_variant_calling.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/lessons/07_variant_calling.md b/lessons/07_variant_calling.md index 5f9cf37..02fcba0 100644 --- a/lessons/07_variant_calling.md +++ b/lessons/07_variant_calling.md @@ -152,7 +152,7 @@ And now, we need to create our variables: ``` # Assign variables -REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa +REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa REFERENCE_DICTIONARY=`echo ${REFERENCE_SEQUENCE%fa}dict` NORMAL_SAMPLE_NAME=syn3_normal NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam @@ -238,21 +238,21 @@ In order to run `MuTect2` we also **need to have a FASTA index file of our refer > >
      > Click here for the commands to create a sequence directory -> We can create the required sequence dictionary in Picard. But first, let's double check we have the Picard module loaded: +> We can create the required sequence dictionary in GATK/Picard. But first, let's double check we have the GATK/Picard module loaded: >
      ->  module load picard/2.27.5
      +> module load gatk/4.6.1.0 > > The command to do create the sequence dictionary is:
      >
       >  # YOU DON'T NEED TO RUN THIS
      ->  java -jar $PICARD/picard.jar CreateSequenceDictionary \
      ->  --REFERENCE /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.fa
      ->  --OUTPUT /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.dict
      +> gatk CreateSequenceDictionary \ +> --REFERENCE /n/groups/hbctraining/variant_calling/reference/GRCh38.fa +> --OUTPUT /n/groups/hbctraining/variant_calling/reference/GRCh38.dict > > The components of this command are: ->
      • java -jar $PICARD/picard.jar CreateSequenceDictionary This calls the CreateSequenceDictionary command within Picard
      • ->
      • --REFERENCE /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.fa This is the reference sequence to create the sequence dictionary from.
      • ->
      • --OUTPUT /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.dict This is the output sequence dictionary.
      +>
      • gatk CreateSequenceDictionary This calls the CreateSequenceDictionary command within GATK/Picard
      • +>
      • --REFERENCE /n/groups/hbctraining/variant_calling/reference/GRCh38.fa This is the reference sequence to create the sequence dictionary from.
      • +>
      • --OUTPUT /n/groups/hbctraining/variant_calling/reference/GRCh38.dict This is the output sequence dictionary.
      > > Like indexing, once you have created the sequence dictionary for a reference genome, you won't need to do it again. >
      @@ -273,7 +273,7 @@ In order to run `MuTect2` we also **need to have a FASTA index file of our refer # Load the GATK module module load gatk/4.1.9.0
      # Assign variables -REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa +REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa REFERENCE_DICTIONARY=`echo ${REFERENCE_SEQUENCE%fa}dict` NORMAL_SAMPLE_NAME=syn3_normal NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam From 24e776f5f26981776f7d2825e717ed97d0bf7f20 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Thu, 15 May 2025 11:36:39 -0400 Subject: [PATCH 11/22] Updated reference --- lessons/03_sequence_alignment_theory.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/lessons/03_sequence_alignment_theory.md b/lessons/03_sequence_alignment_theory.md index 00e8ae0..6660a63 100644 --- a/lessons/03_sequence_alignment_theory.md +++ b/lessons/03_sequence_alignment_theory.md @@ -215,7 +215,7 @@ bwa mem \ -M \ -t 8 \ -R "@RG\tID:syn3_normal\tPL:illumina\tPU:$SAMPLE\tSM:syn3_normal" \ - /n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa \ + /n/groups/hbctraining/variant_calling/reference/GRCh38.fa \ ~/variant_calling/raw_data/syn3_normal_1.fq.gz \ ~/variant_calling/raw_data/syn3_normal_2.fq.gz \ -o /n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam @@ -233,11 +233,11 @@ Another advantage of using `bash` variables in this way is that it can reduce ty ``` # Assign files to bash variables -REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa +REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_normal_1.fq.gz RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz` SAMPLE=`basename $LEFT_READS _1.fq.gz` -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.sam ``` > **NOTE:** `$RIGHT_READS` uses some `bash` string manipulation in order to swap the last parts of their filename. We also use `basename` to parse out the path from a file and when coupled with an argument after the filename, it will trim the end of the filename as well as we can see with the `$SAMPLE` variable. @@ -273,11 +273,11 @@ bwa mem \ module load gcc/14.2.0 module load bwa/0.7.18
      # Assign files to bash variables -REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa +REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_normal_1.fq.gz RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz` SAMPLE=`basename $LEFT_READS _1.fq.gz` -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam
      +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.sam
      # Align reads with bwa bwa mem \ -M \ @@ -322,11 +322,11 @@ $ sed 's/normal/tumor/g' bwa_alignment_normal.sbatch > bwa_alignment_tumor.sbat module load gcc/14.2.0 module load bwa/0.7.18
      # Assign files to bash variables -REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa +REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_tumor_1.fq.gz RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz` SAMPLE=`basename $LEFT_READS _1.fq.gz` -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam
      +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.sam
      # Align reads with bwa bwa mem \ -M \ From 1f3fa183fcd39c7e6a3da05284ffa1992f9199dc Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Thu, 15 May 2025 12:03:33 -0400 Subject: [PATCH 12/22] Updated samtools faidx --- lessons/07_variant_calling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lessons/07_variant_calling.md b/lessons/07_variant_calling.md index 02fcba0..d0e2fc0 100644 --- a/lessons/07_variant_calling.md +++ b/lessons/07_variant_calling.md @@ -218,8 +218,8 @@ In order to run `MuTect2` we also **need to have a FASTA index file of our refer >
      FASTA index files for reference sequences are fairly common requirements for a variety of NGS software packages. Picard currently does not feature an ability to create a FASTA index file. However, samtools is a very popular tool that is used for a variety of processes for processing BAM/SAM files and it also includes functionality for the creation of FASTA index files. First, we will need to load the gcc and samtools modules: > >
      ->  module load gcc/6.2.0
      ->  module load samtools/1.15.1
      +> module load gcc/14.2.0 +> module load samtools/1.21 > > The command for indexing a FASTA file is straightforward and should run pretty quickly: >
      
      From 7adca19bde657e81f4b1a57b979a63300a842285 Mon Sep 17 00:00:00 2001
      From: Will Gammerdinger 
      Date: Thu, 15 May 2025 12:14:13 -0400
      Subject: [PATCH 13/22] Updated comments
      
      ---
       lessons/04_alignment_file_processing.md | 8 ++++----
       1 file changed, 4 insertions(+), 4 deletions(-)
      
      diff --git a/lessons/04_alignment_file_processing.md b/lessons/04_alignment_file_processing.md
      index b04ed73..1a5020d 100644
      --- a/lessons/04_alignment_file_processing.md
      +++ b/lessons/04_alignment_file_processing.md
      @@ -248,7 +248,7 @@ Go ahead and save and quit. **Don't run it just yet!**
         Click here to see what our final sbatchcode script for the normal sample should look like 
         
       #!/bin/bash
      -# This sbatch script is for processing the alignment output from bwa and preparing it
      +# This sbatch script is for processing the alignment output from bwa and preparing it for use
      # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 @@ -352,7 +352,7 @@ vim samtools_processing_normal.sbatch Next, we are going to need to set-up our sbatch submission script with our shebang line, description, sbatch directives, modules to load and file variables.
       #!/bin/bash
      -# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Samtools
      +# This sbatch script is for processing the alignment output from bwa and preparing it for use
      # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 @@ -484,7 +484,7 @@ The final script should look like:
       #!/bin/bash
      -# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Samtools
      +# This sbatch script is for processing the alignment output from bwa and preparing it for use
      # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 @@ -537,7 +537,7 @@ sed 's/normal/tumor/g' samtools_processing_normal.sbatch > samtools_p The final sbatch submission script for the tumor sample should look like:
       #!/bin/bash
      -# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Samtools
      +# This sbatch script is for processing the alignment output from bwa and preparing it for use
      # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 From 14ae02d70622872be7189c69e86443f4583bae81 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Thu, 15 May 2025 13:20:26 -0400 Subject: [PATCH 14/22] Corrected SAm file name --- lessons/04_alignment_file_processing.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lessons/04_alignment_file_processing.md b/lessons/04_alignment_file_processing.md index 1a5020d..d491b91 100644 --- a/lessons/04_alignment_file_processing.md +++ b/lessons/04_alignment_file_processing.md @@ -138,7 +138,7 @@ Next, let's define some variables that we will be using: ``` # Assign file paths to variables SAMPLE_NAME=syn3_normal -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.sam REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/gatk/${SAMPLE_NAME}/ QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` @@ -260,7 +260,7 @@ Go ahead and save and quit. **Don't run it just yet!** module load gatk/4.6.1.0
      # Assign file paths to variables SAMPLE_NAME=syn3_normal -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.sam REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/gatk/${SAMPLE_NAME}/ QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` @@ -364,7 +364,7 @@ Next, we are going to need to set-up our sbatch submission script w module load gcc/14.2.0 module load samtools/1.21
      # Assign file paths to variables -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` FIXMATE_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}fixmates.bam` COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordinate_sorted.bam` @@ -496,7 +496,7 @@ The final script should look like: module load gcc/14.2.0 module load samtools/1.21
      # Assign file paths to variables -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` FIXMATE_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}fixmates.bam` COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordinate_sorted.bam` @@ -549,7 +549,7 @@ The final sbatch submission script for the tumor sample should look module load gcc/14.2.0 module load samtools/1.15.1
      # Assign file paths to variables -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` FIXMATE_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}fixmates.bam` COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordinate_sorted.bam` @@ -624,7 +624,7 @@ _As a result your tumor `GATK`/`Picard` alignment processing script should look module load gatk/4.6.1.0
      # Assign file paths to variables SAMPLE_NAME=syn3_tumor -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.sam REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/picard/${SAMPLE_NAME}/ QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` From 698ea7f4b5761a2d81f879cec12eebfbe830deb2 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Thu, 15 May 2025 13:29:11 -0400 Subject: [PATCH 15/22] Updated packages and reference genome --- lessons/05_alignment_QC.md | 86 +++++++++++++++++++------------------- 1 file changed, 43 insertions(+), 43 deletions(-) diff --git a/lessons/05_alignment_QC.md b/lessons/05_alignment_QC.md index c0009f5..174c8b2 100644 --- a/lessons/05_alignment_QC.md +++ b/lessons/05_alignment_QC.md @@ -8,7 +8,7 @@ Approximate time: 30 minutes ## Learning Objectives -- Verify alignment rates using `Picard` +- Verify alignment rates using `GATK`/`Picard` - Merge `Picard` QC metrics with `FastQC` metrics using `MultiQC` ## Collecting Alignment Statistics @@ -19,7 +19,7 @@ The next step of QC is where we need to evaluate the quality of the alignments.

      -We are going to use `Picard` once again in order to collect our alignment statistics. `Picard` has many packages for collecting different types of data, but the one we will be using is [`CollectAlignmentSummaryMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360040507751-CollectAlignmentSummaryMetrics-Picard). This tool takes a **SAM/BAM file input** and **produces metrics** (in a tab delimited `.txt` file) detailing the quality of the read alignments. _Note that these quality filters are specific to Illumina data._ +We are going to use `GATK`/`Picard` once again in order to collect our alignment statistics. `GATK`/`Picard` has many packages for collecting different types of data, but the one we will be using is [`CollectAlignmentSummaryMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360040507751-CollectAlignmentSummaryMetrics-Picard). This tool takes a **SAM/BAM file input** and **produces metrics** (in a tab delimited `.txt` file) detailing the quality of the read alignments. _Note that these quality filters are specific to Illumina data._ Some examples of metrics reported include (but, are not limited to): @@ -44,45 +44,45 @@ Let's start creating an `sbatch` script for collecting metrics: ``` cd ~/variant_calling/scripts/ -vim picard_metrics_normal.sbatch +vim gatk_metrics_normal.sbatch ``` First, we need to add our shebang line, description and `sbatch` directives to the script: ``` #!/bin/bash -# This sbatch script is for collecting alignment metrics using Picard +# This sbatch script is for collecting alignment metrics using GATK # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:30:00 #SBATCH -c 1 #SBATCH --mem 16G -#SBATCH -o picard_metrics_normal_%j.out -#SBATCH -e picard_metrics_normal_%j.err +#SBATCH -o gatk_metrics_normal_%j.out +#SBATCH -e gatk_metrics_normal_%j.err ``` -Next, we need to load `Picard`: +Next, we need to load `GATK`: ``` -# Load picard -module load picard/2.27.5 +# Load GATK +module load gatk/4.6.1.0 ``` Next, let's assign our files to variables: ``` # Assign variables -INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam -REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt +INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam +REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa +OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt ``` -Lastly, we can add the `Picard` command to gather the alignment metrics. +Lastly, we can add the `GATK`/`Picard` command to gather the alignment metrics. ``` -# Run Picard CollectAlignmentSummaryMetrics -java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ +# Run GATK/Picard CollectAlignmentSummaryMetrics +gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ --REFERENCE_SEQUENCE $REFERENCE \ --OUTPUT $OUTPUT_METRICS_FILE @@ -90,9 +90,9 @@ java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ We can breakdown this command into each of its components: -* `java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics` Calls the `CollectAlignmentSummaryMetrics` package from within `Picard` -* `--INPUT $INPUT_BAM` This is the output BAM file from our previous `Picard` alignment processing steps. -* `--REFERENCE_SEQUENCE $REFERENCE` This isn't a required parameter, but `Picard` can do a subset of mismatch-related metrics if this is provided. +* `gatk CollectAlignmentSummaryMetrics` Calls the `CollectAlignmentSummaryMetrics` package from within `GATK`/`Picard` +* `--INPUT $INPUT_BAM` This is the output BAM file from our previous `GATK`/`Picard` alignment processing steps. +* `--REFERENCE_SEQUENCE $REFERENCE` This isn't a required parameter, but `GATK`/`Picard` can do a subset of mismatch-related metrics if this is provided. * `--OUTPUT $OUTPUT_METRICS_FILE` This is the file to write the output metrics to. @@ -102,20 +102,20 @@ Now this script is all set to run! **Go ahead and save and quit.** Click here to see what our final sbatchcode script for collecting the normal sample alignment metrics should look like
       #!/bin/bash
      -# This sbatch script is for collecting alignment metrics using Picard
      +# This sbatch script is for collecting alignment metrics using GATK
      # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:30:00 #SBATCH -c 1 #SBATCH --mem 16G -#SBATCH -o picard_metrics_normal_%j.out -#SBATCH -e picard_metrics_normal_%j.err
      -# Load picard -module load picard/2.27.5
      +#SBATCH -o gatk_metrics_normal_%j.out +#SBATCH -e gatk_metrics_normal_%j.err
      +# Load GATK +module load gatk/4.6.1.0
      # Assign variables -INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam -REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt
      +INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam +REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa +OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt
      # Run Picard CollectAlignmentSummaryMetrics java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ @@ -129,27 +129,27 @@ java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ Now we will want to **create the tumor version of this submission script using `sed`** (as we have done previously): ``` -sed 's/normal/tumor/g' picard_metrics_normal.sbatch > picard_metrics_tumor.sbatch +sed 's/normal/tumor/g' gatk_metrics_normal.sbatch > gatk_metrics_tumor.sbatch ```
      Click here to see what our final sbatchcode script for collecting the tumor sample alignment metrics should look like
       #!/bin/bash
      -# This sbatch script is for collecting alignment metrics using Picard
      +# This sbatch script is for collecting alignment metrics using GATK
      # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:30:00 #SBATCH -c 1 #SBATCH --mem 16G -#SBATCH -o picard_metrics_tumor_%j.out -#SBATCH -e picard_metrics_tumor_%j.err
      -# Load picard -module load picard/2.27.5
      +#SBATCH -o gatk_metrics_tumor_%j.out +#SBATCH -e gatk_metrics_tumor_%j.err
      +# Load GATK +module load gatk/4.6.1.0
      # Assign variables -INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.coordinate_sorted.bam -REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.p7.CollectAlignmentSummaryMetrics.txt
      +INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.coordinate_sorted.bam +REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa +OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.CollectAlignmentSummaryMetrics.txt
      # Run Picard CollectAlignmentSummaryMetrics java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ @@ -176,7 +176,7 @@ sbatch picard_metrics_tumor.sbatch ## Collecting Coverage Metrics -Coverage is the average level of alignment for any random locus in the genome. `Picard` also has a package called [`CollectWgsMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037269351-CollectWgsMetrics-Picard) which is also very nice for collecting data about coverage for alignments. However, **since our data set is whole exome sequencing rather than whole genome sequencing** and thus only compromises about 1-2% of the human genome, average **coverage across the whole genome is not a very useful metric**. However, if one did have whole genome data, then running `CollectWgsMetrics` would be useful. As such, in the dropdown box below we provide the code that you could use to collect this information. +Coverage is the average level of alignment for any random locus in the genome. `GATK`/`Picard` also has a package called [`CollectWgsMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037269351-CollectWgsMetrics-Picard) which is also very nice for collecting data about coverage for alignments. However, **since our data set is whole exome sequencing rather than whole genome sequencing** and thus only compromises about 1-2% of the human genome, average **coverage across the whole genome is not a very useful metric**. However, if one did have whole genome data, then running `CollectWgsMetrics` would be useful. As such, in the dropdown box below we provide the code that you could use to collect this information.

      @@ -185,21 +185,21 @@ Coverage is the average level of alignment for any random locus in the genome. _Image source: [Coverage analysis from the command line](https://medium.com/ngs-sh/coverage-analysis-from-the-command-line-542ef3545e2c)_

      -Click here to find out more on collecting coverage metrics for WGS datasets in Picard -
      The tool in Picard used for collecting coverage metrics for WGS datasets is called CollectWgsMetrics. The code used to run CollectWgsMetrics can be found below.

      +Click here to find out more on collecting coverage metrics for WGS datasets in GATK/Picard +
      The tool in GATK/Picard used for collecting coverage metrics for WGS datasets is called CollectWgsMetrics. The code used to run CollectWgsMetrics can be found below.

         # Assign paths to bash variables
         $COORDINATE_SORTED_BAM_FILE=/path/to/sample.coordinate_sorted.bam
         $OUTPUT=/home/$USER/variant_calling/reports/picard/sample.CollectWgsMetrics.txt
      -  $REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
      + $REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
      # Run Picard CollectWgsMetrics \ - java -jar $PICARD/picard.jar CollectWgsMetrics \ + gatk CollectWgsMetrics \ --INPUT $COORDINATE_SORTED_BAM_FILE \ --OUTPUT $METRICS_OUTPUT_FILE \ --REFERENCE_SEQUENCE $REFERENCE
      -
      • java -jar $PICARD/picard.jar CollectWgsMetrics This calls the CollectWgsMetrics package within Picard
      • +
        • gatk CollectWgsMetrics This calls the CollectWgsMetrics package within GATK/Picard
        • --INPUT $COORDINATE_SORTED_BAM_FILE This is the input coordinate-sorted BAM file
        • --OUTPUT $METRICS_OUTPUT_FILE This is the output report file
        • --REFERENCE_SEQUENCE $REFERENCE This is the path to the reference genome that was used for the alignment.
        @@ -208,11 +208,11 @@ _Image source: [Coverage analysis from the command line](https://medium.com/ngs- ## Factors Impacting Alignment -While we mentioned above the various metrics that are computed as part of the Picard command, one of the **most important metrics for your alignment file is the alignment rate**. When aligning high-quality reads to a high quality reference genome, **one should expect to see alignment rates at 90% or better**. If alignment rates dipped below 80-85%, then there could be reason for further inspection. +While we mentioned above the various metrics that are computed as part of the GATK/Picard command, one of the **most important metrics for your alignment file is the alignment rate**. When aligning high-quality reads to a high quality reference genome, **one should expect to see alignment rates at 90% or better**. If alignment rates dipped below 80-85%, then there could be reason for further inspection. Alignment rates can vary based upon many factors, including: -- **Quality of reference assembly** - A high-quality assembly like GRCh38.p7 will provide an excellent reference genome for alignment. However, if you were studying a organism with a poorly assembled genome, parts of the reference genome could be missing from the assembly. Therefore, high-quality reads might not align because they there is missing reference sequence to align to that corresponds to their sequence. +- **Quality of reference assembly** - A high-quality assembly like GRCh38 will provide an excellent reference genome for alignment. However, if you were studying a organism with a poorly assembled genome, parts of the reference genome could be missing from the assembly. Therefore, high-quality reads might not align because they there is missing reference sequence to align to that corresponds to their sequence. - **Quality of libraries** - If the library generation was poor and there wasn't enough input DNA, then your sequencing could be filled with low-quality reads - **Quality of the reads** - If the reads are poor quality, then it can make alignment more uncertain. If your `FASTQC` report shows any anomalous signs, contact your sequencing center for support. - **Contamination** - If your samples are contaminated, then it can also skew your alignment. For example, if your samples were heavily contaminated with some bacteria, then much of what you will sequence will be bacteria DNA and not your sample DNA. As a result, most of the sequence reads will not align to your target sequence. If you suspect contamination might be the source of a poor alignment, you could consider running [Kraken](https://ccb.jhu.edu/software/kraken/) to evaluate the levels of contamination in your samples. From 0dfa57de47f489a2424142a08131aa35f4df5dac Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Thu, 15 May 2025 13:37:58 -0400 Subject: [PATCH 16/22] Update scripts --- lessons/05_alignment_QC.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/lessons/05_alignment_QC.md b/lessons/05_alignment_QC.md index 174c8b2..d9cd964 100644 --- a/lessons/05_alignment_QC.md +++ b/lessons/05_alignment_QC.md @@ -51,7 +51,7 @@ First, we need to add our shebang line, description and `sbatch` directives to t ``` #!/bin/bash -# This sbatch script is for collecting alignment metrics using GATK +# This sbatch script is for collecting alignment metrics using GATK # Assign sbatch directives #SBATCH -p priority @@ -116,8 +116,8 @@ module load gatk/4.6.1.0
        INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt
        -# Run Picard CollectAlignmentSummaryMetrics -java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ +# Run GATK CollectAlignmentSummaryMetrics +gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ --REFERENCE_SEQUENCE $REFERENCE \ --OUTPUT $OUTPUT_METRICS_FILE @@ -150,8 +150,8 @@ module load gatk/4.6.1.0
        INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.CollectAlignmentSummaryMetrics.txt
        -# Run Picard CollectAlignmentSummaryMetrics -java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ +# Run GATK CollectAlignmentSummaryMetrics +gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ --REFERENCE_SEQUENCE $REFERENCE \ --OUTPUT $OUTPUT_METRICS_FILE From fe612e4f5f23db48c712070768595ff8c756ab4b Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Thu, 15 May 2025 14:10:54 -0400 Subject: [PATCH 17/22] Updated script --- lessons/05_alignment_QC.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lessons/05_alignment_QC.md b/lessons/05_alignment_QC.md index d9cd964..c9bf347 100644 --- a/lessons/05_alignment_QC.md +++ b/lessons/05_alignment_QC.md @@ -81,7 +81,7 @@ OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn Lastly, we can add the `GATK`/`Picard` command to gather the alignment metrics. ``` -# Run GATK/Picard CollectAlignmentSummaryMetrics +# Run GATK CollectAlignmentSummaryMetrics gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ --REFERENCE_SEQUENCE $REFERENCE \ From 0c3ed20de0f24172962cb489518c71a29ee172b3 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Fri, 16 May 2025 13:21:27 -0400 Subject: [PATCH 18/22] Fixed path --- lessons/05_alignment_QC.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lessons/05_alignment_QC.md b/lessons/05_alignment_QC.md index c9bf347..27aa100 100644 --- a/lessons/05_alignment_QC.md +++ b/lessons/05_alignment_QC.md @@ -75,7 +75,7 @@ Next, let's assign our files to variables: # Assign variables INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt +OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt ``` Lastly, we can add the `GATK`/`Picard` command to gather the alignment metrics. @@ -115,7 +115,7 @@ module load gatk/4.6.1.0
        # Assign variables INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt
        +OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt
        # Run GATK CollectAlignmentSummaryMetrics gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ @@ -149,7 +149,7 @@ module load gatk/4.6.1.0
        # Assign variables INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.CollectAlignmentSummaryMetrics.txt
        +OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_tumor/syn3_tumor_GRCh38.CollectAlignmentSummaryMetrics.txt
        # Run GATK CollectAlignmentSummaryMetrics gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ @@ -166,11 +166,11 @@ Before we submit our jobs, let's **check the status of our previous `Picard` ali squeue --me ``` -* **If your `Picard` alignment processing steps are completed**, and you have the required input files then you can submit these jobs to collect alignment metrics: +* **If your `GATK`/`Picard` alignment processing steps are completed**, and you have the required input files then you can submit these jobs to collect alignment metrics: ```bash -sbatch picard_metrics_normal.sbatch -sbatch picard_metrics_tumor.sbatch +sbatch gatk_metrics_normal.sbatch +sbatch gatk_metrics_tumor.sbatch ``` > **NOTE:** Each of these scripts should only take about 15 minutes to run. From b2b4675c9aadc1b435c8b541d276e05d116315f8 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Fri, 16 May 2025 14:12:40 -0400 Subject: [PATCH 19/22] Updated reference --- lessons/07_variant_calling.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/lessons/07_variant_calling.md b/lessons/07_variant_calling.md index d0e2fc0..b751f8f 100644 --- a/lessons/07_variant_calling.md +++ b/lessons/07_variant_calling.md @@ -155,26 +155,26 @@ And now, we need to create our variables: REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa REFERENCE_DICTIONARY=`echo ${REFERENCE_SEQUENCE%fa}dict` NORMAL_SAMPLE_NAME=syn3_normal -NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam +NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.coordinate_sorted.bam TUMOR_SAMPLE_NAME=syn3_tumor -TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam -VCF_OUTPUT_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_${NORMAL_SAMPLE_NAME}_${TUMOR_SAMPLE_NAME}_GRCh38.p7-raw.vcf +TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.coordinate_sorted.bam +VCF_OUTPUT_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_${NORMAL_SAMPLE_NAME}_${TUMOR_SAMPLE_NAME}_GRCh38.raw.vcf ```
        Click here if you used samtools instead of Picard to process the alignment files Very little needs to be edited to these variables to adapt them for the samtools output. However, the end of the file that was created in samtools was .removed_duplicates.bam rather than .coordinate_sorted.bam. As a result we need to edit the variables a bit. Change:
        -  NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam
        + NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.coordinate_sorted.bam
      To:
      -  NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.removed_duplicates.bam
      + NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.removed_duplicates.bam
      And also change:
      -  TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam
      + TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.coordinate_sorted.bam
      To:
      -  TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.removed_duplicates.bam
      + TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.removed_duplicates.bam
      After those changeed have been made, the rest of the script should be the same.
      @@ -276,10 +276,10 @@ module load gatk/4.1.9.0
      REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa REFERENCE_DICTIONARY=`echo ${REFERENCE_SEQUENCE%fa}dict` NORMAL_SAMPLE_NAME=syn3_normal -NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam +NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.coordinate_sorted.bam TUMOR_SAMPLE_NAME=syn3_tumor -TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam -VCF_OUTPUT_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_${NORMAL_SAMPLE_NAME}_${TUMOR_SAMPLE_NAME}_GRCh38.p7-raw.vcf
      +TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.coordinate_sorted.bam +VCF_OUTPUT_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_${NORMAL_SAMPLE_NAME}_${TUMOR_SAMPLE_NAME}_GRCh38.raw.vcf
      # Run MuTect2 gatk Mutect2 \ --sequence-dictionary $REFERENCE_DICTIONARY \ From 721a16cf4aff36d9d2b18189d0daabbffcbcbb41 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Mon, 19 May 2025 09:18:20 -0400 Subject: [PATCH 20/22] Fixed typo --- lessons/05_alignment_QC.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lessons/05_alignment_QC.md b/lessons/05_alignment_QC.md index 27aa100..23125ec 100644 --- a/lessons/05_alignment_QC.md +++ b/lessons/05_alignment_QC.md @@ -9,7 +9,7 @@ Approximate time: 30 minutes ## Learning Objectives - Verify alignment rates using `GATK`/`Picard` -- Merge `Picard` QC metrics with `FastQC` metrics using `MultiQC` +- Merge `GATK`/`Picard` QC metrics with `FastQC` metrics using `MultiQC` ## Collecting Alignment Statistics From 38e4b8742cda7bd1c8aecf4cc49a9248f4e94f43 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Mon, 19 May 2025 10:19:12 -0400 Subject: [PATCH 21/22] Add pip install --- lessons/06_aggregate_multiqc.md | 65 ++++++++++++++++++++++++++++----- 1 file changed, 56 insertions(+), 9 deletions(-) diff --git a/lessons/06_aggregate_multiqc.md b/lessons/06_aggregate_multiqc.md index 41c2339..dbcbf03 100644 --- a/lessons/06_aggregate_multiqc.md +++ b/lessons/06_aggregate_multiqc.md @@ -23,7 +23,7 @@ The goal of this lesson is to show you how to **combine numerical stats from mul One nice feature of `MultiQC` is that it accepts many different file formats. It figures out which format was submitted and tailors the report to that type of analysis. For this workflow we will combine the following QC stats: * FastQC -* Alignment QC from Picard +* Alignment QC from `GATK`/`Picard` We have already discussed in great detail the FASTQC html report in a [previous lesson](02_fastqc.md). But we haven't yet looked at the output from Picard `CollectAlignmentSummaryMetrics`. @@ -33,7 +33,7 @@ Once your scripts from the previous lesson have finished running, we can take a Let's use `less` to view it: ```bash -less ~/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt +less ~/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt ``` This is a **tab-delimited file** which contains a header component, followed by a table with many columns. Each column lists a different metric and the associated value for this syn3_normal sample. It is difficult to view this in the terminal and so you can use the screenshot below to see the contents: @@ -70,14 +70,59 @@ First, we will add our shebang line, description and `sbatch` directives #SBATCH -e multiqc_alignment_metrics_%j.err ``` -Next, we will load our modules: +Next, we will load our modules and source a virtual environment. We haven't previously sourced an virtual environment, so let's talk about this briefly. HMS-RC, who manages the O2 cluster, would like to delegate some of the tools that rely on a `pip3 install` installation to users for a handful of reasons, including giving them the freedom to manage versions themselves. We have installed a version of `MultiQC` to use for the workshop in our space, however, when you do your analysis with your own data, **you will need create your own environment to source** for two reasons: + +1) We may not always be using the most current version of the tool and you should be trying to use that whenever possible +2) We may upgrade our version at some point and you may cite the wrong version number if you aren't careful + +For instructions on how to install a virtual environment, please use the dropdown menu below. + +
      + Click here to see how to install multiqc in a virtual environment + First, navigate to where you would like to install your multiqc virtual environment: +
      + cd /path/to/install/multiqc/
      + Next, you will need to load python and the gcc module that it is complied against. Technically, you could compile it against the base version of python that is on the RedHat operating system, however, it is a better practice to compile it against a module as it can give you a bit more flexibility. +
      + module load gcc/14.2.0
      + module load python/3.13.1
      + Then, we will open a virtual environment in python with the virtualenv command and name it multiqc_env, but you can name it whatever you would like: +
      + virtualenv multiqc_env
      +After we have created the virtual environment, we will need to source it: +
      +source multiqc_env/bin/activate
      +
      NOTE: If you named it something other than multiqc_env, then you will need to use that name in the above line instead of multiqc_env.

      +
      NOTE: Now that you have activated your virtual environment, your command-line should be preceded by (multiqc_env). This represents the virtual environment that you're in and you should try to only be in one virtual environment at a time otherwise, you may run into dependency conflicts.

      +Now we will install multiqc using a pip3 install command: +
      +pip3 install multiqc
      +After it finished installing you can test that it works with: +
      +multiqc --help
      +You can now exit the virtual environment with the deactivate command: +
      +deactivate
      +In the future, if you would like to use this virtual environment, you will need to be sure to load the same version of python that the virtual enviornment was built within and then source the virtual environment: +
      + module load gcc/14.2.0
      + module load python/3.13.1
      + source /path/to/install/multiqc/multiqc_env/bin/activate
      +If you would like to update your version of multiqc, then you will need to activate your virtual environemnt and use: +
      + pip3 install --upgrade multiqc
      +More information on managing your personal python packages can be found on HMS-RC's website. +
      +
      ``` # Load modules -module load gcc/9.2.0 -module load multiqc/1.21 +module load gcc/14.2.0 +module load python/3.13.1 + +# Source an environment to use +source /n/groups/hbctraining/workshop_environments/variant_analysis/multiqc_env/bin/activate ``` -> NOTE: `MultiQC` version 1.12 requires `gcc/9.2.0` on the O2 cluster. Next, we will assign our variables: @@ -86,7 +131,7 @@ Next, we will assign our variables: REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/ NORMAL_SAMPLE_NAME=syn3_normal TUMOR_SAMPLE_NAME=syn3_tumor -REFERENCE=GRCh38.p7 +REFERENCE=GRCh38 NORMAL_PICARD_METRICS=${REPORTS_DIRECTORY}picard/${NORMAL_SAMPLE_NAME}/${NORMAL_SAMPLE_NAME}_${REFERENCE}.CollectAlignmentSummaryMetrics.txt TUMOR_PICARD_METRICS=${REPORTS_DIRECTORY}picard/${TUMOR_SAMPLE_NAME}/${TUMOR_SAMPLE_NAME}_${REFERENCE}.CollectAlignmentSummaryMetrics.txt NORMAL_FASTQC_1=${REPORTS_DIRECTORY}fastqc/${NORMAL_SAMPLE_NAME}_1_fastqc.zip @@ -130,8 +175,10 @@ multiqc \ #SBATCH -o multiqc_alignment_metrics_%j.out #SBATCH -e multiqc_alignment_metrics_%j.err
      # Load modules -module load gcc/9.2.0 -module load multiqc/1.21
      +module load gcc/14.2.0 +module load python/3.13.1
      +# Source an environment to use +source /n/groups/hbctraining/workshop_environments/variant_analysis/multiqc_env/bin/activate
      # Assign variables REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/ NORMAL_SAMPLE_NAME=syn3_normal From 12a72a813e3d872495578dae29087b5e69bb0fab Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Mon, 19 May 2025 10:20:14 -0400 Subject: [PATCH 22/22] Fixed typo --- lessons/06_aggregate_multiqc.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lessons/06_aggregate_multiqc.md b/lessons/06_aggregate_multiqc.md index dbcbf03..7222aa8 100644 --- a/lessons/06_aggregate_multiqc.md +++ b/lessons/06_aggregate_multiqc.md @@ -59,7 +59,7 @@ First, we will add our shebang line, description and `sbatch` directives ``` #!/bin/bash -# This sbatch script is for collating alignment metrics from FastQC and Picard using MultiQC +# This sbatch script is for collating alignment metrics from FastQC and GATK/Picard using MultiQC # Assign sbatch directives #SBATCH -p priority @@ -166,7 +166,7 @@ multiqc \ Click here to see what our final sbatchcode script for running multiqc should look like
       #!/bin/bash
      -# This sbatch script is for collating alignment metrics from FastQC and Picard using MultiQC
      +# This sbatch script is for collating alignment metrics from FastQC and GATK/Picard using MultiQC
      # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:10:00