From d48e8575fa6cdac0c0806568417f7ef016ea1cab Mon Sep 17 00:00:00 2001
From: Will Gammerdinger
# Load modules
-module load gcc/6.2.0
-module load bwa/0.7.17
+module load gcc/14.2.0
+module load bwa/0.7.18
# Assign files to bash variables
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_normal_1.fq.gz
@@ -319,8 +319,8 @@ $ sed 's/normal/tumor/g' bwa_alignment_normal.sbatch > bwa_alignment_tumor.sbat
#SBATCH -o bwa_alignment_tumor_%j.out
#SBATCH -e bwa_alignment_tumor_%j.err
# Load modules
-module load gcc/6.2.0
-module load bwa/0.7.17
+module load gcc/14.2.0
+module load bwa/0.7.18
# Assign files to bash variables
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_tumor_1.fq.gz
From d8049fee38ed51fcfb91e0911ad28361950a6704 Mon Sep 17 00:00:00 2001
From: Will Gammerdinger
+module load snpEff/5.2f
# Assign variables
REPORTS_DIRECTORY=/home/$USER/variant_calling/reports/snpeff/
SAMPLE_NAME=mutect2_syn3_normal_syn3_tumor
From f92ce818b2c4758b89cc720cdc84a4daae7bb659 Mon Sep 17 00:00:00 2001
From: Will Gammerdinger bcftools
-module load gcc/9.2.0
-module load bcftools/1.14
+module load gcc/14.2.0
+module load bcftools/1.21
Assuming we have already indexed our dbSNP VCF file, the first thing that we are going to need to do is compress the VCF file that we wish to annotate with:
@@ -346,8 +346,8 @@ Let's explain each part of this command:
#SBATCH -o variant_annotation_syn3_normal_syn3_tumor_%j.out
#SBATCH -e variant_annotation_syn3_normal_syn3_tumor_%j.err
# Load modules
-module load gcc/9.2.0
-module load bcftools/1.14
+module load gcc/14.2.0
+module load bcftools/1.21
module load snpEff/5.2f
# Assign variables
REPORTS_DIRECTORY=/home/$USER/variant_calling/reports/snpeff/
From 81d0bcc947d5c32be414f1282246f0fcb916ff4a Mon Sep 17 00:00:00 2001
From: Will Gammerdinger tabix
, which is part of the HTSlib
module. First, we will need to load the HTSlib
module, which also requires us to load the gcc
module as well:
-module load gcc/9.2.0
-module load htslib/1.14
+module load gcc/14.2.0
+module load htslib/1.21
In order to index our dbSNP file using tabix
, we just need to run the following command:
From bac022fcce1b3760e36af3e880c043dd18a05c58 Mon Sep 17 00:00:00 2001
From: Will Gammerdinger
Picard
and samtools
. In the dropdowns below we will outline each method:
> Picard
Picard
module:
+> GATK
gatk
module:
> -> module load picard/2.27.5+> module load gatk/4.6.1.0 > We can define our variables as: >
> INPUT_BAM_1=Read_group_1.bam @@ -47,12 +53,12 @@ The alignment files that come from `bwa` are raw alignment and need some process > MERGED_OUTPUT=Merged_output.bam> Here is the command we would need to run to merge the SAM/BAM files: >
-> java -jar $PICARD/picard.jar MergeSamFiles \ +> gatk MergeSamFiles \ > --INPUT $INPUT_BAM_1 \ > --INPUT $INPUT_BAM_2 \ > --OUTPUT $MERGED_OUTPUT> We can breakdown this command: ->
java -jar $PICARD/picard.jar MergeSamFiles
This calls the MergeSamFiles
from within Picard
gatk MergeSamFiles
This calls the MergeSamFiles
from within gatk
--INPUT $INPUT_BAM_1
This is the first SAM/BAM file that we would like to merge.--INPUT $INPUT_BAM_2
This is the second SAM/BAM file that we would like to merge. We can continue to add --INPUT
lines as needed.--OUTPUT $MERGED_OUTPUT
This is the output merged SAM/BAM filesamtools
samtools
module, which also requires gcc
to be loaded:
> -> module load gcc/6.2.0 -> module load samtools/1.15.1+> module load gcc/14.2.0 +> module load samtools/1.21 > We can define our variables as: >
> INPUT_BAM_1=Read_group_1.bam @@ -77,18 +83,18 @@ The alignment files that come from `bwa` are raw alignment and need some process > $INPUT_BAM_1 \ > $INPUT_BAM_2 \ > --output-fmt BAM \ -> -@ $THREADS+> --threads $THREADS > We can break down this command: >
samtools merge
This calls the merge
package within samtools
.-o $MERGED_OUTPUT
This is the merged output file.$INPUT_BAM_1
This is the first SAM/BAM file that we would like to merge.$INPUT_BAM_2
This is the second SAM/BAM file that we would like to merge. We can continue to add additional input SAM/BAM files to this list as needed.--output-fmt BAM
This specifies the output format as BAM
. If for some reason you wanted a SAM
output file then you would use --output-fmt SAM
instead.-@ $THREADS
This specifies the number of threads we want to use for this process. We are using 8 threads in this example, but this could be different depending on the parameters that you would like to use.--threads $THREADS
This specifies the number of threads we want to use for this process. We are using 8 threads in this example, but this could be different depending on the parameters that you would like to use.
@@ -100,32 +106,32 @@ Let's begin by creating a script for alignment processing. Make a new `sbatch` s
```
$ cd ~/variant_calling/scripts/
-$ vim picard_alignment_processing_normal.sbatch
+$ vim gatk_alignment_processing_normal.sbatch
```
As always, we start the `sbatch` script with our shebang line, description of the script and our `sbatch` directives to request appropriate resources from the O2 cluster.
```
#!/bin/bash
-# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Picard
+# This sbatch script is for processing the alignment output from bwa and preparing it for use
# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-04:00:00
#SBATCH -c 1
#SBATCH --mem 32G
-#SBATCH -o picard_alignment_processing_normal_%j.out
-#SBATCH -e picard_alignment_processing_normal_%j.err
+#SBATCH -o gatk_alignment_processing_normal_%j.out
+#SBATCH -e gatk_alignment_processing_normal_%j.err
```
-Next we load the `Picard` module:
+Next we load the `GATK` module:
```
# Load module
-module load picard/2.27.5
+module load gatk/4.6.1.0
```
-**Note: `Picard` is software that does NOT require gcc/6.2.0 to also be loaded**
+**Note: `GATK` is software that does NOT require gcc/14.2.0 to also be loaded**
Next, let's define some variables that we will be using:
@@ -133,14 +139,14 @@ Next, let's define some variables that we will be using:
# Assign file paths to variables
SAMPLE_NAME=syn3_normal
SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam
-REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/picard/${SAMPLE_NAME}/
+REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/gatk/${SAMPLE_NAME}/
QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam`
REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam`
METRICS_FILE=${REPORTS_DIRECTORY}/${SAMPLE_NAME}.remove_duplicates_metrics.txt
COORDINATE_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}coordinate_sorted.bam`
```
-Finally, we can also make a directory to hold the `Picard` reports:
+Finally, we can also make a directory to hold the `GATK`/`Picard` reports:
```
# Make reports directory
@@ -155,7 +161,7 @@ As you might suspect, because SAM files hold alignment information for all of th
### 2. Query-sort alignment file
-Alignment files are initally ordered by the order of the reads in the FASTQ file, which is not particularly useful. `Picard` can more exhaustively look for duplicates if the file is sorted by read-name (**query-sorted**). Oftentimes, when people discuss sorted BAM/SAM files, they are refering to **coordinate-sorted** BAM/SAM files.
+Alignment files are initally ordered by the order of the reads in the FASTQ file, which is not particularly useful. `GATK`/`Picard` can more exhaustively look for duplicates if the file is sorted by read-name (**query-sorted**). Oftentimes, when people discuss sorted BAM/SAM files, they are refering to **coordinate-sorted** BAM/SAM files.
- **Query**-sorted BAM/SAM files are sorted based upon their read names and ordered lexiographically
- **Coordinate**-sorted BAM/SAM files are sorted by their aligned sequence name (chromosome/linkage group/scaffold) and position
@@ -164,15 +170,15 @@ Alignment files are initally ordered by the order of the reads in the FASTQ file
sbatch
code script for the normal sample should look like#!/bin/bash -# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Picard> > The general syntax for
+# This sbatch script is for processing the alignment output from bwa and preparing it
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 1 #SBATCH --mem 32G -#SBATCH -o picard_alignment_processing_normal_%j.out -#SBATCH -e picard_alignment_processing_normal_%j.err
+#SBATCH -o gatk_alignment_processing_normal_%j.out +#SBATCH -e gatk_alignment_processing_normal_%j.err
# Load module -module load picard/2.27.5
+module load gatk/4.6.1.0
# Assign file paths to variables SAMPLE_NAME=syn3_normal SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam -REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/picard/${SAMPLE_NAME}/ +REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/gatk/${SAMPLE_NAME}/ QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` METRICS_FILE=${REPORTS_DIRECTORY}/${SAMPLE_NAME}.remove_duplicates_metrics.txt @@ -263,18 +269,18 @@ COORDINATE_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}coordinate_sorted.bam`
# Make reports directory mkdir -p $REPORTS_DIRECTORY
# Query-sort alginment file and convert to BAM -java -jar $PICARD/picard.jar SortSam \ +gatk SortSam \ --INPUT $SAM_FILE \ --OUTPUT $QUERY_SORTED_BAM_FILE \ --SORT_ORDER queryname
# Mark and remove duplicates -java -jar $PICARD/picard.jar MarkDuplicates \ +gatk MarkDuplicates \ --INPUT $QUERY_SORTED_BAM_FILE \ --OUTPUT $REMOVE_DUPLICATES_BAM_FILE \ --METRICS_FILE $METRICS_FILE \ --REMOVE_DUPLICATES true
# Coordinate-sort BAM file and create BAM index file -java -jar $PICARD/picard.jar SortSam \ +gatk SortSam \ --INPUT $REMOVE_DUPLICATES_BAM_FILE \ --OUTPUT $COORDINATE_SORTED_BAM_FILE \ --SORT_ORDER coordinate \ @@ -284,19 +290,19 @@ java -jar $PICARD/picard.jar SortSam \ > #### Do I need to add read groups? -> Some pipelines will have you add read groups while procressing your alignment files. It is usually not necessary because you can typically do it during alignment. **If you are needing to add read groups, we recommend doing it first (before all the processing steps outlined above)**. You can use Picard `AddOrReplaceReadGroups`, which has the added benefit of allowing you to also sort your alignment file (our first step anyways) in the same step as adding the read group information. The dropdown below discusses how to add or replace read groups within `Picard`. +> Some pipelines will have you add read groups while procressing your alignment files. It is usually not necessary because you can typically do it during alignment. **If you are needing to add read groups, we recommend doing it first (before all the processing steps outlined above)**. You can use Picard `AddOrReplaceReadGroups`, which has the added benefit of allowing you to also sort your alignment file (our first step anyways) in the same step as adding the read group information. The dropdown below discusses how to add or replace read groups within `GATK`/`Picard`. > >->Click here if you need to add or replace read groups using
-> In order to add or replace read groups, we are going to usePicard
Picard
'sAddOrReplaceReadGroups
tool. First we would need to load thePicard
module: +>Click here if you need to add or replace read groups using
+> In order to add or replace read groups, we are going to useGATK
/Picard
GATK
/Picard
'sAddOrReplaceReadGroups
tool. First we would need to load theGATK
module: >> # Load module -> module load picard/2.27.5+> module load gatk/4.6.1.0
AddOrReplaceReadGroups
is:
> > # Add or replace read group information -> java -jar $PICARD/picard.jar AddOrReplaceReadGroups \ +> gatk AddOrReplaceReadGroups \ > --INPUT $SAM_FILE \ > --OUTPUT $BAM_FILE \ > --RGID $READ_GROUP_ID \ @@ -305,9 +311,9 @@ java -jar $PICARD/picard.jar SortSam \ > --RGPU $READ_GROUP_PLATFORM_UNIT \ > --RGSM $READ_GROUP_SAMPLE> ->
java -jar $PICARD/picard.jar AddOrReplaceReadGroups
This calls the AddOrReplaceReadGroups
package within Picard
--INPUT $SAM_FILE
This is your input file. It could be a BAM/SAM alignment file, but because we recommend doing this first if you need to do it, this would be a SAM file. You don't need to specifiy that it is a BAM/SAM file, Picard
with figure that out from the provided extension.--OUTPUT $BAM_FILE
This would be your output file. It could be BAM/SAM, but you would mostly likely pick BAM because you'd like to save space on the cluster. You don't need to specifiy that it is a BAM/SAM file, Picard
with figure that out from the provided extension.gatk AddOrReplaceReadGroups
This calls the AddOrReplaceReadGroups
package within GATK
/Picard
--INPUT $SAM_FILE
This is your input file. It could be a BAM/SAM alignment file, but because we recommend doing this first if you need to do it, this would be a SAM file. You don't need to specifiy that it is a BAM/SAM file, GATK
/Picard
with figure that out from the provided extension.--OUTPUT $BAM_FILE
This would be your output file. It could be BAM/SAM, but you would mostly likely pick BAM because you'd like to save space on the cluster. You don't need to specifiy that it is a BAM/SAM file, GATK
/Picard
with figure that out from the provided extension.--RGID $READ_GROUP_ID
This is your read group ID and must be unique--RGLB $READ_GROUP_LIBRARY
This is your read group library--RGPL $READ_GROUP_PLATFORM
This is the platform used for the sequencingsbatch
submission script w
#SBATCH -o samtools_processing_normal_%j.out
#SBATCH -e samtools_processing_normal_%j.errPicard
, we are going to need to initally query-
# Sort SAM file and convert it to a query name sorted BAM file samtools sort \ - -@ 8 \ + --threads 8 \ -n \ -o $QUERY_SORTED_BAM_FILE \ $SAM_FILE @@ -382,7 +388,7 @@ The components of this line of code are:
samtools sort
This calls the sort function within samtools
.-@ 8
This tells samtools
to use 8 threads when it multithreads this task. Since we requested 8 cores for this sbatch
submission, let's go ahead and use them all.--threads 8
This tells samtools
to use 8 threads when it multithreads this task. Since we requested 8 cores for this sbatch
submission, let's go ahead and use them all.-n
This argument tells samtools sort
to sort by read name as opposed to the default sorting which is done by coordinate.# Score mates samtools fixmate \ - -@ 8 \ + --threads 8 \ -m \ $QUERY_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE @@ -411,7 +417,7 @@ The parts of this command are:
samtools fixmate
This calls the fixmate
command in samtools
-@ 8
This tells samtools
to use 8 threads when it multithreads this task.--threads 8
This tells samtools
to use 8 threads when it multithreads this task.-m
This will add the mate score tag that will be critically important later for samtools markdup
fixmate
information, we need to coord
# Sort BAM file by coordinate samtools sort \ - -@ 8 \ + --threads 8 \ -o $COORDINATE_SORTED_BAM_FILE \ $FIXMATE_BAM_FILE@@ -448,7 +454,7 @@ Now we are going to mark and remove the duplicate reads: samtools markdup \ -r \ --write-index \ - -@ 8 \ + --threads 8 \ $COORDINATE_SORTED_BAM_FILE \ ${REMOVED_DUPLICATES_BAM_FILE}##idx##${REMOVED_DUPLICATES_BAM_FILE}.bai @@ -461,7 +467,7 @@ The components of this command are:
--write-index
This writes an index file of the output BAM file-@ 8
This sets that we will be using 8 threads--threads 8
This sets that we will be using 8 threads$BAM_FILE
This is our input BAM filesbatch
submission script for the tumor sample should look
#SBATCH -o samtools_processing_tumor_%j.out
#SBATCH -e samtools_processing_tumor_%j.errsbatch
code script for the tumor sample should look like #!/bin/bash -# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Picard
+# This sbatch script is for processing the alignment output from bwa and preparing it for use
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 #SBATCH -c 1 #SBATCH --mem 32G -#SBATCH -o picard_alignment_processing_tumor_%j.out -#SBATCH -e picard_alignment_processing_tumor_%j.err
+#SBATCH -o gatk_alignment_processing_tumor_%j.out +#SBATCH -e gatk_alignment_processing_tumor_%j.err
# Load module -module load picard/2.27.5
+module load gatk/4.6.1.0
# Assign file paths to variables SAMPLE_NAME=syn3_tumor SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam @@ -627,18 +633,18 @@ COORDINATE_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}coordinate_sorted.bam`
# Make reports directory mkdir -p $REPORTS_DIRECTORY
# Query-sort alginment file and convert to BAM -java -jar $PICARD/picard.jar SortSam \ +gatk SortSam \ --INPUT $SAM_FILE \ --OUTPUT $QUERY_SORTED_BAM_FILE \ --SORT_ORDER queryname
# Mark and remove duplicates -java -jar $PICARD/picard.jar MarkDuplicates \ +gatk MarkDuplicates \ --INPUT $QUERY_SORTED_BAM_FILE \ --OUTPUT $REMOVE_DUPLICATES_BAM_FILE \ --METRICS_FILE $METRICS_FILE \ --REMOVE_DUPLICATES true
# Coordinate-sort BAM file and create BAM index file -java -jar $PICARD/picard.jar SortSam \ +gatk SortSam \ --INPUT $REMOVE_DUPLICATES_BAM_FILE \ --OUTPUT $COORDINATE_SORTED_BAM_FILE \ --SORT_ORDER coordinate \ @@ -646,9 +652,9 @@ java -jar $PICARD/picard.jar SortSam \
Picard
. But first, let's double check we have the Picard
module loaded:
+> We can create the required sequence dictionary in GATK
/Picard
. But first, let's double check we have the GATK
/Picard
module loaded:
> -> module load picard/2.27.5+> module load gatk/4.6.1.0 > > The command to do create the sequence dictionary is:
> # YOU DON'T NEED TO RUN THIS -> java -jar $PICARD/picard.jar CreateSequenceDictionary \ -> --REFERENCE /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.fa -> --OUTPUT /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.dict+> gatk CreateSequenceDictionary \ +> --REFERENCE /n/groups/hbctraining/variant_calling/reference/GRCh38.fa +> --OUTPUT /n/groups/hbctraining/variant_calling/reference/GRCh38.dict > > The components of this command are: ->
java -jar $PICARD/picard.jar CreateSequenceDictionary
This calls the CreateSequenceDictionary
command within Picard
--REFERENCE /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.fa
This is the reference sequence to create the sequence dictionary from.--OUTPUT /n/groups/hbctraining/variant_calling/reference/GRCh38.p7_genomic.dict
This is the output sequence dictionary.gatk CreateSequenceDictionary
This calls the CreateSequenceDictionary
command within GATK
/Picard
--REFERENCE /n/groups/hbctraining/variant_calling/reference/GRCh38.fa
This is the reference sequence to create the sequence dictionary from.--OUTPUT /n/groups/hbctraining/variant_calling/reference/GRCh38.dict
This is the output sequence dictionary.Picard
currently does not feature an ability to create a FASTA index file. However, samtools
is a very popular tool that is used for a variety of processes for processing BAM/SAM files and it also includes functionality for the creation of FASTA index files. First, we will need to load the gcc
and samtools
modules:
>
> -> module load gcc/6.2.0 -> module load samtools/1.15.1+> module load gcc/14.2.0 +> module load samtools/1.21 > > The command for indexing a FASTA file is straightforward and should run pretty quickly: >
From 7adca19bde657e81f4b1a57b979a63300a842285 Mon Sep 17 00:00:00 2001 From: Will GammerdingerDate: Thu, 15 May 2025 12:14:13 -0400 Subject: [PATCH 13/22] Updated comments --- lessons/04_alignment_file_processing.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lessons/04_alignment_file_processing.md b/lessons/04_alignment_file_processing.md index b04ed73..1a5020d 100644 --- a/lessons/04_alignment_file_processing.md +++ b/lessons/04_alignment_file_processing.md @@ -248,7 +248,7 @@ Go ahead and save and quit. **Don't run it just yet!** Click here to see what our final
sbatch
code script for the normal sample should look like#!/bin/bash -# This sbatch script is for processing the alignment output from bwa and preparing it
+# This sbatch script is for processing the alignment output from bwa and preparing it for use
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 @@ -352,7 +352,7 @@ vim samtools_processing_normal.sbatch Next, we are going to need to set-up oursbatch
submission script with our shebang line, description,sbatch
directives, modules to load and file variables.#!/bin/bash -# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using Samtools
+# This sbatch script is for processing the alignment output from bwa and preparing it for use
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 @@ -484,7 +484,7 @@ The final script should look like:#!/bin/bash -# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using SamtoolsAfter those changeed have been made, the rest of the script should be the same.
+# This sbatch script is for processing the alignment output from bwa and preparing it for use
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 @@ -537,7 +537,7 @@ sed 's/normal/tumor/g' samtools_processing_normal.sbatch > samtools_p The finalsbatch
submission script for the tumor sample should look like:#!/bin/bash -# This sbatch script is for processing the alignment output from bwa and preparing it for use in GATK using SamtoolsTo:
+# This sbatch script is for processing the alignment output from bwa and preparing it for use
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-04:00:00 From 14ae02d70622872be7189c69e86443f4583bae81 Mon Sep 17 00:00:00 2001 From: Will GammerdingerDate: Thu, 15 May 2025 13:20:26 -0400 Subject: [PATCH 14/22] Corrected SAm file name --- lessons/04_alignment_file_processing.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lessons/04_alignment_file_processing.md b/lessons/04_alignment_file_processing.md index 1a5020d..d491b91 100644 --- a/lessons/04_alignment_file_processing.md +++ b/lessons/04_alignment_file_processing.md @@ -138,7 +138,7 @@ Next, let's define some variables that we will be using: ``` # Assign file paths to variables SAMPLE_NAME=syn3_normal -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.sam REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/gatk/${SAMPLE_NAME}/ QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` @@ -260,7 +260,7 @@ Go ahead and save and quit. **Don't run it just yet!** module load gatk/4.6.1.0
# Assign file paths to variables SAMPLE_NAME=syn3_normal -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.sam REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/gatk/${SAMPLE_NAME}/ QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` @@ -364,7 +364,7 @@ Next, we are going to need to set-up oursbatch
submission script w module load gcc/14.2.0 module load samtools/1.21
# Assign file paths to variables -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` FIXMATE_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}fixmates.bam` COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordinate_sorted.bam` @@ -496,7 +496,7 @@ The final script should look like: module load gcc/14.2.0 module load samtools/1.21
# Assign file paths to variables -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` FIXMATE_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}fixmates.bam` COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordinate_sorted.bam` @@ -549,7 +549,7 @@ The finalsbatch
submission script for the tumor sample should look module load gcc/14.2.0 module load samtools/1.15.1
# Assign file paths to variables -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.sam QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` FIXMATE_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}fixmates.bam` COORDINATE_SORTED_BAM_FILE=`echo ${QUERY_SORTED_BAM_FILE%query_sorted.bam}coordinate_sorted.bam` @@ -624,7 +624,7 @@ _As a result your tumor `GATK`/`Picard` alignment processing script should look module load gatk/4.6.1.0
# Assign file paths to variables SAMPLE_NAME=syn3_tumor -SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.p7.sam +SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE_NAME}_GRCh38.sam REPORTS_DIRECTORY=/home/${USER}/variant_calling/reports/picard/${SAMPLE_NAME}/ QUERY_SORTED_BAM_FILE=`echo ${SAM_FILE%sam}query_sorted.bam` REMOVE_DUPLICATES_BAM_FILE=`echo ${SAM_FILE%sam}remove_duplicates.bam` From 698ea7f4b5761a2d81f879cec12eebfbe830deb2 Mon Sep 17 00:00:00 2001 From: Will GammerdingerDate: Thu, 15 May 2025 13:29:11 -0400 Subject: [PATCH 15/22] Updated packages and reference genome --- lessons/05_alignment_QC.md | 86 +++++++++++++++++++------------------- 1 file changed, 43 insertions(+), 43 deletions(-) diff --git a/lessons/05_alignment_QC.md b/lessons/05_alignment_QC.md index c0009f5..174c8b2 100644 --- a/lessons/05_alignment_QC.md +++ b/lessons/05_alignment_QC.md @@ -8,7 +8,7 @@ Approximate time: 30 minutes ## Learning Objectives -- Verify alignment rates using `Picard` +- Verify alignment rates using `GATK`/`Picard` - Merge `Picard` QC metrics with `FastQC` metrics using `MultiQC` ## Collecting Alignment Statistics @@ -19,7 +19,7 @@ The next step of QC is where we need to evaluate the quality of the alignments. -We are going to use `Picard` once again in order to collect our alignment statistics. `Picard` has many packages for collecting different types of data, but the one we will be using is [`CollectAlignmentSummaryMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360040507751-CollectAlignmentSummaryMetrics-Picard). This tool takes a **SAM/BAM file input** and **produces metrics** (in a tab delimited `.txt` file) detailing the quality of the read alignments. _Note that these quality filters are specific to Illumina data._ +We are going to use `GATK`/`Picard` once again in order to collect our alignment statistics. `GATK`/`Picard` has many packages for collecting different types of data, but the one we will be using is [`CollectAlignmentSummaryMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360040507751-CollectAlignmentSummaryMetrics-Picard). This tool takes a **SAM/BAM file input** and **produces metrics** (in a tab delimited `.txt` file) detailing the quality of the read alignments. _Note that these quality filters are specific to Illumina data._ Some examples of metrics reported include (but, are not limited to): @@ -44,45 +44,45 @@ Let's start creating an `sbatch` script for collecting metrics: ``` cd ~/variant_calling/scripts/ -vim picard_metrics_normal.sbatch +vim gatk_metrics_normal.sbatch ``` First, we need to add our shebang line, description and `sbatch` directives to the script: ``` #!/bin/bash -# This sbatch script is for collecting alignment metrics using Picard +# This sbatch script is for collecting alignment metrics using GATK # Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:30:00 #SBATCH -c 1 #SBATCH --mem 16G -#SBATCH -o picard_metrics_normal_%j.out -#SBATCH -e picard_metrics_normal_%j.err +#SBATCH -o gatk_metrics_normal_%j.out +#SBATCH -e gatk_metrics_normal_%j.err ``` -Next, we need to load `Picard`: +Next, we need to load `GATK`: ``` -# Load picard -module load picard/2.27.5 +# Load GATK +module load gatk/4.6.1.0 ``` Next, let's assign our files to variables: ``` # Assign variables -INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam -REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt +INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam +REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa +OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt ``` -Lastly, we can add the `Picard` command to gather the alignment metrics. +Lastly, we can add the `GATK`/`Picard` command to gather the alignment metrics. ``` -# Run Picard CollectAlignmentSummaryMetrics -java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ +# Run GATK/Picard CollectAlignmentSummaryMetrics +gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ --REFERENCE_SEQUENCE $REFERENCE \ --OUTPUT $OUTPUT_METRICS_FILE @@ -90,9 +90,9 @@ java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ We can breakdown this command into each of its components: -* `java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics` Calls the `CollectAlignmentSummaryMetrics` package from within `Picard` -* `--INPUT $INPUT_BAM` This is the output BAM file from our previous `Picard` alignment processing steps. -* `--REFERENCE_SEQUENCE $REFERENCE` This isn't a required parameter, but `Picard` can do a subset of mismatch-related metrics if this is provided. +* `gatk CollectAlignmentSummaryMetrics` Calls the `CollectAlignmentSummaryMetrics` package from within `GATK`/`Picard` +* `--INPUT $INPUT_BAM` This is the output BAM file from our previous `GATK`/`Picard` alignment processing steps. +* `--REFERENCE_SEQUENCE $REFERENCE` This isn't a required parameter, but `GATK`/`Picard` can do a subset of mismatch-related metrics if this is provided. * `--OUTPUT $OUTPUT_METRICS_FILE` This is the file to write the output metrics to. @@ -102,20 +102,20 @@ Now this script is all set to run! **Go ahead and save and quit.**
Click here to see what our final
sbatch
code script for collecting the normal sample alignment metrics should look like#!/bin/bash -# This sbatch script is for collecting alignment metrics using PicardAnd also change:
+# This sbatch script is for collecting alignment metrics using GATK
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:30:00 #SBATCH -c 1 #SBATCH --mem 16G -#SBATCH -o picard_metrics_normal_%j.out -#SBATCH -e picard_metrics_normal_%j.err
-# Load picard -module load picard/2.27.5
+#SBATCH -o gatk_metrics_normal_%j.out +#SBATCH -e gatk_metrics_normal_%j.err
+# Load GATK +module load gatk/4.6.1.0
# Assign variables -INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam -REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt
+INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam +REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa +OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt
# Run Picard CollectAlignmentSummaryMetrics java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ @@ -129,27 +129,27 @@ java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ Now we will want to **create the tumor version of this submission script using `sed`** (as we have done previously): ``` -sed 's/normal/tumor/g' picard_metrics_normal.sbatch > picard_metrics_tumor.sbatch +sed 's/normal/tumor/g' gatk_metrics_normal.sbatch > gatk_metrics_tumor.sbatch ```Click here to see what our final
sbatch
code script for collecting the tumor sample alignment metrics should look like#!/bin/bash -# This sbatch script is for collecting alignment metrics using PicardTo:
+# This sbatch script is for collecting alignment metrics using GATK
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:30:00 #SBATCH -c 1 #SBATCH --mem 16G -#SBATCH -o picard_metrics_tumor_%j.out -#SBATCH -e picard_metrics_tumor_%j.err
-# Load picard -module load picard/2.27.5
+#SBATCH -o gatk_metrics_tumor_%j.out +#SBATCH -e gatk_metrics_tumor_%j.err
+# Load GATK +module load gatk/4.6.1.0
# Assign variables -INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.coordinate_sorted.bam -REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.p7.CollectAlignmentSummaryMetrics.txt
+INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.coordinate_sorted.bam +REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa +OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.CollectAlignmentSummaryMetrics.txt
# Run Picard CollectAlignmentSummaryMetrics java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ @@ -176,7 +176,7 @@ sbatch picard_metrics_tumor.sbatch ## Collecting Coverage Metrics -Coverage is the average level of alignment for any random locus in the genome. `Picard` also has a package called [`CollectWgsMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037269351-CollectWgsMetrics-Picard) which is also very nice for collecting data about coverage for alignments. However, **since our data set is whole exome sequencing rather than whole genome sequencing** and thus only compromises about 1-2% of the human genome, average **coverage across the whole genome is not a very useful metric**. However, if one did have whole genome data, then running `CollectWgsMetrics` would be useful. As such, in the dropdown box below we provide the code that you could use to collect this information. +Coverage is the average level of alignment for any random locus in the genome. `GATK`/`Picard` also has a package called [`CollectWgsMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037269351-CollectWgsMetrics-Picard) which is also very nice for collecting data about coverage for alignments. However, **since our data set is whole exome sequencing rather than whole genome sequencing** and thus only compromises about 1-2% of the human genome, average **coverage across the whole genome is not a very useful metric**. However, if one did have whole genome data, then running `CollectWgsMetrics` would be useful. As such, in the dropdown box below we provide the code that you could use to collect this information.
@@ -185,21 +185,21 @@ Coverage is the average level of alignment for any random locus in the genome. _Image source: [Coverage analysis from the command line](https://medium.com/ngs-sh/coverage-analysis-from-the-command-line-542ef3545e2c)_
-Click here to find out more on collecting coverage metrics for WGS datasets in
-Picard
The tool inPicard
used for collecting coverage metrics for WGS datasets is calledCollectWgsMetrics
. The code used to runCollectWgsMetrics
can be found below.
+Click here to find out more on collecting coverage metrics for WGS datasets in
+GATK
/Picard
The tool inGATK
/Picard
used for collecting coverage metrics for WGS datasets is calledCollectWgsMetrics
. The code used to runCollectWgsMetrics
can be found below.
# Assign paths to bash variables $COORDINATE_SORTED_BAM_FILE=/path/to/sample.coordinate_sorted.bam $OUTPUT=/home/$USER/variant_calling/reports/picard/sample.CollectWgsMetrics.txt - $REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa-
+ $REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
# Run Picard CollectWgsMetrics \ - java -jar $PICARD/picard.jar CollectWgsMetrics \ + gatk CollectWgsMetrics \ --INPUT $COORDINATE_SORTED_BAM_FILE \ --OUTPUT $METRICS_OUTPUT_FILE \ --REFERENCE_SEQUENCE $REFERENCE
- +
java -jar $PICARD/picard.jar CollectWgsMetrics
This calls theCollectWgsMetrics
package withinPicard
@@ -208,11 +208,11 @@ _Image source: [Coverage analysis from the command line](https://medium.com/ngs- ## Factors Impacting Alignment -While we mentioned above the various metrics that are computed as part of the Picard command, one of the **most important metrics for your alignment file is the alignment rate**. When aligning high-quality reads to a high quality reference genome, **one should expect to see alignment rates at 90% or better**. If alignment rates dipped below 80-85%, then there could be reason for further inspection. +While we mentioned above the various metrics that are computed as part of the GATK/Picard command, one of the **most important metrics for your alignment file is the alignment rate**. When aligning high-quality reads to a high quality reference genome, **one should expect to see alignment rates at 90% or better**. If alignment rates dipped below 80-85%, then there could be reason for further inspection. Alignment rates can vary based upon many factors, including: -- **Quality of reference assembly** - A high-quality assembly like GRCh38.p7 will provide an excellent reference genome for alignment. However, if you were studying a organism with a poorly assembled genome, parts of the reference genome could be missing from the assembly. Therefore, high-quality reads might not align because they there is missing reference sequence to align to that corresponds to their sequence. +- **Quality of reference assembly** - A high-quality assembly like GRCh38 will provide an excellent reference genome for alignment. However, if you were studying a organism with a poorly assembled genome, parts of the reference genome could be missing from the assembly. Therefore, high-quality reads might not align because they there is missing reference sequence to align to that corresponds to their sequence. - **Quality of libraries** - If the library generation was poor and there wasn't enough input DNA, then your sequencing could be filled with low-quality reads - **Quality of the reads** - If the reads are poor quality, then it can make alignment more uncertain. If your `FASTQC` report shows any anomalous signs, contact your sequencing center for support. - **Contamination** - If your samples are contaminated, then it can also skew your alignment. For example, if your samples were heavily contaminated with some bacteria, then much of what you will sequence will be bacteria DNA and not your sample DNA. As a result, most of the sequence reads will not align to your target sequence. If you suspect contamination might be the source of a poor alignment, you could consider running [Kraken](https://ccb.jhu.edu/software/kraken/) to evaluate the levels of contamination in your samples. From 0dfa57de47f489a2424142a08131aa35f4df5dac Mon Sep 17 00:00:00 2001 From: Will Gammerdinger
gatk CollectWgsMetrics
This calls theCollectWgsMetrics
package withinGATK
/Picard
--INPUT $COORDINATE_SORTED_BAM_FILE
This is the input coordinate-sorted BAM file--OUTPUT $METRICS_OUTPUT_FILE
This is the output report file--REFERENCE_SEQUENCE $REFERENCE
This is the path to the reference genome that was used for the alignment.Date: Thu, 15 May 2025 13:37:58 -0400 Subject: [PATCH 16/22] Update scripts --- lessons/05_alignment_QC.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/lessons/05_alignment_QC.md b/lessons/05_alignment_QC.md index 174c8b2..d9cd964 100644 --- a/lessons/05_alignment_QC.md +++ b/lessons/05_alignment_QC.md @@ -51,7 +51,7 @@ First, we need to add our shebang line, description and `sbatch` directives to t ``` #!/bin/bash -# This sbatch script is for collecting alignment metrics using GATK +# This sbatch script is for collecting alignment metrics using GATK # Assign sbatch directives #SBATCH -p priority @@ -116,8 +116,8 @@ module load gatk/4.6.1.0
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt
-# Run Picard CollectAlignmentSummaryMetrics -java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ +# Run GATK CollectAlignmentSummaryMetrics +gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ --REFERENCE_SEQUENCE $REFERENCE \ --OUTPUT $OUTPUT_METRICS_FILE @@ -150,8 +150,8 @@ module load gatk/4.6.1.0
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.CollectAlignmentSummaryMetrics.txt
-# Run Picard CollectAlignmentSummaryMetrics -java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ +# Run GATK CollectAlignmentSummaryMetrics +gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ --REFERENCE_SEQUENCE $REFERENCE \ --OUTPUT $OUTPUT_METRICS_FILE From fe612e4f5f23db48c712070768595ff8c756ab4b Mon Sep 17 00:00:00 2001 From: Will GammerdingerDate: Thu, 15 May 2025 14:10:54 -0400 Subject: [PATCH 17/22] Updated script --- lessons/05_alignment_QC.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lessons/05_alignment_QC.md b/lessons/05_alignment_QC.md index d9cd964..c9bf347 100644 --- a/lessons/05_alignment_QC.md +++ b/lessons/05_alignment_QC.md @@ -81,7 +81,7 @@ OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn Lastly, we can add the `GATK`/`Picard` command to gather the alignment metrics. ``` -# Run GATK/Picard CollectAlignmentSummaryMetrics +# Run GATK CollectAlignmentSummaryMetrics gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ --REFERENCE_SEQUENCE $REFERENCE \ From 0c3ed20de0f24172962cb489518c71a29ee172b3 Mon Sep 17 00:00:00 2001 From: Will Gammerdinger Date: Fri, 16 May 2025 13:21:27 -0400 Subject: [PATCH 18/22] Fixed path --- lessons/05_alignment_QC.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lessons/05_alignment_QC.md b/lessons/05_alignment_QC.md index c9bf347..27aa100 100644 --- a/lessons/05_alignment_QC.md +++ b/lessons/05_alignment_QC.md @@ -75,7 +75,7 @@ Next, let's assign our files to variables: # Assign variables INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt +OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt ``` Lastly, we can add the `GATK`/`Picard` command to gather the alignment metrics. @@ -115,7 +115,7 @@ module load gatk/4.6.1.0
# Assign variables INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt
+OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt
# Run GATK CollectAlignmentSummaryMetrics gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ @@ -149,7 +149,7 @@ module load gatk/4.6.1.0
# Assign variables INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.coordinate_sorted.bam REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa -OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.CollectAlignmentSummaryMetrics.txt
+OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_tumor/syn3_tumor_GRCh38.CollectAlignmentSummaryMetrics.txt
# Run GATK CollectAlignmentSummaryMetrics gatk CollectAlignmentSummaryMetrics \ --INPUT $INPUT_BAM \ @@ -166,11 +166,11 @@ Before we submit our jobs, let's **check the status of our previous `Picard` ali squeue --me ``` -* **If your `Picard` alignment processing steps are completed**, and you have the required input files then you can submit these jobs to collect alignment metrics: +* **If your `GATK`/`Picard` alignment processing steps are completed**, and you have the required input files then you can submit these jobs to collect alignment metrics: ```bash -sbatch picard_metrics_normal.sbatch -sbatch picard_metrics_tumor.sbatch +sbatch gatk_metrics_normal.sbatch +sbatch gatk_metrics_tumor.sbatch ``` > **NOTE:** Each of these scripts should only take about 15 minutes to run. From b2b4675c9aadc1b435c8b541d276e05d116315f8 Mon Sep 17 00:00:00 2001 From: Will GammerdingerDate: Fri, 16 May 2025 14:12:40 -0400 Subject: [PATCH 19/22] Updated reference --- lessons/07_variant_calling.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/lessons/07_variant_calling.md b/lessons/07_variant_calling.md index d0e2fc0..b751f8f 100644 --- a/lessons/07_variant_calling.md +++ b/lessons/07_variant_calling.md @@ -155,26 +155,26 @@ And now, we need to create our variables: REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa REFERENCE_DICTIONARY=`echo ${REFERENCE_SEQUENCE%fa}dict` NORMAL_SAMPLE_NAME=syn3_normal -NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam +NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.coordinate_sorted.bam TUMOR_SAMPLE_NAME=syn3_tumor -TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam -VCF_OUTPUT_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_${NORMAL_SAMPLE_NAME}_${TUMOR_SAMPLE_NAME}_GRCh38.p7-raw.vcf +TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.coordinate_sorted.bam +VCF_OUTPUT_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/vcf_files/mutect2_${NORMAL_SAMPLE_NAME}_${TUMOR_SAMPLE_NAME}_GRCh38.raw.vcf ``` Click here if you used
Very little needs to be edited to these variables to adapt them for thesamtools
instead ofPicard
to process the alignment filessamtools
output. However, the end of the file that was created insamtools
was.removed_duplicates.bam
rather than.coordinate_sorted.bam
. As a result we need to edit the variables a bit. Change:
- NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam+ NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.coordinate_sorted.bam- NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.p7.removed_duplicates.bam+ NORMAL_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${NORMAL_SAMPLE_NAME}_GRCh38.removed_duplicates.bam- TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.coordinate_sorted.bam+ TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.coordinate_sorted.bam- TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.p7.removed_duplicates.bam+ TUMOR_BAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${TUMOR_SAMPLE_NAME}_GRCh38.removed_duplicates.bam
multiqc
in a virtual environmentmultiqc
virtual environment:
+ + cd /path/to/install/multiqc/+ Next, you will need to load
python
and the gcc
module that it is complied against. Technically, you could compile it against the base version of python
that is on the RedHat operating system, however, it is a better practice to compile it against a module as it can give you a bit more flexibility.
+ + module load gcc/14.2.0 + module load python/3.13.1+ Then, we will open a virtual environment in
python
with the virtualenv
command and name it multiqc_env
, but you can name it whatever you would like:
+ + virtualenv multiqc_env+After we have created the virtual environment, we will need to source it: +
+source multiqc_env/bin/activate+
NOTE: If you named it something other thanmultiqc_env
, then you will need to use that name in the above line instead ofmultiqc_env
.
NOTE: Now that you have activated your virtual environment, your command-line should be preceded by (multiqc_env)
. This represents the virtual environment that you're in and you should try to only be in one virtual environment at a time otherwise, you may run into dependency conflicts.
multiqc
using a pip3 install
command:
++pip3 install multiqc+After it finished installing you can test that it works with: +
+multiqc --help+You can now exit the virtual environment with the
deactivate
command:
++deactivate+In the future, if you would like to use this virtual environment, you will need to be sure to load the same version of
python
that the virtual enviornment was built within and then source
the virtual environment:
+ + module load gcc/14.2.0 + module load python/3.13.1 + source /path/to/install/multiqc/multiqc_env/bin/activate+If you would like to update your version of
multiqc
, then you will need to activate your virtual environemnt and use:
+ + pip3 install --upgrade multiqc+More information on managing your personal
python
packages can be found on HMS-RC's website.
+ sbatch
code script for running multiqc
should look like#!/bin/bash -# This sbatch script is for collating alignment metrics from FastQC and Picard using MultiQC
+# This sbatch script is for collating alignment metrics from FastQC and GATK/Picard using MultiQC
# Assign sbatch directives #SBATCH -p priority #SBATCH -t 0-00:10:00