Skip to content

Changed version numbers #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions lessons/03_sequence_alignment_theory.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,8 +200,8 @@ Next, we need to **add the modules** that we will be using for alignment:

```
# Load modules
module load gcc/6.2.0
module load bwa/0.7.17
module load gcc/14.2.0
module load bwa/0.7.18
```

> NOTE: On O2, many of the common tools were compiled using `GCC` version 6.2.0, so to be able to access them, we first need to load the `GCC` module.
Expand All @@ -215,7 +215,7 @@ bwa mem \
-M \
-t 8 \
-R "@RG\tID:syn3_normal\tPL:illumina\tPU:$SAMPLE\tSM:syn3_normal" \
/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa \
/n/groups/hbctraining/variant_calling/reference/GRCh38.fa \
~/variant_calling/raw_data/syn3_normal_1.fq.gz \
~/variant_calling/raw_data/syn3_normal_2.fq.gz \
-o /n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.sam
Expand All @@ -233,11 +233,11 @@ Another advantage of using `bash` variables in this way is that it can reduce ty

```
# Assign files to bash variables
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_normal_1.fq.gz
RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz`
SAMPLE=`basename $LEFT_READS _1.fq.gz`
SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam
SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.sam
```

> **NOTE:** `$RIGHT_READS` uses some `bash` string manipulation in order to swap the last parts of their filename. We also use `basename` to parse out the path from a file and when coupled with an argument after the filename, it will trim the end of the filename as well as we can see with the `$SAMPLE` variable.
Expand Down Expand Up @@ -270,14 +270,14 @@ bwa mem \
#SBATCH -o bwa_alignment_normal_%j.out
#SBATCH -e bwa_alignment_normal_%j.err<br>
# Load modules
module load gcc/6.2.0
module load bwa/0.7.17<br>
module load gcc/14.2.0
module load bwa/0.7.18<br>
# Assign files to bash variables
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_normal_1.fq.gz
RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz`
SAMPLE=`basename $LEFT_READS _1.fq.gz`
SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam<br>
SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.sam<br>
# Align reads with bwa
bwa mem \
-M \
Expand Down Expand Up @@ -319,14 +319,14 @@ $ sed 's/normal/tumor/g' bwa_alignment_normal.sbatch > bwa_alignment_tumor.sbat
#SBATCH -o bwa_alignment_tumor_%j.out
#SBATCH -e bwa_alignment_tumor_%j.err<br>
# Load modules
module load gcc/6.2.0
module load bwa/0.7.17<br>
module load gcc/14.2.0
module load bwa/0.7.18<br>
# Assign files to bash variables
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
REFERENCE_SEQUENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
LEFT_READS=/home/$USER/variant_calling/raw_data/syn3_tumor_1.fq.gz
RIGHT_READS=`echo ${LEFT_READS%1.fq.gz}2.fq.gz`
SAMPLE=`basename $LEFT_READS _1.fq.gz`
SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.p7.sam<br>
SAM_FILE=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/${SAMPLE}_GRCh38.sam<br>
# Align reads with bwa
bwa mem \
-M \
Expand Down
186 changes: 96 additions & 90 deletions lessons/04_alignment_file_processing.md

Large diffs are not rendered by default.

102 changes: 51 additions & 51 deletions lessons/05_alignment_QC.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ Approximate time: 30 minutes

## Learning Objectives

- Verify alignment rates using `Picard`
- Merge `Picard` QC metrics with `FastQC` metrics using `MultiQC`
- Verify alignment rates using `GATK`/`Picard`
- Merge `GATK`/`Picard` QC metrics with `FastQC` metrics using `MultiQC`

## Collecting Alignment Statistics

Expand All @@ -19,7 +19,7 @@ The next step of QC is where we need to evaluate the quality of the alignments.
<img src="../img/Alignment_QC.png" width="800">
</p>

We are going to use `Picard` once again in order to collect our alignment statistics. `Picard` has many packages for collecting different types of data, but the one we will be using is [`CollectAlignmentSummaryMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360040507751-CollectAlignmentSummaryMetrics-Picard). This tool takes a **SAM/BAM file input** and **produces metrics** (in a tab delimited `.txt` file) detailing the quality of the read alignments. _Note that these quality filters are specific to Illumina data._
We are going to use `GATK`/`Picard` once again in order to collect our alignment statistics. `GATK`/`Picard` has many packages for collecting different types of data, but the one we will be using is [`CollectAlignmentSummaryMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360040507751-CollectAlignmentSummaryMetrics-Picard). This tool takes a **SAM/BAM file input** and **produces metrics** (in a tab delimited `.txt` file) detailing the quality of the read alignments. _Note that these quality filters are specific to Illumina data._

Some examples of metrics reported include (but, are not limited to):

Expand All @@ -44,55 +44,55 @@ Let's start creating an `sbatch` script for collecting metrics:

```
cd ~/variant_calling/scripts/
vim picard_metrics_normal.sbatch
vim gatk_metrics_normal.sbatch
```

First, we need to add our shebang line, description and `sbatch` directives to the script:

```
#!/bin/bash
# This sbatch script is for collecting alignment metrics using Picard
# This sbatch script is for collecting alignment metrics using GATK

# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-00:30:00
#SBATCH -c 1
#SBATCH --mem 16G
#SBATCH -o picard_metrics_normal_%j.out
#SBATCH -e picard_metrics_normal_%j.err
#SBATCH -o gatk_metrics_normal_%j.out
#SBATCH -e gatk_metrics_normal_%j.err
```

Next, we need to load `Picard`:
Next, we need to load `GATK`:

```
# Load picard
module load picard/2.27.5
# Load GATK
module load gatk/4.6.1.0
```

Next, let's assign our files to variables:

```
# Assign variables
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam
REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam
REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt
```

Lastly, we can add the `Picard` command to gather the alignment metrics.
Lastly, we can add the `GATK`/`Picard` command to gather the alignment metrics.

```
# Run Picard CollectAlignmentSummaryMetrics
java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
# Run GATK CollectAlignmentSummaryMetrics
gatk CollectAlignmentSummaryMetrics \
--INPUT $INPUT_BAM \
--REFERENCE_SEQUENCE $REFERENCE \
--OUTPUT $OUTPUT_METRICS_FILE
```

We can breakdown this command into each of its components:

* `java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics` Calls the `CollectAlignmentSummaryMetrics` package from within `Picard`
* `--INPUT $INPUT_BAM` This is the output BAM file from our previous `Picard` alignment processing steps.
* `--REFERENCE_SEQUENCE $REFERENCE` This isn't a required parameter, but `Picard` can do a subset of mismatch-related metrics if this is provided.
* `gatk CollectAlignmentSummaryMetrics` Calls the `CollectAlignmentSummaryMetrics` package from within `GATK`/`Picard`
* `--INPUT $INPUT_BAM` This is the output BAM file from our previous `GATK`/`Picard` alignment processing steps.
* `--REFERENCE_SEQUENCE $REFERENCE` This isn't a required parameter, but `GATK`/`Picard` can do a subset of mismatch-related metrics if this is provided.
* `--OUTPUT $OUTPUT_METRICS_FILE` This is the file to write the output metrics to.


Expand All @@ -102,22 +102,22 @@ Now this script is all set to run! **Go ahead and save and quit.**
<summary><b>Click here to see what our final <code>sbatch</code>code script for collecting the normal sample alignment metrics should look like</b></summary>
<pre>
#!/bin/bash
# This sbatch script is for collecting alignment metrics using Picard<br>
# This sbatch script is for collecting alignment metrics using GATK<br>
# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-00:30:00
#SBATCH -c 1
#SBATCH --mem 16G
#SBATCH -o picard_metrics_normal_%j.out
#SBATCH -e picard_metrics_normal_%j.err<br>
# Load picard
module load picard/2.27.5<br>
#SBATCH -o gatk_metrics_normal_%j.out
#SBATCH -e gatk_metrics_normal_%j.err<br>
# Load GATK
module load gatk/4.6.1.0<br>
# Assign variables
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.p7.coordinate_sorted.bam
REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_normal/syn3_normal_GRCh38.p7.CollectAlignmentSummaryMetrics.txt<br>
# Run Picard CollectAlignmentSummaryMetrics
java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_normal_GRCh38.coordinate_sorted.bam
REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_normal/syn3_normal_GRCh38.CollectAlignmentSummaryMetrics.txt<br>
# Run GATK CollectAlignmentSummaryMetrics
gatk CollectAlignmentSummaryMetrics \
--INPUT $INPUT_BAM \
--REFERENCE_SEQUENCE $REFERENCE \
--OUTPUT $OUTPUT_METRICS_FILE
Expand All @@ -129,29 +129,29 @@ java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
Now we will want to **create the tumor version of this submission script using `sed`** (as we have done previously):

```
sed 's/normal/tumor/g' picard_metrics_normal.sbatch > picard_metrics_tumor.sbatch
sed 's/normal/tumor/g' gatk_metrics_normal.sbatch > gatk_metrics_tumor.sbatch
```

<details>
<summary><b>Click here to see what our final <code>sbatch</code>code script for collecting the tumor sample alignment metrics should look like</b></summary>
<pre>
#!/bin/bash
# This sbatch script is for collecting alignment metrics using Picard<br>
# This sbatch script is for collecting alignment metrics using GATK<br>
# Assign sbatch directives
#SBATCH -p priority
#SBATCH -t 0-00:30:00
#SBATCH -c 1
#SBATCH --mem 16G
#SBATCH -o picard_metrics_tumor_%j.out
#SBATCH -e picard_metrics_tumor_%j.err<br>
# Load picard
module load picard/2.27.5<br>
#SBATCH -o gatk_metrics_tumor_%j.out
#SBATCH -e gatk_metrics_tumor_%j.err<br>
# Load GATK
module load gatk/4.6.1.0<br>
# Assign variables
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.p7.coordinate_sorted.bam
REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa
OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/picard/syn3_tumor/syn3_tumor_GRCh38.p7.CollectAlignmentSummaryMetrics.txt<br>
# Run Picard CollectAlignmentSummaryMetrics
java -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
INPUT_BAM=/n/scratch/users/${USER:0:1}/${USER}/variant_calling/alignments/syn3_tumor_GRCh38.coordinate_sorted.bam
REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa
OUTPUT_METRICS_FILE=/home/${USER}/variant_calling/reports/gatk/syn3_tumor/syn3_tumor_GRCh38.CollectAlignmentSummaryMetrics.txt<br>
# Run GATK CollectAlignmentSummaryMetrics
gatk CollectAlignmentSummaryMetrics \
--INPUT $INPUT_BAM \
--REFERENCE_SEQUENCE $REFERENCE \
--OUTPUT $OUTPUT_METRICS_FILE
Expand All @@ -166,17 +166,17 @@ Before we submit our jobs, let's **check the status of our previous `Picard` ali
squeue --me
```

* **If your `Picard` alignment processing steps are completed**, and you have the required input files then you can submit these jobs to collect alignment metrics:
* **If your `GATK`/`Picard` alignment processing steps are completed**, and you have the required input files then you can submit these jobs to collect alignment metrics:

```bash
sbatch picard_metrics_normal.sbatch
sbatch picard_metrics_tumor.sbatch
sbatch gatk_metrics_normal.sbatch
sbatch gatk_metrics_tumor.sbatch
```
> **NOTE:** Each of these scripts should only take about 15 minutes to run.

## Collecting Coverage Metrics

Coverage is the average level of alignment for any random locus in the genome. `Picard` also has a package called [`CollectWgsMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037269351-CollectWgsMetrics-Picard) which is also very nice for collecting data about coverage for alignments. However, **since our data set is whole exome sequencing rather than whole genome sequencing** and thus only compromises about 1-2% of the human genome, average **coverage across the whole genome is not a very useful metric**. However, if one did have whole genome data, then running `CollectWgsMetrics` would be useful. As such, in the dropdown box below we provide the code that you could use to collect this information.
Coverage is the average level of alignment for any random locus in the genome. `GATK`/`Picard` also has a package called [`CollectWgsMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037269351-CollectWgsMetrics-Picard) which is also very nice for collecting data about coverage for alignments. However, **since our data set is whole exome sequencing rather than whole genome sequencing** and thus only compromises about 1-2% of the human genome, average **coverage across the whole genome is not a very useful metric**. However, if one did have whole genome data, then running `CollectWgsMetrics` would be useful. As such, in the dropdown box below we provide the code that you could use to collect this information.

<p align="center">
<img src="../img/coverge.png" width="800">
Expand All @@ -185,21 +185,21 @@ Coverage is the average level of alignment for any random locus in the genome.
_Image source: [Coverage analysis from the command line](https://medium.com/ngs-sh/coverage-analysis-from-the-command-line-542ef3545e2c)_

<details>
<summary><b>Click here to find out more on collecting coverage metrics for WGS datasets in <code>Picard</code></b></summary>
<br>The tool in <code>Picard</code> used for collecting coverage metrics for WGS datasets is called <code>CollectWgsMetrics</code>. The code used to run <code>CollectWgsMetrics</code> can be found below.<br><br>
<summary><b>Click here to find out more on collecting coverage metrics for WGS datasets in <code>GATK</code>/<code>Picard</code></b></summary>
<br>The tool in <code>GATK</code>/<code>Picard</code> used for collecting coverage metrics for WGS datasets is called <code>CollectWgsMetrics</code>. The code used to run <code>CollectWgsMetrics</code> can be found below.<br><br>
<pre>
# Assign paths to bash variables
$COORDINATE_SORTED_BAM_FILE=/path/to/sample.coordinate_sorted.bam
$OUTPUT=/home/$USER/variant_calling/reports/picard/sample.CollectWgsMetrics.txt
$REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.p7.fa<br>
$REFERENCE=/n/groups/hbctraining/variant_calling/reference/GRCh38.fa<br>
# Run Picard CollectWgsMetrics \
java -jar $PICARD/picard.jar CollectWgsMetrics \
gatk CollectWgsMetrics \
--INPUT $COORDINATE_SORTED_BAM_FILE \
--OUTPUT $METRICS_OUTPUT_FILE \
--REFERENCE_SEQUENCE $REFERENCE
</pre>

<ul><li><code>java -jar $PICARD/picard.jar CollectWgsMetrics</code> This calls the <code>CollectWgsMetrics</code> package within <code>Picard</code></li>
<ul><li><code>gatk CollectWgsMetrics</code> This calls the <code>CollectWgsMetrics</code> package within <code>GATK</code>/<code>Picard</code></li>
<li><code>--INPUT $COORDINATE_SORTED_BAM_FILE</code> This is the input coordinate-sorted BAM file</li>
<li><code>--OUTPUT $METRICS_OUTPUT_FILE</code> This is the output report file </li>
<li><code>--REFERENCE_SEQUENCE $REFERENCE</code> This is the path to the reference genome that was used for the alignment.</li></ul>
Expand All @@ -208,11 +208,11 @@ _Image source: [Coverage analysis from the command line](https://medium.com/ngs-

## Factors Impacting Alignment

While we mentioned above the various metrics that are computed as part of the Picard command, one of the **most important metrics for your alignment file is the alignment rate**. When aligning high-quality reads to a high quality reference genome, **one should expect to see alignment rates at 90% or better**. If alignment rates dipped below 80-85%, then there could be reason for further inspection.
While we mentioned above the various metrics that are computed as part of the GATK/Picard command, one of the **most important metrics for your alignment file is the alignment rate**. When aligning high-quality reads to a high quality reference genome, **one should expect to see alignment rates at 90% or better**. If alignment rates dipped below 80-85%, then there could be reason for further inspection.

Alignment rates can vary based upon many factors, including:

- **Quality of reference assembly** - A high-quality assembly like GRCh38.p7 will provide an excellent reference genome for alignment. However, if you were studying a organism with a poorly assembled genome, parts of the reference genome could be missing from the assembly. Therefore, high-quality reads might not align because they there is missing reference sequence to align to that corresponds to their sequence.
- **Quality of reference assembly** - A high-quality assembly like GRCh38 will provide an excellent reference genome for alignment. However, if you were studying a organism with a poorly assembled genome, parts of the reference genome could be missing from the assembly. Therefore, high-quality reads might not align because they there is missing reference sequence to align to that corresponds to their sequence.
- **Quality of libraries** - If the library generation was poor and there wasn't enough input DNA, then your sequencing could be filled with low-quality reads
- **Quality of the reads** - If the reads are poor quality, then it can make alignment more uncertain. If your `FASTQC` report shows any anomalous signs, contact your sequencing center for support.
- **Contamination** - If your samples are contaminated, then it can also skew your alignment. For example, if your samples were heavily contaminated with some bacteria, then much of what you will sequence will be bacteria DNA and not your sample DNA. As a result, most of the sequence reads will not align to your target sequence. If you suspect contamination might be the source of a poor alignment, you could consider running [Kraken](https://ccb.jhu.edu/software/kraken/) to evaluate the levels of contamination in your samples.
Expand Down
Loading