Skip to content

Commit

Permalink
Merge branch 'master' of github.com:bioinformatics-core-shared-traini…
Browse files Browse the repository at this point in the history
…ng/Bulk_RNAseq_Course_Base
  • Loading branch information
AshKernow committed Oct 1, 2024
2 parents 11068de + 393953c commit 4fdf91e
Show file tree
Hide file tree
Showing 72 changed files with 3,686 additions and 879 deletions.
369 changes: 29 additions & 340 deletions Markdowns/01_Introduction_to_RNAseq_Methods.Rmd

Large diffs are not rendered by default.

342 changes: 32 additions & 310 deletions Markdowns/01_Introduction_to_RNAseq_Methods.html

Large diffs are not rendered by default.

66 changes: 19 additions & 47 deletions Markdowns/02_FastQC_introduction.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Basic quality control with FastQC"
date: "March 2023"
date: "October 2024"
output:
ioslides_presentation:
css: css/stylesheet.css
Expand All @@ -15,22 +15,22 @@ output:

<div style="line-height: 50%;"><br></div>

<img src="images/workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">
<img src="images/01s_workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">

## Fastq file format
<img src="images/fq.png" style="width: 95%">
<img src="images/02s_fq.png" style="width: 95%">

## Fastq file format - Headers
<img src="images/fq_headers.png" style="width: 95%">
<img src="images/02s_fq_headers.png" style="width: 95%">

## Fastq file format - Sequences
<img src="images/fq_seq.png" style="width: 95%">
<img src="images/02s_fq_seq.png" style="width: 95%">

## Fastq file format - Third line
<img src="images/fq_3rd_line.png"style="width: 95%">
<img src="images/02s_fq_3rd_line.png"style="width: 95%">

## Fastq file format - Quality Scores
<img src="images/fq_quality.png" style="width: 95%">
<img src="images/02s_fq_quality.png" style="width: 95%">

## (Phred) Quality Scores

Expand All @@ -45,43 +45,15 @@ Sequence quality scores are transformed and translated p-values
* p-value of 0.01 inferred as 1 in 100 chance that called base is wrong
</div>

## (Phred) Quality Scores ...


How do we assign p-values to bases in the fastq file?

<div style="width: 50%;
float: left">
* P-vales can be many characters long (e.g.:0.000005)
* Transform to Phred quality scores _Q_
* $Q = -10(log_{10} P)$ (e.g.: 0.01 = Q value of 20, 0.001 = Q value of 30)
* Translate _Q_ values to ASCII characters (adding 33) (Q value of 30 = ?, Q value of 40 = I )
</div>

<div style="margin-left: 60px;
float: none;">

<img src="images/ascii.png" style="width: 30%;
margin-left: auto;
margin-right: auto;
margin-bottom: auto;
display: block;">

</div>




## QC is important

Check for any problems before we put time and effort into analysing potentially bad data
At every stage we should check for any problems before we put time and effort into analysing potentially bad data

<div style="width: 40%;
float: left">
* Start with FastQC
* Start with FastQC on our sequencing outputs
* Quick
* Outputs an easy to read html report

Expand All @@ -90,7 +62,7 @@ Check for any problems before we put time and effort into analysing potentially
<div style="margin-left: 60px;
float: none;">

<img src="images/FastQC_logo.png" style="width: 25%;
<img src="images/02s_FastQC_logo.png" style="width: 25%;
margin-left: auto;
margin-right: auto;
display: block;">
Expand All @@ -112,15 +84,15 @@ but there are lots of other parameters which you can find to tailor your QC by t
<div style="text-align: center;">
<span style="color: #2e2892;">**Good Data**</span>
</div>
<img src="images/good1.png" style="width: 100%">
<img src="images/02s_fastqc_good1.png" style="width: 100%">
</div>

<div style="width: 47%;
float: right;">
<div style="text-align: center;">
<span style="color: #2e2892;">**Bad Data**</span>
</div>
<img src="images/bad1.png" style="width: 100%">
<img src="images/02s_fastqc_bad1.png" style="width: 100%">
</div>

## Per base sequence content
Expand All @@ -130,15 +102,15 @@ but there are lots of other parameters which you can find to tailor your QC by t
<div style="text-align: center;">
<span style="color: #2e2892;">**Good Data**</span>
</div>
<img src="images/good2.png" style="width: 100%">
<img src="images/02s_fastqc_good2.png" style="width: 100%">
</div>

<div style="width: 47%;
float: right;">
<div style="text-align: center;">
<span style="color: #2e2892;">**Bad Data**</span>
</div>
<img src="images/bad2.png" style="width: 100%">
<img src="images/02s_fastqc_bad2.png" style="width: 100%">
</div>

## Per sequence GC content
Expand All @@ -155,15 +127,15 @@ The spike is due to severe adapter contamination.
<div style="text-align: center;">
<span style="color: #2e2892;">**Good Data**</span>
</div>
<img src="images/good3.png" style="width: 100%">
<img src="images/02s_fastqc_good3.png" style="width: 100%">
</div>

<div style="width: 47%;
float: right;">
<div style="text-align: center;">
<span style="color: #2e2892;">**Bad Data**</span>
</div>
<img src="images/bad3.png" style="width: 100%">
<img src="images/02s_fastqc_bad3.png" style="width: 100%">
</div>

## Adaptor content
Expand All @@ -173,15 +145,15 @@ The spike is due to severe adapter contamination.
<div style="text-align: center;">
<span style="color: #2e2892;">**Good Data**</span>
</div>
<img src="images/good4.png" style="width: 100%">
<img src="images/02s_fastqc_good4.png" style="width: 100%">
</div>

<div style="width: 47%;
float: right;">
<div style="text-align: center;">
<span style="color: #2e2892;">**Bad Data**</span>
</div>
<img src="images/bad4.png" style="width: 100%">
<img src="images/02s_fastqc_bad4.png" style="width: 100%">
</div>


Expand All @@ -203,7 +175,7 @@ The spike is due to severe adapter contamination.
- the directory structure is like a tree, you can go back with cd ..
- Up arrows to get through history
- tab complete to avoid errors
- More to look at the files and q to exit
- Less or More to look at the files and q to exit
- ctrl-c


Expand Down
61 changes: 24 additions & 37 deletions Markdowns/02_FastQC_introduction.html

Large diffs are not rendered by default.

109 changes: 48 additions & 61 deletions Markdowns/03_Quantification_with_Salmon_introduction.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Alignment and Quantification of Gene Expression with Salmon"
date: "March 2023"
date: "October 2024"
output:
ioslides_presentation:
css: css/stylesheet.css
Expand All @@ -15,26 +15,26 @@ bibliography: ref.bib

<div style="line-height: 50%;"><br></div>

<img src="images/workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">
<img src="images/01s_workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">


## Alignment and Quantification overview {#less_space_after_title}

<div style="line-height: 10%;"><br></div>

<img src="images/aln_quant_overview.png" class="centerimg" style="width: 48%; margin-top: 60px;">
<img src="images/03s_aln_quant_overview.svg" class="centerimg" style="width: 48%; margin-top: 60px;">


## Traditional Alignment

AIM: Given a reference sequence and a set of short reads, align each read to
the reference sequence finding the most likely origin of the read sequence.

<img src="images/SRAlignment.svg" class="centerimg" style="width: 100%; margin-top: 60px;">
<img src="images/03s_SRAlignment.svg" class="centerimg" style="width: 100%; margin-top: 60px;">

## Alignment - Splicing aware alignment

<img src="images/GappedAlignment.svg" class="centerimg" style="width: 100%; margin-top: 60px;">
<img src="images/03s_GappedAlignment.svg" class="centerimg" style="width: 100%; margin-top: 60px;">

Aligners: STAR, HISAT2

Expand All @@ -45,13 +45,13 @@ Aligners: STAR, HISAT2
* It is (relatively) slow and computationally intensive


<img src="images/quasi_mapping_1.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
<img src="images/03s_quasi_mapping_1.svg" class="centerimg" style="width: 90%; margin-top: 40px;">

## Alignment
* Traditional alignment perform base-by-base alignment
* Traditional alignment is (relatively) slow and computationally intensive

<img src="images/quasi_mapping_2.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
<img src="images/03s_quasi_mapping_2.svg" class="centerimg" style="width: 90%; margin-top: 40px;">



Expand All @@ -61,18 +61,46 @@ Aligners: STAR, HISAT2
* Traditional alignment perform base-by-base alignment
* Traditional alignment is (relatively) slow and computationally intensive

<img src="images/quasi_mapping_2.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
<img src="images/03s_quasi_mapping_2.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
Switch to *quasi-mapping* (Salmon) or *pseudo-alignment* (Kallisto)

## BAM/SAM file format

**S**equence **A**lignment/**M**ap (SAM) format is the standard format for files
containing aligned reads.

Definition of the format is available at https://samtools.github.io/hts-specs/SAMv1.pdf.

Two main parts:

* Header
- contains meta data (source of the reads, reference genome, aligner, etc.)
- header lines start with “@”
- header fields have standardized two-letter codes
- `@RG` for read group, used for merging BAMs together

* Alignment section
- 1 line for each alignment
- contains details of alignment position, mapping, base quality etc.
- 11 required fields, but other content may vary depending on aligner and other
tools used to create the file

* BAM is a binary version of SAM (not human readable)

## BAM/SAM format - What does it look like?

<img src="images/03s_SAM_alignment_1c.png" class="centerimg" style="width: 100%; margin-top: 10px;">

## Why are Pseudo-alignment methods faster?

Switch to *quasi-mapping* (Salmon) or *pseudo-alignment* (Kallisto)

* These tools avoids base-to-base alignment of the reads
* ~ 20 times faster than the traditional alignment tools like STAR, HISAT2 etc
* Unlike alignment based methods, pseudo-alignment methods focus on transcriptome (~2% of genome in human)
* Use exact kmer matching rather than aligning whole reads with mismatches and indels

<img src="images/quasi_mapping_3.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
<img src="images/03s_quasi_mapping_3.svg" class="centerimg" style="width: 90%; margin-top: 40px;">


## Quantification tools
Expand Down Expand Up @@ -106,7 +134,7 @@ file (e.g. GFF or GTF)

So the simplest approach is to count how many reads overlap each gene.

<img src="images/Read_counting_2.svg" class="centerimg" style="width: 90%; margin-top: 20px;">
<img src="images/03s_Read_counting_2.svg" class="centerimg" style="width: 90%; margin-top: 20px;">

## What is read quantification?

Expand All @@ -124,60 +152,19 @@ Salmon also takes account of biases:
* Because salmon searches transcription, not genome, it's not the right tool for finding new genes or isoforms


## Salmon

## Salmon workflow
* Salmon essential steps
1. Salmon indexing
2. Quasi-mapping and abundance quantification
<img src="images/Salmon_workflow_2.png" class="centerimg" style="width: 40%;">

<div style="text-align: right">
Patro *et al.* (2017) Nature Methods doi:10.1038/nmeth.4197
</div>


## Salmon: Salmon indexing

* Two essential steps
1. Create transcriptome index
* This makes downstream quasi-mapping and quantification step efficient and faster
* Once you create an index, you can use it again and again
* Salmon indexing has two components
* Creates the reference transcriptome suffix array (SA)
* Each transcript in the reference transcriptome is mapped to its location in the SA using a hash table
2. Quasi-mapping and quantification


## Salmon: Quasi-mapping
<div class="columns-2">
<img src="images/quasi-mapping_overview.png" class="centerimg" style="width: 100%; height: 100%">

* The transcriptome (consisting of transcripts $t1,...,t6$) is converted into a \$ separated string "T"
* On "T" suffix array, SA[T], and a hash table, h , are constructed (in indexing step).
* The mapping operation begins with a k-mer (here, k = 3)
* From left to right, the read is scanned until a k-mer appears in the hash table.
* All suffixes containing the k-mer are found in the hash table and the SA intervals are retrieved
* The maximal matching prefix (MMP) is determined by finding the longest read sequence that exactly matches the reference suffix
* This process is repeated until the end of the read
* The final mapping is generated by determining the transcripts that appear in all MMPs for the read

</div>

\

Avi Srivastava *et al.* (2016) Bioinformatics 2016 Jun 15;32(12)
Two essential steps

1. Create transcriptome index

## Abundance estimation
* This makes downstream quasi-mapping and quantification step efficient and faster
* Once you create an index, you can use it again and again
* Consider the Kmer size (default 31 for >75bp)
* Adding decoy sequences to filter contaminants

2. Quasi-mapping and quantification

* With the quasi-mapping method, the best mapping is determined for each read
* After modeling sample-specific parameters and biases, salmon will generate transcript abundance estimates
* A read that maps equally to more than one transcript will have its count divided among them (Isoform information not lost)
* A variety of complex modeling approaches are used to estimate transcript abundances, including Expectation Maximization (EM), which corrects for sample-specific biases.
* GC bias
* Positional bias
* Fragment length bias
* Sequence-based bias


## Practical
Expand Down
105 changes: 40 additions & 65 deletions Markdowns/03_Quantification_with_Salmon_introduction.html

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions Markdowns/04_Quality_Control_introduction.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "QC of Aligned Reads"
date: "March 2023"
date: "October 2024"
output:
ioslides_presentation:
css: css/stylesheet.css
Expand All @@ -15,7 +15,7 @@ output:

<div style="line-height: 50%;"><br></div>

<img src="images/workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">
<img src="images/01s_workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">

## QC of aligned reads

Expand All @@ -41,14 +41,14 @@ output:
* Insert size is the length of the fragment of mRNA from which the reads are
derived

<img src="images/Insert_Size_QC.svg" class="centerimg">
<img src="images/Insert_Size.svg" class="centerimg" style="width: 80%">
<img src="images/04s_Insert_Size_QC.svg" class="centerimg">
<img src="images/04s_Insert_Size.svg" class="centerimg" style="width: 80%">

## QC of aligned reads - Transcript coverage

<div style="line-height: 50%;"><br></div>

<img src="images/TranscriptCoverage.svg" class="centerimg" style="width: 80%">
<img src="images/04s_TranscriptCoverage.svg" class="centerimg" style="width: 80%">

## QC Goals

Expand Down
36 changes: 22 additions & 14 deletions Markdowns/04_Quality_Control_introduction.html

Large diffs are not rendered by default.

Loading

0 comments on commit 4fdf91e

Please sign in to comment.