Merge branch 'master' of github.com:bioinformatics-core-shared-traini…

…ng/Bulk_RNAseq_Course_Base
bioinformatics-core-shared-training · Oct 1, 2024 · 4fdf91e · 4fdf91e
2 parents 11068de + 393953c
commit 4fdf91e
Show file tree

Hide file tree

Showing 72 changed files with 3,686 additions and 879 deletions.
diff --git a/Markdowns/01_Introduction_to_RNAseq_Methods.Rmd b/Markdowns/01_Introduction_to_RNAseq_Methods.Rmd
diff --git a/Markdowns/01_Introduction_to_RNAseq_Methods.html b/Markdowns/01_Introduction_to_RNAseq_Methods.html
diff --git a/Markdowns/02_FastQC_introduction.Rmd b/Markdowns/02_FastQC_introduction.Rmd
@@ -1,6 +1,6 @@
 ---
 title: "Basic quality control with FastQC"
-date: "March 2023"
+date: "October 2024"
 output:
   ioslides_presentation:
     css: css/stylesheet.css
@@ -15,22 +15,22 @@ output:
 
 <div style="line-height: 50%;"><br></div>
 
-<img src="images/workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">
+<img src="images/01s_workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">
 
 ## Fastq file format
-<img src="images/fq.png" style="width: 95%">
+<img src="images/02s_fq.png" style="width: 95%">
 
 ## Fastq file format - Headers
-<img src="images/fq_headers.png" style="width: 95%">
+<img src="images/02s_fq_headers.png" style="width: 95%">
 
 ## Fastq file format - Sequences
-<img src="images/fq_seq.png" style="width: 95%">
+<img src="images/02s_fq_seq.png" style="width: 95%">
 
 ## Fastq file format - Third line 
-<img src="images/fq_3rd_line.png"style="width: 95%"> 
+<img src="images/02s_fq_3rd_line.png"style="width: 95%"> 
 
 ## Fastq file format - Quality Scores
-<img src="images/fq_quality.png" style="width: 95%">
+<img src="images/02s_fq_quality.png" style="width: 95%">
 
 ## (Phred) Quality Scores
 
@@ -45,43 +45,15 @@ Sequence quality scores are transformed and translated p-values
   * p-value of 0.01 inferred as 1 in 100 chance that called base is wrong 
 </div>
 
-## (Phred) Quality Scores  ... 
-
-
-How do we assign p-values to bases in the fastq file?
-
-<div style="width: 50%; 
-            float: left"> 
-            
-* P-vales can be many characters long (e.g.:0.000005)
-* Transform to Phred quality scores _Q_
-* $Q = -10(log_{10} P)$ (e.g.: 0.01 = Q value of 20,  0.001 = Q value of 30)
-* Translate _Q_ values to ASCII characters (adding 33) (Q value of 30 = ?, Q value of 40 = I )
-</div>
-
-<div style="margin-left: 60px;
-            float: none;">
-
-<img src="images/ascii.png" style="width: 30%;
-                                  margin-left: auto;
-                                  margin-right: auto;
-                                  margin-bottom: auto;
-                                  display: block;">
-
-</div>
-
-
-
-
 ## QC is important
 
-Check for any problems before we put time and effort into analysing potentially bad data
+At every stage we should check for any problems before we put time and effort into analysing potentially bad data
 
 <div style="width: 40%; 
             float: left"> 
           
           
-* Start with FastQC
+* Start with FastQC on our sequencing outputs
   * Quick
   * Outputs an easy to read html report
 
@@ -90,7 +62,7 @@ Check for any problems before we put time and effort into analysing potentially
 <div style="margin-left: 60px;
             float: none;">
 
-<img src="images/FastQC_logo.png" style="width: 25%;
+<img src="images/02s_FastQC_logo.png" style="width: 25%;
                                   margin-left: auto;
                                   margin-right: auto;
                                   display: block;">
@@ -112,15 +84,15 @@ but there are lots of other parameters which you can find to tailor your QC by t
 <div style="text-align: center;">
    <span style="color: #2e2892;">**Good Data**</span>
 </div>
-<img src="images/good1.png" style="width: 100%">
+<img src="images/02s_fastqc_good1.png" style="width: 100%">
 </div>
 
 <div style="width: 47%;
             float: right;">
 <div style="text-align: center;">
    <span style="color: #2e2892;">**Bad Data**</span>
 </div>
-<img src="images/bad1.png" style="width: 100%">
+<img src="images/02s_fastqc_bad1.png" style="width: 100%">
 </div>
 
 ## Per base sequence content
@@ -130,15 +102,15 @@ but there are lots of other parameters which you can find to tailor your QC by t
 <div style="text-align: center;">
    <span style="color: #2e2892;">**Good Data**</span>
 </div>
-<img src="images/good2.png" style="width: 100%">
+<img src="images/02s_fastqc_good2.png" style="width: 100%">
 </div>
 
 <div style="width: 47%;
             float: right;">
 <div style="text-align: center;">
    <span style="color: #2e2892;">**Bad Data**</span>
 </div>
-<img src="images/bad2.png" style="width: 100%">
+<img src="images/02s_fastqc_bad2.png" style="width: 100%">
 </div>
 
 ## Per sequence GC content
@@ -155,15 +127,15 @@ The spike is due to severe adapter contamination.
 <div style="text-align: center;">
    <span style="color: #2e2892;">**Good Data**</span>
 </div>
-<img src="images/good3.png" style="width: 100%">
+<img src="images/02s_fastqc_good3.png" style="width: 100%">
 </div>
 
 <div style="width: 47%;
             float: right;">
 <div style="text-align: center;">
    <span style="color: #2e2892;">**Bad Data**</span>
 </div>
-<img src="images/bad3.png" style="width: 100%">
+<img src="images/02s_fastqc_bad3.png" style="width: 100%">
 </div> 
 
 ## Adaptor content
@@ -173,15 +145,15 @@ The spike is due to severe adapter contamination.
 <div style="text-align: center;">
    <span style="color: #2e2892;">**Good Data**</span>
 </div>
-<img src="images/good4.png" style="width: 100%">
+<img src="images/02s_fastqc_good4.png" style="width: 100%">
 </div>
 
 <div style="width: 47%;
             float: right;">
 <div style="text-align: center;">
    <span style="color: #2e2892;">**Bad Data**</span>
 </div>
-<img src="images/bad4.png" style="width: 100%">
+<img src="images/02s_fastqc_bad4.png" style="width: 100%">
 </div> 
 
 
@@ -203,7 +175,7 @@ The spike is due to severe adapter contamination.
   - the directory structure is like a tree, you can go back with cd ..
   - Up arrows to get through history 
   - tab complete to avoid errors
-  - More to look at the files and q to exit
+  - Less or More to look at the files and q to exit
   - ctrl-c
 
 

diff --git a/Markdowns/02_FastQC_introduction.html b/Markdowns/02_FastQC_introduction.html
diff --git a/Markdowns/03_Quantification_with_Salmon_introduction.Rmd b/Markdowns/03_Quantification_with_Salmon_introduction.Rmd
@@ -1,6 +1,6 @@
 ---
 title: "Alignment and Quantification of Gene Expression with Salmon"
-date: "March 2023"
+date: "October 2024"
 output:
   ioslides_presentation:
     css: css/stylesheet.css
@@ -15,26 +15,26 @@ bibliography: ref.bib
 
 <div style="line-height: 50%;"><br></div>
 
-<img src="images/workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">
+<img src="images/01s_workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">
 
 
 ## Alignment and Quantification overview {#less_space_after_title}
 
 <div style="line-height: 10%;"><br></div>
 
-<img src="images/aln_quant_overview.png" class="centerimg" style="width: 48%; margin-top: 60px;">
+<img src="images/03s_aln_quant_overview.svg" class="centerimg" style="width: 48%; margin-top: 60px;">
 
 
 ## Traditional Alignment
 
 AIM: Given a reference sequence and a set of short reads, align each read to
 the reference sequence finding the most likely origin of the read sequence.
 
-<img src="images/SRAlignment.svg" class="centerimg" style="width: 100%; margin-top: 60px;">
+<img src="images/03s_SRAlignment.svg" class="centerimg" style="width: 100%; margin-top: 60px;">
 
 ## Alignment - Splicing aware alignment
 
-<img src="images/GappedAlignment.svg" class="centerimg" style="width: 100%; margin-top: 60px;">
+<img src="images/03s_GappedAlignment.svg" class="centerimg" style="width: 100%; margin-top: 60px;">
 
 Aligners: STAR, HISAT2
 
@@ -45,13 +45,13 @@ Aligners: STAR, HISAT2
 * It is (relatively) slow and computationally intensive
 
 
-<img src="images/quasi_mapping_1.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
+<img src="images/03s_quasi_mapping_1.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
 
 ## Alignment
 * Traditional alignment perform base-by-base alignment
 * Traditional alignment is (relatively) slow and computationally intensive
 
-<img src="images/quasi_mapping_2.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
+<img src="images/03s_quasi_mapping_2.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
 
 
 
@@ -61,18 +61,46 @@ Aligners: STAR, HISAT2
 * Traditional alignment perform base-by-base alignment
 * Traditional alignment is (relatively) slow and computationally intensive
 
-<img src="images/quasi_mapping_2.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
+<img src="images/03s_quasi_mapping_2.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
         
-Switch to *quasi-mapping* (Salmon) or *pseudo-alignment* (Kallisto)
 
+## BAM/SAM file format
+
+**S**equence **A**lignment/**M**ap (SAM) format is the standard format for files
+containing aligned reads.
+
+Definition of the format is available at https://samtools.github.io/hts-specs/SAMv1.pdf.
+
+Two main parts:  
+
+* Header  
+  	- contains meta data (source of the reads, reference genome, aligner, etc.)  
+  	- header lines start with “@”
+  	- header fields have standardized two-letter codes
+  	- `@RG` for read group, used for merging BAMs together
+
+* Alignment section  
+    - 1 line for each alignment  
+    - contains details of alignment position, mapping, base quality etc.  
+    - 11 required fields, but other content may vary depending on aligner and other
+      tools used to create the file
+
+* BAM is a binary version of SAM (not human readable)
+
+## BAM/SAM format - What does it look like?
+
+<img src="images/03s_SAM_alignment_1c.png" class="centerimg" style="width: 100%; margin-top: 10px;">
 
 ## Why are Pseudo-alignment methods faster?
+
+Switch to *quasi-mapping* (Salmon) or *pseudo-alignment* (Kallisto)
+
 * These tools avoids base-to-base alignment of the reads
 * ~ 20 times faster than the traditional alignment tools like STAR, HISAT2 etc
 * Unlike alignment based methods, pseudo-alignment methods focus on transcriptome (~2% of genome in human)
 * Use exact kmer matching rather than aligning whole reads with mismatches and indels
 
-<img src="images/quasi_mapping_3.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
+<img src="images/03s_quasi_mapping_3.svg" class="centerimg" style="width: 90%; margin-top: 40px;">
 
 
 ## Quantification tools
@@ -106,7 +134,7 @@ file (e.g. GFF or GTF)
 
 So the simplest approach is to count how many reads overlap each gene.
 
-<img src="images/Read_counting_2.svg" class="centerimg" style="width: 90%; margin-top: 20px;">
+<img src="images/03s_Read_counting_2.svg" class="centerimg" style="width: 90%; margin-top: 20px;">
 
 ## What is read quantification?
 
@@ -124,60 +152,19 @@ Salmon also takes account of biases:
 * Because salmon searches transcription, not genome, it's not the right tool for finding new genes or isoforms
 
 
+## Salmon
 
-## Salmon workflow
-* Salmon essential steps
-  1. Salmon indexing
-  2. Quasi-mapping and abundance quantification
-<img src="images/Salmon_workflow_2.png" class="centerimg" style="width: 40%;">
-
-<div style="text-align: right">
-  Patro *et al.* (2017) Nature Methods doi:10.1038/nmeth.4197
-</div>
-
-
-## Salmon: Salmon indexing
-
-* Two essential steps
-  1. Create transcriptome index
-    * This makes downstream quasi-mapping and quantification step efficient and faster
-    * Once you create an index, you can use it again and again
-    * Salmon indexing has two components
-      * Creates the reference transcriptome suffix array (SA)
-      * Each transcript in the reference transcriptome is mapped to its location in the SA using a hash table
-  2. Quasi-mapping and quantification    
-
-
-## Salmon: Quasi-mapping
-<div class="columns-2">
-<img src="images/quasi-mapping_overview.png" class="centerimg" style="width: 100%; height: 100%">
-
-  * The transcriptome (consisting of transcripts $t1,...,t6$) is converted into a \$ separated string "T" 
-  * On "T" suffix array, SA[T], and a hash table, h , are constructed (in indexing step).
-  * The mapping operation begins with a k-mer (here, k = 3) 
-  * From left to right, the read is scanned until a k-mer appears in the hash table.
-  * All suffixes containing the k-mer are found in the hash table and the SA intervals are retrieved
-  * The maximal matching prefix (MMP) is determined by finding the longest read sequence that exactly matches the reference suffix
-  * This process is repeated until the end of the read
-  * The final mapping is generated by determining the transcripts that appear in all MMPs for the read
-
-</div>
-
-\
-
-Avi Srivastava *et al.* (2016) Bioinformatics 2016 Jun 15;32(12)
+Two essential steps
 
+1. Create transcriptome index
 
-## Abundance estimation
+* This makes downstream quasi-mapping and quantification step efficient and faster
+* Once you create an index, you can use it again and again
+* Consider the Kmer size (default 31 for >75bp)
+* Adding decoy sequences to filter contaminants 
+
+2. Quasi-mapping and quantification    
 
-* With the quasi-mapping method, the best mapping is determined for each read
-* After modeling sample-specific parameters and biases, salmon will generate transcript abundance estimates
-* A read that maps equally to more than one transcript will have its count divided among them (Isoform information not lost)
-* A variety of complex modeling approaches are used to estimate transcript abundances, including Expectation Maximization (EM), which corrects for sample-specific biases.
-  * GC bias
-  * Positional bias
-  * Fragment length bias
-  * Sequence-based bias
 
 
 ## Practical

diff --git a/Markdowns/03_Quantification_with_Salmon_introduction.html b/Markdowns/03_Quantification_with_Salmon_introduction.html
diff --git a/Markdowns/04_Quality_Control_introduction.Rmd b/Markdowns/04_Quality_Control_introduction.Rmd
@@ -1,6 +1,6 @@
 ---
 title: "QC of Aligned Reads"
-date: "March 2023"
+date: "October 2024"
 output:
   ioslides_presentation:
     css: css/stylesheet.css
@@ -15,7 +15,7 @@ output:
 
 <div style="line-height: 50%;"><br></div>
 
-<img src="images/workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">
+<img src="images/01s_workflow_3Day.svg" class="centerimg" style="width: 80%; margin-top: 60px;">
 
 ## QC of aligned reads
 
@@ -41,14 +41,14 @@ output:
 * Insert size is the length of the fragment of mRNA from which the reads are
 derived
 
-<img src="images/Insert_Size_QC.svg" class="centerimg">
-<img src="images/Insert_Size.svg" class="centerimg" style="width: 80%">
+<img src="images/04s_Insert_Size_QC.svg" class="centerimg">
+<img src="images/04s_Insert_Size.svg" class="centerimg" style="width: 80%">
 
 ## QC of aligned reads - Transcript coverage
 
 <div style="line-height: 50%;"><br></div>
 
-<img src="images/TranscriptCoverage.svg" class="centerimg" style="width: 80%">
+<img src="images/04s_TranscriptCoverage.svg" class="centerimg" style="width: 80%">
 
 ## QC Goals 
 

diff --git a/Markdowns/04_Quality_Control_introduction.html b/Markdowns/04_Quality_Control_introduction.html