Skip to content

Latest commit

 

History

History
85 lines (64 loc) · 6.01 KB

Conesa_etal_2016_labmtg_notes.md

File metadata and controls

85 lines (64 loc) · 6.01 KB

Notes for lab meeting, 10/7/2016:

Paper: Conesa et al 2016. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016 Aug 26;17(1):181.

Overview:

  • RNA is the key intermediate between the genome and the proteome
  • RNA sequencing has many benefits: allows us to ask what is transcribed where and when, also there is less RNA than DNA (usually) so, sequencing the whole transcriptome of a new organism can be cheaper than sequencing the whole genome
  • There is no optimal pipeline, there are many to choose from depending on question: mapping vs. de novo assembly, or microRNA (play a transcription regulatory role)
  • Paper is meant as an outline for resources available
  • Important to review options so that data generated can answer your biological question, but also save money
  • experimental design -> quality control -> reads alignment vs. de novo assembly -> quantification of transcripts -> expression analysis (won't get into other types of analysis besides gene expression analysis)

Questions:

  • @jthmiller questions about size selection during library prep

Experimental Design Considerations:

  • Roadmap on Figure 1 has some good considerations, they did not mention contamination
  • stranded libraries vs. not?
  • How many reads are necessary per sample?
  • power analysis for optimal # samples, see http://scotty.genetics.utah.edu/
  • Make a saturation curve
  • this paper was cited, looks good to read in more depth: http://genome.cshlp.org/content/21/12/2213.full.pdf+html
  • PE vs. single? "The cheaper, short SE reads are normally sufficient for studies of gene expression levels in well-annotated organisms."
  • Table 1, statistical power to detect differential epxression varies with outcome: fold change, millions of reads, and # replicates (3, 5, and 10 replicates per group)

QC

Mapping to a reference genome/transcriptome

  • to genome vs. transcriptome, slightly lower with transcriptome because losing unannotated transcripts and more multi-mappings because reads falling onto exons shared by different isoforms (bowtie1 will allow setting with only 1 mapping)
  • poor quality RNA starting material will show 3' bias
  • quality control after quantification to make sure low ribosomal content doesn't bleed through (it will be there, but should be small)
  • run PCA to make sure no batch effects or funny stuff going on with your samples
  • Figure 2, representative software includes more options than those listed, see lecture by @rob-p http://robpatro.com/redesign/Quantification.pdf
  • Box 3 is excellent summary of mapping to a reference
  • I've never used any of the transcript discovery software mentioned, GRIT, CAGE, RAMPAGE, SLIDE, etc. They make a point to say it is difficult - not trivial, unless you are really looking for novel transcripts with a referece and don't want to do de novo assembly

De novo assembly

  • Takes all raw reads and assembles them into contigs (each contig should be full-length transcripts, in theory)
  • Software often produce a lot of contigs - sweet spot between enough reads and too many (complicates de Bruijn graphs, can lead to misassembly)
  • recommend diginorm to eliminate redundancy: https://khmer-protocols.readthedocs.io/en/ctb/mrnaseq/2-diginorm.html

Note: de novo vs. mapping, results from student exercise at NGS2016: https://docs.google.com/spreadsheets/d/12X06LqGM8j4a4oV3_IsM91Hvop6B7L84xSWR71dXK3E/edit#gid=0

Here's the exercise by @mestato: http://angus.readthedocs.io/en/2016/arabidopsis_assembly_challenge.html)

Transcript quantification:

  • endpoint of the analysis, what RNAseq is most commonly used for
  • general idea is to quantify # reads mapping onto each transcript
  • many ways to do this: Sailfish uses k-mer counting
  • HTSeq-count or featureCounts aggregates raw counts of mapped reads with gtf file
  • "Raw read counts alone are not suficient to compare expression levels among samples, as these values are afected by factors such as transcript length, total # reads, sequencing biases"
  • FPKM or RPKM (fragments or reads per kilobase of exon model per million reads), TPM (transcripts per million) take into account the length
  • different diff expression packages require certain measurements
  • Important point: "it is necessary for correctly ranking gene expression levels within the sample to account for the fact that longer genes accumulate more reads"
  • In quantifying reads with de novo assembled contigs, however, you can't guarantee that a contig represents a unique transcript, so this is tricky
  • Normalization methods vary across packages. DESeq and edgeR will do normalization for you, and require your counts to be raw
  • compute expression based on discrete probability distrubutions
  • DESeq allows for additional variance, or dispersion, beyond variance expected from randomly sapmling (because it's not random, there is co-expression)
  • sampling variance of small read counts is taken into account (like with small number of replicates)
  • Box 4 compares software tools
  • caution with small # replicates
  • limma is good (I've used this for microarrays)
  • DESeq (regarded as "too conservative" and edgeR ("too liberal") are relatively similar
  • see comparison in workshop setting: https://monsterbashseq.wordpress.com/2015/08/26/rnaseq-differential-expression-analysis-ngs2015/

Visualization:

Future:

  • I see a very big future in using long-reads to resolve full-length transcripts and overcome assembly problems