Skip to content

Latest commit

 

History

History
358 lines (252 loc) · 18.3 KB

microarrays.md

File metadata and controls

358 lines (252 loc) · 18.3 KB

Microarray analysis


There are two types of MA platforms:

  • spotted array -- 2 colors
  • synthesized oligos -- 1 color (Affymetrix)

We have: GeneChip Human Transcriptome Array 2.0 (Affymetrix, now Thermo Scientific Fisher)

  • Gene Level plus Alternative Splicing
  • 70% exon probes, 30% exon-exon spanning probes
  • additional files and manuals provided by Thermo Fisher

Typically used microarrays:

from https://bioinformatics.cancer.gov/sites/default/files/course_material/Btep-R-microA-presentation-Jan-Feb-2015.pdf

File formats of microarrays

  • .CEL: Expression Array feature intensity
  • .CDF:
    • Chip definition file
    • information relating probe pair sets to locations on the array ("mapping" of the probe to a gene annotation)
    • in princple, these mappings can be updated

Packages

  • oligo

    • supposed to replace affy for the more modern exon-based arrays
    • for data import and preprocessing
    • uses ExpressionSet
    • the best intro I found was the github wiki
  • affy

    • very comprehensive, but cannot read in HTA2.0 data
    • affycoretools has some functions to streamline array analysis, but they don't seem particularly fast
    • arrayQualityMetrics operates on AffyBatch
  • xps

    • uses ROOT to speed up storage and retrieval
  • affyPLM:

    • MAplot function will work on ExpressionSet

Turning fluorescence signal into biological signal

old MAs had mismatch probes to estimate the noise --> the RMA algorithm made those obsolete, so modern MAs only have perfect match (PM) probes

probeset = group of probes covering one gene

The data analyst can choose one from three definitions of probesets for summarization to the transcript level:

  1. Core Probesets: supported by RefSeq and full-length mRNA GenBank records;
  2. Extended Probesets: supported by non-full-length mRNA GenBank records, EST sequences, ENSEMBL gene collections and others;
  3. Full Probesets: supported by gene and exon prediction algorithms only.

Which one to use?

WhitePaper probe sets

Each gene annotation is constructed from transcript annotations from one or more confidence levels. Some parts of a gene annotation may derive from high confidence core annotations, while other parts derive from the lower confidence extended or full annotations. White Paper Probe Sets II

Normalization methods

MAS5

basically subtracts out mismatch probes

  • Tukey's biweight estimator to provide robust mean signal, Wilcoxo rank test for p-value
  • bckg estimation: weighted average of the lowest 2% if the feature intensities
  • makes use of mismatch probes (applicable to HTA?)
  • linear scaling with trimmed mean
  • analyzes each array independently --> reduced power compared to the other methods

info based on TAC User Manual, more details can be found in the slides of the Canadian Bioinfo Workshop 2012, pages 5-7

Robust Microarray Average (RMA)

is a log scale linear additive model that uses only perfect match probes and extracts background mathematically (GCRMA additionally corrects for mismatch probes)

info from Carvalho 2016, RMA paper

Steps implemented in rma():

  1. Background adjustment
    • noise from cross-hybridization and optical noise from the scanning
    • remove local artifacts so that measurements aren't so affected by their neighbors
    • bckg noise = normal distribution
    • true signal = exponential distribution that is probeset-specific
  2. Quantile normalization
    • remove array-specific effects
  3. Summarization --> obtaining expression levels
    • collapsing multiple probes per target into one signal
    • note that "probes" will be represented by background-adjusted, quantile-normalized, log-transformed PM intensities
    • rma
    • probe affinity a_j_ and chip effect beta_i_ must be estimated:
      • RMA default method: Tukey's Median Polish strategy (robust and fast, but no standard error estimates)
      • fits iteratively; successively removing row and column medians, and accumulating the terms until the process stabilizes; the residuals are what is left at the end
      • median polish
      • alternative: fitting a linear model (Probe Level Model, PLM)
      • PLM

Comparison of correction and normalization approaches

PLIER is the proprietory (?) algorithm of Affymetrix/Thermo Fisher; Table taken from TAC Manual (Appendix)

White Paper Normalization | White Paper Probe Sets A | White Paper Probe Sets B

QC

According to McCall et al., 2011, the most useful QC measures for identifying poorly performing arrays are:

  • RLE
  • NUSE
  • percent present

Pseudo images

Chip pseudo-images are very useful for detecting spatial differences (artifacts) on the invidual arrays (so not for comparing between arrays).

Pseudo-images are generated by fitting a probe-level model (PLM) to the data that assumes that all probes of a probe set behave the same in the different samples: probes that bind well to their target should do so on all arrays, probes that bind with low affinity should do so on all arrays.

You can create pseudo-images based on the residuals or the weights that result from a comparison of the model (the ideal data, without any noise) to the actual data. These weights or residuals may be graphically displayed using the image() function in Bioconductor (default: weights)

The model consists of a probe level (assuming that each probe should behave the same on all arrays) and an array level (taking into account that a gene can have different expression levels in different samples) parameter.

info from wiki.bits

Histograms of log2 intensity

	for(i in 1:6){
		hist(data[,i],lwd=2,which='pm',ylab='Density',xlab='Log2ntensities',
		main=ph@data$sample[i])
		}
		
	# ggplot2 way
	pmexp = pm(data)
	

Boxplots of log2 intensity per sample

pmexp = log2(pm(data))

Boxplots of log2 intensity per GC probe

from Affy's White Paper

MA plots

MA plots were developed for two-color arrays to detect differences between the two color labels on the same array, and for these arrays they became hugely popular. This is why more and more people are now also using them for Affymetrix arrays but on Affymetrix only use a single color label. So people started using them to compare each Affymetrix array to a pseudo-array. The pseudo array consists of the median intensity of each probe over all arrays.

The MA plot shows to what extent the variability in expression depends on the expression level (more variation on high expression values?). In an MA-plot, A is plotted versus M:

  • M = difference between the intensity of a probe on the array and the median intensity of that probe over all arrays
  • A = average of the intensity of a probe on that array and the median intesity of that probe over all arrays; A = (logPMInt_array + logPMInt_medianarray)/2

MA plot

Ideally, the cloud of data points should be centered around M=0 (blue line). This is because we assume that the majority of the genes is not DE and that the number of upregulated genes is similar to the number of downregulated genes. Additionally, the variability of the M values should be similar for different A values (average intensities). You see that the spread of the cloud increases with the average intensity: the loess curve (red line) moves further and further away from M=0 when A increases. To remove (some of) this dependency, we will normalize the data.

for (i in 1:6)
{
name = paste("MAplot",i,".jpg",sep="")
jpeg(name)
# MA-plots comparing the second array to the first array 
affyPLM::MAplot(eset.Dilution, which=c(1,2),ref=c(1,2),plot.method="smoothScatter")
# if multiple ref are given, these samples will be used to calculate the median
# equivalent: which=c("20A","20B"),ref=c("20A","20B")
dev.off()
}

Relative expression boxplot (RLE)

How much is the expression of a probe spread out relative to the same probe on other arrays?

  • large spread of RLE indicates large number of DE genes
  • Computed for each probeset by comparing the expression value on each array against the median expression value for that probeset across all arrays.
  • Ideally: most RLE values should be around zero.
  • does not depend on RMA model

see affyPLM

RLE

Normalized unscaled standard error (NUSE)

How much is the variability of probes within a gene spread out relative to probes of the same gene on other arrays?

see affyPLM

NUSE

QC stat plot

see simpleaffy documentation

Parameter Meaning
x A QCStats object
fc.line.col The colour to mark fold change lines with
sf.ok.region The colour to mark the region in which scale factors lie within appropriate bounds
chip.label.col The colour to label the chips with
sf.thresh Scale factors must be within this fold-range
gdh.thresh Gapdh ratios must be within this range
ba.thresh beta actin must be within this range
present.thresh The percentage of genes called present must lie within this range
bg.thresh Array backgrounds must lie within this range
label What to call the chips
main The title for the plot
usemid If true use 3'/M ratios for the GAPDH and beta actin probes
cex Value to scale character size by (e.g. 0.5 means that the text should be plotted half size)
... Other parameters to pass through to

qc plot

  • lines = arrays, from the 0-fold line to the point that corresponds to its MAS5 scale factor. Affymetrix recommend that scale factors should lie within 3-fold of each other.

  • points: GAPDH and beta-actin 3'/5' ratios. Affy states that beta actin should be within 3, gapdh around 1. Any that fall outside these thresholds (1.25 for gapdh) are coloured red; the rest are blue.

  • number of genes called present on each array vs. the average background. These will vary according to the samples being processed, and Affy's QC suggests simply that they should be similar. If any chips have significantly different values this is flagged in red, otherwise the numbers are displayed in blue. By default, 'significant' means that %-present are within 10% of each other; background intensity, 20 units. These last numbers are somewhat arbitrary and may need some tweaking to find values that suit the samples you're dealing with, and the overall nature of your setup.

  • BioB = spike-in; if not present on a chip, this will be flagged by printing 'BioB' in red; this is a control for the hybridization step

Source of variation

which attribute explains most of the variation (page 82f.)

Determine the fraction of the total variation of the samples can be explained by a given attribute:

  1. compute variance of each probeset
  2. retain the 1000 probesets having the highest variance
  3. Accumulate the total sum of squares for each attribute
  4. The residual sum of squares (where the sum over j represents the sum over samples within the attribute level) is accumulated.
  5. The fraction of variance explained for the attribute is the mean of the fraction explained over all of the probesets.

Annotating probes with gene names

Thermo Fisher provides data bases with the mappings here

Annotation Dbi seems to be the native R way to do this.

For an overview of all bioconductor-hosted annotation data bases, see here. For HTA2.0, there are two options: transcript clusters and probe sets

  • probe sets: for HTA2.0, a probe set is more are less an exon, but not quite
    • old Exon ST arrays had four-probe probesets (e.g., four 25-mers that were summarized to estimate the expression of a 'probe set region', or PSR). A PSR was some or all of an exon, so it wasn't even that clear what you were measuring. If the exon was long, there might have been multiple PSRs for the exon, or if it was short maybe only one.
    • when you summarize at the probeset level on the HTA arrays, you are summarizing all the probes in a probeset, which may measure a PSR, or may also summarize a set of probes that are supposed to span an exon-exon junction
    • analyzing the data at this level is very complex: any significantly differentially expressed PSR or JUC (junction probe) just says something about a little chunk of the gene, and what that then means in the larger context of the gene is something that you have to explore further.
  • transcript clusters: contain all probe sets of a transcript
    • there may be multiple transcript probesets for a given gene
    • given the propensity for Affy to re-use probes in the later versions of arrays, the multiple probesets for a given gene may well include some of the same probes!
    • the transcript level probesets provide some relative measure of the underlying transcription level of a gene
    • different probesets for the same gene may measure different splice variants.

Ref1, Ref2

Stephen Turner has a blog entry on how to do the annotation before the limma analysis; he uses transcript clusters (= gene-level analysis)

DE Analysis

A very good summary of all the most important steps is given by James MacDonald at biostars.

library(oligo)
dat <- read.celfiles(list.celfiles())
eset <- rma(dat)

## you can then get rid of background probes and annotate using functions in my affycoretools package
library(affycoretools)
library(hta20transcriptcluster.db)
eset.main <- getMainProbes(eset, pd.hta.2.0)
eset.main <- annotateEset(eset.main, hta20stranscriptcluster.db)

For probe-set level analysis (see caveats above!):

eset <- rma(dat, target = "probeset")
eset.main <- getMainProbes(eset, pd.hta.2.0)
eset.main <- annotateEset(eset.main, hta20probeset.db)

Affymetrix' TAC

  • Affymetrix' software (Windows only)
  • uses the following R packages:
    • Apcluster - affinity propagation clustering
    • Dbscan - density based clustering of applications with noise
    • Rtsne
    • limma
  • offers the following normalization methods:
    • RMA
    • MAS5
    • Plier PM-MM
  • QC:

Alternative splicing

References