Microarray analysis

MA Platforms
File Formats
R packages
Normalizations
QC
Annotation with gene names
- Tx cluster vs. probe set level
DE analysis
Affymetrix proprietary software
Alternative splicing analysis
References

There are two types of MA platforms:

spotted array -- 2 colors
synthesized oligos -- 1 color (Affymetrix)

We have: GeneChip Human Transcriptome Array 2.0 (Affymetrix, now Thermo Scientific Fisher)

Gene Level plus Alternative Splicing
70% exon probes, 30% exon-exon spanning probes
additional files and manuals provided by Thermo Fisher

Typically used microarrays:

File formats of microarrays

.CEL: Expression Array feature intensity
.CDF:
- Chip definition file
- information relating probe pair sets to locations on the array ("mapping" of the probe to a gene annotation)
- in princple, these mappings can be updated

Packages

oligo
- supposed to replace affy for the more modern exon-based arrays
- for data import and preprocessing
- uses ExpressionSet
- the best intro I found was the github wiki
affy
- very comprehensive, but cannot read in HTA2.0 data
- affycoretools has some functions to streamline array analysis, but they don't seem particularly fast
- arrayQualityMetrics operates on AffyBatch
xps
- uses ROOT to speed up storage and retrieval
affyPLM:
- MAplot function will work on ExpressionSet

Turning fluorescence signal into biological signal

old MAs had mismatch probes to estimate the noise --> the RMA algorithm made those obsolete, so modern MAs only have perfect match (PM) probes

probeset = group of probes covering one gene

The data analyst can choose one from three definitions of probesets for summarization to the transcript level:

Core Probesets: supported by RefSeq and full-length mRNA GenBank records;
Extended Probesets: supported by non-full-length mRNA GenBank records, EST sequences, ENSEMBL gene collections and others;
Full Probesets: supported by gene and exon prediction algorithms only.

Which one to use?

Each gene annotation is constructed from transcript annotations from one or more confidence levels. Some parts of a gene annotation may derive from high confidence core annotations, while other parts derive from the lower confidence extended or full annotations. White Paper Probe Sets II

Normalization methods

MAS5

basically subtracts out mismatch probes

Tukey's biweight estimator to provide robust mean signal, Wilcoxo rank test for p-value
bckg estimation: weighted average of the lowest 2% if the feature intensities
makes use of mismatch probes (applicable to HTA?)
linear scaling with trimmed mean
analyzes each array independently --> reduced power compared to the other methods

info based on TAC User Manual, more details can be found in the slides of the Canadian Bioinfo Workshop 2012, pages 5-7

Robust Microarray Average (RMA)

is a log scale linear additive model that uses only perfect match probes and extracts background mathematically (GCRMA additionally corrects for mismatch probes)

info from Carvalho 2016, RMA paper

Steps implemented in rma():

Background adjustment
- noise from cross-hybridization and optical noise from the scanning
- remove local artifacts so that measurements aren't so affected by their neighbors
- bckg noise = normal distribution
- true signal = exponential distribution that is probeset-specific
Quantile normalization
- remove array-specific effects
Summarization --> obtaining expression levels
- collapsing multiple probes per target into one signal
- note that "probes" will be represented by background-adjusted, quantile-normalized, log-transformed PM intensities
- probe affinity a_j_ and chip effect beta_i_ must be estimated:
  - RMA default method: Tukey's Median Polish strategy (robust and fast, but no standard error estimates)
  - fits iteratively; successively removing row and column medians, and accumulating the terms until the process stabilizes; the residuals are what is left at the end
  - alternative: fitting a linear model (Probe Level Model, PLM)

PLIER is the proprietory (?) algorithm of Affymetrix/Thermo Fisher; Table taken from TAC Manual (Appendix)

White Paper Normalization | White Paper Probe Sets A | White Paper Probe Sets B

QC

According to McCall et al., 2011, the most useful QC measures for identifying poorly performing arrays are:

RLE
NUSE
percent present

Pseudo images

Chip pseudo-images are very useful for detecting spatial differences (artifacts) on the invidual arrays (so not for comparing between arrays).

Pseudo-images are generated by fitting a probe-level model (PLM) to the data that assumes that all probes of a probe set behave the same in the different samples: probes that bind well to their target should do so on all arrays, probes that bind with low affinity should do so on all arrays.

You can create pseudo-images based on the residuals or the weights that result from a comparison of the model (the ideal data, without any noise) to the actual data. These weights or residuals may be graphically displayed using the image() function in Bioconductor (default: weights)

The model consists of a probe level (assuming that each probe should behave the same on all arrays) and an array level (taking into account that a gene can have different expression levels in different samples) parameter.

info from wiki.bits

Histograms of log2 intensity

	for(i in 1:6){
		hist(data[,i],lwd=2,which='pm',ylab='Density',xlab='Log2ntensities',
		main=ph@data$sample[i])
		}
		
	# ggplot2 way
	pmexp = pm(data)

Boxplots of log2 intensity per sample

pmexp = log2(pm(data))

Boxplots of log2 intensity per GC probe

MA plots

MA plots were developed for two-color arrays to detect differences between the two color labels on the same array, and for these arrays they became hugely popular. This is why more and more people are now also using them for Affymetrix arrays but on Affymetrix only use a single color label. So people started using them to compare each Affymetrix array to a pseudo-array. The pseudo array consists of the median intensity of each probe over all arrays.

The MA plot shows to what extent the variability in expression depends on the expression level (more variation on high expression values?). In an MA-plot, A is plotted versus M:

M = difference between the intensity of a probe on the array and the median intensity of that probe over all arrays
A = average of the intensity of a probe on that array and the median intesity of that probe over all arrays; A = (logPMInt_array + logPMInt_medianarray)/2

Ideally, the cloud of data points should be centered around M=0 (blue line). This is because we assume that the majority of the genes is not DE and that the number of upregulated genes is similar to the number of downregulated genes. Additionally, the variability of the M values should be similar for different A values (average intensities). You see that the spread of the cloud increases with the average intensity: the loess curve (red line) moves further and further away from M=0 when A increases. To remove (some of) this dependency, we will normalize the data.

for (i in 1:6)
{
name = paste("MAplot",i,".jpg",sep="")
jpeg(name)
# MA-plots comparing the second array to the first array 
affyPLM::MAplot(eset.Dilution, which=c(1,2),ref=c(1,2),plot.method="smoothScatter")
# if multiple ref are given, these samples will be used to calculate the median
# equivalent: which=c("20A","20B"),ref=c("20A","20B")
dev.off()
}

Relative expression boxplot (RLE)

How much is the expression of a probe spread out relative to the same probe on other arrays?

large spread of RLE indicates large number of DE genes
Computed for each probeset by comparing the expression value on each array against the median expression value for that probeset across all arrays.
Ideally: most RLE values should be around zero.
does not depend on RMA model

see affyPLM

Normalized unscaled standard error (NUSE)

How much is the variability of probes within a gene spread out relative to probes of the same gene on other arrays?

see affyPLM

QC stat plot

see simpleaffy documentation

Parameter	Meaning
x	A QCStats object
fc.line.col	The colour to mark fold change lines with
sf.ok.region	The colour to mark the region in which scale factors lie within appropriate bounds
chip.label.col	The colour to label the chips with
sf.thresh	Scale factors must be within this fold-range
gdh.thresh	Gapdh ratios must be within this range
ba.thresh	beta actin must be within this range
present.thresh	The percentage of genes called present must lie within this range
bg.thresh	Array backgrounds must lie within this range
label	What to call the chips
main	The title for the plot
usemid	If true use 3'/M ratios for the GAPDH and beta actin probes
cex	Value to scale character size by (e.g. 0.5 means that the text should be plotted half size)
...	Other parameters to pass through to

lines = arrays, from the 0-fold line to the point that corresponds to its MAS5 scale factor. Affymetrix recommend that scale factors should lie within 3-fold of each other.
points: GAPDH and beta-actin 3'/5' ratios. Affy states that beta actin should be within 3, gapdh around 1. Any that fall outside these thresholds (1.25 for gapdh) are coloured red; the rest are blue.
number of genes called present on each array vs. the average background. These will vary according to the samples being processed, and Affy's QC suggests simply that they should be similar. If any chips have significantly different values this is flagged in red, otherwise the numbers are displayed in blue. By default, 'significant' means that %-present are within 10% of each other; background intensity, 20 units. These last numbers are somewhat arbitrary and may need some tweaking to find values that suit the samples you're dealing with, and the overall nature of your setup.
BioB = spike-in; if not present on a chip, this will be flagged by printing 'BioB' in red; this is a control for the hybridization step

Source of variation

which attribute explains most of the variation (page 82f.)

Determine the fraction of the total variation of the samples can be explained by a given attribute:

compute variance of each probeset
retain the 1000 probesets having the highest variance
Accumulate the total sum of squares for each attribute
The residual sum of squares (where the sum over j represents the sum over samples within the attribute level) is accumulated.
The fraction of variance explained for the attribute is the mean of the fraction explained over all of the probesets.

Annotating probes with gene names

Thermo Fisher provides data bases with the mappings here

Annotation Dbi seems to be the native R way to do this.

For an overview of all bioconductor-hosted annotation data bases, see here. For HTA2.0, there are two options: transcript clusters and probe sets

probe sets: for HTA2.0, a probe set is more are less an exon, but not quite
- old Exon ST arrays had four-probe probesets (e.g., four 25-mers that were summarized to estimate the expression of a 'probe set region', or PSR). A PSR was some or all of an exon, so it wasn't even that clear what you were measuring. If the exon was long, there might have been multiple PSRs for the exon, or if it was short maybe only one.
- when you summarize at the probeset level on the HTA arrays, you are summarizing all the probes in a probeset, which may measure a PSR, or may also summarize a set of probes that are supposed to span an exon-exon junction
- analyzing the data at this level is very complex: any significantly differentially expressed PSR or JUC (junction probe) just says something about a little chunk of the gene, and what that then means in the larger context of the gene is something that you have to explore further.
transcript clusters: contain all probe sets of a transcript
- there may be multiple transcript probesets for a given gene
- given the propensity for Affy to re-use probes in the later versions of arrays, the multiple probesets for a given gene may well include some of the same probes!
- the transcript level probesets provide some relative measure of the underlying transcription level of a gene
- different probesets for the same gene may measure different splice variants.

Ref1, Ref2

Stephen Turner has a blog entry on how to do the annotation before the limma analysis; he uses transcript clusters (= gene-level analysis)

DE Analysis

A very good summary of all the most important steps is given by James MacDonald at biostars.

library(oligo)
dat <- read.celfiles(list.celfiles())
eset <- rma(dat)

## you can then get rid of background probes and annotate using functions in my affycoretools package
library(affycoretools)
library(hta20transcriptcluster.db)
eset.main <- getMainProbes(eset, pd.hta.2.0)
eset.main <- annotateEset(eset.main, hta20stranscriptcluster.db)

For probe-set level analysis (see caveats above!):

eset <- rma(dat, target = "probeset")
eset.main <- getMainProbes(eset, pd.hta.2.0)
eset.main <- annotateEset(eset.main, hta20probeset.db)

Affymetrix' TAC

Affymetrix' software (Windows only)
- User manual
uses the following R packages:
- Apcluster - affinity propagation clustering
- Dbscan - density based clustering of applications with noise
- Rtsne
- limma
offers the following normalization methods:
- RMA
- MAS5
- Plier PM-MM
QC:
- Thermo Fisher White Paper
- PCA

Alternative splicing

EventPointer
- R vignette
- original paper
- code at github
- Example Data including GTF file

References

JR Stevens 2012
Canadian Bioinfo Workshop on Microarrays

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!