modelbased_normalization.Rmd

---
title: "**Normalization** strategy comparison for **Model-based** analysis of isobarically labeled proteomic data."
author: "Piotr Prostko, Joris Van Houtven"
date: '`r format(Sys.time(), "%B %d, %Y,%H:%M")`'
output: 
  html_document:
    toc: true
    toc_depth: 2
    toc_float: true
    number_sections: true
    theme: flatly
    code_folding: "hide"
editor_options: 
  chunk_output_type: console
params:
  input_data_p: 'data/input_data.rds'
  suffix_p: ''
  load_outputdata_p: FALSE
  save_outputdata_p: FALSE
  subsample_p: 0
---

```{r, setup, include=FALSE}
# default knitr options for all chunks
knitr::opts_chunk$set(
  message=FALSE,
  warning=FALSE,
  fig.width=12,
  fig.height=7
)
```

<span style="color: grey;">
_This notebook is one in a series of many, where we explore how different data analysis strategies affect the outcome of a proteomics experiment based on isobaric labeling and mass spectrometry. Each analysis strategy or 'workflow' can be divided up into different components; it is recommend you read more about that in the [introduction notebook](intro.html)._
</span>

In this notebook specifically, we investigate the effect of varying the **Normalization** component on the outcome of the differential expression results. The three component variants are three different linear mixed-effects model specifications, termed shortly **LMM1**, **LMM2**, and **LMM3**. More details will be provided in the normalization section.

<span style="color: grey;">
_The R packages and helper scripts necessary to run this notebook are listed in the next code chunk: click the 'Code' button. Each code section can be expanded in a similar fashion. You can also download the [entire notebook source code](modelbased_unit.Rmd)._
</span>

```{r}
library(caret)
library(lme4)
library(lmerTest)
library(ggplot2)
library(stringi)
library(gridExtra)
library(ggfortify)
library(dendextend)
library(psych)
library(kableExtra)
library(tidyverse)
library(dtplyr)
source('util/other_functions.R')
source('util/plotting_functions.R')
```

Let's load our PSM-level data set:

```{r}
data.list <- readRDS(params$input_data_p)
dat.l <- data.list$dat.l # data in long format
dat.w <- data.list$dat.w # data in wide format
display_dataframe_head(dat.l)
```

After the filtering done in `data_prep.R`, there are 19 UPS1 proteins remaining, even though 48 were originally spiked in.

```{r}
# which proteins were spiked in?
spiked.proteins <- dat.l %>% distinct(Protein) %>% filter(stri_detect(Protein, fixed='ups')) %>% pull %>% as.character
tmp=dat.l %>% distinct(Protein) %>% pull %>% as.character
# protein subsampling
if (params$subsample_p>0 & params$subsample_p==floor(params$subsample_p) & params$subsample_p<=length(tmp)){
  sub.prot <- tmp[sample(1:length(tmp), size=params$subsample_p)]
  if (length(spiked.proteins)>0) sub.prot <- c(sub.prot,spiked.proteins)
  dat.l <- dat.l %>% filter(Protein %in% sub.prot)
}
```

We store the metadata in `sample.info` and show some entries below. We also pick technical replicates with a dilution factor of 0.5 as the reference condition of interest. Each condition is represented by two of eight reporter Channels in each Run. 

```{r}
# specify # of varying component variants and their names
variant.names <- c('LMM1', 'LMM2', 'LMM3')
n.comp.variants <- length(variant.names)

# get some data parameters created in the data_prep script
referenceCondition <- data.list$data.params$referenceCondition
condition.color <- data.list$data.params$condition.color
ma.onesample.num <- data.list$data.params$ma.onesample.num
ma.onesample.denom <- data.list$data.params$ma.onesample.denom
ma.allsamples.num <- data.list$data.params$ma.allsamples.num
ma.allsamples.denom <- data.list$data.params$ma.allsamples.denom
# create data frame with sample information
sample.info <- get_sample_info(dat.l, condition.color)
# get channel names
channelNames <- remove_factors(unique(sample.info$Channel))
```

```{r}
display_dataframe_head(sample.info)
referenceCondition
channelNames
```

# Unit scale component: log2 transformation of reporter ion intensities

We use the default unit scale: the log2-transformed reporter ion intensities.

```{r}
dat.unit.l <- dat.l %>% mutate(response=log2(intensity)) %>% select(-intensity)
```

# Summarization component: no summarization

As a default approach (consult the manuscript or the [introduction notebook](intro.html)) we opted for no summarization, meaning that all PSM-level data is going to be exploited in further analyses. This also means that multiple  variants of the same peptide within a sample carrying different charge, modifications or detected at different retention times are kept as is.

```{r}
dat.summ.l <- dat.unit.l
```

# Normalization component

The manuscripts of [Hill et al.](https://doi.org/10.1021/pr070520u) and [Oberg et al.](https://doi.org/10.1021/pr700734f) illustrated the application of linear models for removing various biases potentially present in isobarically labelled data. While implementing their approach, we discovered that treating proteins and peptides as random effects instead of fixed effects dramatically speeds up computations. 

All the models below are fitted using the `lmer()` function based on the REML criterion. Afterwards, the "subject-specific" residuals (which involve subtraction of the empirical bayes estimates of the random effects) of the models are treated as the resulting normalized values and used in further analyses.

```{r}
dat.norm.l <- emptyList(variant.names)
# copy dat.summ.l into every element of dat.norm.l list
dat.norm.l <- lapply(dat.norm.l, function(x) x <- dat.summ.l) 
```

For the three normalization models, we adopt the following naming convention:

- $y_{i, j(i), q, l, s}$ the reporter ion intensities, 
- $u$ is the model intercept, 
- $b_q$ is the multiplexed tandem-MS run effect, 
- $v_{l(q)}$ corresponds to the quantification channel within MS run effect, 
- $p_i$ stands for the protein effect, 
- $f_{(j(i)}$ describes the peptide within protein contribution,
- $ε_{i,j(i),q,l,s}$ the error term. 

## LMM1 (peptide-by-run interaction)

We start with a model that corrects the observed reporter ion intensities for imbalance stemming from run $b_q$ and run-channel $v_{l(q)}$ fixed effects, as well as protein $p_i$ and run-peptide $b_q \times f_{j(i)}$ random effects:

$$ \log_2y_{i, j(i), q, l, s} = u + b_q + v_{l(q)} + p_i + (b_q \times f_{j(i)}) + \varepsilon_{i, j(i), q, l, s} $$

where $p_i \sim N(0, \sigma_p^2),\, (b_q \times f_{j(i)}) \sim N(0, \sigma_f^2),\, \varepsilon_{i, j(i), q, l, s} \sim N(0, \sigma^2)$

```{r, eval=!params$load_outputdata_p}
LMM1 <- lmer(response ~ Run + Run:Channel + (1|Protein)  + (1|Run:Peptide), data=dat.summ.l)
dat.norm.l$LMM1$response <- residuals(LMM1)
```

## LMM2 (protein-by-run interaction)

In the next variant, we include a random interaction between Run and Protein, and keep the Peptide random effect constant across different runs:

$$ \log_2y_{i, j(i), q, l, s} = u + b_q + v_{l(q)} + (b_q \times p_i) + f_{j(i)} + \varepsilon_{i, j(i), q, l, s} $$

where $(b_q \times p_i) \sim N(0, \sigma_p^2),\, f_{j(i)} \sim N(0, \sigma_f^2),\, \varepsilon_{i, j(i), q, l, s} \sim N(0, \sigma^2)$

```{r, eval=!params$load_outputdata_p}
LMM2 <- lmer(response ~ Run + Run:Channel + (1|Run:Protein)  + (1|Peptide), data=dat.summ.l)
dat.norm.l$LMM2$response <- residuals(LMM2)
```

## LMM3 (protein and peptide main effects)

For completness, we also consider a mixed-effects normalisation model **without** any interaction between Run and (Protein or Peptide), hence: 

$$ \log_2y_{i, j(i), q, l, s} = u + b_q + v_{l(q)} + p_i + f_{j(i)} + \varepsilon_{i, j(i), q, l, s} $$

where $p_i \sim N(0, \sigma_p^2),\, f_{j(i)} \sim N(0, \sigma_f^2),\, \varepsilon_{i, j(i), q, l, s} \sim N(0, \sigma^2)$

```{r, eval=!params$load_outputdata_p}
LMM3 <- lmer(response ~ Run + Run:Channel + (1|Protein)  + (1|Peptide), data=dat.summ.l)
dat.norm.l$LMM3$response <- residuals(LMM3)
```

# QC plots

Before getting to the DEA section, let's do some basic quality control and take a sneak peek at the differences between the component variants we've chosen. First, however, we should make the data completely wide, so that each sample gets it's own unique column.

```{r, eval=!params$load_outputdata_p}
dat.nonnorm.summ.l <- aggFunc(dat.summ.l, 'response', group.vars=c('Mixture', 'TechRepMixture', 'Run', 'Channel', 'Condition', 'BioReplicate', 'Protein', 'Peptide'), 'median') 
dat.nonnorm.summ.l <- aggFunc(dat.nonnorm.summ.l, 'response', group.vars=c('Mixture', 'TechRepMixture', 'Run', 'Channel', 'Condition', 'BioReplicate', 'Protein'), 'median')

dat.norm.summ.l <- lapply(dat.norm.l, function(x) aggFunc(x, 'response', group.vars=c('Mixture', 'TechRepMixture', 'Run', 'Channel', 'Condition', 'BioReplicate', 'Protein', 'Peptide'), 'median')) 
dat.norm.summ.l <- lapply(dat.norm.summ.l, function(x) aggFunc(x, 'response', group.vars=c('Mixture', 'TechRepMixture', 'Run', 'Channel', 'Condition', 'BioReplicate', 'Protein'), 'median')) 

# make data completely wide (also across runs)
## normalized data
dat.norm.summ.w2 <- lapply(dat.norm.summ.l, function(x) {
  dat.tmp <- pivot_wider(data=x, id_cols=Protein, names_from=Run:Channel, values_from=response, names_sep=':')
  return(dat.tmp)})

## non-normalized data
dat.nonnorm.summ.w2 <- pivot_wider(data=dat.nonnorm.summ.l, id_cols=Protein, names_from=Run:Channel, values_from=response, names_sep=':')
```

```{r, echo=FALSE, eval=params$load_outputdata_p}
load(paste0('modelbased_normalization_outdata', params$suffix_p, '.rda'))
```

## Boxplots

The three normalization models give rise to similar boxplots.

```{r}
par(mfrow=c(2,2))
boxplot_ils(dat.nonnorm.summ.l, 'raw')
for (i in 1:n.comp.variants){
boxplot_ils(dat.norm.summ.l[[variant.names[i]]], paste('normalized', variant.names[i], sep='_'))}
```

## MA plots

We then make MA plots of two single samples taken from condition `r ma.allsamples.num` and condition `r ma.allsamples.denom`, measured in different MS runs (samples *`r ma.onesample.num`* and *`r ma.onesample.denom`*, respectively). Clearly, the normalization had a strong variance-reducing effect on the fold changes. LMM1-LMM3 lead to fairly similar MA plots, which are nicely centered around the zero horizontal line. Notice, however, the different locations of spike-in proteins in case of `LMM3`.

```{r}
p <- emptyList(c('raw', names(dat.norm.summ.w2)))
p[[1]] <- maplot_ils(dat.nonnorm.summ.w2, ma.onesample.num, ma.onesample.denom, scale='log', 'raw', spiked.proteins)
for (i in 1:n.comp.variants){
  p[[i+1]] <- maplot_ils(dat.norm.summ.w2[[variant.names[i]]], ma.onesample.num, ma.onesample.denom, scale='log', paste('normalized', variant.names[i], sep='_'), spiked.proteins)}
grid.arrange(grobs = p, ncol=2, nrow=2)
```

To increase the robustness of these results, let's make some more MA plots, but now for all samples from condition `r ma.allsamples.num` and condition `r ma.allsamples.denom` (quantification values averaged within condition).
Both the unnormalized and normalized data now show less variability as using more samples (now 8 in both the enumerator and denominator instead of just one) in the fold change calculation makes the rolling average more robust. 
Now, the three models render even more similar plots. It also seems the spike-in proteins induce a small positive bias (blue curve is rolling average) for low abundance proteins.

```{r}
channels.num <- sample.info %>% filter(Condition==ma.allsamples.num) %>% distinct(Sample) %>% pull
channels.denom <- sample.info %>% filter(Condition==ma.allsamples.denom) %>% distinct(Sample) %>% pull
p <- emptyList(c('raw', names(dat.norm.summ.w2)))
p[[1]] <- maplot_ils(dat.nonnorm.summ.w2, channels.num, channels.denom, scale='log', 'raw', spiked.proteins)
for (i in 1:n.comp.variants){
  p[[i+1]] <- maplot_ils(dat.norm.summ.w2[[variant.names[i]]], channels.num, channels.denom, scale='log', paste('normalized', variant.names[i], sep='_'), spiked.proteins)}
grid.arrange(grobs = p, ncol=2, nrow=2)
```

## PCA plots

Now, let's check if these multi-dimensional data contains some kind of grouping; It's time to make PCA plots.

### Using all proteins

After LMM2 or LMM3 normalization the samples are very closely grouped according to run instead of the dilution factor. Only LMM1, which contains the peptide-run interaction, restore the correct relation between the samples. This finding, together with conclusions from the [datadriven_normalization](datadriven_normalization.html) notebook, cannot be overstated as it implies that correcting for peptide-run interaction is of utmost importance in obtaining valid inference from multi-run isobaric labeling datasets.

```{r}
par(mfrow=c(2, 2))
pcaplot_ils(dat.nonnorm.summ.w2 %>% select(-'Protein'), info=sample.info, 'raw')
for (i in 1:n.comp.variants){
  pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}
```

There are only 19 proteins supposed to be differentially expressed in this data set, which is only a very small amount in both relative (to the 4083 proteins total) and absolute (for a biological sample) terms.

### Using spiked proteins only

Therefore, let's see what the PCA plots look like if we were to only use the spiked proteins in the PCA. This time, both LMM1 and LMM2 successfully clustered the samples, but this check has only a theoretical value as in practice one does not know for sure which proteins are differentially abundant. 

```{r, eval=length(spiked.proteins)>0}
par(mfrow=c(2, 2))
pcaplot_ils(dat.nonnorm.summ.w2 %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, 'raw')
for (i in 1:n.comp.variants){
  pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}
```

Notice how for all PCA plots, the percentage of variance explained by PC1 is now much greater than when using data from all proteins.
In a real situation without spiked proteins, you might plot data corresponding to the top X most differential proteins instead.

## HC (hierarchical clustering) plots

The PCA plots of all proteins has a rather lower fraction of variance explained by PC1. We can confirm this using the hierarchical clustering dendrograms below: when considering the entire multidimensional space, the different conditions are not very separable at all. This is not surprising as there is little biological variation between the conditions: there are only 19 truly differential proteins, and they all (ought to) covary in exactly the same manner (i.e., their variation can be captured in one dimension).

### Using all proteins

```{r, fig.width=12, fig.height=15}
par(mfrow=c(2,2))
dendrogram_ils(dat.nonnorm.summ.w2 %>% select(-'Protein'), info=sample.info, 'raw')
for (i in 1:n.comp.variants){
  dendrogram_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}
```

## Run effect p-value plot

Our last quality check involves a measure of how well each variant was able to assist in removing the run effect. 
Below are the distributions of p-values from a linear model for the `response` variable with `Run` as a covariate.
If the run effect was removed successfully, these p-values ought to be large. Clearly, the raw data and LMM3-normalized still contains a large run effect, while LMM2 and LMM1 to even larger extent take it out from the normalized data. 

```{r}
dat <- vector('list',length(dat.norm.summ.l)+1)
dat[[1]] <- dat.nonnorm.summ.l; dat[2:length(dat)] <- dat.norm.summ.l[1:length(dat.norm.summ.l)]
names(dat) <- c('raw', names(dat.norm.summ.l))
run_effect_plot(dat)
```

# DEA component: linear (mixed-effects) model

A typical approach to Differential Expression Analysis, which we also employ here, assumes testing only one protein at a time. Therefore, for each slice of the normalized data corresponding to a certain protein $i$, we fit another linear mixed-effect model given by:

$$ w_{j(i), q, l, s} = m + r_c + z_{l(q)} + \eta_{j(i), c, q, l, s} $$

with $w$ as the normalized values (the subject-specific residuals of the normalization model), $m$ as the model intercept, $r_c$ as the difference in expression levels between the biological conditions, $z_{l(q)}$ as the random effect accounting for the potential correlation within each sample induced by the protein repeated measurements, and $\eta_{j(i),c,q,l,s}$ as the random error. Note the index $s$ which implies the PSM-level data (i.e. not aggregated data). 

**Technical comment 1**: obtaining log fold changes corresponding to the contrasts of interest when working with log intensities and 'treatment' model parametrization (i.e., model intercept represents the reference condition) is immediately straightforward: these are coefficients corresponding to the $r_c$ effect.

**Technical comment 2**: while introducing the $z_{l(q)}$ random effect into the DEA model is justified, not every protein will have enough repeated measurements (i.e., multiple peptides and/or PSMs corresponding to different peptide modifications, charge states and retention times) for the random effect being estimable. However, in such cases the fixed effect is estimable and its inference remain valid. 

**Technical comment 3**: after testing, we make a correction for multiple testing using the Benjamini-Hochberg method in order to keep the FDR under control.

```{r, eval=!params$load_outputdata_p}
dat.dea <- lapply(dat.norm.l, function(x){
  return(lmm_dea(dat=x, mod.formula='response ~ Condition + (1|Run:Channel)', referenceCondition, scale='log'))})

# also see what the unnormalized results would look like
n.comp.variants <- n.comp.variants + 1
variant.names <- c(variant.names, 'raw')
dat.dea$raw <- lmm_dea(dat=dat.summ.l, mod.formula='response ~ Condition + (1|Run:Channel)', referenceCondition, scale='log')
```

```{r}
# character vectors containing logFC and p-values columns
dea.cols <- colnames(dat.dea[[1]])
logFC.cols <- dea.cols[stri_detect_fixed(dea.cols, 'logFC')]
significance.cols <- dea.cols[stri_detect_fixed(dea.cols, 'q.mod')]
n.contrasts <- length(logFC.cols)
```

For each condition, we now get the fold changes, p-values, q-values (BH-adjusted p-values), and some other details (head of dataframe below).

```{r}
display_dataframe_head(dat.dea[[1]])
```

```{r, echo=FALSE, eval=params$save_outputdata_p}
# save output data
save(dat.nonnorm.summ.l
     ,dat.norm.summ.l
     ,dat.nonnorm.summ.w2
     ,dat.norm.summ.w2
     ,dat.norm.l
     ,dat.summ.l
     ,dat.dea, file=paste0('modelbased_normalization_outdata', params$suffix_p, '.rda'))
```

# Results comparison

Now, the most important part: let's find out how our component variants have affected the outcome of the DEA.

## Confusion matrix

A confusion matrix shows how many true and false positives/negatives each variant has given rise to. Spiked proteins that are DE are true positives, background proteins that are not DE are true negatives. We calculate this matrix for all conditions and then calculate some other informative metrics based on the confusion matrices: accuracy, sensitivity, specificity, positive predictive value and negative predictive value. 

Quick inspection of results tells us that only LMM1 and LMM2 are still in the game and result in comparable classifications. The suprisingly good performance of the LMM2 variant, despite the strong batch effect seen in the PCA plot, can be explained by the spike-in/controlled aspect of the experiment that generated the data. The difference in expression patterns of spike-in proteins turns out to be stronger than the batch effect (rightfully so!) and drives the separation in the restricted PCA plot.

Finally, the biological difference in the `0.667 vs 0.5` contrast, however, seems to be too small to be picked by the proposed modelling approach, regardless of the normalization methods.

```{r, results='asis'}
cm <- conf_mat(dat.dea, 'q.mod', 0.05, spiked.proteins)
print_conf_mat(cm, referenceCondition)
```

## Scatter plots

To see whether the three Unit scales produce similar results on the detailed level of individual proteins, we make scatter plots and check the correlation between their fold changes and between their significance estimates (q-values, in our case). 

q-values of the three normalization models are moderately correlated. LMM2 and LMM3 q-values are generally larger than those of LMM1, confirming the drastically altered variance
structures in the observations when the normalization model is misspecified (i.e., does not include the peptide-run interaction term). All q-values of the "raw" variant are constant (approximately 1), so the correlation coefficient could not be computed (NA).

```{r}
scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition)
```

When it comes to log fold changes, estimates are virtually not influenced at all by the three different normalization models. The spiked proteins fold change estimates (in orange) based on unnormalized (raw) data are in line with estimates based on normalized expression, emphasizing the successful conduct of the technical of the experiment and remarkable quality of the acquired data.

```{r}
scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition)
```

## Volcano plots

The volcano plot combines information on fold changes and statistical significance. The spike-in proteins are colored blue; the magenta, dashed line indicates the theoretical fold change of the spike-ins. 

Excluding the `0.667 vs 0.5` contrast, where the biological difference is too subtle, we can notice the advantage of LMM1 over LMM2 embodied in more significant q-values. LMM3 and raw data analysis entirely miss biological differences masked by batch effect.

```{r}
for (i in 1:n.contrasts){
  volcanoplot_ils(dat.dea, i, spiked.proteins, referenceCondition)}
```

## Violin plots

A good way to assess the general trend of the fold change estimates on a more 'macroscopic' scale is to make a violin plot. Ideally, there will be some spike-in proteins that attain the expected fold change  (red dashed line) that corresponds to their condition, while most (background) protein log2 fold changes are situated around zero.

Clearly, the empirical results _tend towards_ the theoretical truth, but not a single observation attained the fold change it should have attained. There is clearly a strong bias towards zero fold change, which may partly be explained by the ratio compression phenomenon in mass spectrometry, although the effect seems quite extreme here.

First of all, do not be mistaken by raw data fold changes that lie unexpectedly close to the true values since we saw earlier that significance testing returned zero hits. Other than that, estimates are virtually not influenced at all by the three different normalization models, meaning that these models primarily disagree in the way of shaping data variance structure.

```{r}
# plot theoretical value (horizontal lines) and violin per variant
if (length(spiked.proteins)>0) violinplot_ils(lapply(dat.dea, function(x) x[spiked.proteins, logFC.cols]), referenceCondition) else violinplot_ils(lapply(dat.dea, function(x) x[,logFC.cols]), referenceCondition,  show_truth = FALSE)
```

# Conclusions

The presented here results confirm the findings of [Murie et al.](https://doi.org/10.1016/j.jbior.2017.11.005), suggesting that a proper normalization by the use of LMMs in a model-based approach requires an interaction effect between run and peptide. The omission of this effect clearly and strongly groups samples by run instead of condition in a PCA plot and fails to entirely remove the run/batch effect. This consequence alone takes its toll on the data variability and, by extension, on the q-values. On the other hand, the outcomes for the different LMMs seem somewhat similar: for all conditions there is perfect or near-perfect correlation between their fold change estimates.

# Session information

```{r}
sessionInfo()
```