-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintro.Rmd
291 lines (211 loc) · 17.1 KB
/
intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
title: "Strategies for the analysis of isobarically labeled samples in mass spectrometry-based proteomics"
author: "Piotr Prostko, Joris Van Houtven"
date: '`r format(Sys.time(), "%B %d, %Y")`'
output:
html_document:
toc: true
toc_depth: 2
toc_float: true
number_sections: true
theme: flatly
code_folding: "hide"
editor_options:
chunk_output_type: console
params:
input_data_p: 'data/input_data.rds'
suffix_p: ''
load_outputdata_p: FALSE
save_outputdata_p: FALSE
subsample_p: 0
---
<style>body {text-align: justify}</style>
```{r, libraries, message=FALSE, echo=FALSE, warning=FALSE}
library(tidyverse)
library(ggplot2)
library(stringi)
library(venn)
library(kableExtra)
source('util/other_functions.R')
source('util/plotting_functions.R')
```
# Introduction
In this notebook series, we explore how different data analysis strategies affect the outcome of a proteomics experiment based on isobaric labeling and mass spectrometry.
Each analysis strategy or 'workflow' can be divided up into different components.
In this notebook specifically, we give an overview of the different workflow components and briefly describe the data set used.
There is quite some complexity and experimental variability involved due to the isobaric labeling and pooling of samples and combining of data from different instrument runs, respectively. To make matters worse, LC-MS/MS data is already quite hierarchical in nature. If you need a refresher on any of this, expand the next paragraph.
<details>
<summary>Structure of isobarically labeled LC-MS/MS proteomics data.</summary>
A labeling approach enables the bottom-up analysis of multiple biological samples simultaneously within a single tandem-MS run. In isobaric labeling (e.g. TMT labels in Figure 1), there is no mass difference between signals of identical peptides, which further increases the comparability and quality of the signals, in this case represented by the reporter fragment ion intensities.
```{r f:tmtlabels, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 1: a) TMT 6-plex labels differ by isotopes (*) only. b) The reactive group binds to the amino-terminus of the peptides, the reporter breaks off during fragmentation and the mass normalizer/balancer guarantees that the intact labels are isobaric. Image source: [Rosenblatt et al.](https://assets.fishersci.com/TFS-Assets/BID/posters/D00337~.pdf)."}
knitr::include_graphics("figures/tmt.png")
```
Even though the reporters allow one to assign each spectrum to the correct sample, there is still a substantial amount of complexity to be dealt with. The figure below shows how each protein may be represented by many peptides, multiple times, in multiple 'shapes and forms'. This is why summarization and normalization steps are key components in every workflow.
```{r f:data_hierarchy, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 2: Tandem-MS data is complex and hierarchical; many different signals compete in determining the (relative) abundance of a protein . The latter are not measured directly, but are represented in possibly multiple runs as multiple variations of multiple peptides in different conditions and different (usually replicate) samples. Many different combinations of such signals co-exist and some are measured in each LC-MS run, while some are not (e.g.: peptide k is not measured in the rightmost run). RT=retention time, CS=charge state, PTM=post-translational modification."}
knitr::include_graphics("figures/data_hierarchy.png")
```
</details>
# Workflow components
The table below shows an overview of the many components that make up a workflow for performing a differential expression analysis (DEA). This list is probably not exhaustive, but for each of those components a decision has to be made.
Note that we have made a distinction between data-driven and model-based approaches, which refers to the type of normalization method they use.
<!-- use tablesgenerator.com OR the image below... -->
```{r f:table_selected_variants, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 3: An overview of our selected components and component variants for constructing an analysis workflow. When studying a particular component (e.g. data-driven normalization), each component variant (e.g. median sweeping, CONSTANd, ...) is applied to the data, while the other workflow components are executed using 'Variant 1' methods only. The same process applies when studying the other components (e.g. summarization), overall, leading to multiple sets of outcomes."}
knitr::include_graphics("figures/table_selected_variants.png")
```
Each of the notebooks in this series takes the default approach, except for one component for which different variants are explored in detail. We chose the **unit scale**, **normalization method**, **summarization method** and **DEA method** as most interesting components to investigate, because on one hand we expect them to be impactful and on the other hand they are the ones we see varied and published about the most. We have made a non-systematic, non-exhaustive, [publicly available list of publications and software packages](https://docs.google.com/document/d/14BeNQFh3KHiKESdoF5A4OUlo9QWmt9kv/edit) related to analyzing proteomics data. You are very welcome to suggest edits and additions to this public repository.
# Data set
The data used in this notebook series is a subset of the SpikeIn-5mix-MS2 (Five controlled mixtures with technical replicates) data set used in [Ting et al.](https://doi.org/10.1074/mcp.RA120.002105) for [MSstatsTMT](http://msstats.org/msstatstmt/).
It contains a HeLa background with 48 UPS1 proteins spiked in as a dilution series, which will also constitute the nomenclature for our biological conditions: 1, 0.667, 0.5, and 0.125 times of the highest UPS1 peptide amount (500 fmol).
As shown in the figure below, we use only Runs 1, 2, 4, 5, which represent the first two technical replicates of the first two biological replicates (mixtures). Further, we do not use the reference channels because a) we wish to demonstrate how to analyze data sets _without_ the use of internal references; and b) we wish to avoid confusion with reference conditions and/or channels in the context of ratio's and fold changes.
```{r f:design_ILS_as_subset, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 4: We use only the subset (red rectangle) of data from the original study that consists of channels 127N through 130C (omitting the Reference condition samples) in Runs 1,2,4,5 with Mixtures 1 and 2. We also applied several data filtering steps - detailed in the 'Arbitrary choices for the other components' subsection above - after which we were left with 58 028 PSMs, 26 921 unique peptides and 4083 unique proteins, containing 19 out of the 48 spiked-in proteins. Image adapted from [Ting et al.](https://doi.org/10.1074/mcp.ra120.002105)"}
knitr::include_graphics("figures/design_ILS_as_subset.png")
```
We use the following settings _whenever applicable_ for calculating values or generating figures:
- the reference condition is '0.5';
- the example ratio for sample comparisons is 'Run4_127C' / 'Run2_129N' (a.k.a. 'Mixture2_1:127C' / 'Mixture1_2:129N');
- the example ratio for condition comparisons is '0.125' / '0.5';
- in multi-color figures, the color coding is:
```{r, echo=FALSE, results='asis', align='center'}
df <- data.frame(Condition=c("0.125", "0.5", "0.667", "1"), Color=c('black', 'blue', 'green', 'red'))
knitr::kable(df) %>% kable_styling(full_width = F)
```
## Filtering and preparation
In the file `data_prep.R` you can find the details of all data preparation steps. Here, we list which PSMs were discarded or modified and why:
1. Use only data from Runs 1, 2, 4, 5.
2. Discard reference samples (TMT channels 126 and 131).
3. Discard PSMs with shared peptides (as done in [Ting et al.](https://doi.org/10.1074/mcp.RA120.002105)).
4. Add simulated variability (see next section).
5. Reconstruct missing MS1 Intensity column based on MS2 intensities (necessary for some component variants).
6. Discard duplicate PSMs due to the use of multiple search engine instances (which only change the score).
7. Discard PSMs of proteins that appear both in the background and spike-in.
8. Discard PSMs with Isolation Interference (%) > 30.
9. Discard PSMs with NA entries in the quantification channels.
10. Discard PSMs of one-hit-wonder proteins.
## Simulated biological variability
This experiment was carried out so carefully that the protein concentrations, i.e., the sample sizes, are all very close to each other. In principle this is good, but we actually want to thoroughly test different normalization methods, and specifically their ability to correct for different sample sizes.
Therefore, in `data_prep.R` we introduced additional, random variation by multiplying all quantification values in each sample by 2^r, where r is a random number drawn from a normal distribution with mean zero and standard deviation 0.2 (this value was also used by [Ting et al.](https://doi.org/10.1074/mcp.RA120.002105)).
## Explore data
This is what the data looks like (in wide format):
```{r kable, echo=FALSE, results = "asis"}
#if (!exists('params')) params <- list(input_data_p='input_data.Rds', suffix_p='msstatstmt', load_outputdata_p=FALSE, subsample_p=0)
data.list <- readRDS(params$input_data_p) # create symlink
dat.l <- data.list$dat.l
dat.w <- data.list$dat.w
display_dataframe_head(dat.w)
```
The rest of this section calculates some interesting characteristics and informative graphs:
```{r, warning=FALSE, message=FALSE, class.source = "fold-show"}
# number of detected proteins
(n.protein <- length(unique(dat.w$Protein)))
# number of detected pepides
(n.peptide <- length(unique(dat.w$Peptide)))
# number of recorded spectra
(n.spectra <- nrow(dat.w))
# median number of peptides per protein
dat.w %>% group_by(Protein) %>% summarize(ndist=n_distinct(Peptide)) %>% mutate(n.protein.mean=median(ndist)) %>% pull(n.protein.mean) %>% unique
# median number of PSMs per peptide
dat.w %>% group_by(Peptide) %>% summarize(ndist=n()) %>% mutate(n.peptide.med=median(ndist)) %>%
pull(n.peptide.med) %>% unique
# number of 'peptides within run' instances
# (i.e. one peptide across multiple runs is counted as multiple separate entitities)
(dc1 <- dat.w %>% distinct(Run, Peptide) %>% nrow)
dc2 <- dat.w %>% group_by(Run, Peptide) %>%
summarize(ndist=n()) %>%
filter(ndist>1)
dc2 %>% nrow # 'peptides within run' instances detected in >1 spectra
# distribution of PSM counts per 'peptides within run' (counts > 1)
table(dc2$ndist)
# histogram of PSM counts per 'peptides within run' (counts > 1; removed 5 outliers)
hist(dc2$ndist[dc2$ndist<12], breaks=0.5+1:11, main="PSM counts per 'peptides within run'", xlab='PSM counts per peptide')
# percentage of duplicates on PSM level
100*nrow(dc2)/dc1
# 'peptides within run' instances eluted over multiple retention times
dc3 <- dat.w %>% group_by(Run, Peptide) %>%
summarize(RT.dist=n_distinct(RT)) %>%
filter(RT.dist>1)
dc3 %>% nrow
# RT is the cause of 99% of PSM duplicates
100*(dc3 %>% nrow/dc2 %>% nrow)
```
```{r, warning=FALSE, message=FALSE, echo=FALSE, class.source = "fold-show"}
if (FALSE) { #+ already removed in data_prep.R
# number of spectra with isolation interference>30%
dat.w %>% filter(isoInterOk=='N') %>% nrow
# number of spectra with at least one missing quantification channel
dat.w %>% filter(noNAs=='N') %>% nrow
# number of proteins represented by one peptide only
dat.w %>% filter(onehit.protein=='Y') %>% distinct(Protein) %>% nrow
# number of shared peptides
dat.w %>% filter(shared.peptide=='Y') %>% distinct(Peptide) %>% nrow
# number of spectra after filtering out spectra wth iso interference >30% or with >1 missing quan channel
dat.tmp <- dat.w %>% filter(isoInterOk=='Y' & noNAs=='Y')
nrow(dat.tmp)
# reduction in number of spectra
nrow(dat.w)-nrow(dat.tmp)
} #- already removed in data_prep.R
```
```{r, warning=FALSE, message=FALSE, class.source = "fold-show"}
# Venn diagram of proteins identified across MS runs
unique.prot <- dat.w %>% group_by(Run) %>% distinct(Protein) %>% mutate(val=1)
unique.prot <- unique.prot %>% pivot_wider(Protein, names_from='Run', values_from='val', values_fill=0)
venn(unique.prot[, -1], zcolor = "style", main='Proteins')
# Venn diagram of peptides identified across MS runs
unique.pep <- dat.w %>% group_by(Run) %>% distinct(Peptide) %>% mutate(val=1)
unique.pep <- unique.pep %>% pivot_wider(Peptide, names_from='Run', values_from='val', values_fill=0)
venn(unique.pep[, -1], zcolor = "style", main='Peptides')
# names of remaining spiked-in proteins
(sp <- dat.w %>% distinct(Protein) %>% filter(stri_detect(Protein, fixed='ups')) %>% pull %>% as.character)
# number of spiked-in proteins remaining (out of 48) after filtering
length(sp)
```
## Technical details on knitting notebooks and reusing R codes
Our idea behind this series of notebooks is not only to educate but also to invite other proteomics researchers to test **new workflows methods** on perhaps **new datasets**. To make that possible, we equipped our .Rmd codes with several features that we will now briefly discuss.
### input data
The [data_prep.R](util/data_prep.R) script processes the data in a way required by our notebook files.
This file has been written such that it fits the particular dataset used in our manuscript.
With a little bit of effort, however, one can modify it such that it can become applicable to other datasets as well.
The inputs of that script are:
A CSV file located in the `data` folder called `PSMs.csv` that is the output of **ProteomeDiscoverer** (PD) and therefore it contains typical for PD variables.
Another CSV file located in the `data` folder called `msstatstmt_studydesign.csv`, which contains the following information about the study design of the experiment:
```{r}
display_dataframe_head(read.delim('data/msstatstmt_studydesign.csv', sep=','))
```
Note that columns `Run, TechRepMixture, Channel, Condition, Mixture, and BioReplicate` are required, so even if the design of your new experiment is lacking some layer, you can always enter some constant value in that missing column.
Moreover, read carefully the comments inside the R script to see what filtering steps have been incorporated. If you do not agree with some of them or you feel like some processing step is missing, adjust the code accordingly.
The output of the script is a file created in the `data` folder called `input_data.rds`, which contains a list of three elements: `dat.l` (data in longer format), `dat.w` (data in wider format), and `data.params` with some parameters used throughout individual notebook files. This is how the resulting data should look like:
```{r}
temp <- readRDS("data/input_data.rds")
cat("dat.l")
display_dataframe_head(temp$dat.l)
cat("dat.w")
display_dataframe_head(temp$dat.w)
cat("data.params")
temp$data.params
```
(you should be able to figure out the usage of the elements of `data.params` by diving into individual notebook files)
### Notebook parameters
Each notebook has been supplied with 5 parameters controlling various aspects of the notebook's inputs and outputs. These are:
- `input_data_p`: character; the path to .Rds file generated by `data/data_prep.R`; e.g., `data/input_data.rds`
- `suffix_p`: character; suffix used in notebook output file (html and rda) names
- `load_outputdata_p`: boolean, if `TRUE`, then intermediate data sets used in plotting, DEA will not be created, but rather loaded
- `save_outputdata_p`: boolean, if `TRUE`, then intermediate data sets used in plotting, DEA will be saved as .rda file for later use. The name of the file goes, e.g., as follows: `datadriven_unit_outdata<SUFFIX>.rda`
- `subsample_p`: numeric, 0 means all proteins will be analysed, 50 means that 50 proteins will be randomly sampled.
### Master program
We also prepared an R script with which it is possible to conveniently knit all notebooks from the level of one file. This program is called `data/master_program.R`, and inside you will find references to the R markdown parameters explained above.
### Install the required R packages
Copy and run the code from below (first click on the Code gray rectangle to the right to see the code) to install all packages required by our analyses.
```{r, echo = TRUE, eval=FALSE}
# packages from CRAN
cran.list.of.packages <- c("devtools", "stringi", "gridExtra", "dendextend", "kableExtra", "psych",
"tidyverse", "matrixTests", "coin", "NOMAD", "tidyverse", "caret", "lme4",
"lmerTest", "ggplot2", "ggfortify", "dtplyr")
cran.new.packages <- cran.list.of.packages[!(cran.list.of.packages %in% installed.packages()[,"Package"])]
if(length(cran.new.packages)) install.packages(cran.new.packages)
# packages from Bioconductor
bioc.list.of.packages <- c("limma", "ROTS", "CONSTANd", "preprocessCore", "MSnbase", "DEqMS")
bioc.new.packages <- bioc.list.of.packages[!(bioc.list.of.packages %in% installed.packages()[,"Package"])]
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
if(length(bioc.new.packages)) BiocManager::install()(bioc.new.packages)
# packages from GitHub
if (!require("NOMAD", quietly = TRUE)) devtools::install_github("carlmurie/NOMAD")
```