Skip to content

Commit

Permalink
refactoring the project structure: add util, data and figures directo…
Browse files Browse the repository at this point in the history
…ries.
  • Loading branch information
Joris VAN HOUTVEN committed Mar 11, 2021
1 parent f91e7ac commit f65a413
Show file tree
Hide file tree
Showing 19 changed files with 28 additions and 33 deletions.
10 changes: 3 additions & 7 deletions CONSTANd_vs_medianSweeping.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ output:
editor_options:
chunk_output_type: console
params:
input_data_p: 'input_data.rds'
input_data_p: 'data/input_data.rds'
suffix_p: ''
load_outputdata_p: FALSE
subsample_p: 0
Expand Down Expand Up @@ -48,14 +48,10 @@ library(psych)
library(limma)
library(tidyverse)
library(CONSTANd) # install from source: https://github.com/PDiracDelta/CONSTANd/
source('other_functions.R')
source('plotting_functions.R')
source('util/other_functions.R')
source('util/plotting_functions.R')
```

This notebook presents isobaric labeling data analysis strategy that includes data-driven normalization.

In other notebooks in this series we have systematically varied components and observed how they affect the outcome of a DEA analysis. We have seen that median sweeping normalization works does not work well for intensities on the original scale, and that CONSTANd does not work well on log2-transformed intensities. Here we compare median sweeping on log2 scale, which we know does a good job, with CONSTANd on original intensity scale.

Let's load our PSM-level data set:

```{r}
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
1 change: 0 additions & 1 deletion data_hierarchy.png

This file was deleted.

6 changes: 3 additions & 3 deletions datadriven_DEA.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ output:
editor_options:
chunk_output_type: console
params:
input_data_p: 'input_data.rds'
input_data_p: 'data/input_data.rds'
suffix_p: 'msstatstmt'
load_outputdata_p: FALSE
subsample_p: 0
Expand Down Expand Up @@ -51,8 +51,8 @@ library(tidyverse)
library(matrixTests)
library(coin)
library(ROTS)
source('other_functions.R')
source('plotting_functions.R')
source('util/other_functions.R')
source('util/plotting_functions.R')
```

Let's load our PSM-level data set:
Expand Down
7 changes: 4 additions & 3 deletions datadriven_normalization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ output:
editor_options:
chunk_output_type: console
params:
input_data_p: 'input_data.rds'
input_data_p: 'data/input_data.rds'
suffix_p: 'msstatstmt'
load_outputdata_p: FALSE
subsample_p: 0
Expand Down Expand Up @@ -50,8 +50,8 @@ library(tidyverse)
library(preprocessCore)
library(CONSTANd) # install from source: https://github.com/PDiracDelta/CONSTANd/
library(NOMAD) # devtools::install_github("carlmurie/NOMAD")
source('other_functions.R')
source('plotting_functions.R')
source('util/other_functions.R')
source('util/plotting_functions.R')
```

Let's load our PSM-level data set:
Expand Down Expand Up @@ -490,6 +490,7 @@ violinplot_ils(lapply(dat.dea, function(x) x[spiked.proteins, logFC.cols]), refe
For the given data set, the differences in proteomic outcomes between median sweeping and NOMAD normalization are quite small, both on the global and individual scale.
The quantile methods seem to underperform across the board, but they still produce reliable fold change estimates.
Finally, CONSTANd naturally reduces the variance in the distribution of quantification values and is only suitable for use with untransformed intensities. When used on log2-transformed values like we did here, there is a double variance-reducing effect that ends up over-compressing the fold change estimates.
However, when applied to untransformed intensities like in [this bonus notebook](CONSTANd_vs_medianSweeping.html), the CONSTANd method performs at least on par with median sweeping!

# Session information

Expand Down
6 changes: 3 additions & 3 deletions datadriven_summarization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ output:
editor_options:
chunk_output_type: console
params:
input_data_p: 'input_data.rds'
input_data_p: 'data/input_data.rds'
suffix_p: 'msstatstmt'
load_outputdata_p: FALSE
subsample_p: 0
Expand Down Expand Up @@ -48,8 +48,8 @@ library(limma)
library(psych)
library(MSnbase) # CAREFUL! load this BEFORE tidyverse, or you will screw up the rename function.
library(tidyverse)
source('other_functions.R')
source('plotting_functions.R')
source('util/other_functions.R')
source('util/plotting_functions.R')
```

Let's load our PSM-level data set:
Expand Down
6 changes: 3 additions & 3 deletions datadriven_unit.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ output:
editor_options:
chunk_output_type: console
params:
input_data_p: 'input_data.rds'
input_data_p: 'data/input_data.rds'
suffix_p: 'msstatstmt'
load_outputdata_p: FALSE
subsample_p: 0
Expand Down Expand Up @@ -47,8 +47,8 @@ library(kableExtra)
library(limma)
library(psych)
library(tidyverse)
source('other_functions.R')
source('plotting_functions.R')
source('util/other_functions.R')
source('util/plotting_functions.R')
```

Let's load our PSM-level data set:
Expand Down
Binary file added figures/data_hierarchy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
File renamed without changes
Binary file added figures/tmt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 7 additions & 7 deletions intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ output:
editor_options:
chunk_output_type: console
params:
input_data_p: 'input_data.rds'
input_data_p: 'data/input_data.rds'
suffix_p: 'msstatstmt'
load_outputdata_p: FALSE
subsample_p: 0
Expand All @@ -26,8 +26,8 @@ library(ggplot2)
library(stringi)
library(venn)
library(kableExtra)
source('other_functions.R')
source('plotting_functions.R')
source('util/other_functions.R')
source('util/plotting_functions.R')
```

# Introduction
Expand All @@ -42,11 +42,11 @@ There is quite some complexity and experimental variability involved due to the
<summary>Structure of isobarically labeled LC-MS/MS proteomics data.</summary>
A labeling approach enables the bottom-up analysis of multiple biological samples simultaneously within a single tandem-MS run. In isobaric labeling (e.g. TMT labels in Figure 1), there is no mass difference between signals of identical peptides, which further increases the comparability and quality of the signals, in this case represented by the reporter fragment ion intensities.
```{r f:tmtlabels, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 1: a) TMT 6-plex labels differ by isotopes (*) only. b) The reactive group binds to the amino-terminus of the peptides, the reporter breaks off during fragmentation and the mass normalizer/balancer guarantees that the intact labels are isobaric. Image source: [Rosenblatt et al.](https://assets.fishersci.com/TFS-Assets/BID/posters/D00337~.pdf)."}
knitr::include_graphics("tmt.png")
knitr::include_graphics("figures/tmt.png")
```
Even though the reporters allow one to assign each spectrum to the correct sample, there is still a substantial amount of complexity to be dealt with. The figure below shows how each protein may be represented by many peptides, multiple times, in multiple 'shapes and forms'. This is why summarization and normalization steps are key components in every workflow.
```{r f:data_hierarchy, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 2: Tandem-MS data is complex and hierarchical; many different signals compete in determining the (relative) abundance of a protein . The latter are not measured directly, but are represented in possibly multiple runs as multiple variations of multiple peptides in different conditions and different (usually replicate) samples. Many different combinations of such signals co-exist and some are measured in each LC-MS run, while some are not (e.g.: peptide k is not measured in the rightmost run). RT=retention time, CS=charge state, PTM=post-translational modification."}
knitr::include_graphics("data_hierarchy.png")
knitr::include_graphics("figures/data_hierarchy.png")
```
</details>

Expand All @@ -56,7 +56,7 @@ The table below shows an overview of the many components that make up a workflow
Note that we have made a distinction between data-driven and model-based approaches, which refers to the type of normalization method they use.
<!-- use tablesgenerator.com OR the image below... -->
```{r f:table_selected_variants, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 3: An overview of our selected components and component variants for constructing an analysis workflow. When studying a particular component (e.g. data-driven normalization), each component variant (e.g. median sweeping, CONSTANd, ...) is applied to the data, while the other workflow components are executed using 'Variant 1' methods only. The same process applies when studying the other components (e.g. summarization), overall, leading to multiple sets of outcomes."}
knitr::include_graphics("table_selected_variants.png")
knitr::include_graphics("figures/table_selected_variants.png")
```

Each of the notebooks in this series takes the default approach, except for one component for which different variants are explored in detail. We chose the **unit scale**, **normalization method**, **summarization method** and **DEA method** as most interesting components to investigate, because on one hand we expect them to be impactful and on the other hand they are the ones we see varied and published about the most. We have made a non-systematic, non-exhaustive, [publicly available list of publications and software packages](https://docs.google.com/document/d/14BeNQFh3KHiKESdoF5A4OUlo9QWmt9kv/edit) related to analyzing proteomics data. You are very welcome to suggest edits and additions to this public repository.
Expand All @@ -68,7 +68,7 @@ It contains a HeLa background with 48 UPS1 proteins spiked in as a dilution seri
As shown in the figure below, we use only Runs 1, 2, 4, 5, which represent the first two technical replicates of the first two biological replicates (mixtures). Further, we do not use the reference channels because a) we wish to demonstrate how to analyze data sets _without_ the use of internal references; and b) we wish to avoid confusion with reference conditions and/or channels in the context of ratio's and fold changes.

```{r f:design_ILS_as_subset, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 4: We use only the subset (red rectangle) of data from the original study that consists of channels 127N through 130C (omitting the Reference condition samples) in Runs 1,2,4,5 with Mixtures 1 and 2. We also applied several data filtering steps - detailed in the 'Arbitrary choices for the other components' subsection above - after which we were left with 58 028 PSMs, 26 921 unique peptides and 4083 unique proteins, containing 19 out of the 48 spiked-in proteins. Image adapted from [Ting et al.](https://doi.org/10.1074/mcp.ra120.002105)"}
knitr::include_graphics("design_ILS_as_subset.png")
knitr::include_graphics("figures/design_ILS_as_subset.png")
```

We use the following settings _whenever applicable_ for calculating values or generating figures:
Expand Down
1 change: 0 additions & 1 deletion tmt.png

This file was deleted.

8 changes: 4 additions & 4 deletions data_prep.R → util/data_prep.R
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
library(tidyverse)
library(stringi)
source('other_functions.R')
source('util/other_functions.R')

# suffix used when saving the processed data set ('input_data_<data_name>.rds')
dat.raw <- read.delim('PSMs.csv', sep = '\t') # create symlink
dat.raw <- read.delim('data/PSMs.csv', sep = '\t') # create symlink
dat.raw.org <- dat.raw

# read in the study design data frame
study.design <- read.delim('msstatstmt_studydesign.csv', sep=',') # create symlink
study.design <- read.delim('data/msstatstmt_studydesign.csv', sep=',') # create symlink

# rename quantification columns
tmp.fun <- function(x){
Expand Down Expand Up @@ -149,4 +149,4 @@ params <- list(referenceCondition=referenceCondition,

# save data in wide and long format
if ('X' %in% colnames(dat.l)) { dat.l$X <- NULL }
saveRDS(list(dat.l=dat.l, dat.w=dat.w, data.params=params), paste0('input_data', '.rds')) # make symlink
saveRDS(list(dat.l=dat.l, dat.w=dat.w, data.params=params), paste0('data/input_data', '.rds')) # make symlink
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion plotting_functions.R → util/plotting_functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ cvplot_ils <- function(dat, feature.group, xaxis.group, title, rmCVquan=0.95, ..
# scatterplot_ils: wrapper function on pairs.panels from 'psych' package
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
# pairs.panels.my is a modified pairs.panels function such that the y=x identity line is plotted when lm=T
source('pairs_panels_idline.R')
source('util/pairs_panels_idline.R')

scatterplot_ils <- function(dat, cols, stat, spiked.proteins, refCond){
select.stat <- match.arg(stat, c('p-values', 'log2FC', 'q-values'))
Expand Down

0 comments on commit f65a413

Please sign in to comment.