refactoring the project structure: add util, data and figures directo…

…ries.
Valkenborg · Mar 11, 2021 · f65a413 · f65a413
1 parent f91e7ac
commit f65a413
Show file tree

Hide file tree

Showing 19 changed files with 28 additions and 33 deletions.
diff --git a/CONSTANd_vs_medianSweeping.Rmd b/CONSTANd_vs_medianSweeping.Rmd
@@ -13,7 +13,7 @@ output:
 editor_options: 
   chunk_output_type: console
 params:
-  input_data_p: 'input_data.rds'
+  input_data_p: 'data/input_data.rds'
   suffix_p: ''
   load_outputdata_p: FALSE
   subsample_p: 0
@@ -48,14 +48,10 @@ library(psych)
 library(limma)
 library(tidyverse)
 library(CONSTANd)  # install from source: https://github.com/PDiracDelta/CONSTANd/
-source('other_functions.R')
-source('plotting_functions.R')
+source('util/other_functions.R')
+source('util/plotting_functions.R')
 ```
 
-This notebook presents isobaric labeling data analysis strategy that includes data-driven normalization. 
-
-In other notebooks in this series we have systematically varied components and observed how they affect the outcome of a DEA analysis. We have seen that median sweeping normalization works does not work well for intensities on the original scale, and that CONSTANd does not work well on log2-transformed intensities. Here we compare median sweeping on log2 scale, which we know does a good job, with CONSTANd on original intensity scale.
-
 Let's load our PSM-level data set:
 
 ```{r}

diff --git a/PSMs.csv → data/PSMs.csv b/PSMs.csv → data/PSMs.csv
diff --git a/input_data.rds → data/input_data.rds b/input_data.rds → data/input_data.rds
diff --git a/msstatstmt_studydesign.csv → data/msstatstmt_studydesign.csv b/msstatstmt_studydesign.csv → data/msstatstmt_studydesign.csv
diff --git a/data_hierarchy.png b/data_hierarchy.png
diff --git a/datadriven_DEA.Rmd b/datadriven_DEA.Rmd
@@ -13,7 +13,7 @@ output:
 editor_options: 
   chunk_output_type: console
 params:
-  input_data_p: 'input_data.rds'
+  input_data_p: 'data/input_data.rds'
   suffix_p: 'msstatstmt'
   load_outputdata_p: FALSE
   subsample_p: 0
@@ -51,8 +51,8 @@ library(tidyverse)
 library(matrixTests)
 library(coin)
 library(ROTS)
-source('other_functions.R')
-source('plotting_functions.R')
+source('util/other_functions.R')
+source('util/plotting_functions.R')
 ```
 
 Let's load our PSM-level data set:

diff --git a/datadriven_normalization.Rmd b/datadriven_normalization.Rmd
@@ -13,7 +13,7 @@ output:
 editor_options: 
   chunk_output_type: console 
 params:
-  input_data_p: 'input_data.rds'
+  input_data_p: 'data/input_data.rds'
   suffix_p: 'msstatstmt'
   load_outputdata_p: FALSE
   subsample_p: 0   
@@ -50,8 +50,8 @@ library(tidyverse)
 library(preprocessCore)
 library(CONSTANd)  # install from source: https://github.com/PDiracDelta/CONSTANd/
 library(NOMAD)  # devtools::install_github("carlmurie/NOMAD")
-source('other_functions.R')
-source('plotting_functions.R')
+source('util/other_functions.R')
+source('util/plotting_functions.R')
 ```
 
 Let's load our PSM-level data set:
@@ -490,6 +490,7 @@ violinplot_ils(lapply(dat.dea, function(x) x[spiked.proteins, logFC.cols]), refe
 For the given data set, the differences in proteomic outcomes between median sweeping and NOMAD normalization are quite small, both on the global and individual scale.
 The quantile methods seem to underperform across the board, but they still produce reliable fold change estimates.
 Finally, CONSTANd naturally reduces the variance in the distribution of quantification values and is only suitable for use with untransformed intensities. When used on log2-transformed values like we did here, there is a double variance-reducing effect that ends up over-compressing the fold change estimates.
+However, when applied to untransformed intensities like in [this bonus notebook](CONSTANd_vs_medianSweeping.html), the CONSTANd method performs at least on par with median sweeping!
 
 # Session information
 

diff --git a/datadriven_summarization.Rmd b/datadriven_summarization.Rmd
@@ -13,7 +13,7 @@ output:
 editor_options: 
   chunk_output_type: console
 params:
-  input_data_p: 'input_data.rds'
+  input_data_p: 'data/input_data.rds'
   suffix_p: 'msstatstmt'
   load_outputdata_p: FALSE
   subsample_p: 0
@@ -48,8 +48,8 @@ library(limma)
 library(psych)
 library(MSnbase)  # CAREFUL! load this BEFORE tidyverse, or you will screw up the rename function.
 library(tidyverse)
-source('other_functions.R')
-source('plotting_functions.R')
+source('util/other_functions.R')
+source('util/plotting_functions.R')
 ```
 
 Let's load our PSM-level data set:

diff --git a/datadriven_unit.Rmd b/datadriven_unit.Rmd
@@ -13,7 +13,7 @@ output:
 editor_options: 
   chunk_output_type: console    
 params:
-  input_data_p: 'input_data.rds'
+  input_data_p: 'data/input_data.rds'
   suffix_p: 'msstatstmt'
   load_outputdata_p: FALSE
   subsample_p: 0
@@ -47,8 +47,8 @@ library(kableExtra)
 library(limma)
 library(psych)
 library(tidyverse)
-source('other_functions.R')
-source('plotting_functions.R')
+source('util/other_functions.R')
+source('util/plotting_functions.R')
 ```
 
 Let's load our PSM-level data set:

diff --git a/figures/data_hierarchy.png b/figures/data_hierarchy.png
diff --git a/design_ILS_as_subset.png → figures/design_ILS_as_subset.png b/design_ILS_as_subset.png → figures/design_ILS_as_subset.png
diff --git a/table_selected_variants.png → figures/table_selected_variants.png b/table_selected_variants.png → figures/table_selected_variants.png
diff --git a/figures/tmt.png b/figures/tmt.png
diff --git a/intro.Rmd b/intro.Rmd
@@ -13,7 +13,7 @@ output:
 editor_options: 
   chunk_output_type: console
 params:
-  input_data_p: 'input_data.rds'
+  input_data_p: 'data/input_data.rds'
   suffix_p: 'msstatstmt'
   load_outputdata_p: FALSE
   subsample_p: 0
@@ -26,8 +26,8 @@ library(ggplot2)
 library(stringi)
 library(venn)
 library(kableExtra)
-source('other_functions.R')
-source('plotting_functions.R')
+source('util/other_functions.R')
+source('util/plotting_functions.R')
 ```
 
 # Introduction
@@ -42,11 +42,11 @@ There is quite some complexity and experimental variability involved due to the
   <summary>Structure of isobarically labeled LC-MS/MS proteomics data.</summary>
   A labeling approach enables the bottom-up analysis of multiple biological samples simultaneously within a single tandem-MS run. In isobaric labeling (e.g. TMT labels in Figure 1), there is no mass difference between signals of identical peptides, which further increases the comparability and quality of the signals, in this case represented by the reporter fragment ion intensities.
 ```{r f:tmtlabels, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 1: a) TMT 6-plex labels differ by isotopes (*) only. b) The reactive group binds to the amino-terminus of the peptides, the reporter breaks off during fragmentation and the mass normalizer/balancer guarantees that the intact labels are isobaric. Image source: [Rosenblatt et al.](https://assets.fishersci.com/TFS-Assets/BID/posters/D00337~.pdf)."}
-knitr::include_graphics("tmt.png")
+knitr::include_graphics("figures/tmt.png")
 ```
   Even though the reporters allow one to assign each spectrum to the correct sample, there is still a substantial amount of complexity to be dealt with. The figure below shows how each protein may be represented by many peptides, multiple times, in multiple 'shapes and forms'. This is why summarization and normalization steps are key components in every workflow.
 ```{r f:data_hierarchy, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 2: Tandem-MS data is complex and hierarchical; many different signals compete in determining the (relative) abundance of a protein . The latter are not measured directly, but are represented in possibly multiple runs as multiple variations of multiple peptides in different conditions and different (usually replicate) samples. Many different combinations of such signals co-exist and some are measured in each LC-MS run, while some are not (e.g.: peptide k is not measured in the rightmost run). RT=retention time, CS=charge state, PTM=post-translational modification."}
-knitr::include_graphics("data_hierarchy.png")
+knitr::include_graphics("figures/data_hierarchy.png")
 ```
 </details>  
 
@@ -56,7 +56,7 @@ The table below shows an overview of the many components that make up a workflow
 Note that we have made a distinction between data-driven and model-based approaches, which refers to the type of normalization method they use.
 <!-- use tablesgenerator.com OR the image below... -->
 ```{r f:table_selected_variants, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 3: An overview of our selected components and component variants for constructing an analysis workflow. When studying a particular component (e.g. data-driven normalization), each component variant (e.g. median sweeping, CONSTANd, ...) is applied to the data, while the other workflow components are executed using 'Variant 1' methods only. The same process applies when studying the other components (e.g. summarization), overall, leading to multiple sets of outcomes."}
-knitr::include_graphics("table_selected_variants.png")
+knitr::include_graphics("figures/table_selected_variants.png")
 ```
 
 Each of the notebooks in this series takes the default approach, except for one component for which different variants are explored in detail. We chose the **unit scale**, **normalization method**, **summarization method** and **DEA method** as most interesting components to investigate, because on one hand we expect them to be impactful and on the other hand they are the ones we see varied and published about the most. We have made a non-systematic, non-exhaustive, [publicly available list of publications and software packages](https://docs.google.com/document/d/14BeNQFh3KHiKESdoF5A4OUlo9QWmt9kv/edit) related to analyzing proteomics data. You are very welcome to suggest edits and additions to this public repository.
@@ -68,7 +68,7 @@ It contains a HeLa background with 48 UPS1 proteins spiked in as a dilution seri
 As shown in the figure below, we use only Runs 1, 2, 4, 5, which represent the first two technical replicates of the first two biological replicates (mixtures). Further, we do not use the reference channels because a) we wish to demonstrate how to analyze data sets _without_ the use of internal references; and b) we wish to avoid confusion with reference conditions and/or channels in the context of ratio's and fold changes.
 
 ```{r f:design_ILS_as_subset, echo=FALSE, fig.align="center", out.width = "100%", fig.cap="Figure 4: We use only the subset (red rectangle) of data from the original study that consists of channels 127N through 130C (omitting the Reference condition samples) in Runs 1,2,4,5 with Mixtures 1 and 2. We also applied several data filtering steps - detailed in the 'Arbitrary choices for the other components' subsection above - after which we were left with 58 028 PSMs, 26 921 unique peptides and 4083 unique proteins, containing 19 out of the 48 spiked-in proteins. Image adapted from [Ting et al.](https://doi.org/10.1074/mcp.ra120.002105)"}
-knitr::include_graphics("design_ILS_as_subset.png")
+knitr::include_graphics("figures/design_ILS_as_subset.png")
 ```
 
 We use the following settings _whenever applicable_ for calculating values or generating figures:

diff --git a/tmt.png b/tmt.png
diff --git a/data_prep.R → util/data_prep.R b/data_prep.R → util/data_prep.R
@@ -1,13 +1,13 @@
 library(tidyverse)
 library(stringi)
-source('other_functions.R')
+source('util/other_functions.R')
 
 # suffix used when saving the processed data set ('input_data_<data_name>.rds')
-dat.raw <- read.delim('PSMs.csv', sep = '\t')  # create symlink
+dat.raw <- read.delim('data/PSMs.csv', sep = '\t')  # create symlink
 dat.raw.org <- dat.raw
 
 # read in the study design data frame
-study.design <- read.delim('msstatstmt_studydesign.csv', sep=',')  # create symlink
+study.design <- read.delim('data/msstatstmt_studydesign.csv', sep=',')  # create symlink
 
 # rename quantification columns
 tmp.fun <- function(x){
@@ -149,4 +149,4 @@ params <- list(referenceCondition=referenceCondition,
 
 # save data in wide and long format
 if ('X' %in% colnames(dat.l)) { dat.l$X <- NULL }
-saveRDS(list(dat.l=dat.l, dat.w=dat.w, data.params=params), paste0('input_data', '.rds'))  # make symlink
+saveRDS(list(dat.l=dat.l, dat.w=dat.w, data.params=params), paste0('data/input_data', '.rds'))  # make symlink
diff --git a/other_functions.R → util/other_functions.R b/other_functions.R → util/other_functions.R
diff --git a/pairs_panels_idline.R → util/pairs_panels_idline.R b/pairs_panels_idline.R → util/pairs_panels_idline.R
diff --git a/plotting_functions.R → util/plotting_functions.R b/plotting_functions.R → util/plotting_functions.R
@@ -222,7 +222,7 @@ cvplot_ils <- function(dat, feature.group, xaxis.group, title, rmCVquan=0.95, ..
 # scatterplot_ils: wrapper function on pairs.panels from 'psych' package
 #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
 # pairs.panels.my is a modified pairs.panels function such that the y=x identity line is plotted when lm=T
-source('pairs_panels_idline.R')
+source('util/pairs_panels_idline.R')
 
 scatterplot_ils <- function(dat, cols, stat, spiked.proteins, refCond){
   select.stat <- match.arg(stat, c('p-values', 'log2FC', 'q-values'))