Minor changes

Valkenborg · Nov 9, 2022 · 76a70cd · 76a70cd
1 parent 5d0b9ae
commit 76a70cd
Show file tree

Hide file tree

Showing 10 changed files with 80 additions and 52 deletions.
diff --git a/CONSTANd_vs_medianSweeping.Rmd b/CONSTANd_vs_medianSweeping.Rmd
@@ -265,7 +265,7 @@ for (i in 1:n.comp.variants){
   pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}
 ```
 
-### Using spiked proteins only (if applicable)
+### Using spiked proteins only
 ```{r, eval=length(spiked.proteins)>0}
 par(mfrow=c(1, 2))
 for (i in 1:n.comp.variants){

diff --git a/datadriven_DEA.Rmd b/datadriven_DEA.Rmd
@@ -137,7 +137,7 @@ dat.norm.w <- emptyList(variant.names)
 ## median sweeping (1)
 
 Median sweeping means subtracting from each PSM quantification value the spectrum median (i.e., the row median computed across samples/channels) and the sample median (i.e., the column median computed across features). If the unit scale is set to intensities or ratios, the multiplicative variant of this procedure is applied: subtraction is replaced by division.
-First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g. 
+First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g). 
 No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows.
 
 ```{r, eval=!params$load_outputdata_p}
@@ -320,18 +320,13 @@ display_dataframe_head(dat.dea$moderated_ttest)
 The Wilcoxon test ([Wilcoxon, F.](http://doi.org/10.1007/978-1-4612-4380-9)) is a non-parametric rank-based test for comparing two groups (i.e., biological conditions).
 For each protein separately, the test is applied to each condition w.r.t. the reference condition `r referenceCondition`.
 
-_@Piotr: uncomment the explanation of Wilcoxon fold changes if desired, otherwise explain why not_
-<!-- # Wilcoxon rank-sum test employs ranks of quantification values, therefore logFC must be computed manually: -->
-<!-- #   if logFC.method='diff': difference of arithmetic means computed on log2 scale -->
-<!-- #   if logFC.method='ratio': log2 ratio of arithmetic means computed on raw scale -->
-
 ```{r, eval=!params$load_outputdata_p}
 otherConditions <- dat.l %>% distinct(Condition) %>% pull(Condition) %>% as.character %>% sort
 otherConditions <- otherConditions[-match(referenceCondition, otherConditions)]
 dat.dea$Wilcoxon <- wilcoxon_test(dat.norm.summ.w2, sample.info, referenceCondition, otherConditions, logFC.method='ratio')
 ```
 
-For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below.
+For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below. Note that the Wilcoxon test operates on rank therfore log fold changes are not returned by this testing procedure.
 
 ```{r}
 display_dataframe_head(dat.dea$Wilcoxon)
@@ -347,7 +342,7 @@ The final p-values are then computed by comparing the observed test statistic va
 dat.dea$permutation_test <- permutation_test(dat.norm.summ.l, referenceCondition, otherConditions, seed=seed, distribution='exact')
 ```
 
-For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below.
+For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below. Note that log fold changes returned by the permutation test are exactly the same as in the standard t-test/ or moderated t-test/limma.
 
 ```{r}
 display_dataframe_head(dat.dea$permutation_test)
@@ -361,7 +356,7 @@ The Reproducibility-Optimized Test Statistic (ROTS) ([Elo et al.](https://doi.or
 dat.dea$ROTS <- rots_test(dat.norm.summ.w2, sample.info, referenceCondition, otherConditions)
 ```
 
-For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below.
+For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below. Note that log fold changes returned by ROTS test are exactly the same as in the standard t-test/ or moderated t-test/limma.
 
 ```{r}
 display_dataframe_head(dat.dea$ROTS)
@@ -401,28 +396,31 @@ For all conditions, the q-values of the different variants correlate moderately
 scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition)
 ```
 
-On the other hand, the fold changes are very well correlated for all component variants, across all conditions, though the Wilcoxon estimates to a slightly lesser extent.
-
-```{r}
-scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition)
-```
+We do not present here the scatter plots of fold changes because from a methodological point of view all variants provide the same values (except the Wilcoxon test that is incompatible with fold change estimation).
 
 ## Volcano plots
 
 The volcano plot combines information on fold changes and statistical significance. The spike-in proteins are colored blue, and immediately it is clear that their fold changes dominate the region of statistical significance, which suggests the experiment and analysis were carried out successfully. The magenta, dashed line indicates the theoretical fold change of the spike-ins.
 
+(Wilcoxon test results are not showed as there are no fold change available)
+
 ```{r}
+# don't create volcano plot for the Wilcoxon test as there are no fold changes available for this variant
+dat.dea.volcano <- dat.dea
+dat.dea.volcano[["Wilcoxon"]] <- NULL
 for (i in 1:n.contrasts){
-  volcanoplot_ils(dat.dea, i, spiked.proteins, referenceCondition)}
+  volcanoplot_ils(dat.dea.volcano, i, spiked.proteins, referenceCondition)}
 ```
 
 ## Violin plots
 
 A good way to assess the general trend of the fold change estimates on a more 'macroscopic' scale is to make a violin plot. Ideally, there will be some spike-in proteins that attain the expected fold change  (red dashed line) that corresponds to their condition, while most (background) protein log2 fold changes are situated around zero.
 
+(Wilcoxon test results are not showed as there are no fold change available)
+
 ```{r}
 # plot theoretical value (horizontal lines) and violin per variant
-if (length(spiked.proteins)>0) violinplot_ils(lapply(dat.dea, function(x) x[spiked.proteins, logFC.cols]), referenceCondition) else violinplot_ils(lapply(dat.dea, function(x) x[,logFC.cols]), referenceCondition,  show_truth = FALSE)
+if (length(spiked.proteins)>0) violinplot_ils(lapply(dat.dea.volcano, function(x) x[spiked.proteins, logFC.cols]), referenceCondition) else violinplot_ils(lapply(dat.dea.volcano, function(x) x[,logFC.cols]), referenceCondition,  show_truth = FALSE)
 ```
 
 ```{r, echo=FALSE, eval=params$save_outputdata_p}
@@ -438,7 +436,7 @@ save(dat.nonnorm.summ.l
 # Conclusions
 
 It seems that the nonparametric methods do not have enough statistical power, even though ROTS is (over-)optimistic, producing low q-values for many background proteins, even compared to the moderated t-test. 
-Though the correlation between their significance estimates is a bit all over the place, all variants do agree on the fold change estimates.
+Though the correlation between their significance estimates is a bit all over the place, all variants do agree on the fold change estimates because the computation method is virtually the same.
 
 # Session information
 

diff --git a/datadriven_normalization.Rmd b/datadriven_normalization.Rmd
@@ -134,7 +134,7 @@ dat.norm.w <- emptyList(variant.names)
 ## median sweeping (1)
 
 Median sweeping means subtracting from each PSM quantification value the spectrum median (i.e., the row median computed across samples/channels) and the sample median (i.e., the column median computed across features). If the unit scale is set to intensities or ratios, the multiplicative variant of this procedure is applied: subtraction is replaced by division.
-First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g. 
+First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g). 
 No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows.
 
 ```{r, eval=!params$load_outputdata_p}
@@ -145,7 +145,7 @@ dat.norm.w$median_sweeping[,channelNames] <- median_sweep(dat.norm.w$median_swee
 
 ## CONSTANd
 
-[CONSTANd](https://doi.org/doi:10.18129/B9.bioc.CONSTANd) ([Van Houtven et al.](https://doi.org/10.1101/2021.03.04.433870)) normalizes a data matrix by 'raking' iteratively along the rows and columns (i.e. multiplying each row or column with a particular number) such that the row means and column means equal 1.
+[CONSTANd](https://doi.org/doi:10.18129/B9.bioc.CONSTANd) ([Van Houtven et al.](https://doi.org/10.1016/j.jmb.2021.166966)) normalizes a data matrix by 'raking' iteratively along the rows and columns (i.e. multiplying each row or column with a particular number) such that the row means and column means equal 1.
 One can never attain these row and column constraints simultaneously, but the algorithm converges very fast to the desired precision.
 
 ```{r, eval=!params$load_outputdata_p}
@@ -174,7 +174,7 @@ display_dataframe_head(dat.norm.w$NOMAD[, channelNames])
 
 ## Quantile (1)
 
-Quantile normalization (As implemented by (Bolstad et al.)[https://doi.org/10.1093/bioinformatics/19.2.185]) makes the distribution (i.e., the values of the quantiles) of quantification values in different samples identical.
+Quantile normalization (As implemented by ([Bolstad et al.](https://doi.org/10.1093/bioinformatics/19.2.185)) makes the distribution (i.e., the values of the quantiles) of quantification values in different samples identical.
 We first apply it to each Run separately, and then re-scale the observations so that the mean observation within in each run is set equal to the mean observation across all runs.
 After summarization, we do a second pass on the matrix with data from all runs.
 

diff --git a/datadriven_summarization.Rmd b/datadriven_summarization.Rmd
@@ -125,7 +125,7 @@ dat.unit.w <- pivot_wider(data = dat.unit.l, id_cols=-one_of(c('Condition', 'Bio
 display_dataframe_head(dat.unit.w)
 ```
 
-First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g. 
+First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g). 
 No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows.
 
 ```{r}
@@ -437,6 +437,7 @@ for (i in 1:n.contrasts){
 A good way to assess the general trend of the fold change estimates on a more 'macroscopic' scale is to make a violin plot. Ideally, there will be some spike-in proteins that attain the expected fold change  (red dashed line) that corresponds to their condition, while most (background) protein log2 fold changes are situated around zero.
 
 Clearly, the empirical results _tend towards_ the theoretical truth, but not a single observation attained the fold change it should have attained. There is clearly a strong bias towards zero fold change, which may partly be explained by the ratio compression phenomenon in mass spectrometry, although the effect seems quite extreme here.
+
 It seems that Median summarization and iPQF produce very similar violins, while Sum summarization is again the odd one out. Even though the Sum-associated values are closer to their theoretically expected values, in light of the rest of our analysis it seems more plausible that this is due to the entire distribution suffering increased variability, rather than that the Sum summarization would produce more reliable outcomes.
 
 ```{r}

diff --git a/datadriven_unit.Rmd b/datadriven_unit.Rmd
@@ -150,7 +150,7 @@ dat.unit.w <- lapply(dat.unit.l, function(x){
   pivot_wider(data = x, id_cols=-one_of(c('Condition', 'BioReplicate')), names_from=Channel, values_from=response)})
 ```
 
-First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g. 
+First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g). 
 No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows.
 
 ```{r, eval=!params$load_outputdata_p}
@@ -228,8 +228,7 @@ load(paste0('datadriven_unit_outdata', params$suffix_p, '.rda'))
 
 ## Boxplots
 
-These boxplots that for all variants the distributions are very similar and symmetrical. In contrast, the Sum summarization produces very skewed distributions. The means of the distributions for multiplicative unit scales (intensity, ratio) are 1 instead of zero because there we do median sweeping using division instead of subtraction.
-Although for all summarization methods the boxplots are centered after normalization, this skewness of the Sum summarized values is ominous.
+These boxplots that for all variants the distributions are similar and symmetrical. The means of the distributions for multiplicative unit scales (intensity, ratio) are 1 instead of zero because there we do median sweeping using division instead of subtraction.
 
 ```{r}
 # use (half-)wide format
@@ -431,7 +430,7 @@ if (length(spiked.proteins)>0) violinplot_ils(lapply(dat.dea, function(x) x[spik
 
 # Conclusions
 
-For the given data set, the differences in proteomic outcomes between all unit scale variants (log2 intensity, intensity, ratio) are quite small. The QC plots suggest that they produce qualitative outcomes, although the fold changes seem to experience an unusually large amount of ratio compression (probably inherent to the data set rather than the methodology). Using normalized ratios is identical to using untransformed intensities, and if you want to work on log2 scale, it doens't seem to matter whether you take the transform in the beginning or right before the DEA step.
+For the given data set, the differences in proteomic outcomes between all unit scale variants (log2 intensity, intensity, ratio) are quite small. The QC plots suggest that they produce qualitative outcomes, although the fold changes seem to experience an unusually large amount of ratio compression (probably inherent to the data set rather than the methodology). Using normalized ratios is identical to using untransformed intensities, and if you want to work on log2 scale, it doesn't seem to matter whether you take the transform in the beginning or right before the DEA step.
 
 # Session information
 

diff --git a/intro.Rmd b/intro.Rmd
@@ -87,16 +87,26 @@ knitr::kable(df) %>% kable_styling(full_width = F)
 ## Filtering and preparation
 
 In the file `data_prep.R` you can find the details of all data preparation steps. Here, we list which PSMs were discarded or modified and why:
+
 1. Use only data from Runs 1, 2, 4, 5.
-1. Discard reference samples (TMT channels 126 and 131).
-1. Discard PSMs with shared peptides (as done in [Ting et al.](https://doi.org/10.1074/mcp.RA120.002105)).
-1. Add simulated variability (see next section).
-1. Reconstruct missing MS1 Intensity column based on MS2 intensities (necessary for some component variants).
-1. Discard duplicate PSMs due to the use of multiple search engine instances (which only change the score).
-1. Discard PSMs of proteins that appear both in the background and spike-in.
-1. Discard PSMs with Isolation Interference (%) > 30.
-1. Discard PSMs with NA entries in the quantification channels.
-1. Discard PSMs of one-hit-wonder proteins.
+
+2. Discard reference samples (TMT channels 126 and 131).
+
+3. Discard PSMs with shared peptides (as done in [Ting et al.](https://doi.org/10.1074/mcp.RA120.002105)).
+
+4. Add simulated variability (see next section).
+
+5. Reconstruct missing MS1 Intensity column based on MS2 intensities (necessary for some component variants).
+
+6. Discard duplicate PSMs due to the use of multiple search engine instances (which only change the score).
+
+7. Discard PSMs of proteins that appear both in the background and spike-in.
+
+8. Discard PSMs with Isolation Interference (%) > 30.
+
+9. Discard PSMs with NA entries in the quantification channels.
+
+10. Discard PSMs of one-hit-wonder proteins.
 
 ## Simulated biological variability
 
@@ -208,7 +218,7 @@ Our idea behind this series of notebooks is not only to educate but also to invi
 
 ### input data
 
-The [data_prep.R](util/data_prep.R) script process the data in a way required by our notebook files.
+The [data_prep.R](util/data_prep.R) script processes the data in a way required by our notebook files.
 This file has been written such that it fits the particular dataset used in our manuscript. 
 With a little bit of effort, however, one can modify it such that it can become applicable to other datasets as well. 
 
@@ -231,12 +241,13 @@ The output of the script is a file created in the `data` folder called `input_da
 ```{r}
 temp <- readRDS("data/input_data.rds")
 
-# dat.l
+cat("dat.l")
 display_dataframe_head(temp$dat.l)
 
-# dat.w
+cat("dat.w")
 display_dataframe_head(temp$dat.w)
 
+cat("data.params")
 temp$data.params
 ```
 
@@ -254,7 +265,7 @@ Each notebook has been supplied with 5 parameters controlling various aspects of
 
 ### Master program
 
-We also prepared an R script with which it is possible to conveniently knit all notebooks from the level of one file. This program is called `data/master_program.R`. This program refers to the notebook parameter described above.
+We also prepared an R script with which it is possible to conveniently knit all notebooks from the level of one file. This program is called `data/master_program.R`, and inside you will find references to the R markdown parameters explained above.
 
 ### Install the required R packages
 

diff --git a/modelbased_DEA.Rmd b/modelbased_DEA.Rmd
@@ -34,7 +34,7 @@ knitr::opts_chunk$set(
 _This notebook is one in a series of many, where we explore how different data analysis strategies affect the outcome of a proteomics experiment based on isobaric labeling and mass spectrometry. Each analysis strategy or 'workflow' can be divided up into different components; it is recommend you read more about that in the [introduction notebook](intro.html)._
 </span>
 
-In this notebook specifically, we investigate the effect of varying the Differential Expression testing component on the outcome of the differential expression results. The four component variants are: **linear mixed-effects model**, **DEqMS**, **ANOVA* applied to the **protein-level** data, and **ANOVA** applied to the **PSM-level** data.
+In this notebook specifically, we investigate the effect of varying the Differential Expression testing component on the outcome of the differential expression results. The four component variants are: **linear mixed-effects model**, **DEqMS**, **ANOVA** applied to the **protein-level** data, and **ANOVA** applied to the **PSM-level** data.
 
 <span style="color: grey;">
 _The R packages and helper scripts necessary to run this notebook are listed in the next code chunk: click the 'Code' button. Each code section can be expanded in a similar fashion. You can also download the [entire notebook source code](modelbased_unit.Rmd)._
@@ -164,7 +164,7 @@ load(paste0('modelbased_DEA_outdata', params$suffix_p, '.rda'))
 
 ## Boxplot
 
-The normalization model consistently align the reporter ion intensity values.  
+The normalization model consistently aligns the reporter ion intensity values.  
 
 ```{r}
 par(mfrow=c(1,2))
@@ -336,10 +336,13 @@ To see whether the three Unit scales produce similar results on the detailed lev
 
 Regarding q-values, we can merely notice the higher correlation between the variants working with the same type of data (PSM or protein-level data).
 
+```{r}
+scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition)
+```
+
 Log fold changes originating from all variants are well correlated. We can merely notice that using PSM-level data lead to a bit more scattered values around the center of distributions.
 
 ```{r}
-scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition)
 scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition)
 ```