From 76a70cd140b3da462bf906f0fe744dfa622bc595 Mon Sep 17 00:00:00 2001 From: Piotr Prostko Date: Wed, 9 Nov 2022 23:12:07 +0100 Subject: [PATCH] Minor changes --- CONSTANd_vs_medianSweeping.Rmd | 2 +- datadriven_DEA.Rmd | 32 ++++++++++++++--------------- datadriven_normalization.Rmd | 6 +++--- datadriven_summarization.Rmd | 3 ++- datadriven_unit.Rmd | 7 +++---- intro.Rmd | 37 ++++++++++++++++++++++------------ modelbased_DEA.Rmd | 9 ++++++--- modelbased_normalization.Rmd | 23 ++++++++++++++++----- modelbased_summarization.Rmd | 2 +- modelbased_unit.Rmd | 11 ++++++---- 10 files changed, 80 insertions(+), 52 deletions(-) diff --git a/CONSTANd_vs_medianSweeping.Rmd b/CONSTANd_vs_medianSweeping.Rmd index 7d5fefc..ec49c28 100644 --- a/CONSTANd_vs_medianSweeping.Rmd +++ b/CONSTANd_vs_medianSweeping.Rmd @@ -265,7 +265,7 @@ for (i in 1:n.comp.variants){ pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))} ``` -### Using spiked proteins only (if applicable) +### Using spiked proteins only ```{r, eval=length(spiked.proteins)>0} par(mfrow=c(1, 2)) for (i in 1:n.comp.variants){ diff --git a/datadriven_DEA.Rmd b/datadriven_DEA.Rmd index 889febc..9f83d78 100644 --- a/datadriven_DEA.Rmd +++ b/datadriven_DEA.Rmd @@ -137,7 +137,7 @@ dat.norm.w <- emptyList(variant.names) ## median sweeping (1) Median sweeping means subtracting from each PSM quantification value the spectrum median (i.e., the row median computed across samples/channels) and the sample median (i.e., the column median computed across features). If the unit scale is set to intensities or ratios, the multiplicative variant of this procedure is applied: subtraction is replaced by division. -First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g. +First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g). No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows. ```{r, eval=!params$load_outputdata_p} @@ -320,18 +320,13 @@ display_dataframe_head(dat.dea$moderated_ttest) The Wilcoxon test ([Wilcoxon, F.](http://doi.org/10.1007/978-1-4612-4380-9)) is a non-parametric rank-based test for comparing two groups (i.e., biological conditions). For each protein separately, the test is applied to each condition w.r.t. the reference condition `r referenceCondition`. -_@Piotr: uncomment the explanation of Wilcoxon fold changes if desired, otherwise explain why not_ - - - - ```{r, eval=!params$load_outputdata_p} otherConditions <- dat.l %>% distinct(Condition) %>% pull(Condition) %>% as.character %>% sort otherConditions <- otherConditions[-match(referenceCondition, otherConditions)] dat.dea$Wilcoxon <- wilcoxon_test(dat.norm.summ.w2, sample.info, referenceCondition, otherConditions, logFC.method='ratio') ``` -For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below. +For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below. Note that the Wilcoxon test operates on rank therfore log fold changes are not returned by this testing procedure. ```{r} display_dataframe_head(dat.dea$Wilcoxon) @@ -347,7 +342,7 @@ The final p-values are then computed by comparing the observed test statistic va dat.dea$permutation_test <- permutation_test(dat.norm.summ.l, referenceCondition, otherConditions, seed=seed, distribution='exact') ``` -For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below. +For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below. Note that log fold changes returned by the permutation test are exactly the same as in the standard t-test/ or moderated t-test/limma. ```{r} display_dataframe_head(dat.dea$permutation_test) @@ -361,7 +356,7 @@ The Reproducibility-Optimized Test Statistic (ROTS) ([Elo et al.](https://doi.or dat.dea$ROTS <- rots_test(dat.norm.summ.w2, sample.info, referenceCondition, otherConditions) ``` -For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below. +For each condition, we now get fold change estimates, p-values and q-values (adjusted p-values): see `head` of dataframe below. Note that log fold changes returned by ROTS test are exactly the same as in the standard t-test/ or moderated t-test/limma. ```{r} display_dataframe_head(dat.dea$ROTS) @@ -401,28 +396,31 @@ For all conditions, the q-values of the different variants correlate moderately scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition) ``` -On the other hand, the fold changes are very well correlated for all component variants, across all conditions, though the Wilcoxon estimates to a slightly lesser extent. - -```{r} -scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition) -``` +We do not present here the scatter plots of fold changes because from a methodological point of view all variants provide the same values (except the Wilcoxon test that is incompatible with fold change estimation). ## Volcano plots The volcano plot combines information on fold changes and statistical significance. The spike-in proteins are colored blue, and immediately it is clear that their fold changes dominate the region of statistical significance, which suggests the experiment and analysis were carried out successfully. The magenta, dashed line indicates the theoretical fold change of the spike-ins. +(Wilcoxon test results are not showed as there are no fold change available) + ```{r} +# don't create volcano plot for the Wilcoxon test as there are no fold changes available for this variant +dat.dea.volcano <- dat.dea +dat.dea.volcano[["Wilcoxon"]] <- NULL for (i in 1:n.contrasts){ - volcanoplot_ils(dat.dea, i, spiked.proteins, referenceCondition)} + volcanoplot_ils(dat.dea.volcano, i, spiked.proteins, referenceCondition)} ``` ## Violin plots A good way to assess the general trend of the fold change estimates on a more 'macroscopic' scale is to make a violin plot. Ideally, there will be some spike-in proteins that attain the expected fold change (red dashed line) that corresponds to their condition, while most (background) protein log2 fold changes are situated around zero. +(Wilcoxon test results are not showed as there are no fold change available) + ```{r} # plot theoretical value (horizontal lines) and violin per variant -if (length(spiked.proteins)>0) violinplot_ils(lapply(dat.dea, function(x) x[spiked.proteins, logFC.cols]), referenceCondition) else violinplot_ils(lapply(dat.dea, function(x) x[,logFC.cols]), referenceCondition, show_truth = FALSE) +if (length(spiked.proteins)>0) violinplot_ils(lapply(dat.dea.volcano, function(x) x[spiked.proteins, logFC.cols]), referenceCondition) else violinplot_ils(lapply(dat.dea.volcano, function(x) x[,logFC.cols]), referenceCondition, show_truth = FALSE) ``` ```{r, echo=FALSE, eval=params$save_outputdata_p} @@ -438,7 +436,7 @@ save(dat.nonnorm.summ.l # Conclusions It seems that the nonparametric methods do not have enough statistical power, even though ROTS is (over-)optimistic, producing low q-values for many background proteins, even compared to the moderated t-test. -Though the correlation between their significance estimates is a bit all over the place, all variants do agree on the fold change estimates. +Though the correlation between their significance estimates is a bit all over the place, all variants do agree on the fold change estimates because the computation method is virtually the same. # Session information diff --git a/datadriven_normalization.Rmd b/datadriven_normalization.Rmd index 202c0e1..ffc0e93 100644 --- a/datadriven_normalization.Rmd +++ b/datadriven_normalization.Rmd @@ -134,7 +134,7 @@ dat.norm.w <- emptyList(variant.names) ## median sweeping (1) Median sweeping means subtracting from each PSM quantification value the spectrum median (i.e., the row median computed across samples/channels) and the sample median (i.e., the column median computed across features). If the unit scale is set to intensities or ratios, the multiplicative variant of this procedure is applied: subtraction is replaced by division. -First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g. +First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g). No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows. ```{r, eval=!params$load_outputdata_p} @@ -145,7 +145,7 @@ dat.norm.w$median_sweeping[,channelNames] <- median_sweep(dat.norm.w$median_swee ## CONSTANd -[CONSTANd](https://doi.org/doi:10.18129/B9.bioc.CONSTANd) ([Van Houtven et al.](https://doi.org/10.1101/2021.03.04.433870)) normalizes a data matrix by 'raking' iteratively along the rows and columns (i.e. multiplying each row or column with a particular number) such that the row means and column means equal 1. +[CONSTANd](https://doi.org/doi:10.18129/B9.bioc.CONSTANd) ([Van Houtven et al.](https://doi.org/10.1016/j.jmb.2021.166966)) normalizes a data matrix by 'raking' iteratively along the rows and columns (i.e. multiplying each row or column with a particular number) such that the row means and column means equal 1. One can never attain these row and column constraints simultaneously, but the algorithm converges very fast to the desired precision. ```{r, eval=!params$load_outputdata_p} @@ -174,7 +174,7 @@ display_dataframe_head(dat.norm.w$NOMAD[, channelNames]) ## Quantile (1) -Quantile normalization (As implemented by (Bolstad et al.)[https://doi.org/10.1093/bioinformatics/19.2.185]) makes the distribution (i.e., the values of the quantiles) of quantification values in different samples identical. +Quantile normalization (As implemented by ([Bolstad et al.](https://doi.org/10.1093/bioinformatics/19.2.185)) makes the distribution (i.e., the values of the quantiles) of quantification values in different samples identical. We first apply it to each Run separately, and then re-scale the observations so that the mean observation within in each run is set equal to the mean observation across all runs. After summarization, we do a second pass on the matrix with data from all runs. diff --git a/datadriven_summarization.Rmd b/datadriven_summarization.Rmd index 2f19f5a..398cfea 100644 --- a/datadriven_summarization.Rmd +++ b/datadriven_summarization.Rmd @@ -125,7 +125,7 @@ dat.unit.w <- pivot_wider(data = dat.unit.l, id_cols=-one_of(c('Condition', 'Bio display_dataframe_head(dat.unit.w) ``` -First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g. +First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g). No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows. ```{r} @@ -437,6 +437,7 @@ for (i in 1:n.contrasts){ A good way to assess the general trend of the fold change estimates on a more 'macroscopic' scale is to make a violin plot. Ideally, there will be some spike-in proteins that attain the expected fold change (red dashed line) that corresponds to their condition, while most (background) protein log2 fold changes are situated around zero. Clearly, the empirical results _tend towards_ the theoretical truth, but not a single observation attained the fold change it should have attained. There is clearly a strong bias towards zero fold change, which may partly be explained by the ratio compression phenomenon in mass spectrometry, although the effect seems quite extreme here. + It seems that Median summarization and iPQF produce very similar violins, while Sum summarization is again the odd one out. Even though the Sum-associated values are closer to their theoretically expected values, in light of the rest of our analysis it seems more plausible that this is due to the entire distribution suffering increased variability, rather than that the Sum summarization would produce more reliable outcomes. ```{r} diff --git a/datadriven_unit.Rmd b/datadriven_unit.Rmd index 2aa098e..d1e1bd2 100644 --- a/datadriven_unit.Rmd +++ b/datadriven_unit.Rmd @@ -150,7 +150,7 @@ dat.unit.w <- lapply(dat.unit.l, function(x){ pivot_wider(data = x, id_cols=-one_of(c('Condition', 'BioReplicate')), names_from=Channel, values_from=response)}) ``` -First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g. +First, let's sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g). No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows. ```{r, eval=!params$load_outputdata_p} @@ -228,8 +228,7 @@ load(paste0('datadriven_unit_outdata', params$suffix_p, '.rda')) ## Boxplots -These boxplots that for all variants the distributions are very similar and symmetrical. In contrast, the Sum summarization produces very skewed distributions. The means of the distributions for multiplicative unit scales (intensity, ratio) are 1 instead of zero because there we do median sweeping using division instead of subtraction. -Although for all summarization methods the boxplots are centered after normalization, this skewness of the Sum summarized values is ominous. +These boxplots that for all variants the distributions are similar and symmetrical. The means of the distributions for multiplicative unit scales (intensity, ratio) are 1 instead of zero because there we do median sweeping using division instead of subtraction. ```{r} # use (half-)wide format @@ -431,7 +430,7 @@ if (length(spiked.proteins)>0) violinplot_ils(lapply(dat.dea, function(x) x[spik # Conclusions -For the given data set, the differences in proteomic outcomes between all unit scale variants (log2 intensity, intensity, ratio) are quite small. The QC plots suggest that they produce qualitative outcomes, although the fold changes seem to experience an unusually large amount of ratio compression (probably inherent to the data set rather than the methodology). Using normalized ratios is identical to using untransformed intensities, and if you want to work on log2 scale, it doens't seem to matter whether you take the transform in the beginning or right before the DEA step. +For the given data set, the differences in proteomic outcomes between all unit scale variants (log2 intensity, intensity, ratio) are quite small. The QC plots suggest that they produce qualitative outcomes, although the fold changes seem to experience an unusually large amount of ratio compression (probably inherent to the data set rather than the methodology). Using normalized ratios is identical to using untransformed intensities, and if you want to work on log2 scale, it doesn't seem to matter whether you take the transform in the beginning or right before the DEA step. # Session information diff --git a/intro.Rmd b/intro.Rmd index 7f5e453..4d25e59 100644 --- a/intro.Rmd +++ b/intro.Rmd @@ -87,16 +87,26 @@ knitr::kable(df) %>% kable_styling(full_width = F) ## Filtering and preparation In the file `data_prep.R` you can find the details of all data preparation steps. Here, we list which PSMs were discarded or modified and why: + 1. Use only data from Runs 1, 2, 4, 5. -1. Discard reference samples (TMT channels 126 and 131). -1. Discard PSMs with shared peptides (as done in [Ting et al.](https://doi.org/10.1074/mcp.RA120.002105)). -1. Add simulated variability (see next section). -1. Reconstruct missing MS1 Intensity column based on MS2 intensities (necessary for some component variants). -1. Discard duplicate PSMs due to the use of multiple search engine instances (which only change the score). -1. Discard PSMs of proteins that appear both in the background and spike-in. -1. Discard PSMs with Isolation Interference (%) > 30. -1. Discard PSMs with NA entries in the quantification channels. -1. Discard PSMs of one-hit-wonder proteins. + +2. Discard reference samples (TMT channels 126 and 131). + +3. Discard PSMs with shared peptides (as done in [Ting et al.](https://doi.org/10.1074/mcp.RA120.002105)). + +4. Add simulated variability (see next section). + +5. Reconstruct missing MS1 Intensity column based on MS2 intensities (necessary for some component variants). + +6. Discard duplicate PSMs due to the use of multiple search engine instances (which only change the score). + +7. Discard PSMs of proteins that appear both in the background and spike-in. + +8. Discard PSMs with Isolation Interference (%) > 30. + +9. Discard PSMs with NA entries in the quantification channels. + +10. Discard PSMs of one-hit-wonder proteins. ## Simulated biological variability @@ -208,7 +218,7 @@ Our idea behind this series of notebooks is not only to educate but also to invi ### input data -The [data_prep.R](util/data_prep.R) script process the data in a way required by our notebook files. +The [data_prep.R](util/data_prep.R) script processes the data in a way required by our notebook files. This file has been written such that it fits the particular dataset used in our manuscript. With a little bit of effort, however, one can modify it such that it can become applicable to other datasets as well. @@ -231,12 +241,13 @@ The output of the script is a file created in the `data` folder called `input_da ```{r} temp <- readRDS("data/input_data.rds") -# dat.l +cat("dat.l") display_dataframe_head(temp$dat.l) -# dat.w +cat("dat.w") display_dataframe_head(temp$dat.w) +cat("data.params") temp$data.params ``` @@ -254,7 +265,7 @@ Each notebook has been supplied with 5 parameters controlling various aspects of ### Master program -We also prepared an R script with which it is possible to conveniently knit all notebooks from the level of one file. This program is called `data/master_program.R`. This program refers to the notebook parameter described above. +We also prepared an R script with which it is possible to conveniently knit all notebooks from the level of one file. This program is called `data/master_program.R`, and inside you will find references to the R markdown parameters explained above. ### Install the required R packages diff --git a/modelbased_DEA.Rmd b/modelbased_DEA.Rmd index 5c341c8..640756a 100644 --- a/modelbased_DEA.Rmd +++ b/modelbased_DEA.Rmd @@ -34,7 +34,7 @@ knitr::opts_chunk$set( _This notebook is one in a series of many, where we explore how different data analysis strategies affect the outcome of a proteomics experiment based on isobaric labeling and mass spectrometry. Each analysis strategy or 'workflow' can be divided up into different components; it is recommend you read more about that in the [introduction notebook](intro.html)._ -In this notebook specifically, we investigate the effect of varying the Differential Expression testing component on the outcome of the differential expression results. The four component variants are: **linear mixed-effects model**, **DEqMS**, **ANOVA* applied to the **protein-level** data, and **ANOVA** applied to the **PSM-level** data. +In this notebook specifically, we investigate the effect of varying the Differential Expression testing component on the outcome of the differential expression results. The four component variants are: **linear mixed-effects model**, **DEqMS**, **ANOVA** applied to the **protein-level** data, and **ANOVA** applied to the **PSM-level** data. _The R packages and helper scripts necessary to run this notebook are listed in the next code chunk: click the 'Code' button. Each code section can be expanded in a similar fashion. You can also download the [entire notebook source code](modelbased_unit.Rmd)._ @@ -164,7 +164,7 @@ load(paste0('modelbased_DEA_outdata', params$suffix_p, '.rda')) ## Boxplot -The normalization model consistently align the reporter ion intensity values. +The normalization model consistently aligns the reporter ion intensity values. ```{r} par(mfrow=c(1,2)) @@ -336,10 +336,13 @@ To see whether the three Unit scales produce similar results on the detailed lev Regarding q-values, we can merely notice the higher correlation between the variants working with the same type of data (PSM or protein-level data). +```{r} +scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition) +``` + Log fold changes originating from all variants are well correlated. We can merely notice that using PSM-level data lead to a bit more scattered values around the center of distributions. ```{r} -scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition) scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition) ``` diff --git a/modelbased_normalization.Rmd b/modelbased_normalization.Rmd index d28e5fc..0bca4bb 100644 --- a/modelbased_normalization.Rmd +++ b/modelbased_normalization.Rmd @@ -134,9 +134,19 @@ dat.norm.l <- emptyList(variant.names) dat.norm.l <- lapply(dat.norm.l, function(x) x <- dat.summ.l) ``` +For the three normalization models, we adopt the following naming convention: + +- $y_{i, j(i), q, l, s}$ the reporter ion intensities, +- $u$ is the model intercept, +- $b_q$ is the multiplexed tandem-MS run effect, +- $v_{l(q)}$ corresponds to the quantification channel within MS run effect, +- $p_i$ stands for the protein effect, +- $f_{(j(i)}$ describes the peptide within protein contribution, +- $ε_{i,j(i),q,l,s}$ the error term. + ## LMM1 (peptide-by-run interaction) -We start with a model that corrects the observed reporter ion intensities $y_{i, j(i), q, l, s}$ for imbalance stemming from run $b_q$ and run-channel $v_{l(q)}$ fixed effects, as well as protein $p_i$ and run-protein $b_q \times f_{j(i)}$ random effects: +We start with a model that corrects the observed reporter ion intensities for imbalance stemming from run $b_q$ and run-channel $v_{l(q)}$ fixed effects, as well as protein $p_i$ and run-peptide $b_q \times f_{j(i)}$ random effects: $$ \log_2y_{i, j(i), q, l, s} = u + b_q + v_{l(q)} + p_i + (b_q \times f_{j(i)}) + \varepsilon_{i, j(i), q, l, s} $$ @@ -149,7 +159,7 @@ dat.norm.l$LMM1$response <- residuals(LMM1) ## LMM2 (protein-by-run interaction) -In the next variant, we include a random interaction between Run and Protein and keep the Peptide random effect constant across different runs: +In the next variant, we include a random interaction between Run and Protein, and keep the Peptide random effect constant across different runs: $$ \log_2y_{i, j(i), q, l, s} = u + b_q + v_{l(q)} + (b_q \times p_i) + f_{j(i)} + \varepsilon_{i, j(i), q, l, s} $$ @@ -241,7 +251,7 @@ Now, let's check if these multi-dimensional data contains some kind of grouping; ### Using all proteins -After LMM2 or LMM3 normalization the samples are very closely grouped according to run instead of the dilution factor. Only LMM1, which contains the peptide-run interaction, restore the correct relation between the samples. This finding, together with conclusions from the **datadrive_normalization** notebook, cannot be overstated as it implies that correcting for peptide-run interaction is of utmost importance in obtaining valid inference from multi-run isobaric labeling datasets. +After LMM2 or LMM3 normalization the samples are very closely grouped according to run instead of the dilution factor. Only LMM1, which contains the peptide-run interaction, restore the correct relation between the samples. This finding, together with conclusions from the [datadriven_normalization](datadriven_normalization.html) notebook, cannot be overstated as it implies that correcting for peptide-run interaction is of utmost importance in obtaining valid inference from multi-run isobaric labeling datasets. ```{r} par(mfrow=c(2, 2)) @@ -254,7 +264,7 @@ There are only 19 proteins supposed to be differentially expressed in this data ### Using spiked proteins only -Therefore, let's see what the PCA plots look like if we were to only use the spiked proteins in the PCA. This time, both LMM1 and LMM2 successfully clustered the samples, but this check has only a theoretical value as in practice one does not for sure which proteins are differentially abundant. +Therefore, let's see what the PCA plots look like if we were to only use the spiked proteins in the PCA. This time, both LMM1 and LMM2 successfully clustered the samples, but this check has only a theoretical value as in practice one does not know for sure which proteins are differentially abundant. ```{r, eval=length(spiked.proteins)>0} par(mfrow=c(2, 2)) @@ -365,10 +375,13 @@ To see whether the three Unit scales produce similar results on the detailed lev q-values of the three normalization models are moderately correlated. LMM2 and LMM3 q-values are generally larger than those of LMM1, confirming the drastically altered variance structures in the observations when the normalization model is misspecified (i.e., does not include the peptide-run interaction term). All q-values of the "raw" variant are constant (approximately 1), so the correlation coefficient could not be computed (NA). +```{r} +scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition) +``` + When it comes to log fold changes, estimates are virtually not influenced at all by the three different normalization models. The spiked proteins fold change estimates (in orange) based on unnormalized (raw) data are in line with estimates based on normalized expression, emphasizing the successful conduct of the technical of the experiment and remarkable quality of the acquired data. ```{r} -scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition) scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition) ``` diff --git a/modelbased_summarization.Rmd b/modelbased_summarization.Rmd index 3bc6ee1..aa4457a 100644 --- a/modelbased_summarization.Rmd +++ b/modelbased_summarization.Rmd @@ -372,7 +372,7 @@ A good way to assess the general trend of the fold change estimates on a more 'm Clearly, the empirical results _tend towards_ the theoretical truth, but not a single observation attained the fold change it should have attained. There is clearly a strong bias towards zero fold change, which may partly be explained by the ratio compression phenomenon in mass spectrometry, although the effect seems quite extreme here. -Although, shape of spike-in protein log fold changes somewhat differ between the three variants, the underlying message is simple - there are no substantial differences. +Although, shape of spike-in protein log fold changes somewhat differ between the three variants, the underlying message is simple - there are no substantial differences, at least in this particular dataset. ```{r} # plot theoretical value (horizontal lines) and violin per variant diff --git a/modelbased_unit.Rmd b/modelbased_unit.Rmd index d0a4e7c..8092585 100644 --- a/modelbased_unit.Rmd +++ b/modelbased_unit.Rmd @@ -354,7 +354,7 @@ Now, the most important part: let's find out how our component variants have aff A confusion matrix shows how many true and false positives/negatives each variant has given rise to. Spiked proteins that are DE are true positives, background proteins that are not DE are true negatives. We calculate this matrix for all conditions and then calculate some other informative metrics based on the confusion matrices: accuracy, sensitivity, specificity, positive predictive value and negative predictive value. -In case of the `0.125 vs 0.5` and `1 vs 0.5` contrasts, only the log of intensities provides good outcomes. The biological difference in the `0.667 vs 0.5` contrast, however, seems to be too small to be picked by the proposed modelling approach, regardless of the unit scale. Moreover, the reason for the subpar performance of untransformed intensities and ratios that we observe here can lie in the flawed log fold change estimation and/or erroneous variance structure in the data. Hopefully, this conundrum can be explained after inspecting next visualisations. +In case of the `0.125 vs 0.5` and `1 vs 0.5` contrasts, only the log of intensities provides good outcomes. The biological difference in the `0.667 vs 0.5` contrast, however, seems to be too small to be picked by the proposed modelling approach, regardless of the unit scale. Moreover, the reason for the subpar performance of untransformed intensities and ratios that we observe here can lie in the flawed log fold change estimation and/or erroneous variance structure in the data. Hopefully, this conundrum can be explained after inspecting the forthcoming visualisations. ```{r, results='asis'} cm <- conf_mat(dat.dea, 'q.mod', 0.05, spiked.proteins) @@ -367,10 +367,13 @@ To see whether the three Unit scales produce similar results on the detailed lev First, the q-values of spike-in proteins associated with intensities on the original scale and ratios are generally larger. +```{r} +scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition) +``` + Second, the distribution of fold changes stemming from the analysis of untransformed intensities and ratios are unnaturally wide, both for the spike-in and background proteins. If we recall that the boxplots of normalized intensity and ratio values were centered around zero and the technical comment #2, we can conjecture that the terms $m$ and $m + r_c$ can be also close to zero, and therefore the entire $\log FC = \log \frac {m + r_c}{m}$ can be unstable or even undefined. -```{r, fig.width=12, fig.height=10} -scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition) +```{r} scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition) ``` @@ -400,7 +403,7 @@ if (length(spiked.proteins)>0) violinplot_ils(lapply(dat.dea, function(x) x[spik # Conclusions -In this notebook we demonstrated that one should not use (untransformed) intensities in a model-based normalization as the former putatively exhibit multiplicative biases, but the latter can only apply additive corrections. Mixing additive and multiplicative scales tends to give distorted results, including when using (intensity-based) ratios. +In this notebook we demonstrated that one should not use (untransformed) intensities in a model-based normalization as the former putatively exhibit multiplicative biases, but the latter can only apply additive corrections. Mixing additive and multiplicative scales tends to give distorted results, including the usage of (intensity-based) ratios. # Session information