forked from yufree/metaworkflow
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-normalization.Rmd
365 lines (229 loc) · 17.2 KB
/
09-normalization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
---
bibliography: references.bib
---
# Peaks normalization
## Batch effects
Batch effects are the variances caused by factor other than the experimental design. We could simply make a linear model for the intensity of one peak:
$$Intensity = Average + Condition + Batch + Error$$
Research is focused on condition contribution part and overall average or random error could be estimated. However, we know little about the batch contribution. Sometimes we could use known variables such as injection order or operators as the batch part. However, in most cases we such variable is unknown. Almost all the batch correction methods are trying to use some estimations to balance or remove the batch effect.
For analytical chemistry, internal standards or pool quality control samples are actually standing for the batch contribution part in the model. However, it's impractical to get all the internal standards when the data is collected untargeted. For methods using internal standards or pool quality control samples, the variations among those samples are usually removed as median, quantile, mean or the ratios. Other ways like quantile regression, centering and scaling based on distribution within samples could be treated as using the stable distribution of peaks intensity to remove batch effects.
## Batch effects classification
Variances among the samples across all the extracted peaks might be affected by factors other than the experiment design. There are three types of those batch effects: Monotone, Block and Mixed.
- Monotone would increase/decrease with the injection order or batches.
```{r echo=FALSE,out.width='61.8%'}
batch <- seq(0,9,length.out = 10)
group <- factor(c(rep(1,5),rep(2,5)))
names(batch) <- as.character(c(1:10))
barplot(batch,col=group)
legend('topleft',legend = c('Batch1','Batch2'),col=c(1,2),pch=19,bty = 'n')
```
- Block would be system shift among different batches.
```{r echo=FALSE,out.width='61.8%'}
# Block batch
batch <- c(2,2,7,7,7,7,7,3,3,3)
group <- factor(c(rep(1,5),rep(2,5)))
names(batch) <- as.character(c(1:10))
barplot(batch,col=group)
legend('topleft',legend = c('Batch1','Batch2'),col=c(1,2),pch=19,bty = 'n')
```
- Mixed would be the combination of monotone and block batch effects.
```{r echo=FALSE,out.width='61.8%'}
# Mixed batch
batch <- c(2,2,7,7,7,7,7,3,3,3) + seq(0,9,length.out = 10)
group <- factor(c(rep(1,5),rep(2,5)))
names(batch) <- as.character(c(1:10))
barplot(batch,col=group)
legend('topleft',legend = c('Batch1','Batch2'),col=c(1,2),pch=19,bty = 'n')
```
Meanwhile, different compounds would suffer different type of batch effects. In this case, the normalization or batch correction should be done peak by peak.
```{r bem,echo=FALSE,out.width='70%'}
getsample <- function(n){
batch <- NULL
for (i in 1:n){
batchin <- seq(1,10,length.out = 10) * rnorm(1)
batchde <- seq(10,1,length.out = 10) * rnorm(1)
batchblock <- c(rep(rnorm(1),2),rep(rnorm(1),5),rep(rnorm(1),3))
batchtemp <- batchin*sample(c(0,1),1) + batchde*sample(c(0,1),1) + batchblock*sample(c(0,1),1)
batch <- rbind(batch,batchtemp)
}
return(batch)
}
d <- getsample(30)
rownames(d) <- paste('Compound',c(1:30))
heatmap(d, Colv = NA, Rowv = NA, scale="column")
```
## Batch effects visualization
Any correction might introduce bias. We need to make sure there are patterns which different from our experimental design. Pooled QC samples should be clustered on PCA score plot.
## Source of batch effects
- Different Operators & Dates & Sequences
- Different Instrumental Condition such as different instrumental parameters, poor quality control, sample contamination during the analysis, Column (Pooled QC) and sample matrix effects (ions suppression or/and enhancement)
- Unknown Unknowns
## Avoid batch effects by DoE
You could avoid batch effects from experimental design. Cap the sequence with Pooled QC and Randomized samples sequence. Some internal standards/Instrumental QC might Help to find the source of batch effects while it's not practical for every compounds in non-targeted analysis.
Batch effects might not change the conclusion when the effect size is relatively small. Here is a simulation:
```{r}
set.seed(30)
# real peaks
group <- factor(c(rep(1,5),rep(2,5)))
con <- c(rnorm(5,5),rnorm(5,8))
re <- t.test(con~group)
# real peaks
group <- factor(c(rep(1,5),rep(2,5)))
con <- c(rnorm(5,5),rnorm(5,8))
batch <- seq(0,5,length.out = 10)
ins <- batch+con
re <- t.test(ins~group)
index <- sample(10)
ins <- batch+con[index]
re <- t.test(ins~group[index])
```
Randomization could not guarantee the results. Here is a simulation.
```{r}
# real peaks
group <- factor(c(rep(1,5),rep(2,5)))
con <- c(rnorm(5,5),rnorm(5,8))
batch <- seq(5,0,length.out = 10)
ins <- batch+con
re <- t.test(ins~group)
```
## *post hoc* data normalization
To make the samples comparable, normalization across the samples are always needed when the experiment part is done. Batch effect should have patterns other than experimental design, otherwise just noise. Correction is possible by data analysis/randomized experimental design. There are numerous methods to make normalization with their combination. We could divided those methods into two categories: unsupervised and supervised.
Unsupervised methods only consider the normalization peaks intensity distribution across the samples. For example, quantile calibration try to make the intensity distribution among the samples similar. Such methods are preferred to explore the inner structures of the samples. Internal standards or pool QC samples also belong to this category. However, it's hard to take a few peaks standing for all peaks extracted.
Supervised methods will use the group information or batch information in experimental design to normalize the data. A linear model is always used to model the unwanted variances and remove them for further analysis.
Since the real batch effects are always unknown, it's hard to make validation for different normalization methods. Li et.al developed NOREVA to make comparision among 25 correction method [@li2017a] and a recently updates make this numbers to 168 [@yang2020]. MetaboDrift also contain some methods for batch correction in excel [@thonusin2017]. Another idea is use spiked-in samples to validate the methods [@franceschi2012] , which might be good for targeted analysis instead of non-targeted analysis.
Relative log abundance (RLA) plots[@delivera2012] and heatmap often used to show the variances among the samples.
### Unsupervised methods
#### Distribution of intensity
Intensity collects from LC/GC-MS always showed a right-skewed distribution. Log transformation is often necessary for further statistical analysis.
#### Centering
For peak p of sample s in batch b, the corrected abundance I is:
$$\hat I_{p,s,b} = I_{p,s,b} - mean(I_{p,b}) + median(I_{p,qc})$$
If no quality control samples used, the corrected abundance I would be:
$$\hat I_{p,s,b} = I_{p,s,b} - mean(I_{p,b})$$
#### Scaling
For peak p of sample s in certain batch b, the corrected abundance I is:
$$\hat I_{p,s,b} = \frac{I_{p,s,b} - mean(I_{p,b})}{std_{p,b}} * std_{p,qc,b} + mean(I_{p,qc,b})$$ If no quality control samples used, the corrected abundance I would be:
$$\hat I_{p,s,b} = \frac{I_{p,s,b} - mean(I_{p,b})}{std_{p,b}}$$
#### Pareto Scaling
For peak p of sample s in certain batch b, the corrected abundance I is:
$$\hat I_{p,s,b} = \frac{I_{p,s,b} - mean(I_{p,b})}{Sqrt(std_{p,b})} * Sqrt(std_{p,qc,b}) + mean(I_{p,qc,b})$$
If no quality control samples used, the corrected abundance I would be:
$$\hat I_{p,s,b} = \frac{I_{p,s,b} - mean(I_{p,b})}{Sqrt(std_{p,b})}$$
#### Range Scaling
For peak p of sample s in certain batch b, the corrected abundance I is:
$$\hat I_{p,s,b} = \frac{I_{p,s,b} - mean(I_{p,b})}{max(I_{p,b}) - min(I_{p,b})} * (max(I_{p,qc,b}) - min(I_{p,qc,b})) + mean(I_{p,qc,b})$$
If no quality control samples used, the corrected abundance I would be:
$$\hat I_{p,s,b} = \frac{I_{p,s,b} - mean(I_{p,b})}{max(I_{p,b}) - min(I_{p,b})} $$
#### Level scaling
For peak p of sample s in certain batch b, the corrected abundance I is:
$$\hat I_{p,s,b} = \frac{I_{p,s,b} - mean(I_{p,b})}{mean(I_{p,b})} * mean(I_{p,qc,b}) + mean(I_{p,qc,b})$$
If no quality control samples used, the corrected abundance I would be:
$$\hat I_{p,s,b} = \frac{I_{p,s,b} - mean(I_{p,b})}{mean(I_{p,b})} $$
#### Quantile
The idea of quantile calibration is that alignment of the intensities in certain samples according to quantile in each sample.
Here is the demo:
```{r quantile, cache=T}
set.seed(42)
a <- rnorm(1000)
# b sufferred batch effect with a bias of 10
b <- rnorm(1000,10)
hist(a,xlim=c(-5,15),breaks = 50)
hist(b,col = 'black', breaks = 50, add=T)
# quantile normalized
cor <- (a[order(a)]+b[order(b)])/2
# reorder
cor <- cor[order(order(a))]
hist(cor,col = 'red', breaks = 50, add=T)
```
#### Ratio based calibration
This method calibrates samples by the ratio between qc samples in all samples and in certain batch.For peak p of sample s in certain batch b, the corrected abundance I is:
$$\hat I_{p,s,b} = \frac{I_{p,s,b} * median(I_{p,qc})}{mean_{p,qc,b}}$$
```{r ratio}
set.seed(42)
# raw data
I = c(rnorm(10,mean = 0, sd = 0.3),rnorm(10,mean = 1, sd = 0.5))
# batch
B = c(rep(0,10),rep(1,10))
# qc
Iqc = c(rnorm(1,mean = 0, sd = 0.3),rnorm(1,mean = 1, sd = 0.5))
# corrected data
Icor = I * median(c(rep(Iqc[1],10),rep(Iqc[2],10)))/mean(c(rep(Iqc[1],10),rep(Iqc[2],10)))
# plot the result
plot(I)
plot(Icor)
```
#### Linear Normalizer
This method initially scales each sample so that the sum of all peak abundances equals one. In this study, by multiplying the median sum of all peak abundances across all samples,we got the corrected data.
```{r Linear}
set.seed(42)
# raw data
peaksa <- c(rnorm(10,mean = 10, sd = 0.3),rnorm(10,mean = 20, sd = 0.5))
peaksb <- c(rnorm(10,mean = 10, sd = 0.3),rnorm(10,mean = 20, sd = 0.5))
df <- rbind(peaksa,peaksb)
dfcor <- df/apply(df,2,sum)* sum(apply(df,2,median))
image(df)
image(dfcor)
```
#### Internal standards
$$\hat I_{p,s} = \frac{I_{p,s} * median(I_{IS})}{I_{IS,s}}$$
Some methods also use pooled calibration samples and multiple internal standard strategy to correct the data [@vanderkloet2009; @sysi-aho2007]. Also some methods only use QC samples to handle the data [@kuligowski2015].
### Supervised methods
#### Regression calibration
Considering the batch effect of injection order, regress the data by a linear model to get the calibration.
#### Batch Normalizer
Use the total abundance scale and then fit with the regression line [@wang2013].
#### Surrogate Variable Analysis(SVA)
We have a data matrix(M\*N) with M stands for identity peaks from one sample and N stand for individual samples. For one sample, $X = (x_{i1},...,x_{in})^T$ stands for the normalized intensities of peaks. We use $Y = (y_i,...,y_m)^T$ stands for the group information of our data. Then we could build such models:
$$x_{ij} = \mu_i + f_i(y_i) + e_{ij}$$
$\mu_i$ stands for the baseline of the peak intensities in a normal state. Then we have:
$$f_i(y_i) = E(x_{ij}|y_j) - \mu_i$$
stands for the biological variations caused by the our group, for example, whether treated by exposure or not.
However, considering the batch effects, the real model could be:
$$x_{ij} = \mu_i + f_i(y_i) + \sum_{l = 1}^L \gamma_{li}p_{lj} + e_{ij}^*$$ $\gamma_{li}$ stands for the peak-specific coefficient for potential factor $l$. $p_{lj}$ stands for the potential factors across the samples. Actually, the error item $e_{ij}$ in real sample could always be decomposed as $e_{ij} = \sum_{l = 1}^L \gamma_{li}p_{lj} + e_{ij}^*$ with $e_{ij}^*$ standing for the real random error in certain sample for certain peak.
We could not get the potential factors directly. Since we don't care the details of the unknown factors, we could estimate orthogonal vectors $h_k$ standing for such potential factors. Thus we have:
$$
x_{ij} = \mu_i + f_i(y_i) + \sum_{l = 1}^L \gamma_{li}p_{lj} + e_{ij}^*\\
= \mu_i + f_i(y_i) + \sum_{k = 1}^K \lambda_{ki}h_{kj} + e_{ij}
$$
Here is the details of the algorithm:
> The algorithm is decomposed into two parts: detection of unmodeled factors and construction of surrogate variables
##### Detection of unmodeled factors
- Estimate $\hat\mu_i$ and $f_i$ by fitting the model $x_{ij} = \mu_i + f_i(y_i) + e_{ij}$ and get the residual $r_{ij} = x_{ij}-\hat\mu_i - \hat f_i(y_i)$. Then we have the residual matrix R.
- Perform the singular value decompositon(SVD) of the residual matrix $R = UDV^T$
- Let $d_l$ be the $l$th eigenvalue of the diagonal matrix D for $l = 1,...,n$. Set $df$ as the freedom of the model $\hat\mu_i + \hat f_i(y_i)$. We could build a statistic $T_k$ as:
$$T_k = \frac{d_k^2}{\sum_{l=1}^{n-df}d_l^2}$$
to show the variance explained by the $k$th eigenvalue.
- Permute each row of R to remove the structure in the matrix and get $R^*$.
- Fit the model $r_{ij}^* = \mu_i^* + f_i^*(y_i) + e^*_{ij}$ and get $r_{ij}^0 = r^*_{ij}-\hat\mu^*_i - \hat f^*_i(y_i)$ as a null matrix $R_0$
- Perform the singular value decompositon(SVD) of the residual matrix $R_0 = U_0D_0V_0^T$
- Compute the null statistic:
$$
T_k^0 = \frac{d_{0k}^2}{\sum_{l=1}^{n-df}d_{0l}^2}
$$
- Repeat permuting the row B times to get the null statistics $T_k^{0b}$
- Get the p-value for eigengene:
$$p_k = \frac{\#{T_k^{0b}\geq T_k;b=1,...,B }}{B}$$
- For a significance level $\alpha$, treat k as a significant signature of residual R if $p_k\leq\alpha$
##### Construction of surrogate variables
- Estimate $\hat\mu_i$ and $f_i$ by fitting the model $x_{ij} = \mu_i + f_i(y_i) + e_{ij}$ and get the residual $r_{ij} = x_{ij}-\hat\mu_i - \hat f_i(y_i)$. Then we have the residual matrix R.
- Perform the singular value decompositon(SVD) of the residual matrix $R = UDV^T$. Let $e_k = (e_{k1},...,e_{kn})^T$ be the $k$th column of V
- Set $\hat K$ as the significant eigenvalues found by the first step.
- Regress each $e_k$ on $x_i$, get the p-value for the association.
- Set $\pi_0$ as the proportion of the peak intensity $x_i$ not associate with $e_k$ and find the numbers $\hat m =[1-\hat \pi_0 \times m]$ and the index of the peaks associated with the eigenvalues
- Form the matrix $\hat m_1 \times N$, this matrix$X_r$ stand for the potential variables. As was done for R, get the eigengents of $X_r$ and denote these by $e_j^r$
- Let $j^* = argmax_{1\leq j \leq n}cor(e_k,e_j^r)$ and set $\hat h_k=e_j^r$. Set the estimate of the surrogate variable to be the eigenvalue of the reduced matrix most correlated with the corresponding residual eigenvalue. Since the reduced matrix is enriched for peaks associated with this residual eigenvalue, this is a principled choice for the estimated surrogate variable that allows for correlation with the primary variable.
- Employ the $\mu_i + f_i(y_i) + \sum_{k = 1}^K \gamma_{ki}\hat h_{kj} + e_{ij}$ as the estimate of the ideal model $\mu_i + f_i(y_i) + \sum_{k = 1}^K \gamma_{ki}h_{kj} + e_{ij}$
This method could found the potential unwanted variables for the data. SVA were introduced by Jeff Leek [@leek2008; @leek2007; @leek2012] and EigenMS package implement SVA with modifications including analysis of data with missing values that are typical in LC-MS experiments [@karpievitch2014a].
#### RUV (Remove Unwanted Variation)
This method's performance is similar to SVA. Instead find surrogate variable from the whole dataset. RUA use control or pool QC to find the unwanted variances and remove them to find the peaks related to experimental design. However, we could also empirically estimate the control peaks by linear mixed model. RUA-random [@livera2015; @delivera2012] further use linear mixed model to estimate the variances of random error. A hierarchical approach RUV was recently proposed for metabolomics data[@kim2021]. This method could be used with suitable control, which is common in metabolomics DoE.
#### RRmix
RRmix also use a latent factor models correct the data [@jr2017]. This method could be treated as linear mixed model version SVA. No control samples are required and the unwanted variances could be removed by factor analysis. This method might be the best choice to remove the unwanted variables with common experiment design.
#### Norm ISWSVR
It is a two-step approach via combining the best-performance internal standard correction with support vector regression normalization, comprehensively removing the systematic and random errors and matrix effects[@ding2022a].
## Method to validate the normalization
Various methods have been used for batch correction and evaluation. Simulation will ensure groud turth. Difference analysis would be a common method for evaluation. Then we could check whether this peak is true positive or false positive by settings of the simulation. Other methods need statistics or lots of standards to describ the performance of batch correction or normalization results.
## Software
- [BatchCorrMetabolomics](https://github.com/rwehrens/BatchCorrMetabolomics) is for improved batch correction in untargeted MS-based metabolomics
- [MetNorm](https://github.com/cran/MetNorm) show Statistical Methods for Normalizing Metabolomics Data.
- [BatchQC](https://github.com/mani2012/BatchQC) could be used to make batch effect simulation.
- [Noreva](http://idrb.zju.edu.cn/noreva/) could make online batch correction and comparison[@fu2021].