-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathREADME.Rmd
348 lines (262 loc) · 14.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
```{r, hide=TRUE, warning=FALSE, message=FALSE, echo=FALSE}
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("GGally"))
devtools::load_all()
```
sigclust2 [](https://travis-ci.org/pkimes/sigclust2) [](https://codecov.io/gh/pkimes/sigclust2)
=======================
## Contents
1. [Introduction](#intro)
2. [Installing](#install)
3. [Testing](#test)
4. [Plotting](#plot)
5. [References](#refs)
6. [Session Information](#sessioninfo)
## <a name="intro"></a> Introduction
This package may be used to assess statistical significance in hierarchical clustering.
To assess significance in high-dimensional data, the approach assumes that a cluster
may be well approximated by a single Gaussian (normal) distribution. Given the results
of hierarchical clustering, the approach sequentially tests from the root node whether
the data at each split/join correspond to one or more Gaussian distributions. The
hypothesis test performed at each node is based on a Monte Carlo simulation procedure,
and the family-wise error rate (FWER) is controlled across the dendrogram using a sequential
testing procedure.
An illustration of the basic usage of the package's testing procedure is provided in the
[Testing section](#test). Variations on the basic testing procedure are described in the
associated subsections. Basic plotting procedures are described in the [Plotting section](#plot).
## <a name="install"></a> Installing
To install `sigclust2`, which depends on packages in both [CRAN](http://cran.r-project.org/) and [Bioconductor](https://bioconductor.org/),
use the following call to [the `BiocManager` package](https://CRAN.R-project.org/package=BiocManager):
```
R> BiocManager::install("pkimes/sigclust2")
```
The package can then be loaded using the standard call to `library`.
```{r}
suppressPackageStartupMessages(library("sigclust2"))
```
**While not necessary, installing the `Rclusterpp` package for faster clustering is recommended
when using `sigclust2`.** Unforunately, the `Rclusterpp` package is no longer available on CRAN.
However, `Rclusterpp` can still be installed from [the author's GitHub repo](https://github.com/nolanlab/Rclusterpp/)
with the following command from [the `devtools` package](https://CRAN.R-project.org/package=devtools) (or
equivalently with a call to `BiocManager::install` again).
```
R> devtools::install_github("nolanlab/Rclusterpp")
```
## <a name="test"></a> Testing
For the following examples, we will use a simple toy example with 150 samples (_n_) with
100 measurements (_p_). The data are simulated from three Gaussian (normal) distributions.
```{r}
set.seed(1508)
n1 <- 60; n2 <- 40; n3 <- 50; n <- n1 + n2 + n3
p <- 100
data <- matrix(rnorm(n*p), nrow=n, ncol=p)
data[, 1] <- data[, 1] + c(rep(2, n1), rep(-2, n2), rep(0, n3))
data[, 2] <- data[, 2] + c(rep(0, n1+n2), rep(sqrt(3)*3, n3))
```
The separation of the three underlying distributions can be observed from a PCA (principal components
analysis) scatterplot. While the separation is clear in the first 2 PCs, recall that the data
actually exists in `r p` dimensions.
```{r, fig.width=10, fig.height=4}
data_pc <- prcomp(data)
par(mfrow=c(1, 2))
plot(data_pc$x[, 2], data_pc$x[, 1], xlab="PC2", ylab="PC1")
plot(data_pc$x[, 3], data_pc$x[, 1], xlab="PC3", ylab="PC1")
```
The SHC testing procedure is performed using the `shc` function. The function requires the following
three arguments:
* `x`: the data as a `matrix` with samples in rows,
* `metric`: the dissimilarity metric, and
* `linkage`: the linkage function to be used for hierarchical clustering.
For reasons outlined in the corresponding paper [(Kimes et al. 2017)](#refs) relating to how
the method handles testing when n << p, we recommmend using `"euclidean"` as the metric,
and any of `"ward.D2"`, `"single"`, `"average"`, `"complete"` as the linkage. If a custom
dissimilarity metric is desired, either of `vecmet` or `matmet` should be specified, as
described [later](#newmetric) in this section.
If metric functions which do not statisfy rotation invariance are desired,
e.g. one minus Pearson correlation (`"cor"`) or L1 (`"manhattan"`),
`null_alg = "2means"` and `ci = "2CI"` should be specified. The `null_alg` and `ci` parameters
specify the algorithm for clustering and measure of "cluster strength" used to generate the null
distribution for assessing significance. Since the K-means algorithm (`2means`) optimizes
the 2-means CI (`2CI`), the resulting p-value will be conservative. However, since the hierarchical
algorithm is not rotation invariant, using `null_alg = "hclust"` or `ci = "linkage"` produces
unreliable results. An example for testing using Pearson correlation is given [later](#pearson) in
this section.
For now, we just use the recommended and default parameters.
```{r}
shc_result <- shc(data, metric="euclidean", linkage="ward.D2")
```
The output is a S3 object of class `shc`, and a brief description of the analysis results can be
obtained by the `summary` function.
```{r}
summary(shc_result)
```
The analysis output can be accessed using the `$` accessor. More details on the different entries
can be found in the documentation for the `shc` function.
```{r}
names(shc_result)
```
The computed p-values are probably of greatest interest. Two p-values are computed as part of the
SHC testing procedure: (1) an empirical p-value (`p_emp`), and (2) a Gaussian approximate
p-value (`p_norm`). The p-values are computed based on comparing the observed strength of
clustering in the data against the expected strength of clustering under the null hypothesis
that the data from a single cluster. The null distribution is approximated using a
specified number of simulated datasets (`n_sim = 100` default argument). `p_emp` is the empirical
p-value computed from the collection of simulated null datasets. `p_norm` is an approximation to
the empirical p-value which provides more continuous p-values. `nd_type` stores the results of the
test and takes values in: `n_small`, `no_test`, `sig`, `not_sig`, `cutoff_skipped`. With the default
implementation of `shc` using no FWER control, all nodes are either `cutoff_skipped` or `n_small`.
The p-values are reported for each of `r n-1` (`n-1`) nodes along the hierarchical dendrogram.
The entries of `p_emp` and `p_norm` are ordered descending from the top of the dendrogram, with
the first entry corresponding to the very top (root) node of the tree.
```{r}
data.frame(result = head(shc_result$nd_type, 5),
round(head(shc_result$p_norm, 5), 5),
round(head(shc_result$p_emp, 5), 5))
```
In addition to values between 0 and 1, some p-values are reported as `2`. These values correspond
to nodes which were not tested, either because of the implemented family-wise error rate (FWER)
controlling procedure (`alpha`) or the minimum tree size for testing (`min_n`).
Variations on the standard testing procedure are possible by changing the default parameters of
the call to `shc(..)`.
### <a name="newmetric"></a>Explicitly specifying a dissimilarity function
The method also supports specifying your own metric function through the `vecmet` and `matmet`
parameters. Only one of `vecmet` and `matmet` should be specified. If either is specified, the
`metric` parameter will be ignored. The `vecmet` parameter should be passed a function which takes
two vectors as input and returns the dissimilarity between the two vectors. The `matmet` parameter
should be passed a function which takes a matrix as input and returns a `dist` object of
dissimilarities of the matrix rows.
The `vecmet` example is not actually run in this tutorial since it is __incredibliy__
computationally expensive. Internally, the function passed to `vecmet` is wrapped in the
following call to `outer` to compute dissimilarities between all rows of a matrix.
```{r, eval=FALSE}
as.dist(outer(split(x, row(x)), split(x, row(x)), Vectorize(vecmet)))
```
The following simple benchmarking example with `cor` illustrates the overhead for
using `outer` to call on a vector function rather than using an optimized matrix
dissimilarity function.
```{r}
vfun <- function(x, y) {1 - cor(x, y)}
mfun1 <- function(x) {
as.dist(outer(split(x, row(x)), split(x, row(x)),
Vectorize(vfun)))
}
mfun2 <- function(x) { as.dist(1 - cor(t(x))) }
system.time(mfun1(data))
system.time(mfun2(data))
```
The first matrix correlation function, `mfun1`, is written it
would be processed if `vfun` were passed to `shc` as `vecmet`. The second funtion,
`mfun2`, is a function that could be passed to `matmet`. The performance difference is
clearly significant.
When specifying a custom dissimilarity function for `shc`, it is important to
remember that the function must be used to compute dissimilarity matrices `n_sim` times
for __each node__. In our toy example where `n_sim = 100` and `n = 150`, this means
calling on the dissimilarity function >10,000 times.
Our custom function, `mfun2` can be passed to `shc` through the `matmet` parameter.
```{r}
shc_mfun2 <- shc(data, matmet=mfun2, linkage="average")
data.frame(result = head(shc_mfun2$nd_type),
round(head(shc_mfun2$p_norm), 5),
round(head(shc_mfun2$p_emp), 5))
```
Since the toy dataset is simulated with all differentiating signal lying in the
first two dimensions, Pearson correlation-based clustering does a poor job at
distinguishing the clusters, and the resulting p-values show weak significance.
### <a name="pearson"></a> Using Pearson correlation
As a shortcut, without having to specify `matmet`, if testing using `(1 - cor(x))` is desired,
the following specification can be used.
```{r, eval=FALSE}
data_pearson <- shc(data, metric="cor", linkage="average", null_alg="2means")
```
The result will be equivalent to apply the original `sigclust` hypothesis test described
in [Liu et al. 2008](#refs) at each node along the dendrogram.
### <a name="fwerstopping"></a> Testing with FWER stopping
By default, p-values are calculated at all nodes along the dendrogram with at least `n_min`
observations (default `n_min = 10`). The package includes a FWER controlling procedure which
proceeds sequentially from the top node such that daughter nodes are only tested if
FWER-corrected significance was achieved at the parent node. To reduce the total number of tests
performed, set `alpha` to some value less than `1`.
```{r}
shc_fwer <- shc(data, metric="euclidean", linkage="ward.D2", alpha=0.05)
```
The FWER is noted in the summary of the resulting `shc` object, and can be seen in the `nd_type`
attribute, where most tests are now labeled `no_test` (with `p_norm` and `p_emp` values of 2).
```{r}
data.frame(result = head(shc_fwer$nd_type, 10),
round(head(shc_fwer$p_norm, 10), 5),
round(head(shc_fwer$p_emp, 10), 5))
```
By default, `p_norm` p-values are used to test for significance against the FWER cutoffs,
but `p_emp` can be used by specifying `p_emp = TRUE`.
### <a name="pearson"></a> Performing tests with multiple indices
The `shc` function allows for testing along the same dendrogram simultaneously using
different measures of strength of clustering.
For example, it is possible to simultaneously test the above example using both the 2-means
cluster index and the linkage value as the measure of strength of clustering.
```{r}
data_2tests <- shc(data, metric="euclidean", linkage="ward.D2",
ci=c("2CI", "linkage"),
null_alg=c("hclust", "hclust"))
round(head(data_2tests$p_norm), 5)
```
The results of clustering using `hclust_2CI` and `hclust_linkage` are reported in the columns
of the analysis results. The relative performance of a few of these different combinations are
described in the [corresponding manuscript](#refs) when using Ward's linkage clustering.
When `alpha < 1` is specified, the additional `ci_idx` parameter specifies the index of the test
that should be used when trying to control the FWER.
## <a name="plot"></a> Plotting
While looking at the p-values is nice, plots are always nicer than numbers. A nice way to
see the results of the SHC procedure is simply to call `plot` on the `shc` class object
created using the `shc(..)` constructor.
```{r, fig.width=12, fig.height=4}
plot(shc_result, hang=.1)
```
The resulting plot shows significant nodes and splits in red, as well as the corresponding p-values.
Nodes which were not tested, as described earlier, are marked in either green or teal (blue).
### <a name="diagnostics"></a> Diagnostic plots
Several types of diagnostic plots are implemented for the SHC method. These are available through the
`diagnostic` method. Since testing is performed separately at each node along the dendrogram, diagnostic
plots are also generated per-node. The set of nodes for which diagnostic plots should be generated
is specified with the `K` parameter. The default is to only generate plots for the root node, `K = 1`.
The method currently supports four types of diagnostic plots: `background`, `qq`, `covest`, `pvalue`.
The desired plot type is specified to the `pty` parameter as a vector of strings. To create all four
plots, simply specify `all`, which is also the default value.
If the length of `K` is greater than 1 or more than one plot type is specified, the method will
write files to a pdf file, `fname.pdf`, where `fname` is an input parameter that can be specifeid
by the user.
The `background` plot will return a jitter plot of the matrix entries, as well as a smooth kernel
density estimate and best-fit Gaussian approximation used in estimating the background
noise level.
```{r}
diagnostic(shc_result, K=1, pty='background')
```
The `qq` plot provides the corresponding Quantile-Quantile plot from the background noise estimating
procedure.
```{r}
diagnostic(shc_result, K=1, pty='qq')
```
The `covest` plot shows the estimated eigenvalues of the null Gaussian distribution along with the sample
eigenvalues of the original data matrix.
```{r}
diagnostic(shc_result, K=1, pty='covest')
```
The `pvalue` plot shows the cluster index for the original data along with the distribution of
simulated cluster indices used to determine the reported empirical (Q) p-value. Additionally, the
best-fit Gaussian approximation to the cluster index distirbution used to compute the Gaussian-approximate
(Z) p-value is overlaid in black.
```{r}
diagnostic(shc_result, K=1, pty='pvalue')
```
## <a name="refs"></a> References
* ___Kimes PK___, Liu Y, Hayes DN, and Marron JS. (2017). "Statistical significance
for hierarchical clustering." _Biometrics_.
* Huang H, Liu Y, Yuan M, and Marron JS. (2015). "Statistical significance of
clustering using soft thresholding."
_Journal of Computational and Graphical Statistics_.
* Liu Y, Hayes DN, Nobel A, and Marron JS. (2008). "Statistical significance of
clustering for high-dimension, low–sample size data."
_Journal of the American Statistical Association_.
## <a name="sessioninfo"></a> Session Information
```{r}
sessionInfo()
```