-
Notifications
You must be signed in to change notification settings - Fork 10
Home
Derek Chiu, Johnson Liu, Aline Talhouk
With the explosion of various high throughput genomic data, a general question facing clinicians, biologists and informaticians is how to organize large and complex datasets into meaningful and useful structures.
Cluster analysis is a popular statistical learning technique with the general purpose of identifying distinct groups of objects characterized by certain measurements1. This technique is considered unsupervised because the researcher does not have a priori knowledge of the structure underlying the data. As such, ascertaining the results can be challenging, since the researcher does not have knowledge of the ground truth. Furthermore, cluster analysis can be achieved by a variety of clustering algorithm with different parameter settings that may potentially yield very different solutions.
In the context of genomics, clustering can be used with data generated from various assays extending from microarray or RNA sequencing (RNA-seq) expression data to proteomics and methylation data. This technique can be applied to distinguish between different disease phenotypes and to identify potentially valuable disease biomarkers. In medical and cancer research, clustering is particularly useful to create taxonomies of disease and to identify homogenous patient sub-populations, facilitating the delivery and assessment of targeted therapy. Additionally, clustering can improve the scientific understanding of underlying biological mechanisms. It has been used to partition samples, genomic features, or both, allowing better functional understanding of co-regulated genes. Cluster analysis is data agnostic; it has been used in many other diverse applications such as image analysis such as radiomics2, in clustering lifestyle risk factors3, biological organization of species, segmentation of a business’ customers, or grouping of products for recommender systems, and so on.
The result of a cluster analysis is either a hard (crisp) cluster assignment or a soft (fuzzy) assignment, with an associated measure representing the degree of membership of an object to a cluster. Multiple clustering solutions might be suitable for the same data collection, and ultimately, the ascertainment of a final clustering is largely dependent on what it will be used for4. For new discoveries, one may be interested in considering a diverse set of solutions to the problem in order to explore potential merits of each. However, the evaluation and the qualitative comparison of different cluster assignments, including the number of clusters, is challenging in absence of ground truth. Many validation criteria have been proposed5,6. Most validity measures typically target hard cluster assignment. These are designed to evaluate desirable criteria such as the homogeneity within each cluster, how often neighbouring items are assigned to the same cluster, and how separated elements within each cluster are from elements of another cluster. In addition, we often care about the stability of the clustering, where the cluster assignment does not significantly change in the presence of new items or new features. Different indices may measure one or more of these criteria, but there isn’t a universal index that can completely characterize a clustering and the performance of these indices is dependent on the level of noise in the data and the degree of overlap between the clusters6. In addition, it has been shown that individual indices may suffer from biases5 and a combination of indices may be more suitable.
There are many clustering algorithms and they each work differently, resulting in potentially different cluster assignments. Currently, most researchers may attempt clustering with different algorithms, plotting the resulting heatmaps and selecting the algorithm that is considered to provide the “best fit”. This process is dangerously subjective, since researchers may select algorithms that reinforce their preconceptions, inhibiting the discovery of novel structures. While it is important to interpret the final clustering in light of biological understanding, cluster validation step should be unsupervised as well. Generally, if the aim is class discover, the researcher does not know a priori what the underlying structure of the data is and the interest may be in characterizing the landscape of the data. If the actual data distribution is different from the generative model assumed by the algorithm used for clustering, even with a `correct' guess at the number of clusters, its performance can be worse than a random partitioning of the data. This is particularly true when the data under consideration is complex and inherently multi-dimensional.
Cluster ensembles provide an approach to combining multiple clusterings of a set of objects into a single consolidated clustering, often referred to as the consensus solution. Ensemble approaches consist of a two step procedure the first involved generating many sets of cluster assignments of cases in a dataset and integrating them into a final clustering (ig. 1).
Since no single classifier algorithm can be superior to others under all circumstance, an ensemble clustering approach can be used to generate many possible solutions and assessing those and combining them to obtain a final more robust and stable clustering compared to a single clustering algorithm. Another advantage of this approach is that it allows distributed computing, which may useful under privacy or sharing constraints, as well as the reuse of existing knowledge by combining cluster assignments from different data sources on the same cases7. This can eventually be extended beyond genomic data to incorporate clinical, lifestyle and environmental data and can truly pave the way to personalized medicine.
There is strong evidence that ensemble approaches for clustering produce better results than haphazardly choosing a single algorithm8. Though, the real strength of an ensemble clustering method is that it allows for automation and emphases the evidence accumulation aspect of research. Cluster ensembles hold a lot of promise and are an active area of research in the data mining community. A recent paper, exposed some weaknesses in the current application of consensus clustering9 and the properties and limitations of ensemble clustering algorithms have not been thoroughly investigated in various data structures10. A desirable feature would be the ability to assess the reliability of the resulting clustering solution(s). Another obstacle currently is the lack of accessibility to clinicians and researchers without extensive machine learning and statistical expertise. The current lack of quantitative assessment and reporting of novel clustering algorithms and reliance on visual displays of the most discriminative genes can be deceptive9. Perhaps a more automatic or semi-automatic approach to clustering is needed.
In this project we propose to address the current limitations of cluster ensembles, by providing an accessible automated framework to generate ensemble solutions and assess their performance using an easy to interpret global index to evaluate and validate their fit to the data.
1 Perform a benchmark study to assess the performance of different ensemble clustering approaches under different settings and evaluate their functionality and limitations.
A benchmark study was performed to test the different ways of generating an ensemble algorithm currently proposed on different simulated data sets, where the true class labels will be known, as well as well-annotated cancer data sets.
Different data structures will be simulated with different numbers of features, several degrees of separation and including a different number of noisy variables and outliers11.
As a first step, we will generate multiple partitioning of the data using different algorithms and perturbations to encourage diversity and challenge stability. This will result in a large library of clusterings for each case. The diversity of the data can be achieved by using different clustering algorithms8; or the same algorithm with different initialization parameters. Diversity can also be obtained by selecting with or without replacement a subset of objects or cases such as in object distributed clustering (ODC) or by selecting a subset of features in features distributed clustering (FDC)12. Projections to different subspaces of the data as well as applying data transformation can be used to encourage a diverse set of solutions. Generally speaking, the more diverse the solution the more information about the data is captured. This is particularly important in a discovery/exploratory phase when very little information is known about the data a priori. While expert’s experience can be useful here, a diverse solution would ensure a lack of bias an opportunity for discovery.
The second step would consist of combining the diverse results using a consensus function. There are many methods used to obtain a consensus function10. We will consider a number of popular approaches to combine the solutions in our study8,10,13,14. We will compute the different validation indices and compare the cluster ensemble approaches to the simple clustering algorithms using methods similar to5,6.
The type of knowledge synthesis that would be provided by such a benchmark study is what is currently needed to ensure the proper application of the ensemble clustering techniques in the context of genomic data and to avoid biased solutions.
**2 Develop and test a validation strategies based on a combination of internal validation indices, and use these indices to validate the results obtained in aim 1. **
Several criteria have been proposed in the literature as a measure of validation for each cluster, however no single measure completely captures all the desirable features one is interested in in an algorithm. These indices measure the quality of the clustering, the diversity of each clustering solution in addition they assess the compactness, separability, connectivity, robustness, and stability of the clusters.
We will use the data that was generated in aim 1. For each generated partitioning, a large number of internal validation indices would have been computed. We will project the resulting multidimensional matrix to a lower dimensional subspace using parameteric methods such as PCA or non-parametric methods such XX or XX. Based on this we will obtain a single score associated with each clustering method.
Select the algorithm or a weighted combination of algorithms the score computed in (1) to optimize the pca score in (3) and assign a final clustering. come up with a confidence level with respect to each cluster assignment (not sure how this can be done yet, I have to think more about it)
We will compare these approaches to simulated data, where we would know the ground truth and to popular publicly available, well-studied data sets in addition to data available in our own center. In addition to providing comparison with existing clustering (and external criteria), we will implement a comparative framework to implement and investigate several popular clustering approaches on a variety of datasets.
3 Develop and implement a semi-automated approach to ensemble clustering based on optimization of a summary index score.
The algorithm we propose can take as input the raw data or clusterings. In other words, researchers can input several clusterings of the data that they may have generated elsewhere using a variety of different datasets and the algorithm will combine those using a consensus function, optimizing and selecting solutions that provide the most diversity and leaving out unstable clusterings and redundancies. Another nice feature of the approach we are proposing is that as new and novel methodologies immerge, they can be incorporated and compared to the existing solutions. This means that the clustering is taken in context with the existing knowledge. Finally, the potential automatic or semi-automatic nature of the proposed method allows for an automatic or a semi-automatic solution that does not require a great deal of expertise in implementation of understanding of the underlying algorithm. This means that a solid implementation of the approach, with an easy to use interface, would be well received from the clinical community. To disseminate this approach, we will
We will build an open-source bioconductor package that can be used by statisticians and informaticians who wish to use and incorporate this approach into their research. The approach will implement parallel computing feature to speed up computation. The algorithm will be available in three different ways: (i) An R package that can be downloaded from Bioconductor (ii) On a local server with a web-interface to which users may upload data and the computation will be done locally (iii) an easy-to use package with a point and click user-interface (e.g. Shiny15) that the researcher can download and run on their own computer.
- Hennig, Chritstian, Meila, Marina, Murtagh, Fionn RR. Handbook of Cluster Analysis - CRC Press Book. Available at: https://www.crcpress.com/Handbook-of-Cluster-Analysis/Hennig-Meila-Murtagh-Rocci/9781466551886. Accessed January 31, 2016.
- Aerts HJWL, Velazquez ER, Leijenaar RTH, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun. 2014;5:4006. doi:10.1038/ncomms5006.
- Schuit AJ, van Loon AJM, Tijhuis M, Ocké MC. Clustering of Lifestyle Risk Factors in a General Adult Population. Prev Med (Baltim). 2002;35(3):219–224. doi:10.1006/pmed.2002.1064.
- Song Q, Merajver SD, Li JZ. Cancer classification in the genomic era: five contemporary problems. Hum Genomics. 2015;9(1):27. doi:10.1186/s40246-015-0049-8.
- Handl J, Knowles J, Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005;21:3201–3212. doi:10.1093/bioinformatics/bti517.
- Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013;46(1):243–256. doi:10.1016/j.patcog.2012.07.021.
- Hoadley K a., Yau C, Wolf DM, et al. Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin. Cell. 2013;158(4):929–944. doi:10.1016/j.cell.2014.06.049.
- Strehl A, Ghosh J. Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. J Mach Learn Res. 2002;3:583–617. doi:10.1162/153244303321897735.
- Șenbabaoğlu Y, Michailidis G, Li JZ. Critical limitations of consensus clustering in class discovery. Sci Rep. 2014;4:6207. doi:10.1038/srep06207.
-
VEGA-PONS S, RUIZ-SHULCLOPER J. A SURVEY OF CLUSTERING ENSEMBLE ALGORITHMS. Int J Pattern Recognit Artif Intell. 2011;25(03):337–372. Available at: http://www.worldscientific.com/doi/abs/10.1142/S0218001411008683. Accessed May 24, 2015.
-
Qiu W, Joe H. Generation of Random Clusters with Specified Degree of Separation. J Classif. 2006;23(2):315–334. doi:10.1007/s00357-006-0018-y.
-
Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering: A resampling based method for class discovery and visualization of gene expression microarray data. Mach Learn. 2003;52(i):91–118.
-
Ghosh J, Acharya A. Cluster ensembles. Wiley Interdiscip Rev Data Min Knowl Discov. 2011;1(4):305–315. doi:10.1002/widm.32.
-
Ayad HG, Kamel MS. On voting-based consensus of cluster ensembles. Pattern Recognit. 2010;43(5):1943–1953. Available at: http://www.sciencedirect.com/science/article/pii/S0031320309004312. Accessed May 24, 2015.
-
RStudio Inc. Easy web applications in R. 2013. Available at: http://www.rstudio.com/shiny/. Accessed January 27, 2016.