-
Notifications
You must be signed in to change notification settings - Fork 8
Dynamic Visualization and Fast Computation for Convex Clustering
Clustering -- the study of finding sub-populations in unlabelled data
-- is a key technique
in applied statistics and data science. While many algorithms for
clustering have been
proposed, they typically lack rigorous theoretical and algorithmic
foundations. The recently
proposed convex fusion formulation of clustering (``convex
clustering'') allows for rigorous
optimization and statistical guarantees, but at a significantly higher
computational cost
than traditional approaches. To address this, Weylandt, Nagorski, and
Allen [1] have proposed
an efficient algorithm for calculating high-quality approximations of
the convex clustering
path. These methods have been implemented in the R
package
`clustRviz` [2]. (See Section
1 and Appendices D and F of [1] for more details on convex clustering
and comparisons with
other approaches.)
This project will significantly enhance the functionality of the
clustRviz
package in the
following ways:
- Improved algorithms for Convex Bi-Clustering using a recently proposed Generalized ADMM scheme [3].
- Support for missing data using a one-step algorithmic-regularization variant of the MM scheme proposed by Chi, Allen, and Baraniuk [4]. (The math for this is unpublished, but a straightforward extension of [1].)
- Improved interactive graphics based on the
plotly
andgganimate
packages. - [Possible] Support for variable selection using the sparse convex clustering approach of Wang et al. [5]
This project will focus on either the computational or graphical
aspects of clustRviz
depending on the student's interests and abilities.
Convex clustering and bi-clustering are available using in the cvxclustr
and cvxbiclustr
packages (available on CRAN) [6,7]. These packages, however, are somewhat 'low-level' and do not provide the efficient pathwise algorithms of clustRviz
[2]. The scvxclustr
package, previously on CRAN, provided sparse convex clustering.
-
Improved Bi-Clustering (Estimated time: 2 Weeks)
Implement the improved convex bi-clustering algorithms proposed in [3].
As part of this work, the internal structure of the
clustRviz
package may be refactored to take advantage of the simpler structure of the algorithms in [3] (as compared to those in [4] and Appendix C of [1].) -
Missing Data Support (Estimated time: 2 Weeks)
Add missing data support to
clustRviz.
The algorithmic changes for this stage are quite straight-forward (requiring only a minor update to the primal update steps) but missing data support will require downstream modifications to the visualization components.
-
Dynamic Visualization and Documentation (Estimated time: 5 Weeks)
Improvements to the visualizations provided by
clustRviz.
The "dynamic" visualizations currently available in
clustRviz
are based primarily on theshiny
package, which displays static graphics sequentially. True dynamic visualization, based on thegganimate
orplotly
packages, would provide smoother and more elegant visualizations, as well as allowing convex clustering results to be shared without a runningR
process. -
Additional Work / CRAN Preparedness (Estimated time: 3 Weeks)
If time allows, additional features such as sparse convex clustering may be added, based on the interests of the student and the mentors. At the end of the summer, time will be dedicated to preparing
clustRviz
for its initial CRAN release.
These updates to clustRviz
will significantly enhance its functionality and usability, helping to cement clustRviz
's as an efficient, robust, and user-friendly package for convex clustering. The ability to cluster missing data and to embed
dynamic (movie-like) visualizations in PDF and HTML documents of data clustering will be particularly useful for applied data scientists working with messy data seeking to communicate their results to non-technical audiences.
-
Michael Weylandt ([email protected])
Department of Statistics, Rice University
-
Genevera Allen ([email protected])
Departments of Statistics, CS, and ECE, Rice University
Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children's Hospital
The mentors are authors of the clustRviz
package and several papers on convex clustering and bi-clustering.
-
R
package development - Use of the
plotly
,ggplot2
,gganimate
, andshiny
packages - Coding with
C++
andRcpp
[optional but strongly desired]
Submit a pull-request to the clustRviz
repo.
The issues tagged "good first issue" should be self-contained and
relatively easy to address:
https://github.com/DataSlingers/clustRviz/labels/good%20first%20issue
The mentors will
work with you to refine and merge your pull request.
-
Take the rough prototype of a dynamic heatmap from https://github.com/talgalili/heatmaply/issues/207 and fix the transitions / rendering issues.
-
Modify the current
CBASS
code to support the Generalized ADMM scheme for convex bi-clustering proposed in [3]. Note that, because the G-ADMM, does not require iteratively solving sub-problems, it should be possible to significantly simplify the code insrc/optim_policies.h
.
Please contact Michael Weylandt ([email protected]) with questions about the tests.
Students, please add a link to your solutions below. Links to pull requests on the clustRviz
will suffice for the "easy" tests.
[1] M. Weylandt, J. Nagorsi, and G. I. Allen. "Dynamic Visualization and Fast Computation for Convex Clustering via Algorithmic Regularization." ArXiv Pre-Print 1901.01477 available at https://arxiv.org/abs/1901.01477.
[2] https://github.com/DataSlingers/clustRviz
[3] M. Weylandt. "Splitting Methods for Convex Bi-Clustering and Co-Clustering." ArXiv Pre-Print 1901.06075 available at https://arxiv.org/abs/1901.06075.
[4] E. C. Chi, G. I. Allen, and R. G. Baraniuk. "Convex Biclustering." Biometrics 73(1), pp.10-19. 2017. DOI: https://doi.org/10.1111/biom.12540
[5] B. Wang, Y. Zhang, W. W. Sun, and Y. Fang. "Sparse Convex Clustering." Journal of Computational and Graphical Statistics 27(2), p.393-403. 2018. DOI: https://doi.org/10.1080/10618600.2017.1377081