Skip to content

Dynamic Visualization and Fast Computation for Convex Clustering

Michael Weylandt edited this page Mar 18, 2019 · 1 revision

clustRviz: Dynamic Visualization and Fast Computation for Convex Clustering in R

Background

Clustering -- the study of finding sub-populations in unlabelled data -- is a key technique in applied statistics and data science. While many algorithms for clustering have been proposed, they typically lack rigorous theoretical and algorithmic foundations. The recently proposed convex fusion formulation of clustering (``convex clustering'') allows for rigorous optimization and statistical guarantees, but at a significantly higher computational cost than traditional approaches. To address this, Weylandt, Nagorski, and Allen [1] have proposed an efficient algorithm for calculating high-quality approximations of the convex clustering path. These methods have been implemented in the R package `clustRviz` [2]. (See Section 1 and Appendices D and F of [1] for more details on convex clustering and comparisons with other approaches.)

This project will significantly enhance the functionality of the clustRviz package in the following ways:

  1. Improved algorithms for Convex Bi-Clustering using a recently proposed Generalized ADMM scheme [3].
  2. Support for missing data using a one-step algorithmic-regularization variant of the MM scheme proposed by Chi, Allen, and Baraniuk [4]. (The math for this is unpublished, but a straightforward extension of [1].)
  3. Improved interactive graphics based on the plotly and gganimate packages.
  4. [Possible] Support for variable selection using the sparse convex clustering approach of Wang et al. [5]

This project will focus on either the computational or graphical aspects of clustRviz depending on the student's interests and abilities.

Related work

Convex clustering and bi-clustering are available using in the cvxclustr and cvxbiclustr packages (available on CRAN) [6,7]. These packages, however, are somewhat 'low-level' and do not provide the efficient pathwise algorithms of clustRviz [2]. The scvxclustr package, previously on CRAN, provided sparse convex clustering.

Coding Project Details

  • Improved Bi-Clustering (Estimated time: 2 Weeks)

    Implement the improved convex bi-clustering algorithms proposed in [3].

    As part of this work, the internal structure of the clustRviz package may be refactored to take advantage of the simpler structure of the algorithms in [3] (as compared to those in [4] and Appendix C of [1].)

  • Missing Data Support (Estimated time: 2 Weeks)

    Add missing data support to clustRviz.

    The algorithmic changes for this stage are quite straight-forward (requiring only a minor update to the primal update steps) but missing data support will require downstream modifications to the visualization components.

  • Dynamic Visualization and Documentation (Estimated time: 5 Weeks)

    Improvements to the visualizations provided by clustRviz.

    The "dynamic" visualizations currently available in clustRviz are based primarily on the shiny package, which displays static graphics sequentially. True dynamic visualization, based on the gganimate or plotly packages, would provide smoother and more elegant visualizations, as well as allowing convex clustering results to be shared without a running R process.

  • Additional Work / CRAN Preparedness (Estimated time: 3 Weeks)

    If time allows, additional features such as sparse convex clustering may be added, based on the interests of the student and the mentors. At the end of the summer, time will be dedicated to preparing clustRviz for its initial CRAN release.

Expected Impact

These updates to clustRviz will significantly enhance its functionality and usability, helping to cement clustRviz's as an efficient, robust, and user-friendly package for convex clustering. The ability to cluster missing data and to embed dynamic (movie-like) visualizations in PDF and HTML documents of data clustering will be particularly useful for applied data scientists working with messy data seeking to communicate their results to non-technical audiences.

Mentors

The mentors are authors of the clustRviz package and several papers on convex clustering and bi-clustering.

Required Skills

  • R package development
  • Use of the plotly, ggplot2, gganimate, and shiny packages
  • Coding with C++ and Rcpp [optional but strongly desired]

Tests

Easy

Submit a pull-request to the clustRviz repo. The issues tagged "good first issue" should be self-contained and relatively easy to address: https://github.com/DataSlingers/clustRviz/labels/good%20first%20issue The mentors will work with you to refine and merge your pull request.

Advanced

  1. Take the rough prototype of a dynamic heatmap from https://github.com/talgalili/heatmaply/issues/207 and fix the transitions / rendering issues.

  2. Modify the current CBASS code to support the Generalized ADMM scheme for convex bi-clustering proposed in [3]. Note that, because the G-ADMM, does not require iteratively solving sub-problems, it should be possible to significantly simplify the code in src/optim_policies.h.

Please contact Michael Weylandt ([email protected]) with questions about the tests.

Solutions of Tests

Students, please add a link to your solutions below. Links to pull requests on the clustRviz will suffice for the "easy" tests.


References

[1] M. Weylandt, J. Nagorsi, and G. I. Allen. "Dynamic Visualization and Fast Computation for Convex Clustering via Algorithmic Regularization." ArXiv Pre-Print 1901.01477 available at https://arxiv.org/abs/1901.01477.

[2] https://github.com/DataSlingers/clustRviz

[3] M. Weylandt. "Splitting Methods for Convex Bi-Clustering and Co-Clustering." ArXiv Pre-Print 1901.06075 available at https://arxiv.org/abs/1901.06075.

[4] E. C. Chi, G. I. Allen, and R. G. Baraniuk. "Convex Biclustering." Biometrics 73(1), pp.10-19. 2017. DOI: https://doi.org/10.1111/biom.12540

[5] B. Wang, Y. Zhang, W. W. Sun, and Y. Fang. "Sparse Convex Clustering." Journal of Computational and Graphical Statistics 27(2), p.393-403. 2018. DOI: https://doi.org/10.1080/10618600.2017.1377081

[6] https://cran.r-project.org/package=cvxclustr

[7] https://cran.r-project.org/package=cvxbiclustr

Clone this wiki locally