Skip to content

Optimal partitioning algorithm for changepoint detection

Anuraag Srivastava edited this page Mar 9, 2019 · 10 revisions

Background

The Optimal partitioning algorithm, arXiv paper, computes the solution to the penalized changepoint problem (best model for a given non-negative real-valued penalty parameter) in quadratic time. It is relatively straightforward to code in C, but there is no R package that provides a reference implementation.

Related work

There are many other R packages that compute optimal changepoint models.

  • changepoint: provides the cpt.mean, … functions which compute the optimal solution to the penalized problem via the PELT algorithm, which is log-linear time complexity.
  • Fpop::fpop computes the optimal solution to the penalized problem via the FPOP algorithm, which is also log-linear time complexity.

Details of your coding project

Code a new R package opart which implements the optimal partitioning algorithm.

  • opart::opart_gaussian(data.vec, penalty) should compute the optimal changepoint model for a vector of real-valued data and a non-negative real-valued penalty, given the square loss (to minimize) / gaussian likelihood (to maximize).
  • it should call a C++ function int opart_gaussian(int n_data, double *data_ptr, double penalty, double *cost_ptr, int *end_ptr), where
    • return value is an error status code (0 for success).
    • n_data is the number of observations.
    • data_ptr is a pointer to those data values.
    • penalty is the non-negative penalty constant.
    • cost_ptr is a pointer to the optimal cost values of the best models from 1 to n_data.
    • end_ptr is a pointer to the optimal segment ends.
      • there are maximimally n_data segments, in which case end_ptr -> [ 0 1 2 … n_data-1 ].
      • there is minimally 1 segment, in which case end_ptr -> [ n_data-1 -2 -2 … -2 ], where -2 is a placeholder value indicating that segment is not used in the optimal model.
  • other functions that compute other optimal models if there is time.
  • documentation with informative examples.
  • tests for (1) correct outputs given valid inputs, and (2) informative error messages given invalid inputs.
  • vignette comparing speed of opart::opart_gaussian with changepoint::cpt.mean and Fpop::fpop – plot computation time as a function of data set size.

Expected impact

This R package will benefit the changepoint community by providing an efficient C++ reference implementation of the standard Optimal Partitioning algorithm that has been missing in R. Its simplicity makes it easy to modify (which is not true of other algorithms such as FPOP), which will be useful when developing other changepoint models/algorithms.

Mentors

Students, please contact mentors below after completing at least one of the tests below. Toby Hocking <[email protected]> and Guillem Rigaill <[email protected]> are the authors of several R packages for changepoint detection including Fpop.

Tests

Students, please do one or more of the following tests before contacting the mentors above.

  • Easy: run either changepoint::cpt.mean or Fpop::fpop on one of the data sets (vector of logratio values for a given profile.id/chromosome combination) in neuroblatoma$profiles from data(neuroblastoma, package=”neuroblastoma”). For one penalty parameter, plot the data as black points and the optimal segment means as horizontal green line segments.
  • Medium: for two data sets and three penalty parameters, plot the data and optimal models using a ggplot with facet_grid(segments ~ profile.id + chromosome)
  • Hard: use microbenchmark to compute the timings of changepoint::cpt.mean and Fpop::fpop of several neuroblastoma data sets of different sizes (or all of them). Plot the computation time versus data set size.

Solutions of tests

Students, please post a link to your test results here.

Name : Divyansh Chawla

Email : [email protected], [email protected]

University : Indian Institute of Technology, Kanpur

Course : Integrated B.S. & M.S. in Economic Sciences (Majorly Econometrics)

Solution to Easy Test : Easy

__________________________________________________

Name: Garrett Tetrault

Email: [email protected], [email protected]

University: Cornell University, Ithaca NY

Program: Mathematics w/ concentration in Computer Science

Solution to Tests: Easy, Medium, Hard

__________________________________________________

Name: Anuraag Srivastava

Email: [email protected]

University: Northern Arizona University, Flagstaff AZ

Program: Masters of Science in Computer Science

Solution to Tests: Easy,Medium,Hard

Clone this wiki locally