-
Notifications
You must be signed in to change notification settings - Fork 0
/
changepoint.Rmd
261 lines (192 loc) · 14.7 KB
/
changepoint.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
---
title: "Bayesian changepoint detection with R and Stan"
author: "Gabriele Modena"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
set.seed(42)
```
**Update 2017-03-19**: added a reference in the Conclusion about improving runtime performance by tansforming parameters in linear time complexity (see [https://github.com/gmodena/bayesian-changepoint/issues/1](https://github.com/gmodena/bayesian-changepoint/issues/1)).
In this document, I'll review a Bayesian approach to detect a single changepoint in a timeseries. I'll implement a model using [stan](http://mc-stan.org) and show how to interpret its output in R using the [rstan](http://mc-stan.org/interfaces/rstan) package. The code for this article can be found at [https://github.com/gmodena/bayesian-changepoint](https://github.com/gmodena/bayesian-changepoint).
## The frequentist approach
Let $D = {d_1, ..., d_n}$ be a time series with $n$ *normally distributed* data points. To determine whether a change of parameters $\mu$ and $\sigma$ at point $\tau$ is significant, one possible approach would be to perform a likelihood test. In this setting we run a hypothesis test where the null hypothesis $H_0$ is *there is no changepoint in $D$*, and the alternate hypothesis $H_1$ is *there is at most one changepoint at $\tau \in D$*[^1].
More formally $H_0$ says that $\forall n$ we have $d_n \sim \mathcal{N}(\mu, \sigma)$
Where $H_1$ says that $\forall n$
$$d_n \sim \begin{cases} \mathcal{N}(\mu_1, \sigma_1) & \mbox{if } n \lt \tau \\ \mathcal{N}(\mu_2, \sigma_2), & \mbox{if } n \geq \tau \end{cases}$$
Under $H_0$, the log-likelihood of $D$ is $log P(D|\mu, \sigma)$. Under $H_1$ the log-likelihood of $D$ is $ll_\tau = log P(D_{1..{\tau-1}} | \mu_1, \sigma_1) + log P(D_{\tau..n} | \mu_2, \sigma_2)$.
We then perform a maximum likelihood estimation to find the value $\tau$ that maximizes $ll_{\tau}$, which we use to construct a test statistic $\lambda = 2 [ max_{\tau} ll_{\tau} - log P(D|\mu, \sigma)]$.
A changepoint has happened iff $\lambda > c$, for a given $c$. What is a good value of $c$?. As it turns out, *the appropriate value for this parameter c is still an open research question with several authors devising p values and other information criteria under different types of changes* [changepoint: An R Package for Changepoint Analysis](https://www.jstatsoft.org/article/view/v058i03).
## Bayesian changepoint detection
In the Bayesian setting, we assume that the data $D$ is generated by some probability distribution parametrized by $\Theta$. Our goal is to model $P(\Theta|D)$.
From Bayes rule we know that $P(\Theta|D) = \frac{P(D|\Theta)P(\Theta)}{P(D)}$. $P(\Theta|D)$ is called the *posterior* distribution, $P(D|\Theta)$ is the *likelihood* and $P(\Theta)$ is the the *prior*. The core of Bayesin statistics can be summarized as calculating $posterior \propto likelihood \cdot prior$.
A Bayesian data analysis involves the following steps:
1. define a prior distribution that incorporates your beliefs about the data
2. acquire some data
3. use Bayes rule to update the prior distribution given the newly acquired data and calculate the posterior distribution
4. analyse the posterior
A Bayesian model to detect a single changepoint $\tau$, modeled as a *uniformly distributed* discrete latent parameter, will look like this:
$$ \tau \sim Uniform(1, N) $$
$$ \mu_1 \sim \mathcal{N}(\mu_{\mu_1}, \sigma_{\mu_1}) $$
$$ \sigma_1 \sim \mathcal{N}(\mu_{\sigma_1}, \sigma_{\sigma_1})$$
$$ \mu_2 \sim \mathcal{N}(\mu_{\mu_2}, \sigma_{\mu_2}) $$
$$ \sigma_2 \sim \mathcal{N}(\mu_{\sigma_2}, \sigma_{\sigma_2})$$
$$ d_n \sim \begin{cases} \mathcal{N}(\mu_1, \sigma_1) & \mbox{if } n \lt \tau \\ \mathcal{N}(\mu_2, \sigma_2), & \mbox{if } n \geq \tau \end{cases}$$
In many practical applications, a closed form solution for the posterior is hard to derive (eg. the integral is hard, or impossible to calculate), and approximation methods such as [Markov Chain Monte Carlo](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) (MCMC) are required.
In the reminder of this document we'll code this model as a `stan` program, and fit the model with MCMC to calculate the posterior probability $P(\tau, \mu_1, \sigma_1, \mu_2, \sigma_2 |D)$.
### Stan
[stan](http://mc-stan.org) is a C++ library that comes with a domain specific modelling language very similar to statistical notation. Stan can help us to perform the Bayesian analysis steps, including MCMC inference (and more advanced techniques). The model that I will illustrate in the remainder of this document, is an adaptation of the [Coal Mine Distater](https://pymc-devs.github.io/pymc/tutorial.html) example from [Stan's manual](https://github.com/stan-dev/stan/releases/download/v2.12.0/stan-reference-2.12.0.pdf) Section 12. The full code of this model can be found at [https://gist.github.com/gmodena/0f316232aa2e9f7b6fc76b49f14bfb31](https://gist.github.com/gmodena/0f316232aa2e9f7b6fc76b49f14bfb31).
A stan program is structured in blocks of code. Each block defines:
1. the data
2. the parameters of our model
3. the model
4. some transformations of the model output
### Data
In the `data` block we describe what the dataset looks like. We will be modelling a time series `D`, stored as an array of `N`, `real` valued, elements.
We require `D` to have at least one element (`int<lower=1> N`).
```{markdown}
data {
int<lower=1> N;
real D[N];
}
```
### Parameters
The `parameter` block describes the sampling space.
In our example, we want to module two Gaussian, each with a `mu` and `sigma`. We constrain `sigma1` and `sigma2` to be positive
(`{markdown}<lower=0>`), so that we can stick an half-normal prior on them later on.
```{markdown}
parameters {
real mu1;
real mu2;
real<lower=0> sigma1;
real<lower=0> sigma2;
}
```
Stan does not allow for sampling from discrete distributions, so we will have to reparametrize our model and marginalise out $\tau$. Let's proceed one step at a time.
We know that $P(D|\tau,\mu_1,\sigma_1,\mu_2,\sigma_2)=\frac{P(D,\tau,\mu_1,\sigma_1,\mu_2,\sigma_2)}{P(\tau,\mu_1,\sigma_1,\mu_2,\sigma_2)}$. It follows that the joint probability distribution factors as $P(D,\tau,\mu_1,\sigma_1,\mu_2,\sigma_2)=P(D|\tau,\mu_1,\sigma_1,\mu_2,\sigma_2)P(\tau,\mu_1,\sigma_1,\mu_2,\sigma_2)$.
To marginalize out $\tau$ we'll consider a factorization into likelihood and prior as $P(D,\mu_1,\sigma_1,\mu_2,\sigma_2) = P(D|\mu_1,\sigma_1,\mu_2,\sigma_2)P(\mu_1,\sigma_1,\mu_2,\sigma_2)$.
Then we calculate the likelihood $P(D|\mu_1, \sigma_1, \mu_2, \sigma_2) = \sum_{n=1}^N P(\tau, D | \mu_1, \sigma_1, \mu_2, \sigma_2) = \sum_{\tau=1}^N P(\tau) P(D|\tau, \mu_1, \sigma_1, \mu_2, \sigma_2)=\sum_{\tau=1}^N P(\tau) \prod_{n=1}^N P(d_n|\tau, \mu_1, \sigma_1, \mu_2, \sigma_2)$. Where $P(D|\tau, \mu_1, \sigma_1, \mu_2, \sigma_2)$ is a product of Gaussians.
The `transformed parameters` block is used to process the `parameters` before calculating the posterior. In the block that follows we marginalise out `tau` and calculate `log_p(D | mu1, sd1, mu2, sd2)`. Since `stan` works in logarithmic scale, we'll have to take the sum of the log PDFs (`normal_lpdf`).
```{markdown}
// TODO: we can calculate log_p(D | mu1, sd1, mu2, sd2) in
// linear time with dynamic programming. See the Conclusion section.
transformed parameters {
vector[N] log_p;
real mu;
real sigma;
log_p = rep_vector(-log(N), N);
for (tau in 1:N)
for (n in 1:N) {
mu = n < tau ? mu1 : mu2;
sigma = i < tau ? sigma1 : sigma2;
log_p[tau] = log_p[tau] + normal_lpdf(D[n] | mu, sigma);
}
}
```
The functions `normal_lpdf` (and `normal_lpdf` used in the `model` block) allows us to write a model in log scale, as required by Stan.
### Model
In the `model` block we define the priors on $\mu_1$, $\mu_2$, $\sigma_1$, $\sigma_2$, and the log-likelihood $\sum_{n} = log_p(d_n | \mu_1, \sigma_1, \mu_2, \sigma_2)$. A reasonably good default choice is to use an *half-normal* prior on the scale parameters $\sigma_1, \sigma_2$(a negative scale would be ill defined!). Here I'm using large values for the scale parameter $\sigma$ to denote uncertantinty in the prior beliefs of the distribution. See [Prior Choice Recommendations](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations) for an overview and best practices. If we know more about the data, we could use more certain values, that is define more appropriate priors.
```{markdown}
model {
mu1 ~ normal(0, 100);
mu2 ~ normal(0, 100);
// scale parameters need to be > 0;
// we constrained sigma1, sigma2 to be positive
// so that stan interprets the following as half-normal priors
sigma1 ~ normal(0, 100);
sigma2 ~ normal(0, 100);
target += log_sum_exp(log_p);
}
```
What about the posterior? This is where some of Stan's magic kicks in. At each iteration of MCMC sampling after convergence, $\mu_1, \sigma_1, \mu_2, \sigma_2$ are drawn from $P( \mu_1, \sigma_1, \mu_2, \sigma_2 | D)$, and $P(\tau| \mu_1, \sigma_1, \mu_2, \sigma_2, D)$ is calculated based on the local unnormalized value of `log_p`. As a final step $P(\tau | \mu_1, \sigma_1, \mu_2, \sigma_2, D)$ is normalized by averaging over all draws to obtain $P(\tau|D)$. More details on this step can be found in Stan's manual.
### Discrete sampling
The `generated quantities` block allows for postprocessing. We can use it to draw a discrete sampling of `tau` at each iteration using the `categorical_rng` probability distribution. At each iteration we draw a value of `tau`, and later on, we'll look at the histogram of draws to determine the most likely changepoint as the most frequent `tau`.
```{markdown}
generated quantities {
int<lower=1,upper=N> tau;
// K simplex are a data type in Stan
simplex[N] sp;
sp = softmax(log_p);
tau = categorical_rng(sp);
}
```
The `softmax` transform maps `log_p` to a [K-simplex](https://en.wikipedia.org/wiki/Simplex#The_standard_simplex), which is the parameter type expected by `categorical_rng`. The label will be the index of `log_p`.
## Putting it all together
Let's generate some artificial data to test if the models works as expected.
```{r bayesian-changepoint-data}
x1 <- rnorm(41, mean=15, sd=1.5)
x2 <- rnorm(79, mean=17, sd=1.1)
x <- c(x1, x2)
plot(x, type='l')
```
If all goes as expected, a changepoint should be identified at $\tau = 42$.
### The R interface for Stan
I'll be using the excellent `rstan` package to fit the model and analyse its output.
```{r}
library(rstan)
rstan::stan_version()
rstan_options(auto_write = TRUE) # cache a compiled Stan program
```
The `stan` function wraps the following three steps:
1. Translate a model in Stan code to C++ code
2. Compile the C++ code to a dynamic shared object (DSO) and load the DSO
3. Sample given some user-specified data and other settings
The function returns an S4 `stanfit` object. We can use methods such as `print` and `plot` (and `pairs`) to check the fitted results.
```{r}
fit <- stan(
file = "changepoint.stan",
data = list(D = x, N=length(x)),
chains = 4,
warmup = 1000,
iter = 10000,
cores = 4,
refresh = 500
)
```
The parameters are self explanatory. Note that the variables naming in the `data` parameter should match the `data` block in the stan code.
`print` gives an overview of the parameters and the log posterior `log_p`.
```{r}
print(fit)
```
Looking at the table, we can see that the `log_p` has a minimum at index 42. We can see this more explicitly if we look at a histogram of
the discrete values of `tau`
```{r bayesian-changepoint-tau-hist}
qplot(extract(fit)$tau, geom='histogram', xlab = "tau")
print(fit, pars=c("tau"))
```
We also get the credible inteterval of $\tau$ using the `plot` function with
```{r bayesian-changepoint-ci}
plot(fit, pars=c("tau"))
```
The plot indicates the expected value of $\tau = 42$. The red bands denote the 80% credible interval; if we were to repeat the experiment several time, we'd expect $\tau$ to lie in the interval $41 \leq \tau \leq 44$ with probability $P=0.80$. The black bands denote the 95% credible interval.
This result is consistent with the dataset $D$.
## Conclusion
In this document I showed a simple yet powerful bayesian model for detecting a single changepoint in a timeseries. The advantages over the frequentist approach are twofold. On the one hand, the Bayesian model gives a distribution of changepoints and not a single point estimate. This lets us reason about how credible the model output is. On the other hand, we don't need to hard-code threshold values for a test statistic, but rather embed our prior knowledge about the data to "tune" the model.
It's not all roses, though. The flexibility given to us by the Bayesian framework comes at the expenses of some math being needed to work with a discrete parameter. This is largely due to `stan`s implementation choices wrt discrete random variables. For a more concise changepoint model, written with the PyMC framework one can have a look at the work of my colleague Vincent at [Switching to Sampling in order to Switch](http://koaning.io/switching-to-sampling-in-order-to-switch.html). A drawback of the Bayesian approach is that it is relatively computationally expensive. On my machine (a 2015 retina MacBook pro), it takes about 7 seconds to run the MCMC simulation on four cores[^2]. For comparison, the `changepoint` package has a sub-second run time. One bottle neck of this model is the quadratic cost for marginalising out `tau` in the `transformed paramteres` block. We can make this operation linear by dynamic programming (see also the discussion wrt the coal mine example in the stan manual).
Many thanks to [@idontgetoutmuch](https://github.com/idontgetoutmuch) for suggesting the following implementation:
```{mardown}
transformed parameters {
vector[N] log_p;
{
vector[N+1] log_p_e;
vector[N+1] log_p_l;
log_p_e[1] = 0;
log_p_l[1] = 0;
for (i in 1:N) {
log_p_e[i + 1] = log_p_e[i] + normal_lpdf(D[i] | mu1, sigma1);
log_p_l[i + 1] = log_p_l[i] + normal_lpdf(D[i] | mu2, sigma2);
}
log_p = rep_vector(-log(N) + log_p_l[N + 1], N) + head(log_p_e, N) - head(log_p_l, N);
}
}
```
Which leads to a total runtime of apporximately 3.5 seconds:
```{markdown}
Elapsed Time: 0.833563 seconds (Warm-up)
2.64182 seconds (Sampling)
3.47538 seconds (Total)
```
[^1]: The excellent [changepoint](https://cran.r-project.org/web/packages/changepoint/index.html) R package implements this likelihood test, as well as its natural extension to the case of multiple change points.
[^2]: The `transformed parameters` block has a quadratic time complexity. Stan's manual contains a dynamic programming alternative with linear time complexity.