title | author | date | layout |
---|---|---|---|
INLA from scratch |
Stefan Siegert |
22 September 2017 |
default |
Last update: 22 September 2017
INLA (integrated nested Laplace approximation) is a popular framework for approximate Bayesian inference in complicated space-time models. It can provide a significant speedup over Markov-Chain Monte-Carlo in highdimensional models.
When I started reading about INLA, I found a few technical papers in stats journals that explain the theory. When I searched for INLA on the internet to get a more gentle introduction, I got the impression that the INLA methodology is identical with the R package INLA. Every blog post or forum entry I found on INLA was dealing with issues related to the R package. I could not find any examples where someone implemented INLA (the mathematical framework) from scratch, without using the R package.
This report is my attempt to implement INLA from scratch, without using the R package. The motivation is to understand the method in more detail, and to make the R package less of a black box.
The report was written in R markdown, converted to markdown with knitr, converted to html with jekyll and hosted via github pages.
library(Matrix)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## expand(): tidyr, Matrix
## filter(): dplyr, stats
## lag(): dplyr, stats
knitr::opts_chunk$set(
cache.path='_knitr_cache/inla-from-scratch/',
fig.path='figure/inla-from-scratch/'
)
Consider the case where we want to learn something about a time series that comes in the form of zeros and ones, for example
\begin{equation} (y_1, ..., y_n) = (0, 1, 0, 0, 1, 1, 1, ..., 0, 1) \end{equation}
Such a binary time series could occur, for example, as a record of rain/no rain on consecutive days. As is usual in meteorological time series, consecutive values are correlated with one another. For rain data, we usually observe that a rainy day is more likely to be followed by a rainy day than by a dry day. The auto-correlation structure is a useful statistical property to infer from such a time series.
How should we model a time series of auto-correlated zeros and ones? One way is to model the data by a two-stage process.
The observations
\begin{equation} p(y_t | p_t) = \begin{cases} p_t & \mathrm{if}; y_t = 1 \newline 1 - p_t & \mathrm{if}; y_t = 0 \end{cases} \end{equation}
or equivalently
\begin{equation} p(y_t | p_t) = p_t^{y_t} (1 - p_t)^{1-y_t} \end{equation}
The unobserved success probabilities
\begin{equation} p_t = \frac{\exp(\beta x_t)}{1 + \exp(\beta x_t)} \end{equation}
This guarantees that
\begin{equation} x_t = \alpha x_{t-1} + \sigma \epsilon_t \end{equation}
where the
We refer to the hyperparameters
In the following simple example we will assume that
We simulate data from the toy model with parameter settings
set.seed(1234)
n = 100
alpha_true = .6
sigma_true = 0.1
beta_true = 10
# simulate x_t as AR1, p_t as logit(x_t), and y_t as Bernoulli(p_t)
x_true = arima.sim(n=n, model=list(ar=alpha_true), sd=sigma_true) %>% as.numeric
p_true = 1 / (1 + exp(-beta_true * x_true))
y = rbinom(n, 1, p_true)
# create data frame for plotting
df = data_frame(i = 1:n, x = x_true, p = p_true, y = y) %>%
gather(key='variable', value='value', -i) %>%
mutate(variable = factor(variable, levels=c('x','p','y')))
# plot the time series x_t, p_t, and y_t in 3 panels
ggplot(df, aes(x=i, y=value, color=variable)) +
geom_line(na.rm=TRUE, show.legend=FALSE) +
facet_wrap(~variable, ncol=1, scales='free_y') +
xlim(0,100) + labs(x=NULL, y=NULL)
Here we derive the full joint distribution of the observations
\begin{equation} p(y, x, \theta) = p(y \vert x, \theta) p(x \vert \theta) p(\theta) \end{equation}
It will be convenient to work with log probabilities instead of probabilities, so the joint distribution can equivalently be expressed as
\begin{equation} \log p(y, x, \theta) = \log p(y \vert x, \theta) + \log p(x \vert \theta) + \log p(\theta) \end{equation}
The conditional log-probability of the observations is given by
\begin{align} \log p(y \vert x,\theta) & = \sum_t \left[ y_t \log(p_t) + (1 - y_t) \log(1 - p_t) \right] \newline & = \sum_t \left[ \beta x_t y_t - \log(1 + \exp(\beta x_t)) \right] \end{align}
The latent process
\begin{align} p(x_1, ..., x_n) & = p(x_1) p(x_2 \vert x_1) p(x_3 \vert x_2) ... p(x_n \vert x_{n-1})\newline & \propto \exp \left[ -\frac{1}{2\sigma^2} \left[ x_1^2 (1 - \alpha^2) + (x_2 - \alpha x_1)^2 + (x_3 - \alpha x_2)^2 + ... + (x_n - \alpha x_{n-1})^2 \right]\right]\newline & = \exp \left[ -\frac{1}{2\sigma^2} \left[ x_1^2 - 2\alpha x_1 x_2 + x_2^2 (1 + \alpha^2) - 2 \alpha x_2 x_3 + x_3^2 (1 + \alpha^2) - ... + x_{n-1}^2 (1 + \alpha^2) - 2 \alpha x_{n-1} x_n + x_n^2 \right] \right]\newline & = \exp \left[ -\frac12 x' Q x \right] \end{align}
which is a multivariate Normal distribution with zero mean and precision (inverse covariance) matrix given by
\begin{equation} Q = \frac{1}{\sigma^2} \left(\begin{matrix} 1 & -\alpha & & & \newline -\alpha & 1 + \alpha^2 & -\alpha & & \newline & \ddots & \ddots & \ddots& \newline & & -\alpha & 1 + \alpha^2 & -\alpha \newline & & & -\alpha & 1 \end{matrix}\right) \end{equation}
We don't make any assumptions about the prior distribution
\begin{align} \log p(y,x,\theta) = & \log p(y\vert x,\theta) + \log p(x\vert \theta) + \log p(\theta) \newline \propto &\sum_t \left[ \beta x_t y_t - \log(1 + \exp(\beta x_t)) \right] + \frac12 \log\vert Q\vert - \frac12 x'Qx + \log p(\theta) \end{align}
INLA (integrated nested Laplace approximation) is a statistical framework for Bayesian inference in models of the following form:
observations:
latent variables:
hyperparameters:
This hierarchical model covers a wide range of statsitical models for spatial, temporal and spatio-temporal data.
The reasons for the restrictions such as sparsity of
One goal of INLA is to calculate an approximation of the posterior of the hyperparameters
We are going to proceed in two steps:
We will first approximate
Observations
From the (slightly unusual) factorisation of the joint distribution
\begin{equation} \log p(y,x,\theta) = \log p(x\vert\theta, y) + \log p(\theta\vert y) + \log p(y) \end{equation}
we get the following expression for the posterior of the hyperparameters
\begin{equation} \log p(\theta \vert y) \propto \log p(y,x,\theta) - \log p(x\vert y,\theta) \end{equation}
The symbol
The above expression is valid for arbitrary values of
\begin{equation} \log p(\theta \vert y) \propto \left[ \log p(y,x,\theta) - \log p(x\vert y,\theta) \right]_{x=x_0(\theta)} \end{equation}
The challenge is now to find a good approximation of the conditional distribution of the latent variables
\begin{equation} \int dx; e^{f(x)} \approx e^{f(x_0)} (2\pi)^{n/2} \big\vert -Hf(x_0)\big\vert ^{-1/2} \end{equation}
where
To apply the Laplace approximation to approximate
\begin{equation} p(x\vert y,\theta) = \frac{\exp[f(x)]}{\int dx \exp[f(x)]} \end{equation}
The denominator is approximated by the Laplace approximation.
So the Laplace approximation of
\begin{equation} \log p(x_0 | y, \theta) \approx -\frac{n}{2} \log (2\pi) + \frac12 \log \big\vert -Hf(x_0) \big\vert \end{equation}
and the log of the posterior
\begin{equation} \log p(\theta \vert y) \propto \log p(y, x_0, \theta) - \frac12 \log \big\vert -Hf(x_0) \big\vert \end{equation}
In our toy model, the function
\begin{equation} f(x) = \sum_t [\beta x_t y_t - \log ( 1 + \exp(\beta x_t))] - \frac12 x'Qx \end{equation}
The gradient vector and Hessian matrix of
\begin{align} \nabla f (x) & = vec\left[ \beta y_i - \frac{\beta \exp(\beta x_i)}{1+\exp(\beta x_i)}\right] - Qx\newline Hf(x) & = - Q - diag\left[ \frac{\beta^2 \exp(\beta x_i)}{(1+\exp(\beta x_i))^2} \right] \end{align}
I haven't talked about how the mode optim
function.
These work in general, but are about 10 times slower than the iterative method proposed in the original paper, which I describe here:
The function
\begin{equation} f(x) = -\frac12 x'Qx + \sum_i g_i(x_i) \end{equation}
where
We Taylor expand
\begin{align} f(x) & \approx f(x_0) + \nabla f(x_0) (x-x_0) + \frac12 (x-x_0)' Hf(x_0) (x-x_0)\newline & \propto [ \nabla f (x_0) - x_0' Hf(x_0) ] x + \frac12 x' Hf(x_0) x \end{align}
where
The Taylor expansion approximates
\begin{align} [Q - diag; g^{''}_i(x_0)] x = vec; g^{'}_i(x_0) - x_0' diag; g^{''}_i(x_0) \end{align}
The mode of the fitted parabola provides a new, improved estimate of the true mode of
I mentioned earlier that application of INLA is restricted to latent Gaussian models with a sparse precision matrix of the latent process, with sparse dependency between
To approximate
The Hessian of
Lastly, by limiting the number of hyperparameters
So let's implement INLA for our toy model.
We will assume that the parameters
We next define the necessary R functions.
calc_Q = function(alpha) {
1 / sigma_true^2 * Matrix::bandSparse(n, k=0:1,
diagonals = list(c(1, 1 + alpha^2, 1) %>% rep(c(1, n-2, 1)),
-alpha %>% rep(n-1)),
symmetric=TRUE)
}
The AR1 coefficient is between -1 and 1. We assume a beta prior for
calc_lprior = function(alpha, a=1, b=1) {
(a-1) * log((alpha + 1) / 2) + (b-1) * log(1 - (alpha+1)/2)
}
We calculate the determinant of
calc_ljoint = function(y, x, alpha, a=1, b=1) {
chol_Q = calc_Q(alpha) %>% chol
logdet_Q_half = chol_Q %>% diag %>% log %>% sum
quad_form = crossprod(chol_Q %*% x) %>% drop
res =
sum(beta_true * x * y - log1p(exp(beta_true * x))) +
logdet_Q_half - 0.5 * quad_form +
calc_lprior(alpha, a, b)
return(res)
}
calc_ff = function(x, alpha) {
sum(beta_true * x * y - log1p(exp(beta_true * x))) -
0.5 * drop(as.matrix(x %*% calc_Q(alpha) %*% x))
}
calc_grad_ff = function(x, alpha) {
beta_true * y -
beta_true * exp(beta_true * x) / (1 + exp(beta_true * x)) -
drop(as.matrix(calc_Q(alpha) %*% x))
}
calc_neg_hess_ff = function(x, alpha) {
calc_Q(alpha) +
diag(beta_true^2 * exp(beta_true * x) / (1 + exp(beta_true * x))^2)
}
# the function g(x) = log(p(y | x,theta)), its gradient and hessian
calc_g0 = function(x) {
sum(beta_true * x * y - log1p(exp(beta_true * x)))
}
calc_g1 = function(x) {
beta_true * y - beta_true * exp(beta_true * x) / (1 + exp(beta_true * x))
}
calc_g2 = function(x) {
(-1) * beta_true^2 * exp(beta_true * x) / (1 + exp(beta_true * x))^2
}
calc_x0 = function(alpha, tol=1e-12) {
Q = calc_Q(alpha)
x = x0 = rep(0, n)
while(1) {
g1 = calc_g1(x)
g2 = calc_g2(x)
x = drop(solve(Q - bandSparse(n=n, k=0, diagonals=list(g2))) %*%
(g1 - x0 * g2))
if (mean((x-x0)^2 < tol)) {
break
} else {
x0 = x
}
}
return(x)
}
Here is a brute force optimisation using the built-in R function optim
, which is 10 times slower than the iterative method above, but gives the same result:
calc_x0_brute = function(alpha) {
optim(par=rep(0, n), fn=calc_ff, gr=calc_grad_ff, alpha=alpha,
control=list(fnscale=-1), method='BFGS')$par
}
With the above functions, approximating
calc_lpost = function(alpha) {
x0 = calc_x0(alpha)
chol_h = chol(calc_neg_hess_ff(x0, alpha))
calc_ljoint(y, x0, alpha) - sum(log(diag(chol_h)))
}
alpha_vec = seq(-.95, .95, len=31)
lpost = sapply(alpha_vec, calc_lpost)
Once
calc_Z = function(alpha_vec, lpost_vec) {
nn = length(alpha_vec)
hh = alpha_vec[2] - alpha_vec[1]
ww = c(1, rep(c(4,2), (nn-3)/2), c(4,1))
return(sum(ww * exp(lpost_vec)) * hh / 3)
}
At this point, approximation of the unnormalised log posterior of sapply
call, and the normalisation constant can be approximated subsequently:
lpost = lpost - mean(lpost) # to avoid numerical overflow
Z = calc_Z(alpha_vec, lpost)
We plot the unnormalised log-posterior and the normalised posterior:
# data frame for plotting
df_posterior =
bind_rows(
data_frame(alpha=alpha_vec, posterior=lpost,
type='unnormalised_log_posterior'),
data_frame(alpha=alpha_vec, posterior=exp(lpost)/Z,
type='normalised_posterior')) %>%
mutate(type = factor(type, levels=c('unnormalised_log_posterior',
'normalised_posterior')))
# plot unnormalised log posterior and normalised posterior in 2 panels
ggplot(df_posterior, aes(x=alpha, y=posterior)) +
geom_line() + geom_point() +
geom_vline(aes(xintercept=alpha_true), linetype='dashed') +
facet_wrap(~type, scale='free_y', ncol=1) +
theme(legend.position='none')
For comparison we use the R package INLA
to calculate the posterior
In R-INLA, the AR1 latent model is parametrised through the parameters
It is worthwhile to have a look at the default
# data frame for plotting
df_prior =
bind_rows(
data_frame(logit_alpha = rnorm(1e4,0,sqrt(1/0.15)),
type='default prior N(0,0.15)'),
data_frame(logit_alpha = rnorm(1e4,0,sqrt(1/0.5)),
type='new prior N(0,0.5)')) %>%
group_by(type) %>%
mutate(alpha = (exp(logit_alpha)-1)/(exp(logit_alpha) + 1)) %>%
ungroup %>%
gather(key='transformation', value='value', -type) %>%
mutate(transformation = factor(transformation, levels=c('logit_alpha', 'alpha')))
# plot distributions of logit(alpha) and alpha for the two priors in 4 panels
ggplot(df_prior, aes(x=value)) +
geom_histogram(bins=30) +
facet_grid(type ~ transformation, scale='free') +
labs(x=NULL, y=NULL)
library(INLA)
## Loading required package: sp
## Loading required package: methods
## This is INLA_17.06.20 built 2017-09-07 09:01:27 UTC.
## See www.r-inla.org/contact-us for how to get help.
theta1_true = log((1-alpha_true^2)/sigma_true^2)
theta2_param = c(0, 0.5)
A_mat = diag(beta_true, n, n) %>% inla.as.sparse
inla_formula =
y ~ -1 +
f(i, model='ar1', hyper=list(
theta1 = list(fixed=TRUE, initial=theta1_true),
theta2 = list(param=theta2_param)))
inla_data = data_frame(i=1:n, y=y)
res = inla(
formula=inla_formula,
data=inla_data,
family='binomial',
Ntrials=rep(1,n),
control.predictor=list(A = A_mat)
)
## Note: method with signature 'Matrix#numLike' chosen for function '%*%',
## target signature 'dgTMatrix#numeric'.
## "TsparseMatrix#ANY" would also be valid
## Note: method with signature 'sparseMatrix#matrix' chosen for function '%*%',
## target signature 'dgTMatrix#matrix'.
## "TsparseMatrix#ANY" would also be valid
# data frame for plotting
df_compare = bind_rows(
res$marginals.hyperpar$`Rho for i` %>%
as_data_frame %>%
rename(alpha=x, posterior=y) %>%
mutate(type='R-INLA'),
data_frame(alpha=alpha_vec, posterior=exp(lpost) / Z) %>%
mutate(type='inla_from_scratch')
)
# plot posteriors for inla-from-scratch and R-INLA
ggplot(data=df_compare, mapping=aes(x=alpha, y=posterior, colour=type)) +
geom_line() + geom_point() +
geom_vline(aes(xintercept=alpha_true)) +
labs(color=NULL)
The posterior of the latent variables can be calculated by
\begin{equation} p(x \vert y) = \int d\theta; p(x \vert y, \theta) p(\theta \vert y) \end{equation}
which can be approximated numerically by a finite sum
\begin{equation} p(x \vert y) \approx \sum_i p(x \vert y, \theta_i) p(\theta_i \vert y) \Delta_i \end{equation}
While building up the approximation of
We will probably be interested in marginal variances of the individual
\begin{align} E[X] & = \mu = \sum_i w_i \mu_i \newline E[(X-\mu)^2] & = \sigma^2 = \sum_i w_i ((\mu_i - \mu)^2 + \sigma_i^2) \end{align}
where
Below we go through the whole process of approximating
# approximate p(x | y, alpha) for many values of alpha
alpha_vec = seq(-.95, .95, len=31)
post_x = lapply(alpha_vec, function(alpha_) {
mode_ = calc_x0(alpha_)
chol_H_ = chol(calc_neg_hess_ff(mode_, alpha_))
# save alpha, the mode, the covariance matrix, and p(theta|y) unnormalised
return(list(
alpha = alpha_,
x0 = mode_,
diag_sigma = drop(diag(chol2inv(chol_H_))),
unn_log_post = calc_ljoint(y, mode_, alpha_) - sum(log(diag(chol_H_)))
))
})
# extract marginal means and variances for each p(x|y,theta)
mu = sapply(post_x, `[[`, 'x0')
sigma2 = sapply(post_x, `[[`, 'diag_sigma')
# normalise the posterior
lpost = sapply(post_x, `[[`, 'unn_log_post')
lpost = lpost - mean(lpost) # to avoid numerical overflow
post_alpha = exp(lpost) / calc_Z(alpha_vec, lpost)
mu_post = sapply(1:n, function(ii) weighted.mean(mu[ii,], w=post_alpha))
sigma_post =
sapply(1:n, function(ii)
weighted.mean((mu[ii,] - mu_post[ii])^2 + sigma2[ii,], w=post_alpha)
) %>%
sqrt
# data frame for plotting
df_latent = bind_rows(
data_frame(
type = 'inla_from_scratch', i=1:n, mu = mu_post,
lwr = mu_post - 2 * sigma_post,
upr = mu_post + 2 * sigma_post),
data_frame(
type = 'R-INLA', i=1:n, mu = res$summary.random$i[,'mean'],
lwr = with(res$summary.random$i, mean - 2 * sd),
upr = with(res$summary.random$i, mean + 2 * sd)),
data_frame(
type = 'truth', i=1:n, mu = x_true, lwr = x_true, upr = x_true)
)
# ribbon plots for p(x|y) for inla-from-scratch and R-INLA, and true values
ggplot(df_latent, aes(x=i)) +
geom_ribbon(aes(ymin=lwr, ymax=upr, alpha=type, linetype=type, color=type),
fill=NA) +
geom_line(aes(y=mu, color=type)) + labs(alpha=NULL, linetype=NULL, color=NULL, x=NULL, y=NULL)
My results are close, but not identical to the results obtained with the R package. The best explanation for the difference is, of course, that I made a mistake somewhere along the way. But even if my implementation is correct, we should expect differences.
The prior I used was similar, but not identical to prior in R-INLA.
I tried slightly larger sample sizes to suppress the influence of the prior, but the two solutions do not seem to converge.
Due to the hierarchical structure of the model, there seems to be some irreducible uncertainty about
R-INLA is much cleverer than I am when selecting the points at which to evaluate
The posterior of the latent process
Lastly, the capabilities of R-INLA are not limited to what I showed here.
R-INLA can compute posterior predictive distributions for future values of
INLA is an efficient method for approximate Bayesian inference in latent Gaussian models with sparse dependency structure.
The crucial step is to approximate the conditional distribution of the latent process
The INLA framework is not identical to the R package INLA
, that is, you do not have to install the R package to apply INLA.
I showed how to implement INLA to approximate the posterior distribution of the autocorrelation parameter of a simple AR1-Bernoulli process.
The results broadly agree with the results obtained with the R package.