-
Notifications
You must be signed in to change notification settings - Fork 8
/
mcmc.tex
2454 lines (2296 loc) · 127 KB
/
mcmc.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% This file is part of the MCMC project,
% which is (in turn) part of the Data Analysis Recipes project.
% Copyright 2013, 2014, 2015, 2016, 2017 David W. Hogg (NYU) and Dan Foreman-Mackey (NYU, UW).
% style notes
% -----------
% - \MCMC not MCMC
% - [LaTeX] newline after every full stop for proper git diffing
% - [LaTeX] eqnarray not equation
% - pdf not PDF
% - Is it ``a MCMC sampling'' or ``an MCMC sampling''?
% - \note{} right after word if you are endnoting a phrase,
% but after the full stop if endnoting the sentence or paragraph.
% - Always $\pars$ never $x$; $K$ samples $\pars_k$ and $\pars$ space is $D$-dimensional
% - Should we use the word ``walker'' everywhere and do we?
\documentclass[modern]{aastex61}
\include{dar}
\usepackage{needspace}
\shorttitle{using markov chain monte carlo}
\shortauthors{hogg \& foreman-mackey}
\newcommand{\MCMC}{\acronym{MCMC}}
\newcommand{\MH}{\acronym{M--H}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\data}{D}
\newcommand{\pars}{\theta}
\newcommand{\best}{{\mathrm{(best)}}}
\newcommand{\better}{{\mathrm{(better)}}}
\begin{document}\sloppy\sloppypar\raggedbottom\frenchspacing\thispagestyle{plain}%
%
\title{Data analysis recipes:\\
Using Markov Chain Monte Carlo%
\note{\label{note:first}%
Copyright 2013, 2014, 2015, 2016, 2017 the authors (OMG this took us a long time).
This work is licensed under a
\href{http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en\_US}{%
Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported
License}.}}%
\author[0000-0003-2866-9403]{David W. Hogg}
\affil{Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Ave, New York, NY 10010, USA}
\affil{Center for Cosmology and Particle Physics, Department of Physics, New York University, 726 Broadway, New York, NY 10003, USA}
\affil{Center for Data Science, New York University, 60 Fifth Ave, New York, NY 10011, USA}
\affil{Max-Planck-Institut f\"ur Astronomie, K\"onigstuhl 17, D-69117 Heidelberg}
\author[0000-0002-9328-5652]{Daniel Foreman-Mackey}
\affil{Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Ave, New York, NY 10010, USA}
\affil{NASA Sagan Fellow; Department~of~Astronomy, University~of~Washington, Box 351580, Seattle, WA 98195, USA}
\begin{abstract}\noindent
Markov Chain Monte Carlo (\MCMC) methods for sampling probability
density functions (combined with abundant computational resources)
have transformed the sciences, especially in performing probabilistic
inferences, or fitting models to data.
In this primarily pedagogical contribution,
we give a brief overview of the most basic \MCMC\ method and some
practical advice for the use of \MCMC\ in real inference problems.
We give advice on method choice, tuning for performance,
methods for initialization, tests of convergence, troubleshooting, and use of the chain
output to produce or report parameter estimates with associated uncertainties.
We argue that autocorrelation time is the most important test for convergence,
as it directly connects to the uncertainty on the sampling estimate of any
quantity of interest.
We emphasize that sampling is a method for doing integrals; this guides our thinking
about how \MCMC\ output is best used.
\end{abstract}
\section{When do you need MCMC?}\label{sec:when}
Markov Chain Monte Carlo (\MCMC) methods are methods for sampling
probability distribution functions or probability density functions (pdfs).
These pdfs may be either probability mass functions on a discrete
space, or probability densities on a continuous space, though we will
concentrate on the latter in this \documentname.
\MCMC\ methods don't require that you have a full analytic description of the
properly normalized pdf for sampling to proceed; they only require
that you be able to compute ratios of the pdf at pairs of locations.
This makes \MCMC\ methods ideal for sampling \emph{posterior
pdfs} in probabilistic inferences:
In a probabilistic inference, the posterior pdf $p(\pars\given\data)$,
or pdf for the parameters $\pars$ given the data $\data$, is
constructed from the likelihood $p(\data\given\pars)$, or pdf for the
data given the parameters, and the prior pdf $p(\pars)$ for the
parameters by what's often known as ``Bayes rule'',
\begin{eqnarray}
p(\pars\given\data) &=& \frac{1}{Z}\,p(\data\given\pars)\,p(\pars)
\label{eq:bayes}\quad .
\end{eqnarray}
In these contexts, the constant $Z$, sometimes written as
$p(\data)$, is known by the names ``evidence'', ``marginal likelihood'',
``Bayes integral'', and ``prior predictive probability'', and is usually
\emph{extremely hard to calculate}.\note{%
The factor $Z$ is often difficult to compute, because the
likelihood (or the prior) can have extremely complex structure,
with multiple arbitrarily compact modes, arbitrarily positioned
in the (presumably high dimensional) parameter space $\pars$.
Elsewhere, we discuss the computation of this object (\citealt{fengji}),
and so have many others before us.
We also have (unpublished) philosophical arguments against calculating
this $Z$ if you can possibly avoid it, but these are
outside the scope of this \documentname.
The point is that \MCMC\ methods will not require that you know $Z$.}
That is, you often know the function $p(\pars\given\data)$ up to a
constant factor; you can compute ratios of the pdf at pairs of points,
but not the precise value at any individual point.
In addition to this normalization-insensitive property of \MCMC, in its
simplest forms it can be run without computing any derivatives or
integrals of the function, and (as we will show below in \sectionname~\ref{sec:MH}) in its simplest
forms it is \emph{extremely easy to implement}.
For all these reasons, \MCMC\ is ideal for sampling posterior pdfs in
the real situations in which scientists find themselves.
Say you are in this situation:
You have a huge blob of data $\data$ (think of this as a vector or
list or heterogeneous-but-ordered collection of observations).
You also have a model sophisticated enough---a probabilistic,
generative model, if you will\note{%
Briefly, a ``model'' for us is a likelihood function (a pdf for the data given
model parameters), and a prior pdf over the parameters.
Because, under this definition, the model can always generate (by
sampling, say) parameters and parameters can generate (again by
sampling, say) data, the model is effectively (or actually, if you
are a true subjective Bayesian) a probability distribution (pdf)
over all possible data.%
}---that, given a
setting of a huge blob of parameters (again, think of this as a vector or list or heterogeneous-but-ordered collection of values) $\pars$,
you can compute a pdf for data (or likelihood\note{%
Techically, $p(\data\given\pars)$
is only properly a likelihood function when
we are thinking of the data $\data$ as being fixed, and the parameters $\pars$
as being permitted
to vary.}) $p(\data\given\pars)$.
Furthermore, say also that you can write down some kind of informative or
vague prior pdf $p(\pars)$ for the parameter blob $\pars$.
If all these things are true, then---even if you can't compute
anything else---in principle a trivial-to-implement \MCMC\ can give you
a fair sampling of the posterior pdf.
That is, you can run \MCMC\
(for a very long time---see \sectionname~\ref{sec:convergence} for how long)
and you will be left with a set of $K$ parameter-blob settings $\pars_k$ such that the
full set $\setof{\pars_k}_{k=1}^K$ constitutes a fair sampling from
the posterior pdf $p(\pars\given\data)$. We will give some sense of what ``fair'' means
in this context, below.
All that said, and adhering to the traditions of the \project{Data
Analysis Recipes} project\note{Every entry in the \project{Data
Analysis Recipes} series begins with a rant in which we argue that
most uses of the methods in question are not appropriate!}, we are
compelled to note at the outset that \MCMC\ is in fact \emph{over-used}.
Because \MCMC\ provably (under assumptions\note{The assumptions include
things like: The algorithm is run ``long enough'', where this phrase
is undefined (since the convergence requirements are set by precision
requirements on particular integrals), and that the density that is
being sampled has some connectedness properties: There aren't distant islands of finite density
separated by regions of zero (or exceedingly low) density.}, some of which will be discussed) samples the full
posterior pdf in all of parameter space, many investigators use \MCMC\
because (they believe) it will sample \emph{all} of the parameter space
($\pars$-space).
That is, they are using \MCMC\ because they want to
\emph{search the parameter space for good models}.
This is not a good reason to use \MCMC!
Another bad use case is the following:
Because \MCMC\ samples the parameter representatively, it spends most of
its time near very good models; models that (within the confines of
the prior pdf) do a good job of explaining the data.
For this reason, many investigators are using \MCMC\ because it
effectively \emph{optimizes the posterior pdf}, or, for certain
choices of prior pdf, \emph{optimizes the likelihood}.\note{Of course
a committed Bayesian would argue that any time you are optimizing a
posterior pdf or optimizing a likelihood, you \emph{should} be
sampling a posterior pdf. That is, for some, the fact that when
someone ``wants'' to optimize, it is actually useful that they
choose the ``wrong'' tool, because that ``wrong'' tool gives them
back something far more useful than the output of any optimizer!
However it is true that when the data are extremely informative
(that is, the posterior is very narrow), there is not much difference
between optimizing the posterior and sampling it, nor between optimizing
the likelihood and optimizing the posterior. That is, all inference
methods sort-of converge as the data become very informative. That can be
an important point when you are working on a problem where the posterior
is very narrow.}
This is another bad reason!
Both of these reasons for using \MCMC---that it is a parameter-space
search algorithm, and that it is a simple-to-code effective
optimizer---are \emph{not good reasons}. \MCMC\ is a
\emph{sampler}. If you are trying to find the optimum of the
likelihood or the posterior pdf, you should use an \emph{optimizer},
not a sampler. If you want to make sure you search all of parameter
space, you should use a \emph{search algorithm}, not a sampler. \MCMC\
is good at one thing, and one thing only: Sampling ill-normalized (or
otherwise hard to sample) pdfs.
In what follows, we are going to provide a kind of ``user manual'' or
advice document or folklore capture regarding the use of \MCMC\ for
data analysis.
This will not be a detailed description of multiple \MCMC\ methods
(indeed, we will only explain one method in detail), and it will not
be about the mathematical properties or structure of the method or
methods.\note{For a description of various different sampling methods,
given from the point of view of an astrophysicist, see the review by
\citet{sharma}. We say a bit about different \MCMC\ methods in
\sectionname~\ref{sec:methods}, but our discussion is limited
relative to Sharma's review. For even more depth, there are relevant
book-length treatments of \MCMC; our favorite is \citet{mcmchandbook}.
There are also
good books on Bayesian data analysis from a computing perspective more generally that
include discussion of \MCMC\ as a part of that;
for example, \citet{kruschke} and \citet{andreon} and \citet{hilbe}.}
Think of this \documentname\ as a getting-started document for a new user
of \MCMC\ methods.
It concentrates on conceptual points, and questions we get asked
frequently by users, with a concentration on how an \MCMC\ project is
diagnosed and tested, and how its outputs are used responsibly.
Most of what we say will
be applicable to any \MCMC\ method; in that sense the discussion is general.
And all in the language---more-or-less---of an astrophysicist.
The first couple of \sectionname s will describe what a sampling is, and how
the simplest \MCMC\ method, the Metropolis-Hastings algorithm, can
provide one.
The next few \sectionname s will provide ideas about how to
initialize, tune and operate \MCMC\ methods for good performance.
The last few \sectionname s will provide advice for making decisions
among the myriad \MCMC\ methods implementations, how to implement a good
likelihood function and prior pdf function for inference, and how to
trouble-shoot standard kinds of problems that arise in operating \MCMC\
methods on real problems.
We will leave parenthetical and philosophical matters to the footnotes,
which appear after the main text.
\section{What is a sampling?}\label{sec:sampling}
\nopagebreak
Heuristically, a sampling $\setof{\pars_k}_{k=1}^K$ from some pdf $p(\pars)$
is a set of $K$ values $\pars_k$ that are draws from the pdf.
Heuristically, if an enormous (large $K$) sampling could be displayed
in a finely binned $\pars$-space histogram, the histogram would look---up
to a total normalization---just like the original function $p(\pars)$.
We don't know all the relevant mathematics (measure theory),
but for our purposes here a pdf is
any single-valued (scalar) function that is non-negative everywhere
(the entire domain of $\pars$), and obeys a normalization condition
\begin{eqnarray}
0 &\leq& p(\pars) \quad \mbox{for all $\pars$}
\\ \label{eq:normp}
1 &=& \int p(\pars)\,\dd\pars
\quad ,
\end{eqnarray}
where, implicitly, the integral is over the full domain of $\pars$.
Importantly, although $p(\pars)$ is single-valued, $\pars$ can be
a multi-element vector or list or blob; it can be arbitrarily large
and complicated.
In data-analysis contexts, $\pars$ will often be the full blob of free
parameters in the model.
Implicitly, the integral in \equationname~(\ref{eq:normp}) is high-dimensional; it
has as many dimensions as there are elements or entries or components of $\pars$.
Also, if there are elements or entries or components of $\pars$ that are discrete
(that is, take on only integer values or equivalent),
then along those dimensions the integral becomes a discrete sum.
This latter is a detail to which we return below (briefly, in
\sectionname~\ref{sec:trouble}).
Given this pdf $p(\pars)$, we can define \emph{expectation values}
$E_{p(\pars)}[\pars]$
for $\pars$ or for any quantity that can be expressed as a function $g(\pars)$
of $\pars$:
\begin{eqnarray}
E_{p(\pars)}[\pars] &\equiv& \int \pars\,p(\pars)\,\dd \pars
\\
E_{p(\pars)}[g(\pars)] &\equiv& \int g(\pars)\,p(\pars)\,\dd \pars
\quad ,
\label{eq:the-real-integral}
\end{eqnarray}
where again the integrals are implicitly definite integrals over the
entire domain
of $\pars$ (all the parts of $\pars$ space in which $p(\pars)$ is finite) and the
integrals are multi-dimensional if $\pars$ is multi-dimensional.
These expectation values are the mean values of $\pars$ and $g(\pars)$ under
the pdf. A good sampling---and really this is the definition of a good
sampling---makes the sampling approximation to these integrals
accurate.
With a good sampling $\setof{\pars_k}_{k=1}^K$ the integrals get replaced
with sums over samples $\pars_k$:
\begin{eqnarray}
E_{p(\pars)}[\pars] &\approx& \frac{1}{K}\,\sum_{k=1}^K \pars_k
\\
E_{p(\pars)}[g(\pars)] &\approx& \frac{1}{K}\,\sum_{k=1}^K g(\pars_k)
\quad .
\label{eq:the-real-samples}
\end{eqnarray}
That is, a sampling is good when any expectation value of interest is
accurately computed via the sampling approximation.
The word ``accurately'' here translates into some kinds of theorems
about limiting behavior; the general idea is that the sampling
approximation becomes exact as $K$ goes to infinity.
The size $K$ of the sampling you need \emph{in practice} will depend
on the expectations you want to compute, and the accuracies you need.
As we noted above (\sectionname~\ref{sec:when}), in the context of \MCMC, we are often
using some \emph{badly normalized} function $f(\pars)$.
This function is just the pdf $p(\pars)$ multiplied by some unknown and
hard-to-compute scalar.
In this case, for our purposes, the conditions on $f(\pars)$ are that it
be non-negative everywhere and have \emph{finite} integral $Z$
\begin{eqnarray}
0 &\leq& f(\pars) \quad \mbox{for all $\pars$}
\\
Z &=& \int f(\pars)\,\dd \pars \label{eq:proper}
\quad .
\end{eqnarray}
And recall that we don't actually know the value of $Z$, but we do know
that it is finite.
When the sampling $\setof{\pars_k}_{k=1}^K$ is of one of these badly
normalized functions $f(\pars)$---as it usually will be---the
sampling-approximation expectation values are the expectation values
under the properly-normalized corresponding pdf, even though you might
\emph{never learn that normalization}.
When we run \MCMC\ sampling on $f(\pars)$, a function that differs from a
pdf $p(\pars)$ by some unknown normalization constant, then the sampling
permits computation of the following kinds of quantities:
\begin{eqnarray}
E_{p(\pars)}[g(\pars)] &\equiv& \frac{\int g(\pars)\,f(\pars)\,\dd \pars}{\int f(\pars)\,\dd \pars}
\\
E_{p(\pars)}[g(\pars)] &\approx& \frac{1}{K}\,\sum_{k=1}^K g(\pars_k)
\label{eq:expectationvalue}
\quad .
\end{eqnarray}
That is, the sampling can be constructed (as we will show below in \sectionname~\ref{sec:MH}) from
evaluations of $f(\pars)$ directly, and it permits you to compute expectation values
without ever requiring you to integrate either the numerator integral
or the denominator integral, both of which are generally
intractable.\note{As we will
see, the intractability comes from various directions, but one is
that the dimension of $\pars$ gets large in most realistic situations.
Another is that the support of $p(\pars)$ tends to be, in normal
inference situations, \emph{far smaller} than the full domain of
$\pars$.
That is, the pdf is at or very close to zero over all but some tiny
and very hard-to-find part of the space.}
The ``correctness'' of a sampling is defined (above in \sectionname~\ref{sec:sampling}) in terms of its use in
performing integrals or approximate computation of integrals.
In a deep sense, the \emph{only} thing a sampling is good for is
computing integrals.
There are many uses of the \MCMC\ sampling, some of them good and some of them bad.
Most of the good or sensible uses will somehow involve integration.
For example, one magical property of a sampling (in a $D$-dimensional
space) is that a \emph{histogram} (a $D$-dimensional histogram) of the
samples (divided, if you like, by the number of samples in each bin
and the bin width\note{We are referring here to the point that if you
want a histogram of samples to look just like the posterior pdf from
which the samples are drawn, the histogram (thought of as a step
function) must integrate to unity.})
looks very much like the pdf from which the samples were drawn.
This is a way to ``reconstruct'' the pdf from the sampling:
Make a histogram from the samples.
Even in this case, the sampling is being used to do \emph{integrals};
the (possibly odd) idea is that the approximate or effective value of the pdf in each bin
of the histogram is an average over the bin.
That average is obtained by performing an integral.
Integrals are also involved in finding the mean, median, and any quantiles
of a pdf.
They are not involved in finding the \emph{mode} of a pdf.
For this reason (and others), in what follows, when we talk about what to report
about the outcome of your \MCMC\ sampling,
we will advise in favor of mean, median, and quantiles,
and we will advise against mode.
Finally---and perhaps most importantly---a magical property of a sampling
(in a $D$-dimensional space)
is that if some of your $D$ dimensions are totally uninteresting
(nuisance parameters, if you will)
and some of your $D$ dimensions are of great interest,
the sampling in the full $D$-space is trivially converted into a sampling in the subspace of interest:
You just drop from each $\pars_k$ vector (or blob or list) the dimensions of no interest!
That is, the projection of the sampling to the subspace of interest produces
a sampling of the marginalized pdf, marginalizing (or projecting) out the
nuisance parameters.\note{This point about marginalization is, once again,
a use of the sampling to perform an \emph{integral};
the marginalized pdf is obtained from the full pdf by an integration.
The sampling performs this integration automatically.}
That is extremely important for inference,
where there are always parameters with very different levels of importance to the scientific conclusions.
This point generalizes from a subspace to any function of the parameters;
and it will return again below (\sectionname~\ref{sec:results}).
Although the discussion in this \documentname\ is general, the most
common use of \MCMC\ sampling (for us, anyway) is in probabilistic
inference.
For this reason, we will often refer to the function $f(\pars)$
colloquially as ``the posterior pdf''\note{We will also sometimes
refer to expectations under $f(\pars)$ as posterior means, and
medians as median-of-posterior values, and so on.} even though it is
implicitly ill-normalized and might not be a posterior pdf in any
sense.
We will also occasionally assume---just because it is true in
inference---that the function $f(\pars)$ is the product of two
functions, one called ``the prior pdf'' and one called ``the
likelihood''.
Again, this usage is colloquial and is only strictly correct in
inference contexts with proper inputs.
That the function $f(\pars)$ can be thought of as a product of a
prior pdf and a likelihood is only necessary for what follows in
the context of advanced sampling techniques like tempering or
nested sampling, both mentioned briefly below (\sectionname~\ref{sec:methods}).
\begin{problem}\label{prob:simple-stats}
Look up (or choose) definitions for the mean, variance, skewness,
and kurtosis of a distribution.
Also look up or compute the analytic values of these four statistics
for a top-hat (uniform) distribution.
Write a computer program that uses some standard package (such as
\project{numpy}\note{There isn't a full citation for this package but there is \cite{numpy}.})
to generate $K$ random numbers
$x$ from a uniform distribution in the interval $0<x<1$.
Now use those $K$ numbers to compute a sampling estimate of the
mean, variance, skewness, and kurtosis (four estimates; look up
definitions as needed).
Make four plot of these four estimates as a function of $1/K$ or
perhaps $\log_2 K$, for $K=4^n$ for $n=1$ up to $n=10$ (that is,
$K=4$, $K=16$, and so on up to $K=1048576$).
Over-plot the analytic answers.
What can you conclude?
\end{problem}
\section{Metropolis--Hastings MCMC}\label{sec:MH}
\nopagebreak
The simplest algorithm for \MCMC\ is the Metropolis--Hastings algorithm (\MH\ \MCMC).\note{%
There are many claims that this algorithm is incorrectly named,
with claims that it should be credited to Enrico Fermi or Stan Ulam.
We don't have any opinions of this matter, but encourage the reader to follow this up.
The original paper is \citet{metropolis}, and there are a few sketchy historical notes
in \citet{geyer}.}
It is so simple that we recommend that any reader of this document
who has not previously implemented the algorithm take a break at the end
of this \sectionname\ and implement it forthwith, in a short piece of computer code,
in the context of some simple problems.\note{%
We request this in a trivial case below in
\problemname~\ref{prob:MH} and the following problems in this
\sectionname.
We make the same request in a less trivial context in our
model-fitting screed (\citealt{fittingaline},
\problemname~6 and 7), where we ask the Reader to implement \MCMC\ for
a useful mixture model.}
The \MH\ \MCMC\ algorithm requires two inputs.
The first is a handle to the function $f(\pars)$ that is the function to be sampled,
such that the algorithm can evaluate $f(\pars)$ for any value of the parameters $\pars$.
In data-analysis contexts,
this function would be the prior $p(\pars)$ times the likelihood $p(\data\given\pars)$
evaluated at the observed data $\data$.
The second input is a handle to a proposal pdf function $q(\pars'\given\pars)$ that can deliver samples,
such that the algorithm can draw a new position $\pars'$ in the parameter space
given an ``old'' position $\pars$.
This second function must meet a symmetry requirement (detailed
balance) we discuss further below.
It permits us to random-walk around the parameter space in a fair way.
The algorithm\note{Technically this is the \emph{Metropolis} algorithm
rather than the Metropolis--Hastings algorithm. The difference is
that in the Metropolis--Hastings algorithm, we permit the proposal
distribution to disobey detailed balance but correct the
accept--reject step to recover detailed balance.} is the
following: We have generated some set of samples, the most recent of
which is $\pars_k$ To generate the next sample $\pars_{k+1}$ do the
following:
\begin{itemize}
\item Draw a proposal $\pars'$ from the proposal pdf $q(\pars'\given\pars_k)$.
\item Draw a random number $0<r<1$ from the uniform distribution.
\item If $f(\pars') / f(\pars_k) > r$ then $\pars_{k+1} \leftarrow \pars'$;
otherwise $\pars_{k+1} \leftarrow \pars_k$.
\end{itemize}
That is, at each step, either a new proposed position in the parameter
space gets accepted into the list of samples or else the previous sample
in the parameter space \emph{gets repeated}.
The algorithm can be iterated a large number $K$ of times to produce $K$ samples.
Why does this algorithm work? The answer is not absolutely
trivial\note{There is a very mathematical discussion in \citet{geyer}
that we do not fully understand. There is a more heuristic answer
on \project{Wikipedia} that we are happier with!},
but there are two components to the argument:
The first is that the Markov process delivers a unique stationary
distribution.
The second is that that stationary distribution is proportional to the
density function $f(\pars)$.
This algorithm---and indeed any \MCMC\ algorithm---produces a
\emph{biased random walk} through parameter space.
It is a random walk for the same reason that it is ``Markov'':
The step it makes to position $\pars_{k+1}$ depends only on the state
of the sampler at position $\pars_k$ (and no previous state).
It is a biased random walk, biased by the acceptance algorithm
involving the ratios of function values; this acceptance rule biases
the random walk such that the amount of time spent in the neighborhood
of location $\pars$ is proportional to $f(\pars)$.
Because of this local (Markov) property, the nearby samples are not
independent; the algorithm only produces fair samples in the limit of
arbitrary run time, and two samples are only independent when they are
sufficiently separated in the chain (more on this below in \sectionname~\ref{sec:convergence}).
The principal user-settable knob in \MH\ \MCMC\ is the proposal
pdf $q(\pars'\given\pars)$.
A typical choice is a multi-variate Gaussian distribution for $\pars'$
centered on $\pars$ with some simple (diagonal, perhaps) variance
tensor.
We will discuss the choice and tuning of this proposal distribution
below (\sectionname~\ref{sec:tuning} and \sectionname~\ref{sec:trouble}).
Importantly---for the algorithm given above to work correctly---the
proposal pdf must satisfy a ``detailed-balance'' condition\note{This
detailed-balance condition in \equationname~(\ref{eq:detailed-balance})
requires implicitly that the function $q(\pars'\given\pars)$ is also
properly normalized. That is, that it integrates (over $\pars'$) to unity
for any setting of $\pars$. That is an extremely technical point, but stands
as a reminder that you don't want to mess with detailed balance casually!};
it must have the property that
\begin{eqnarray}
q(\pars'\given\pars) &=& q(\pars\given\pars')
\label{eq:detailed-balance}\quad ;
\end{eqnarray}
that is, it must be just as easy to go one way in the parameter space
as the other.
You can break this property, if you like, and then adjust the
acceptance condition accordingly (which is truly the Metropolis--Hastings algorithm), but we \emph{do not recommend this}
except under the supervision of a trained professional.\note{Actually,
our own favorite method, \project{emcee}, has a proposal pdf that does
violate detailed balance in this way and has a compensated
acceptance probability. Read the paper \cite{emcee} for more details.}
The reason is: It is one thing to draw samples from $q(x'\given x)$.
It is another thing to correctly write down $q(x'\given x)$.
If you violate detailed balance in $q(x'\given x)$ then you have to be
able to \emph{both} draw from \emph{and} write down $q(x'\given x)$
and you are subject to a set of new possible bugs for your code.
If the detailed balance condition frightens you (as it frightens us),
then just stick with pdfs that are symmetric in $\pars'$ and $\pars$,
like the Gaussian, or a centered uniform distribution, and forget about
it.
In what follows, when we discuss tuning of the proposal pdf, we will
often cycle through the parameter blob components and propose changes
in only one dimension or element or component at a time.
That is, at step $k+1$ you might only propose a one-dimensional move
in the $i$th dimension, not in all $D$ dimensions of the parameter
space.
That won't change anything significant; but if you are implementing
this right now you might want to do that.
It will help us tune the proposal pdf (\sectionname~\ref{sec:tuning})
and diagnose problems (\sectionname~\ref{sec:trouble}).
One note to make here
is that you must \emph{propose}
in the same coordinates (parameterization or transformation of parameters)
as that in which your priors are \emph{specified},
or else multiply in a (possibly nasty) Jacobian.
That is, if your prior is ``flat in $\pars$''
but you find it easier to run your sampler by proposing steps in $\ln \pars$,
if you don't modify your acceptance probability by the Jacobian,
your \emph{real} prior won't be flat in $\pars$.
These think-os can get subtle;
the best way to avoid them
is to write your code and your likelihood function
and your prior pdf function and your proposal distribution
all in precisely the same parameterization.
We will return to this point below in our comments on testing (\sectionname~\ref{sec:trouble}).
The point is that if you \emph{do} change coordinates between
the statement of your priors and the specific variables you
step in, you have to introduce Jacobian factors.
As we discuss briefly below (\sectionname~\ref{sec:methods}),
there are many \MCMC\ methods more advanced than M--H.
However, they all share characteristics with M--H
(initialization, tuning, judging convergence),
such that it is very valuable to understand \MH\ well before using anything more advanced.
Furthermore, a scientist new to \MCMC\ benefits enormously from
building, tuning, and using her or his own \MCMC\ software.
One piece of advice we give then,
to the new user of \MCMC,
is to code up, tune, and use a \MH\ \MCMC\ sampler for a scientific project.
A huge amount is learned in doing this,
and it is very often the case that the home-built \MH\ \MCMC\ does everything needed;
you often don't need any more advanced tool.
Only after you have concluded that your home-built \MH\ \MCMC\ sampler
is not suited to your project
(or not developed properly into a properly versatile software package)
should you download and start to use any professionally developed alternative.
That is, even if---in the end---%
you want to leave the sampling code to the experts and run an industrial-strength code,
it is still valuable to build your own,
given the simplicity of the algorithm,
and given the intuition you gain by doing it (at least once) yourself.\note{%
We also make this point in a previous piece (\citealt{fittingaline});
the ambitious reader will find in that piece a solid example project
for using \MCMC\ in a real context.}
Because many problems involve a huge amount of dynamic range in the
density function $f(\pars)$, and we like to avoid underflows and overflows
(in, say, ratios computed for accept--reject steps), it is often advisable
to work in the (natural) logarithm of the density rather than the density.
When working in logarithmic density,
the accept--reject step would change from a comparison of a random
deviate with a ratio of probabilities to comparison of the log of a
random deviate with a difference of log probabilities.
That is, the accept--reject step becomes:
\begin{itemize}
\item If $\ln f(\pars') - \ln f(\pars_k) > \ln r$ then $\pars_{k+1} \leftarrow \pars'$;
otherwise $\pars_{k+1} \leftarrow \pars_k$.
\end{itemize}
This protects you from underflow, but exposes you to some (negative)
infinities, if you end up taking the logarithm of a zero. It is
important to write your code to be infinity-safe.\note{Most code will
be infinity safe if written in the form in this paragraph. However,
there can be issues if \emph{both} $f(\pars')$ \emph{and}
$f(\pars_k)$ are negative infinity; in (present-day) Python this
will return a \texttt{NaN} rather than an infinity or zero. That's a
problem and is a case that should get caught (probably way before
the accept--reject step!).}
\begin{problem}\label{prob:MH}
In your scientific programming language of choice, write a very simple
M-H \MCMC\ sampler.
Sample in a single parameter $x$ and give the sampler as its density
function $p(x)$ a Gaussian density with mean 2 and variance 2.
(Note that variance is the \emph{square} of the standard deviation.)
Give the sampler a proposal distribution $q(x'\given x)$ a Gaussian
pdf for $x'$ with mean $x$ and variance 1.
Initialize the sampler with $x=0$ and run the sampler for more than $10^4$ steps.
Plot the results as a histogram, with the true density over-plotted sensibly.
The resulting plot should look something like \figurename~\ref{fig:MH}.
\end{problem}
\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=0.5\textwidth]{problems/MH}
\end{center}
\caption{Solution to \problemname~\ref{prob:MH}:
\MCMC\ samples from a Gaussian (black) and the true distribution (blue).}
\label{fig:MH}
\end{figure}
\begin{problem}\label{prob:MH2}
Re-do \problemname~\ref{prob:MH} but now with an input density
that is uniform on $3<x<7$ and zero everywhere else.
The plot should look like \figurename~\ref{fig:MH2}.
What change did you have to make to the initialization, and why?
\end{problem}
\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=0.5\textwidth]{problems/MH2}
\end{center}
\caption{Solution to \problemname~\ref{prob:MH2}:
\MCMC\ samples from a uniform distribution (black) and the true distribution
(blue).}
\label{fig:MH2}
\end{figure}
\begin{problem}\label{prob:twod}
Re-do \problemname~\ref{prob:MH} but now with an input density
that is a function of two variables $(x, y)$.
For the density function use two different functions.
\emph{(a)} The first density function is a covariant two-dimensional Gaussian
density with variance tensor
\begin{eqnarray}
V &=& \left[\begin{array}{cc} 2.0 & 1.2 \\ 1.2 & 2.0 \end{array}\right]
\quad.
\end{eqnarray}
\emph{(b)} The second density function is a rectangular top-hat function that
is uniform on the joint constraint $3<x<7$ and $1<y<9$ and zero everywhere else.
For the proposal distribution $q(x', y'\given x, y)$ a two-dimensional Gaussian
density with mean at $[x, y]$ and variance tensor set to the two-dimensional
identity matrix.
Plot the two one-dimensional histograms and also a two-dimensional scatter
plot for each sampling.
\figurename~\ref{fig:twod-a} shows the expected results for the Gaussian.
Make a similar plot for the top-hat.
\end{problem}
\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=0.7\textwidth]{problems/twod_a}
\end{center}
\caption{Solution to \problemname~\ref{prob:twod}a:
\MCMC\ samples from a two-dimensional Gaussian (scatter plot) and the
one-dimensional marginalized distributions (histograms).}
\label{fig:twod-a}
\end{figure}
\begin{problem}\label{prob:MH-sigma}
Re-do \problemname~\ref{prob:twod}a but with different values for
the variance of the proposal distribution $q(x'\given x)$.
What happens when you go to very extreme values (like for instance
$10^{-1}$ or $10^2$)?
\end{problem}
\begin{problem}\label{prob:prop-mean}
Why, in all the previous problems, did we give the proposal
distributions $q(x'\given x)$ a mean of $x$?
What would be bad if we hadn't done that?
Re-do \problemname~\ref{prob:twod}a with a proposal $q(x'\given x)$ with a
stupidly shifted mean of $x + 2$ and see what happens.
\emph{Bonus points:} Modify the acceptance--rejection criterion to deal with
the messed-up $q(x'\given x)$ and show that everything works once
again.
\end{problem}
\section{Likelihoods and priors}\label{sec:likelihood}
\nopagebreak
\MCMC\ is used to obtain samples $\pars_k$ from a pdf $p(\pars)$,
given a badly normalized all-positive function $f(\pars)$
that is different from the pdf by an unknown factor $Z$.
In the context of data analysis,
\MCMC\ is usually being used to obtain samples $\pars_k$
that are \emph{parameter values} $\pars$ for a probabilistic model of some data.
That is, \MCMC\ is sampling a pdf \emph{for the parameters}.
Ideally, if you are using \MCMC\ for inference, your code should input
to the \MCMC\ function or routine a probability function which is a
product of a prior times a likelihood.
As we noted above---in terms of implementation---it is usually advisable to work in the
logarithm of the density function, so the function input to the \MCMC\ code
would be called something like \code{ln_f()}.
This function \code{ln_f()} internally would compute and return the
sum of a log prior pdf \code{ln_prior()} and a log likelihood function
\code{ln_likelihood()}.
If you are using ``flat'' (improper) priors, the \code{ln_prior()}
function can just return a zero no matter what the parameters.
If it is flat with bounds (that is, proper), the \code{ln_prior()}
should check the bounds and return \code{-Inf} values when the parameter
vector is out of bounds.
The pseudo-code for the \code{ln_f()} function should look something
like this:
\begin{verbatim}
def ln_f(pars, data):
x = ln_prior(pars)
if not is_finite(x):
return -Inf
return x + ln_likelihood(data, pars)
\end{verbatim}
This pseudo-code ensures that you only compute the likelihood when the
parameters are within the prior bounds; it assumes implicitly that the
prior pdf is easier to compute than the likelihood.
If you find yourself in the opposite regime, adjust accordingly.
Of course the above pseudo-code presumes that when you perform the
accept--reject step of the \MCMC\ method, the programming language handles
properly the \code{-Inf} values.\note{If you are working in a language
that doesn't have an \code{Inf} or doesn't evaluate comparisons correctly
when the \code{Inf} appears, you might have to write some case code, and
have your \code{ln_f} function return a value \emph{and} some kind of flag
which indicates ``zero probability'' or $f(\pars)=0$.}
\MCMC\ \emph{cannot sample a likelihood} (which is a probability \emph{for the data} given
parameters).\note{Well, technically, \MCMC\ \emph{can} be used to sample a likelihood function,
but the samples would be samples of possible \emph{data} not possible \emph{parameters}.
The purist might say that even this is not sampling a likelihood function,
because you should only call it a ``likelihood function'' in contexts in which you are
treating the data as fixed (at the true data values) and the parameters as variable.
In this context, the likelihood function is \emph{not} a pdf for anything,
so it can't be sampled.}
Despite this, in many cases,
data analysts believe they are sampling the likelihood.
This is because (we presume) they have put a likelihood function
(or log-likelihood function)
in as the input to the \MCMC\ code, where the probability function
(or log-probability function) should go.
Then is it the likelihood that is being sampled?
No, not really;
it is a posterior probability that is directly proportional to the likelihood function.
That is, it is a posterior probability for some implicit (and improper) ``flat'' priors.
It should be outside of the scope of this document to note here that
it is a good idea to have proper priors.\note{This point---that you
should be using proper priors---is just basic Bayesian good
practice, often violated but never for good reasons.}
Proper priors obey the integral constraint (\ref{eq:proper}),
with $Z$ finite.
It isn't a requirement that priors be proper for the posterior to
be proper, so many investigators and projects violate this rule.
This is outside the current scope, except for the following point:
It is a great functional test of your sampler and your data analysis
setup to take the likelihood function to a very small power (much less
than one) or multiply the log-likelihood by a very small number (much
less than one) and check that the sampler samples, properly and
correctly, the prior pdf.
This test is only possible if the prior pdf is proper.
\begin{problem}\label{prob:improper}
Run your M-H \MCMC\ sampler from \problemname~\ref{prob:MH}, but now
with a density function that is precisely unity \emph{everywhere}
(that is, at any input value of $x$ it returns unity). That is, an
improper function (as discussed in \sectionname~\ref{sec:likelihood}).
Run it for longer and longer and plot the chain value $x$ as a
function of timestep. What happens?
\end{problem}
\begin{problem}\label{prob:fittingaline}
For a real-world inference problem, read enough of
\citet{fittingaline} to understand and execute Exercise~6 in that
document.
\end{problem}
\begin{problem}\label{prob:logprior}
Modify the sampler you wrote in \problemname~\ref{prob:MH} to take
steps not in $x$ but in $\ln x$. That is, replace the Gaussian
proposal distribution $q(x'\given x)$ with a Gaussian distribution in
$\ln x$ $q(\ln x'\given \ln x)$, but make no other changes.
By doing this, you are no longer sampling the Gaussian $p(x)$ that you were in
\problemname~\ref{prob:MH}.
What about your answers change? What distribution are you sampling now?
Compute the analytic function that you have sampled from~--~this will no longer
be the same $p(x)$~--~and over-plot it on your histogram.
\end{problem}
\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=0.5\textwidth]{problems/logprior}
\end{center}
\caption{Solution to \problemname~\ref{prob:logprior}:
The black histogram shows the \MCMC\ samples and the blue curve is the analytic
target density.}
\label{fig:logprior}
\end{figure}
\Needspace{7\baselineskip}
\section{Autocorrelation \& convergence}\label{sec:convergence}%
\nopagebreak%
A key question for an \MCMC\ operator---\emph{the} key question in some
sense---is \emph{how long to run} to be sure of having reliable
results.
It is disappointing and annoying to many that there is no extremely
simple and reliable answer to this question.\note{There is very good
coverage of most of the points in this \sectionname, and more, in a
set of lecture notes by Patrick Lam up at
\url{http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf}.}
The reason that there is no simple answer is that you can't really
ever know that you have sampled the full posterior pdf, and the reason
for \emph{that} is that if you \emph{could} know that, you would also
be able to solve the famously difficult discrete optimization
problem.\note{Famously, there is ``no free lunch''
\citep{nofreelunch}: You can't find the global optimum of a general
optimization problem without exhaustive search of all possibilities.
This is very closely related---somehow---to the difficulty of sampling.\label{note:nofreelunch}}
Qualitatively, you can consider the scenario where you have two modes---two
separated regions of substantial total probability---in the posterior pdf that
are both important but separated by a large region of low probability.
If you are using a simple sampler, you could sample one of these modes
very well but the sampler will take effectively infinite time to find
the other mode.
In most real problems, the situation is much worse, with an unknown number of
modes, and knowing that you have a complete and representative sampling
is effectively impossible.
In this context,
it is \emph{simply not fair} to require an \MCMC\ run to have fully and completely
sampled the posterior pdf,
at least not in any provable sense.
The question of whether a sampling is definitely converged to a representative sampling
of the posterior pdf is actually \emph{outside the domain of science},
because the domain of science is the domain of questions answerable by scientists.
That is, unless you have some bounds on the support and smoothness of
the posterior pdf,
you can \emph{never know} that you have correctly sampled the posterior pdf,
despite many statements in the literature to the contrary.\note{%
For example, you can (almost) never say that your chains are
definitely converged, or that you have the posterior pdf correct to
some given level of accuracy. The reason is related to the above-mentioned ``no free
lunch'' theorem of discrete optimization (see \notename~\ref{note:nofreelunch}): In most problems there is no
way you could have searched your parameter space finely enough to know
this. There are some exceptions of course. In one kind of exception,
your problem is convex or Gaussian, and you can analytically show that
there is exactly one mode in the density and where it is. Of course
in these cases you rarely need \MCMC\ to solve your problems! In
another kind of exception, you can say something about the finite
width or shape of the modes of the likelihood function (and prior
pdf). For example, sometimes the Cram\'er--Rao bound tells you that
modes of the density must be smoother than some finite amount. Then
an exhaustive search of parameter space followed by \MCMC\ can in
principle have provable properties. But, as we say, these situations
are rare.}
The upshot of this is that we don't and can't require any kind of absolute convergence;
indeed, it would be impossible even to test for it.
In this pragmatic situation---the Real World, as it is called---we have to rely on heuristics.
Heuristically, you have sampled long enough when you can see
that the (or each) walker has traversed the high-probability parts of
the parameter space many times in the length of the chain.
Or, equivalently, you have sampled long enough when the first half of
the chain shows very much the same posterior pdf morphology as the
second half of the chain, or indeed as any substantial subset.
Or, relatedly, you have sampled long enough when different walkers initialized
differently and run with different random number seeds create posterior inferences
that are substantially the same.
The above heuristics can be made more precise in terms of the amount
of deviation one expects between the means and variances, say, of two
disjoint subsets of the chain.
The premier tool for making this heuristic precise, however, is to
look at the \emph{``integrated autocorrelation time''} of the chain.
In general, when one has a sequence $(\pars_1, \pars_2, \theta_3, \cdots)$
generated by a Markov Process in the $\pars$ space, nearby points in
the sequence will be similar, but sufficiently distant points will not
``know about'' one another;
the autocorrelation function measures this.
More formally, as discussed in \sectionname~\ref{sec:sampling}, the goal of
\MCMC\ sampling is to compute integrals of the form given in
\equationname~\ref{eq:the-real-integral} using the Monte Carlo approximation
in \equationname~\ref{eq:the-real-samples}.
The Monte Carlo error introduced by this approximation is proportional to
$\sqrt{\tau_\mathrm{int}/N}$ where $\tau_\mathrm{int}$ is the
integrated autocorrelation time and $N$ is the total number of
samples of $p(\pars)$.
In other words, $\tau_\mathrm{int}$ is the number of steps required for the
chain to produce an \emph{independent} sample.
A sampler with a smaller integrated autocorrelation time is better; you have
to do fewer $f(\pars)$ calls per independent sample, and you have to
run less time to get accurate sampling-based integral estimates.
A sampler that takes an independent sample every time would have an
autocorrelation time of unity, which is the best possible value; this
optimal sampling is only possible for problems where the sampling is
analytic (for example if the posterior is perfectly Gaussian with
known mean and variance).
In general, the best sampler will be different for different problems, and we
will (below; \sectionname~\ref{sec:methods}) tune the samplers we have to do
the best on the problems we have; this tuning will also be problem-specific.
There are many heuristic bases on which different samplers might be
compared or tuned, but fundamentally it is lower autocorrelation time
that separates good samplers from bad ones and is the ultimate basis
on which we compare performance.\note{%
We sometimes see \MCMC\ methods compared according to burn-in times,
which, for one, depends strongly on initialization, and, for two,
depends strongly on tuning and dynamics. Similarly we often see
comparisons in terms of acceptance ratio. In principle the only
question is how precise are one's inferences given a certain amount
of computation. This is set by the autocorrelation time and (pretty
much) nothing else, for reasonably converged chains. In the real
world, of course, one must add to the CPU time the investigator time
spent tuning the method (and thinking about initialization) but
there is never (to our knowledge) a condition in which burn-in time
is the dominant consideration in choosing an \MCMC\ method.}
We won't go into details about how to estimate $\tau_\mathrm{int}$, but it is
notoriously difficult: It is a two-point statistic, and two-point statistics
are much harder to estimate than one-point statistics.\note{See: All of
cosmology!}
If you are not a hard-core user of \MCMC, and if all you want is
heuristic indicators of convergence, then what you should take from this
section is that the autocorrelation time is involved in variance estimates,
but that it is hard to estimate.
If you want to think about autocorrelation estimation, more detailed
references can be found elsewhere.\note{%
A canonical reference is a set of lecture notes by Alan Sokal
\citep{sokal}.
Another good reference is a blog post by one of us (DFM) that can be found
at \url{http://dfm.io/posts/autocorr}.}
Besides estimating the integrated autocorrelation time, another simple and
sensible test of convergence is the Gelman--Rubin
diagnostic\note{\citet{gelmanrubin}.}, which compares the variance (in one
parameter, or your most important parameter, or all parameters) \emph{within}
a chain to the variance \emph{across} chains.
This requires running multiple chains, and looking at the empirical
variance of the parameter away from its mean within each individual
chain, and comparing it to the variance in the mean of that parameter
across chains, inflated to be a mean per-sample variance.
What Gelman and Rubin do with these variances specifically is
sensible, but the important question of convergence is whether, as the
chains get longer, these two variances asymptotically reach stable
values, and that those two values agree.
The Gelman--Rubin diagnostic is related to the autocorrelation time,
in that it will only deliver success when the chains are much longer
than an autocorrelation time.
Finally one last point about convergence:
Since all convergence tests are fundamentally heuristic, it is useful
to just make the heuristic visualization of the samples $\pars_k$ as a
function of $k$, in the order they were generated.
The chain is likely to be converged only if the random-walk process
crossed the domain of $\pars$ fully many times during the \MCMC\ run.
Often some parameter directions are much worse than others; it is
worth looking at the chain in all parameter directions.
\begin{problem}\label{prob:convergence}
Re-do \problemname~\ref{prob:MH} but now look at convergence:
Plot the $x$ chain as a function of timestep. Also split the chain
into four contiguous segments (the first, second, third, and fourth
quarters of the chain). In each of these four, compute the empirical