chapter_notes/BDA_notes_ch4.tex

\documentclass[a4paper,11pt,english]{article}

\usepackage{babel}
\usepackage[latin1]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{times}
\usepackage{amsmath}
\usepackage{microtype}
\usepackage{url}
\urlstyle{same}

\usepackage[bookmarks=false]{hyperref}
\hypersetup{%
  bookmarksopen=true,
  bookmarksnumbered=true,
  pdftitle={Bayesian data analysis},
  pdfsubject={Reading instructions},
  pdfauthor={Aki Vehtari},
  pdfkeywords={bayesilainen todennäköisyysteoria, bayesilainen
    päättely, bayesilaiset mallit, mallien analysointi,
    laskennalliset menetelmät, Markov-ketju Monte Carlo},
  pdfstartview={FitH -32768},
  colorlinks=true,
  linkcolor=black,
  citecolor=black,
  filecolor=black,
  urlcolor=black
}


% if not draft, smaller printable area makes the paper more readable
\topmargin -4mm
\oddsidemargin 0mm
\textheight 225mm
\textwidth 160mm

%\parskip=\baselineskip

\DeclareMathOperator{\E}{E}
\DeclareMathOperator{\Var}{Var}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\Sd}{Sd}
\DeclareMathOperator{\sd}{sd}
\DeclareMathOperator{\Bin}{Bin}
\DeclareMathOperator{\Beta}{Beta}
\DeclareMathOperator{\Invchi2}{Inv-\chi^2}
\DeclareMathOperator{\NInvchi2}{N-Inv-\chi^2}
\DeclareMathOperator{\logit}{logit}
\DeclareMathOperator{\N}{N}
\DeclareMathOperator{\U}{U}
\DeclareMathOperator{\tr}{tr}
%\DeclareMathOperator{\Pr}{Pr}
\DeclareMathOperator{\trace}{trace}

\pagestyle{empty}

\begin{document}
\thispagestyle{empty}

\section*{Bayesian data analysis --  4} 
\smallskip
{\bf Aki Vehtari}
\smallskip

\subsection*{Chapter 4}

Outline of the chapter 4
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item 4.1 Normal approximation (Laplace's method)
\item 4.2 Large-sample theory
\item 4.3 Counter examples
\item 4.4 Frequency evaluation (not part of the course, but interesting)
\item 4.5 Other statistical methods (not part of the course, but interesting)
\end{list}

Normal approximation is used often used as part of posterior
computation (more about this in Ch 13, which is not a part of the
course).

Demos
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item esim4\_1: Bioassay example
\end{list}

Find all the terms and symbols listed below. When reading the chapter,
write down questions related to things unclear for you or things you
think might be unclear for others. 
\begin{list}{$\bullet$}{\parsep=0pt\itemsep=2pt}
\item sample size
\item asymptotic theory
\item normal approximation
\item quadratic function
\item Taylor series expansion
\item observed information
\item positive definite
\item why $\log \sigma$?
\item Jacobian of the transformation
\item point estimates and standard errors
\item lower-dimnsional approximations
\item large-sample theory
\item asymptotic normality
\item consistency
\item underidentified
\item nonidentified
\item number of parameters increasing with sample size
\item aliasing
\item unbounded likelihood
\item improper posterior
\item edge of parameter space
\item tails of distribution
\end{list}

\subsection*{Normal approximation}

Other Gaussian posterior approximations are discussed in Chapter
13. For example, variational and expectation propagation methods
improve the approximation by global fitting instead of just the
curvature at the mode. The Gaussian approximation at the mode is often
also called the Laplace method, as Laplace used it first.

Several researchers have provided partial proofs that posterior
converges towards Gaussian distribution. In the mid 20th century Le
Cam was first to provide a strict proof.

\subsection*{Observed information}

When $n\rightarrow\infty$, the posterior distribution approaches
Gaussian distribution. As the log density of the Gaussian is a
quadratic function, the higher derivatives of the log posterior
approach zero. The curvature at the mode describes the information
only in the case if asymptotic normality. In the case of the Gaussian
distribution, the curvature describes also the width of the
Gaussian. Information matrix is a {\em precision matrix}, which
inverse is a covariance matrix.

\subsection*{Aliasing}

In Finnish: valetoisto.

Aliasing is a special case of under-identifiability, where likelihood
repeats in separate points of the parameter space. That is, likelihood
will get exactly same values and has same shape although possibly
mirrored or otherwise projected. For example, the following mixture model
\begin{equation*}
  p(y_i|\mu_1,\mu_2,\sigma_1^2,\sigma_2^2,\lambda)=\lambda\N(\mu_1,\sigma_1^2)+(1-\lambda)\N(\mu_2,\sigma_2^2),
\end{equation*}
has two Gaussians with own means and variances. With a probability
$\lambda$ the observation comes from $\N(\mu_1,\sigma_1^2)$ and a
probability $1-\lambda$ from $\N(\mu_2,\sigma_2^2)$. This kind of
model could be used, for example, for the Newcomb's data, so that the
another Gaussian component would model faulty measurements.  Model
does not state which of the components 1 or 2, would model good
measurements and which would model the faulty measurements. Thus it is
possible to interchange values of $(\mu_1,\mu_2)$ and
$(\sigma_1^2,\sigma_2^2)$ and replace $\lambda$ with $(1-\lambda)$ to
get the equivalent model. Posterior distribution then has two modes
which are mirror images of each other. When $n\rightarrow\infty$ modes
will get narrower, but the posterior does not converge to a single
point.

If we can integrate over the whole posterior, the aliasing is not a
problem. However aliasing makes the approximative inference more
difficult. 

\subsection*{Frequency property vs. frequentist}

Bayesians can evaluate frequency properties of Bayesian estimates
without being frequentist. For Bayesians the starting point is the
Bayes rule and decision theory. Bayesians care more about efficiency
than unbiasedness.  For frequentists the starting point is to find an
estimator with desired frequency properties and quite often
unbiasedness is chosen as the first restriction.

% Tässä arvioitiin bayes-päättelyn frekvenssiominaisuuksia
% bayesilaisen teorian mukaan. Todennäköisyyden tulkinta siis säilyy
% bayesilaisena. Frekventistisessä teoriassa estimaattoreilla ja
% estimaateilla on vain frekvenssiominaisuuksia (ks. aiemmpi
% kommentti confidence ja posterior intervallien eroista).
% Bayesilaisessa teoriassa voidaan myös tutkia
% frekvenssiominaisuuksia, mutta myös yksittäisten tapahtumien
% todennäköisyyksiä. Rajalla voidaan kahdesta lähtökohdaltaan
% erilaisesta teoriasta saada vastaavankaltaisia tuloksia, vaikka
% tulosten perusta ja tulkinta säilyvät erilaisina.

\subsection*{Transformation of variables}

See p. 21 for the explanation how to derive densities for transformed
variables. This explains, for example, why uniform prior
$p(\log(\sigma^2))\propto 1$ for $\log(\sigma^2)$ corresponds to prior
$p(\sigma^2)=\sigma^{-2}$ for $sigma^{2}$.

\subsection*{On derivation}

Here's a reminder how to integrate with respect to $g(\theta)$. For example 
\begin{align*}
  \frac{d}{d\log\sigma}\sigma^{-2}=-2 \sigma^{-2}
\end{align*}
is easily solved by setting $z=\log\sigma$ to get
\begin{align*}
  \frac{d}{dz}\exp(z)^{-2}=-2\exp(z)^{-3}\exp(z)=-2\exp(z)^{-2}=-2\sigma^{-2}.
\end{align*}


\end{document}

%%% Local Variables: 
%%% TeX-PDF-mode: t
%%% TeX-master: t
%%% End: