---
title: "Markov chain Monte Carlo"
subtitle: "Adapted from notes by Murali Haran, Penn State University"
date: "May / June 2018, 14th Penn State Astrostatistics Summer School"
output: beamer_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Goal: Describe and implement a simple example of MCMC
Monte Carlo sampling methods, which we describe below, don't always work.
The purpose of this activity is twofold:
* Illustrate how MCMC algorithms are easy to implement,
at least in principle, in situations
where classical Monte Carlo methods do not work.
* Provide a glimpse of practical MCMC implementation issues.
## Basic Monte Carlo Methods
Suppose we want to estimate $m= E_fg(X)$, the
expectation of a function $g(x)$ with respect to the probability
distribution $f$.
If $m$ is analytically intractable, a Monte Carlo estimate of $m$ can be
obtained by
* simulating $N$ pseudo-random values, say $X_1, X_2, \ldots, X_N$, from the distribution $f$
* taking the average of $g(X_1), g(X_2),\ldots,g(X_N)$ to
estimate $m$.
Typically, the Law of Large Numbers implies that
the estimate converges to the true expectation $m$ as $N$ gets large.
## A toy example
Suppose X is a standard normal random variable and we wish to calculate
$P(-1-1) & (xs<1)) # count draws in (-1,1)
xcount/10000 # Monte Carlo estimate
pnorm(1) - pnorm(-1) # Compare to exact value
```
## Importance sampling: Another MC technique for estimating expectations
* Goal: estimate $m=E_f g(X)$.
* Strategy: Sample $Y_1, \ldots, Y_N$ from density $q$ (NOT $f$).
* Notice:
\[
E_f g(X) = E_q \left[ g(Y) \frac{f(Y)}{q(Y)} \right].
\]
* Conclusion: Estimate $m$ by
\[
\frac1N \sum_{i=1}^N W_i g(Y_i)
\mbox{\quad where \quad} W_i = \frac{f(Y_i)}{q(Y_i)}.
\]
When normalizing constants for $f$ or $q$ are unknown, or for numerical
stability, the weights $W_i$ may be scaled so that their mean is one. (NB: This
should be done only when $f$ and $q$ have the same support, for in that
case $E_q W_i = 1$.)
## Toy Example 2: Calculate $P(44) # Estimate probability empirically
```
Next, importance sampling using $q(y)=\exp(-(y-4))$ for $y>4$.
```{r toy2b}
y <- rexp(10000)+4 # Sample from q
mean(dnorm(y)/exp(4-y)) # All y are >4
```
Here is the exact value:
```{r toy2c}
1-pnorm(4) # Compare it to exact answer
```
## Importance sampling is powerful in a number of situations
* If expectations with respect to several different
distributions (say $f_1, \ldots, f_p$) are of
interest, all can be estimated from a single set of samples.
* For rare event probabilities, ordinary Monte Carlo could take a
huge number of samples for accurate estimates. In such cases,
selecting $q$ wisely can give more precise estimates using fewer samples.
* In principle, "selecting $q$ wisely" means trying to make it roughly
proportional to $f\times g$.
Discussions of importance sampling in astronomical Bayesian
computation appear in papers by
Lewis & Bridle (http://xxx.lanl.gov/abs/astro-ph/0205436)
and
Trotta (http://arxiv.org/abs/0803.4089)
for cosmological parameter estimation and
Ford (http://xxx.lanl.gov/abs/astro-ph/0512634)
for extrasolar planet modeling.
## MCMC can help when $f$ is intractable
* Producing pseudo-random draws from $f$ is often infeasible.
* An alternative is to use a Metropolis-Hastings algorithm to produce a
Markov chain $X_1, \ldots, X_N$ where the $X_i$ are **dependent**
draws that are **approximately** from $f$.
* As before, the mean of
$g(X_1), \ldots, g(X_N)$ provides an estimate of $m$.
The
Metropolis-Hastings algorithm is very general and hence very
useful.
The following example shows how it can be used for
inference for a situation where it would otherwise be impossible
to compute desired expectations.
## Brief review of statistical inference
* We assume that data $Y$ are distributed according to density $h_\theta(y)$,
where $\theta$ is the (unknown) object of our scientific interest.
* We observe $Y$; suppose we denote it $Y_{\rm obs}$.
* The likelihood is $L(\theta) = h_\theta(Y_{\rm obs})$.
* To find the MLE, we'd maximize $L(\theta)$. (Or in practice $\log L(\theta)$)
**In Bayesian inference, we assume a prior density $p(\theta)$ for $\theta$.**
The prior may be based on astrophysical insights (e.g. no
source can have negative brightness), past astronomical observation
(e.g. stars have masses between 0.08-150 solar masses), and/or purely
statistical considerations (e.g. uniform or Jeffreys priors) when it
is difficult to obtain good prior information.
**The object of inference is the posterior density $\pi(\theta | Y_{\rm obs})$.**
## The posterior is proportional to likelihood times prior
* That is, $\pi(\theta | Y_{\rm obs}) \propto L(\theta) = h_\theta(Y_{\rm obs})
p(\theta)$.
* Sometimes the proportionality constant is difficult to calculate.
* MCMC algorithms avoid computation of the proportionality constant (!)
Although the MCMC methods discussed here are often associated with
Bayesian computation, this need not be the case.
Essentially, any time samples from a complicated distribution are needed,
MCMC may be useful.
## Example: X-ray emission time series
Our dataset is from the Chandra Orion Ultradeep Project (COUP),
found in the file 'COUP551_rates.dat' in the following path:
```{r mcmcpath}
MCMCpath <- paste("http://www.stat.psu.edu/~dhunter/",
"astrostatistics/mcmc/", sep="")
```
This is a time series of X-ray emission from a flaring young star in the Orion
Nebula Cluster.
More information on the Chandra Flares dataset is available at
http://astrostatistics.psu.edu/datasets/Chandra_flares.html.
* The raw data arrive approximately according to a Poisson
process
* They consist of the individual photon arrival times (in seconds) and
their energies (in keV).
* The processed data we consider here are counts of events
obtained by grouping the events into evenly-spaced time bins of
width 10,000 seconds.
## Assumption: Piecewise homogeneous Poisson process with changepoint
Our goal is to identify the changepoint and
estimate the intensities of the Poisson process before and after the
change point.
Here's a Bayesian model for the problem from
_Bayes and Empirical Bayes Methods for Data Analysis_
(Carlin and Louis, 2000, Chpt. 5):
* Let $Y_t$ be the number of occurrences of some event at
time $t$.
* The process is observed for times $1$ through $n$.
* At time $k$, the event counts
are significantly different, higher or lower than before.
In other words,
\[
\begin{split}
Y_t |k,\theta,\lambda \sim & \mbox{Poisson}(\theta) \mbox{ for } t=1,\dots,k;\\
Y_t |k,\theta,\lambda \sim & \mbox{Poisson}(\lambda) \mbox{ for } t=k+1,\dots,n.
\end{split}
\]
We assume that the $Y_t$ are mutually independent conditional on the parameters.
## Priors for piecewise homogeneous Poisson process model
We assume
$Y_t |k,\theta,\lambda \sim \begin{cases}
\mbox{Poisson}(\theta) &\mbox{ for } t=1,\dots,k; \\
\mbox{Poisson}(\lambda) &\mbox{ for } t=k+1,\dots,n.
\end{cases}$
We assume the following prior distributions on the parameters:
\[
\begin{split}
\theta|b_1 \sim & \mbox{Gamma}(0.5,b_1)\qquad \mbox{(pdf=} g_1(\theta|b_1))\\
\lambda|b_2 \sim & \mbox{Gamma}(0.5,b_2)\qquad \mbox{(pdf=} g_2(\lambda|b_2))\\
b_1 \sim & \mbox{IG}(1,1)\:\:\:\:\:\:\:\:\: \mbox{(pdf=} h_1(b_1))\\
b_2 \sim & \mbox{IG}(1,1)\:\:\:\:\:\:\:\:\: \mbox{(pdf=} h_2(b_2))\\
k \sim & \mbox{Uniform}(1,\dots,n)\qquad \mbox{(pmf =} u(k))
\end{split}
\]
Here, we assume $k$, $\theta$, and $\lambda$ are conditionally independent and the
gamma$(\alpha, \beta)$ and inverse gamma$(\alpha, \beta)$ densities are
\begin{eqnarray*}
\frac{1}{\Gamma(\alpha)\beta^{\alpha}} x^{\alpha-1} e^{-x/\beta} \mbox{\ and\ }
\frac{e^{-1/\beta x}} {\Gamma(\alpha) \beta^{\alpha} x^{\alpha + 1}},
\mbox{\ respectively.}
\end{eqnarray*}
## Read data and produce a simple time series plot
```{r timeSeries, out.width="3in"}
chptdat <- read.table(paste(MCMCpath,
"COUP551_rates.dat", sep=""), skip=1)
Y=chptdat[,2] # store data in Y
ts.plot(Y,main="Time series plot of change point data")
```
\vspace{-10ex}
\begin{columns}
\begin{column}{3.5in}
\end{column}
\begin{column}{1.5in}
The plot suggests a changepoint around 10.
\end{column}
\end{columns}
## Setting up the MCMC algorithm
Inference is based on the
posterior distribution, obtained up to a
multiplicative constant by multiplying likelihood
times prior:
\[
\begin{split}
f(k,\theta,\lambda,b_1,b_2| {\bf Y}) \ \propto\ &\prod_{i=1}^k f_1(Y_i|\theta,\lambda,k) \prod_{i=k+1}^n f_2(Y_i|\theta,\lambda,k)\\
& \times g_1(\theta|b_1) g_2(\lambda|b_2) h_1(b_1) h_2(b_2) u(k)\\
\ =\ & \prod_{i=1}^k \frac{\theta^{Y_i} e^{-\theta}}{Y_i !} \prod_{i=k+1}^n \frac{\lambda^{Y_i} e^{-\lambda}}{Y_i !} \\
& \times \frac{1}{\Gamma(0.5)b_1^{0.5}} \theta^{-0.5} e^{-\theta/b_1} \times \frac{1}{\Gamma(0.5)b_2^{0.5}} \lambda^{-0.5} e^{-\lambda/b_2}\\
& \times \frac{1}{b_1^2} e^{-1/b_1} \frac{1}{b_2^2} e^{-1/b_2} \times \frac{1}{n}.
\end{split}
\]
Our goal is to simulate
multiple draws from this posterior distribution.
## Full conditional distributions for $\theta$ and $\lambda$
\[
\begin{split}
f(k,\theta,\lambda,b_1,b_2| {\bf Y}) \ \propto\
& \prod_{i=1}^k \frac{\theta^{Y_i} e^{-\theta}}{Y_i !} \prod_{i=k+1}^n \frac{\lambda^{Y_i} e^{-\lambda}}{Y_i !} \\
& \times \frac{1}{\Gamma(0.5)b_1^{0.5}} \theta^{-0.5} e^{-\theta/b_1} \times \frac{1}{\Gamma(0.5)b_2^{0.5}} \lambda^{-0.5} e^{-\lambda/b_2}\\
& \times \frac{1}{b_1^2} e^{-1/b_1} \frac{1}{b_2^2} e^{-1/b_2} \times \frac{1}{n}
\end{split}
\]
gives
\[
\begin{split}
f(\theta|k,\lambda,b_1,b_2,{\bf Y})
& \propto \theta^{\sum_{i=1}^k Y_i-0.5} e^{-\theta(k+ 1/b_1)} \\
& \propto \mbox{Gamma}\left(\sum_{i=1}^k Y_i + 0.5, \frac{b_1}{k b_1+1}\right),\\
f(\lambda|k,\theta,b_1,b_2,{\bf Y})
& \propto \mbox{Gamma}\left(\sum_{i=k+1}^n Y_i + 0.5, \frac{b_2}{(n-k) b_2+1}\right).
\end{split}
\]
## Full conditional distribution for $k$
\[
\begin{split}
f(k,\theta,\lambda,b_1,b_2| {\bf Y}) \ \propto\
& \prod_{i=1}^k \frac{\theta^{Y_i} e^{-\theta}}{Y_i !} \prod_{i=k+1}^n \frac{\lambda^{Y_i} e^{-\lambda}}{Y_i !} \\
& \times \frac{1}{\Gamma(0.5)b_1^{0.5}} \theta^{-0.5} e^{-\theta/b_1} \times \frac{1}{\Gamma(0.5)b_2^{0.5}} \lambda^{-0.5} e^{-\lambda/b_2}\\
& \times \frac{1}{b_1^2} e^{-1/b_1} \frac{1}{b_2^2} e^{-1/b_2} \times \frac{1}{n}
\end{split}
\]
gives
\[
\begin{split}
f(k|\theta,\lambda,b_1,b_2,{\bf Y}) & \propto \prod_{i=1}^k \frac{\theta^{Y_i} e^{-\theta}}{Y_i !} \prod_{i=k+1}^n \frac{\lambda^{Y_i} e^{-\lambda}}{Y_i !}\\
& \propto \theta^{\sum_{i=1}^k Y_i} \lambda ^{\sum_{i=k+1}^n Y_i} e^{-k \theta - (n-k) \lambda}.
\end{split}
\]
## Full conditional distributions for $b_1$ and $b_2$
\[
\begin{split}
f(k,\theta,\lambda,b_1,b_2| {\bf Y}) \ \propto\
& \prod_{i=1}^k \frac{\theta^{Y_i} e^{-\theta}}{Y_i !} \prod_{i=k+1}^n \frac{\lambda^{Y_i} e^{-\lambda}}{Y_i !} \\
& \times \frac{1}{\Gamma(0.5)b_1^{0.5}} \theta^{-0.5} e^{-\theta/b_1} \times \frac{1}{\Gamma(0.5)b_2^{0.5}} \lambda^{-0.5} e^{-\lambda/b_2}\\
& \times \frac{1}{b_1^2} e^{-1/b_1} \frac{1}{b_2^2} e^{-1/b_2} \times \frac{1}{n}
\end{split}
\]
gives
\[
f(b_1 | k,\theta,\lambda,b_2, {\bf Y})
\propto b_1^{-2.5} e^{-(1+\theta)/b_1} \propto IG(1.5,1/(\theta+1)).
\]
and
\[
f(b_2 | k,\theta,\lambda,b_1| {\bf Y})
\propto b_2^{-2.5} e^{-(1+\lambda)/b_2} \propto IG(1.5,1/(\lambda+1)).
\]
## Setting up an MCMC algorithm
Many MCMC algorithms cycle through each univariate full conditional in
sequence, creating a Markov chain whose stationary distribution is the
desired posterior distribution.
We have provided code to do this in R:
```{r MCMCcode}
source(paste(MCMCpath, "MCMCchpt.R", sep=""))
```
Run the MCMC algorithm, with or without fixing $k=10$:
```{r MCMCcodeRun}
mchain <- mhsampler(dat=Y, MCMCiterations=5000)
mchain2<-mhsampler(dat=Y,MCMCiterations=5000,kfixed=TRUE)
```
## MCMC output analysis
* We have a series of 5-dimensional vectors.
* This approx. posterior sample can be used in multiple ways.
* E.g., since $\theta$ is the first dimension, $E(theta)$ is estimated by
```{r Etheta}
mean(mchain[1,])
```
* To get estimates for means of medians for all parameters:
```{r meansAndMedians}
rbind(apply(mchain,1,mean), # compute row means
apply(mchain,1,median)) # compute row medians
```
## More MCMC output analysis
Using the mchain object,
we can obtain marginal posterior density estimates for $\theta$ or $\lambda$:
```{r density, out.width='.49\\linewidth', echo=FALSE}
plot(density(mchain[1,]),main="smoothed density plot for theta posterior")
plot(density(mchain[2,]),main="smoothed density plot for lambda posterior")
```
We can similarly analyze the mchain2 object, in which all $k$ values
(in the 3rd row) are fixed at 10.
## More MCMC output analysis
Here is a posterior histogram for $k$ using mchain:
```{r kposterior,echo=FALSE}
hist(mchain[3,],main="histogram for k posterior", nclass=20)
```
## Mean estimates as a function of iteration
The 'batchmeans' package has a function that depicts this.
```{r batchmeans, echo=FALSE, message=FALSE}
library(batchmeans) # Load the package (different from installing it)
```
```{r estvssamp}
estvssamp(mchain[3,])
```
## Assessing behavior of the chain
* We would like to assess whether our Markov chain is moving around
quickly enough to produce good estimates, a property often
called 'good mixing'.
* This is hard to do rigorously, but some tests are possible.
* An example: Estimates of autocorrelation for $k$ and $b_1$
```{r acfplots, out.width='.49\\linewidth', echo=FALSE}
acf(mchain[3,],main="acf plot for k")
acf(mchain[4,],main="acf plot for b1")
```
## Strong autocorrelation, MCMC mixing and efficiency
* Using 'acf' on each parameter in turn shows that only the $k$ parameter
exhibits strong autocorrelation.
* This isn't a big problem in this case; it is easy to run the sampler for a long time. This may not be true in all cases.
* The Metropolis Hastings acceptance rate for $k$ is under 10%, so a different
proposal might help the chain mix better.
* Carefully constructed
proposals can have a major impact on the efficiency of an MCMC
algorithm.
## Choosing an MCMC starting value
* In general, any value we
believe to be reasonable under the posterior distribution will
suffice.
* Sometimes much is made about running the chain for a certain number of
"burnin" iterations before any sampling is done.
* In theory, longer burnin
periods will cause the chain to "forget" its starting value so that the
influence of this value will be lessened. While this sounds good, in
practice burnin is probably not necessary as long as the initial value is
plausible.
* See http://users.stat.umn.edu/~geyer/mcmc/burn.html for an amusing rant
entitled "Burn-in is unnecessary"
from one of the world's authorities on MCMC.
## Assessing accuracy and determining chain length
For classical i.i.d. Monte Carlo samplers,
* Calculating standard errors for sample means is easy using the usual formula
(SD estimate divided by the square root of $N$).
* Chain length is the same thing as sample size.
## Assessing accuracy and determining chain length
For MCMC,
* Since
Markov chains produce dependent draws,
computing precise Monte Carlo
standard errors for such samplers is a difficult problem in
general.
* The algorithm produces draws that are only
asymptotically from the correct distribution.
* All the draws we see
after a finite number of iterations are therefore only approximately
from the correct distribution.
* Determining how long we have to run
the chain before we feel sufficiently confident that the MCMC
algorithm has produced reasonably accurate draws from the
distribution is therefore a very difficult problem.
* Most rigorous
solutions are too specific or tailored towards relatively simple
situations while more general approaches tend to be heuristic.
## Standard error estimates for MCMC output
The 'batchmeans' package allows for this:
```{r bmFunction}
apply(mchain, 1, function(a) bm(a)$est)
apply(mchain, 1, function(a) bm(a)$se)
```
Are these standard errors acceptable?
## Length of MCMC chain
Often MCMC users do not run their simulations long enough.
* There is a vast literature on different proposals for dealing with
how long to run the chain.
* They are all heuristics
at best.
* One method that is fairly simple, theoretically justified
in some cases and seems to work reasonably well in practice is as
follows: run the MCMC algorithm and periodically
compute Monte Carlo standard errors. Once the Monte Carlo standard
errors are below some (user-defined) threshold, stop the
simulation.
## Making changes to the model
* For certain types of changes, only minor changes may be needed in the code.
For instance, if the Inverse Gamma prior on $b_1$ and $b_2$
were replaced by Gamma(0.01,100) prior on them, we would only have to
change the lines in the code corresponding to the updates of
$b_1$ and $b_2$.
* An obvious modification to this model would be to allow for more than
one change point. It is possible to treat the number of changepoints as
unknown.
The posterior distribution is then a mixture over
distributions of varying dimensions, which requires an advanced
version of the Metropolis-Hastings algorithm known as
**reversible-jump Metropolis Hastings**
due to Peter Green (_Biometrika_, 1995).
## Additional resources
* The 'coda' and 'BOA' packages in R implement many well-known output
analysis techniques.
* Charlie Geyer's 'MCMC'
package in R is another free resource.
* JAGS and Stan are popular platforms for implementing MCMC samplers.
* There is a vast literature on various aspects of MCMC. Murali Haran
has provided a list of references at
http://sites.stat.psu.edu/~mharan/MCMCtut/suppl.html
as a starting point.