Gelman’s Bayesian Data Analysis (3rd ed.)
Summaries, thoughts, and comments on Prof. Andrew Gelman’s Bayesian Data Analysis (3rd ed.)
Previously in my math note taking journey, I had trouble splitting attention between recording what I wanted from the textbook (essentially everything) and actually distilling my thoughts. A lot of my previous notes just look like I copied 80% of a book over to a markdown file, which is a somewhat accurate description, and it wasn’t helpful for learning or review. On this page, I hope to read the chapters and then summarize key ideas. Big calculations or detail should be put onto separate pages or left for book reference.
Ch. 1 Probability and Inference.
“The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical data analysis.”
A common example: Bayesians can talk about the probability of a parameter being in some interval, but a frequentist confidence interval is more convoluted. A frequentist confidence interval would tell us about the reliability of repeated testing. A Bayesian credible interval is a legitimate probability.
Another thought that I like is uncertainty about model parameters in inference. If we’re trying to model something, how do we express the fact that we are uncertain about the actual parts of the model? The frequentist viewpoint doesn’t allow for us to include this information at all.
The rest of the chapter consists of examples, key ideas, and notation.
Basic ideas:
- Bayes’ theorem, posterior density proportional to likelihood times prior (unnormalized posterior density).
- Before the data is considered, the distribution of the unknown but observable is the prior predictive distribution.
- After observing data , we can predict unknown observable from the same process. The distribution of is called the posterior predictive distribution.
Using Bayes’ means the data affects the posterior only through For fixed this is a function of and called the likelihood function.
Ch. 2 Single-parameter models
First examine statistical models where only a single scalar parameter is to be estimated. Some intuitive notions about the posterior distribution being less variable than the prior distribution are expressed:
the “tower rule,” and more interestingly,
which implies the posterior variance is on average smaller than the prior variance. We see this behavior as a general feature in Bayesian inference: the posterior distribution is centered at a point that represents a “compromise” between the prior information and the data, where the compromise is controlled to a greater extent by the data as the sample size increases.
Reporting posterior results can include information about the mean, median, and mode. But posterior uncertainty, quantiles, and intervals are important, too. For simple models, we can directly compute posterior intervals from the CDFs. For others, intervals can be computed using computer simulations.
Informative priors
Constructing prior distributions: two basic interpretations.
- population interpretation: the prior represents a population of possible parameter values, from which the current of interest has been drawn
- state of knowledge interpretation: we must express our knowledge and uncertainty about as if its value could be thought of as a random realization from the prior distribution.
People may like to use conjugate prior distributions because they make results easy to understand and they simplify computations. Nonconjugate prior distributions can make interpretations of posterior inferences less transparent and computation more difficult.
They work through the Normal distribution with known variance, using the exponential conjugate prior. Interestingly, the posterior mean is expressed as a weighted average of the prior mean and the observed value , with weights proportional to the precisions.
The rest of the chapter is devoted to more conjugate priors for common distributions (Normal, Poisson, Gamma, Beta, etc.)
Noninformative priors
When prior distributions have no population basis, they can be difficult to construct. Maybe we want a prior density that is flat, diffuse, or noninformative such that the data can “speak for itself.”
A related idea is the weakly informative prior distribution, which contains enough information to “regularize” the posterior distribution without attempting to fully capture knowledge about the parameter.
One of the problems that can arise with noninformative priors is trying to avoid improper prior distributions (e.g., they don’t integrate to 1).
Interestingly, improper prior distributions can lead to proper posterior distributions. If we simply define the unnormalized posterior density and continue, we can come out with a proper posterior density in the sense that
is finite for all .
Other issues with noninformative priors include:
- We cannot always go with a vague prior: if the likelihood is truly dominant, then the choice among a range of flat prior densities can’t matter. Establishing one as the prior distribution encourages its automatic use.
- A density that is flat or uniform in one parameterization will not be in another
- Eventual problems when averaging over a set of competing models with improper prior distributions
When the number of parameters in a problem is large, pure noninformative priors are generally discarded in favor of hierarchical models.
Jeffreys’ Prior
One approach to noninformative prior distributions was introduced by Jeffreys, which emphasizes invariance to one-to-one transformations / reparametrizations of the parameter. Representing these transformations as we want the prior densities to be equivalent:
Jeffreys’ principle leads to the noninformative prior density being
for the Fisher information for
Jeffreys’ prior is invariant to parameterization.
Weakly informative prior
A prior is weakly informative if it is proper but set up so the information it provides is intentioanlly weaker than whatever prior knowledge is available.
Two principles for setting up weakly informative priors are:
- Start with some version of a noninformative prior distribution and then add enough information so that inferences are constrained to be reasonable.
- Start with a strong, highly informative prior and broaden it to account for uncertainty in one’s prior beliefs and in the applicability of any historically based prior distribution to new data.
Ch. 3 Multiparameter models
Now we are interested in finding the marginal posterior distribution of the particular parameters of interest. This requires finding the joint posterior and integrating over the others (sometimes called nuisane parameters). We rarely evaluate these integrals explicitly.
A bunch of basic examples. Generally, we write the likelihood and posterior densities. Create a crude estimate of the parameters for use as a starting point. Draw simulations of parameters from the posterior distribution. Use the sample draws to compute the posterior density of functions of interest.
Ch 4. Asymptotics and connections to non-Bayesian approaches
Asymptotic theory: ideas about how as the sample size increases, the influence of the prior distribution on posterior inferences decreases. Also thinking about the extent to how Bayesian approaches with noninformative priors agree with standard non-Bayesian approaches.
Suppose the data are modeled by a parameteric family with a prior distribution If the true data distribution is included in the parametric family, then, in addition to asymptotic normality, the property of consistency holds: the posterior distribution converges to a point mass at the true parameter value .
When the true distribution is not included in the parametric family, there is no longer a true value but there is a theoretical result that makes the model distribution closest to the true distribution in a technical sense involving KL-divergence.
If the posterior distribution is unimodal and roughly symmetric, it can be convenient to approximate it by a normal distribution. Hence the log of the posterior density is approximated by a quadratic function of :
a Taylor expansion where the linear term is zero because the density has zero derivative at its mode. In many cases, convergence to normality of the posterior distribution for a parameter can be dramatically improved by a transformation.