21  Chapter 20: Bayesian Inference

This chapter collects and organizes the Bayesian ideas used throughout point estimation, hypothesis testing, interval estimation, and decision theory. The main message is that once we know the posterior distribution, we can derive point estimates, tests, credible intervals, and optimal decisions from one common framework.

NoteTopics

Bayesian model ingredients; prior, likelihood, posterior, and marginal likelihood; Bayes estimators; posterior mean, median, mode, and MAP; conjugate priors; beta-binomial, gamma-Poisson, and normal-normal models; Bayes risk; Bayesian tests; credible intervals; highest posterior density regions; loss-function interpretation; practice problems and solutions.

22 Overview

This section collects the Bayesian material that appeared across point estimation, evaluating estimators, hypothesis testing, evaluating tests, interval estimation, and evaluating intervals.

Bayesian inference gives a unified method for learning from data. Instead of treating the parameter as an unknown but fixed constant, the Bayesian approach represents uncertainty about the parameter using a probability distribution. Data update this distribution through Bayes’ rule.

TipKey idea

Main message Bayesian inference is the rule \[\text{posterior} \propto \text{likelihood} \times \text{prior}.\] Once the posterior distribution is known, point estimation, hypothesis testing, and interval estimation can all be derived from it.

The core idea is simple but powerful. Before observing data, we describe prior information by a prior distribution. After observing data, we update the prior into a posterior distribution. All Bayesian inference is then based on this posterior distribution.

23 Bayesian Model Ingredients

This section introduces the basic objects of Bayesian inference: likelihood, prior, marginal likelihood, and posterior.

Suppose the observed data are \[X=(X_1,\ldots,X_n), \qquad x=(x_1,\ldots,x_n),\] with sampling density or mass function \[f(x\mid \theta).\] The parameter \(\theta\) is unknown. In the Bayesian approach, \(\theta\) is assigned a probability distribution.

NoteDefinition

Definition 1 (Prior distribution). The prior distribution \(\pi(\theta)\) describes our uncertainty or belief about \(\theta\) before observing the data.

NoteDefinition

Definition 2 (Likelihood). After observing \(X=x\), the likelihood function is \[L(\theta\mid x)=f(x\mid \theta).\] It measures how compatible each parameter value \(\theta\) is with the observed data.

NoteDefinition

Definition 3 (Marginal distribution). The marginal distribution or prior predictive distribution of the data is \[m(x)=\int f(x\mid \theta)\pi(\theta)\,d\theta.\] It is the normalizing constant that makes the posterior integrate to one.

NoteDefinition

Definition 4 (Posterior distribution). The posterior distribution of \(\theta\) given \(X=x\) is \[\pi(\theta\mid x)=\frac{f(x\mid \theta)\pi(\theta)}{m(x)} =\frac{f(x\mid \theta)\pi(\theta)}{\int f(x\mid \theta')\pi(\theta')\,d\theta'}.\]

TipKey idea

Bayesian updating The denominator \(m(x)\) does not depend on \(\theta\). Therefore, for many calculations, we write \[\pi(\theta\mid x)\propto f(x\mid \theta)\pi(\theta).\] This means that the posterior is proportional to likelihood times prior.

TipExample

Example 5 (Bayesian ingredients for coin tossing). Suppose a coin has unknown probability \(p\) of heads. We toss the coin \(n\) times and observe \(x\) heads. The likelihood is \[f(x\mid p)=\binom{n}{x}p^x(1-p)^{n-x}, \qquad 0<p<1.\] If the prior is \(p\sim\operatorname{Beta}(\alpha,\beta)\), then \[\pi(p)=\frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1}.\] Find the posterior distribution.

CautionSolution

Using Bayes’ rule, \[\pi(p\mid x)\propto \binom{n}{x}p^x(1-p)^{n-x}\cdot p^{\alpha-1}(1-p)^{\beta-1}.\] Ignoring constants that do not depend on \(p\), \[\pi(p\mid x)\propto p^{\alpha+x-1}(1-p)^{\beta+n-x-1}.\] This is the kernel of a beta distribution, so \[p\mid X=x\sim \operatorname{Beta}(\alpha+x,\beta+n-x).\]

24 Bayesian Point Estimation

This section explains how to produce a single-number estimate from a posterior distribution.

In classical point estimation, estimators such as the method of moments estimator and the maximum likelihood estimator are functions of the data. In Bayesian inference, once the posterior distribution is obtained, we can summarize it by its mean, median, mode, or another loss-optimal point.

24.1 Posterior mean, median, and mode

This subsection introduces three common posterior summaries used as Bayes point estimates.

NoteDefinition

Definition 6 (Posterior mean). The posterior mean estimator is \[\widehat\theta_B=\mathbb{E}(\theta\mid x)=\int \theta\pi(\theta\mid x)\,d\theta.\] It is commonly called the Bayes estimator under squared error loss.

NoteDefinition

Definition 7 (Posterior median). A posterior median is any value \(m\) satisfying \[\mathbb{P}(\theta\le m\mid x)\ge \frac12, \qquad \mathbb{P}(\theta\ge m\mid x)\ge \frac12.\] It is the Bayes estimator under absolute error loss.

NoteDefinition

Definition 8 (Posterior mode and MAP estimator). The maximum a posteriori estimator is \[\widehat\theta_{\mathrm{MAP}}=\arg\max_{\theta}\pi(\theta\mid x).\] Since \[\pi(\theta\mid x)\propto f(x\mid \theta)\pi(\theta),\] we can also write \[\widehat\theta_{\mathrm{MAP}}=\arg\max_{\theta}\{\log f(x\mid \theta)+\log \pi(\theta)\}.\]

TipKey idea

MAP versus MLE The maximum likelihood estimator maximizes only the likelihood: \[\widehat\theta_{\mathrm{MLE}}=\arg\max_\theta \log f(x\mid \theta).\] The MAP estimator maximizes likelihood plus prior information: \[\widehat\theta_{\mathrm{MAP}}=\arg\max_\theta \{\log f(x\mid \theta)+\log \pi(\theta)\}.\] A flat prior makes MAP behave like MLE.

24.2 Loss functions and Bayes estimators

This subsection connects Bayesian point estimates to decision theory.

A Bayesian point estimator can be chosen by minimizing posterior expected loss. Let \(a\) be an action, interpreted as an estimate of \(\theta\). A loss function \(L(\theta,a)\) measures the penalty of estimating \(\theta\) by \(a\).

NoteDefinition

Definition 9 (Posterior expected loss). Given posterior \(\pi(\theta\mid x)\), the posterior expected loss of action \(a\) is \[\rho(a\mid x)=\mathbb{E}[L(\theta,a)\mid x]=\int L(\theta,a)\pi(\theta\mid x)\,d\theta.\] A Bayes action minimizes \(\rho(a\mid x)\).

NoteTheorem

Theorem 10 (Common Bayes estimators). For a real-valued parameter \(\theta\):

  1. Under squared error loss \(L(\theta,a)=(a-\theta)^2\), the Bayes estimator is the posterior mean.

  2. Under absolute error loss \(L(\theta,a)=|a-\theta|\), the Bayes estimator is a posterior median.

  3. Under 0–1 style local loss, the Bayes estimator is a posterior mode, or MAP estimator.

Proof. Proof. For squared error loss, \[\rho(a\mid x)=\mathbb{E}[(a-\theta)^2\mid x] =a^2-2a\mathbb{E}(\theta\mid x)+\mathbb{E}(\theta^2\mid x).\] Differentiating with respect to \(a\) gives \[2a-2\mathbb{E}(\theta\mid x)=0,\] so \(a=\mathbb{E}(\theta\mid x)\). The absolute loss result follows because a median minimizes expected absolute deviation. The mode result follows because maximizing posterior mass in a small neighborhood is equivalent to maximizing the posterior density. ◻

TipExample

Example 11 (Beta-binomial posterior mean, variance, and mode). Suppose \(X\sim\operatorname{Binomial}(n,p)\) and \(p\sim\operatorname{Beta}(\alpha,\beta)\). If \(X=x\), find the posterior mean, variance, and mode of \(p\).

CautionSolution

The posterior is \[p\mid x\sim \operatorname{Beta}(\alpha+x,\beta+n-x).\] Let \[\alpha^*=\alpha+x, \qquad \beta^*=\beta+n-x.\] Then \[\mathbb{E}(p\mid x)=\frac{\alpha^*}{\alpha^*+\beta^*} =\frac{\alpha+x}{\alpha+\beta+n}.\] The posterior variance is \[\operatorname{Var}(p\mid x)=\frac{\alpha^*\beta^*}{(\alpha^*+\beta^*)^2(\alpha^*+\beta^*+1)} =\frac{(\alpha+x)(\beta+n-x)}{(\alpha+\beta+n)^2(\alpha+\beta+n+1)}.\] If \(\alpha^*>1\) and \(\beta^*>1\), the posterior mode is \[\frac{\alpha^*-1}{\alpha^*+\beta^*-2} =\frac{\alpha+x-1}{\alpha+\beta+n-2}.\]

TipExample

Example 12 (Pseudo-count interpretation). Suppose \(p\sim\operatorname{Beta}(3,5)\) and we observe \(x=5\) heads in \(n=6\) tosses. Find the posterior distribution and posterior mean.

CautionSolution

The posterior is \[p\mid X=5\sim\operatorname{Beta}(3+5,5+6-5)=\operatorname{Beta}(8,6).\] The posterior mean is \[\mathbb{E}(p\mid X=5)=\frac{8}{8+6}=\frac{8}{14}=0.5714.\] The prior contributes \(3\) prior successes and \(5\) prior failures. The data contribute \(5\) successes and \(1\) failure. Thus the posterior has \(8\) pseudo-successes and \(6\) pseudo-failures.

25 Conjugate Priors

This section studies conjugate priors, which make Bayesian updating especially simple.

A conjugate prior is useful because the posterior stays in the same distribution family as the prior. Then Bayesian updating often becomes a simple rule of adding data summaries to prior hyperparameters.

NoteDefinition

Definition 13 (Conjugate prior). For a likelihood function \(f(x\mid \theta)\), a prior family \(\pi(\theta)\) is called conjugate if the posterior \(\pi(\theta\mid x)\) belongs to the same family as the prior.

TipKey idea

Why conjugacy helps If a prior is conjugate, then we can often skip difficult integration. The posterior distribution is known up to updated parameters, and the update rule is usually easy to compute.

25.1 Beta-binomial conjugacy

This subsection reviews the most important conjugate pair for proportions.

NoteTheorem

Theorem 14 (Beta-binomial model). If \[X\mid p\sim\operatorname{Binomial}(n,p), \qquad p\sim\operatorname{Beta}(\alpha,\beta),\] then \[p\mid X=x\sim\operatorname{Beta}(\alpha+x,\beta+n-x).\]

TipExample

Example 15 (Bayesian estimator for a Bernoulli probability). Suppose \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)\) and \(S=\sum_{i=1}^n X_i\). Let \(p\sim\operatorname{Beta}(\alpha,\beta)\). Find the Bayes estimator under squared error loss.

CautionSolution

Since \(S\mid p\sim\operatorname{Binomial}(n,p)\), the posterior is \[p\mid S\sim\operatorname{Beta}(\alpha+S,\beta+n-S).\] Under squared error loss, the Bayes estimator is the posterior mean: \[\widehat p_B=\mathbb{E}(p\mid S)=\frac{\alpha+S}{\alpha+\beta+n}.\] This shrinks the sample proportion \(S/n\) toward the prior mean \(\alpha/(\alpha+\beta)\).

25.2 Gamma-Poisson conjugacy

This subsection reviews the conjugate prior for a Poisson rate.

We use the rate parametrization for the gamma distribution: \[\lambda\sim\operatorname{Gamma}(\alpha,\beta), \qquad \pi(\lambda)=\frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha-1}e^{-\beta\lambda}, \qquad \lambda>0.\]

NoteTheorem

Theorem 16 (Gamma-Poisson model). If \[X_1,\ldots,X_n\mid \lambda \overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda), \qquad \lambda\sim\operatorname{Gamma}(\alpha,\beta),\] then \[\lambda\mid x\sim \operatorname{Gamma}\left(\alpha+\sum_{i=1}^n x_i,\,\beta+n\right).\]

Proof. Proof. The likelihood is \[f(x\mid \lambda)=\prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} \propto e^{-n\lambda}\lambda^{\sum x_i}.\] Multiplying by the gamma prior gives \[\pi(\lambda\mid x)\propto e^{-n\lambda}\lambda^{\sum x_i}\cdot \lambda^{\alpha-1}e^{-\beta\lambda} =\lambda^{\alpha+\sum x_i-1}e^{-(\beta+n)\lambda}.\] This is the kernel of \(\operatorname{Gamma}(\alpha+\sum x_i,\beta+n)\). ◻

TipExample

Example 17 (Poisson posterior). Suppose \(n=10\) Poisson observations have total count \(\sum x_i=26\). Let the prior be \(\lambda\sim\operatorname{Gamma}(2,1)\). Find the posterior mean.

CautionSolution

The posterior is \[\lambda\mid x\sim \operatorname{Gamma}(2+26,1+10)=\operatorname{Gamma}(28,11).\] For a gamma distribution with rate \(\beta\), the mean is \(\alpha/\beta\). Therefore, \[\mathbb{E}(\lambda\mid x)=\frac{28}{11}\approx 2.545.\]

25.3 Normal-normal conjugacy

This subsection reviews the conjugate model for a normal mean with known variance.

Suppose \[X_1,\ldots,X_n\mid \mu \overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2),\] where \(\sigma^2\) is known. Let the prior be \[\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2).\]

NoteTheorem

Theorem 18 (Normal-normal posterior). The posterior distribution of \(\mu\) is normal: \[\mu\mid x\sim\operatorname{Normal}(m_n,v_n),\] where \[v_n=\left(\frac{n}{\sigma^2}+\frac{1}{\sigma_0^2}\right)^{-1} =\frac{\sigma^2\sigma_0^2}{\sigma^2+n\sigma_0^2},\] and \[m_n=v_n\left(\frac{n\overline x}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right) =\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}.\]

TipKey idea

Weighted average interpretation The posterior mean is a weighted average of the sample mean \(\overline x\) and the prior mean \(\mu_0\): \[m_n=\frac{n\sigma_0^2}{n\sigma_0^2+\sigma^2}\overline x +\frac{\sigma^2}{n\sigma_0^2+\sigma^2}\mu_0.\] If \(\sigma_0^2\to\infty\), the prior becomes diffuse and \(m_n\to\overline x\), the MLE.

TipExample

Example 19 (MAP for a normal mean). Suppose \(X_1,\ldots,X_n\mid \mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)\) with known \(\sigma^2\), and suppose \(\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2)\). Find the MAP estimator.

CautionSolution

The posterior is normal with mean \(m_n\) and variance \(v_n\). A normal density is maximized at its mean, so \[\widehat\mu_{\mathrm{MAP}}=m_n =\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}.\] Equivalently, \[\widehat\mu_{\mathrm{MAP}} =\frac{n\sigma_0^2}{n\sigma_0^2+\sigma^2}\overline x +\frac{\sigma^2}{n\sigma_0^2+\sigma^2}\mu_0.\] Because the posterior is normal, the posterior mean and posterior mode are the same.

26 Bayes Risk and Evaluating Estimators

This section explains how Bayesian decision theory evaluates estimators by averaging risk over a prior distribution.

In frequentist evaluation, the risk \(R(\theta,\delta)\) is viewed as a function of the unknown parameter \(\theta\). In Bayesian evaluation, we average this risk over the prior distribution.

NoteDefinition

Definition 20 (Risk function). For decision rule \(\delta(X)\) and loss function \(L(\theta,a)\), the risk function is \[R(\theta,\delta)=\mathbb{E}_\theta[L(\theta,\delta(X))].\]

NoteDefinition

Definition 21 (Bayes risk). Given prior distribution \(\pi(\theta)\), the Bayes risk of \(\delta\) is \[r(\pi,\delta)=\int R(\theta,\delta)\pi(\theta)\,d\theta.\] Equivalently, \[r(\pi,\delta)=\int\int L(\theta,\delta(x))f(x\mid \theta)\pi(\theta)\,dx\,d\theta.\]

NoteTheorem

Theorem 22 (Posterior risk minimization). A Bayes rule can be found by minimizing posterior expected loss for each observed \(x\): \[\delta_B(x)\in \arg\min_a \int L(\theta,a)\pi(\theta\mid x)\,d\theta.\]

TipExample

Example 23 (MSE of a beta-binomial Bayes estimator). Let \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)\) and \(S=\sum X_i\). Consider \[\widehat p_B=\frac{\alpha+S}{\alpha+\beta+n}.\] Find its frequentist MSE as a function of \(p\).

CautionSolution

Since \(S\sim\operatorname{Binomial}(n,p)\), \[\mathbb{E}_p(\widehat p_B)=\frac{\alpha+np}{\alpha+\beta+n}, \qquad \operatorname{Var}_p(\widehat p_B)=\frac{np(1-p)}{(\alpha+\beta+n)^2}.\] The bias is \[\operatorname{Bias}_p(\widehat p_B) =\frac{\alpha+np}{\alpha+\beta+n}-p =\frac{\alpha-(\alpha+\beta)p}{\alpha+\beta+n}.\] Therefore, \[\operatorname{MSE}_p(\widehat p_B) =\frac{np(1-p)}{(\alpha+\beta+n)^2} +\left(\frac{\alpha-(\alpha+\beta)p}{\alpha+\beta+n}\right)^2.\] This shows the bias-variance tradeoff introduced by the prior.

NoteRemark

Remark 24 (Why shrinkage can help). The estimator \(\widehat p_B\) is usually biased as a frequentist estimator, but it can have smaller MSE than the MLE \(\widehat p=S/n\) for some values of \(p\), especially when \(n\) is small and the prior information is reasonable.

27 Bayesian Hypothesis Testing

This section explains how Bayesian inference turns hypothesis testing into posterior probability comparison.

In classical testing, we control Type I error, study power, and often use p-values or likelihood ratio tests. In Bayesian testing, the posterior distribution directly assigns probabilities to hypotheses.

27.1 Posterior probability tests

This subsection defines Bayesian tests through posterior probabilities of the null and alternative parameter regions.

Let the hypotheses be \[H_0:\theta\in\Theta_0, \qquad H_1:\theta\in\Theta_0^c.\] Given posterior \(\pi(\theta\mid x)\), \[\mathbb{P}(H_0\mid x)=\mathbb{P}(\theta\in\Theta_0\mid x)=\int_{\Theta_0}\pi(\theta\mid x)\,d\theta,\] and \[\mathbb{P}(H_1\mid x)=\mathbb{P}(\theta\in\Theta_0^c\mid x)=\int_{\Theta_0^c}\pi(\theta\mid x)\,d\theta.\]

NoteDefinition

Definition 25 (Bayesian posterior probability test). A simple Bayesian decision rule rejects \(H_0\) when \[\mathbb{P}(H_0\mid x)<\mathbb{P}(H_1\mid x),\] equivalently, when \[\mathbb{P}(H_0\mid x)<\frac12.\] A more conservative rule rejects \(H_0\) when \[\mathbb{P}(H_0\mid x)<0.05.\]

TipExample

Example 26 (Bayesian one-sided normal mean test). Suppose \[X_1,\ldots,X_n\mid \mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2),\] where \(\sigma^2\) is known, and suppose \[\mu\sim\operatorname{Normal}(\theta,\tau^2).\] Test \[H_0:\mu\le \mu_0 \qquad\text{versus}\qquad H_1:\mu>\mu_0.\] Find a posterior-probability rejection rule.

CautionSolution

The posterior distribution is \[\mu\mid x\sim\operatorname{Normal}(m_n,v_n),\] where \[m_n=\frac{\tau^2\sum_{i=1}^n x_i+\sigma^2\theta}{n\tau^2+\sigma^2}, \qquad v_n=\frac{\sigma^2\tau^2}{\sigma^2+n\tau^2}.\] The posterior probability of the null is \[\mathbb{P}(H_0\mid x)=\mathbb{P}(\mu\le \mu_0\mid x) =\Phi\left(\frac{\mu_0-m_n}{\sqrt{v_n}}\right).\] If we reject when \(\mathbb{P}(H_0\mid x)<1/2\), then because the posterior is normal and symmetric, this is equivalent to \[m_n>\mu_0.\] That is, \[\frac{\tau^2\sum_{i=1}^n x_i+\sigma^2\theta}{n\tau^2+\sigma^2}>\mu_0.\] Equivalently, \[\overline x>\mu_0+\frac{\sigma^2(\mu_0-\theta)}{n\tau^2}.\]

27.2 Bayesian tests with loss functions

This subsection connects Bayesian tests to error costs.

Suppose the action space is \(\{a_0,a_1\}\), where \(a_0\) means “accept or fail to reject \(H_0\)” and \(a_1\) means “reject \(H_0\)”. A generalized 0–1 loss is \[L(\theta,a_0)= \begin{cases} 0, & \theta\in\Theta_0,\\ c_{II}, & \theta\in\Theta_0^c, \end{cases}\] and \[L(\theta,a_1)= \begin{cases} c_I, & \theta\in\Theta_0,\\ 0, & \theta\in\Theta_0^c. \end{cases}\] Here \(c_I\) is the cost of a Type I error and \(c_{II}\) is the cost of a Type II error.

NoteTheorem

Theorem 27 (Bayes test under generalized 0–1 loss). Under the above loss, reject \(H_0\) if \[c_I\mathbb{P}(H_0\mid x)<c_{II}\mathbb{P}(H_1\mid x).\] Equivalently, \[\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}.\]

Proof. Proof. The posterior expected loss of accepting \(H_0\) is \[\rho(a_0\mid x)=c_{II}\mathbb{P}(H_1\mid x).\] The posterior expected loss of rejecting \(H_0\) is \[\rho(a_1\mid x)=c_I\mathbb{P}(H_0\mid x).\] We reject when \(\rho(a_1\mid x)<\rho(a_0\mid x)\), namely \[c_I\mathbb{P}(H_0\mid x)<c_{II}\mathbb{P}(H_1\mid x).\] Since \(\mathbb{P}(H_1\mid x)=1-\mathbb{P}(H_0\mid x)\), this is equivalent to \[\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}.\] ◻

NoteRemark

Remark 28 (Connection to classical risk). In classical testing, the risk under generalized 0–1 loss is \[R(\theta,\delta)= \begin{cases} c_I\beta(\theta), & \theta\in\Theta_0,\\ c_{II}[1-\beta(\theta)], & \theta\in\Theta_0^c, \end{cases}\] where \(\beta(\theta)=\mathbb{P}_\theta(\text{reject }H_0)\) is the power function. Bayesian testing averages this risk using the posterior or prior distribution.

28 Bayesian Interval Estimation

This section introduces Bayesian credible intervals and contrasts them with frequentist confidence intervals.

In frequentist confidence intervals, the interval is random and the parameter is fixed. In Bayesian credible intervals, the parameter is random under the posterior distribution, so probability statements about \(\theta\) belonging to an observed interval are meaningful within the model.

NoteDefinition

Definition 29 (Credible set). A set \(C(x)\subseteq\Theta\) is a \(100(1-\alpha)\%\) credible set if \[\mathbb{P}(\theta\in C(x)\mid x)=\int_{C(x)}\pi(\theta\mid x)\,d\theta=1-\alpha.\] If \(C(x)=[a,b]\), then \([a,b]\) is called a credible interval.

WarningWarning

Interpretation warning A frequentist \(95\%\) confidence interval does not mean that there is a \(95\%\) probability that the fixed parameter lies in the observed interval. A Bayesian \(95\%\) credible interval does allow the statement: \[\mathbb{P}(\theta\in C(x)\mid x)=0.95.\] This interpretation depends on the prior and the Bayesian model.

28.1 Equal-tail credible intervals

This subsection introduces credible intervals based on posterior quantiles.

NoteDefinition

Definition 30 (Equal-tail credible interval). A \(100(1-\alpha)\%\) equal-tail credible interval \([a,b]\) satisfies \[\mathbb{P}(\theta<a\mid x)=\frac{\alpha}{2}, \qquad \mathbb{P}(\theta>b\mid x)=\frac{\alpha}{2}.\] Equivalently, \(a\) and \(b\) are the posterior \(\alpha/2\) and \(1-\alpha/2\) quantiles.

TipExample

Example 31 (Beta-binomial credible interval). Suppose \(X\sim\operatorname{Binomial}(n,p)\), \(p\sim\operatorname{Beta}(2,2)\), \(n=20\), and \(x=12\). Find the posterior distribution and describe the \(95\%\) equal-tail credible interval.

CautionSolution

The posterior is \[p\mid X=12\sim\operatorname{Beta}(2+12,2+20-12)=\operatorname{Beta}(14,10).\] The \(95\%\) equal-tail credible interval is \[\left[q_{0.025},q_{0.975}\right],\] where \(q_c\) is the \(c\)th quantile of the \(\operatorname{Beta}(14,10)\) distribution. Numerically, this interval is approximately \[[0.385,0.768].\] This means that, under the beta-binomial Bayesian model, \[\mathbb{P}(0.385\le p\le 0.768\mid X=12)\approx 0.95.\]

TipExample

Example 32 (Normal-normal credible interval). Suppose \(X_1,\ldots,X_n\mid\mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)\) with known \(\sigma^2\), and \(\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2)\). Find a \(100(1-\alpha)\%\) credible interval for \(\mu\).

CautionSolution

The posterior is \[\mu\mid x\sim\operatorname{Normal}(m_n,v_n),\] where \[m_n=\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}, \qquad v_n=\frac{\sigma^2\sigma_0^2}{\sigma^2+n\sigma_0^2}.\] Therefore a \(100(1-\alpha)\%\) credible interval is \[\left[m_n-z_{\alpha/2}\sqrt{v_n},\;m_n+z_{\alpha/2}\sqrt{v_n}\right],\] where \(z_{\alpha/2}\) satisfies \(\mathbb{P}(Z>z_{\alpha/2})=\alpha/2\) for \(Z\sim\operatorname{Normal}(0,1)\).

TipExample

Example 33 (Gamma-Poisson credible interval). Suppose \(X_1,\ldots,X_n\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)\) and \(\lambda\sim\operatorname{Gamma}(\alpha,\beta)\) in the rate parametrization. Derive the posterior credible interval for \(\lambda\).

CautionSolution

The posterior is \[\lambda\mid x\sim \operatorname{Gamma}\left(\alpha+\sum_{i=1}^n x_i,\beta+n\right).\] A \(100(1-\alpha_0)\%\) equal-tail credible interval is \[\left[q_{\alpha_0/2},q_{1-\alpha_0/2}\right],\] where \(q_c\) is the \(c\)th quantile of the posterior gamma distribution. For example, if \(n=10\), \(\sum x_i=26\), and \(\lambda\sim\operatorname{Gamma}(2,1)\), then \[\lambda\mid x\sim\operatorname{Gamma}(28,11),\] and the interval is obtained from the quantiles of \(\operatorname{Gamma}(28,11)\).

29 Highest Posterior Density Regions

This section explains the shortest Bayesian credible regions for unimodal posterior distributions.

Equal-tail intervals are easy to compute, but they are not always the shortest credible intervals. For a unimodal posterior density, the shortest credible region is formed by keeping the parameter values with highest posterior density.

NoteDefinition

Definition 34 (Highest posterior density region). A \(100(1-\alpha)\%\) highest posterior density (HPD) credible region is a set \[C_{\mathrm{HPD}}(x)=\{\theta:\pi(\theta\mid x)\ge k\},\] where \(k\) is chosen so that \[\int_{C_{\mathrm{HPD}}(x)}\pi(\theta\mid x)\,d\theta=1-\alpha.\]

NoteTheorem

Theorem 35 (Shortest credible interval for unimodal posterior). If \(\pi(\theta\mid x)\) is unimodal, then the shortest credible interval with posterior probability \(1-\alpha\) is the HPD interval. Its endpoints have equal posterior density, except possibly at a boundary of the parameter space.

TipExample

Example 36 (Poisson HPD region). Suppose \[X_1,\ldots,X_n\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda),\] and use a conjugate gamma prior. If \[\lambda\mid x\sim\operatorname{Gamma}\left(a+\sum x_i,\, n+b\right)\] under a rate parametrization, then a \(100(1-\alpha)\%\) HPD credible region has the form \[\{\lambda:\pi(\lambda\mid x)\ge k\},\] where \(k\) is chosen to make the posterior probability equal to \(1-\alpha\).

CautionSolution

The posterior density is a gamma density. If it is unimodal, the HPD region contains the highest-density values around the posterior mode. The cutoff \(k\) is determined by solving \[\int_{\{\lambda:\pi(\lambda\mid x)\ge k\}}\pi(\lambda\mid x)\,d\lambda=1-\alpha.\] For example, in the lecture notes case with \(a=b=1\), \(n=10\), and \(\sum x_i=6\), the posterior is gamma-shaped and the \(90\%\) HPD credible set is approximately \[[0.253,1.005],\] while a corresponding equal-tail interval is slightly longer.

NoteRemark

Remark 37 (Equal-tail versus HPD). Equal-tail intervals split posterior probability equally between the two tails. HPD intervals minimize length by taking the most plausible parameter values first. For symmetric unimodal posteriors, the equal-tail and HPD intervals often coincide. For skewed posteriors, they are usually different.

30 Bayesian Optimality for Intervals

This section connects credible intervals to loss-function optimality.

Bayesian interval estimation can be framed as a decision problem: the action is choosing a set \(C\), and the loss penalizes long intervals while rewarding coverage of the true parameter.

NoteDefinition

Definition 38 (Interval loss function). One simple loss function for choosing a confidence or credible set \(C\) is \[L(\theta,C)=b\cdot \operatorname{Length}(C)-\mathbb{1}\{\theta\in C\},\] where \(b>0\) controls the tradeoff between short length and high coverage.

The corresponding frequentist risk is \[R(\theta,C)=b\mathbb{E}_\theta[\operatorname{Length}(C(X))]-\mathbb{P}_\theta(\theta\in C(X)).\] The Bayesian posterior expected loss is \[\rho(C\mid x)=b\cdot \operatorname{Length}(C)-\mathbb{P}(\theta\in C\mid x).\]

TipKey idea

Interpretation Large \(b\) prioritizes shorter intervals. Small \(b\) prioritizes posterior coverage. This makes precise the tradeoff between precision and uncertainty.

TipExample

Example 39 (Normal interval risk). Suppose \(X\sim\operatorname{Normal}(\mu,\sigma^2)\) with known \(\sigma^2\), and consider symmetric intervals \[C(X)=[X-c\sigma,X+c\sigma], \qquad c\ge 0.\] Compute the risk under \[L(\mu,C)=b\cdot\operatorname{Length}(C)-\mathbb{1}\{\mu\in C\}.\]

CautionSolution

The length is \[\operatorname{Length}(C)=2c\sigma.\] The coverage probability is \[\mathbb{P}_\mu(\mu\in C(X)) =\mathbb{P}_\mu(X-c\sigma\le\mu\le X+c\sigma) =\mathbb{P}\left(-c\le \frac{X-\mu}{\sigma}\le c\right) =2\Phi(c)-1.\] Therefore the risk is \[R(c)=b(2c\sigma)-[2\Phi(c)-1].\] Differentiating, \[R'(c)=2b\sigma-2\phi(c).\] The optimum satisfies \[\phi(c)=b\sigma.\] Since \(\phi(c)=\frac{1}{\sqrt{2\pi}}e^{-c^2/2}\), if \(b\sigma\le 1/\sqrt{2\pi}\), then \[c=\sqrt{-2\log(b\sigma\sqrt{2\pi})}.\] If \(b\sigma>1/\sqrt{2\pi}\), the minimum occurs at \(c=0\), corresponding to a point estimate.

31 Bayesian Inference Workflow

This section summarizes the steps of Bayesian inference as a reusable procedure.

TipKey idea

Bayesian workflow

  1. Specify the sampling model \(f(x\mid\theta)\).

  2. Choose a prior distribution \(\pi(\theta)\).

  3. Compute the posterior distribution \(\pi(\theta\mid x)\propto f(x\mid\theta)\pi(\theta)\).

  4. For point estimation, report posterior mean, median, MAP, or another loss-optimal estimate.

  5. For testing, compute posterior probabilities of \(H_0\) and \(H_1\), then choose an action based on posterior risk.

  6. For interval estimation, report an equal-tail credible interval or HPD credible interval.

  7. Interpret the answer conditional on the chosen model and prior.

WarningWarning

Prior sensitivity Bayesian inference depends on the prior distribution. With large samples, the likelihood often dominates the prior. With small samples, the prior can strongly influence posterior estimates, tests, and intervals.

32 Practice Problems

This section provides practice problems that connect the Bayesian methods from Sections 12–17.

ImportantPractice Problem

Practice Problem 40 (Beta-binomial posterior and estimator). Suppose \(X\sim\operatorname{Binomial}(30,p)\) and \(x=18\). Let \(p\sim\operatorname{Beta}(4,6)\).

  1. Find the posterior distribution of \(p\).

  2. Find the posterior mean.

  3. Find the MAP estimator, assuming the posterior parameters are both larger than 1.

CautionSolution
  1. The posterior is \[p\mid x\sim\operatorname{Beta}(4+18,6+30-18)=\operatorname{Beta}(22,18).\]

  2. The posterior mean is \[\mathbb{E}(p\mid x)=\frac{22}{22+18}=\frac{22}{40}=0.55.\]

  3. The MAP estimator is \[\widehat p_{\mathrm{MAP}}=\frac{22-1}{22+18-2}=\frac{21}{38}\approx 0.5526.\]

ImportantPractice Problem

Practice Problem 41 (Normal-normal posterior). Suppose \(X_1,\ldots,X_{16}\mid\mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,4)\) and \(\overline x=10\). Let \(\mu\sim\operatorname{Normal}(8,9)\).

  1. Find the posterior variance \(v_n\).

  2. Find the posterior mean \(m_n\).

  3. Give a \(95\%\) credible interval for \(\mu\).

CautionSolution

Here \(n=16\), \(\sigma^2=4\), \(\mu_0=8\), and \(\sigma_0^2=9\).

  1. \[v_n=\left(\frac{16}{4}+\frac{1}{9}\right)^{-1} =\left(4+\frac19\right)^{-1} =\frac{9}{37}.\]

  2. \[m_n=v_n\left(\frac{16\cdot 10}{4}+\frac{8}{9}\right) =\frac{9}{37}\left(40+\frac{8}{9}\right) =\frac{368}{37}\approx 9.946.\]

  3. A \(95\%\) credible interval is \[m_n\pm 1.96\sqrt{v_n} =9.946\pm 1.96\sqrt{\frac{9}{37}}.\] Since \(\sqrt{9/37}\approx 0.493\), the interval is approximately \[[8.980,10.912].\]

ImportantPractice Problem

Practice Problem 42 (Gamma-Poisson posterior). Suppose \(X_1,\ldots,X_8\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)\) and \(\sum x_i=20\). Let \(\lambda\sim\operatorname{Gamma}(3,2)\) using the rate parametrization.

  1. Find the posterior distribution.

  2. Find the posterior mean.

  3. Find the MAP estimator, assuming the posterior shape is larger than 1.

CautionSolution
  1. The posterior is \[\lambda\mid x\sim\operatorname{Gamma}(3+20,2+8)=\operatorname{Gamma}(23,10).\]

  2. The posterior mean is \[\mathbb{E}(\lambda\mid x)=\frac{23}{10}=2.3.\]

  3. For a gamma distribution with shape \(a\) and rate \(b\), the mode is \((a-1)/b\) when \(a>1\). Thus \[\widehat\lambda_{\mathrm{MAP}}=\frac{23-1}{10}=2.2.\]

ImportantPractice Problem

Practice Problem 43 (Bayesian test with posterior probability). Suppose the posterior distribution of a parameter is \[\theta\mid x\sim\operatorname{Normal}(1.2,0.25).\] Test \[H_0:\theta\le 0 \qquad\text{versus}\qquad H_1:\theta>0.\] Compute \(\mathbb{P}(H_0\mid x)\) and decide whether to reject \(H_0\) using the rule \(\mathbb{P}(H_0\mid x)<0.05\).

CautionSolution

The posterior standard deviation is \(\sqrt{0.25}=0.5\). Therefore \[\mathbb{P}(H_0\mid x)=\mathbb{P}(\theta\le 0\mid x) =\Phi\left(\frac{0-1.2}{0.5}\right) =\Phi(-2.4).\] Using the standard normal table, \[\Phi(-2.4)\approx 0.0082.\] Since \(0.0082<0.05\), we reject \(H_0\) using this Bayesian posterior-probability rule.

ImportantPractice Problem

Practice Problem 44 (Bayesian decision rule with unequal costs). For testing \(H_0:\theta\in\Theta_0\) versus \(H_1:\theta\in\Theta_0^c\), suppose \[\mathbb{P}(H_0\mid x)=0.30.\] The cost of Type I error is \(c_I=5\), and the cost of Type II error is \(c_{II}=1\). Should we reject \(H_0\)?

CautionSolution

Reject \(H_0\) if \[\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}=\frac{1}{5+1}=\frac16\approx 0.1667.\] Here \(\mathbb{P}(H_0\mid x)=0.30>0.1667\), so we do not reject \(H_0\). The high cost of Type I error makes the rejection rule more conservative.

ImportantPractice Problem

Practice Problem 45 (Equal-tail credible interval). Suppose the posterior distribution is \[\theta\mid x\sim\operatorname{Normal}(5,4).\] Find a \(90\%\) equal-tail credible interval.

CautionSolution

The posterior standard deviation is \(2\). For a \(90\%\) interval, \(z_{0.05}\approx 1.645\). Thus \[5\pm 1.645(2)=5\pm 3.29.\] The credible interval is \[[1.71,8.29].\]

ImportantPractice Problem

Practice Problem 46 (HPD interval concept). Suppose a posterior density is unimodal and skewed to the right. Explain why the \(95\%\) HPD interval may differ from the \(95\%\) equal-tail interval.

CautionSolution

The equal-tail interval places \(2.5\%\) posterior probability in each tail. The HPD interval instead contains the parameter values with highest posterior density and chooses the cutoff so that the total posterior probability is \(95\%\). For a skewed posterior, equal tails may include some low-density values in the long tail while excluding higher-density values on the other side. Therefore the HPD interval is usually shorter and has endpoints with equal posterior density, while the equal-tail interval is based on posterior quantiles.

33 Summary

This section summarizes the role of Bayesian inference across the main statistical tasks of the course.

TipKey idea

Section summary

  1. Bayesian inference starts with a prior \(\pi(\theta)\) and likelihood \(f(x\mid\theta)\).

  2. Bayes’ rule gives the posterior: \[\pi(\theta\mid x)\propto f(x\mid\theta)\pi(\theta).\]

  3. Point estimates can be posterior means, medians, or MAP estimates, depending on the loss function.

  4. Conjugate priors make posterior computation simple.

  5. Bayesian tests compare posterior probabilities of hypotheses, possibly weighted by error costs.

  6. Bayesian credible intervals give direct posterior probability statements.

  7. HPD regions are shortest credible regions for unimodal posterior distributions.

  8. Bayesian decisions are naturally derived by minimizing posterior expected loss.

Task Bayesian object Common answer
Point estimation Posterior distribution Posterior mean, median, MAP
Testing Posterior hypothesis probability Reject if posterior risk is smaller
Interval estimation Posterior credible probability Equal-tail or HPD credible interval
Model updating Prior and likelihood Posterior distribution
Decision theory Posterior expected loss Bayes action