21 Chapter 20: Bayesian Inference
This chapter collects and organizes the Bayesian ideas used throughout point estimation, hypothesis testing, interval estimation, and decision theory. The main message is that once we know the posterior distribution, we can derive point estimates, tests, credible intervals, and optimal decisions from one common framework.
Bayesian model ingredients; prior, likelihood, posterior, and marginal likelihood; Bayes estimators; posterior mean, median, mode, and MAP; conjugate priors; beta-binomial, gamma-Poisson, and normal-normal models; Bayes risk; Bayesian tests; credible intervals; highest posterior density regions; loss-function interpretation; practice problems and solutions.
22 Overview
This section collects the Bayesian material that appeared across point estimation, evaluating estimators, hypothesis testing, evaluating tests, interval estimation, and evaluating intervals.
Bayesian inference gives a unified method for learning from data. Instead of treating the parameter as an unknown but fixed constant, the Bayesian approach represents uncertainty about the parameter using a probability distribution. Data update this distribution through Bayes’ rule.
Main message Bayesian inference is the rule \[\text{posterior} \propto \text{likelihood} \times \text{prior}.\] Once the posterior distribution is known, point estimation, hypothesis testing, and interval estimation can all be derived from it.
The core idea is simple but powerful. Before observing data, we describe prior information by a prior distribution. After observing data, we update the prior into a posterior distribution. All Bayesian inference is then based on this posterior distribution.
23 Bayesian Model Ingredients
This section introduces the basic objects of Bayesian inference: likelihood, prior, marginal likelihood, and posterior.
Suppose the observed data are \[X=(X_1,\ldots,X_n), \qquad x=(x_1,\ldots,x_n),\] with sampling density or mass function \[f(x\mid \theta).\] The parameter \(\theta\) is unknown. In the Bayesian approach, \(\theta\) is assigned a probability distribution.
Definition 1 (Prior distribution). The prior distribution \(\pi(\theta)\) describes our uncertainty or belief about \(\theta\) before observing the data.
Definition 2 (Likelihood). After observing \(X=x\), the likelihood function is \[L(\theta\mid x)=f(x\mid \theta).\] It measures how compatible each parameter value \(\theta\) is with the observed data.
Definition 3 (Marginal distribution). The marginal distribution or prior predictive distribution of the data is \[m(x)=\int f(x\mid \theta)\pi(\theta)\,d\theta.\] It is the normalizing constant that makes the posterior integrate to one.
Definition 4 (Posterior distribution). The posterior distribution of \(\theta\) given \(X=x\) is \[\pi(\theta\mid x)=\frac{f(x\mid \theta)\pi(\theta)}{m(x)} =\frac{f(x\mid \theta)\pi(\theta)}{\int f(x\mid \theta')\pi(\theta')\,d\theta'}.\]
Bayesian updating The denominator \(m(x)\) does not depend on \(\theta\). Therefore, for many calculations, we write \[\pi(\theta\mid x)\propto f(x\mid \theta)\pi(\theta).\] This means that the posterior is proportional to likelihood times prior.
Example 5 (Bayesian ingredients for coin tossing). Suppose a coin has unknown probability \(p\) of heads. We toss the coin \(n\) times and observe \(x\) heads. The likelihood is \[f(x\mid p)=\binom{n}{x}p^x(1-p)^{n-x}, \qquad 0<p<1.\] If the prior is \(p\sim\operatorname{Beta}(\alpha,\beta)\), then \[\pi(p)=\frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1}.\] Find the posterior distribution.
Using Bayes’ rule, \[\pi(p\mid x)\propto \binom{n}{x}p^x(1-p)^{n-x}\cdot p^{\alpha-1}(1-p)^{\beta-1}.\] Ignoring constants that do not depend on \(p\), \[\pi(p\mid x)\propto p^{\alpha+x-1}(1-p)^{\beta+n-x-1}.\] This is the kernel of a beta distribution, so \[p\mid X=x\sim \operatorname{Beta}(\alpha+x,\beta+n-x).\]
24 Bayesian Point Estimation
This section explains how to produce a single-number estimate from a posterior distribution.
In classical point estimation, estimators such as the method of moments estimator and the maximum likelihood estimator are functions of the data. In Bayesian inference, once the posterior distribution is obtained, we can summarize it by its mean, median, mode, or another loss-optimal point.
24.1 Posterior mean, median, and mode
This subsection introduces three common posterior summaries used as Bayes point estimates.
Definition 6 (Posterior mean). The posterior mean estimator is \[\widehat\theta_B=\mathbb{E}(\theta\mid x)=\int \theta\pi(\theta\mid x)\,d\theta.\] It is commonly called the Bayes estimator under squared error loss.
Definition 7 (Posterior median). A posterior median is any value \(m\) satisfying \[\mathbb{P}(\theta\le m\mid x)\ge \frac12, \qquad \mathbb{P}(\theta\ge m\mid x)\ge \frac12.\] It is the Bayes estimator under absolute error loss.
Definition 8 (Posterior mode and MAP estimator). The maximum a posteriori estimator is \[\widehat\theta_{\mathrm{MAP}}=\arg\max_{\theta}\pi(\theta\mid x).\] Since \[\pi(\theta\mid x)\propto f(x\mid \theta)\pi(\theta),\] we can also write \[\widehat\theta_{\mathrm{MAP}}=\arg\max_{\theta}\{\log f(x\mid \theta)+\log \pi(\theta)\}.\]
MAP versus MLE The maximum likelihood estimator maximizes only the likelihood: \[\widehat\theta_{\mathrm{MLE}}=\arg\max_\theta \log f(x\mid \theta).\] The MAP estimator maximizes likelihood plus prior information: \[\widehat\theta_{\mathrm{MAP}}=\arg\max_\theta \{\log f(x\mid \theta)+\log \pi(\theta)\}.\] A flat prior makes MAP behave like MLE.
24.2 Loss functions and Bayes estimators
This subsection connects Bayesian point estimates to decision theory.
A Bayesian point estimator can be chosen by minimizing posterior expected loss. Let \(a\) be an action, interpreted as an estimate of \(\theta\). A loss function \(L(\theta,a)\) measures the penalty of estimating \(\theta\) by \(a\).
Definition 9 (Posterior expected loss). Given posterior \(\pi(\theta\mid x)\), the posterior expected loss of action \(a\) is \[\rho(a\mid x)=\mathbb{E}[L(\theta,a)\mid x]=\int L(\theta,a)\pi(\theta\mid x)\,d\theta.\] A Bayes action minimizes \(\rho(a\mid x)\).
Theorem 10 (Common Bayes estimators). For a real-valued parameter \(\theta\):
Under squared error loss \(L(\theta,a)=(a-\theta)^2\), the Bayes estimator is the posterior mean.
Under absolute error loss \(L(\theta,a)=|a-\theta|\), the Bayes estimator is a posterior median.
Under 0–1 style local loss, the Bayes estimator is a posterior mode, or MAP estimator.
Proof. Proof. For squared error loss, \[\rho(a\mid x)=\mathbb{E}[(a-\theta)^2\mid x] =a^2-2a\mathbb{E}(\theta\mid x)+\mathbb{E}(\theta^2\mid x).\] Differentiating with respect to \(a\) gives \[2a-2\mathbb{E}(\theta\mid x)=0,\] so \(a=\mathbb{E}(\theta\mid x)\). The absolute loss result follows because a median minimizes expected absolute deviation. The mode result follows because maximizing posterior mass in a small neighborhood is equivalent to maximizing the posterior density. ◻
Example 11 (Beta-binomial posterior mean, variance, and mode). Suppose \(X\sim\operatorname{Binomial}(n,p)\) and \(p\sim\operatorname{Beta}(\alpha,\beta)\). If \(X=x\), find the posterior mean, variance, and mode of \(p\).
The posterior is \[p\mid x\sim \operatorname{Beta}(\alpha+x,\beta+n-x).\] Let \[\alpha^*=\alpha+x, \qquad \beta^*=\beta+n-x.\] Then \[\mathbb{E}(p\mid x)=\frac{\alpha^*}{\alpha^*+\beta^*} =\frac{\alpha+x}{\alpha+\beta+n}.\] The posterior variance is \[\operatorname{Var}(p\mid x)=\frac{\alpha^*\beta^*}{(\alpha^*+\beta^*)^2(\alpha^*+\beta^*+1)} =\frac{(\alpha+x)(\beta+n-x)}{(\alpha+\beta+n)^2(\alpha+\beta+n+1)}.\] If \(\alpha^*>1\) and \(\beta^*>1\), the posterior mode is \[\frac{\alpha^*-1}{\alpha^*+\beta^*-2} =\frac{\alpha+x-1}{\alpha+\beta+n-2}.\]
Example 12 (Pseudo-count interpretation). Suppose \(p\sim\operatorname{Beta}(3,5)\) and we observe \(x=5\) heads in \(n=6\) tosses. Find the posterior distribution and posterior mean.
The posterior is \[p\mid X=5\sim\operatorname{Beta}(3+5,5+6-5)=\operatorname{Beta}(8,6).\] The posterior mean is \[\mathbb{E}(p\mid X=5)=\frac{8}{8+6}=\frac{8}{14}=0.5714.\] The prior contributes \(3\) prior successes and \(5\) prior failures. The data contribute \(5\) successes and \(1\) failure. Thus the posterior has \(8\) pseudo-successes and \(6\) pseudo-failures.
25 Conjugate Priors
This section studies conjugate priors, which make Bayesian updating especially simple.
A conjugate prior is useful because the posterior stays in the same distribution family as the prior. Then Bayesian updating often becomes a simple rule of adding data summaries to prior hyperparameters.
Definition 13 (Conjugate prior). For a likelihood function \(f(x\mid \theta)\), a prior family \(\pi(\theta)\) is called conjugate if the posterior \(\pi(\theta\mid x)\) belongs to the same family as the prior.
Why conjugacy helps If a prior is conjugate, then we can often skip difficult integration. The posterior distribution is known up to updated parameters, and the update rule is usually easy to compute.
25.1 Beta-binomial conjugacy
This subsection reviews the most important conjugate pair for proportions.
Theorem 14 (Beta-binomial model). If \[X\mid p\sim\operatorname{Binomial}(n,p), \qquad p\sim\operatorname{Beta}(\alpha,\beta),\] then \[p\mid X=x\sim\operatorname{Beta}(\alpha+x,\beta+n-x).\]
Example 15 (Bayesian estimator for a Bernoulli probability). Suppose \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)\) and \(S=\sum_{i=1}^n X_i\). Let \(p\sim\operatorname{Beta}(\alpha,\beta)\). Find the Bayes estimator under squared error loss.
Since \(S\mid p\sim\operatorname{Binomial}(n,p)\), the posterior is \[p\mid S\sim\operatorname{Beta}(\alpha+S,\beta+n-S).\] Under squared error loss, the Bayes estimator is the posterior mean: \[\widehat p_B=\mathbb{E}(p\mid S)=\frac{\alpha+S}{\alpha+\beta+n}.\] This shrinks the sample proportion \(S/n\) toward the prior mean \(\alpha/(\alpha+\beta)\).
25.2 Gamma-Poisson conjugacy
This subsection reviews the conjugate prior for a Poisson rate.
We use the rate parametrization for the gamma distribution: \[\lambda\sim\operatorname{Gamma}(\alpha,\beta), \qquad \pi(\lambda)=\frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha-1}e^{-\beta\lambda}, \qquad \lambda>0.\]
Theorem 16 (Gamma-Poisson model). If \[X_1,\ldots,X_n\mid \lambda \overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda), \qquad \lambda\sim\operatorname{Gamma}(\alpha,\beta),\] then \[\lambda\mid x\sim \operatorname{Gamma}\left(\alpha+\sum_{i=1}^n x_i,\,\beta+n\right).\]
Proof. Proof. The likelihood is \[f(x\mid \lambda)=\prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} \propto e^{-n\lambda}\lambda^{\sum x_i}.\] Multiplying by the gamma prior gives \[\pi(\lambda\mid x)\propto e^{-n\lambda}\lambda^{\sum x_i}\cdot \lambda^{\alpha-1}e^{-\beta\lambda} =\lambda^{\alpha+\sum x_i-1}e^{-(\beta+n)\lambda}.\] This is the kernel of \(\operatorname{Gamma}(\alpha+\sum x_i,\beta+n)\). ◻
Example 17 (Poisson posterior). Suppose \(n=10\) Poisson observations have total count \(\sum x_i=26\). Let the prior be \(\lambda\sim\operatorname{Gamma}(2,1)\). Find the posterior mean.
The posterior is \[\lambda\mid x\sim \operatorname{Gamma}(2+26,1+10)=\operatorname{Gamma}(28,11).\] For a gamma distribution with rate \(\beta\), the mean is \(\alpha/\beta\). Therefore, \[\mathbb{E}(\lambda\mid x)=\frac{28}{11}\approx 2.545.\]
25.3 Normal-normal conjugacy
This subsection reviews the conjugate model for a normal mean with known variance.
Suppose \[X_1,\ldots,X_n\mid \mu \overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2),\] where \(\sigma^2\) is known. Let the prior be \[\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2).\]
Theorem 18 (Normal-normal posterior). The posterior distribution of \(\mu\) is normal: \[\mu\mid x\sim\operatorname{Normal}(m_n,v_n),\] where \[v_n=\left(\frac{n}{\sigma^2}+\frac{1}{\sigma_0^2}\right)^{-1} =\frac{\sigma^2\sigma_0^2}{\sigma^2+n\sigma_0^2},\] and \[m_n=v_n\left(\frac{n\overline x}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right) =\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}.\]
Weighted average interpretation The posterior mean is a weighted average of the sample mean \(\overline x\) and the prior mean \(\mu_0\): \[m_n=\frac{n\sigma_0^2}{n\sigma_0^2+\sigma^2}\overline x +\frac{\sigma^2}{n\sigma_0^2+\sigma^2}\mu_0.\] If \(\sigma_0^2\to\infty\), the prior becomes diffuse and \(m_n\to\overline x\), the MLE.
Example 19 (MAP for a normal mean). Suppose \(X_1,\ldots,X_n\mid \mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)\) with known \(\sigma^2\), and suppose \(\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2)\). Find the MAP estimator.
The posterior is normal with mean \(m_n\) and variance \(v_n\). A normal density is maximized at its mean, so \[\widehat\mu_{\mathrm{MAP}}=m_n =\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}.\] Equivalently, \[\widehat\mu_{\mathrm{MAP}} =\frac{n\sigma_0^2}{n\sigma_0^2+\sigma^2}\overline x +\frac{\sigma^2}{n\sigma_0^2+\sigma^2}\mu_0.\] Because the posterior is normal, the posterior mean and posterior mode are the same.
26 Bayes Risk and Evaluating Estimators
This section explains how Bayesian decision theory evaluates estimators by averaging risk over a prior distribution.
In frequentist evaluation, the risk \(R(\theta,\delta)\) is viewed as a function of the unknown parameter \(\theta\). In Bayesian evaluation, we average this risk over the prior distribution.
Definition 20 (Risk function). For decision rule \(\delta(X)\) and loss function \(L(\theta,a)\), the risk function is \[R(\theta,\delta)=\mathbb{E}_\theta[L(\theta,\delta(X))].\]
Definition 21 (Bayes risk). Given prior distribution \(\pi(\theta)\), the Bayes risk of \(\delta\) is \[r(\pi,\delta)=\int R(\theta,\delta)\pi(\theta)\,d\theta.\] Equivalently, \[r(\pi,\delta)=\int\int L(\theta,\delta(x))f(x\mid \theta)\pi(\theta)\,dx\,d\theta.\]
Theorem 22 (Posterior risk minimization). A Bayes rule can be found by minimizing posterior expected loss for each observed \(x\): \[\delta_B(x)\in \arg\min_a \int L(\theta,a)\pi(\theta\mid x)\,d\theta.\]
Example 23 (MSE of a beta-binomial Bayes estimator). Let \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)\) and \(S=\sum X_i\). Consider \[\widehat p_B=\frac{\alpha+S}{\alpha+\beta+n}.\] Find its frequentist MSE as a function of \(p\).
Since \(S\sim\operatorname{Binomial}(n,p)\), \[\mathbb{E}_p(\widehat p_B)=\frac{\alpha+np}{\alpha+\beta+n}, \qquad \operatorname{Var}_p(\widehat p_B)=\frac{np(1-p)}{(\alpha+\beta+n)^2}.\] The bias is \[\operatorname{Bias}_p(\widehat p_B) =\frac{\alpha+np}{\alpha+\beta+n}-p =\frac{\alpha-(\alpha+\beta)p}{\alpha+\beta+n}.\] Therefore, \[\operatorname{MSE}_p(\widehat p_B) =\frac{np(1-p)}{(\alpha+\beta+n)^2} +\left(\frac{\alpha-(\alpha+\beta)p}{\alpha+\beta+n}\right)^2.\] This shows the bias-variance tradeoff introduced by the prior.
Remark 24 (Why shrinkage can help). The estimator \(\widehat p_B\) is usually biased as a frequentist estimator, but it can have smaller MSE than the MLE \(\widehat p=S/n\) for some values of \(p\), especially when \(n\) is small and the prior information is reasonable.
27 Bayesian Hypothesis Testing
This section explains how Bayesian inference turns hypothesis testing into posterior probability comparison.
In classical testing, we control Type I error, study power, and often use p-values or likelihood ratio tests. In Bayesian testing, the posterior distribution directly assigns probabilities to hypotheses.
27.1 Posterior probability tests
This subsection defines Bayesian tests through posterior probabilities of the null and alternative parameter regions.
Let the hypotheses be \[H_0:\theta\in\Theta_0, \qquad H_1:\theta\in\Theta_0^c.\] Given posterior \(\pi(\theta\mid x)\), \[\mathbb{P}(H_0\mid x)=\mathbb{P}(\theta\in\Theta_0\mid x)=\int_{\Theta_0}\pi(\theta\mid x)\,d\theta,\] and \[\mathbb{P}(H_1\mid x)=\mathbb{P}(\theta\in\Theta_0^c\mid x)=\int_{\Theta_0^c}\pi(\theta\mid x)\,d\theta.\]
Definition 25 (Bayesian posterior probability test). A simple Bayesian decision rule rejects \(H_0\) when \[\mathbb{P}(H_0\mid x)<\mathbb{P}(H_1\mid x),\] equivalently, when \[\mathbb{P}(H_0\mid x)<\frac12.\] A more conservative rule rejects \(H_0\) when \[\mathbb{P}(H_0\mid x)<0.05.\]
Example 26 (Bayesian one-sided normal mean test). Suppose \[X_1,\ldots,X_n\mid \mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2),\] where \(\sigma^2\) is known, and suppose \[\mu\sim\operatorname{Normal}(\theta,\tau^2).\] Test \[H_0:\mu\le \mu_0 \qquad\text{versus}\qquad H_1:\mu>\mu_0.\] Find a posterior-probability rejection rule.
The posterior distribution is \[\mu\mid x\sim\operatorname{Normal}(m_n,v_n),\] where \[m_n=\frac{\tau^2\sum_{i=1}^n x_i+\sigma^2\theta}{n\tau^2+\sigma^2}, \qquad v_n=\frac{\sigma^2\tau^2}{\sigma^2+n\tau^2}.\] The posterior probability of the null is \[\mathbb{P}(H_0\mid x)=\mathbb{P}(\mu\le \mu_0\mid x) =\Phi\left(\frac{\mu_0-m_n}{\sqrt{v_n}}\right).\] If we reject when \(\mathbb{P}(H_0\mid x)<1/2\), then because the posterior is normal and symmetric, this is equivalent to \[m_n>\mu_0.\] That is, \[\frac{\tau^2\sum_{i=1}^n x_i+\sigma^2\theta}{n\tau^2+\sigma^2}>\mu_0.\] Equivalently, \[\overline x>\mu_0+\frac{\sigma^2(\mu_0-\theta)}{n\tau^2}.\]
27.2 Bayesian tests with loss functions
This subsection connects Bayesian tests to error costs.
Suppose the action space is \(\{a_0,a_1\}\), where \(a_0\) means “accept or fail to reject \(H_0\)” and \(a_1\) means “reject \(H_0\)”. A generalized 0–1 loss is \[L(\theta,a_0)= \begin{cases} 0, & \theta\in\Theta_0,\\ c_{II}, & \theta\in\Theta_0^c, \end{cases}\] and \[L(\theta,a_1)= \begin{cases} c_I, & \theta\in\Theta_0,\\ 0, & \theta\in\Theta_0^c. \end{cases}\] Here \(c_I\) is the cost of a Type I error and \(c_{II}\) is the cost of a Type II error.
Theorem 27 (Bayes test under generalized 0–1 loss). Under the above loss, reject \(H_0\) if \[c_I\mathbb{P}(H_0\mid x)<c_{II}\mathbb{P}(H_1\mid x).\] Equivalently, \[\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}.\]
Proof. Proof. The posterior expected loss of accepting \(H_0\) is \[\rho(a_0\mid x)=c_{II}\mathbb{P}(H_1\mid x).\] The posterior expected loss of rejecting \(H_0\) is \[\rho(a_1\mid x)=c_I\mathbb{P}(H_0\mid x).\] We reject when \(\rho(a_1\mid x)<\rho(a_0\mid x)\), namely \[c_I\mathbb{P}(H_0\mid x)<c_{II}\mathbb{P}(H_1\mid x).\] Since \(\mathbb{P}(H_1\mid x)=1-\mathbb{P}(H_0\mid x)\), this is equivalent to \[\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}.\] ◻
Remark 28 (Connection to classical risk). In classical testing, the risk under generalized 0–1 loss is \[R(\theta,\delta)= \begin{cases} c_I\beta(\theta), & \theta\in\Theta_0,\\ c_{II}[1-\beta(\theta)], & \theta\in\Theta_0^c, \end{cases}\] where \(\beta(\theta)=\mathbb{P}_\theta(\text{reject }H_0)\) is the power function. Bayesian testing averages this risk using the posterior or prior distribution.
28 Bayesian Interval Estimation
This section introduces Bayesian credible intervals and contrasts them with frequentist confidence intervals.
In frequentist confidence intervals, the interval is random and the parameter is fixed. In Bayesian credible intervals, the parameter is random under the posterior distribution, so probability statements about \(\theta\) belonging to an observed interval are meaningful within the model.
Definition 29 (Credible set). A set \(C(x)\subseteq\Theta\) is a \(100(1-\alpha)\%\) credible set if \[\mathbb{P}(\theta\in C(x)\mid x)=\int_{C(x)}\pi(\theta\mid x)\,d\theta=1-\alpha.\] If \(C(x)=[a,b]\), then \([a,b]\) is called a credible interval.
Interpretation warning A frequentist \(95\%\) confidence interval does not mean that there is a \(95\%\) probability that the fixed parameter lies in the observed interval. A Bayesian \(95\%\) credible interval does allow the statement: \[\mathbb{P}(\theta\in C(x)\mid x)=0.95.\] This interpretation depends on the prior and the Bayesian model.
28.1 Equal-tail credible intervals
This subsection introduces credible intervals based on posterior quantiles.
Definition 30 (Equal-tail credible interval). A \(100(1-\alpha)\%\) equal-tail credible interval \([a,b]\) satisfies \[\mathbb{P}(\theta<a\mid x)=\frac{\alpha}{2}, \qquad \mathbb{P}(\theta>b\mid x)=\frac{\alpha}{2}.\] Equivalently, \(a\) and \(b\) are the posterior \(\alpha/2\) and \(1-\alpha/2\) quantiles.
Example 31 (Beta-binomial credible interval). Suppose \(X\sim\operatorname{Binomial}(n,p)\), \(p\sim\operatorname{Beta}(2,2)\), \(n=20\), and \(x=12\). Find the posterior distribution and describe the \(95\%\) equal-tail credible interval.
The posterior is \[p\mid X=12\sim\operatorname{Beta}(2+12,2+20-12)=\operatorname{Beta}(14,10).\] The \(95\%\) equal-tail credible interval is \[\left[q_{0.025},q_{0.975}\right],\] where \(q_c\) is the \(c\)th quantile of the \(\operatorname{Beta}(14,10)\) distribution. Numerically, this interval is approximately \[[0.385,0.768].\] This means that, under the beta-binomial Bayesian model, \[\mathbb{P}(0.385\le p\le 0.768\mid X=12)\approx 0.95.\]
Example 32 (Normal-normal credible interval). Suppose \(X_1,\ldots,X_n\mid\mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)\) with known \(\sigma^2\), and \(\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2)\). Find a \(100(1-\alpha)\%\) credible interval for \(\mu\).
The posterior is \[\mu\mid x\sim\operatorname{Normal}(m_n,v_n),\] where \[m_n=\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}, \qquad v_n=\frac{\sigma^2\sigma_0^2}{\sigma^2+n\sigma_0^2}.\] Therefore a \(100(1-\alpha)\%\) credible interval is \[\left[m_n-z_{\alpha/2}\sqrt{v_n},\;m_n+z_{\alpha/2}\sqrt{v_n}\right],\] where \(z_{\alpha/2}\) satisfies \(\mathbb{P}(Z>z_{\alpha/2})=\alpha/2\) for \(Z\sim\operatorname{Normal}(0,1)\).
Example 33 (Gamma-Poisson credible interval). Suppose \(X_1,\ldots,X_n\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)\) and \(\lambda\sim\operatorname{Gamma}(\alpha,\beta)\) in the rate parametrization. Derive the posterior credible interval for \(\lambda\).
The posterior is \[\lambda\mid x\sim \operatorname{Gamma}\left(\alpha+\sum_{i=1}^n x_i,\beta+n\right).\] A \(100(1-\alpha_0)\%\) equal-tail credible interval is \[\left[q_{\alpha_0/2},q_{1-\alpha_0/2}\right],\] where \(q_c\) is the \(c\)th quantile of the posterior gamma distribution. For example, if \(n=10\), \(\sum x_i=26\), and \(\lambda\sim\operatorname{Gamma}(2,1)\), then \[\lambda\mid x\sim\operatorname{Gamma}(28,11),\] and the interval is obtained from the quantiles of \(\operatorname{Gamma}(28,11)\).
29 Highest Posterior Density Regions
This section explains the shortest Bayesian credible regions for unimodal posterior distributions.
Equal-tail intervals are easy to compute, but they are not always the shortest credible intervals. For a unimodal posterior density, the shortest credible region is formed by keeping the parameter values with highest posterior density.
Definition 34 (Highest posterior density region). A \(100(1-\alpha)\%\) highest posterior density (HPD) credible region is a set \[C_{\mathrm{HPD}}(x)=\{\theta:\pi(\theta\mid x)\ge k\},\] where \(k\) is chosen so that \[\int_{C_{\mathrm{HPD}}(x)}\pi(\theta\mid x)\,d\theta=1-\alpha.\]
Theorem 35 (Shortest credible interval for unimodal posterior). If \(\pi(\theta\mid x)\) is unimodal, then the shortest credible interval with posterior probability \(1-\alpha\) is the HPD interval. Its endpoints have equal posterior density, except possibly at a boundary of the parameter space.
Example 36 (Poisson HPD region). Suppose \[X_1,\ldots,X_n\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda),\] and use a conjugate gamma prior. If \[\lambda\mid x\sim\operatorname{Gamma}\left(a+\sum x_i,\, n+b\right)\] under a rate parametrization, then a \(100(1-\alpha)\%\) HPD credible region has the form \[\{\lambda:\pi(\lambda\mid x)\ge k\},\] where \(k\) is chosen to make the posterior probability equal to \(1-\alpha\).
The posterior density is a gamma density. If it is unimodal, the HPD region contains the highest-density values around the posterior mode. The cutoff \(k\) is determined by solving \[\int_{\{\lambda:\pi(\lambda\mid x)\ge k\}}\pi(\lambda\mid x)\,d\lambda=1-\alpha.\] For example, in the lecture notes case with \(a=b=1\), \(n=10\), and \(\sum x_i=6\), the posterior is gamma-shaped and the \(90\%\) HPD credible set is approximately \[[0.253,1.005],\] while a corresponding equal-tail interval is slightly longer.
Remark 37 (Equal-tail versus HPD). Equal-tail intervals split posterior probability equally between the two tails. HPD intervals minimize length by taking the most plausible parameter values first. For symmetric unimodal posteriors, the equal-tail and HPD intervals often coincide. For skewed posteriors, they are usually different.
30 Bayesian Optimality for Intervals
This section connects credible intervals to loss-function optimality.
Bayesian interval estimation can be framed as a decision problem: the action is choosing a set \(C\), and the loss penalizes long intervals while rewarding coverage of the true parameter.
Definition 38 (Interval loss function). One simple loss function for choosing a confidence or credible set \(C\) is \[L(\theta,C)=b\cdot \operatorname{Length}(C)-\mathbb{1}\{\theta\in C\},\] where \(b>0\) controls the tradeoff between short length and high coverage.
The corresponding frequentist risk is \[R(\theta,C)=b\mathbb{E}_\theta[\operatorname{Length}(C(X))]-\mathbb{P}_\theta(\theta\in C(X)).\] The Bayesian posterior expected loss is \[\rho(C\mid x)=b\cdot \operatorname{Length}(C)-\mathbb{P}(\theta\in C\mid x).\]
Interpretation Large \(b\) prioritizes shorter intervals. Small \(b\) prioritizes posterior coverage. This makes precise the tradeoff between precision and uncertainty.
Example 39 (Normal interval risk). Suppose \(X\sim\operatorname{Normal}(\mu,\sigma^2)\) with known \(\sigma^2\), and consider symmetric intervals \[C(X)=[X-c\sigma,X+c\sigma], \qquad c\ge 0.\] Compute the risk under \[L(\mu,C)=b\cdot\operatorname{Length}(C)-\mathbb{1}\{\mu\in C\}.\]
The length is \[\operatorname{Length}(C)=2c\sigma.\] The coverage probability is \[\mathbb{P}_\mu(\mu\in C(X)) =\mathbb{P}_\mu(X-c\sigma\le\mu\le X+c\sigma) =\mathbb{P}\left(-c\le \frac{X-\mu}{\sigma}\le c\right) =2\Phi(c)-1.\] Therefore the risk is \[R(c)=b(2c\sigma)-[2\Phi(c)-1].\] Differentiating, \[R'(c)=2b\sigma-2\phi(c).\] The optimum satisfies \[\phi(c)=b\sigma.\] Since \(\phi(c)=\frac{1}{\sqrt{2\pi}}e^{-c^2/2}\), if \(b\sigma\le 1/\sqrt{2\pi}\), then \[c=\sqrt{-2\log(b\sigma\sqrt{2\pi})}.\] If \(b\sigma>1/\sqrt{2\pi}\), the minimum occurs at \(c=0\), corresponding to a point estimate.
31 Bayesian Inference Workflow
This section summarizes the steps of Bayesian inference as a reusable procedure.
Bayesian workflow
Specify the sampling model \(f(x\mid\theta)\).
Choose a prior distribution \(\pi(\theta)\).
Compute the posterior distribution \(\pi(\theta\mid x)\propto f(x\mid\theta)\pi(\theta)\).
For point estimation, report posterior mean, median, MAP, or another loss-optimal estimate.
For testing, compute posterior probabilities of \(H_0\) and \(H_1\), then choose an action based on posterior risk.
For interval estimation, report an equal-tail credible interval or HPD credible interval.
Interpret the answer conditional on the chosen model and prior.
Prior sensitivity Bayesian inference depends on the prior distribution. With large samples, the likelihood often dominates the prior. With small samples, the prior can strongly influence posterior estimates, tests, and intervals.
32 Practice Problems
This section provides practice problems that connect the Bayesian methods from Sections 12–17.
Practice Problem 40 (Beta-binomial posterior and estimator). Suppose \(X\sim\operatorname{Binomial}(30,p)\) and \(x=18\). Let \(p\sim\operatorname{Beta}(4,6)\).
Find the posterior distribution of \(p\).
Find the posterior mean.
Find the MAP estimator, assuming the posterior parameters are both larger than 1.
The posterior is \[p\mid x\sim\operatorname{Beta}(4+18,6+30-18)=\operatorname{Beta}(22,18).\]
The posterior mean is \[\mathbb{E}(p\mid x)=\frac{22}{22+18}=\frac{22}{40}=0.55.\]
The MAP estimator is \[\widehat p_{\mathrm{MAP}}=\frac{22-1}{22+18-2}=\frac{21}{38}\approx 0.5526.\]
Practice Problem 41 (Normal-normal posterior). Suppose \(X_1,\ldots,X_{16}\mid\mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,4)\) and \(\overline x=10\). Let \(\mu\sim\operatorname{Normal}(8,9)\).
Find the posterior variance \(v_n\).
Find the posterior mean \(m_n\).
Give a \(95\%\) credible interval for \(\mu\).
Here \(n=16\), \(\sigma^2=4\), \(\mu_0=8\), and \(\sigma_0^2=9\).
\[v_n=\left(\frac{16}{4}+\frac{1}{9}\right)^{-1} =\left(4+\frac19\right)^{-1} =\frac{9}{37}.\]
\[m_n=v_n\left(\frac{16\cdot 10}{4}+\frac{8}{9}\right) =\frac{9}{37}\left(40+\frac{8}{9}\right) =\frac{368}{37}\approx 9.946.\]
A \(95\%\) credible interval is \[m_n\pm 1.96\sqrt{v_n} =9.946\pm 1.96\sqrt{\frac{9}{37}}.\] Since \(\sqrt{9/37}\approx 0.493\), the interval is approximately \[[8.980,10.912].\]
Practice Problem 42 (Gamma-Poisson posterior). Suppose \(X_1,\ldots,X_8\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)\) and \(\sum x_i=20\). Let \(\lambda\sim\operatorname{Gamma}(3,2)\) using the rate parametrization.
Find the posterior distribution.
Find the posterior mean.
Find the MAP estimator, assuming the posterior shape is larger than 1.
The posterior is \[\lambda\mid x\sim\operatorname{Gamma}(3+20,2+8)=\operatorname{Gamma}(23,10).\]
The posterior mean is \[\mathbb{E}(\lambda\mid x)=\frac{23}{10}=2.3.\]
For a gamma distribution with shape \(a\) and rate \(b\), the mode is \((a-1)/b\) when \(a>1\). Thus \[\widehat\lambda_{\mathrm{MAP}}=\frac{23-1}{10}=2.2.\]
Practice Problem 43 (Bayesian test with posterior probability). Suppose the posterior distribution of a parameter is \[\theta\mid x\sim\operatorname{Normal}(1.2,0.25).\] Test \[H_0:\theta\le 0 \qquad\text{versus}\qquad H_1:\theta>0.\] Compute \(\mathbb{P}(H_0\mid x)\) and decide whether to reject \(H_0\) using the rule \(\mathbb{P}(H_0\mid x)<0.05\).
The posterior standard deviation is \(\sqrt{0.25}=0.5\). Therefore \[\mathbb{P}(H_0\mid x)=\mathbb{P}(\theta\le 0\mid x) =\Phi\left(\frac{0-1.2}{0.5}\right) =\Phi(-2.4).\] Using the standard normal table, \[\Phi(-2.4)\approx 0.0082.\] Since \(0.0082<0.05\), we reject \(H_0\) using this Bayesian posterior-probability rule.
Practice Problem 44 (Bayesian decision rule with unequal costs). For testing \(H_0:\theta\in\Theta_0\) versus \(H_1:\theta\in\Theta_0^c\), suppose \[\mathbb{P}(H_0\mid x)=0.30.\] The cost of Type I error is \(c_I=5\), and the cost of Type II error is \(c_{II}=1\). Should we reject \(H_0\)?
Reject \(H_0\) if \[\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}=\frac{1}{5+1}=\frac16\approx 0.1667.\] Here \(\mathbb{P}(H_0\mid x)=0.30>0.1667\), so we do not reject \(H_0\). The high cost of Type I error makes the rejection rule more conservative.
Practice Problem 45 (Equal-tail credible interval). Suppose the posterior distribution is \[\theta\mid x\sim\operatorname{Normal}(5,4).\] Find a \(90\%\) equal-tail credible interval.
The posterior standard deviation is \(2\). For a \(90\%\) interval, \(z_{0.05}\approx 1.645\). Thus \[5\pm 1.645(2)=5\pm 3.29.\] The credible interval is \[[1.71,8.29].\]
Practice Problem 46 (HPD interval concept). Suppose a posterior density is unimodal and skewed to the right. Explain why the \(95\%\) HPD interval may differ from the \(95\%\) equal-tail interval.
The equal-tail interval places \(2.5\%\) posterior probability in each tail. The HPD interval instead contains the parameter values with highest posterior density and chooses the cutoff so that the total posterior probability is \(95\%\). For a skewed posterior, equal tails may include some low-density values in the long tail while excluding higher-density values on the other side. Therefore the HPD interval is usually shorter and has endpoints with equal posterior density, while the equal-tail interval is based on posterior quantiles.
33 Summary
This section summarizes the role of Bayesian inference across the main statistical tasks of the course.
Section summary
Bayesian inference starts with a prior \(\pi(\theta)\) and likelihood \(f(x\mid\theta)\).
Bayes’ rule gives the posterior: \[\pi(\theta\mid x)\propto f(x\mid\theta)\pi(\theta).\]
Point estimates can be posterior means, medians, or MAP estimates, depending on the loss function.
Conjugate priors make posterior computation simple.
Bayesian tests compare posterior probabilities of hypotheses, possibly weighted by error costs.
Bayesian credible intervals give direct posterior probability statements.
HPD regions are shortest credible regions for unimodal posterior distributions.
Bayesian decisions are naturally derived by minimizing posterior expected loss.
| Task | Bayesian object | Common answer |
|---|---|---|
| Point estimation | Posterior distribution | Posterior mean, median, MAP |
| Testing | Posterior hypothesis probability | Reject if posterior risk is smaller |
| Interval estimation | Posterior credible probability | Equal-tail or HPD credible interval |
| Model updating | Prior and likelihood | Posterior distribution |
| Decision theory | Posterior expected loss | Bayes action |