21 Chapter 20: Bayesian Inference

This chapter collects and organizes the Bayesian ideas used throughout point estimation, hypothesis testing, interval estimation, and decision theory. The main message is that once we know the posterior distribution, we can derive point estimates, tests, credible intervals, and optimal decisions from one common framework.

Topics

Bayesian model ingredients; prior, likelihood, posterior, and marginal likelihood; Bayes estimators; posterior mean, median, mode, and MAP; conjugate priors; beta-binomial, gamma-Poisson, and normal-normal models; Bayes risk; Bayesian tests; credible intervals; highest posterior density regions; loss-function interpretation; practice problems and solutions.

22 Overview

This section collects the Bayesian material that appeared across point estimation, evaluating estimators, hypothesis testing, evaluating tests, interval estimation, and evaluating intervals.

Bayesian inference gives a unified method for learning from data. Instead of treating the parameter as an unknown but fixed constant, the Bayesian approach represents uncertainty about the parameter using a probability distribution. Data update this distribution through Bayes’ rule.

Key idea

Main message Bayesian inference is the rule \[\text{posterior} \propto \text{likelihood} \times \text{prior}.\] Once the posterior distribution is known, point estimation, hypothesis testing, and interval estimation can all be derived from it.

The core idea is simple but powerful. Before observing data, we describe prior information by a prior distribution. After observing data, we update the prior into a posterior distribution. All Bayesian inference is then based on this posterior distribution.

23 Bayesian Model Ingredients

This section introduces the basic objects of Bayesian inference: likelihood, prior, marginal likelihood, and posterior.

Suppose the observed data are \[X=(X_1,\ldots,X_n), \qquad x=(x_1,\ldots,x_n),\] with sampling density or mass function \[f(x\mid \theta).\] The parameter $\theta$ is unknown. In the Bayesian approach, $\theta$ is assigned a probability distribution.

Definition

Definition 1 (Prior distribution). The prior distribution $\pi(\theta)$ describes our uncertainty or belief about $\theta$ before observing the data.

Definition

Definition 2 (Likelihood). After observing $X=x$, the likelihood function is \[L(\theta\mid x)=f(x\mid \theta).\] It measures how compatible each parameter value $\theta$ is with the observed data.

Definition

Definition 3 (Marginal distribution). The marginal distribution or prior predictive distribution of the data is \[m(x)=\int f(x\mid \theta)\pi(\theta)\,d\theta.\] It is the normalizing constant that makes the posterior integrate to one.

Definition

Definition 4 (Posterior distribution). The posterior distribution of $\theta$ given $X=x$ is \[\pi(\theta\mid x)=\frac{f(x\mid \theta)\pi(\theta)}{m(x)} =\frac{f(x\mid \theta)\pi(\theta)}{\int f(x\mid \theta')\pi(\theta')\,d\theta'}.\]

Key idea

Bayesian updating The denominator $m(x)$ does not depend on $\theta$. Therefore, for many calculations, we write \[\pi(\theta\mid x)\propto f(x\mid \theta)\pi(\theta).\] This means that the posterior is proportional to likelihood times prior.

Example

Example 5 (Bayesian ingredients for coin tossing). Suppose a coin has unknown probability $p$ of heads. We toss the coin $n$ times and observe $x$ heads. The likelihood is \[f(x\mid p)=\binom{n}{x}p^x(1-p)^{n-x}, \qquad 0<p<1.\] If the prior is $p\sim\operatorname{Beta}(\alpha,\beta)$, then \[\pi(p)=\frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1}.\] Find the posterior distribution.

Solution

Using Bayes’ rule, \[\pi(p\mid x)\propto \binom{n}{x}p^x(1-p)^{n-x}\cdot p^{\alpha-1}(1-p)^{\beta-1}.\] Ignoring constants that do not depend on $p$, \[\pi(p\mid x)\propto p^{\alpha+x-1}(1-p)^{\beta+n-x-1}.\] This is the kernel of a beta distribution, so \[p\mid X=x\sim \operatorname{Beta}(\alpha+x,\beta+n-x).\]

24 Bayesian Point Estimation

This section explains how to produce a single-number estimate from a posterior distribution.

In classical point estimation, estimators such as the method of moments estimator and the maximum likelihood estimator are functions of the data. In Bayesian inference, once the posterior distribution is obtained, we can summarize it by its mean, median, mode, or another loss-optimal point.

24.1 Posterior mean, median, and mode

This subsection introduces three common posterior summaries used as Bayes point estimates.

Definition

Definition 6 (Posterior mean). The posterior mean estimator is \[\widehat\theta_B=\mathbb{E}(\theta\mid x)=\int \theta\pi(\theta\mid x)\,d\theta.\] It is commonly called the Bayes estimator under squared error loss.

Definition

Definition 7 (Posterior median). A posterior median is any value $m$ satisfying \[\mathbb{P}(\theta\le m\mid x)\ge \frac12, \qquad \mathbb{P}(\theta\ge m\mid x)\ge \frac12.\] It is the Bayes estimator under absolute error loss.

Definition

Definition 8 (Posterior mode and MAP estimator). The maximum a posteriori estimator is \[\widehat\theta_{\mathrm{MAP}}=\arg\max_{\theta}\pi(\theta\mid x).\] Since \[\pi(\theta\mid x)\propto f(x\mid \theta)\pi(\theta),\] we can also write \[\widehat\theta_{\mathrm{MAP}}=\arg\max_{\theta}\{\log f(x\mid \theta)+\log \pi(\theta)\}.\]

Key idea

MAP versus MLE The maximum likelihood estimator maximizes only the likelihood: \[\widehat\theta_{\mathrm{MLE}}=\arg\max_\theta \log f(x\mid \theta).\] The MAP estimator maximizes likelihood plus prior information: \[\widehat\theta_{\mathrm{MAP}}=\arg\max_\theta \{\log f(x\mid \theta)+\log \pi(\theta)\}.\] A flat prior makes MAP behave like MLE.

24.2 Loss functions and Bayes estimators

This subsection connects Bayesian point estimates to decision theory.

A Bayesian point estimator can be chosen by minimizing posterior expected loss. Let $a$ be an action, interpreted as an estimate of $\theta$. A loss function $L(\theta,a)$ measures the penalty of estimating $\theta$ by $a$.

Definition

Definition 9 (Posterior expected loss). Given posterior $\pi(\theta\mid x)$, the posterior expected loss of action $a$ is \[\rho(a\mid x)=\mathbb{E}[L(\theta,a)\mid x]=\int L(\theta,a)\pi(\theta\mid x)\,d\theta.\] A Bayes action minimizes $\rho(a\mid x)$.

Theorem

Theorem 10 (Common Bayes estimators). For a real-valued parameter $\theta$:

Under squared error loss $L(\theta,a)=(a-\theta)^2$, the Bayes estimator is the posterior mean.
Under absolute error loss $L(\theta,a)=|a-\theta|$, the Bayes estimator is a posterior median.
Under 0–1 style local loss, the Bayes estimator is a posterior mode, or MAP estimator.

Proof. Proof. For squared error loss, \[\rho(a\mid x)=\mathbb{E}[(a-\theta)^2\mid x] =a^2-2a\mathbb{E}(\theta\mid x)+\mathbb{E}(\theta^2\mid x).\] Differentiating with respect to $a$ gives \[2a-2\mathbb{E}(\theta\mid x)=0,\] so $a=\mathbb{E}(\theta\mid x)$. The absolute loss result follows because a median minimizes expected absolute deviation. The mode result follows because maximizing posterior mass in a small neighborhood is equivalent to maximizing the posterior density. ◻

Example

Example 11 (Beta-binomial posterior mean, variance, and mode). Suppose $X\sim\operatorname{Binomial}(n,p)$ and $p\sim\operatorname{Beta}(\alpha,\beta)$. If $X=x$, find the posterior mean, variance, and mode of $p$.

Solution

The posterior is \[p\mid x\sim \operatorname{Beta}(\alpha+x,\beta+n-x).\] Let \[\alpha^*=\alpha+x, \qquad \beta^*=\beta+n-x.\] Then \[\mathbb{E}(p\mid x)=\frac{\alpha^*}{\alpha^*+\beta^*} =\frac{\alpha+x}{\alpha+\beta+n}.\] The posterior variance is \[\operatorname{Var}(p\mid x)=\frac{\alpha^*\beta^*}{(\alpha^*+\beta^*)^2(\alpha^*+\beta^*+1)} =\frac{(\alpha+x)(\beta+n-x)}{(\alpha+\beta+n)^2(\alpha+\beta+n+1)}.\] If $\alpha^*>1$ and $\beta^*>1$, the posterior mode is \[\frac{\alpha^*-1}{\alpha^*+\beta^*-2} =\frac{\alpha+x-1}{\alpha+\beta+n-2}.\]

Example

Example 12 (Pseudo-count interpretation). Suppose $p\sim\operatorname{Beta}(3,5)$ and we observe $x=5$ heads in $n=6$ tosses. Find the posterior distribution and posterior mean.

Solution

The posterior is \[p\mid X=5\sim\operatorname{Beta}(3+5,5+6-5)=\operatorname{Beta}(8,6).\] The posterior mean is \[\mathbb{E}(p\mid X=5)=\frac{8}{8+6}=\frac{8}{14}=0.5714.\] The prior contributes $3$ prior successes and $5$ prior failures. The data contribute $5$ successes and $1$ failure. Thus the posterior has $8$ pseudo-successes and $6$ pseudo-failures.

25 Conjugate Priors

This section studies conjugate priors, which make Bayesian updating especially simple.

A conjugate prior is useful because the posterior stays in the same distribution family as the prior. Then Bayesian updating often becomes a simple rule of adding data summaries to prior hyperparameters.

Definition

Definition 13 (Conjugate prior). For a likelihood function $f(x\mid \theta)$, a prior family $\pi(\theta)$ is called conjugate if the posterior $\pi(\theta\mid x)$ belongs to the same family as the prior.

Key idea

Why conjugacy helps If a prior is conjugate, then we can often skip difficult integration. The posterior distribution is known up to updated parameters, and the update rule is usually easy to compute.

25.1 Beta-binomial conjugacy

This subsection reviews the most important conjugate pair for proportions.

Theorem

Theorem 14 (Beta-binomial model). If \[X\mid p\sim\operatorname{Binomial}(n,p), \qquad p\sim\operatorname{Beta}(\alpha,\beta),\] then \[p\mid X=x\sim\operatorname{Beta}(\alpha+x,\beta+n-x).\]

Example

Example 15 (Bayesian estimator for a Bernoulli probability). Suppose $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)$ and $S=\sum_{i=1}^n X_i$. Let $p\sim\operatorname{Beta}(\alpha,\beta)$. Find the Bayes estimator under squared error loss.

Solution

Since $S\mid p\sim\operatorname{Binomial}(n,p)$, the posterior is \[p\mid S\sim\operatorname{Beta}(\alpha+S,\beta+n-S).\] Under squared error loss, the Bayes estimator is the posterior mean: \[\widehat p_B=\mathbb{E}(p\mid S)=\frac{\alpha+S}{\alpha+\beta+n}.\] This shrinks the sample proportion $S/n$ toward the prior mean $\alpha/(\alpha+\beta)$.

25.2 Gamma-Poisson conjugacy

This subsection reviews the conjugate prior for a Poisson rate.

We use the rate parametrization for the gamma distribution: \[\lambda\sim\operatorname{Gamma}(\alpha,\beta), \qquad \pi(\lambda)=\frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha-1}e^{-\beta\lambda}, \qquad \lambda>0.\]

Theorem

Theorem 16 (Gamma-Poisson model). If \[X_1,\ldots,X_n\mid \lambda \overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda), \qquad \lambda\sim\operatorname{Gamma}(\alpha,\beta),\] then \[\lambda\mid x\sim \operatorname{Gamma}\left(\alpha+\sum_{i=1}^n x_i,\,\beta+n\right).\]

Proof. Proof. The likelihood is \[f(x\mid \lambda)=\prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} \propto e^{-n\lambda}\lambda^{\sum x_i}.\] Multiplying by the gamma prior gives \[\pi(\lambda\mid x)\propto e^{-n\lambda}\lambda^{\sum x_i}\cdot \lambda^{\alpha-1}e^{-\beta\lambda} =\lambda^{\alpha+\sum x_i-1}e^{-(\beta+n)\lambda}.\] This is the kernel of $\operatorname{Gamma}(\alpha+\sum x_i,\beta+n)$. ◻

Example

Example 17 (Poisson posterior). Suppose $n=10$ Poisson observations have total count $\sum x_i=26$. Let the prior be $\lambda\sim\operatorname{Gamma}(2,1)$. Find the posterior mean.

Solution

The posterior is \[\lambda\mid x\sim \operatorname{Gamma}(2+26,1+10)=\operatorname{Gamma}(28,11).\] For a gamma distribution with rate $\beta$, the mean is $\alpha/\beta$. Therefore, \[\mathbb{E}(\lambda\mid x)=\frac{28}{11}\approx 2.545.\]

25.3 Normal-normal conjugacy

This subsection reviews the conjugate model for a normal mean with known variance.

Suppose \[X_1,\ldots,X_n\mid \mu \overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2),\] where $\sigma^2$ is known. Let the prior be \[\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2).\]

Theorem

Theorem 18 (Normal-normal posterior). The posterior distribution of $\mu$ is normal: \[\mu\mid x\sim\operatorname{Normal}(m_n,v_n),\] where \[v_n=\left(\frac{n}{\sigma^2}+\frac{1}{\sigma_0^2}\right)^{-1} =\frac{\sigma^2\sigma_0^2}{\sigma^2+n\sigma_0^2},\] and \[m_n=v_n\left(\frac{n\overline x}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right) =\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}.\]

Key idea

Weighted average interpretation The posterior mean is a weighted average of the sample mean $\overline x$ and the prior mean $\mu_0$: \[m_n=\frac{n\sigma_0^2}{n\sigma_0^2+\sigma^2}\overline x +\frac{\sigma^2}{n\sigma_0^2+\sigma^2}\mu_0.\] If $\sigma_0^2\to\infty$, the prior becomes diffuse and $m_n\to\overline x$, the MLE.

Example

Example 19 (MAP for a normal mean). Suppose $X_1,\ldots,X_n\mid \mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$ with known $\sigma^2$, and suppose $\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2)$. Find the MAP estimator.

Solution

The posterior is normal with mean $m_n$ and variance $v_n$. A normal density is maximized at its mean, so \[\widehat\mu_{\mathrm{MAP}}=m_n =\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}.\] Equivalently, \[\widehat\mu_{\mathrm{MAP}} =\frac{n\sigma_0^2}{n\sigma_0^2+\sigma^2}\overline x +\frac{\sigma^2}{n\sigma_0^2+\sigma^2}\mu_0.\] Because the posterior is normal, the posterior mean and posterior mode are the same.

26 Bayes Risk and Evaluating Estimators

This section explains how Bayesian decision theory evaluates estimators by averaging risk over a prior distribution.

In frequentist evaluation, the risk $R(\theta,\delta)$ is viewed as a function of the unknown parameter $\theta$. In Bayesian evaluation, we average this risk over the prior distribution.

Definition

Definition 20 (Risk function). For decision rule $\delta(X)$ and loss function $L(\theta,a)$, the risk function is \[R(\theta,\delta)=\mathbb{E}_\theta[L(\theta,\delta(X))].\]

Definition

Definition 21 (Bayes risk). Given prior distribution $\pi(\theta)$, the Bayes risk of $\delta$ is \[r(\pi,\delta)=\int R(\theta,\delta)\pi(\theta)\,d\theta.\] Equivalently, \[r(\pi,\delta)=\int\int L(\theta,\delta(x))f(x\mid \theta)\pi(\theta)\,dx\,d\theta.\]

Theorem

Theorem 22 (Posterior risk minimization). A Bayes rule can be found by minimizing posterior expected loss for each observed $x$: \[\delta_B(x)\in \arg\min_a \int L(\theta,a)\pi(\theta\mid x)\,d\theta.\]

Example

Example 23 (MSE of a beta-binomial Bayes estimator). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)$ and $S=\sum X_i$. Consider \[\widehat p_B=\frac{\alpha+S}{\alpha+\beta+n}.\] Find its frequentist MSE as a function of $p$.

Solution

Since $S\sim\operatorname{Binomial}(n,p)$, \[\mathbb{E}_p(\widehat p_B)=\frac{\alpha+np}{\alpha+\beta+n}, \qquad \operatorname{Var}_p(\widehat p_B)=\frac{np(1-p)}{(\alpha+\beta+n)^2}.\] The bias is \[\operatorname{Bias}_p(\widehat p_B) =\frac{\alpha+np}{\alpha+\beta+n}-p =\frac{\alpha-(\alpha+\beta)p}{\alpha+\beta+n}.\] Therefore, \[\operatorname{MSE}_p(\widehat p_B) =\frac{np(1-p)}{(\alpha+\beta+n)^2} +\left(\frac{\alpha-(\alpha+\beta)p}{\alpha+\beta+n}\right)^2.\] This shows the bias-variance tradeoff introduced by the prior.

Remark

Remark 24 (Why shrinkage can help). The estimator $\widehat p_B$ is usually biased as a frequentist estimator, but it can have smaller MSE than the MLE $\widehat p=S/n$ for some values of $p$, especially when $n$ is small and the prior information is reasonable.

27 Bayesian Hypothesis Testing

This section explains how Bayesian inference turns hypothesis testing into posterior probability comparison.

In classical testing, we control Type I error, study power, and often use p-values or likelihood ratio tests. In Bayesian testing, the posterior distribution directly assigns probabilities to hypotheses.

27.1 Posterior probability tests

This subsection defines Bayesian tests through posterior probabilities of the null and alternative parameter regions.

Let the hypotheses be \[H_0:\theta\in\Theta_0, \qquad H_1:\theta\in\Theta_0^c.\] Given posterior $\pi(\theta\mid x)$, \[\mathbb{P}(H_0\mid x)=\mathbb{P}(\theta\in\Theta_0\mid x)=\int_{\Theta_0}\pi(\theta\mid x)\,d\theta,\] and \[\mathbb{P}(H_1\mid x)=\mathbb{P}(\theta\in\Theta_0^c\mid x)=\int_{\Theta_0^c}\pi(\theta\mid x)\,d\theta.\]

Definition

Definition 25 (Bayesian posterior probability test). A simple Bayesian decision rule rejects $H_0$ when \[\mathbb{P}(H_0\mid x)<\mathbb{P}(H_1\mid x),\] equivalently, when \[\mathbb{P}(H_0\mid x)<\frac12.\] A more conservative rule rejects $H_0$ when \[\mathbb{P}(H_0\mid x)<0.05.\]

Example

Example 26 (Bayesian one-sided normal mean test). Suppose \[X_1,\ldots,X_n\mid \mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2),\] where $\sigma^2$ is known, and suppose \[\mu\sim\operatorname{Normal}(\theta,\tau^2).\] Test \[H_0:\mu\le \mu_0 \qquad\text{versus}\qquad H_1:\mu>\mu_0.\] Find a posterior-probability rejection rule.

Solution

The posterior distribution is \[\mu\mid x\sim\operatorname{Normal}(m_n,v_n),\] where \[m_n=\frac{\tau^2\sum_{i=1}^n x_i+\sigma^2\theta}{n\tau^2+\sigma^2}, \qquad v_n=\frac{\sigma^2\tau^2}{\sigma^2+n\tau^2}.\] The posterior probability of the null is \[\mathbb{P}(H_0\mid x)=\mathbb{P}(\mu\le \mu_0\mid x) =\Phi\left(\frac{\mu_0-m_n}{\sqrt{v_n}}\right).\] If we reject when $\mathbb{P}(H_0\mid x)<1/2$, then because the posterior is normal and symmetric, this is equivalent to \[m_n>\mu_0.\] That is, \[\frac{\tau^2\sum_{i=1}^n x_i+\sigma^2\theta}{n\tau^2+\sigma^2}>\mu_0.\] Equivalently, \[\overline x>\mu_0+\frac{\sigma^2(\mu_0-\theta)}{n\tau^2}.\]

27.2 Bayesian tests with loss functions

This subsection connects Bayesian tests to error costs.

Suppose the action space is $\{a_0,a_1\}$, where $a_0$ means “accept or fail to reject $H_0$” and $a_1$ means “reject $H_0$”. A generalized 0–1 loss is \[L(\theta,a_0)= \begin{cases} 0, & \theta\in\Theta_0,\\ c_{II}, & \theta\in\Theta_0^c, \end{cases}\] and \[L(\theta,a_1)= \begin{cases} c_I, & \theta\in\Theta_0,\\ 0, & \theta\in\Theta_0^c. \end{cases}\] Here $c_I$ is the cost of a Type I error and $c_{II}$ is the cost of a Type II error.

Theorem

Theorem 27 (Bayes test under generalized 0–1 loss). Under the above loss, reject $H_0$ if \[c_I\mathbb{P}(H_0\mid x)<c_{II}\mathbb{P}(H_1\mid x).\] Equivalently, \[\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}.\]

Proof. Proof. The posterior expected loss of accepting $H_0$ is \[\rho(a_0\mid x)=c_{II}\mathbb{P}(H_1\mid x).\] The posterior expected loss of rejecting $H_0$ is \[\rho(a_1\mid x)=c_I\mathbb{P}(H_0\mid x).\] We reject when $\rho(a_1\mid x)<\rho(a_0\mid x)$, namely \[c_I\mathbb{P}(H_0\mid x)<c_{II}\mathbb{P}(H_1\mid x).\] Since $\mathbb{P}(H_1\mid x)=1-\mathbb{P}(H_0\mid x)$, this is equivalent to \[\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}.\] ◻

Remark

Remark 28 (Connection to classical risk). In classical testing, the risk under generalized 0–1 loss is \[R(\theta,\delta)= \begin{cases} c_I\beta(\theta), & \theta\in\Theta_0,\\ c_{II}[1-\beta(\theta)], & \theta\in\Theta_0^c, \end{cases}\] where $\beta(\theta)=\mathbb{P}_\theta(\text{reject }H_0)$ is the power function. Bayesian testing averages this risk using the posterior or prior distribution.

28 Bayesian Interval Estimation

This section introduces Bayesian credible intervals and contrasts them with frequentist confidence intervals.

In frequentist confidence intervals, the interval is random and the parameter is fixed. In Bayesian credible intervals, the parameter is random under the posterior distribution, so probability statements about $\theta$ belonging to an observed interval are meaningful within the model.

Definition

Definition 29 (Credible set). A set $C(x)\subseteq\Theta$ is a $100(1-\alpha)\%$ credible set if \[\mathbb{P}(\theta\in C(x)\mid x)=\int_{C(x)}\pi(\theta\mid x)\,d\theta=1-\alpha.\] If $C(x)=[a,b]$, then $[a,b]$ is called a credible interval.

Warning

Interpretation warning A frequentist $95\%$ confidence interval does not mean that there is a $95\%$ probability that the fixed parameter lies in the observed interval. A Bayesian $95\%$ credible interval does allow the statement: \[\mathbb{P}(\theta\in C(x)\mid x)=0.95.\] This interpretation depends on the prior and the Bayesian model.

28.1 Equal-tail credible intervals

This subsection introduces credible intervals based on posterior quantiles.

Definition

Definition 30 (Equal-tail credible interval). A $100(1-\alpha)\%$ equal-tail credible interval $[a,b]$ satisfies \[\mathbb{P}(\theta<a\mid x)=\frac{\alpha}{2}, \qquad \mathbb{P}(\theta>b\mid x)=\frac{\alpha}{2}.\] Equivalently, $a$ and $b$ are the posterior $\alpha/2$ and $1-\alpha/2$ quantiles.

Example

Example 31 (Beta-binomial credible interval). Suppose $X\sim\operatorname{Binomial}(n,p)$, $p\sim\operatorname{Beta}(2,2)$, $n=20$, and $x=12$. Find the posterior distribution and describe the $95\%$ equal-tail credible interval.

Solution

The posterior is \[p\mid X=12\sim\operatorname{Beta}(2+12,2+20-12)=\operatorname{Beta}(14,10).\] The $95\%$ equal-tail credible interval is \[\left[q_{0.025},q_{0.975}\right],\] where $q_c$ is the $c$th quantile of the $\operatorname{Beta}(14,10)$ distribution. Numerically, this interval is approximately \[[0.385,0.768].\] This means that, under the beta-binomial Bayesian model, \[\mathbb{P}(0.385\le p\le 0.768\mid X=12)\approx 0.95.\]

Example

Example 32 (Normal-normal credible interval). Suppose $X_1,\ldots,X_n\mid\mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$ with known $\sigma^2$, and $\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2)$. Find a $100(1-\alpha)\%$ credible interval for $\mu$.

Solution

The posterior is \[\mu\mid x\sim\operatorname{Normal}(m_n,v_n),\] where \[m_n=\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}, \qquad v_n=\frac{\sigma^2\sigma_0^2}{\sigma^2+n\sigma_0^2}.\] Therefore a $100(1-\alpha)\%$ credible interval is \[\left[m_n-z_{\alpha/2}\sqrt{v_n},\;m_n+z_{\alpha/2}\sqrt{v_n}\right],\] where $z_{\alpha/2}$ satisfies $\mathbb{P}(Z>z_{\alpha/2})=\alpha/2$ for $Z\sim\operatorname{Normal}(0,1)$.

Example

Example 33 (Gamma-Poisson credible interval). Suppose $X_1,\ldots,X_n\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)$ and $\lambda\sim\operatorname{Gamma}(\alpha,\beta)$ in the rate parametrization. Derive the posterior credible interval for $\lambda$.

Solution

The posterior is \[\lambda\mid x\sim \operatorname{Gamma}\left(\alpha+\sum_{i=1}^n x_i,\beta+n\right).\] A $100(1-\alpha_0)\%$ equal-tail credible interval is \[\left[q_{\alpha_0/2},q_{1-\alpha_0/2}\right],\] where $q_c$ is the $c$th quantile of the posterior gamma distribution. For example, if $n=10$, $\sum x_i=26$, and $\lambda\sim\operatorname{Gamma}(2,1)$, then \[\lambda\mid x\sim\operatorname{Gamma}(28,11),\] and the interval is obtained from the quantiles of $\operatorname{Gamma}(28,11)$.

29 Highest Posterior Density Regions

This section explains the shortest Bayesian credible regions for unimodal posterior distributions.

Equal-tail intervals are easy to compute, but they are not always the shortest credible intervals. For a unimodal posterior density, the shortest credible region is formed by keeping the parameter values with highest posterior density.

Definition

Definition 34 (Highest posterior density region). A $100(1-\alpha)\%$ highest posterior density (HPD) credible region is a set \[C_{\mathrm{HPD}}(x)=\{\theta:\pi(\theta\mid x)\ge k\},\] where $k$ is chosen so that \[\int_{C_{\mathrm{HPD}}(x)}\pi(\theta\mid x)\,d\theta=1-\alpha.\]

Theorem

Theorem 35 (Shortest credible interval for unimodal posterior). If $\pi(\theta\mid x)$ is unimodal, then the shortest credible interval with posterior probability $1-\alpha$ is the HPD interval. Its endpoints have equal posterior density, except possibly at a boundary of the parameter space.

Example

Example 36 (Poisson HPD region). Suppose \[X_1,\ldots,X_n\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda),\] and use a conjugate gamma prior. If \[\lambda\mid x\sim\operatorname{Gamma}\left(a+\sum x_i,\, n+b\right)\] under a rate parametrization, then a $100(1-\alpha)\%$ HPD credible region has the form \[\{\lambda:\pi(\lambda\mid x)\ge k\},\] where $k$ is chosen to make the posterior probability equal to $1-\alpha$.

Solution

The posterior density is a gamma density. If it is unimodal, the HPD region contains the highest-density values around the posterior mode. The cutoff $k$ is determined by solving \[\int_{\{\lambda:\pi(\lambda\mid x)\ge k\}}\pi(\lambda\mid x)\,d\lambda=1-\alpha.\] For example, in the lecture notes case with $a=b=1$, $n=10$, and $\sum x_i=6$, the posterior is gamma-shaped and the $90\%$ HPD credible set is approximately \[[0.253,1.005],\] while a corresponding equal-tail interval is slightly longer.

Remark

Remark 37 (Equal-tail versus HPD). Equal-tail intervals split posterior probability equally between the two tails. HPD intervals minimize length by taking the most plausible parameter values first. For symmetric unimodal posteriors, the equal-tail and HPD intervals often coincide. For skewed posteriors, they are usually different.

30 Bayesian Optimality for Intervals

This section connects credible intervals to loss-function optimality.

Bayesian interval estimation can be framed as a decision problem: the action is choosing a set $C$, and the loss penalizes long intervals while rewarding coverage of the true parameter.

Definition

Definition 38 (Interval loss function). One simple loss function for choosing a confidence or credible set $C$ is \[L(\theta,C)=b\cdot \operatorname{Length}(C)-\mathbb{1}\{\theta\in C\},\] where $b>0$ controls the tradeoff between short length and high coverage.

The corresponding frequentist risk is \[R(\theta,C)=b\mathbb{E}_\theta[\operatorname{Length}(C(X))]-\mathbb{P}_\theta(\theta\in C(X)).\] The Bayesian posterior expected loss is \[\rho(C\mid x)=b\cdot \operatorname{Length}(C)-\mathbb{P}(\theta\in C\mid x).\]

Key idea

Interpretation Large $b$ prioritizes shorter intervals. Small $b$ prioritizes posterior coverage. This makes precise the tradeoff between precision and uncertainty.

Example

Example 39 (Normal interval risk). Suppose $X\sim\operatorname{Normal}(\mu,\sigma^2)$ with known $\sigma^2$, and consider symmetric intervals \[C(X)=[X-c\sigma,X+c\sigma], \qquad c\ge 0.\] Compute the risk under \[L(\mu,C)=b\cdot\operatorname{Length}(C)-\mathbb{1}\{\mu\in C\}.\]

Solution

The length is \[\operatorname{Length}(C)=2c\sigma.\] The coverage probability is \[\mathbb{P}_\mu(\mu\in C(X)) =\mathbb{P}_\mu(X-c\sigma\le\mu\le X+c\sigma) =\mathbb{P}\left(-c\le \frac{X-\mu}{\sigma}\le c\right) =2\Phi(c)-1.\] Therefore the risk is \[R(c)=b(2c\sigma)-[2\Phi(c)-1].\] Differentiating, \[R'(c)=2b\sigma-2\phi(c).\] The optimum satisfies \[\phi(c)=b\sigma.\] Since $\phi(c)=\frac{1}{\sqrt{2\pi}}e^{-c^2/2}$, if $b\sigma\le 1/\sqrt{2\pi}$, then \[c=\sqrt{-2\log(b\sigma\sqrt{2\pi})}.\] If $b\sigma>1/\sqrt{2\pi}$, the minimum occurs at $c=0$, corresponding to a point estimate.

31 Bayesian Inference Workflow

This section summarizes the steps of Bayesian inference as a reusable procedure.

Key idea

Bayesian workflow

Specify the sampling model $f(x\mid\theta)$.
Choose a prior distribution $\pi(\theta)$.
Compute the posterior distribution $\pi(\theta\mid x)\propto f(x\mid\theta)\pi(\theta)$.
For point estimation, report posterior mean, median, MAP, or another loss-optimal estimate.
For testing, compute posterior probabilities of $H_0$ and $H_1$, then choose an action based on posterior risk.
For interval estimation, report an equal-tail credible interval or HPD credible interval.
Interpret the answer conditional on the chosen model and prior.

Warning

Prior sensitivity Bayesian inference depends on the prior distribution. With large samples, the likelihood often dominates the prior. With small samples, the prior can strongly influence posterior estimates, tests, and intervals.

32 Practice Problems

This section provides practice problems that connect the Bayesian methods from Sections 12–17.

Practice Problem

Practice Problem 40 (Beta-binomial posterior and estimator). Suppose $X\sim\operatorname{Binomial}(30,p)$ and $x=18$. Let $p\sim\operatorname{Beta}(4,6)$.

Find the posterior distribution of $p$.
Find the posterior mean.
Find the MAP estimator, assuming the posterior parameters are both larger than 1.

Solution

The posterior is \[p\mid x\sim\operatorname{Beta}(4+18,6+30-18)=\operatorname{Beta}(22,18).\]
The posterior mean is \[\mathbb{E}(p\mid x)=\frac{22}{22+18}=\frac{22}{40}=0.55.\]
The MAP estimator is \[\widehat p_{\mathrm{MAP}}=\frac{22-1}{22+18-2}=\frac{21}{38}\approx 0.5526.\]

Practice Problem

Practice Problem 41 (Normal-normal posterior). Suppose $X_1,\ldots,X_{16}\mid\mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,4)$ and $\overline x=10$. Let $\mu\sim\operatorname{Normal}(8,9)$.

Find the posterior variance $v_n$.
Find the posterior mean $m_n$.
Give a $95\%$ credible interval for $\mu$.

Solution

Here $n=16$, $\sigma^2=4$, $\mu_0=8$, and $\sigma_0^2=9$.

\[v_n=\left(\frac{16}{4}+\frac{1}{9}\right)^{-1} =\left(4+\frac19\right)^{-1} =\frac{9}{37}.\]
\[m_n=v_n\left(\frac{16\cdot 10}{4}+\frac{8}{9}\right) =\frac{9}{37}\left(40+\frac{8}{9}\right) =\frac{368}{37}\approx 9.946.\]
A $95\%$ credible interval is \[m_n\pm 1.96\sqrt{v_n} =9.946\pm 1.96\sqrt{\frac{9}{37}}.\] Since $\sqrt{9/37}\approx 0.493$, the interval is approximately \[[8.980,10.912].\]

Practice Problem

Practice Problem 42 (Gamma-Poisson posterior). Suppose $X_1,\ldots,X_8\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)$ and $\sum x_i=20$. Let $\lambda\sim\operatorname{Gamma}(3,2)$ using the rate parametrization.

Find the posterior distribution.
Find the posterior mean.
Find the MAP estimator, assuming the posterior shape is larger than 1.

Solution

The posterior is \[\lambda\mid x\sim\operatorname{Gamma}(3+20,2+8)=\operatorname{Gamma}(23,10).\]
The posterior mean is \[\mathbb{E}(\lambda\mid x)=\frac{23}{10}=2.3.\]
For a gamma distribution with shape $a$ and rate $b$, the mode is $(a-1)/b$ when $a>1$. Thus \[\widehat\lambda_{\mathrm{MAP}}=\frac{23-1}{10}=2.2.\]

Practice Problem

Practice Problem 43 (Bayesian test with posterior probability). Suppose the posterior distribution of a parameter is \[\theta\mid x\sim\operatorname{Normal}(1.2,0.25).\] Test \[H_0:\theta\le 0 \qquad\text{versus}\qquad H_1:\theta>0.\] Compute $\mathbb{P}(H_0\mid x)$ and decide whether to reject $H_0$ using the rule $\mathbb{P}(H_0\mid x)<0.05$.

Solution

The posterior standard deviation is $\sqrt{0.25}=0.5$. Therefore \[\mathbb{P}(H_0\mid x)=\mathbb{P}(\theta\le 0\mid x) =\Phi\left(\frac{0-1.2}{0.5}\right) =\Phi(-2.4).\] Using the standard normal table, \[\Phi(-2.4)\approx 0.0082.\] Since $0.0082<0.05$, we reject $H_0$ using this Bayesian posterior-probability rule.

Practice Problem

Practice Problem 44 (Bayesian decision rule with unequal costs). For testing $H_0:\theta\in\Theta_0$ versus $H_1:\theta\in\Theta_0^c$, suppose \[\mathbb{P}(H_0\mid x)=0.30.\] The cost of Type I error is $c_I=5$, and the cost of Type II error is $c_{II}=1$. Should we reject $H_0$?

Solution

Reject $H_0$ if \[\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}=\frac{1}{5+1}=\frac16\approx 0.1667.\] Here $\mathbb{P}(H_0\mid x)=0.30>0.1667$, so we do not reject $H_0$. The high cost of Type I error makes the rejection rule more conservative.

Practice Problem

Practice Problem 45 (Equal-tail credible interval). Suppose the posterior distribution is \[\theta\mid x\sim\operatorname{Normal}(5,4).\] Find a $90\%$ equal-tail credible interval.

Solution

The posterior standard deviation is $2$. For a $90\%$ interval, $z_{0.05}\approx 1.645$. Thus \[5\pm 1.645(2)=5\pm 3.29.\] The credible interval is \[[1.71,8.29].\]

Practice Problem

Practice Problem 46 (HPD interval concept). Suppose a posterior density is unimodal and skewed to the right. Explain why the $95\%$ HPD interval may differ from the $95\%$ equal-tail interval.

Solution

The equal-tail interval places $2.5\%$ posterior probability in each tail. The HPD interval instead contains the parameter values with highest posterior density and chooses the cutoff so that the total posterior probability is $95\%$. For a skewed posterior, equal tails may include some low-density values in the long tail while excluding higher-density values on the other side. Therefore the HPD interval is usually shorter and has endpoints with equal posterior density, while the equal-tail interval is based on posterior quantiles.

33 Summary

This section summarizes the role of Bayesian inference across the main statistical tasks of the course.

Key idea

Section summary

Bayesian inference starts with a prior $\pi(\theta)$ and likelihood $f(x\mid\theta)$.
Bayes’ rule gives the posterior: \[\pi(\theta\mid x)\propto f(x\mid\theta)\pi(\theta).\]
Point estimates can be posterior means, medians, or MAP estimates, depending on the loss function.
Conjugate priors make posterior computation simple.
Bayesian tests compare posterior probabilities of hypotheses, possibly weighted by error costs.
Bayesian credible intervals give direct posterior probability statements.
HPD regions are shortest credible regions for unimodal posterior distributions.
Bayesian decisions are naturally derived by minimizing posterior expected loss.

Task	Bayesian object	Common answer
Point estimation	Posterior distribution	Posterior mean, median, MAP
Testing	Posterior hypothesis probability	Reject if posterior risk is smaller
Interval estimation	Posterior credible probability	Equal-tail or HPD credible interval
Model updating	Prior and likelihood	Posterior distribution
Decision theory	Posterior expected loss	Bayes action

--- title: "Chapter 20: Bayesian Inference" format: html: toc: true toc-depth: 3 number-sections: true pdf: toc: true number-sections: true execute: warning: false message: false --- This chapter collects and organizes the Bayesian ideas used throughout point estimation, hypothesis testing, interval estimation, and decision theory. The main message is that once we know the posterior distribution, we can derive point estimates, tests, credible intervals, and optimal decisions from one common framework. ::: {.callout-note title="Topics"} Bayesian model ingredients; prior, likelihood, posterior, and marginal likelihood; Bayes estimators; posterior mean, median, mode, and MAP; conjugate priors; beta-binomial, gamma-Poisson, and normal-normal models; Bayes risk; Bayesian tests; credible intervals; highest posterior density regions; loss-function interpretation; practice problems and solutions. ::: # Overview This section collects the Bayesian material that appeared across point estimation, evaluating estimators, hypothesis testing, evaluating tests, interval estimation, and evaluating intervals. Bayesian inference gives a unified method for learning from data. Instead of treating the parameter as an unknown but fixed constant, the Bayesian approach represents uncertainty about the parameter using a probability distribution. Data update this distribution through Bayes' rule. ::: {.callout-tip title="Key idea"} Main message Bayesian inference is the rule $$\text{posterior} \propto \text{likelihood} \times \text{prior}.$$ Once the posterior distribution is known, point estimation, hypothesis testing, and interval estimation can all be derived from it. ::: The core idea is simple but powerful. Before observing data, we describe prior information by a prior distribution. After observing data, we update the prior into a posterior distribution. All Bayesian inference is then based on this posterior distribution. # Bayesian Model Ingredients This section introduces the basic objects of Bayesian inference: likelihood, prior, marginal likelihood, and posterior. Suppose the observed data are $$X=(X_1,\ldots,X_n), \qquad x=(x_1,\ldots,x_n),$$ with sampling density or mass function $$f(x\mid \theta).$$ The parameter $\theta$ is unknown. In the Bayesian approach, $\theta$ is assigned a probability distribution. ::: {.callout-note title="Definition"} **Definition 1** (Prior distribution). The **prior distribution** $\pi(\theta)$ describes our uncertainty or belief about $\theta$ before observing the data. ::: ::: {.callout-note title="Definition"} **Definition 2** (Likelihood). After observing $X=x$, the **likelihood function** is $$L(\theta\mid x)=f(x\mid \theta).$$ It measures how compatible each parameter value $\theta$ is with the observed data. ::: ::: {.callout-note title="Definition"} **Definition 3** (Marginal distribution). The **marginal distribution** or **prior predictive distribution** of the data is $$m(x)=\int f(x\mid \theta)\pi(\theta)\,d\theta.$$ It is the normalizing constant that makes the posterior integrate to one. ::: ::: {.callout-note title="Definition"} **Definition 4** (Posterior distribution). The **posterior distribution** of $\theta$ given $X=x$ is $$\pi(\theta\mid x)=\frac{f(x\mid \theta)\pi(\theta)}{m(x)} =\frac{f(x\mid \theta)\pi(\theta)}{\int f(x\mid \theta')\pi(\theta')\,d\theta'}.$$ ::: ::: {.callout-tip title="Key idea"} Bayesian updating The denominator $m(x)$ does not depend on $\theta$. Therefore, for many calculations, we write $$\pi(\theta\mid x)\propto f(x\mid \theta)\pi(\theta).$$ This means that the posterior is proportional to likelihood times prior. ::: ::: {.callout-tip title="Example"} **Example 5** (Bayesian ingredients for coin tossing). Suppose a coin has unknown probability $p$ of heads. We toss the coin $n$ times and observe $x$ heads. The likelihood is $$f(x\mid p)=\binom{n}{x}p^x(1-p)^{n-x}, \qquad 0<p<1.$$ If the prior is $p\sim\operatorname{Beta}(\alpha,\beta)$, then $$\pi(p)=\frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1}.$$ Find the posterior distribution. ::: ::: {.callout-caution title="Solution"} Using Bayes' rule, $$\pi(p\mid x)\propto \binom{n}{x}p^x(1-p)^{n-x}\cdot p^{\alpha-1}(1-p)^{\beta-1}.$$ Ignoring constants that do not depend on $p$, $$\pi(p\mid x)\propto p^{\alpha+x-1}(1-p)^{\beta+n-x-1}.$$ This is the kernel of a beta distribution, so $$p\mid X=x\sim \operatorname{Beta}(\alpha+x,\beta+n-x).$$ ::: # Bayesian Point Estimation This section explains how to produce a single-number estimate from a posterior distribution. In classical point estimation, estimators such as the method of moments estimator and the maximum likelihood estimator are functions of the data. In Bayesian inference, once the posterior distribution is obtained, we can summarize it by its mean, median, mode, or another loss-optimal point. ## Posterior mean, median, and mode This subsection introduces three common posterior summaries used as Bayes point estimates. ::: {.callout-note title="Definition"} **Definition 6** (Posterior mean). The **posterior mean** estimator is $$\widehat\theta_B=\mathbb{E}(\theta\mid x)=\int \theta\pi(\theta\mid x)\,d\theta.$$ It is commonly called the Bayes estimator under squared error loss. ::: ::: {.callout-note title="Definition"} **Definition 7** (Posterior median). A **posterior median** is any value $m$ satisfying $$\mathbb{P}(\theta\le m\mid x)\ge \frac12, \qquad \mathbb{P}(\theta\ge m\mid x)\ge \frac12.$$ It is the Bayes estimator under absolute error loss. ::: ::: {.callout-note title="Definition"} **Definition 8** (Posterior mode and MAP estimator). The **maximum a posteriori** estimator is $$\widehat\theta_{\mathrm{MAP}}=\arg\max_{\theta}\pi(\theta\mid x).$$ Since $$\pi(\theta\mid x)\propto f(x\mid \theta)\pi(\theta),$$ we can also write $$\widehat\theta_{\mathrm{MAP}}=\arg\max_{\theta}\{\log f(x\mid \theta)+\log \pi(\theta)\}.$$ ::: ::: {.callout-tip title="Key idea"} MAP versus MLE The maximum likelihood estimator maximizes only the likelihood: $$\widehat\theta_{\mathrm{MLE}}=\arg\max_\theta \log f(x\mid \theta).$$ The MAP estimator maximizes likelihood plus prior information: $$\widehat\theta_{\mathrm{MAP}}=\arg\max_\theta \{\log f(x\mid \theta)+\log \pi(\theta)\}.$$ A flat prior makes MAP behave like MLE. ::: ## Loss functions and Bayes estimators This subsection connects Bayesian point estimates to decision theory. A Bayesian point estimator can be chosen by minimizing posterior expected loss. Let $a$ be an action, interpreted as an estimate of $\theta$. A loss function $L(\theta,a)$ measures the penalty of estimating $\theta$ by $a$. ::: {.callout-note title="Definition"} **Definition 9** (Posterior expected loss). Given posterior $\pi(\theta\mid x)$, the posterior expected loss of action $a$ is $$\rho(a\mid x)=\mathbb{E}[L(\theta,a)\mid x]=\int L(\theta,a)\pi(\theta\mid x)\,d\theta.$$ A Bayes action minimizes $\rho(a\mid x)$. ::: ::: {.callout-note title="Theorem"} **Theorem 10** (Common Bayes estimators). *For a real-valued parameter $\theta$:* 1. *Under squared error loss $L(\theta,a)=(a-\theta)^2$, the Bayes estimator is the posterior mean.* 2. *Under absolute error loss $L(\theta,a)=|a-\theta|$, the Bayes estimator is a posterior median.* 3. *Under 0--1 style local loss, the Bayes estimator is a posterior mode, or MAP estimator.* ::: ::: proof *Proof.* For squared error loss, $$\rho(a\mid x)=\mathbb{E}[(a-\theta)^2\mid x] =a^2-2a\mathbb{E}(\theta\mid x)+\mathbb{E}(\theta^2\mid x).$$ Differentiating with respect to $a$ gives $$2a-2\mathbb{E}(\theta\mid x)=0,$$ so $a=\mathbb{E}(\theta\mid x)$. The absolute loss result follows because a median minimizes expected absolute deviation. The mode result follows because maximizing posterior mass in a small neighborhood is equivalent to maximizing the posterior density. ◻ ::: ::: {.callout-tip title="Example"} **Example 11** (Beta-binomial posterior mean, variance, and mode). Suppose $X\sim\operatorname{Binomial}(n,p)$ and $p\sim\operatorname{Beta}(\alpha,\beta)$. If $X=x$, find the posterior mean, variance, and mode of $p$. ::: ::: {.callout-caution title="Solution"} The posterior is $$p\mid x\sim \operatorname{Beta}(\alpha+x,\beta+n-x).$$ Let $$\alpha^*=\alpha+x, \qquad \beta^*=\beta+n-x.$$ Then $$\mathbb{E}(p\mid x)=\frac{\alpha^*}{\alpha^*+\beta^*} =\frac{\alpha+x}{\alpha+\beta+n}.$$ The posterior variance is $$\operatorname{Var}(p\mid x)=\frac{\alpha^*\beta^*}{(\alpha^*+\beta^*)^2(\alpha^*+\beta^*+1)} =\frac{(\alpha+x)(\beta+n-x)}{(\alpha+\beta+n)^2(\alpha+\beta+n+1)}.$$ If $\alpha^*>1$ and $\beta^*>1$, the posterior mode is $$\frac{\alpha^*-1}{\alpha^*+\beta^*-2} =\frac{\alpha+x-1}{\alpha+\beta+n-2}.$$ ::: ::: {.callout-tip title="Example"} **Example 12** (Pseudo-count interpretation). Suppose $p\sim\operatorname{Beta}(3,5)$ and we observe $x=5$ heads in $n=6$ tosses. Find the posterior distribution and posterior mean. ::: ::: {.callout-caution title="Solution"} The posterior is $$p\mid X=5\sim\operatorname{Beta}(3+5,5+6-5)=\operatorname{Beta}(8,6).$$ The posterior mean is $$\mathbb{E}(p\mid X=5)=\frac{8}{8+6}=\frac{8}{14}=0.5714.$$ The prior contributes $3$ prior successes and $5$ prior failures. The data contribute $5$ successes and $1$ failure. Thus the posterior has $8$ pseudo-successes and $6$ pseudo-failures. ::: # Conjugate Priors This section studies conjugate priors, which make Bayesian updating especially simple. A conjugate prior is useful because the posterior stays in the same distribution family as the prior. Then Bayesian updating often becomes a simple rule of adding data summaries to prior hyperparameters. ::: {.callout-note title="Definition"} **Definition 13** (Conjugate prior). For a likelihood function $f(x\mid \theta)$, a prior family $\pi(\theta)$ is called **conjugate** if the posterior $\pi(\theta\mid x)$ belongs to the same family as the prior. ::: ::: {.callout-tip title="Key idea"} Why conjugacy helps If a prior is conjugate, then we can often skip difficult integration. The posterior distribution is known up to updated parameters, and the update rule is usually easy to compute. ::: ## Beta-binomial conjugacy This subsection reviews the most important conjugate pair for proportions. ::: {.callout-note title="Theorem"} **Theorem 14** (Beta-binomial model). *If $$X\mid p\sim\operatorname{Binomial}(n,p), \qquad p\sim\operatorname{Beta}(\alpha,\beta),$$ then $$p\mid X=x\sim\operatorname{Beta}(\alpha+x,\beta+n-x).$$* ::: ::: {.callout-tip title="Example"} **Example 15** (Bayesian estimator for a Bernoulli probability). Suppose $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)$ and $S=\sum_{i=1}^n X_i$. Let $p\sim\operatorname{Beta}(\alpha,\beta)$. Find the Bayes estimator under squared error loss. ::: ::: {.callout-caution title="Solution"} Since $S\mid p\sim\operatorname{Binomial}(n,p)$, the posterior is $$p\mid S\sim\operatorname{Beta}(\alpha+S,\beta+n-S).$$ Under squared error loss, the Bayes estimator is the posterior mean: $$\widehat p_B=\mathbb{E}(p\mid S)=\frac{\alpha+S}{\alpha+\beta+n}.$$ This shrinks the sample proportion $S/n$ toward the prior mean $\alpha/(\alpha+\beta)$. ::: ## Gamma-Poisson conjugacy This subsection reviews the conjugate prior for a Poisson rate. We use the rate parametrization for the gamma distribution: $$\lambda\sim\operatorname{Gamma}(\alpha,\beta), \qquad \pi(\lambda)=\frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha-1}e^{-\beta\lambda}, \qquad \lambda>0.$$ ::: {.callout-note title="Theorem"} **Theorem 16** (Gamma-Poisson model). *If $$X_1,\ldots,X_n\mid \lambda \overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda), \qquad \lambda\sim\operatorname{Gamma}(\alpha,\beta),$$ then $$\lambda\mid x\sim \operatorname{Gamma}\left(\alpha+\sum_{i=1}^n x_i,\,\beta+n\right).$$* ::: ::: proof *Proof.* The likelihood is $$f(x\mid \lambda)=\prod_{i=1}^n \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} \propto e^{-n\lambda}\lambda^{\sum x_i}.$$ Multiplying by the gamma prior gives $$\pi(\lambda\mid x)\propto e^{-n\lambda}\lambda^{\sum x_i}\cdot \lambda^{\alpha-1}e^{-\beta\lambda} =\lambda^{\alpha+\sum x_i-1}e^{-(\beta+n)\lambda}.$$ This is the kernel of $\operatorname{Gamma}(\alpha+\sum x_i,\beta+n)$. ◻ ::: ::: {.callout-tip title="Example"} **Example 17** (Poisson posterior). Suppose $n=10$ Poisson observations have total count $\sum x_i=26$. Let the prior be $\lambda\sim\operatorname{Gamma}(2,1)$. Find the posterior mean. ::: ::: {.callout-caution title="Solution"} The posterior is $$\lambda\mid x\sim \operatorname{Gamma}(2+26,1+10)=\operatorname{Gamma}(28,11).$$ For a gamma distribution with rate $\beta$, the mean is $\alpha/\beta$. Therefore, $$\mathbb{E}(\lambda\mid x)=\frac{28}{11}\approx 2.545.$$ ::: ## Normal-normal conjugacy This subsection reviews the conjugate model for a normal mean with known variance. Suppose $$X_1,\ldots,X_n\mid \mu \overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2),$$ where $\sigma^2$ is known. Let the prior be $$\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2).$$ ::: {.callout-note title="Theorem"} **Theorem 18** (Normal-normal posterior). *The posterior distribution of $\mu$ is normal: $$\mu\mid x\sim\operatorname{Normal}(m_n,v_n),$$ where $$v_n=\left(\frac{n}{\sigma^2}+\frac{1}{\sigma_0^2}\right)^{-1} =\frac{\sigma^2\sigma_0^2}{\sigma^2+n\sigma_0^2},$$ and $$m_n=v_n\left(\frac{n\overline x}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right) =\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}.$$* ::: ::: {.callout-tip title="Key idea"} Weighted average interpretation The posterior mean is a weighted average of the sample mean $\overline x$ and the prior mean $\mu_0$: $$m_n=\frac{n\sigma_0^2}{n\sigma_0^2+\sigma^2}\overline x +\frac{\sigma^2}{n\sigma_0^2+\sigma^2}\mu_0.$$ If $\sigma_0^2\to\infty$, the prior becomes diffuse and $m_n\to\overline x$, the MLE. ::: ::: {.callout-tip title="Example"} **Example 19** (MAP for a normal mean). Suppose $X_1,\ldots,X_n\mid \mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$ with known $\sigma^2$, and suppose $\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2)$. Find the MAP estimator. ::: ::: {.callout-caution title="Solution"} The posterior is normal with mean $m_n$ and variance $v_n$. A normal density is maximized at its mean, so $$\widehat\mu_{\mathrm{MAP}}=m_n =\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}.$$ Equivalently, $$\widehat\mu_{\mathrm{MAP}} =\frac{n\sigma_0^2}{n\sigma_0^2+\sigma^2}\overline x +\frac{\sigma^2}{n\sigma_0^2+\sigma^2}\mu_0.$$ Because the posterior is normal, the posterior mean and posterior mode are the same. ::: # Bayes Risk and Evaluating Estimators This section explains how Bayesian decision theory evaluates estimators by averaging risk over a prior distribution. In frequentist evaluation, the risk $R(\theta,\delta)$ is viewed as a function of the unknown parameter $\theta$. In Bayesian evaluation, we average this risk over the prior distribution. ::: {.callout-note title="Definition"} **Definition 20** (Risk function). For decision rule $\delta(X)$ and loss function $L(\theta,a)$, the risk function is $$R(\theta,\delta)=\mathbb{E}_\theta[L(\theta,\delta(X))].$$ ::: ::: {.callout-note title="Definition"} **Definition 21** (Bayes risk). Given prior distribution $\pi(\theta)$, the **Bayes risk** of $\delta$ is $$r(\pi,\delta)=\int R(\theta,\delta)\pi(\theta)\,d\theta.$$ Equivalently, $$r(\pi,\delta)=\int\int L(\theta,\delta(x))f(x\mid \theta)\pi(\theta)\,dx\,d\theta.$$ ::: ::: {.callout-note title="Theorem"} **Theorem 22** (Posterior risk minimization). *A Bayes rule can be found by minimizing posterior expected loss for each observed $x$: $$\delta_B(x)\in \arg\min_a \int L(\theta,a)\pi(\theta\mid x)\,d\theta.$$* ::: ::: {.callout-tip title="Example"} **Example 23** (MSE of a beta-binomial Bayes estimator). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)$ and $S=\sum X_i$. Consider $$\widehat p_B=\frac{\alpha+S}{\alpha+\beta+n}.$$ Find its frequentist MSE as a function of $p$. ::: ::: {.callout-caution title="Solution"} Since $S\sim\operatorname{Binomial}(n,p)$, $$\mathbb{E}_p(\widehat p_B)=\frac{\alpha+np}{\alpha+\beta+n}, \qquad \operatorname{Var}_p(\widehat p_B)=\frac{np(1-p)}{(\alpha+\beta+n)^2}.$$ The bias is $$\operatorname{Bias}_p(\widehat p_B) =\frac{\alpha+np}{\alpha+\beta+n}-p =\frac{\alpha-(\alpha+\beta)p}{\alpha+\beta+n}.$$ Therefore, $$\operatorname{MSE}_p(\widehat p_B) =\frac{np(1-p)}{(\alpha+\beta+n)^2} +\left(\frac{\alpha-(\alpha+\beta)p}{\alpha+\beta+n}\right)^2.$$ This shows the bias-variance tradeoff introduced by the prior. ::: ::: {.callout-note title="Remark"} *Remark 24* (Why shrinkage can help). The estimator $\widehat p_B$ is usually biased as a frequentist estimator, but it can have smaller MSE than the MLE $\widehat p=S/n$ for some values of $p$, especially when $n$ is small and the prior information is reasonable. ::: # Bayesian Hypothesis Testing This section explains how Bayesian inference turns hypothesis testing into posterior probability comparison. In classical testing, we control Type I error, study power, and often use p-values or likelihood ratio tests. In Bayesian testing, the posterior distribution directly assigns probabilities to hypotheses. ## Posterior probability tests This subsection defines Bayesian tests through posterior probabilities of the null and alternative parameter regions. Let the hypotheses be $$H_0:\theta\in\Theta_0, \qquad H_1:\theta\in\Theta_0^c.$$ Given posterior $\pi(\theta\mid x)$, $$\mathbb{P}(H_0\mid x)=\mathbb{P}(\theta\in\Theta_0\mid x)=\int_{\Theta_0}\pi(\theta\mid x)\,d\theta,$$ and $$\mathbb{P}(H_1\mid x)=\mathbb{P}(\theta\in\Theta_0^c\mid x)=\int_{\Theta_0^c}\pi(\theta\mid x)\,d\theta.$$ ::: {.callout-note title="Definition"} **Definition 25** (Bayesian posterior probability test). A simple Bayesian decision rule rejects $H_0$ when $$\mathbb{P}(H_0\mid x)<\mathbb{P}(H_1\mid x),$$ equivalently, when $$\mathbb{P}(H_0\mid x)<\frac12.$$ A more conservative rule rejects $H_0$ when $$\mathbb{P}(H_0\mid x)<0.05.$$ ::: ::: {.callout-tip title="Example"} **Example 26** (Bayesian one-sided normal mean test). Suppose $$X_1,\ldots,X_n\mid \mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2),$$ where $\sigma^2$ is known, and suppose $$\mu\sim\operatorname{Normal}(\theta,\tau^2).$$ Test $$H_0:\mu\le \mu_0 \qquad\text{versus}\qquad H_1:\mu>\mu_0.$$ Find a posterior-probability rejection rule. ::: ::: {.callout-caution title="Solution"} The posterior distribution is $$\mu\mid x\sim\operatorname{Normal}(m_n,v_n),$$ where $$m_n=\frac{\tau^2\sum_{i=1}^n x_i+\sigma^2\theta}{n\tau^2+\sigma^2}, \qquad v_n=\frac{\sigma^2\tau^2}{\sigma^2+n\tau^2}.$$ The posterior probability of the null is $$\mathbb{P}(H_0\mid x)=\mathbb{P}(\mu\le \mu_0\mid x) =\Phi\left(\frac{\mu_0-m_n}{\sqrt{v_n}}\right).$$ If we reject when $\mathbb{P}(H_0\mid x)<1/2$, then because the posterior is normal and symmetric, this is equivalent to $$m_n>\mu_0.$$ That is, $$\frac{\tau^2\sum_{i=1}^n x_i+\sigma^2\theta}{n\tau^2+\sigma^2}>\mu_0.$$ Equivalently, $$\overline x>\mu_0+\frac{\sigma^2(\mu_0-\theta)}{n\tau^2}.$$ ::: ## Bayesian tests with loss functions This subsection connects Bayesian tests to error costs. Suppose the action space is $\{a_0,a_1\}$, where $a_0$ means "accept or fail to reject $H_0$" and $a_1$ means "reject $H_0$". A generalized 0--1 loss is $$L(\theta,a_0)= \begin{cases} 0, & \theta\in\Theta_0,\\ c_{II}, & \theta\in\Theta_0^c, \end{cases}$$ and $$L(\theta,a_1)= \begin{cases} c_I, & \theta\in\Theta_0,\\ 0, & \theta\in\Theta_0^c. \end{cases}$$ Here $c_I$ is the cost of a Type I error and $c_{II}$ is the cost of a Type II error. ::: {.callout-note title="Theorem"} **Theorem 27** (Bayes test under generalized 0--1 loss). *Under the above loss, reject $H_0$ if $$c_I\mathbb{P}(H_0\mid x)<c_{II}\mathbb{P}(H_1\mid x).$$ Equivalently, $$\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}.$$* ::: ::: proof *Proof.* The posterior expected loss of accepting $H_0$ is $$\rho(a_0\mid x)=c_{II}\mathbb{P}(H_1\mid x).$$ The posterior expected loss of rejecting $H_0$ is $$\rho(a_1\mid x)=c_I\mathbb{P}(H_0\mid x).$$ We reject when $\rho(a_1\mid x)<\rho(a_0\mid x)$, namely $$c_I\mathbb{P}(H_0\mid x)<c_{II}\mathbb{P}(H_1\mid x).$$ Since $\mathbb{P}(H_1\mid x)=1-\mathbb{P}(H_0\mid x)$, this is equivalent to $$\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}.$$ ◻ ::: ::: {.callout-note title="Remark"} *Remark 28* (Connection to classical risk). In classical testing, the risk under generalized 0--1 loss is $$R(\theta,\delta)= \begin{cases} c_I\beta(\theta), & \theta\in\Theta_0,\\ c_{II}[1-\beta(\theta)], & \theta\in\Theta_0^c, \end{cases}$$ where $\beta(\theta)=\mathbb{P}_\theta(\text{reject }H_0)$ is the power function. Bayesian testing averages this risk using the posterior or prior distribution. ::: # Bayesian Interval Estimation This section introduces Bayesian credible intervals and contrasts them with frequentist confidence intervals. In frequentist confidence intervals, the interval is random and the parameter is fixed. In Bayesian credible intervals, the parameter is random under the posterior distribution, so probability statements about $\theta$ belonging to an observed interval are meaningful within the model. ::: {.callout-note title="Definition"} **Definition 29** (Credible set). A set $C(x)\subseteq\Theta$ is a $100(1-\alpha)\%$ **credible set** if $$\mathbb{P}(\theta\in C(x)\mid x)=\int_{C(x)}\pi(\theta\mid x)\,d\theta=1-\alpha.$$ If $C(x)=[a,b]$, then $[a,b]$ is called a credible interval. ::: ::: {.callout-warning title="Warning"} Interpretation warning A frequentist $95\%$ confidence interval does not mean that there is a $95\%$ probability that the fixed parameter lies in the observed interval. A Bayesian $95\%$ credible interval does allow the statement: $$\mathbb{P}(\theta\in C(x)\mid x)=0.95.$$ This interpretation depends on the prior and the Bayesian model. ::: ## Equal-tail credible intervals This subsection introduces credible intervals based on posterior quantiles. ::: {.callout-note title="Definition"} **Definition 30** (Equal-tail credible interval). A $100(1-\alpha)\%$ equal-tail credible interval $[a,b]$ satisfies $$\mathbb{P}(\theta<a\mid x)=\frac{\alpha}{2}, \qquad \mathbb{P}(\theta>b\mid x)=\frac{\alpha}{2}.$$ Equivalently, $a$ and $b$ are the posterior $\alpha/2$ and $1-\alpha/2$ quantiles. ::: ::: {.callout-tip title="Example"} **Example 31** (Beta-binomial credible interval). Suppose $X\sim\operatorname{Binomial}(n,p)$, $p\sim\operatorname{Beta}(2,2)$, $n=20$, and $x=12$. Find the posterior distribution and describe the $95\%$ equal-tail credible interval. ::: ::: {.callout-caution title="Solution"} The posterior is $$p\mid X=12\sim\operatorname{Beta}(2+12,2+20-12)=\operatorname{Beta}(14,10).$$ The $95\%$ equal-tail credible interval is $$\left[q_{0.025},q_{0.975}\right],$$ where $q_c$ is the $c$th quantile of the $\operatorname{Beta}(14,10)$ distribution. Numerically, this interval is approximately $$[0.385,0.768].$$ This means that, under the beta-binomial Bayesian model, $$\mathbb{P}(0.385\le p\le 0.768\mid X=12)\approx 0.95.$$ ::: ::: {.callout-tip title="Example"} **Example 32** (Normal-normal credible interval). Suppose $X_1,\ldots,X_n\mid\mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$ with known $\sigma^2$, and $\mu\sim\operatorname{Normal}(\mu_0,\sigma_0^2)$. Find a $100(1-\alpha)\%$ credible interval for $\mu$. ::: ::: {.callout-caution title="Solution"} The posterior is $$\mu\mid x\sim\operatorname{Normal}(m_n,v_n),$$ where $$m_n=\frac{\sigma_0^2\sum_{i=1}^n x_i+\sigma^2\mu_0}{n\sigma_0^2+\sigma^2}, \qquad v_n=\frac{\sigma^2\sigma_0^2}{\sigma^2+n\sigma_0^2}.$$ Therefore a $100(1-\alpha)\%$ credible interval is $$\left[m_n-z_{\alpha/2}\sqrt{v_n},\;m_n+z_{\alpha/2}\sqrt{v_n}\right],$$ where $z_{\alpha/2}$ satisfies $\mathbb{P}(Z>z_{\alpha/2})=\alpha/2$ for $Z\sim\operatorname{Normal}(0,1)$. ::: ::: {.callout-tip title="Example"} **Example 33** (Gamma-Poisson credible interval). Suppose $X_1,\ldots,X_n\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)$ and $\lambda\sim\operatorname{Gamma}(\alpha,\beta)$ in the rate parametrization. Derive the posterior credible interval for $\lambda$. ::: ::: {.callout-caution title="Solution"} The posterior is $$\lambda\mid x\sim \operatorname{Gamma}\left(\alpha+\sum_{i=1}^n x_i,\beta+n\right).$$ A $100(1-\alpha_0)\%$ equal-tail credible interval is $$\left[q_{\alpha_0/2},q_{1-\alpha_0/2}\right],$$ where $q_c$ is the $c$th quantile of the posterior gamma distribution. For example, if $n=10$, $\sum x_i=26$, and $\lambda\sim\operatorname{Gamma}(2,1)$, then $$\lambda\mid x\sim\operatorname{Gamma}(28,11),$$ and the interval is obtained from the quantiles of $\operatorname{Gamma}(28,11)$. ::: # Highest Posterior Density Regions This section explains the shortest Bayesian credible regions for unimodal posterior distributions. Equal-tail intervals are easy to compute, but they are not always the shortest credible intervals. For a unimodal posterior density, the shortest credible region is formed by keeping the parameter values with highest posterior density. ::: {.callout-note title="Definition"} **Definition 34** (Highest posterior density region). A $100(1-\alpha)\%$ **highest posterior density** (HPD) credible region is a set $$C_{\mathrm{HPD}}(x)=\{\theta:\pi(\theta\mid x)\ge k\},$$ where $k$ is chosen so that $$\int_{C_{\mathrm{HPD}}(x)}\pi(\theta\mid x)\,d\theta=1-\alpha.$$ ::: ::: {.callout-note title="Theorem"} **Theorem 35** (Shortest credible interval for unimodal posterior). *If $\pi(\theta\mid x)$ is unimodal, then the shortest credible interval with posterior probability $1-\alpha$ is the HPD interval. Its endpoints have equal posterior density, except possibly at a boundary of the parameter space.* ::: ::: {.callout-tip title="Example"} **Example 36** (Poisson HPD region). Suppose $$X_1,\ldots,X_n\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda),$$ and use a conjugate gamma prior. If $$\lambda\mid x\sim\operatorname{Gamma}\left(a+\sum x_i,\, n+b\right)$$ under a rate parametrization, then a $100(1-\alpha)\%$ HPD credible region has the form $$\{\lambda:\pi(\lambda\mid x)\ge k\},$$ where $k$ is chosen to make the posterior probability equal to $1-\alpha$. ::: ::: {.callout-caution title="Solution"} The posterior density is a gamma density. If it is unimodal, the HPD region contains the highest-density values around the posterior mode. The cutoff $k$ is determined by solving $$\int_{\{\lambda:\pi(\lambda\mid x)\ge k\}}\pi(\lambda\mid x)\,d\lambda=1-\alpha.$$ For example, in the lecture notes case with $a=b=1$, $n=10$, and $\sum x_i=6$, the posterior is gamma-shaped and the $90\%$ HPD credible set is approximately $$[0.253,1.005],$$ while a corresponding equal-tail interval is slightly longer. ::: ::: {.callout-note title="Remark"} *Remark 37* (Equal-tail versus HPD). Equal-tail intervals split posterior probability equally between the two tails. HPD intervals minimize length by taking the most plausible parameter values first. For symmetric unimodal posteriors, the equal-tail and HPD intervals often coincide. For skewed posteriors, they are usually different. ::: # Bayesian Optimality for Intervals This section connects credible intervals to loss-function optimality. Bayesian interval estimation can be framed as a decision problem: the action is choosing a set $C$, and the loss penalizes long intervals while rewarding coverage of the true parameter. ::: {.callout-note title="Definition"} **Definition 38** (Interval loss function). One simple loss function for choosing a confidence or credible set $C$ is $$L(\theta,C)=b\cdot \operatorname{Length}(C)-\mathbb{1}\{\theta\in C\},$$ where $b>0$ controls the tradeoff between short length and high coverage. ::: The corresponding frequentist risk is $$R(\theta,C)=b\mathbb{E}_\theta[\operatorname{Length}(C(X))]-\mathbb{P}_\theta(\theta\in C(X)).$$ The Bayesian posterior expected loss is $$\rho(C\mid x)=b\cdot \operatorname{Length}(C)-\mathbb{P}(\theta\in C\mid x).$$ ::: {.callout-tip title="Key idea"} Interpretation Large $b$ prioritizes shorter intervals. Small $b$ prioritizes posterior coverage. This makes precise the tradeoff between precision and uncertainty. ::: ::: {.callout-tip title="Example"} **Example 39** (Normal interval risk). Suppose $X\sim\operatorname{Normal}(\mu,\sigma^2)$ with known $\sigma^2$, and consider symmetric intervals $$C(X)=[X-c\sigma,X+c\sigma], \qquad c\ge 0.$$ Compute the risk under $$L(\mu,C)=b\cdot\operatorname{Length}(C)-\mathbb{1}\{\mu\in C\}.$$ ::: ::: {.callout-caution title="Solution"} The length is $$\operatorname{Length}(C)=2c\sigma.$$ The coverage probability is $$\mathbb{P}_\mu(\mu\in C(X)) =\mathbb{P}_\mu(X-c\sigma\le\mu\le X+c\sigma) =\mathbb{P}\left(-c\le \frac{X-\mu}{\sigma}\le c\right) =2\Phi(c)-1.$$ Therefore the risk is $$R(c)=b(2c\sigma)-[2\Phi(c)-1].$$ Differentiating, $$R'(c)=2b\sigma-2\phi(c).$$ The optimum satisfies $$\phi(c)=b\sigma.$$ Since $\phi(c)=\frac{1}{\sqrt{2\pi}}e^{-c^2/2}$, if $b\sigma\le 1/\sqrt{2\pi}$, then $$c=\sqrt{-2\log(b\sigma\sqrt{2\pi})}.$$ If $b\sigma>1/\sqrt{2\pi}$, the minimum occurs at $c=0$, corresponding to a point estimate. ::: # Bayesian Inference Workflow This section summarizes the steps of Bayesian inference as a reusable procedure. ::: {.callout-tip title="Key idea"} Bayesian workflow 1. Specify the sampling model $f(x\mid\theta)$. 2. Choose a prior distribution $\pi(\theta)$. 3. Compute the posterior distribution $\pi(\theta\mid x)\propto f(x\mid\theta)\pi(\theta)$. 4. For point estimation, report posterior mean, median, MAP, or another loss-optimal estimate. 5. For testing, compute posterior probabilities of $H_0$ and $H_1$, then choose an action based on posterior risk. 6. For interval estimation, report an equal-tail credible interval or HPD credible interval. 7. Interpret the answer conditional on the chosen model and prior. ::: ::: {.callout-warning title="Warning"} Prior sensitivity Bayesian inference depends on the prior distribution. With large samples, the likelihood often dominates the prior. With small samples, the prior can strongly influence posterior estimates, tests, and intervals. ::: # Practice Problems This section provides practice problems that connect the Bayesian methods from Sections 12--17. ::: {.callout-important title="Practice Problem"} **Practice Problem 40** (Beta-binomial posterior and estimator). Suppose $X\sim\operatorname{Binomial}(30,p)$ and $x=18$. Let $p\sim\operatorname{Beta}(4,6)$. 1. Find the posterior distribution of $p$. 2. Find the posterior mean. 3. Find the MAP estimator, assuming the posterior parameters are both larger than 1. ::: ::: {.callout-caution title="Solution"} 1. The posterior is $$p\mid x\sim\operatorname{Beta}(4+18,6+30-18)=\operatorname{Beta}(22,18).$$ 2. The posterior mean is $$\mathbb{E}(p\mid x)=\frac{22}{22+18}=\frac{22}{40}=0.55.$$ 3. The MAP estimator is $$\widehat p_{\mathrm{MAP}}=\frac{22-1}{22+18-2}=\frac{21}{38}\approx 0.5526.$$ ::: ::: {.callout-important title="Practice Problem"} **Practice Problem 41** (Normal-normal posterior). Suppose $X_1,\ldots,X_{16}\mid\mu\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,4)$ and $\overline x=10$. Let $\mu\sim\operatorname{Normal}(8,9)$. 1. Find the posterior variance $v_n$. 2. Find the posterior mean $m_n$. 3. Give a $95\%$ credible interval for $\mu$. ::: ::: {.callout-caution title="Solution"} Here $n=16$, $\sigma^2=4$, $\mu_0=8$, and $\sigma_0^2=9$. 1. $$v_n=\left(\frac{16}{4}+\frac{1}{9}\right)^{-1} =\left(4+\frac19\right)^{-1} =\frac{9}{37}.$$ 2. $$m_n=v_n\left(\frac{16\cdot 10}{4}+\frac{8}{9}\right) =\frac{9}{37}\left(40+\frac{8}{9}\right) =\frac{368}{37}\approx 9.946.$$ 3. A $95\%$ credible interval is $$m_n\pm 1.96\sqrt{v_n} =9.946\pm 1.96\sqrt{\frac{9}{37}}.$$ Since $\sqrt{9/37}\approx 0.493$, the interval is approximately $$[8.980,10.912].$$ ::: ::: {.callout-important title="Practice Problem"} **Practice Problem 42** (Gamma-Poisson posterior). Suppose $X_1,\ldots,X_8\mid\lambda\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)$ and $\sum x_i=20$. Let $\lambda\sim\operatorname{Gamma}(3,2)$ using the rate parametrization. 1. Find the posterior distribution. 2. Find the posterior mean. 3. Find the MAP estimator, assuming the posterior shape is larger than 1. ::: ::: {.callout-caution title="Solution"} 1. The posterior is $$\lambda\mid x\sim\operatorname{Gamma}(3+20,2+8)=\operatorname{Gamma}(23,10).$$ 2. The posterior mean is $$\mathbb{E}(\lambda\mid x)=\frac{23}{10}=2.3.$$ 3. For a gamma distribution with shape $a$ and rate $b$, the mode is $(a-1)/b$ when $a>1$. Thus $$\widehat\lambda_{\mathrm{MAP}}=\frac{23-1}{10}=2.2.$$ ::: ::: {.callout-important title="Practice Problem"} **Practice Problem 43** (Bayesian test with posterior probability). Suppose the posterior distribution of a parameter is $$\theta\mid x\sim\operatorname{Normal}(1.2,0.25).$$ Test $$H_0:\theta\le 0 \qquad\text{versus}\qquad H_1:\theta>0.$$ Compute $\mathbb{P}(H_0\mid x)$ and decide whether to reject $H_0$ using the rule $\mathbb{P}(H_0\mid x)<0.05$. ::: ::: {.callout-caution title="Solution"} The posterior standard deviation is $\sqrt{0.25}=0.5$. Therefore $$\mathbb{P}(H_0\mid x)=\mathbb{P}(\theta\le 0\mid x) =\Phi\left(\frac{0-1.2}{0.5}\right) =\Phi(-2.4).$$ Using the standard normal table, $$\Phi(-2.4)\approx 0.0082.$$ Since $0.0082<0.05$, we reject $H_0$ using this Bayesian posterior-probability rule. ::: ::: {.callout-important title="Practice Problem"} **Practice Problem 44** (Bayesian decision rule with unequal costs). For testing $H_0:\theta\in\Theta_0$ versus $H_1:\theta\in\Theta_0^c$, suppose $$\mathbb{P}(H_0\mid x)=0.30.$$ The cost of Type I error is $c_I=5$, and the cost of Type II error is $c_{II}=1$. Should we reject $H_0$? ::: ::: {.callout-caution title="Solution"} Reject $H_0$ if $$\mathbb{P}(H_0\mid x)<\frac{c_{II}}{c_I+c_{II}}=\frac{1}{5+1}=\frac16\approx 0.1667.$$ Here $\mathbb{P}(H_0\mid x)=0.30>0.1667$, so we do not reject $H_0$. The high cost of Type I error makes the rejection rule more conservative. ::: ::: {.callout-important title="Practice Problem"} **Practice Problem 45** (Equal-tail credible interval). Suppose the posterior distribution is $$\theta\mid x\sim\operatorname{Normal}(5,4).$$ Find a $90\%$ equal-tail credible interval. ::: ::: {.callout-caution title="Solution"} The posterior standard deviation is $2$. For a $90\%$ interval, $z_{0.05}\approx 1.645$. Thus $$5\pm 1.645(2)=5\pm 3.29.$$ The credible interval is $$[1.71,8.29].$$ ::: ::: {.callout-important title="Practice Problem"} **Practice Problem 46** (HPD interval concept). Suppose a posterior density is unimodal and skewed to the right. Explain why the $95\%$ HPD interval may differ from the $95\%$ equal-tail interval. ::: ::: {.callout-caution title="Solution"} The equal-tail interval places $2.5\%$ posterior probability in each tail. The HPD interval instead contains the parameter values with highest posterior density and chooses the cutoff so that the total posterior probability is $95\%$. For a skewed posterior, equal tails may include some low-density values in the long tail while excluding higher-density values on the other side. Therefore the HPD interval is usually shorter and has endpoints with equal posterior density, while the equal-tail interval is based on posterior quantiles. ::: # Summary This section summarizes the role of Bayesian inference across the main statistical tasks of the course. ::: {.callout-tip title="Key idea"} Section summary 1. Bayesian inference starts with a prior $\pi(\theta)$ and likelihood $f(x\mid\theta)$. 2. Bayes' rule gives the posterior: $$\pi(\theta\mid x)\propto f(x\mid\theta)\pi(\theta).$$ 3. Point estimates can be posterior means, medians, or MAP estimates, depending on the loss function. 4. Conjugate priors make posterior computation simple. 5. Bayesian tests compare posterior probabilities of hypotheses, possibly weighted by error costs. 6. Bayesian credible intervals give direct posterior probability statements. 7. HPD regions are shortest credible regions for unimodal posterior distributions. 8. Bayesian decisions are naturally derived by minimizing posterior expected loss. ::: ::: center Task Bayesian object Common answer --------------------- ---------------------------------- ------------------------------------- Point estimation Posterior distribution Posterior mean, median, MAP Testing Posterior hypothesis probability Reject if posterior risk is smaller Interval estimation Posterior credible probability Equal-tail or HPD credible interval Model updating Prior and likelihood Posterior distribution Decision theory Posterior expected loss Bayes action :::