Chapter 6 Extra: Multinomial and Multinormal Distributions

This extra chapter extends Section 6 in two directions. First, it develops the multinomial distribution and the Dirichlet prior for categorical data. Second, it introduces the multivariate Gaussian distribution and the linear-algebra formulas behind marginalization, conditioning, affine transformations, and linear Gaussian models.

Topics. Multinomial distribution; Dirichlet distribution; Dirichlet-multinomial Bayesian model; multivariate Gaussian distribution; Mahalanobis distance; spectral geometry; marginal and conditional Gaussian distributions; affine transformations; products and convolutions of Gaussians; linear Gaussian models.

Multinomial Distribution

This section extends the categorical and binomial distributions to experiments with more than two possible outcomes.

From categorical trials to multinomial counts

A categorical random variable records one outcome from several categories, while a multinomial random vector records the counts of each outcome after many independent categorical trials.

Definition 1 (Categorical trial). A categorical trial has $m$ possible outcomes $O_1,\ldots,O_m$ with probabilities \[\phi_1,\ldots,\phi_m,\qquad \phi_i\ge 0,\qquad \sum_{i=1}^m \phi_i=1.\] If one trial is performed, then the outcome has a categorical distribution.

Definition 2 (Multinomial distribution). Suppose we perform $n$ independent trials, each with outcomes $O_1,\ldots,O_m$ and constant category probabilities $(\phi_1,\ldots,\phi_m)$. Let \[X_i=\text{the number of times outcome }O_i\text{ appears in the }n\text{ trials}.\] Then \[X=(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1, \ldots,\phi_m),\] and for nonnegative integers $n_1,\ldots,n_m$ satisfying $n_1+\cdots+n_m=n$, \[\mathbb{P}(X_1=n_1,\ldots,X_m=n_m) = \frac{n!}{n_1!\cdots n_m!}\phi_1^{n_1}\cdots \phi_m^{n_m}.\]

Course connection

The multinomial distribution is the direct generalization of the categorical distribution. When $n=1$, it reduces to a categorical trial. When $m=2$, it reduces to the binomial distribution.

Example 3 (Tossing an $m$-sided die). An $m$-sided die has probabilities $\phi_1,\ldots,\phi_m$ for sides $1,\ldots,m$. Toss the die $n$ times and let $X_i$ be the number of times side $i$ appears. Find the joint pmf of $(X_1,\ldots,X_m)$.

Solution

The vector of counts follows \[(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1,\ldots,\phi_m).\] Therefore, for $n_1+\cdots+n_m=n$, \[\mathbb{P}(X_1=n_1,\ldots,X_m=n_m) = \frac{n!}{n_1!\cdots n_m!}\prod_{i=1}^m\phi_i^{n_i}.\] The coefficient $n!/(n_1!\cdots n_m!)$ counts how many orderings of the $n$ trials produce the same vector of counts.

Example 4 (Three-category experiment). Suppose a website visit leads to one of three outcomes: purchase, sign-up only, or no action. The probabilities are $(0.1,0.2,0.7)$. Among $n=10$ visitors, find the probability of exactly $2$ purchases, $3$ sign-ups only, and $5$ no-action visits.

Solution

Let $X=(X_1,X_2,X_3)\sim \operatorname{Multinomial}(10,0.1,0.2,0.7)$. Then \[\mathbb{P}(X_1=2,X_2=3,X_3=5) = \frac{10!}{2!3!5!}(0.1)^2(0.2)^3(0.7)^5.\] Numerically, \[\frac{10!}{2!3!5!}(0.1)^2(0.2)^3(0.7)^5 =2520(0.01)(0.008)(0.16807)\approx 0.0339.\]

Mean, variance, and covariance

The components of a multinomial vector are individually binomial, but they are dependent because the total count is fixed at $n$.

Theorem 5 (Moments of the multinomial distribution). If $X=(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1,\ldots,\phi_m)$, then \[\mathbb{E}[X_i]=n\phi_i, \qquad \operatorname{Var}(X_i)=n\phi_i(1-\phi_i),\] and for $i\ne j$, \[\operatorname{Cov}(X_i,X_j)=-n\phi_i\phi_j.\] Equivalently, \[\operatorname{Cov}(X)=n\left(\operatorname{diag}(\phi)-\phi\phi^T\right),\] where $\phi=(\phi_1,\ldots,\phi_m)^T$.

Proof

Write $X_i=\sum_{r=1}^n I_{ri}$, where $I_{ri}=1$ if trial $r$ produces outcome $O_i$, and $0$ otherwise. Then $I_{ri}\sim \operatorname{Bernoulli}(\phi_i)$, so \[\mathbb{E}[X_i]=\sum_{r=1}^n \mathbb{E}[I_{ri}]=n\phi_i, \qquad \operatorname{Var}(X_i)=\sum_{r=1}^n \operatorname{Var}(I_{ri})=n\phi_i(1-\phi_i).\] For $i\ne j$, in one trial $I_{ri}I_{rj}=0$, so \[\operatorname{Cov}(I_{ri},I_{rj})=\mathbb{E}[I_{ri}I_{rj}]-\mathbb{E}[I_{ri}]\mathbb{E}[I_{rj}]=0-\phi_i\phi_j=-\phi_i\phi_j.\] Different trials are independent, so \[\operatorname{Cov}(X_i,X_j)=\sum_{r=1}^n \operatorname{Cov}(I_{ri},I_{rj})=-n\phi_i\phi_j.\]

Remark. Remark 6. The negative covariance is not a paradox. If the total number of trials is fixed, seeing more observations in one category leaves fewer observations available for the other categories.

Practice Problem 7 (Checking the binomial special case). Show that if $m=2$, then the multinomial distribution reduces to the binomial distribution.

Solution

Let $X=(X_1,X_2)\sim \operatorname{Multinomial}(n,p,1-p)$. Because $X_1+X_2=n$, the entire vector is determined by $X_1$. For $X_1=k$, we have $X_2=n-k$. Thus \[\mathbb{P}(X_1=k)=\mathbb{P}(X_1=k,X_2=n-k) =\frac{n!}{k!(n-k)!}p^k(1-p)^{n-k},\] which is exactly the $\operatorname{Binomial}(n,p)$ pmf.

Dirichlet Distribution

This section introduces the Dirichlet distribution, the continuous distribution on probability vectors that generalizes the Beta distribution.

The simplex and the Dirichlet density

The Dirichlet distribution is used when the unknown quantity is itself a vector of probabilities.

Definition 8 (Probability simplex). The $(K-1)$-dimensional probability simplex is \[\Delta^{K-1}=\left\{\mu=(\mu_1,\ldots,\mu_K):\; \mu_k\ge 0,\; \sum_{k=1}^K \mu_k=1\right\}.\] A point in the simplex represents a probability vector over $K$ categories.

Definition 9 (Dirichlet distribution). Let $\alpha=(\alpha_1,\ldots,\alpha_K)$ with $\alpha_k>0$ and let \[\alpha_0=\sum_{k=1}^K \alpha_k.\] A random probability vector $\mu=(\mu_1,\ldots,\mu_K)$ has a Dirichlet distribution with parameter $\alpha$, written \[\mu\sim \operatorname{Dirichlet}(\alpha_1,\ldots,\alpha_K),\] if its density on the simplex is \[p(\mu\mid\alpha)= \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_1)\cdots \Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k-1},\] where $\Gamma(\cdot)$ is the gamma function.

Interpretation of the parameters

The parameter $\alpha_k$ acts like a prior pseudo-count for category $k$. Large values of $\alpha_0$ mean the distribution is more concentrated around its mean; small values can place more mass near the corners of the simplex.

Theorem 10 (Dirichlet moments). If $\mu\sim \operatorname{Dirichlet}(\alpha_1,\ldots,\alpha_K)$ and $\alpha_0=\sum_k\alpha_k$, then \[\mathbb{E}[\mu_k]=\frac{\alpha_k}{\alpha_0},\] \[\operatorname{Var}(\mu_k)=\frac{\alpha_k(\alpha_0-\alpha_k)}{\alpha_0^2(\alpha_0+1)}, \qquad \operatorname{Cov}(\mu_i,\mu_j)= -\frac{\alpha_i\alpha_j}{\alpha_0^2(\alpha_0+1)}\quad (i\ne j).\]

Example 11 (The Beta distribution as a special case). Show that when $K=2$, the Dirichlet distribution becomes the Beta distribution.

Solution

When $K=2$, the simplex condition says $\mu_1+\mu_2=1$, so $\mu_2=1-\mu_1$. The Dirichlet density is proportional to \[\mu_1^{\alpha_1-1}\mu_2^{\alpha_2-1} =\mu_1^{\alpha_1-1}(1-\mu_1)^{\alpha_2-1}.\] The normalized density is \[\frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} \mu_1^{\alpha_1-1}(1-\mu_1)^{\alpha_2-1},\] which is the $\operatorname{Beta}(\alpha_1,\alpha_2)$ density. Therefore, \[\operatorname{Dirichlet}(\alpha_1,\alpha_2)=\operatorname{Beta}(\alpha_1,\alpha_2).\]

Dirichlet-multinomial Bayesian model

The Dirichlet distribution is the natural conjugate prior for categorical and multinomial probability vectors.

Definition 12 (Dirichlet-multinomial model). A Bayesian model for categorical or multinomial data is \[\mu\sim \operatorname{Dirichlet}(\alpha_1, \ldots,\alpha_K), \qquad X\mid\mu\sim \operatorname{Multinomial}(n,\mu_1,\ldots,\mu_K).\] Here the likelihood is multinomial, the prior is Dirichlet, and the posterior is again Dirichlet.

Theorem 13 (Dirichlet-multinomial conjugacy). Suppose \[\mu\sim \operatorname{Dirichlet}(\alpha_1, \ldots,\alpha_K), \qquad X=(X_1,\ldots,X_K)\mid\mu\sim \operatorname{Multinomial}(n,\mu_1,\ldots,\mu_K).\] If the observed counts are $X_1=n_1,\ldots,X_K=n_K$, then \[\mu\mid X=(n_1,\ldots,n_K) \sim \operatorname{Dirichlet}(\alpha_1+n_1, \ldots, \alpha_K+n_K).\]

Proof

The prior density is \[p(\mu)\propto \prod_{k=1}^K \mu_k^{\alpha_k-1}.\] The multinomial likelihood, as a function of $\mu$, is \[p(x\mid\mu)\propto \prod_{k=1}^K \mu_k^{n_k}.\] Therefore, by Bayes’ theorem, \[p(\mu\mid x)\propto p(x\mid\mu)p(\mu) \propto \prod_{k=1}^K \mu_k^{n_k}\prod_{k=1}^K\mu_k^{\alpha_k-1} = \prod_{k=1}^K \mu_k^{\alpha_k+n_k-1}.\] This is the kernel of a Dirichlet distribution with parameters $\alpha_k+n_k$.

Example 14 (Updating category probabilities). Suppose there are three categories and the prior is \[\mu\sim \operatorname{Dirichlet}(2,2,2).\] We observe $n=10$ trials with counts $(5,3,2)$. Find the posterior distribution and posterior mean.

Solution

By conjugacy, \[\mu\mid X=(5,3,2)\sim \operatorname{Dirichlet}(2+5,2+3,2+2)=\operatorname{Dirichlet}(7,5,4).\] The posterior total is $7+5+4=16$. Therefore, \[\mathbb{E}[\mu_1\mid X]=\frac{7}{16}, \qquad \mathbb{E}[\mu_2\mid X]=\frac{5}{16}, \qquad \mathbb{E}[\mu_3\mid X]=\frac{4}{16}.\] The posterior mean is a weighted compromise between the prior pseudo-counts and the observed counts.

Practice Problem 15 (Uniform prior on the simplex). Let $\mu\sim \operatorname{Dirichlet}(1,1,1)$ and suppose counts $(n_1,n_2,n_3)=(4,1,5)$ are observed. Find the posterior distribution.

Solution

The posterior is \[\mu\mid X\sim \operatorname{Dirichlet}(1+4,1+1,1+5)=\operatorname{Dirichlet}(5,2,6).\] Because $\operatorname{Dirichlet}(1,1,1)$ is uniform over the three-category simplex, the posterior parameters are simply one plus the observed counts.

Single Variable Gaussian Distribution

This section reviews the ordinary one-dimensional Gaussian distribution before moving to the multivariate Gaussian distribution.

Density, mean, variance, and probability

The one-dimensional Gaussian distribution is the prototype for the multivariate Gaussian distribution.

Definition 16 (Univariate Gaussian). A random variable $X$ has a Gaussian or normal distribution with mean $\mu$ and variance $\sigma^2$, written \[X\sim \operatorname{Normal}(\mu,\sigma^2),\] if its density is \[f_X(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}, \qquad -\infty<x<\infty.\] Then \[\mathbb{E}[X]=\mu, \qquad \operatorname{Var}(X)=\sigma^2.\] For any $a<b$, \[\mathbb{P}(a<X<b)=\int_a^b f_X(x)\,dx.\]

Example 17 (Standardizing a normal random variable). Let $X\sim \operatorname{Normal}(\mu,\sigma^2)$. Show that \[Z=\frac{X-\mu}{\sigma}\] has the standard normal distribution.

Solution

The transformation is $Z=(X-\mu)/\sigma$, so $X=\mu+\sigma Z$. The change-of-variables formula gives \[f_Z(z)=f_X(\mu+\sigma z)\cdot \sigma.\] Substituting the density of $X$, \[f_Z(z)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{z^2}{2}\right)\sigma =\frac{1}{\sqrt{2\pi}}e^{-z^2/2}.\] Thus $Z\sim \operatorname{Normal}(0,1)$.

Multivariate Gaussian Distribution

This section introduces the Gaussian distribution for random vectors and explains how the covariance matrix controls its geometry.

Definition and covariance matrix

A multivariate Gaussian distribution is completely determined by its mean vector and covariance matrix.

Definition 18 (Multivariate Gaussian). Let $X\in\mathbb{R}^d$ be a random vector. We say \[X\sim \mathcal{N}_d(\mu,\Sigma),\] where $\mu\in\mathbb{R}^d$ and $\Sigma$ is a $d\times d$ symmetric positive definite matrix, if \[p(x\mid\mu,\Sigma) = \frac{1}{\sqrt{(2\pi)^d|\Sigma|}} \exp\left\{-\frac12 (x-\mu)^T\Sigma^{-1}(x-\mu)\right\}.\] The covariance matrix is \[\Sigma=\operatorname{Cov}(X,X)=\mathbb{E}\left[(X-\mu)(X-\mu)^T\right].\]

Normalization constant

The normalization constant is \[Z=\sqrt{\det(2\pi\Sigma)}=\sqrt{(2\pi)^d|\Sigma|}.\] This is the multivariate analogue of $\sqrt{2\pi}\sigma$ in the one-dimensional normal density.

Mahalanobis distance and geometry

The shape of a multivariate Gaussian is controlled by a quadratic form that generalizes the one-dimensional $z$-score.

Definition 19 (Mahalanobis distance). For $X\sim \mathcal{N}_d(\mu,\Sigma)$, define \[\Delta^2=(x-\mu)^T\Sigma^{-1}(x-\mu).\] The value $\Delta$ is called the Mahalanobis distance from $x$ to $\mu$ with respect to $\Sigma$.

Remark. Remark 20. In one dimension, \[\Delta^2=\frac{(x-\mu)^2}{\sigma^2}, \qquad \Delta=\left|\frac{x-\mu}{\sigma}\right|.\] Thus Mahalanobis distance generalizes the absolute value of the usual $z$-score. If $\Sigma=I$, then \[\Delta^2=(x-\mu)^T(x-\mu)=\|x-\mu\|^2,\] so Mahalanobis distance reduces to Euclidean distance.

Example 21 (Two-dimensional covariance matrices). Compare the covariance matrices \[\Sigma_1=\begin{pmatrix}1&0\\0&1\end{pmatrix}, \qquad \Sigma_2=\begin{pmatrix}1&0.5\\0.5&1\end{pmatrix}, \qquad \Sigma_3=\begin{pmatrix}1&0.8\\0.8&1\end{pmatrix}.\] Describe the qualitative effect on the Gaussian density.

Solution

For $\Sigma_1$, the coordinates are uncorrelated with equal variance, so the contours are circles centered at $\mu$.

For $\Sigma_2$, the positive off-diagonal covariance creates positively tilted elliptical contours. Larger $x_1$ values tend to be associated with larger $x_2$ values.

For $\Sigma_3$, the positive dependence is stronger, so the ellipse is more stretched along the positive diagonal direction. The density is more concentrated around a narrow tilted ridge than in the case $\Sigma_2$.

Spectral decomposition and precision matrix

The eigenvalues and eigenvectors of the covariance matrix reveal the principal axes of a Gaussian distribution.

Theorem 22 (Spectral form of the quadratic term). Let $\Sigma$ be symmetric positive definite. Then \[\Sigma=UDU^T=\sum_{i=1}^d \lambda_i u_i u_i^T,\] where $U=[u_1\ \cdots\ u_d]$ is orthogonal and $D=\operatorname{diag}(\lambda_1,\ldots,\lambda_d)$ with $\lambda_i>0$. The precision matrix is \[\Sigma^{-1}=UD^{-1}U^T=\sum_{i=1}^d\frac{1}{\lambda_i}u_i u_i^T.\] If \[y_i=u_i^T(x-\mu),\] then \[\Delta^2=(x-\mu)^T\Sigma^{-1}(x-\mu)=\sum_{i=1}^d \frac{y_i^2}{\lambda_i}.\]

Proof

Since $\Sigma=UDU^T$ and $U^TU=I$, the inverse is \[\Sigma^{-1}=UD^{-1}U^T.\] Thus \[(x-\mu)^T\Sigma^{-1}(x-\mu) =(x-\mu)^TUD^{-1}U^T(x-\mu).\] Let $y=U^T(x-\mu)$. Then the quadratic form becomes \[y^TD^{-1}y=\sum_{i=1}^d\frac{y_i^2}{\lambda_i}.\]

Terminology

The matrix $\Sigma^{-1}$ is called the precision matrix. The covariance matrix describes spread; the precision matrix describes inverse spread and appears naturally in the exponent of the Gaussian density.

Why Gaussian distributions are central

Gaussian distributions appear frequently because of the central limit theorem and because Gaussianity is preserved by many important operations.

Once Gaussian, always Gaussian

Gaussian random vectors are completely determined by mean vector $\mu$ and covariance matrix $\Sigma$. Linear transformations, marginalization, conditioning, and sums of independent Gaussian vectors all produce Gaussian distributions again.

Remark. Remark 23 (Central limit theorem motivation). If $X_1,\ldots,X_m$ are independent and identically distributed with finite mean and variance, then the arithmetic mean \[\bar X=\frac{1}{m}(X_1+\cdots+X_m)\] is approximately normal for large $m$, after appropriate centering and scaling. This explains why Gaussian distributions occur often in real data.

Marginal and Conditional Gaussian Distributions

This section gives the most important computational formulas for partitioned multivariate Gaussian distributions.

Partitioned Gaussian vectors

Partitioning a Gaussian vector allows us to study subsets of variables and conditional distributions.

Suppose \[X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right).\] Here $X_1$ and $X_2$ may themselves be vectors, and the blocks of $\Sigma$ have compatible dimensions.

Marginalization

The marginal distribution of a subvector of a Gaussian vector is Gaussian with the corresponding mean block and covariance block.

Theorem 24 (Gaussian marginal distribution). If \[X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right),\] then \[X_1\sim \mathcal{N}(\mu_1,\Sigma_{11}), \qquad X_2\sim \mathcal{N}(\mu_2,\Sigma_{22}).\]

Example 25 (Marginal of a bivariate normal). Let \[\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}1\\2\end{pmatrix}, \begin{pmatrix}4&1\\1&9\end{pmatrix} \right).\] Find the marginal distributions of $X_1$ and $X_2$.

Solution

By the marginalization theorem, \[X_1\sim \operatorname{Normal}(1,4), \qquad X_2\sim \operatorname{Normal}(2,9).\] The off-diagonal covariance $1$ affects the dependence between $X_1$ and $X_2$, but it does not change the marginal variances.

Conditional Gaussian distributions

Conditioning on one part of a Gaussian vector gives another Gaussian distribution.

Theorem 26 (Conditional Gaussian distribution). Suppose \[X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right),\] where $\Sigma_{22}$ is invertible. Then \[X_1\mid X_2=x_2\sim \mathcal{N}(\mu_{1\mid 2},\Sigma_{11\mid 2}),\] where \[\mu_{1\mid 2}=\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2),\] and \[\Sigma_{11\mid 2}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.\]

Proof

Proof idea by completing the square. Let the precision matrix be \[\Lambda=\Sigma^{-1}=\begin{pmatrix}\Lambda_{11}&\Lambda_{12}\\\Lambda_{21}&\Lambda_{22}\end{pmatrix}.\] The exponent of the joint Gaussian density is \[-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu).\] If $x_2$ is fixed, then as a function of $x_1$ this expression is a quadratic form in $x_1$. Therefore the conditional distribution must be Gaussian. Completing the square gives \[\Sigma_{11\mid 2}=\Lambda_{11}^{-1}, \qquad \mu_{1\mid 2}=\mu_1-\Lambda_{11}^{-1}\Lambda_{12}(x_2-\mu_2).\] Using the block inverse formula and Schur complement identities, \[\Lambda_{11}^{-1}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21},\] and \[-\Lambda_{11}^{-1}\Lambda_{12}=\Sigma_{12}\Sigma_{22}^{-1}.\] This gives the stated formula.

Interpretation

The conditional mean is a linear function of the observed value $x_2$. The conditional covariance does not depend on the observed value $x_2$; it only depends on the covariance blocks.

Example 27 (Conditional distribution in two dimensions). Let \[\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}0\\0\end{pmatrix}, \begin{pmatrix}1&\rho\\\rho&1\end{pmatrix} \right), \qquad -1<\rho<1.\] Find $X_1\mid X_2=x_2$.

Solution

Here \[\mu_1=0, \quad \mu_2=0, \quad \Sigma_{11}=1, \quad \Sigma_{22}=1, \quad \Sigma_{12}=\rho.\] Therefore, \[\mu_{1\mid 2}=0+\rho(1)^{-1}(x_2-0)=\rho x_2,\] and \[\Sigma_{11\mid 2}=1-\rho(1)^{-1}\rho=1-\rho^2.\] Thus \[X_1\mid X_2=x_2\sim \operatorname{Normal}(\rho x_2,1-\rho^2).\] If $\rho=0$, the conditional distribution is the same as the marginal distribution, reflecting independence.

Linear algebra review: Schur complements and block inverses

The Schur complement is the matrix identity behind the conditional covariance formula.

Definition 28 (Schur complement). Let \[M=\begin{pmatrix}A&B\\C&D\end{pmatrix},\] where $D$ is invertible. The Schur complement of $D$ in $M$ is \[M/D=A-BD^{-1}C.\]

Theorem 29 (Block inverse formula). If $D$ and $M/D$ are invertible, then \[M^{-1}=\begin{pmatrix} (M/D)^{-1} & -(M/D)^{-1}BD^{-1}\\ -D^{-1}C(M/D)^{-1} & D^{-1}+D^{-1}C(M/D)^{-1}BD^{-1} \end{pmatrix}.\]

Proof

Block Gaussian elimination gives \[\begin{pmatrix}A&B\\C&D\end{pmatrix} = \begin{pmatrix}I&BD^{-1}\\0&I\end{pmatrix} \begin{pmatrix}A-BD^{-1}C&0\\0&D\end{pmatrix} \begin{pmatrix}I&0\\D^{-1}C&I\end{pmatrix}.\] Taking inverses of the three factors and multiplying them gives the stated block inverse formula.

Practice Problem 30 (Schur complement calculation). Let \[M=\begin{pmatrix}4&2\\2&3\end{pmatrix}.\] Compute the Schur complement of $D=3$.

Solution

Here $A=4$, $B=2$, $C=2$, and $D=3$. Therefore \[M/D=A-BD^{-1}C=4-2\cdot\frac13\cdot 2=4-\frac43=\frac83.\]

Operations on Gaussian Random Variables

This section summarizes several operations that preserve Gaussian structure.

Affine transformations

A linear transformation plus a shift maps a Gaussian vector to another Gaussian vector.

Theorem 31 (Affine transformation of a Gaussian). If \[X\sim \mathcal{N}(\mu,\Sigma), \qquad Y=AX+b,\] then \[Y\sim \mathcal{N}(A\mu+b,A\Sigma A^T).\]

Proof

The mean is \[\mathbb{E}[Y]=\mathbb{E}[AX+b]=A\mathbb{E}[X]+b=A\mu+b.\] The covariance is \[\operatorname{Var}(Y)=\operatorname{Var}(AX+b)=A\operatorname{Var}(X)A^T=A\Sigma A^T.\] Because any affine transformation of a Gaussian vector is Gaussian, these two quantities determine the distribution.

Example 32 (Linear combination of a Gaussian vector). Let $X\sim \mathcal{N}(\mu,\Sigma)$ and let $a$ be a fixed vector. Find the distribution of $Y=a^TX$.

Solution

This is the affine transformation with $A=a^T$ and $b=0$. Hence \[Y=a^TX\sim \operatorname{Normal}(a^T\mu,a^T\Sigma a).\]

Products and convolutions of Gaussian densities

Products and convolutions of Gaussian densities are central in Bayesian updating, filtering, and linear models.

Theorem 33 (Pointwise product of two Gaussian densities). Consider two Gaussian densities over the same variable $x$: \[p_1(x)=\mathcal{N}(x\mid\mu_1,\Sigma_1), \qquad p_2(x)=\mathcal{N}(x\mid\mu_2,\Sigma_2).\] Their pointwise product is proportional to another Gaussian density: \[p_1(x)p_2(x)\propto \mathcal{N}(x\mid\mu,\Sigma),\] where \[\Sigma^{-1}=\Sigma_1^{-1}+\Sigma_2^{-1}, \qquad \mu=\Sigma\left(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2\right).\]

Proof

The product has exponent \[-\frac12(x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1) -\frac12(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2).\] Collecting quadratic and linear terms in $x$ gives \[-\frac12 x^T(\Sigma_1^{-1}+\Sigma_2^{-1})x +x^T(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2)+\text{constant}.\] Completing the square gives the stated covariance and mean.

Theorem 34 (Convolution and sums). If $X\sim \mathcal{N}(\mu_1,\Sigma_1)$ and $Y\sim \mathcal{N}(\mu_2,\Sigma_2)$ are independent, then \[Z=X+Y\sim \mathcal{N}(\mu_1+\mu_2,\Sigma_1+\Sigma_2).\] Equivalently, the convolution \[p_Z(z)=\int p_X(x)p_Y(z-x)\,dx\] is Gaussian.

Proof

Since $X$ and $Y$ are independent, \[\mathbb{E}[Z]=\mathbb{E}[X]+\mathbb{E}[Y]=\mu_1+\mu_2,\] and \[\operatorname{Var}(Z)=\operatorname{Var}(X)+\operatorname{Var}(Y)=\Sigma_1+\Sigma_2.\] The sum of independent Gaussian random vectors is Gaussian, so the distribution is determined by these two quantities.

Example 35 (Bayesian normal update as product of Gaussians). Suppose a prior density for a scalar parameter is $\theta\sim \operatorname{Normal}(\mu_0,\sigma_0^2)$, and a Gaussian likelihood kernel is proportional to $\operatorname{Normal}(\theta\mid y,\sigma^2)$. Find the posterior variance and mean.

Solution

Using the product formula in one dimension, \[\frac{1}{\sigma_{post}^2}=\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}.\] Thus \[\sigma_{post}^2=\left(\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}\right)^{-1}.\] The posterior mean is \[\mu_{post}=\sigma_{post}^2\left(\frac{\mu_0}{\sigma_0^2}+\frac{y}{\sigma^2}\right).\] This is a precision-weighted average of the prior mean and the data value.

Linear Gaussian Models

This section combines marginal and conditional Gaussian formulas in a common model used in Bayesian statistics and machine learning.

Marginal and conditional distributions in a linear Gaussian model

A linear Gaussian model assumes a Gaussian prior on an input vector and a Gaussian conditional distribution for an output vector.

Theorem 36 (Marginal and conditional Gaussians in a linear model). Suppose \[p(x)=\mathcal{N}(x\mid\mu,\Lambda^{-1}), \qquad p(y\mid x)=\mathcal{N}(y\mid Ax+b,L^{-1}),\] where $\Lambda$ and $L$ are precision matrices. Then \[p(y)=\mathcal{N}\left(y\mid A\mu+b, L^{-1}+A\Lambda^{-1}A^T\right).\] Moreover, \[p(x\mid y)=\mathcal{N}(x\mid m,S),\] where \[S=(\Lambda+A^TLA)^{-1},\] and \[m=S\left(A^TL(y-b)+\Lambda\mu\right).\]

Proof

Write the model as \[y=Ax+b+\varepsilon, \qquad \varepsilon\sim \mathcal{N}(0,L^{-1}), \qquad x\sim \mathcal{N}(\mu,\Lambda^{-1}),\] with $x$ independent of $\varepsilon$. Therefore \[\mathbb{E}[y]=A\mathbb{E}[x]+b=A\mu+b,\] and \[\operatorname{Var}(y)=A\operatorname{Var}(x)A^T+\operatorname{Var}(\varepsilon)=A\Lambda^{-1}A^T+L^{-1}.\] This gives the marginal distribution of $y$.

For the conditional distribution, multiply the prior and likelihood as functions of $x$: \[p(x\mid y)\propto \exp\left\{-\frac12(x-\mu)^T\Lambda(x-\mu) -\frac12(y-Ax-b)^TL(y-Ax-b)\right\}.\] Collecting quadratic terms in $x$ gives precision \[S^{-1}=\Lambda+A^TLA.\] Collecting linear terms gives \[S^{-1}m=\Lambda\mu+A^TL(y-b),\] so \[m=S\left(A^TL(y-b)+\Lambda\mu\right).\]

Example 37 (Scalar linear Gaussian model). Let \[X\sim \operatorname{Normal}(\mu,\tau^2), \qquad Y\mid X=x\sim \operatorname{Normal}(ax+b,\sigma^2).\] Find the marginal distribution of $Y$.

Solution

We can write \[Y=aX+b+\varepsilon, \qquad \varepsilon\sim \operatorname{Normal}(0,\sigma^2),\] with $\varepsilon$ independent of $X$. Hence \[\mathbb{E}[Y]=a\mu+b,\] and \[\operatorname{Var}(Y)=a^2\tau^2+\sigma^2.\] Therefore \[Y\sim \operatorname{Normal}(a\mu+b,a^2\tau^2+\sigma^2).\]

Practice Problem 38 (Posterior in the scalar linear Gaussian model). For the same model, \[X\sim \operatorname{Normal}(\mu,\tau^2), \qquad Y\mid X=x\sim \operatorname{Normal}(ax+b,\sigma^2),\] find the conditional distribution of $X\mid Y=y$.

Solution

Here the prior precision is $1/\tau^2$ and the observation precision is $1/\sigma^2$. The posterior precision is \[\frac{1}{s^2}=\frac{1}{\tau^2}+\frac{a^2}{\sigma^2}.\] Thus \[s^2=\left(\frac{1}{\tau^2}+\frac{a^2}{\sigma^2}\right)^{-1}.\] The posterior mean is \[m=s^2\left(\frac{\mu}{\tau^2}+\frac{a(y-b)}{\sigma^2}\right).\] Therefore \[X\mid Y=y\sim \operatorname{Normal}(m,s^2).\]

Summary and Practice

This final section summarizes the main formulas and provides practice problems that reinforce the section.

Core formulas

The main ideas of this section are that multinomial counts live on a constrained simplex of counts, Dirichlet distributions live on a simplex of probabilities, and Gaussian random vectors are stable under many operations.

Multinomial and Dirichlet formulas

\[\mathbb{P}(X_1=n_1,\ldots,X_K=n_K) =\frac{n!}{n_1!\cdots n_K!}\prod_{k=1}^K\phi_k^{n_k}.\] \[p(\mu\mid\alpha)= \frac{\Gamma(\alpha_0)}{\prod_{k=1}^K\Gamma(\alpha_k)} \prod_{k=1}^K \mu_k^{\alpha_k-1}, \qquad \alpha_0=\sum_{k=1}^K\alpha_k.\] \[\mu\mid X=(n_1, \ldots,n_K) \sim \operatorname{Dirichlet}(\alpha_1+n_1, \ldots, \alpha_K+n_K).\]

Multivariate Gaussian formulas

\[p(x\mid\mu,\Sigma)=\frac{1}{\sqrt{(2\pi)^d|\Sigma|}} \exp\left\{-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu)\right\}.\] \[X_1\mid X_2=x_2\sim \mathcal{N}\left(\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2), \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right).\] \[AX+b\sim \mathcal{N}(A\mu+b,A\Sigma A^T).\]

Practice Problem 39 (Multinomial probability). A four-sided die has probabilities $(0.1,0.2,0.3,0.4)$. It is tossed $8$ times. Find the probability that the counts are $(1,2,2,3)$.

Solution

Use the multinomial pmf: \[\mathbb{P}(X=(1,2,2,3))= \frac{8!}{1!2!2!3!}(0.1)^1(0.2)^2(0.3)^2(0.4)^3.\] The coefficient is \[\frac{8!}{1!2!2!3!}=1680.\] Thus \[\mathbb{P}(X=(1,2,2,3))=1680(0.1)(0.04)(0.09)(0.064) \approx 0.0387.\]

Practice Problem 40 (Dirichlet posterior mean). Suppose $\mu\sim \operatorname{Dirichlet}(3,1,2)$ and the observed counts are $(2,4,1)$. Find the posterior distribution and posterior mean.

Solution

The posterior is \[\mu\mid X\sim \operatorname{Dirichlet}(3+2,1+4,2+1)=\operatorname{Dirichlet}(5,5,3).\] The posterior total is $13$, so \[\mathbb{E}[\mu\mid X]=\left(\frac{5}{13},\frac{5}{13},\frac{3}{13}\right).\]

Practice Problem 41 (Conditional normal calculation). Let \[\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}2\\1\end{pmatrix}, \begin{pmatrix}9&3\\3&4\end{pmatrix} \right).\] Find $X_1\mid X_2=5$.

Solution

We have \[\mu_1=2, \quad \mu_2=1, \quad \Sigma_{11}=9, \quad \Sigma_{12}=3, \quad \Sigma_{22}=4, \quad \Sigma_{21}=3.\] The conditional mean is \[\mu_{1\mid 2}=2+3\cdot 4^{-1}(5-1)=2+3=5.\] The conditional variance is \[\Sigma_{11\mid 2}=9-3\cdot 4^{-1}\cdot 3=9-\frac94=\frac{27}{4}.\] Therefore \[X_1\mid X_2=5\sim \operatorname{Normal}\left(5,\frac{27}{4}\right).\]

Practice Problem 42 (Affine transformation). Let $X\sim \mathcal{N}_2\left(\begin{pmatrix}1\\2\end{pmatrix},\begin{pmatrix}1&0.5\\0.5&2\end{pmatrix}\right)$ and let $Y=2X_1-X_2+3$. Find the distribution of $Y$.

Solution

Write $Y=a^TX+3$, where \[a=\begin{pmatrix}2\\-1\end{pmatrix}.\] The mean is \[\mathbb{E}[Y]=a^T\mu+3=2(1)-1(2)+3=3.\] The variance is \[\operatorname{Var}(Y)=a^T\Sigma a.\] Compute \[\Sigma a= \begin{pmatrix}1&0.5\\0.5&2\end{pmatrix} \begin{pmatrix}2\\-1\end{pmatrix} =\begin{pmatrix}1.5\\-1\end{pmatrix}.\] Thus \[a^T\Sigma a=(2,-1)\begin{pmatrix}1.5\\-1\end{pmatrix}=3+1=4.\] Therefore \[Y\sim \operatorname{Normal}(3,4).\]

9 George Casella and Roger L. Berger, Statistical Inference, 2nd edition. Larry Wasserman, All of Statistics. C. M. Grinstead and J. L. Snell, Introduction to Probability, American Mathematical Society, 2012. Sheldon Ross, Introduction to Probability Models, 12th edition.

# Chapter 6 Extra: Multinomial and Multinormal Distributions {.unnumbered} This extra chapter extends Section 6 in two directions. First, it develops the multinomial distribution and the Dirichlet prior for categorical data. Second, it introduces the multivariate Gaussian distribution and the linear-algebra formulas behind marginalization, conditioning, affine transformations, and linear Gaussian models. **Topics.** Multinomial distribution; Dirichlet distribution; Dirichlet-multinomial Bayesian model; multivariate Gaussian distribution; Mahalanobis distance; spectral geometry; marginal and conditional Gaussian distributions; affine transformations; products and convolutions of Gaussians; linear Gaussian models. ## Multinomial Distribution This section extends the categorical and binomial distributions to experiments with more than two possible outcomes. ### From categorical trials to multinomial counts A categorical random variable records one outcome from several categories, while a multinomial random vector records the counts of each outcome after many independent categorical trials. ::: definition **Definition 1** (Categorical trial). A categorical trial has $m$ possible outcomes $O_1,\ldots,O_m$ with probabilities $$\phi_1,\ldots,\phi_m,\qquad \phi_i\ge 0,\qquad \sum_{i=1}^m \phi_i=1.$$ If one trial is performed, then the outcome has a categorical distribution. ::: ::: definition **Definition 2** (Multinomial distribution). Suppose we perform $n$ independent trials, each with outcomes $O_1,\ldots,O_m$ and constant category probabilities $(\phi_1,\ldots,\phi_m)$. Let $$X_i=\text{the number of times outcome }O_i\text{ appears in the }n\text{ trials}.$$ Then $$X=(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1, \ldots,\phi_m),$$ and for nonnegative integers $n_1,\ldots,n_m$ satisfying $n_1+\cdots+n_m=n$, $$\mathbb{P}(X_1=n_1,\ldots,X_m=n_m) = \frac{n!}{n_1!\cdots n_m!}\phi_1^{n_1}\cdots \phi_m^{n_m}.$$ ::: ::: {.callout-tip title="Course connection"} The multinomial distribution is the direct generalization of the categorical distribution. When $n=1$, it reduces to a categorical trial. When $m=2$, it reduces to the binomial distribution. ::: ::: example **Example 3** (Tossing an $m$-sided die). An $m$-sided die has probabilities $\phi_1,\ldots,\phi_m$ for sides $1,\ldots,m$. Toss the die $n$ times and let $X_i$ be the number of times side $i$ appears. Find the joint pmf of $(X_1,\ldots,X_m)$. ::: ::: {.callout-note title="Solution"} The vector of counts follows $$(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1,\ldots,\phi_m).$$ Therefore, for $n_1+\cdots+n_m=n$, $$\mathbb{P}(X_1=n_1,\ldots,X_m=n_m) = \frac{n!}{n_1!\cdots n_m!}\prod_{i=1}^m\phi_i^{n_i}.$$ The coefficient $n!/(n_1!\cdots n_m!)$ counts how many orderings of the $n$ trials produce the same vector of counts. ::: ::: example **Example 4** (Three-category experiment). Suppose a website visit leads to one of three outcomes: purchase, sign-up only, or no action. The probabilities are $(0.1,0.2,0.7)$. Among $n=10$ visitors, find the probability of exactly $2$ purchases, $3$ sign-ups only, and $5$ no-action visits. ::: ::: {.callout-note title="Solution"} Let $X=(X_1,X_2,X_3)\sim \operatorname{Multinomial}(10,0.1,0.2,0.7)$. Then $$\mathbb{P}(X_1=2,X_2=3,X_3=5) = \frac{10!}{2!3!5!}(0.1)^2(0.2)^3(0.7)^5.$$ Numerically, $$\frac{10!}{2!3!5!}(0.1)^2(0.2)^3(0.7)^5 =2520(0.01)(0.008)(0.16807)\approx 0.0339.$$ ::: ### Mean, variance, and covariance The components of a multinomial vector are individually binomial, but they are dependent because the total count is fixed at $n$. ::: theorem **Theorem 5** (Moments of the multinomial distribution). *If $X=(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1,\ldots,\phi_m)$, then $$\mathbb{E}[X_i]=n\phi_i, \qquad \operatorname{Var}(X_i)=n\phi_i(1-\phi_i),$$ and for $i\ne j$, $$\operatorname{Cov}(X_i,X_j)=-n\phi_i\phi_j.$$ Equivalently, $$\operatorname{Cov}(X)=n\left(\operatorname{diag}(\phi)-\phi\phi^T\right),$$ where $\phi=(\phi_1,\ldots,\phi_m)^T$.* ::: ::: {.callout-note title="Proof"} Write $X_i=\sum_{r=1}^n I_{ri}$, where $I_{ri}=1$ if trial $r$ produces outcome $O_i$, and $0$ otherwise. Then $I_{ri}\sim \operatorname{Bernoulli}(\phi_i)$, so $$\mathbb{E}[X_i]=\sum_{r=1}^n \mathbb{E}[I_{ri}]=n\phi_i, \qquad \operatorname{Var}(X_i)=\sum_{r=1}^n \operatorname{Var}(I_{ri})=n\phi_i(1-\phi_i).$$ For $i\ne j$, in one trial $I_{ri}I_{rj}=0$, so $$\operatorname{Cov}(I_{ri},I_{rj})=\mathbb{E}[I_{ri}I_{rj}]-\mathbb{E}[I_{ri}]\mathbb{E}[I_{rj}]=0-\phi_i\phi_j=-\phi_i\phi_j.$$ Different trials are independent, so $$\operatorname{Cov}(X_i,X_j)=\sum_{r=1}^n \operatorname{Cov}(I_{ri},I_{rj})=-n\phi_i\phi_j.$$ ::: ::: remark *Remark 6*. The negative covariance is not a paradox. If the total number of trials is fixed, seeing more observations in one category leaves fewer observations available for the other categories. ::: ::: exercise **Practice Problem 7** (Checking the binomial special case). Show that if $m=2$, then the multinomial distribution reduces to the binomial distribution. ::: ::: {.callout-note title="Solution"} Let $X=(X_1,X_2)\sim \operatorname{Multinomial}(n,p,1-p)$. Because $X_1+X_2=n$, the entire vector is determined by $X_1$. For $X_1=k$, we have $X_2=n-k$. Thus $$\mathbb{P}(X_1=k)=\mathbb{P}(X_1=k,X_2=n-k) =\frac{n!}{k!(n-k)!}p^k(1-p)^{n-k},$$ which is exactly the $\operatorname{Binomial}(n,p)$ pmf. ::: ## Dirichlet Distribution This section introduces the Dirichlet distribution, the continuous distribution on probability vectors that generalizes the Beta distribution. ### The simplex and the Dirichlet density The Dirichlet distribution is used when the unknown quantity is itself a vector of probabilities. ::: definition **Definition 8** (Probability simplex). The $(K-1)$-dimensional probability simplex is $$\Delta^{K-1}=\left\{\mu=(\mu_1,\ldots,\mu_K):\; \mu_k\ge 0,\; \sum_{k=1}^K \mu_k=1\right\}.$$ A point in the simplex represents a probability vector over $K$ categories. ::: ::: definition **Definition 9** (Dirichlet distribution). Let $\alpha=(\alpha_1,\ldots,\alpha_K)$ with $\alpha_k>0$ and let $$\alpha_0=\sum_{k=1}^K \alpha_k.$$ A random probability vector $\mu=(\mu_1,\ldots,\mu_K)$ has a Dirichlet distribution with parameter $\alpha$, written $$\mu\sim \operatorname{Dirichlet}(\alpha_1,\ldots,\alpha_K),$$ if its density on the simplex is $$p(\mu\mid\alpha)= \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_1)\cdots \Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k-1},$$ where $\Gamma(\cdot)$ is the gamma function. ::: ::: {.callout-tip title="Interpretation of the parameters"} The parameter $\alpha_k$ acts like a prior pseudo-count for category $k$. Large values of $\alpha_0$ mean the distribution is more concentrated around its mean; small values can place more mass near the corners of the simplex. ::: ::: theorem **Theorem 10** (Dirichlet moments). *If $\mu\sim \operatorname{Dirichlet}(\alpha_1,\ldots,\alpha_K)$ and $\alpha_0=\sum_k\alpha_k$, then $$\mathbb{E}[\mu_k]=\frac{\alpha_k}{\alpha_0},$$ $$\operatorname{Var}(\mu_k)=\frac{\alpha_k(\alpha_0-\alpha_k)}{\alpha_0^2(\alpha_0+1)}, \qquad \operatorname{Cov}(\mu_i,\mu_j)= -\frac{\alpha_i\alpha_j}{\alpha_0^2(\alpha_0+1)}\quad (i\ne j).$$* ::: ::: example **Example 11** (The Beta distribution as a special case). Show that when $K=2$, the Dirichlet distribution becomes the Beta distribution. ::: ::: {.callout-note title="Solution"} When $K=2$, the simplex condition says $\mu_1+\mu_2=1$, so $\mu_2=1-\mu_1$. The Dirichlet density is proportional to $$\mu_1^{\alpha_1-1}\mu_2^{\alpha_2-1} =\mu_1^{\alpha_1-1}(1-\mu_1)^{\alpha_2-1}.$$ The normalized density is $$\frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} \mu_1^{\alpha_1-1}(1-\mu_1)^{\alpha_2-1},$$ which is the $\operatorname{Beta}(\alpha_1,\alpha_2)$ density. Therefore, $$\operatorname{Dirichlet}(\alpha_1,\alpha_2)=\operatorname{Beta}(\alpha_1,\alpha_2).$$ ::: ### Dirichlet-multinomial Bayesian model The Dirichlet distribution is the natural conjugate prior for categorical and multinomial probability vectors. ::: definition **Definition 12** (Dirichlet-multinomial model). A Bayesian model for categorical or multinomial data is $$\mu\sim \operatorname{Dirichlet}(\alpha_1, \ldots,\alpha_K), \qquad X\mid\mu\sim \operatorname{Multinomial}(n,\mu_1,\ldots,\mu_K).$$ Here the likelihood is multinomial, the prior is Dirichlet, and the posterior is again Dirichlet. ::: ::: theorem **Theorem 13** (Dirichlet-multinomial conjugacy). *Suppose $$\mu\sim \operatorname{Dirichlet}(\alpha_1, \ldots,\alpha_K), \qquad X=(X_1,\ldots,X_K)\mid\mu\sim \operatorname{Multinomial}(n,\mu_1,\ldots,\mu_K).$$ If the observed counts are $X_1=n_1,\ldots,X_K=n_K$, then $$\mu\mid X=(n_1,\ldots,n_K) \sim \operatorname{Dirichlet}(\alpha_1+n_1, \ldots, \alpha_K+n_K).$$* ::: ::: {.callout-note title="Proof"} The prior density is $$p(\mu)\propto \prod_{k=1}^K \mu_k^{\alpha_k-1}.$$ The multinomial likelihood, as a function of $\mu$, is $$p(x\mid\mu)\propto \prod_{k=1}^K \mu_k^{n_k}.$$ Therefore, by Bayes' theorem, $$p(\mu\mid x)\propto p(x\mid\mu)p(\mu) \propto \prod_{k=1}^K \mu_k^{n_k}\prod_{k=1}^K\mu_k^{\alpha_k-1} = \prod_{k=1}^K \mu_k^{\alpha_k+n_k-1}.$$ This is the kernel of a Dirichlet distribution with parameters $\alpha_k+n_k$. ::: ::: example **Example 14** (Updating category probabilities). Suppose there are three categories and the prior is $$\mu\sim \operatorname{Dirichlet}(2,2,2).$$ We observe $n=10$ trials with counts $(5,3,2)$. Find the posterior distribution and posterior mean. ::: ::: {.callout-note title="Solution"} By conjugacy, $$\mu\mid X=(5,3,2)\sim \operatorname{Dirichlet}(2+5,2+3,2+2)=\operatorname{Dirichlet}(7,5,4).$$ The posterior total is $7+5+4=16$. Therefore, $$\mathbb{E}[\mu_1\mid X]=\frac{7}{16}, \qquad \mathbb{E}[\mu_2\mid X]=\frac{5}{16}, \qquad \mathbb{E}[\mu_3\mid X]=\frac{4}{16}.$$ The posterior mean is a weighted compromise between the prior pseudo-counts and the observed counts. ::: ::: exercise **Practice Problem 15** (Uniform prior on the simplex). Let $\mu\sim \operatorname{Dirichlet}(1,1,1)$ and suppose counts $(n_1,n_2,n_3)=(4,1,5)$ are observed. Find the posterior distribution. ::: ::: {.callout-note title="Solution"} The posterior is $$\mu\mid X\sim \operatorname{Dirichlet}(1+4,1+1,1+5)=\operatorname{Dirichlet}(5,2,6).$$ Because $\operatorname{Dirichlet}(1,1,1)$ is uniform over the three-category simplex, the posterior parameters are simply one plus the observed counts. ::: ## Single Variable Gaussian Distribution This section reviews the ordinary one-dimensional Gaussian distribution before moving to the multivariate Gaussian distribution. ### Density, mean, variance, and probability The one-dimensional Gaussian distribution is the prototype for the multivariate Gaussian distribution. ::: definition **Definition 16** (Univariate Gaussian). A random variable $X$ has a Gaussian or normal distribution with mean $\mu$ and variance $\sigma^2$, written $$X\sim \operatorname{Normal}(\mu,\sigma^2),$$ if its density is $$f_X(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}, \qquad -\infty<x<\infty.$$ Then $$\mathbb{E}[X]=\mu, \qquad \operatorname{Var}(X)=\sigma^2.$$ For any $a<b$, $$\mathbb{P}(a<X<b)=\int_a^b f_X(x)\,dx.$$ ::: ::: example **Example 17** (Standardizing a normal random variable). Let $X\sim \operatorname{Normal}(\mu,\sigma^2)$. Show that $$Z=\frac{X-\mu}{\sigma}$$ has the standard normal distribution. ::: ::: {.callout-note title="Solution"} The transformation is $Z=(X-\mu)/\sigma$, so $X=\mu+\sigma Z$. The change-of-variables formula gives $$f_Z(z)=f_X(\mu+\sigma z)\cdot \sigma.$$ Substituting the density of $X$, $$f_Z(z)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{z^2}{2}\right)\sigma =\frac{1}{\sqrt{2\pi}}e^{-z^2/2}.$$ Thus $Z\sim \operatorname{Normal}(0,1)$. ::: ## Multivariate Gaussian Distribution This section introduces the Gaussian distribution for random vectors and explains how the covariance matrix controls its geometry. ### Definition and covariance matrix A multivariate Gaussian distribution is completely determined by its mean vector and covariance matrix. ::: definition **Definition 18** (Multivariate Gaussian). Let $X\in\mathbb{R}^d$ be a random vector. We say $$X\sim \mathcal{N}_d(\mu,\Sigma),$$ where $\mu\in\mathbb{R}^d$ and $\Sigma$ is a $d\times d$ symmetric positive definite matrix, if $$p(x\mid\mu,\Sigma) = \frac{1}{\sqrt{(2\pi)^d|\Sigma|}} \exp\left\{-\frac12 (x-\mu)^T\Sigma^{-1}(x-\mu)\right\}.$$ The covariance matrix is $$\Sigma=\operatorname{Cov}(X,X)=\mathbb{E}\left[(X-\mu)(X-\mu)^T\right].$$ ::: ::: {.callout-tip title="Normalization constant"} The normalization constant is $$Z=\sqrt{\det(2\pi\Sigma)}=\sqrt{(2\pi)^d|\Sigma|}.$$ This is the multivariate analogue of $\sqrt{2\pi}\sigma$ in the one-dimensional normal density. ::: ### Mahalanobis distance and geometry The shape of a multivariate Gaussian is controlled by a quadratic form that generalizes the one-dimensional $z$-score. ::: definition **Definition 19** (Mahalanobis distance). For $X\sim \mathcal{N}_d(\mu,\Sigma)$, define $$\Delta^2=(x-\mu)^T\Sigma^{-1}(x-\mu).$$ The value $\Delta$ is called the Mahalanobis distance from $x$ to $\mu$ with respect to $\Sigma$. ::: ::: remark *Remark 20*. In one dimension, $$\Delta^2=\frac{(x-\mu)^2}{\sigma^2}, \qquad \Delta=\left|\frac{x-\mu}{\sigma}\right|.$$ Thus Mahalanobis distance generalizes the absolute value of the usual $z$-score. If $\Sigma=I$, then $$\Delta^2=(x-\mu)^T(x-\mu)=\|x-\mu\|^2,$$ so Mahalanobis distance reduces to Euclidean distance. ::: ::: example **Example 21** (Two-dimensional covariance matrices). Compare the covariance matrices $$\Sigma_1=\begin{pmatrix}1&0\\0&1\end{pmatrix}, \qquad \Sigma_2=\begin{pmatrix}1&0.5\\0.5&1\end{pmatrix}, \qquad \Sigma_3=\begin{pmatrix}1&0.8\\0.8&1\end{pmatrix}.$$ Describe the qualitative effect on the Gaussian density. ::: ::: {.callout-note title="Solution"} For $\Sigma_1$, the coordinates are uncorrelated with equal variance, so the contours are circles centered at $\mu$. For $\Sigma_2$, the positive off-diagonal covariance creates positively tilted elliptical contours. Larger $x_1$ values tend to be associated with larger $x_2$ values. For $\Sigma_3$, the positive dependence is stronger, so the ellipse is more stretched along the positive diagonal direction. The density is more concentrated around a narrow tilted ridge than in the case $\Sigma_2$. ::: ### Spectral decomposition and precision matrix The eigenvalues and eigenvectors of the covariance matrix reveal the principal axes of a Gaussian distribution. ::: theorem **Theorem 22** (Spectral form of the quadratic term). *Let $\Sigma$ be symmetric positive definite. Then $$\Sigma=UDU^T=\sum_{i=1}^d \lambda_i u_i u_i^T,$$ where $U=[u_1\ \cdots\ u_d]$ is orthogonal and $D=\operatorname{diag}(\lambda_1,\ldots,\lambda_d)$ with $\lambda_i>0$. The precision matrix is $$\Sigma^{-1}=UD^{-1}U^T=\sum_{i=1}^d\frac{1}{\lambda_i}u_i u_i^T.$$ If $$y_i=u_i^T(x-\mu),$$ then $$\Delta^2=(x-\mu)^T\Sigma^{-1}(x-\mu)=\sum_{i=1}^d \frac{y_i^2}{\lambda_i}.$$* ::: ::: {.callout-note title="Proof"} Since $\Sigma=UDU^T$ and $U^TU=I$, the inverse is $$\Sigma^{-1}=UD^{-1}U^T.$$ Thus $$(x-\mu)^T\Sigma^{-1}(x-\mu) =(x-\mu)^TUD^{-1}U^T(x-\mu).$$ Let $y=U^T(x-\mu)$. Then the quadratic form becomes $$y^TD^{-1}y=\sum_{i=1}^d\frac{y_i^2}{\lambda_i}.$$ ::: ::: {.callout-warning title="Terminology"} The matrix $\Sigma^{-1}$ is called the *precision matrix*. The covariance matrix describes spread; the precision matrix describes inverse spread and appears naturally in the exponent of the Gaussian density. ::: ### Why Gaussian distributions are central Gaussian distributions appear frequently because of the central limit theorem and because Gaussianity is preserved by many important operations. ::: {.callout-tip title="Once Gaussian, always Gaussian"} Gaussian random vectors are completely determined by mean vector $\mu$ and covariance matrix $\Sigma$. Linear transformations, marginalization, conditioning, and sums of independent Gaussian vectors all produce Gaussian distributions again. ::: ::: remark *Remark 23* (Central limit theorem motivation). If $X_1,\ldots,X_m$ are independent and identically distributed with finite mean and variance, then the arithmetic mean $$\bar X=\frac{1}{m}(X_1+\cdots+X_m)$$ is approximately normal for large $m$, after appropriate centering and scaling. This explains why Gaussian distributions occur often in real data. ::: ## Marginal and Conditional Gaussian Distributions This section gives the most important computational formulas for partitioned multivariate Gaussian distributions. ### Partitioned Gaussian vectors Partitioning a Gaussian vector allows us to study subsets of variables and conditional distributions. Suppose $$X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right).$$ Here $X_1$ and $X_2$ may themselves be vectors, and the blocks of $\Sigma$ have compatible dimensions. ### Marginalization The marginal distribution of a subvector of a Gaussian vector is Gaussian with the corresponding mean block and covariance block. ::: theorem **Theorem 24** (Gaussian marginal distribution). *If $$X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right),$$ then $$X_1\sim \mathcal{N}(\mu_1,\Sigma_{11}), \qquad X_2\sim \mathcal{N}(\mu_2,\Sigma_{22}).$$* ::: ::: example **Example 25** (Marginal of a bivariate normal). Let $$\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}1\\2\end{pmatrix}, \begin{pmatrix}4&1\\1&9\end{pmatrix} \right).$$ Find the marginal distributions of $X_1$ and $X_2$. ::: ::: {.callout-note title="Solution"} By the marginalization theorem, $$X_1\sim \operatorname{Normal}(1,4), \qquad X_2\sim \operatorname{Normal}(2,9).$$ The off-diagonal covariance $1$ affects the dependence between $X_1$ and $X_2$, but it does not change the marginal variances. ::: ### Conditional Gaussian distributions Conditioning on one part of a Gaussian vector gives another Gaussian distribution. ::: theorem **Theorem 26** (Conditional Gaussian distribution). *Suppose $$X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right),$$ where $\Sigma_{22}$ is invertible. Then $$X_1\mid X_2=x_2\sim \mathcal{N}(\mu_{1\mid 2},\Sigma_{11\mid 2}),$$ where $$\mu_{1\mid 2}=\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2),$$ and $$\Sigma_{11\mid 2}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.$$* ::: ::: {.callout-note title="Proof"} *Proof idea by completing the square.* Let the precision matrix be $$\Lambda=\Sigma^{-1}=\begin{pmatrix}\Lambda_{11}&\Lambda_{12}\\\Lambda_{21}&\Lambda_{22}\end{pmatrix}.$$ The exponent of the joint Gaussian density is $$-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu).$$ If $x_2$ is fixed, then as a function of $x_1$ this expression is a quadratic form in $x_1$. Therefore the conditional distribution must be Gaussian. Completing the square gives $$\Sigma_{11\mid 2}=\Lambda_{11}^{-1}, \qquad \mu_{1\mid 2}=\mu_1-\Lambda_{11}^{-1}\Lambda_{12}(x_2-\mu_2).$$ Using the block inverse formula and Schur complement identities, $$\Lambda_{11}^{-1}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21},$$ and $$-\Lambda_{11}^{-1}\Lambda_{12}=\Sigma_{12}\Sigma_{22}^{-1}.$$ This gives the stated formula. ::: ::: {.callout-tip title="Interpretation"} The conditional mean is a linear function of the observed value $x_2$. The conditional covariance does not depend on the observed value $x_2$; it only depends on the covariance blocks. ::: ::: example **Example 27** (Conditional distribution in two dimensions). Let $$\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}0\\0\end{pmatrix}, \begin{pmatrix}1&\rho\\\rho&1\end{pmatrix} \right), \qquad -1<\rho<1.$$ Find $X_1\mid X_2=x_2$. ::: ::: {.callout-note title="Solution"} Here $$\mu_1=0, \quad \mu_2=0, \quad \Sigma_{11}=1, \quad \Sigma_{22}=1, \quad \Sigma_{12}=\rho.$$ Therefore, $$\mu_{1\mid 2}=0+\rho(1)^{-1}(x_2-0)=\rho x_2,$$ and $$\Sigma_{11\mid 2}=1-\rho(1)^{-1}\rho=1-\rho^2.$$ Thus $$X_1\mid X_2=x_2\sim \operatorname{Normal}(\rho x_2,1-\rho^2).$$ If $\rho=0$, the conditional distribution is the same as the marginal distribution, reflecting independence. ::: ### Linear algebra review: Schur complements and block inverses The Schur complement is the matrix identity behind the conditional covariance formula. ::: definition **Definition 28** (Schur complement). Let $$M=\begin{pmatrix}A&B\\C&D\end{pmatrix},$$ where $D$ is invertible. The Schur complement of $D$ in $M$ is $$M/D=A-BD^{-1}C.$$ ::: ::: theorem **Theorem 29** (Block inverse formula). *If $D$ and $M/D$ are invertible, then $$M^{-1}=\begin{pmatrix} (M/D)^{-1} & -(M/D)^{-1}BD^{-1}\\ -D^{-1}C(M/D)^{-1} & D^{-1}+D^{-1}C(M/D)^{-1}BD^{-1} \end{pmatrix}.$$* ::: ::: {.callout-note title="Proof"} Block Gaussian elimination gives $$\begin{pmatrix}A&B\\C&D\end{pmatrix} = \begin{pmatrix}I&BD^{-1}\\0&I\end{pmatrix} \begin{pmatrix}A-BD^{-1}C&0\\0&D\end{pmatrix} \begin{pmatrix}I&0\\D^{-1}C&I\end{pmatrix}.$$ Taking inverses of the three factors and multiplying them gives the stated block inverse formula. ::: ::: exercise **Practice Problem 30** (Schur complement calculation). Let $$M=\begin{pmatrix}4&2\\2&3\end{pmatrix}.$$ Compute the Schur complement of $D=3$. ::: ::: {.callout-note title="Solution"} Here $A=4$, $B=2$, $C=2$, and $D=3$. Therefore $$M/D=A-BD^{-1}C=4-2\cdot\frac13\cdot 2=4-\frac43=\frac83.$$ ::: ## Operations on Gaussian Random Variables This section summarizes several operations that preserve Gaussian structure. ### Affine transformations A linear transformation plus a shift maps a Gaussian vector to another Gaussian vector. ::: theorem **Theorem 31** (Affine transformation of a Gaussian). *If $$X\sim \mathcal{N}(\mu,\Sigma), \qquad Y=AX+b,$$ then $$Y\sim \mathcal{N}(A\mu+b,A\Sigma A^T).$$* ::: ::: {.callout-note title="Proof"} The mean is $$\mathbb{E}[Y]=\mathbb{E}[AX+b]=A\mathbb{E}[X]+b=A\mu+b.$$ The covariance is $$\operatorname{Var}(Y)=\operatorname{Var}(AX+b)=A\operatorname{Var}(X)A^T=A\Sigma A^T.$$ Because any affine transformation of a Gaussian vector is Gaussian, these two quantities determine the distribution. ::: ::: example **Example 32** (Linear combination of a Gaussian vector). Let $X\sim \mathcal{N}(\mu,\Sigma)$ and let $a$ be a fixed vector. Find the distribution of $Y=a^TX$. ::: ::: {.callout-note title="Solution"} This is the affine transformation with $A=a^T$ and $b=0$. Hence $$Y=a^TX\sim \operatorname{Normal}(a^T\mu,a^T\Sigma a).$$ ::: ### Products and convolutions of Gaussian densities Products and convolutions of Gaussian densities are central in Bayesian updating, filtering, and linear models. ::: theorem **Theorem 33** (Pointwise product of two Gaussian densities). *Consider two Gaussian densities over the same variable $x$: $$p_1(x)=\mathcal{N}(x\mid\mu_1,\Sigma_1), \qquad p_2(x)=\mathcal{N}(x\mid\mu_2,\Sigma_2).$$ Their pointwise product is proportional to another Gaussian density: $$p_1(x)p_2(x)\propto \mathcal{N}(x\mid\mu,\Sigma),$$ where $$\Sigma^{-1}=\Sigma_1^{-1}+\Sigma_2^{-1}, \qquad \mu=\Sigma\left(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2\right).$$* ::: ::: {.callout-note title="Proof"} The product has exponent $$-\frac12(x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1) -\frac12(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2).$$ Collecting quadratic and linear terms in $x$ gives $$-\frac12 x^T(\Sigma_1^{-1}+\Sigma_2^{-1})x +x^T(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2)+\text{constant}.$$ Completing the square gives the stated covariance and mean. ::: ::: theorem **Theorem 34** (Convolution and sums). *If $X\sim \mathcal{N}(\mu_1,\Sigma_1)$ and $Y\sim \mathcal{N}(\mu_2,\Sigma_2)$ are independent, then $$Z=X+Y\sim \mathcal{N}(\mu_1+\mu_2,\Sigma_1+\Sigma_2).$$ Equivalently, the convolution $$p_Z(z)=\int p_X(x)p_Y(z-x)\,dx$$ is Gaussian.* ::: ::: {.callout-note title="Proof"} Since $X$ and $Y$ are independent, $$\mathbb{E}[Z]=\mathbb{E}[X]+\mathbb{E}[Y]=\mu_1+\mu_2,$$ and $$\operatorname{Var}(Z)=\operatorname{Var}(X)+\operatorname{Var}(Y)=\Sigma_1+\Sigma_2.$$ The sum of independent Gaussian random vectors is Gaussian, so the distribution is determined by these two quantities. ::: ::: example **Example 35** (Bayesian normal update as product of Gaussians). Suppose a prior density for a scalar parameter is $\theta\sim \operatorname{Normal}(\mu_0,\sigma_0^2)$, and a Gaussian likelihood kernel is proportional to $\operatorname{Normal}(\theta\mid y,\sigma^2)$. Find the posterior variance and mean. ::: ::: {.callout-note title="Solution"} Using the product formula in one dimension, $$\frac{1}{\sigma_{post}^2}=\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}.$$ Thus $$\sigma_{post}^2=\left(\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}\right)^{-1}.$$ The posterior mean is $$\mu_{post}=\sigma_{post}^2\left(\frac{\mu_0}{\sigma_0^2}+\frac{y}{\sigma^2}\right).$$ This is a precision-weighted average of the prior mean and the data value. ::: ## Linear Gaussian Models This section combines marginal and conditional Gaussian formulas in a common model used in Bayesian statistics and machine learning. ### Marginal and conditional distributions in a linear Gaussian model A linear Gaussian model assumes a Gaussian prior on an input vector and a Gaussian conditional distribution for an output vector. ::: theorem **Theorem 36** (Marginal and conditional Gaussians in a linear model). *Suppose $$p(x)=\mathcal{N}(x\mid\mu,\Lambda^{-1}), \qquad p(y\mid x)=\mathcal{N}(y\mid Ax+b,L^{-1}),$$ where $\Lambda$ and $L$ are precision matrices. Then $$p(y)=\mathcal{N}\left(y\mid A\mu+b, L^{-1}+A\Lambda^{-1}A^T\right).$$ Moreover, $$p(x\mid y)=\mathcal{N}(x\mid m,S),$$ where $$S=(\Lambda+A^TLA)^{-1},$$ and $$m=S\left(A^TL(y-b)+\Lambda\mu\right).$$* ::: ::: {.callout-note title="Proof"} Write the model as $$y=Ax+b+\varepsilon, \qquad \varepsilon\sim \mathcal{N}(0,L^{-1}), \qquad x\sim \mathcal{N}(\mu,\Lambda^{-1}),$$ with $x$ independent of $\varepsilon$. Therefore $$\mathbb{E}[y]=A\mathbb{E}[x]+b=A\mu+b,$$ and $$\operatorname{Var}(y)=A\operatorname{Var}(x)A^T+\operatorname{Var}(\varepsilon)=A\Lambda^{-1}A^T+L^{-1}.$$ This gives the marginal distribution of $y$. For the conditional distribution, multiply the prior and likelihood as functions of $x$: $$p(x\mid y)\propto \exp\left\{-\frac12(x-\mu)^T\Lambda(x-\mu) -\frac12(y-Ax-b)^TL(y-Ax-b)\right\}.$$ Collecting quadratic terms in $x$ gives precision $$S^{-1}=\Lambda+A^TLA.$$ Collecting linear terms gives $$S^{-1}m=\Lambda\mu+A^TL(y-b),$$ so $$m=S\left(A^TL(y-b)+\Lambda\mu\right).$$ ::: ::: example **Example 37** (Scalar linear Gaussian model). Let $$X\sim \operatorname{Normal}(\mu,\tau^2), \qquad Y\mid X=x\sim \operatorname{Normal}(ax+b,\sigma^2).$$ Find the marginal distribution of $Y$. ::: ::: {.callout-note title="Solution"} We can write $$Y=aX+b+\varepsilon, \qquad \varepsilon\sim \operatorname{Normal}(0,\sigma^2),$$ with $\varepsilon$ independent of $X$. Hence $$\mathbb{E}[Y]=a\mu+b,$$ and $$\operatorname{Var}(Y)=a^2\tau^2+\sigma^2.$$ Therefore $$Y\sim \operatorname{Normal}(a\mu+b,a^2\tau^2+\sigma^2).$$ ::: ::: exercise **Practice Problem 38** (Posterior in the scalar linear Gaussian model). For the same model, $$X\sim \operatorname{Normal}(\mu,\tau^2), \qquad Y\mid X=x\sim \operatorname{Normal}(ax+b,\sigma^2),$$ find the conditional distribution of $X\mid Y=y$. ::: ::: {.callout-note title="Solution"} Here the prior precision is $1/\tau^2$ and the observation precision is $1/\sigma^2$. The posterior precision is $$\frac{1}{s^2}=\frac{1}{\tau^2}+\frac{a^2}{\sigma^2}.$$ Thus $$s^2=\left(\frac{1}{\tau^2}+\frac{a^2}{\sigma^2}\right)^{-1}.$$ The posterior mean is $$m=s^2\left(\frac{\mu}{\tau^2}+\frac{a(y-b)}{\sigma^2}\right).$$ Therefore $$X\mid Y=y\sim \operatorname{Normal}(m,s^2).$$ ::: ## Summary and Practice This final section summarizes the main formulas and provides practice problems that reinforce the section. ### Core formulas The main ideas of this section are that multinomial counts live on a constrained simplex of counts, Dirichlet distributions live on a simplex of probabilities, and Gaussian random vectors are stable under many operations. ::: {.callout-tip title="Multinomial and Dirichlet formulas"} $$\mathbb{P}(X_1=n_1,\ldots,X_K=n_K) =\frac{n!}{n_1!\cdots n_K!}\prod_{k=1}^K\phi_k^{n_k}.$$ $$p(\mu\mid\alpha)= \frac{\Gamma(\alpha_0)}{\prod_{k=1}^K\Gamma(\alpha_k)} \prod_{k=1}^K \mu_k^{\alpha_k-1}, \qquad \alpha_0=\sum_{k=1}^K\alpha_k.$$ $$\mu\mid X=(n_1, \ldots,n_K) \sim \operatorname{Dirichlet}(\alpha_1+n_1, \ldots, \alpha_K+n_K).$$ ::: ::: {.callout-tip title="Multivariate Gaussian formulas"} $$p(x\mid\mu,\Sigma)=\frac{1}{\sqrt{(2\pi)^d|\Sigma|}} \exp\left\{-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu)\right\}.$$ $$X_1\mid X_2=x_2\sim \mathcal{N}\left(\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2), \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right).$$ $$AX+b\sim \mathcal{N}(A\mu+b,A\Sigma A^T).$$ ::: ::: exercise **Practice Problem 39** (Multinomial probability). A four-sided die has probabilities $(0.1,0.2,0.3,0.4)$. It is tossed $8$ times. Find the probability that the counts are $(1,2,2,3)$. ::: ::: {.callout-note title="Solution"} Use the multinomial pmf: $$\mathbb{P}(X=(1,2,2,3))= \frac{8!}{1!2!2!3!}(0.1)^1(0.2)^2(0.3)^2(0.4)^3.$$ The coefficient is $$\frac{8!}{1!2!2!3!}=1680.$$ Thus $$\mathbb{P}(X=(1,2,2,3))=1680(0.1)(0.04)(0.09)(0.064) \approx 0.0387.$$ ::: ::: exercise **Practice Problem 40** (Dirichlet posterior mean). Suppose $\mu\sim \operatorname{Dirichlet}(3,1,2)$ and the observed counts are $(2,4,1)$. Find the posterior distribution and posterior mean. ::: ::: {.callout-note title="Solution"} The posterior is $$\mu\mid X\sim \operatorname{Dirichlet}(3+2,1+4,2+1)=\operatorname{Dirichlet}(5,5,3).$$ The posterior total is $13$, so $$\mathbb{E}[\mu\mid X]=\left(\frac{5}{13},\frac{5}{13},\frac{3}{13}\right).$$ ::: ::: exercise **Practice Problem 41** (Conditional normal calculation). Let $$\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}2\\1\end{pmatrix}, \begin{pmatrix}9&3\\3&4\end{pmatrix} \right).$$ Find $X_1\mid X_2=5$. ::: ::: {.callout-note title="Solution"} We have $$\mu_1=2, \quad \mu_2=1, \quad \Sigma_{11}=9, \quad \Sigma_{12}=3, \quad \Sigma_{22}=4, \quad \Sigma_{21}=3.$$ The conditional mean is $$\mu_{1\mid 2}=2+3\cdot 4^{-1}(5-1)=2+3=5.$$ The conditional variance is $$\Sigma_{11\mid 2}=9-3\cdot 4^{-1}\cdot 3=9-\frac94=\frac{27}{4}.$$ Therefore $$X_1\mid X_2=5\sim \operatorname{Normal}\left(5,\frac{27}{4}\right).$$ ::: ::: exercise **Practice Problem 42** (Affine transformation). Let $X\sim \mathcal{N}_2\left(\begin{pmatrix}1\\2\end{pmatrix},\begin{pmatrix}1&0.5\\0.5&2\end{pmatrix}\right)$ and let $Y=2X_1-X_2+3$. Find the distribution of $Y$. ::: ::: {.callout-note title="Solution"} Write $Y=a^TX+3$, where $$a=\begin{pmatrix}2\\-1\end{pmatrix}.$$ The mean is $$\mathbb{E}[Y]=a^T\mu+3=2(1)-1(2)+3=3.$$ The variance is $$\operatorname{Var}(Y)=a^T\Sigma a.$$ Compute $$\Sigma a= \begin{pmatrix}1&0.5\\0.5&2\end{pmatrix} \begin{pmatrix}2\\-1\end{pmatrix} =\begin{pmatrix}1.5\\-1\end{pmatrix}.$$ Thus $$a^T\Sigma a=(2,-1)\begin{pmatrix}1.5\\-1\end{pmatrix}=3+1=4.$$ Therefore $$Y\sim \operatorname{Normal}(3,4).$$ ::: ::: thebibliography 9 George Casella and Roger L. Berger, *Statistical Inference*, 2nd edition. Larry Wasserman, *All of Statistics*. C. M. Grinstead and J. L. Snell, *Introduction to Probability*, American Mathematical Society, 2012. Sheldon Ross, *Introduction to Probability Models*, 12th edition. :::