Chapter 6 Extra: Multinomial and Multinormal Distributions
This extra chapter extends Section 6 in two directions. First, it develops the multinomial distribution and the Dirichlet prior for categorical data. Second, it introduces the multivariate Gaussian distribution and the linear-algebra formulas behind marginalization, conditioning, affine transformations, and linear Gaussian models.
Topics. Multinomial distribution; Dirichlet distribution; Dirichlet-multinomial Bayesian model; multivariate Gaussian distribution; Mahalanobis distance; spectral geometry; marginal and conditional Gaussian distributions; affine transformations; products and convolutions of Gaussians; linear Gaussian models.
Multinomial Distribution
This section extends the categorical and binomial distributions to experiments with more than two possible outcomes.
From categorical trials to multinomial counts
A categorical random variable records one outcome from several categories, while a multinomial random vector records the counts of each outcome after many independent categorical trials.
Definition 1 (Categorical trial). A categorical trial has \(m\) possible outcomes \(O_1,\ldots,O_m\) with probabilities \[\phi_1,\ldots,\phi_m,\qquad \phi_i\ge 0,\qquad \sum_{i=1}^m \phi_i=1.\] If one trial is performed, then the outcome has a categorical distribution.
Definition 2 (Multinomial distribution). Suppose we perform \(n\) independent trials, each with outcomes \(O_1,\ldots,O_m\) and constant category probabilities \((\phi_1,\ldots,\phi_m)\). Let \[X_i=\text{the number of times outcome }O_i\text{ appears in the }n\text{ trials}.\] Then \[X=(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1, \ldots,\phi_m),\] and for nonnegative integers \(n_1,\ldots,n_m\) satisfying \(n_1+\cdots+n_m=n\), \[\mathbb{P}(X_1=n_1,\ldots,X_m=n_m) = \frac{n!}{n_1!\cdots n_m!}\phi_1^{n_1}\cdots \phi_m^{n_m}.\]
The multinomial distribution is the direct generalization of the categorical distribution. When \(n=1\), it reduces to a categorical trial. When \(m=2\), it reduces to the binomial distribution.
Example 3 (Tossing an \(m\)-sided die). An \(m\)-sided die has probabilities \(\phi_1,\ldots,\phi_m\) for sides \(1,\ldots,m\). Toss the die \(n\) times and let \(X_i\) be the number of times side \(i\) appears. Find the joint pmf of \((X_1,\ldots,X_m)\).
The vector of counts follows \[(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1,\ldots,\phi_m).\] Therefore, for \(n_1+\cdots+n_m=n\), \[\mathbb{P}(X_1=n_1,\ldots,X_m=n_m) = \frac{n!}{n_1!\cdots n_m!}\prod_{i=1}^m\phi_i^{n_i}.\] The coefficient \(n!/(n_1!\cdots n_m!)\) counts how many orderings of the \(n\) trials produce the same vector of counts.
Example 4 (Three-category experiment). Suppose a website visit leads to one of three outcomes: purchase, sign-up only, or no action. The probabilities are \((0.1,0.2,0.7)\). Among \(n=10\) visitors, find the probability of exactly \(2\) purchases, \(3\) sign-ups only, and \(5\) no-action visits.
Let \(X=(X_1,X_2,X_3)\sim \operatorname{Multinomial}(10,0.1,0.2,0.7)\). Then \[\mathbb{P}(X_1=2,X_2=3,X_3=5) = \frac{10!}{2!3!5!}(0.1)^2(0.2)^3(0.7)^5.\] Numerically, \[\frac{10!}{2!3!5!}(0.1)^2(0.2)^3(0.7)^5 =2520(0.01)(0.008)(0.16807)\approx 0.0339.\]
Mean, variance, and covariance
The components of a multinomial vector are individually binomial, but they are dependent because the total count is fixed at \(n\).
Theorem 5 (Moments of the multinomial distribution). If \(X=(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1,\ldots,\phi_m)\), then \[\mathbb{E}[X_i]=n\phi_i, \qquad \operatorname{Var}(X_i)=n\phi_i(1-\phi_i),\] and for \(i\ne j\), \[\operatorname{Cov}(X_i,X_j)=-n\phi_i\phi_j.\] Equivalently, \[\operatorname{Cov}(X)=n\left(\operatorname{diag}(\phi)-\phi\phi^T\right),\] where \(\phi=(\phi_1,\ldots,\phi_m)^T\).
Write \(X_i=\sum_{r=1}^n I_{ri}\), where \(I_{ri}=1\) if trial \(r\) produces outcome \(O_i\), and \(0\) otherwise. Then \(I_{ri}\sim \operatorname{Bernoulli}(\phi_i)\), so \[\mathbb{E}[X_i]=\sum_{r=1}^n \mathbb{E}[I_{ri}]=n\phi_i, \qquad \operatorname{Var}(X_i)=\sum_{r=1}^n \operatorname{Var}(I_{ri})=n\phi_i(1-\phi_i).\] For \(i\ne j\), in one trial \(I_{ri}I_{rj}=0\), so \[\operatorname{Cov}(I_{ri},I_{rj})=\mathbb{E}[I_{ri}I_{rj}]-\mathbb{E}[I_{ri}]\mathbb{E}[I_{rj}]=0-\phi_i\phi_j=-\phi_i\phi_j.\] Different trials are independent, so \[\operatorname{Cov}(X_i,X_j)=\sum_{r=1}^n \operatorname{Cov}(I_{ri},I_{rj})=-n\phi_i\phi_j.\]
Remark. Remark 6. The negative covariance is not a paradox. If the total number of trials is fixed, seeing more observations in one category leaves fewer observations available for the other categories.
Practice Problem 7 (Checking the binomial special case). Show that if \(m=2\), then the multinomial distribution reduces to the binomial distribution.
Let \(X=(X_1,X_2)\sim \operatorname{Multinomial}(n,p,1-p)\). Because \(X_1+X_2=n\), the entire vector is determined by \(X_1\). For \(X_1=k\), we have \(X_2=n-k\). Thus \[\mathbb{P}(X_1=k)=\mathbb{P}(X_1=k,X_2=n-k) =\frac{n!}{k!(n-k)!}p^k(1-p)^{n-k},\] which is exactly the \(\operatorname{Binomial}(n,p)\) pmf.
Dirichlet Distribution
This section introduces the Dirichlet distribution, the continuous distribution on probability vectors that generalizes the Beta distribution.
The simplex and the Dirichlet density
The Dirichlet distribution is used when the unknown quantity is itself a vector of probabilities.
Definition 8 (Probability simplex). The \((K-1)\)-dimensional probability simplex is \[\Delta^{K-1}=\left\{\mu=(\mu_1,\ldots,\mu_K):\; \mu_k\ge 0,\; \sum_{k=1}^K \mu_k=1\right\}.\] A point in the simplex represents a probability vector over \(K\) categories.
Definition 9 (Dirichlet distribution). Let \(\alpha=(\alpha_1,\ldots,\alpha_K)\) with \(\alpha_k>0\) and let \[\alpha_0=\sum_{k=1}^K \alpha_k.\] A random probability vector \(\mu=(\mu_1,\ldots,\mu_K)\) has a Dirichlet distribution with parameter \(\alpha\), written \[\mu\sim \operatorname{Dirichlet}(\alpha_1,\ldots,\alpha_K),\] if its density on the simplex is \[p(\mu\mid\alpha)= \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_1)\cdots \Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k-1},\] where \(\Gamma(\cdot)\) is the gamma function.
The parameter \(\alpha_k\) acts like a prior pseudo-count for category \(k\). Large values of \(\alpha_0\) mean the distribution is more concentrated around its mean; small values can place more mass near the corners of the simplex.
Theorem 10 (Dirichlet moments). If \(\mu\sim \operatorname{Dirichlet}(\alpha_1,\ldots,\alpha_K)\) and \(\alpha_0=\sum_k\alpha_k\), then \[\mathbb{E}[\mu_k]=\frac{\alpha_k}{\alpha_0},\] \[\operatorname{Var}(\mu_k)=\frac{\alpha_k(\alpha_0-\alpha_k)}{\alpha_0^2(\alpha_0+1)}, \qquad \operatorname{Cov}(\mu_i,\mu_j)= -\frac{\alpha_i\alpha_j}{\alpha_0^2(\alpha_0+1)}\quad (i\ne j).\]
Example 11 (The Beta distribution as a special case). Show that when \(K=2\), the Dirichlet distribution becomes the Beta distribution.
When \(K=2\), the simplex condition says \(\mu_1+\mu_2=1\), so \(\mu_2=1-\mu_1\). The Dirichlet density is proportional to \[\mu_1^{\alpha_1-1}\mu_2^{\alpha_2-1} =\mu_1^{\alpha_1-1}(1-\mu_1)^{\alpha_2-1}.\] The normalized density is \[\frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} \mu_1^{\alpha_1-1}(1-\mu_1)^{\alpha_2-1},\] which is the \(\operatorname{Beta}(\alpha_1,\alpha_2)\) density. Therefore, \[\operatorname{Dirichlet}(\alpha_1,\alpha_2)=\operatorname{Beta}(\alpha_1,\alpha_2).\]
Dirichlet-multinomial Bayesian model
The Dirichlet distribution is the natural conjugate prior for categorical and multinomial probability vectors.
Definition 12 (Dirichlet-multinomial model). A Bayesian model for categorical or multinomial data is \[\mu\sim \operatorname{Dirichlet}(\alpha_1, \ldots,\alpha_K), \qquad X\mid\mu\sim \operatorname{Multinomial}(n,\mu_1,\ldots,\mu_K).\] Here the likelihood is multinomial, the prior is Dirichlet, and the posterior is again Dirichlet.
Theorem 13 (Dirichlet-multinomial conjugacy). Suppose \[\mu\sim \operatorname{Dirichlet}(\alpha_1, \ldots,\alpha_K), \qquad X=(X_1,\ldots,X_K)\mid\mu\sim \operatorname{Multinomial}(n,\mu_1,\ldots,\mu_K).\] If the observed counts are \(X_1=n_1,\ldots,X_K=n_K\), then \[\mu\mid X=(n_1,\ldots,n_K) \sim \operatorname{Dirichlet}(\alpha_1+n_1, \ldots, \alpha_K+n_K).\]
The prior density is \[p(\mu)\propto \prod_{k=1}^K \mu_k^{\alpha_k-1}.\] The multinomial likelihood, as a function of \(\mu\), is \[p(x\mid\mu)\propto \prod_{k=1}^K \mu_k^{n_k}.\] Therefore, by Bayes’ theorem, \[p(\mu\mid x)\propto p(x\mid\mu)p(\mu) \propto \prod_{k=1}^K \mu_k^{n_k}\prod_{k=1}^K\mu_k^{\alpha_k-1} = \prod_{k=1}^K \mu_k^{\alpha_k+n_k-1}.\] This is the kernel of a Dirichlet distribution with parameters \(\alpha_k+n_k\).
Example 14 (Updating category probabilities). Suppose there are three categories and the prior is \[\mu\sim \operatorname{Dirichlet}(2,2,2).\] We observe \(n=10\) trials with counts \((5,3,2)\). Find the posterior distribution and posterior mean.
By conjugacy, \[\mu\mid X=(5,3,2)\sim \operatorname{Dirichlet}(2+5,2+3,2+2)=\operatorname{Dirichlet}(7,5,4).\] The posterior total is \(7+5+4=16\). Therefore, \[\mathbb{E}[\mu_1\mid X]=\frac{7}{16}, \qquad \mathbb{E}[\mu_2\mid X]=\frac{5}{16}, \qquad \mathbb{E}[\mu_3\mid X]=\frac{4}{16}.\] The posterior mean is a weighted compromise between the prior pseudo-counts and the observed counts.
Practice Problem 15 (Uniform prior on the simplex). Let \(\mu\sim \operatorname{Dirichlet}(1,1,1)\) and suppose counts \((n_1,n_2,n_3)=(4,1,5)\) are observed. Find the posterior distribution.
The posterior is \[\mu\mid X\sim \operatorname{Dirichlet}(1+4,1+1,1+5)=\operatorname{Dirichlet}(5,2,6).\] Because \(\operatorname{Dirichlet}(1,1,1)\) is uniform over the three-category simplex, the posterior parameters are simply one plus the observed counts.
Single Variable Gaussian Distribution
This section reviews the ordinary one-dimensional Gaussian distribution before moving to the multivariate Gaussian distribution.
Density, mean, variance, and probability
The one-dimensional Gaussian distribution is the prototype for the multivariate Gaussian distribution.
Definition 16 (Univariate Gaussian). A random variable \(X\) has a Gaussian or normal distribution with mean \(\mu\) and variance \(\sigma^2\), written \[X\sim \operatorname{Normal}(\mu,\sigma^2),\] if its density is \[f_X(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}, \qquad -\infty<x<\infty.\] Then \[\mathbb{E}[X]=\mu, \qquad \operatorname{Var}(X)=\sigma^2.\] For any \(a<b\), \[\mathbb{P}(a<X<b)=\int_a^b f_X(x)\,dx.\]
Example 17 (Standardizing a normal random variable). Let \(X\sim \operatorname{Normal}(\mu,\sigma^2)\). Show that \[Z=\frac{X-\mu}{\sigma}\] has the standard normal distribution.
The transformation is \(Z=(X-\mu)/\sigma\), so \(X=\mu+\sigma Z\). The change-of-variables formula gives \[f_Z(z)=f_X(\mu+\sigma z)\cdot \sigma.\] Substituting the density of \(X\), \[f_Z(z)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{z^2}{2}\right)\sigma =\frac{1}{\sqrt{2\pi}}e^{-z^2/2}.\] Thus \(Z\sim \operatorname{Normal}(0,1)\).
Multivariate Gaussian Distribution
This section introduces the Gaussian distribution for random vectors and explains how the covariance matrix controls its geometry.
Definition and covariance matrix
A multivariate Gaussian distribution is completely determined by its mean vector and covariance matrix.
Definition 18 (Multivariate Gaussian). Let \(X\in\mathbb{R}^d\) be a random vector. We say \[X\sim \mathcal{N}_d(\mu,\Sigma),\] where \(\mu\in\mathbb{R}^d\) and \(\Sigma\) is a \(d\times d\) symmetric positive definite matrix, if \[p(x\mid\mu,\Sigma) = \frac{1}{\sqrt{(2\pi)^d|\Sigma|}} \exp\left\{-\frac12 (x-\mu)^T\Sigma^{-1}(x-\mu)\right\}.\] The covariance matrix is \[\Sigma=\operatorname{Cov}(X,X)=\mathbb{E}\left[(X-\mu)(X-\mu)^T\right].\]
The normalization constant is \[Z=\sqrt{\det(2\pi\Sigma)}=\sqrt{(2\pi)^d|\Sigma|}.\] This is the multivariate analogue of \(\sqrt{2\pi}\sigma\) in the one-dimensional normal density.
Mahalanobis distance and geometry
The shape of a multivariate Gaussian is controlled by a quadratic form that generalizes the one-dimensional \(z\)-score.
Definition 19 (Mahalanobis distance). For \(X\sim \mathcal{N}_d(\mu,\Sigma)\), define \[\Delta^2=(x-\mu)^T\Sigma^{-1}(x-\mu).\] The value \(\Delta\) is called the Mahalanobis distance from \(x\) to \(\mu\) with respect to \(\Sigma\).
Remark. Remark 20. In one dimension, \[\Delta^2=\frac{(x-\mu)^2}{\sigma^2}, \qquad \Delta=\left|\frac{x-\mu}{\sigma}\right|.\] Thus Mahalanobis distance generalizes the absolute value of the usual \(z\)-score. If \(\Sigma=I\), then \[\Delta^2=(x-\mu)^T(x-\mu)=\|x-\mu\|^2,\] so Mahalanobis distance reduces to Euclidean distance.
Example 21 (Two-dimensional covariance matrices). Compare the covariance matrices \[\Sigma_1=\begin{pmatrix}1&0\\0&1\end{pmatrix}, \qquad \Sigma_2=\begin{pmatrix}1&0.5\\0.5&1\end{pmatrix}, \qquad \Sigma_3=\begin{pmatrix}1&0.8\\0.8&1\end{pmatrix}.\] Describe the qualitative effect on the Gaussian density.
For \(\Sigma_1\), the coordinates are uncorrelated with equal variance, so the contours are circles centered at \(\mu\).
For \(\Sigma_2\), the positive off-diagonal covariance creates positively tilted elliptical contours. Larger \(x_1\) values tend to be associated with larger \(x_2\) values.
For \(\Sigma_3\), the positive dependence is stronger, so the ellipse is more stretched along the positive diagonal direction. The density is more concentrated around a narrow tilted ridge than in the case \(\Sigma_2\).
Spectral decomposition and precision matrix
The eigenvalues and eigenvectors of the covariance matrix reveal the principal axes of a Gaussian distribution.
Theorem 22 (Spectral form of the quadratic term). Let \(\Sigma\) be symmetric positive definite. Then \[\Sigma=UDU^T=\sum_{i=1}^d \lambda_i u_i u_i^T,\] where \(U=[u_1\ \cdots\ u_d]\) is orthogonal and \(D=\operatorname{diag}(\lambda_1,\ldots,\lambda_d)\) with \(\lambda_i>0\). The precision matrix is \[\Sigma^{-1}=UD^{-1}U^T=\sum_{i=1}^d\frac{1}{\lambda_i}u_i u_i^T.\] If \[y_i=u_i^T(x-\mu),\] then \[\Delta^2=(x-\mu)^T\Sigma^{-1}(x-\mu)=\sum_{i=1}^d \frac{y_i^2}{\lambda_i}.\]
Since \(\Sigma=UDU^T\) and \(U^TU=I\), the inverse is \[\Sigma^{-1}=UD^{-1}U^T.\] Thus \[(x-\mu)^T\Sigma^{-1}(x-\mu) =(x-\mu)^TUD^{-1}U^T(x-\mu).\] Let \(y=U^T(x-\mu)\). Then the quadratic form becomes \[y^TD^{-1}y=\sum_{i=1}^d\frac{y_i^2}{\lambda_i}.\]
The matrix \(\Sigma^{-1}\) is called the precision matrix. The covariance matrix describes spread; the precision matrix describes inverse spread and appears naturally in the exponent of the Gaussian density.
Why Gaussian distributions are central
Gaussian distributions appear frequently because of the central limit theorem and because Gaussianity is preserved by many important operations.
Gaussian random vectors are completely determined by mean vector \(\mu\) and covariance matrix \(\Sigma\). Linear transformations, marginalization, conditioning, and sums of independent Gaussian vectors all produce Gaussian distributions again.
Remark. Remark 23 (Central limit theorem motivation). If \(X_1,\ldots,X_m\) are independent and identically distributed with finite mean and variance, then the arithmetic mean \[\bar X=\frac{1}{m}(X_1+\cdots+X_m)\] is approximately normal for large \(m\), after appropriate centering and scaling. This explains why Gaussian distributions occur often in real data.
Marginal and Conditional Gaussian Distributions
This section gives the most important computational formulas for partitioned multivariate Gaussian distributions.
Partitioned Gaussian vectors
Partitioning a Gaussian vector allows us to study subsets of variables and conditional distributions.
Suppose \[X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right).\] Here \(X_1\) and \(X_2\) may themselves be vectors, and the blocks of \(\Sigma\) have compatible dimensions.
Marginalization
The marginal distribution of a subvector of a Gaussian vector is Gaussian with the corresponding mean block and covariance block.
Theorem 24 (Gaussian marginal distribution). If \[X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right),\] then \[X_1\sim \mathcal{N}(\mu_1,\Sigma_{11}), \qquad X_2\sim \mathcal{N}(\mu_2,\Sigma_{22}).\]
Example 25 (Marginal of a bivariate normal). Let \[\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}1\\2\end{pmatrix}, \begin{pmatrix}4&1\\1&9\end{pmatrix} \right).\] Find the marginal distributions of \(X_1\) and \(X_2\).
By the marginalization theorem, \[X_1\sim \operatorname{Normal}(1,4), \qquad X_2\sim \operatorname{Normal}(2,9).\] The off-diagonal covariance \(1\) affects the dependence between \(X_1\) and \(X_2\), but it does not change the marginal variances.
Conditional Gaussian distributions
Conditioning on one part of a Gaussian vector gives another Gaussian distribution.
Theorem 26 (Conditional Gaussian distribution). Suppose \[X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right),\] where \(\Sigma_{22}\) is invertible. Then \[X_1\mid X_2=x_2\sim \mathcal{N}(\mu_{1\mid 2},\Sigma_{11\mid 2}),\] where \[\mu_{1\mid 2}=\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2),\] and \[\Sigma_{11\mid 2}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.\]
Proof idea by completing the square. Let the precision matrix be \[\Lambda=\Sigma^{-1}=\begin{pmatrix}\Lambda_{11}&\Lambda_{12}\\\Lambda_{21}&\Lambda_{22}\end{pmatrix}.\] The exponent of the joint Gaussian density is \[-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu).\] If \(x_2\) is fixed, then as a function of \(x_1\) this expression is a quadratic form in \(x_1\). Therefore the conditional distribution must be Gaussian. Completing the square gives \[\Sigma_{11\mid 2}=\Lambda_{11}^{-1}, \qquad \mu_{1\mid 2}=\mu_1-\Lambda_{11}^{-1}\Lambda_{12}(x_2-\mu_2).\] Using the block inverse formula and Schur complement identities, \[\Lambda_{11}^{-1}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21},\] and \[-\Lambda_{11}^{-1}\Lambda_{12}=\Sigma_{12}\Sigma_{22}^{-1}.\] This gives the stated formula.
The conditional mean is a linear function of the observed value \(x_2\). The conditional covariance does not depend on the observed value \(x_2\); it only depends on the covariance blocks.
Example 27 (Conditional distribution in two dimensions). Let \[\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}0\\0\end{pmatrix}, \begin{pmatrix}1&\rho\\\rho&1\end{pmatrix} \right), \qquad -1<\rho<1.\] Find \(X_1\mid X_2=x_2\).
Here \[\mu_1=0, \quad \mu_2=0, \quad \Sigma_{11}=1, \quad \Sigma_{22}=1, \quad \Sigma_{12}=\rho.\] Therefore, \[\mu_{1\mid 2}=0+\rho(1)^{-1}(x_2-0)=\rho x_2,\] and \[\Sigma_{11\mid 2}=1-\rho(1)^{-1}\rho=1-\rho^2.\] Thus \[X_1\mid X_2=x_2\sim \operatorname{Normal}(\rho x_2,1-\rho^2).\] If \(\rho=0\), the conditional distribution is the same as the marginal distribution, reflecting independence.
Linear algebra review: Schur complements and block inverses
The Schur complement is the matrix identity behind the conditional covariance formula.
Definition 28 (Schur complement). Let \[M=\begin{pmatrix}A&B\\C&D\end{pmatrix},\] where \(D\) is invertible. The Schur complement of \(D\) in \(M\) is \[M/D=A-BD^{-1}C.\]
Theorem 29 (Block inverse formula). If \(D\) and \(M/D\) are invertible, then \[M^{-1}=\begin{pmatrix} (M/D)^{-1} & -(M/D)^{-1}BD^{-1}\\ -D^{-1}C(M/D)^{-1} & D^{-1}+D^{-1}C(M/D)^{-1}BD^{-1} \end{pmatrix}.\]
Block Gaussian elimination gives \[\begin{pmatrix}A&B\\C&D\end{pmatrix} = \begin{pmatrix}I&BD^{-1}\\0&I\end{pmatrix} \begin{pmatrix}A-BD^{-1}C&0\\0&D\end{pmatrix} \begin{pmatrix}I&0\\D^{-1}C&I\end{pmatrix}.\] Taking inverses of the three factors and multiplying them gives the stated block inverse formula.
Practice Problem 30 (Schur complement calculation). Let \[M=\begin{pmatrix}4&2\\2&3\end{pmatrix}.\] Compute the Schur complement of \(D=3\).
Here \(A=4\), \(B=2\), \(C=2\), and \(D=3\). Therefore \[M/D=A-BD^{-1}C=4-2\cdot\frac13\cdot 2=4-\frac43=\frac83.\]
Operations on Gaussian Random Variables
This section summarizes several operations that preserve Gaussian structure.
Affine transformations
A linear transformation plus a shift maps a Gaussian vector to another Gaussian vector.
Theorem 31 (Affine transformation of a Gaussian). If \[X\sim \mathcal{N}(\mu,\Sigma), \qquad Y=AX+b,\] then \[Y\sim \mathcal{N}(A\mu+b,A\Sigma A^T).\]
The mean is \[\mathbb{E}[Y]=\mathbb{E}[AX+b]=A\mathbb{E}[X]+b=A\mu+b.\] The covariance is \[\operatorname{Var}(Y)=\operatorname{Var}(AX+b)=A\operatorname{Var}(X)A^T=A\Sigma A^T.\] Because any affine transformation of a Gaussian vector is Gaussian, these two quantities determine the distribution.
Example 32 (Linear combination of a Gaussian vector). Let \(X\sim \mathcal{N}(\mu,\Sigma)\) and let \(a\) be a fixed vector. Find the distribution of \(Y=a^TX\).
This is the affine transformation with \(A=a^T\) and \(b=0\). Hence \[Y=a^TX\sim \operatorname{Normal}(a^T\mu,a^T\Sigma a).\]
Products and convolutions of Gaussian densities
Products and convolutions of Gaussian densities are central in Bayesian updating, filtering, and linear models.
Theorem 33 (Pointwise product of two Gaussian densities). Consider two Gaussian densities over the same variable \(x\): \[p_1(x)=\mathcal{N}(x\mid\mu_1,\Sigma_1), \qquad p_2(x)=\mathcal{N}(x\mid\mu_2,\Sigma_2).\] Their pointwise product is proportional to another Gaussian density: \[p_1(x)p_2(x)\propto \mathcal{N}(x\mid\mu,\Sigma),\] where \[\Sigma^{-1}=\Sigma_1^{-1}+\Sigma_2^{-1}, \qquad \mu=\Sigma\left(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2\right).\]
The product has exponent \[-\frac12(x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1) -\frac12(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2).\] Collecting quadratic and linear terms in \(x\) gives \[-\frac12 x^T(\Sigma_1^{-1}+\Sigma_2^{-1})x +x^T(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2)+\text{constant}.\] Completing the square gives the stated covariance and mean.
Theorem 34 (Convolution and sums). If \(X\sim \mathcal{N}(\mu_1,\Sigma_1)\) and \(Y\sim \mathcal{N}(\mu_2,\Sigma_2)\) are independent, then \[Z=X+Y\sim \mathcal{N}(\mu_1+\mu_2,\Sigma_1+\Sigma_2).\] Equivalently, the convolution \[p_Z(z)=\int p_X(x)p_Y(z-x)\,dx\] is Gaussian.
Since \(X\) and \(Y\) are independent, \[\mathbb{E}[Z]=\mathbb{E}[X]+\mathbb{E}[Y]=\mu_1+\mu_2,\] and \[\operatorname{Var}(Z)=\operatorname{Var}(X)+\operatorname{Var}(Y)=\Sigma_1+\Sigma_2.\] The sum of independent Gaussian random vectors is Gaussian, so the distribution is determined by these two quantities.
Example 35 (Bayesian normal update as product of Gaussians). Suppose a prior density for a scalar parameter is \(\theta\sim \operatorname{Normal}(\mu_0,\sigma_0^2)\), and a Gaussian likelihood kernel is proportional to \(\operatorname{Normal}(\theta\mid y,\sigma^2)\). Find the posterior variance and mean.
Using the product formula in one dimension, \[\frac{1}{\sigma_{post}^2}=\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}.\] Thus \[\sigma_{post}^2=\left(\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}\right)^{-1}.\] The posterior mean is \[\mu_{post}=\sigma_{post}^2\left(\frac{\mu_0}{\sigma_0^2}+\frac{y}{\sigma^2}\right).\] This is a precision-weighted average of the prior mean and the data value.
Linear Gaussian Models
This section combines marginal and conditional Gaussian formulas in a common model used in Bayesian statistics and machine learning.
Marginal and conditional distributions in a linear Gaussian model
A linear Gaussian model assumes a Gaussian prior on an input vector and a Gaussian conditional distribution for an output vector.
Theorem 36 (Marginal and conditional Gaussians in a linear model). Suppose \[p(x)=\mathcal{N}(x\mid\mu,\Lambda^{-1}), \qquad p(y\mid x)=\mathcal{N}(y\mid Ax+b,L^{-1}),\] where \(\Lambda\) and \(L\) are precision matrices. Then \[p(y)=\mathcal{N}\left(y\mid A\mu+b, L^{-1}+A\Lambda^{-1}A^T\right).\] Moreover, \[p(x\mid y)=\mathcal{N}(x\mid m,S),\] where \[S=(\Lambda+A^TLA)^{-1},\] and \[m=S\left(A^TL(y-b)+\Lambda\mu\right).\]
Write the model as \[y=Ax+b+\varepsilon, \qquad \varepsilon\sim \mathcal{N}(0,L^{-1}), \qquad x\sim \mathcal{N}(\mu,\Lambda^{-1}),\] with \(x\) independent of \(\varepsilon\). Therefore \[\mathbb{E}[y]=A\mathbb{E}[x]+b=A\mu+b,\] and \[\operatorname{Var}(y)=A\operatorname{Var}(x)A^T+\operatorname{Var}(\varepsilon)=A\Lambda^{-1}A^T+L^{-1}.\] This gives the marginal distribution of \(y\).
For the conditional distribution, multiply the prior and likelihood as functions of \(x\): \[p(x\mid y)\propto \exp\left\{-\frac12(x-\mu)^T\Lambda(x-\mu) -\frac12(y-Ax-b)^TL(y-Ax-b)\right\}.\] Collecting quadratic terms in \(x\) gives precision \[S^{-1}=\Lambda+A^TLA.\] Collecting linear terms gives \[S^{-1}m=\Lambda\mu+A^TL(y-b),\] so \[m=S\left(A^TL(y-b)+\Lambda\mu\right).\]
Example 37 (Scalar linear Gaussian model). Let \[X\sim \operatorname{Normal}(\mu,\tau^2), \qquad Y\mid X=x\sim \operatorname{Normal}(ax+b,\sigma^2).\] Find the marginal distribution of \(Y\).
We can write \[Y=aX+b+\varepsilon, \qquad \varepsilon\sim \operatorname{Normal}(0,\sigma^2),\] with \(\varepsilon\) independent of \(X\). Hence \[\mathbb{E}[Y]=a\mu+b,\] and \[\operatorname{Var}(Y)=a^2\tau^2+\sigma^2.\] Therefore \[Y\sim \operatorname{Normal}(a\mu+b,a^2\tau^2+\sigma^2).\]
Practice Problem 38 (Posterior in the scalar linear Gaussian model). For the same model, \[X\sim \operatorname{Normal}(\mu,\tau^2), \qquad Y\mid X=x\sim \operatorname{Normal}(ax+b,\sigma^2),\] find the conditional distribution of \(X\mid Y=y\).
Here the prior precision is \(1/\tau^2\) and the observation precision is \(1/\sigma^2\). The posterior precision is \[\frac{1}{s^2}=\frac{1}{\tau^2}+\frac{a^2}{\sigma^2}.\] Thus \[s^2=\left(\frac{1}{\tau^2}+\frac{a^2}{\sigma^2}\right)^{-1}.\] The posterior mean is \[m=s^2\left(\frac{\mu}{\tau^2}+\frac{a(y-b)}{\sigma^2}\right).\] Therefore \[X\mid Y=y\sim \operatorname{Normal}(m,s^2).\]
Summary and Practice
This final section summarizes the main formulas and provides practice problems that reinforce the section.
Core formulas
The main ideas of this section are that multinomial counts live on a constrained simplex of counts, Dirichlet distributions live on a simplex of probabilities, and Gaussian random vectors are stable under many operations.
\[\mathbb{P}(X_1=n_1,\ldots,X_K=n_K) =\frac{n!}{n_1!\cdots n_K!}\prod_{k=1}^K\phi_k^{n_k}.\] \[p(\mu\mid\alpha)= \frac{\Gamma(\alpha_0)}{\prod_{k=1}^K\Gamma(\alpha_k)} \prod_{k=1}^K \mu_k^{\alpha_k-1}, \qquad \alpha_0=\sum_{k=1}^K\alpha_k.\] \[\mu\mid X=(n_1, \ldots,n_K) \sim \operatorname{Dirichlet}(\alpha_1+n_1, \ldots, \alpha_K+n_K).\]
\[p(x\mid\mu,\Sigma)=\frac{1}{\sqrt{(2\pi)^d|\Sigma|}} \exp\left\{-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu)\right\}.\] \[X_1\mid X_2=x_2\sim \mathcal{N}\left(\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2), \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right).\] \[AX+b\sim \mathcal{N}(A\mu+b,A\Sigma A^T).\]
Practice Problem 39 (Multinomial probability). A four-sided die has probabilities \((0.1,0.2,0.3,0.4)\). It is tossed \(8\) times. Find the probability that the counts are \((1,2,2,3)\).
Use the multinomial pmf: \[\mathbb{P}(X=(1,2,2,3))= \frac{8!}{1!2!2!3!}(0.1)^1(0.2)^2(0.3)^2(0.4)^3.\] The coefficient is \[\frac{8!}{1!2!2!3!}=1680.\] Thus \[\mathbb{P}(X=(1,2,2,3))=1680(0.1)(0.04)(0.09)(0.064) \approx 0.0387.\]
Practice Problem 40 (Dirichlet posterior mean). Suppose \(\mu\sim \operatorname{Dirichlet}(3,1,2)\) and the observed counts are \((2,4,1)\). Find the posterior distribution and posterior mean.
The posterior is \[\mu\mid X\sim \operatorname{Dirichlet}(3+2,1+4,2+1)=\operatorname{Dirichlet}(5,5,3).\] The posterior total is \(13\), so \[\mathbb{E}[\mu\mid X]=\left(\frac{5}{13},\frac{5}{13},\frac{3}{13}\right).\]
Practice Problem 41 (Conditional normal calculation). Let \[\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}2\\1\end{pmatrix}, \begin{pmatrix}9&3\\3&4\end{pmatrix} \right).\] Find \(X_1\mid X_2=5\).
We have \[\mu_1=2, \quad \mu_2=1, \quad \Sigma_{11}=9, \quad \Sigma_{12}=3, \quad \Sigma_{22}=4, \quad \Sigma_{21}=3.\] The conditional mean is \[\mu_{1\mid 2}=2+3\cdot 4^{-1}(5-1)=2+3=5.\] The conditional variance is \[\Sigma_{11\mid 2}=9-3\cdot 4^{-1}\cdot 3=9-\frac94=\frac{27}{4}.\] Therefore \[X_1\mid X_2=5\sim \operatorname{Normal}\left(5,\frac{27}{4}\right).\]
Practice Problem 42 (Affine transformation). Let \(X\sim \mathcal{N}_2\left(\begin{pmatrix}1\\2\end{pmatrix},\begin{pmatrix}1&0.5\\0.5&2\end{pmatrix}\right)\) and let \(Y=2X_1-X_2+3\). Find the distribution of \(Y\).
Write \(Y=a^TX+3\), where \[a=\begin{pmatrix}2\\-1\end{pmatrix}.\] The mean is \[\mathbb{E}[Y]=a^T\mu+3=2(1)-1(2)+3=3.\] The variance is \[\operatorname{Var}(Y)=a^T\Sigma a.\] Compute \[\Sigma a= \begin{pmatrix}1&0.5\\0.5&2\end{pmatrix} \begin{pmatrix}2\\-1\end{pmatrix} =\begin{pmatrix}1.5\\-1\end{pmatrix}.\] Thus \[a^T\Sigma a=(2,-1)\begin{pmatrix}1.5\\-1\end{pmatrix}=3+1=4.\] Therefore \[Y\sim \operatorname{Normal}(3,4).\]
9 George Casella and Roger L. Berger, Statistical Inference, 2nd edition. Larry Wasserman, All of Statistics. C. M. Grinstead and J. L. Snell, Introduction to Probability, American Mathematical Society, 2012. Sheldon Ross, Introduction to Probability Models, 12th edition.