Chapter 6 Extra: Multinomial and Multinormal Distributions

This extra chapter extends Section 6 in two directions. First, it develops the multinomial distribution and the Dirichlet prior for categorical data. Second, it introduces the multivariate Gaussian distribution and the linear-algebra formulas behind marginalization, conditioning, affine transformations, and linear Gaussian models.

Topics. Multinomial distribution; Dirichlet distribution; Dirichlet-multinomial Bayesian model; multivariate Gaussian distribution; Mahalanobis distance; spectral geometry; marginal and conditional Gaussian distributions; affine transformations; products and convolutions of Gaussians; linear Gaussian models.

Multinomial Distribution

This section extends the categorical and binomial distributions to experiments with more than two possible outcomes.

From categorical trials to multinomial counts

A categorical random variable records one outcome from several categories, while a multinomial random vector records the counts of each outcome after many independent categorical trials.

Definition 1 (Categorical trial). A categorical trial has \(m\) possible outcomes \(O_1,\ldots,O_m\) with probabilities \[\phi_1,\ldots,\phi_m,\qquad \phi_i\ge 0,\qquad \sum_{i=1}^m \phi_i=1.\] If one trial is performed, then the outcome has a categorical distribution.

Definition 2 (Multinomial distribution). Suppose we perform \(n\) independent trials, each with outcomes \(O_1,\ldots,O_m\) and constant category probabilities \((\phi_1,\ldots,\phi_m)\). Let \[X_i=\text{the number of times outcome }O_i\text{ appears in the }n\text{ trials}.\] Then \[X=(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1, \ldots,\phi_m),\] and for nonnegative integers \(n_1,\ldots,n_m\) satisfying \(n_1+\cdots+n_m=n\), \[\mathbb{P}(X_1=n_1,\ldots,X_m=n_m) = \frac{n!}{n_1!\cdots n_m!}\phi_1^{n_1}\cdots \phi_m^{n_m}.\]

TipCourse connection

The multinomial distribution is the direct generalization of the categorical distribution. When \(n=1\), it reduces to a categorical trial. When \(m=2\), it reduces to the binomial distribution.

Example 3 (Tossing an \(m\)-sided die). An \(m\)-sided die has probabilities \(\phi_1,\ldots,\phi_m\) for sides \(1,\ldots,m\). Toss the die \(n\) times and let \(X_i\) be the number of times side \(i\) appears. Find the joint pmf of \((X_1,\ldots,X_m)\).

NoteSolution

The vector of counts follows \[(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1,\ldots,\phi_m).\] Therefore, for \(n_1+\cdots+n_m=n\), \[\mathbb{P}(X_1=n_1,\ldots,X_m=n_m) = \frac{n!}{n_1!\cdots n_m!}\prod_{i=1}^m\phi_i^{n_i}.\] The coefficient \(n!/(n_1!\cdots n_m!)\) counts how many orderings of the \(n\) trials produce the same vector of counts.

Example 4 (Three-category experiment). Suppose a website visit leads to one of three outcomes: purchase, sign-up only, or no action. The probabilities are \((0.1,0.2,0.7)\). Among \(n=10\) visitors, find the probability of exactly \(2\) purchases, \(3\) sign-ups only, and \(5\) no-action visits.

NoteSolution

Let \(X=(X_1,X_2,X_3)\sim \operatorname{Multinomial}(10,0.1,0.2,0.7)\). Then \[\mathbb{P}(X_1=2,X_2=3,X_3=5) = \frac{10!}{2!3!5!}(0.1)^2(0.2)^3(0.7)^5.\] Numerically, \[\frac{10!}{2!3!5!}(0.1)^2(0.2)^3(0.7)^5 =2520(0.01)(0.008)(0.16807)\approx 0.0339.\]

Mean, variance, and covariance

The components of a multinomial vector are individually binomial, but they are dependent because the total count is fixed at \(n\).

Theorem 5 (Moments of the multinomial distribution). If \(X=(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n,\phi_1,\ldots,\phi_m)\), then \[\mathbb{E}[X_i]=n\phi_i, \qquad \operatorname{Var}(X_i)=n\phi_i(1-\phi_i),\] and for \(i\ne j\), \[\operatorname{Cov}(X_i,X_j)=-n\phi_i\phi_j.\] Equivalently, \[\operatorname{Cov}(X)=n\left(\operatorname{diag}(\phi)-\phi\phi^T\right),\] where \(\phi=(\phi_1,\ldots,\phi_m)^T\).

NoteProof

Write \(X_i=\sum_{r=1}^n I_{ri}\), where \(I_{ri}=1\) if trial \(r\) produces outcome \(O_i\), and \(0\) otherwise. Then \(I_{ri}\sim \operatorname{Bernoulli}(\phi_i)\), so \[\mathbb{E}[X_i]=\sum_{r=1}^n \mathbb{E}[I_{ri}]=n\phi_i, \qquad \operatorname{Var}(X_i)=\sum_{r=1}^n \operatorname{Var}(I_{ri})=n\phi_i(1-\phi_i).\] For \(i\ne j\), in one trial \(I_{ri}I_{rj}=0\), so \[\operatorname{Cov}(I_{ri},I_{rj})=\mathbb{E}[I_{ri}I_{rj}]-\mathbb{E}[I_{ri}]\mathbb{E}[I_{rj}]=0-\phi_i\phi_j=-\phi_i\phi_j.\] Different trials are independent, so \[\operatorname{Cov}(X_i,X_j)=\sum_{r=1}^n \operatorname{Cov}(I_{ri},I_{rj})=-n\phi_i\phi_j.\]

Remark. Remark 6. The negative covariance is not a paradox. If the total number of trials is fixed, seeing more observations in one category leaves fewer observations available for the other categories.

Practice Problem 7 (Checking the binomial special case). Show that if \(m=2\), then the multinomial distribution reduces to the binomial distribution.

NoteSolution

Let \(X=(X_1,X_2)\sim \operatorname{Multinomial}(n,p,1-p)\). Because \(X_1+X_2=n\), the entire vector is determined by \(X_1\). For \(X_1=k\), we have \(X_2=n-k\). Thus \[\mathbb{P}(X_1=k)=\mathbb{P}(X_1=k,X_2=n-k) =\frac{n!}{k!(n-k)!}p^k(1-p)^{n-k},\] which is exactly the \(\operatorname{Binomial}(n,p)\) pmf.

Dirichlet Distribution

This section introduces the Dirichlet distribution, the continuous distribution on probability vectors that generalizes the Beta distribution.

The simplex and the Dirichlet density

The Dirichlet distribution is used when the unknown quantity is itself a vector of probabilities.

Definition 8 (Probability simplex). The \((K-1)\)-dimensional probability simplex is \[\Delta^{K-1}=\left\{\mu=(\mu_1,\ldots,\mu_K):\; \mu_k\ge 0,\; \sum_{k=1}^K \mu_k=1\right\}.\] A point in the simplex represents a probability vector over \(K\) categories.

Definition 9 (Dirichlet distribution). Let \(\alpha=(\alpha_1,\ldots,\alpha_K)\) with \(\alpha_k>0\) and let \[\alpha_0=\sum_{k=1}^K \alpha_k.\] A random probability vector \(\mu=(\mu_1,\ldots,\mu_K)\) has a Dirichlet distribution with parameter \(\alpha\), written \[\mu\sim \operatorname{Dirichlet}(\alpha_1,\ldots,\alpha_K),\] if its density on the simplex is \[p(\mu\mid\alpha)= \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_1)\cdots \Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k-1},\] where \(\Gamma(\cdot)\) is the gamma function.

TipInterpretation of the parameters

The parameter \(\alpha_k\) acts like a prior pseudo-count for category \(k\). Large values of \(\alpha_0\) mean the distribution is more concentrated around its mean; small values can place more mass near the corners of the simplex.

Theorem 10 (Dirichlet moments). If \(\mu\sim \operatorname{Dirichlet}(\alpha_1,\ldots,\alpha_K)\) and \(\alpha_0=\sum_k\alpha_k\), then \[\mathbb{E}[\mu_k]=\frac{\alpha_k}{\alpha_0},\] \[\operatorname{Var}(\mu_k)=\frac{\alpha_k(\alpha_0-\alpha_k)}{\alpha_0^2(\alpha_0+1)}, \qquad \operatorname{Cov}(\mu_i,\mu_j)= -\frac{\alpha_i\alpha_j}{\alpha_0^2(\alpha_0+1)}\quad (i\ne j).\]

Example 11 (The Beta distribution as a special case). Show that when \(K=2\), the Dirichlet distribution becomes the Beta distribution.

NoteSolution

When \(K=2\), the simplex condition says \(\mu_1+\mu_2=1\), so \(\mu_2=1-\mu_1\). The Dirichlet density is proportional to \[\mu_1^{\alpha_1-1}\mu_2^{\alpha_2-1} =\mu_1^{\alpha_1-1}(1-\mu_1)^{\alpha_2-1}.\] The normalized density is \[\frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} \mu_1^{\alpha_1-1}(1-\mu_1)^{\alpha_2-1},\] which is the \(\operatorname{Beta}(\alpha_1,\alpha_2)\) density. Therefore, \[\operatorname{Dirichlet}(\alpha_1,\alpha_2)=\operatorname{Beta}(\alpha_1,\alpha_2).\]

Dirichlet-multinomial Bayesian model

The Dirichlet distribution is the natural conjugate prior for categorical and multinomial probability vectors.

Definition 12 (Dirichlet-multinomial model). A Bayesian model for categorical or multinomial data is \[\mu\sim \operatorname{Dirichlet}(\alpha_1, \ldots,\alpha_K), \qquad X\mid\mu\sim \operatorname{Multinomial}(n,\mu_1,\ldots,\mu_K).\] Here the likelihood is multinomial, the prior is Dirichlet, and the posterior is again Dirichlet.

Theorem 13 (Dirichlet-multinomial conjugacy). Suppose \[\mu\sim \operatorname{Dirichlet}(\alpha_1, \ldots,\alpha_K), \qquad X=(X_1,\ldots,X_K)\mid\mu\sim \operatorname{Multinomial}(n,\mu_1,\ldots,\mu_K).\] If the observed counts are \(X_1=n_1,\ldots,X_K=n_K\), then \[\mu\mid X=(n_1,\ldots,n_K) \sim \operatorname{Dirichlet}(\alpha_1+n_1, \ldots, \alpha_K+n_K).\]

NoteProof

The prior density is \[p(\mu)\propto \prod_{k=1}^K \mu_k^{\alpha_k-1}.\] The multinomial likelihood, as a function of \(\mu\), is \[p(x\mid\mu)\propto \prod_{k=1}^K \mu_k^{n_k}.\] Therefore, by Bayes’ theorem, \[p(\mu\mid x)\propto p(x\mid\mu)p(\mu) \propto \prod_{k=1}^K \mu_k^{n_k}\prod_{k=1}^K\mu_k^{\alpha_k-1} = \prod_{k=1}^K \mu_k^{\alpha_k+n_k-1}.\] This is the kernel of a Dirichlet distribution with parameters \(\alpha_k+n_k\).

Example 14 (Updating category probabilities). Suppose there are three categories and the prior is \[\mu\sim \operatorname{Dirichlet}(2,2,2).\] We observe \(n=10\) trials with counts \((5,3,2)\). Find the posterior distribution and posterior mean.

NoteSolution

By conjugacy, \[\mu\mid X=(5,3,2)\sim \operatorname{Dirichlet}(2+5,2+3,2+2)=\operatorname{Dirichlet}(7,5,4).\] The posterior total is \(7+5+4=16\). Therefore, \[\mathbb{E}[\mu_1\mid X]=\frac{7}{16}, \qquad \mathbb{E}[\mu_2\mid X]=\frac{5}{16}, \qquad \mathbb{E}[\mu_3\mid X]=\frac{4}{16}.\] The posterior mean is a weighted compromise between the prior pseudo-counts and the observed counts.

Practice Problem 15 (Uniform prior on the simplex). Let \(\mu\sim \operatorname{Dirichlet}(1,1,1)\) and suppose counts \((n_1,n_2,n_3)=(4,1,5)\) are observed. Find the posterior distribution.

NoteSolution

The posterior is \[\mu\mid X\sim \operatorname{Dirichlet}(1+4,1+1,1+5)=\operatorname{Dirichlet}(5,2,6).\] Because \(\operatorname{Dirichlet}(1,1,1)\) is uniform over the three-category simplex, the posterior parameters are simply one plus the observed counts.

Single Variable Gaussian Distribution

This section reviews the ordinary one-dimensional Gaussian distribution before moving to the multivariate Gaussian distribution.

Density, mean, variance, and probability

The one-dimensional Gaussian distribution is the prototype for the multivariate Gaussian distribution.

Definition 16 (Univariate Gaussian). A random variable \(X\) has a Gaussian or normal distribution with mean \(\mu\) and variance \(\sigma^2\), written \[X\sim \operatorname{Normal}(\mu,\sigma^2),\] if its density is \[f_X(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}, \qquad -\infty<x<\infty.\] Then \[\mathbb{E}[X]=\mu, \qquad \operatorname{Var}(X)=\sigma^2.\] For any \(a<b\), \[\mathbb{P}(a<X<b)=\int_a^b f_X(x)\,dx.\]

Example 17 (Standardizing a normal random variable). Let \(X\sim \operatorname{Normal}(\mu,\sigma^2)\). Show that \[Z=\frac{X-\mu}{\sigma}\] has the standard normal distribution.

NoteSolution

The transformation is \(Z=(X-\mu)/\sigma\), so \(X=\mu+\sigma Z\). The change-of-variables formula gives \[f_Z(z)=f_X(\mu+\sigma z)\cdot \sigma.\] Substituting the density of \(X\), \[f_Z(z)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{z^2}{2}\right)\sigma =\frac{1}{\sqrt{2\pi}}e^{-z^2/2}.\] Thus \(Z\sim \operatorname{Normal}(0,1)\).

Multivariate Gaussian Distribution

This section introduces the Gaussian distribution for random vectors and explains how the covariance matrix controls its geometry.

Definition and covariance matrix

A multivariate Gaussian distribution is completely determined by its mean vector and covariance matrix.

Definition 18 (Multivariate Gaussian). Let \(X\in\mathbb{R}^d\) be a random vector. We say \[X\sim \mathcal{N}_d(\mu,\Sigma),\] where \(\mu\in\mathbb{R}^d\) and \(\Sigma\) is a \(d\times d\) symmetric positive definite matrix, if \[p(x\mid\mu,\Sigma) = \frac{1}{\sqrt{(2\pi)^d|\Sigma|}} \exp\left\{-\frac12 (x-\mu)^T\Sigma^{-1}(x-\mu)\right\}.\] The covariance matrix is \[\Sigma=\operatorname{Cov}(X,X)=\mathbb{E}\left[(X-\mu)(X-\mu)^T\right].\]

TipNormalization constant

The normalization constant is \[Z=\sqrt{\det(2\pi\Sigma)}=\sqrt{(2\pi)^d|\Sigma|}.\] This is the multivariate analogue of \(\sqrt{2\pi}\sigma\) in the one-dimensional normal density.

Mahalanobis distance and geometry

The shape of a multivariate Gaussian is controlled by a quadratic form that generalizes the one-dimensional \(z\)-score.

Definition 19 (Mahalanobis distance). For \(X\sim \mathcal{N}_d(\mu,\Sigma)\), define \[\Delta^2=(x-\mu)^T\Sigma^{-1}(x-\mu).\] The value \(\Delta\) is called the Mahalanobis distance from \(x\) to \(\mu\) with respect to \(\Sigma\).

Remark. Remark 20. In one dimension, \[\Delta^2=\frac{(x-\mu)^2}{\sigma^2}, \qquad \Delta=\left|\frac{x-\mu}{\sigma}\right|.\] Thus Mahalanobis distance generalizes the absolute value of the usual \(z\)-score. If \(\Sigma=I\), then \[\Delta^2=(x-\mu)^T(x-\mu)=\|x-\mu\|^2,\] so Mahalanobis distance reduces to Euclidean distance.

Example 21 (Two-dimensional covariance matrices). Compare the covariance matrices \[\Sigma_1=\begin{pmatrix}1&0\\0&1\end{pmatrix}, \qquad \Sigma_2=\begin{pmatrix}1&0.5\\0.5&1\end{pmatrix}, \qquad \Sigma_3=\begin{pmatrix}1&0.8\\0.8&1\end{pmatrix}.\] Describe the qualitative effect on the Gaussian density.

NoteSolution

For \(\Sigma_1\), the coordinates are uncorrelated with equal variance, so the contours are circles centered at \(\mu\).

For \(\Sigma_2\), the positive off-diagonal covariance creates positively tilted elliptical contours. Larger \(x_1\) values tend to be associated with larger \(x_2\) values.

For \(\Sigma_3\), the positive dependence is stronger, so the ellipse is more stretched along the positive diagonal direction. The density is more concentrated around a narrow tilted ridge than in the case \(\Sigma_2\).

Spectral decomposition and precision matrix

The eigenvalues and eigenvectors of the covariance matrix reveal the principal axes of a Gaussian distribution.

Theorem 22 (Spectral form of the quadratic term). Let \(\Sigma\) be symmetric positive definite. Then \[\Sigma=UDU^T=\sum_{i=1}^d \lambda_i u_i u_i^T,\] where \(U=[u_1\ \cdots\ u_d]\) is orthogonal and \(D=\operatorname{diag}(\lambda_1,\ldots,\lambda_d)\) with \(\lambda_i>0\). The precision matrix is \[\Sigma^{-1}=UD^{-1}U^T=\sum_{i=1}^d\frac{1}{\lambda_i}u_i u_i^T.\] If \[y_i=u_i^T(x-\mu),\] then \[\Delta^2=(x-\mu)^T\Sigma^{-1}(x-\mu)=\sum_{i=1}^d \frac{y_i^2}{\lambda_i}.\]

NoteProof

Since \(\Sigma=UDU^T\) and \(U^TU=I\), the inverse is \[\Sigma^{-1}=UD^{-1}U^T.\] Thus \[(x-\mu)^T\Sigma^{-1}(x-\mu) =(x-\mu)^TUD^{-1}U^T(x-\mu).\] Let \(y=U^T(x-\mu)\). Then the quadratic form becomes \[y^TD^{-1}y=\sum_{i=1}^d\frac{y_i^2}{\lambda_i}.\]

WarningTerminology

The matrix \(\Sigma^{-1}\) is called the precision matrix. The covariance matrix describes spread; the precision matrix describes inverse spread and appears naturally in the exponent of the Gaussian density.

Why Gaussian distributions are central

Gaussian distributions appear frequently because of the central limit theorem and because Gaussianity is preserved by many important operations.

TipOnce Gaussian, always Gaussian

Gaussian random vectors are completely determined by mean vector \(\mu\) and covariance matrix \(\Sigma\). Linear transformations, marginalization, conditioning, and sums of independent Gaussian vectors all produce Gaussian distributions again.

Remark. Remark 23 (Central limit theorem motivation). If \(X_1,\ldots,X_m\) are independent and identically distributed with finite mean and variance, then the arithmetic mean \[\bar X=\frac{1}{m}(X_1+\cdots+X_m)\] is approximately normal for large \(m\), after appropriate centering and scaling. This explains why Gaussian distributions occur often in real data.

Marginal and Conditional Gaussian Distributions

This section gives the most important computational formulas for partitioned multivariate Gaussian distributions.

Partitioned Gaussian vectors

Partitioning a Gaussian vector allows us to study subsets of variables and conditional distributions.

Suppose \[X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right).\] Here \(X_1\) and \(X_2\) may themselves be vectors, and the blocks of \(\Sigma\) have compatible dimensions.

Marginalization

The marginal distribution of a subvector of a Gaussian vector is Gaussian with the corresponding mean block and covariance block.

Theorem 24 (Gaussian marginal distribution). If \[X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right),\] then \[X_1\sim \mathcal{N}(\mu_1,\Sigma_{11}), \qquad X_2\sim \mathcal{N}(\mu_2,\Sigma_{22}).\]

Example 25 (Marginal of a bivariate normal). Let \[\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}1\\2\end{pmatrix}, \begin{pmatrix}4&1\\1&9\end{pmatrix} \right).\] Find the marginal distributions of \(X_1\) and \(X_2\).

NoteSolution

By the marginalization theorem, \[X_1\sim \operatorname{Normal}(1,4), \qquad X_2\sim \operatorname{Normal}(2,9).\] The off-diagonal covariance \(1\) affects the dependence between \(X_1\) and \(X_2\), but it does not change the marginal variances.

Conditional Gaussian distributions

Conditioning on one part of a Gaussian vector gives another Gaussian distribution.

Theorem 26 (Conditional Gaussian distribution). Suppose \[X=\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}\mu_1\\\mu_2\end{pmatrix}, \begin{pmatrix} \Sigma_{11}&\Sigma_{12}\\ \Sigma_{21}&\Sigma_{22} \end{pmatrix} \right),\] where \(\Sigma_{22}\) is invertible. Then \[X_1\mid X_2=x_2\sim \mathcal{N}(\mu_{1\mid 2},\Sigma_{11\mid 2}),\] where \[\mu_{1\mid 2}=\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2),\] and \[\Sigma_{11\mid 2}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.\]

NoteProof

Proof idea by completing the square. Let the precision matrix be \[\Lambda=\Sigma^{-1}=\begin{pmatrix}\Lambda_{11}&\Lambda_{12}\\\Lambda_{21}&\Lambda_{22}\end{pmatrix}.\] The exponent of the joint Gaussian density is \[-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu).\] If \(x_2\) is fixed, then as a function of \(x_1\) this expression is a quadratic form in \(x_1\). Therefore the conditional distribution must be Gaussian. Completing the square gives \[\Sigma_{11\mid 2}=\Lambda_{11}^{-1}, \qquad \mu_{1\mid 2}=\mu_1-\Lambda_{11}^{-1}\Lambda_{12}(x_2-\mu_2).\] Using the block inverse formula and Schur complement identities, \[\Lambda_{11}^{-1}=\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21},\] and \[-\Lambda_{11}^{-1}\Lambda_{12}=\Sigma_{12}\Sigma_{22}^{-1}.\] This gives the stated formula.

TipInterpretation

The conditional mean is a linear function of the observed value \(x_2\). The conditional covariance does not depend on the observed value \(x_2\); it only depends on the covariance blocks.

Example 27 (Conditional distribution in two dimensions). Let \[\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}0\\0\end{pmatrix}, \begin{pmatrix}1&\rho\\\rho&1\end{pmatrix} \right), \qquad -1<\rho<1.\] Find \(X_1\mid X_2=x_2\).

NoteSolution

Here \[\mu_1=0, \quad \mu_2=0, \quad \Sigma_{11}=1, \quad \Sigma_{22}=1, \quad \Sigma_{12}=\rho.\] Therefore, \[\mu_{1\mid 2}=0+\rho(1)^{-1}(x_2-0)=\rho x_2,\] and \[\Sigma_{11\mid 2}=1-\rho(1)^{-1}\rho=1-\rho^2.\] Thus \[X_1\mid X_2=x_2\sim \operatorname{Normal}(\rho x_2,1-\rho^2).\] If \(\rho=0\), the conditional distribution is the same as the marginal distribution, reflecting independence.

Linear algebra review: Schur complements and block inverses

The Schur complement is the matrix identity behind the conditional covariance formula.

Definition 28 (Schur complement). Let \[M=\begin{pmatrix}A&B\\C&D\end{pmatrix},\] where \(D\) is invertible. The Schur complement of \(D\) in \(M\) is \[M/D=A-BD^{-1}C.\]

Theorem 29 (Block inverse formula). If \(D\) and \(M/D\) are invertible, then \[M^{-1}=\begin{pmatrix} (M/D)^{-1} & -(M/D)^{-1}BD^{-1}\\ -D^{-1}C(M/D)^{-1} & D^{-1}+D^{-1}C(M/D)^{-1}BD^{-1} \end{pmatrix}.\]

NoteProof

Block Gaussian elimination gives \[\begin{pmatrix}A&B\\C&D\end{pmatrix} = \begin{pmatrix}I&BD^{-1}\\0&I\end{pmatrix} \begin{pmatrix}A-BD^{-1}C&0\\0&D\end{pmatrix} \begin{pmatrix}I&0\\D^{-1}C&I\end{pmatrix}.\] Taking inverses of the three factors and multiplying them gives the stated block inverse formula.

Practice Problem 30 (Schur complement calculation). Let \[M=\begin{pmatrix}4&2\\2&3\end{pmatrix}.\] Compute the Schur complement of \(D=3\).

NoteSolution

Here \(A=4\), \(B=2\), \(C=2\), and \(D=3\). Therefore \[M/D=A-BD^{-1}C=4-2\cdot\frac13\cdot 2=4-\frac43=\frac83.\]

Operations on Gaussian Random Variables

This section summarizes several operations that preserve Gaussian structure.

Affine transformations

A linear transformation plus a shift maps a Gaussian vector to another Gaussian vector.

Theorem 31 (Affine transformation of a Gaussian). If \[X\sim \mathcal{N}(\mu,\Sigma), \qquad Y=AX+b,\] then \[Y\sim \mathcal{N}(A\mu+b,A\Sigma A^T).\]

NoteProof

The mean is \[\mathbb{E}[Y]=\mathbb{E}[AX+b]=A\mathbb{E}[X]+b=A\mu+b.\] The covariance is \[\operatorname{Var}(Y)=\operatorname{Var}(AX+b)=A\operatorname{Var}(X)A^T=A\Sigma A^T.\] Because any affine transformation of a Gaussian vector is Gaussian, these two quantities determine the distribution.

Example 32 (Linear combination of a Gaussian vector). Let \(X\sim \mathcal{N}(\mu,\Sigma)\) and let \(a\) be a fixed vector. Find the distribution of \(Y=a^TX\).

NoteSolution

This is the affine transformation with \(A=a^T\) and \(b=0\). Hence \[Y=a^TX\sim \operatorname{Normal}(a^T\mu,a^T\Sigma a).\]

Products and convolutions of Gaussian densities

Products and convolutions of Gaussian densities are central in Bayesian updating, filtering, and linear models.

Theorem 33 (Pointwise product of two Gaussian densities). Consider two Gaussian densities over the same variable \(x\): \[p_1(x)=\mathcal{N}(x\mid\mu_1,\Sigma_1), \qquad p_2(x)=\mathcal{N}(x\mid\mu_2,\Sigma_2).\] Their pointwise product is proportional to another Gaussian density: \[p_1(x)p_2(x)\propto \mathcal{N}(x\mid\mu,\Sigma),\] where \[\Sigma^{-1}=\Sigma_1^{-1}+\Sigma_2^{-1}, \qquad \mu=\Sigma\left(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2\right).\]

NoteProof

The product has exponent \[-\frac12(x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1) -\frac12(x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2).\] Collecting quadratic and linear terms in \(x\) gives \[-\frac12 x^T(\Sigma_1^{-1}+\Sigma_2^{-1})x +x^T(\Sigma_1^{-1}\mu_1+\Sigma_2^{-1}\mu_2)+\text{constant}.\] Completing the square gives the stated covariance and mean.

Theorem 34 (Convolution and sums). If \(X\sim \mathcal{N}(\mu_1,\Sigma_1)\) and \(Y\sim \mathcal{N}(\mu_2,\Sigma_2)\) are independent, then \[Z=X+Y\sim \mathcal{N}(\mu_1+\mu_2,\Sigma_1+\Sigma_2).\] Equivalently, the convolution \[p_Z(z)=\int p_X(x)p_Y(z-x)\,dx\] is Gaussian.

NoteProof

Since \(X\) and \(Y\) are independent, \[\mathbb{E}[Z]=\mathbb{E}[X]+\mathbb{E}[Y]=\mu_1+\mu_2,\] and \[\operatorname{Var}(Z)=\operatorname{Var}(X)+\operatorname{Var}(Y)=\Sigma_1+\Sigma_2.\] The sum of independent Gaussian random vectors is Gaussian, so the distribution is determined by these two quantities.

Example 35 (Bayesian normal update as product of Gaussians). Suppose a prior density for a scalar parameter is \(\theta\sim \operatorname{Normal}(\mu_0,\sigma_0^2)\), and a Gaussian likelihood kernel is proportional to \(\operatorname{Normal}(\theta\mid y,\sigma^2)\). Find the posterior variance and mean.

NoteSolution

Using the product formula in one dimension, \[\frac{1}{\sigma_{post}^2}=\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}.\] Thus \[\sigma_{post}^2=\left(\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}\right)^{-1}.\] The posterior mean is \[\mu_{post}=\sigma_{post}^2\left(\frac{\mu_0}{\sigma_0^2}+\frac{y}{\sigma^2}\right).\] This is a precision-weighted average of the prior mean and the data value.

Linear Gaussian Models

This section combines marginal and conditional Gaussian formulas in a common model used in Bayesian statistics and machine learning.

Marginal and conditional distributions in a linear Gaussian model

A linear Gaussian model assumes a Gaussian prior on an input vector and a Gaussian conditional distribution for an output vector.

Theorem 36 (Marginal and conditional Gaussians in a linear model). Suppose \[p(x)=\mathcal{N}(x\mid\mu,\Lambda^{-1}), \qquad p(y\mid x)=\mathcal{N}(y\mid Ax+b,L^{-1}),\] where \(\Lambda\) and \(L\) are precision matrices. Then \[p(y)=\mathcal{N}\left(y\mid A\mu+b, L^{-1}+A\Lambda^{-1}A^T\right).\] Moreover, \[p(x\mid y)=\mathcal{N}(x\mid m,S),\] where \[S=(\Lambda+A^TLA)^{-1},\] and \[m=S\left(A^TL(y-b)+\Lambda\mu\right).\]

NoteProof

Write the model as \[y=Ax+b+\varepsilon, \qquad \varepsilon\sim \mathcal{N}(0,L^{-1}), \qquad x\sim \mathcal{N}(\mu,\Lambda^{-1}),\] with \(x\) independent of \(\varepsilon\). Therefore \[\mathbb{E}[y]=A\mathbb{E}[x]+b=A\mu+b,\] and \[\operatorname{Var}(y)=A\operatorname{Var}(x)A^T+\operatorname{Var}(\varepsilon)=A\Lambda^{-1}A^T+L^{-1}.\] This gives the marginal distribution of \(y\).

For the conditional distribution, multiply the prior and likelihood as functions of \(x\): \[p(x\mid y)\propto \exp\left\{-\frac12(x-\mu)^T\Lambda(x-\mu) -\frac12(y-Ax-b)^TL(y-Ax-b)\right\}.\] Collecting quadratic terms in \(x\) gives precision \[S^{-1}=\Lambda+A^TLA.\] Collecting linear terms gives \[S^{-1}m=\Lambda\mu+A^TL(y-b),\] so \[m=S\left(A^TL(y-b)+\Lambda\mu\right).\]

Example 37 (Scalar linear Gaussian model). Let \[X\sim \operatorname{Normal}(\mu,\tau^2), \qquad Y\mid X=x\sim \operatorname{Normal}(ax+b,\sigma^2).\] Find the marginal distribution of \(Y\).

NoteSolution

We can write \[Y=aX+b+\varepsilon, \qquad \varepsilon\sim \operatorname{Normal}(0,\sigma^2),\] with \(\varepsilon\) independent of \(X\). Hence \[\mathbb{E}[Y]=a\mu+b,\] and \[\operatorname{Var}(Y)=a^2\tau^2+\sigma^2.\] Therefore \[Y\sim \operatorname{Normal}(a\mu+b,a^2\tau^2+\sigma^2).\]

Practice Problem 38 (Posterior in the scalar linear Gaussian model). For the same model, \[X\sim \operatorname{Normal}(\mu,\tau^2), \qquad Y\mid X=x\sim \operatorname{Normal}(ax+b,\sigma^2),\] find the conditional distribution of \(X\mid Y=y\).

NoteSolution

Here the prior precision is \(1/\tau^2\) and the observation precision is \(1/\sigma^2\). The posterior precision is \[\frac{1}{s^2}=\frac{1}{\tau^2}+\frac{a^2}{\sigma^2}.\] Thus \[s^2=\left(\frac{1}{\tau^2}+\frac{a^2}{\sigma^2}\right)^{-1}.\] The posterior mean is \[m=s^2\left(\frac{\mu}{\tau^2}+\frac{a(y-b)}{\sigma^2}\right).\] Therefore \[X\mid Y=y\sim \operatorname{Normal}(m,s^2).\]

Summary and Practice

This final section summarizes the main formulas and provides practice problems that reinforce the section.

Core formulas

The main ideas of this section are that multinomial counts live on a constrained simplex of counts, Dirichlet distributions live on a simplex of probabilities, and Gaussian random vectors are stable under many operations.

TipMultinomial and Dirichlet formulas

\[\mathbb{P}(X_1=n_1,\ldots,X_K=n_K) =\frac{n!}{n_1!\cdots n_K!}\prod_{k=1}^K\phi_k^{n_k}.\] \[p(\mu\mid\alpha)= \frac{\Gamma(\alpha_0)}{\prod_{k=1}^K\Gamma(\alpha_k)} \prod_{k=1}^K \mu_k^{\alpha_k-1}, \qquad \alpha_0=\sum_{k=1}^K\alpha_k.\] \[\mu\mid X=(n_1, \ldots,n_K) \sim \operatorname{Dirichlet}(\alpha_1+n_1, \ldots, \alpha_K+n_K).\]

TipMultivariate Gaussian formulas

\[p(x\mid\mu,\Sigma)=\frac{1}{\sqrt{(2\pi)^d|\Sigma|}} \exp\left\{-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu)\right\}.\] \[X_1\mid X_2=x_2\sim \mathcal{N}\left(\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2), \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right).\] \[AX+b\sim \mathcal{N}(A\mu+b,A\Sigma A^T).\]

Practice Problem 39 (Multinomial probability). A four-sided die has probabilities \((0.1,0.2,0.3,0.4)\). It is tossed \(8\) times. Find the probability that the counts are \((1,2,2,3)\).

NoteSolution

Use the multinomial pmf: \[\mathbb{P}(X=(1,2,2,3))= \frac{8!}{1!2!2!3!}(0.1)^1(0.2)^2(0.3)^2(0.4)^3.\] The coefficient is \[\frac{8!}{1!2!2!3!}=1680.\] Thus \[\mathbb{P}(X=(1,2,2,3))=1680(0.1)(0.04)(0.09)(0.064) \approx 0.0387.\]

Practice Problem 40 (Dirichlet posterior mean). Suppose \(\mu\sim \operatorname{Dirichlet}(3,1,2)\) and the observed counts are \((2,4,1)\). Find the posterior distribution and posterior mean.

NoteSolution

The posterior is \[\mu\mid X\sim \operatorname{Dirichlet}(3+2,1+4,2+1)=\operatorname{Dirichlet}(5,5,3).\] The posterior total is \(13\), so \[\mathbb{E}[\mu\mid X]=\left(\frac{5}{13},\frac{5}{13},\frac{3}{13}\right).\]

Practice Problem 41 (Conditional normal calculation). Let \[\begin{pmatrix}X_1\\X_2\end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix}2\\1\end{pmatrix}, \begin{pmatrix}9&3\\3&4\end{pmatrix} \right).\] Find \(X_1\mid X_2=5\).

NoteSolution

We have \[\mu_1=2, \quad \mu_2=1, \quad \Sigma_{11}=9, \quad \Sigma_{12}=3, \quad \Sigma_{22}=4, \quad \Sigma_{21}=3.\] The conditional mean is \[\mu_{1\mid 2}=2+3\cdot 4^{-1}(5-1)=2+3=5.\] The conditional variance is \[\Sigma_{11\mid 2}=9-3\cdot 4^{-1}\cdot 3=9-\frac94=\frac{27}{4}.\] Therefore \[X_1\mid X_2=5\sim \operatorname{Normal}\left(5,\frac{27}{4}\right).\]

Practice Problem 42 (Affine transformation). Let \(X\sim \mathcal{N}_2\left(\begin{pmatrix}1\\2\end{pmatrix},\begin{pmatrix}1&0.5\\0.5&2\end{pmatrix}\right)\) and let \(Y=2X_1-X_2+3\). Find the distribution of \(Y\).

NoteSolution

Write \(Y=a^TX+3\), where \[a=\begin{pmatrix}2\\-1\end{pmatrix}.\] The mean is \[\mathbb{E}[Y]=a^T\mu+3=2(1)-1(2)+3=3.\] The variance is \[\operatorname{Var}(Y)=a^T\Sigma a.\] Compute \[\Sigma a= \begin{pmatrix}1&0.5\\0.5&2\end{pmatrix} \begin{pmatrix}2\\-1\end{pmatrix} =\begin{pmatrix}1.5\\-1\end{pmatrix}.\] Thus \[a^T\Sigma a=(2,-1)\begin{pmatrix}1.5\\-1\end{pmatrix}=3+1=4.\] Therefore \[Y\sim \operatorname{Normal}(3,4).\]

9 George Casella and Roger L. Berger, Statistical Inference, 2nd edition. Larry Wasserman, All of Statistics. C. M. Grinstead and J. L. Snell, Introduction to Probability, American Mathematical Society, 2012. Sheldon Ross, Introduction to Probability Models, 12th edition.