4  Chapter 3: Joint and Conditional Probability

This chapter moves from one random variable to several random variables studied together. The main objects are joint distributions, marginal distributions, conditional distributions, independence, covariance, total probability, Bayes theorem, and conditional independence.

Topics. Expected value review; variance and covariance; joint, marginal, and conditional distributions; independence; total probability; Bayes theorem; conditional independence.

4.1 Overview

This section moves from one random variable to several random variables studied together.

In Section 2, we learned how a single random variable is described by a CDF, a pmf, or a pdf. In applications, however, data usually come in groups: height and weight, treatment and outcome, first die and second die, test result and disease status, or multiple measurements from the same experiment. The correct mathematical language for this is the joint distribution. Once we have a joint distribution, we can recover marginal distributions, define conditional distributions, test independence, and update probabilities using Bayes theorem.

TipMain message

A joint distribution describes the complete probabilistic relationship among variables. Marginal distributions describe individual variables. Conditional distributions describe what remains after partial information is known.

4.2 Expected Values: Review

This section reviews expected value and variance because they are the basic numerical summaries used throughout joint and conditional probability.

4.2.1 Expected value

This subsection recalls the definition and interpretation of expected value for discrete and continuous random variables.

Definition (Expected value).

Let \(X\) be a random variable. If \(X\) is discrete with pmf \(p_X(k)\), then [ [X]={k} k p_X(k). ] If \(X\) is continuous with pdf \(f_X(x)\), then [ [X]={-}^{} x f_X(x),d x. ]

Expected value is a generalization of the idea of an average. Operationally, if we measure \(X\) in many independent trials and obtain \(X_1,\ldots,X_n\), then the sample average [ X_n=(X_1++X_n) ] should approach \(\mathbb{E}[X]\) as \(n\) becomes large. This is formalized later by the Law of Large Numbers.

ImportantProposition: Linearity for one random variable

For constants \(a,b\in\mathbb{R}\), [ [aX+b]=a[X]+b. ]

Example (Bernoulli expected value).

Let \(X\sim\operatorname{Bernoulli}(p)\), so [ (X=0)=1-p,(X=1)=p. ] Find \(\mathbb{E}[X]\).

Using the discrete expectation formula, [ [X]=_k k p_X(k)=0(1-p)+1(p)=p. ] Thus the expected value of a Bernoulli random variable is its success probability: [ ]

Example (Expected outcome of a fair die).

Let \(X\) be the outcome of rolling a fair six-sided die. Find \(\mathbb{E}[X]\).

The pmf is \(p_X(k)=1/6\) for \(k=1,2,3,4,5,6\). Therefore [ [X]=_{k=1}^{6} k = ==. ] So the long-run average die value is [ ]

4.2.2 Variance and standard deviation

This subsection reviews how variance and standard deviation measure the spread of a random variable around its mean.

Definition (Variance and standard deviation).

The variance of a random variable \(X\) is [ (X)=)^2]. ] The standard deviation is [ (X)=. ]

A useful computational formula is [ (X)=[X^2]-([X])^2. ] For constants \(a,b\in\mathbb{R}\), [ (aX+b)=a^2(X). ] Adding a constant changes location but not spread; multiplying by \(a\) scales spread by \(a^2\) in variance and by \(|a|\) in standard deviation.

Example (Variance of a Bernoulli random variable).

Let \(X\sim\operatorname{Bernoulli}(p)\). Find \(\operatorname{Var}(X)\).

Since \(X\) only takes values \(0\) and \(1\), we have \(X^2=X\). Thus [ [X^2]=[X]=p. ] Using the computational formula, [ (X)=[X^2]-([X])2=p-p2=p(1-p). ] Therefore [ ]

Practice Problem (Review: expectation and variance).

Suppose \(X\) takes values \(-1,0,2\) with probabilities \(1/4,1/2,1/4\), respectively. Find \(\mathbb{E}[X]\) and \(\operatorname{Var}(X)\).

First compute the mean: [ [X]=(-1)+0+2=-+=. ] Next compute the second moment: [ [X^2]=(-1)2+02+2^2=+1=. ] Therefore [ (X)=[X^2]-([X])2=-()2 =-=. ] So [ ]

4.3 Joint Distributions

This section introduces the joint distribution, which is the main object for studying two or more random variables together.

4.3.1 Joint CDF, joint pmf, and joint pdf

This subsection defines the joint distribution for two random variables in both discrete and continuous cases.

Definition (Joint CDF).

For two random variables \(X\) and \(Y\), the joint cumulative distribution function is [ F_{X,Y}(x,y)=(Xx,,Yy). ]

If \(X\) and \(Y\) are both discrete, then their joint pmf is [ p_{X,Y}(x,y)=(X=x,Y=y), ] for all \(x\in\mathbb{R}ange(X)\) and \(y\in\mathbb{R}ange(Y)\). It must satisfy [ p_{X,Y}(x,y), {x}{y}p_{X,Y}(x,y)=1. ]

If \(X\) and \(Y\) are absolutely continuous, then their joint pdf \(f_{X,Y}(x,y)\) satisfies [ f_{X,Y}(x,y), {-}{}_{-}{}f{X,Y}(x,y),d x,d y=1. ] When the joint CDF is differentiable, [ f_{X,Y}(x,y)=F_{X,Y}(x,y). ] For a region \(R\) in the \(xy\)-plane, [ ((X,Y)R)=R f{X,Y}(x,y),d x,d y. ]

4.3.2 Marginal distributions

This subsection explains how to recover the distribution of one variable from the joint distribution.

The marginal distribution of \(X\) is obtained by summing or integrating out \(Y\). Similarly, the marginal distribution of \(Y\) is obtained by summing or integrating out \(X\).

For discrete random variables, [ p_X(x)=y p{X,Y}(x,y), p_Y(y)=x p{X,Y}(x,y). ] For continuous random variables, [ f_X(x)={-}^{} f{X,Y}(x,y),d y, f_Y(y)={-}^{} f{X,Y}(x,y),d x. ]

WarningCommon mistake

When finding \(f_X(x)\), integrate with respect to \(y\). When finding \(f_Y(y)\), integrate with respect to \(x\).

4.3.3 Discrete joint distribution example: two dice

This subsection works out a full discrete joint distribution using two dice.

Example (Difference and maximum of two dice).

Roll two fair dice. Let [ X=, Y=. ] Find the ranges, marginal pmfs, and joint pmf table of \((X,Y)\).

The sample space has \(36\) equally likely outcomes \((i,j)\) with \(i,j\in\{1,2,3,4,5,6\}\). The range of \(X\) is [ ange(X)={0,1,2,3,4,5}, ] and the range of \(Y\) is [ ange(Y)={1,2,3,4,5,6}. ]

For \(X=0\), the two dice match. There are \(6\) outcomes. For \(X=d\ge 1\), the two dice differ by \(d\); there are \(2(6-d)\) ordered outcomes. Hence [ \[\begin{array}{c|cccccc} x&0&1&2&3&4&5\\ \hline p_X(x)&6/36&10/36&8/36&6/36&4/36&2/36 \end{array}\]

]

For \(Y=y\), at least one die equals \(y\) and both dice are at most \(y\). There are [ y2-(y-1)2=2y-1 ] outcomes. Thus [ \[\begin{array}{c|cccccc} y&1&2&3&4&5&6\\ \hline p_Y(y)&1/36&3/36&5/36&7/36&9/36&11/36 \end{array}\]

]

The joint pmf table is [ \[\begin{array}{c|cccccc|c} & X=0&X=1&X=2&X=3&X=4&X=5&p_Y(y)\\ \hline Y=1&1/36&0&0&0&0&0&1/36\\ Y=2&1/36&2/36&0&0&0&0&3/36\\ Y=3&1/36&2/36&2/36&0&0&0&5/36\\ Y=4&1/36&2/36&2/36&2/36&0&0&7/36\\ Y=5&1/36&2/36&2/36&2/36&2/36&0&9/36\\ Y=6&1/36&2/36&2/36&2/36&2/36&2/36&11/36\\ \hline p_X(x)&6/36&10/36&8/36&6/36&4/36&2/36&1 \end{array}\]

] The entries sum to \(1\), and the row and column sums agree with the marginals.

Example (Same marginals, different joint pmf).

Show that marginal distributions alone do not determine the joint distribution.

Let \(X,Y\in\{0,1\}\), and suppose both marginals satisfy [ (X=0)=(X=1)=, (Y=0)=(Y=1)=. ] Consider the two joint pmf tables [ \[\begin{array}{c|cc} &Y=0&Y=1\\ \hline X=0&1/2&0\\ X=1&0&1/2 \end{array} \qquad\text{and}\qquad \begin{array}{c|cc} &Y=0&Y=1\\ \hline X=0&1/4&1/4\\ X=1&1/4&1/4 \end{array}\]

. ] Both tables have the same marginal distributions for \(X\) and \(Y\). However, the dependence structure is different. In the first table, \(Y=X\) always. In the second table, \(X\) and \(Y\) are independent. Therefore marginals alone do not determine the joint distribution.

4.3.4 Continuous joint distribution example

This subsection gives a full continuous example with normalization, a region probability, a marginal density, and a one-dimensional probability.

Example (A joint pdf on the unit square).

A joint pdf is defined by [ f(x,y)=6xy^2, <x<1,<y<1, ] and \(f(x,y)=0\) otherwise.

  • Check that it is a valid joint pdf.
  • Calculate \(\mathbb{P}(X+Y\ge 1)\).
  • Calculate the marginal pdf of \(X\), \(f_X(x)\).
  • Calculate \(\mathbb{P}\left(\frac12<X<\frac34\right)\).

(1) Validity. Clearly \(f(x,y)\ge 0\) on its support. Also, [ _01_01 6xy^2,d x,d y =_0^1 3y^2,d y =.y3|_01=1. ] Thus \(f\) is a valid joint pdf.

(2) Region probability. The event \(X+Y\ge 1\) corresponds to \(0<x<1\) and \(1-x\le y\le 1\). Hence [ (X+Y) =01_{1-x}{1}6xy^2,d y,d x. ] Compute the inner integral: [ {1-x}{1}6xy2,d y =2x(1-(1-x)^3). ] Therefore [ (X+Y)=_0^1 2x(1-(1-x)^3),d x=. ]

(3) Marginal of \(X\). [ f_X(x)=_0^1 6xy^2,d y=6x=2x, <x<1. ]

(4) Probability for \(X\). [ (<X<)=_{1/2}^{3/4}2x,d x =.x2|_{1/2}{3/4} =-=. ] Thus [ ]

Practice Problem (Marginal density practice).

Suppose [ f(x,y)=c(x+y),<x<1,<y<1, ] and \(0\) otherwise. Find \(c\), \(f_X(x)\), and \(\mathbb{P}(X\le 1/2)\).

Normalize: [ 1=_01_01 c(x+y),d y,d x =c_0^1(x+),d x =c(+)=c. ] So \(c=1\). Then [ f_X(x)=_0^1(x+y),d y=x+, <x<1. ] Finally, [ (X/2)=_0^{1/2}(x+),d x =.(+)|_0^{1/2} =+=. ]

4.4 Multivariate Normal Distribution

This section introduces the most important multivariate continuous distribution: the multivariate normal distribution.

4.4.1 Definition and geometry

This subsection defines the multivariate normal density and explains the roles of the mean vector and covariance matrix.

Definition (Multivariate normal distribution).

Let [ X= \[\begin{pmatrix}X_1\\ \vdots\\ X_d\end{pmatrix}\]

. ] We say [ Xormal(,) ] if \(\vec\mu\in\mathbb{R}^d\) and \(\Sigma\) is a \(d\times d\) symmetric positive definite matrix, and the joint pdf is [ f_{X}(x) = . ]

The mean vector is [ [X]=, ] and the covariance matrix is [ (X)=. ] The matrix \(\Sigma\) controls the spread and orientation of the density. A nearly diagonal covariance matrix produces contours aligned with the coordinate axes; a covariance matrix with nonzero off-diagonal entries produces tilted elliptical contours.

NoteVisual idea

Multivariate normal contours are ellipses. Diagonal covariance gives axis-aligned contours, while nonzero covariance tilts the contours.

4.4.2 Covariance and correlation

This subsection defines covariance and correlation as second-order summaries of the relationship between two random variables.

Definition (Covariance and correlation).

For two random variables \(X\) and \(Y\), the covariance is [ (X,Y)=[(X-[X])(Y-[Y])] =[XY]-[X][Y]. ] The Pearson correlation coefficient is [ (X,Y)=. ] It satisfies [ -1(X,Y). ]

Covariance has units depending on \(X\) and \(Y\), while correlation is dimensionless. Positive correlation means that large values of \(X\) tend to occur with large values of \(Y\); negative correlation means that large values of \(X\) tend to occur with small values of \(Y\).

Example (Covariance from a small joint table).

Let \(X,Y\in\{0,1\}\) with joint pmf [ \[\begin{array}{c|cc} &Y=0&Y=1\\ \hline X=0&0.40&0.10\\ X=1&0.10&0.40 \end{array}\]

. ] Find \(\operatorname{Cov}(X,Y)\).

The marginals are [ (X=1)=0.10+0.40=0.50, (Y=1)=0.10+0.40=0.50. ] Thus \(\mathbb{E}[X]=\mathbb{E}[Y]=0.5\). Also, since \(XY=1\) only when \(X=1,Y=1\), [ [XY]=(X=1,Y=1)=0.40. ] Therefore [ (X,Y)=[XY]-[X][Y]=0.40-(0.50)(0.50)=0.15. ] The covariance is positive.

4.5 Conditional Probability and Conditional Distributions

This section formalizes the idea of updating probabilities after some information has been observed.

4.5.1 Conditional probability for events

This subsection recalls the event-level definition from Section 1.

Definition (Conditional probability).

If \(\mathbb{P}(B)>0\), then the probability of \(A\) given \(B\) is [ (AB)=. ] When \(B\) is fixed, \(\mathbb{P}(\cdot\mid B)\) is another probability measure.

4.5.2 Conditional distributions

This subsection extends conditional probability from events to random variables.

For two random variables \(X\) and \(Y\), the conditional pmf/pdf of \(Y\) given \(X=x\) is [ p_{YX}(yx)=, ] provided \(p_X(x)>0\). In the continuous case the same formula is written using densities: [ f_{YX}(yx)=, ] provided \(f_X(x)>0\).

Example (Triangle uniform distribution).

Let \((X,Y)\) be uniformly distributed on the triangular region [ D={(x,y):x, y, x+y}. ] The joint pdf is [ f(x,y)= \[\begin{cases} 2, & (x,y)\in D,\\ 0, & \text{otherwise}. \end{cases}\]

] Find \(f_X(x)\) and \(f_{Y\mid X}(y\mid x)\).

For a fixed \(x\in[0,1]\), the variable \(y\) ranges from \(0\) to \(1-x\). Hence [ f_X(x)=0^{1-x}2,d y=2(1-x), x. ] Therefore [ f{YX}(yx) = ==, ] for \(0\le y\le 1-x\). Thus, conditional on \(X=x\), \(Y\) is uniform on \([0,1-x]\): [ ]

Example (Beta-Bernoulli joint distribution).

The Beta-Bernoulli model is a basic Bayesian model. Suppose [ Y(,) ] and, conditional on \(Y=y\), [ XY=y(y). ] Find the joint distribution \(p(x,y)\) and identify the posterior distribution of \(Y\) given \(X=x\).

The conditional pmf of \(X\) given \(Y=y\) is [ p(xy)=yx(1-y){1-x}, x{0,1}. ] The density of \(Y\) is [ f_Y(y)=y{}(1-y){}, <y<1. ] Thus the joint distribution is [ p(x,y)=p(xy)f_Y(y) =y{+x-1}(1-y){+1-x-1}. ] As a function of \(y\) with \(x\) fixed, this has the kernel of a beta density: [ y{(+x)-1}(1-y){(+1-x)-1}. ] Therefore [ ] This is why the beta distribution is called a conjugate prior for the Bernoulli model.

Practice Problem (Conditional density practice).

Let [ f(x,y)=8xy, <y<x<1, ] and \(0\) otherwise. Find \(f_X(x)\) and \(f_{Y\mid X}(y\mid x)\).

For a fixed \(x\in(0,1)\), \(y\) ranges from \(0\) to \(x\). Thus [ f_X(x)=0^x 8xy,d y=8x=4x^3, <x<1. ] Then [ f{YX}(yx)==, <y<x. ] The conditional density integrates to one: [ _0^x ,d y=1. ]

4.6 Independence

This section explains independence for events and for random variables, and shows how independence simplifies joint distributions and expectations.

4.6.1 Independence for events

This subsection recalls the event-level definition of independence.

Definition (Independent events).

Events \(A\) and \(B\) are independent if [ (AB)=(A)(B). ] If \(\mathbb{P}(A)>0\) and \(\mathbb{P}(B)>0\), this is equivalent to [ (AB)=(A) (BA)=(B). ]

The interpretation is that knowing one event occurred does not change the probability of the other event.

4.6.2 Independent random variables

This subsection defines independence using joint and marginal distributions.

Definition (Independent random variables).

Discrete random variables \(X\) and \(Y\) are independent if [ p_{X,Y}(x,y)=p_X(x)p_Y(y) ] for all possible \(x\) and \(y\). Continuous random variables \(X\) and \(Y\) are independent if [ f_{X,Y}(x,y)=f_X(x)f_Y(y) ] for all \(x,y\).

Similarly, \(X_1,\ldots,X_n\) are mutually independent if their joint pmf/pdf factors into the product of their marginal pmfs/pdfs.

Example (Independent coin tosses).

Toss a biased coin twice. Let [ X= \[\begin{cases}1,&\text{first toss is Heads},\\0,&\text{first toss is Tails},\end{cases} \qquad\] Y= \[\begin{cases}1,&\text{second toss is Heads},\\0,&\text{second toss is Tails}. \end{cases}\]

] Suppose \(\mathbb{P}(\text{Heads})=p\). Show that \(X\) and \(Y\) are independent.

Because the two tosses are independent, the probability of any pair of outcomes factors. For example, [ (X=1,Y=0)=()=p(1-p). ] Also, [ (X=1)=p, (Y=0)=1-p. ] Hence [ (X=1,Y=0)=(X=1)(Y=0). ] The same check holds for the other three pairs \((0,0),(0,1),(1,1)\). Therefore \(X\) and \(Y\) are independent.

Example (Checking dependence in the two-dice example).

In the two-dice example, \(X\) is the absolute difference and \(Y\) is the maximum. Are \(X\) and \(Y\) independent?

Use one entry of the joint table. We have [ p_{X,Y}(0,1)=. ] The marginals are [ p_X(0)==, p_Y(1)=. ] If \(X\) and \(Y\) were independent, then [ p_{X,Y}(0,1)=p_X(0)p_Y(1)==. ] But \(1/36\ne 1/216\). Therefore [ ]

4.6.3 Checking independence by factorization

This subsection gives a useful theorem for checking independence without explicitly computing both marginals first.

ImportantTheorem: Factorization criterion

Let \((X,Y)\) have joint pmf or pdf \(f(x,y)\). Then \(X\) and \(Y\) are independent if and only if there exist nonnegative functions \(g(x)\) and \(h(y)\) such that [ f(x,y)=g(x)h(y) ] for every \((x,y)\) in the support, and the support itself factors as a product of a set in \(x\) and a set in \(y\).

Example (Factorized joint density).

Suppose a joint density has the form [ f(x,y)=x2y4 e{-x/2}e{-y}, x>0,\ y>0. ] Determine whether \(X\) and \(Y\) are independent.

We can write [ f(x,y)=(x2e{-x/2})(y4e{-y}) =g(x)h(y), ] where [ g(x)=x2e{-x/2},h(y)=y4e{-y}. ] The support is also a product region: \(x>0\) and \(y>0\). Hence the joint density factors into a function of \(x\) times a function of \(y\), so \(X\) and \(Y\) are independent.

WarningSupport matters

A formula may look factorized, but if the support couples \(x\) and \(y\), the variables may not be independent. For example, \(f(x,y)=2\) on \(0<x<1\), \(0<y<1-x\) does not have a product support.

4.6.4 Independence and information

This subsection explains why independence is essential in likelihood-based statistical inference.

Suppose we observe \(X_1,\ldots,X_n\in\{0,1\}\), where the observations are independent and identically distributed as \(\operatorname{Bernoulli}(\theta)\). Then the joint pmf factors as [ p(x_1,,x_n;)=p(x_1;)p(x_2;)p(x_n;). ] Taking logarithms gives [ p(x_1,,x_n;)=p(x_1;)++p(x_n;). ] Thus the log-likelihood is a sum of individual log-likelihood contributions: [ (;x_1,,x_n)=(;x_1)++(;x_n). ] This is the statistical meaning of the phrase: total information equals the sum of individual information.

4.7 Consequences of Independence

This section collects important algebraic consequences of independence for expectation, variance, and moment generating functions.

4.7.1 Expectations of products

This subsection gives the key expectation rule for independent variables.

ImportantTheorem: Products of functions of independent random variables

If \(X\) and \(Y\) are independent, then for suitable functions \(g\) and \(h\), [ [g(X)h(Y)]=[g(X)][h(Y)]. ] In particular, [ [XY]=[X][Y]. ]

For the continuous case, [ [g(X)h(Y)] =g(x)h(y)f_{X,Y}(x,y),d x,d y. ] Independence gives \(f_{X,Y}(x,y)=f_X(x)f_Y(y)\), so [ [g(X)h(Y)] =g(x)h(y)f_X(x)f_Y(y),d x,d y =(g(x)f_X(x),d x) (h(y)f_Y(y),d y). ] Thus [ [g(X)h(Y)]=[g(X)][h(Y)]. ] The discrete proof is the same with sums replacing integrals.

4.7.2 Moment generating functions and sums

This subsection explains why moment generating functions are especially useful for sums of independent random variables.

ImportantTheorem: MGF of a sum of independent random variables

If \(X\) and \(Y\) are independent and \(Z=X+Y\), then [ M_Z(t)=M_X(t)M_Y(t), ] where \(M_X(t)=\mathbb{E}[e^{tX}]\).

By definition, [ M_Z(t)=[e^{tZ}]=[e^{t(X+Y)}]=[e{tX}e{tY}]. ] Since \(X\) and \(Y\) are independent, \(e^{tX}\) and \(e^{tY}\) are independent functions of them, so [ M_Z(t)=[e^{tX}][e^{tY}]=M_X(t)M_Y(t). ]

ImportantTheorem: Sum of independent normal random variables

Suppose [ Xormal(_1,_1^2), Yormal(_2,_2^2), ] and \(X\) and \(Y\) are independent. Then [ X+Yormal(_1+_2,_12+_22). ]

The MGF of a normal random variable \(X\sim\mathbb{N}ormal(\mu,\sigma^2)\) is [ M_X(t)=(t+). ] Let \(Z=X+Y\). Since \(X\) and \(Y\) are independent, [ M_Z(t)=M_X(t)M_Y(t). ] Therefore [ M_Z(t) =(_1t+) (_2t+). ] Combining exponents, [ M_Z(t)=((_1+_2)t+), ] which is the MGF of \(\mathbb{N}ormal(\mu_1+\mu_2,\sigma_1^2+\sigma_2^2)\).

4.7.3 Linearity, covariance, and variance of sums

This subsection separates what is always true from what requires independence.

For any random variables \(X\) and \(Y\) on the same sample space, [ [aX+bY]=a[X]+b[Y]. ] No independence is required for linearity of expectation.

The variance of a linear combination is [ (aX+bY)=a2(X)+b2(Y)+2ab(X,Y). ] If \(X\) and \(Y\) are independent, then [ (X,Y)=0, ] and therefore [ (aX+bY)=a2(X)+b2(Y). ]

WarningZero covariance is not the same as independence

Independence implies zero covariance, provided the variances exist. The converse is not true: zero covariance does not necessarily imply independence.

Practice Problem (A zero covariance but not independent example).

Let \(X\) take values \(-1,0,1\) with probabilities \(1/3,1/3,1/3\), and define \(Y=X^2\). Show that \(\operatorname{Cov}(X,Y)=0\) but \(X\) and \(Y\) are not independent.

We compute [ [X]==0, Y=X^2, XY=X^3. ] Thus [ [XY]=[X^3]==0. ] Hence [ (X,Y)=[XY]-[X][Y]=0-0[Y]=0. ] But \(X\) and \(Y\) are not independent. For example, [ (Y=0X=0)=1, ] while [ (Y=0)=(X=0)=. ] Knowing \(X=0\) changes the probability of \(Y=0\), so \(X\) and \(Y\) are dependent.

4.8 Law of Total Probability and Bayes Theorem for Random Variables

This section extends total probability and Bayes theorem from events to random variables.

4.8.1 Law of total probability for random variables

This subsection shows how a marginal distribution can be obtained by averaging conditional distributions.

ImportantTheorem: Law of total probability for random variables

If \(X\) is discrete, then [ p_Y(y)={x’} p{YX}(yx’)p_X(x’). ] If \(X\) is continuous, then [ p_Y(y)=p_{YX}(yx’)p_X(x’),d x’. ]

Example (Poisson-Binomial thinning).

Let [ X(), YX=x(x,p). ] Find the marginal distribution of \(Y\).

By the law of total probability, [ (Y=y)={x} (Y=yX=x)(X=x). ] Since \(Y\mid X=x\sim\operatorname{Binomial}(x,p)\), the sum starts at \(x=y\): [ (Y=y)={xy}py(1-p){x-y}. ] Using \(\binom{x}{y}=x!/[y!(x-y)!]\), we get [ (Y=y)={xy}. ] Let \(k=x-y\). Then [ (Y=y)= {k=0}{}\frac{((1-p))k}{k!}. ] The sum is \(e^{\lambda(1-p)}\), so [ (Y=y)= =. ] Therefore [ ]

4.8.2 Bayes theorem for random variables

This subsection gives Bayes theorem in pmf/pdf notation.

ImportantTheorem: Bayes theorem for random variables

For random variables \(X\) and \(Y\), [ p_{XY}(xy) = =. ] If \(X\) is discrete, then [ p_{XY}(xy) =. ] If \(X\) is continuous, then [ p_{XY}(xy) =. ]

Example (Bayesian update with a beta prior).

Suppose \(Y\sim\operatorname{Beta}(\alpha,\beta)\) and \(X\mid Y=y\sim\operatorname{Bernoulli}(y)\). Use Bayes theorem to find \(Y\mid X=1\).

By Bayes theorem, [ f_{YX}(y)(X=1Y=y)f_Y(y). ] Now [ (X=1Y=y)=y, f_Y(y)y{}(1-y){}. ] Therefore [ f_{YX}(y)y{}(1-y){}. ] This is the kernel of \(\operatorname{Beta}(\alpha+1,\beta)\), so [ ] Similarly, if \(X=0\), then \(Y\mid X=0\sim\operatorname{Beta}(\alpha,\beta+1)\).

Practice Problem (Total probability practice).

Suppose \(X\sim\operatorname{Bernoulli}(q)\) and [ YX=0(p_0), YX=1(p_1). ] Find \(\mathbb{P}(Y=1)\).

Use total probability over the two possible values of \(X\): [ (Y=1)=(Y=1X=0)(X=0) +(Y=1X=1)(X=1). ] Since \(\mathbb{P}(X=1)=q\) and \(\mathbb{P}(X=0)=1-q\), [ ]

4.9 Conditional Independence

This section introduces conditional independence, a central concept in Bayesian statistics, causal inference, and graphical models.

4.9.1 Definition and equivalent forms

This subsection defines conditional independence and lists several equivalent factorization properties.

Definition (Conditional independence).

Random variables \(X\) and \(Y\) are conditionally independent given \(Z\) if [ p_{X,YZ}(x,yz)=p_{XZ}(xz)p_{YZ}(yz) ] for all relevant \(x,y,z\).

ImportantTheorem: Equivalent forms

If \(X\) and \(Y\) are conditionally independent given \(Z\), then the following equivalent statements hold whenever the conditional densities are defined:

  • [ p_{X,YZ}(x,yz)=p_{XZ}(xz)p_{YZ}(yz). ]
  • [ p_{XY,Z}(xy,z)=p_{XZ}(xz). ]
  • [ p_{X,Y,Z}(x,y,z)=. ]
  • There exist functions \(g\) and \(h\) such that [ p_{X,Y,Z}(x,y,z)=g(x,z)h(y,z). ]

The key interpretation is: after \(Z\) is known, learning \(Y\) gives no additional information about \(X\).

WarningIndependence versus conditional independence

Independence and conditional independence are different concepts. Independence does not always imply conditional independence, and conditional independence does not always imply independence.

4.9.2 Unit disk example

This subsection gives a full example involving a joint density on a non-rectangular support.

Example (Uniform distribution on the unit disk).

Let \((X,Y)\) be uniformly distributed over the unit disk [ D={(x,y):x2+y2}. ] The joint pdf is [ f_{X,Y}(x,y)= \[\begin{cases} c, & (x,y)\in D,\\ 0, & \text{otherwise}. \end{cases}\]

]

  • Find \(c\).
  • Find the marginal pdfs \(f_X(x)\) and \(f_Y(y)\).
  • Find the conditional pdf \(f_{X\mid Y}(x\mid y)\) for \(-1\le y\le1\).
  • Are \(X\) and \(Y\) independent?

(a) Find \(c\). Since the area of the unit disk is \(\pi\), [ 1=_D c,d x,d y=c. ] Thus [ ]

(b) Find the marginals. For fixed \(x\in[-1,1]\), \(y\) ranges from \(-\sqrt{1-x^2}\) to \(\sqrt{1-x^2}\). Hence [ f_X(x)=_{-}{\sqrt{1-x2}},d y =, x. ] By symmetry, [ f_Y(y)=, y. ]

(c) Find the conditional pdf. For fixed \(y\in[-1,1]\), \(x\) ranges from \(-\sqrt{1-y^2}\) to \(\sqrt{1-y^2}\). Therefore [ f_{XY}(xy)= = =, ] for [ -x. ] Thus [ ]

(d) Independence. If \(X\) and \(Y\) were independent, we would have [ f_{X,Y}(x,y)=f_X(x)f_Y(y). ] But inside the disk, [ f_{X,Y}(x,y)=, ] whereas [ f_X(x)f_Y(y)=, ] which is not constant on the disk. Also, \(f_{X\mid Y}(x\mid y)\) depends on \(y\). Therefore [ ]

4.10 Summary

This section summarizes the main formulas and conceptual points from joint and conditional probability.

TipCore formulas

\[\begin{align*} F_{X,Y}(x,y)&=\mathbb{P}(X\le x,Y\le y),\\ p_X(x)&=\sum_y p_{X,Y}(x,y),\\ f_X(x)&=\int f_{X,Y}(x,y)\,d y,\\ p_{Y\mid X}(y\mid x)&=\frac{p_{X,Y}(x,y)}{p_X(x)},\\ p_Y(y)&=\sum_x p_{Y\mid X}(y\mid x)p_X(x),\\ p_{X\mid Y}(x\mid y)&=\frac{p_{Y\mid X}(y\mid x)p_X(x)}{p_Y(y)},\\ \operatorname{Cov}(X,Y)&=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y]. \end{align*}\]

TipConceptual checklist
  • A joint distribution contains the full relationship between variables.
  • A marginal distribution describes one variable after summing or integrating out the others.
  • A conditional distribution describes one variable after another variable has been fixed or observed.
  • Independence means the joint distribution factors into marginals.
  • Conditional independence means the conditional joint distribution factors after conditioning on another variable.
  • Bayes theorem reverses conditioning: it turns \(p_{Y\mid X}\) and \(p_X\) into \(p_{X\mid Y}\).