7 Chapter 6: Conditional Expectations
This chapter explains how conditional distributions and conditional expectations turn difficult probability, expectation, and variance calculations into easier two-step calculations: condition first, then average.
Topics. Conditional distributions; conditioning as a problem-solving method; law of total probability; memoryless property; conditional expectation; law of total expectation; law of total variance; random sums; Bayesian updating; inverse probability weighting; importance weighting.
7.1 Overview
This section explains how conditioning turns a difficult probability or expectation into an average of easier conditional probabilities or conditional expectations.
Conditional probability was introduced earlier as \[\mathbb{P}(A\mid B)=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}.\] In this section, we extend the idea from events to random variables. We will use conditional distributions such as \(p_{X\mid Y}(x\mid y)\) and conditional expectations such as \(\mathbb{E}[X\mid Y]\).
Conditioning is a way to break a problem into two steps: first solve the problem after some information is fixed, then average over the information that was fixed. \[\text{unconditional quantity}=\text{average of conditional quantities}.\]
7.2 Conditional Distributions
This section reviews conditional distributions and shows how they are used to compute probabilities by conditioning on another random variable.
7.2.1 Discrete conditional distributions
This subsection begins with the discrete case, where conditional distributions are ratios of joint probabilities and marginals.
Let \(X\) and \(Y\) be discrete random variables. If \(p_Y(y)>0\), then \[p_{X\mid Y}(x\mid y) =\mathbb{P}(X=x\mid Y=y) =\frac{p_{X,Y}(x,y)}{p_Y(y)}.\] Equivalently, \[p_{X,Y}(x,y)=p_{X\mid Y}(x\mid y)p_Y(y).\] Therefore, summing over all possible values of \(Y\) gives the total probability formula \[\mathbb{P}(X=x)=\sum_y \mathbb{P}(X=x\mid Y=y)\mathbb{P}(Y=y).\]
It is often easier to compute \(\mathbb{P}(X=x\mid Y=y)\) after \(Y\) is fixed. Then the unconditional probability is obtained by summing over all possible values of \(Y\).
Example 1 (Best prize / secretary problem). There are \(n\) distinct prizes arriving in a random order. Exactly one prize is the best. You must accept or reject each prize when it arrives, and you cannot return to earlier prizes. Consider the strategy:
Reject the first \(k\) prizes, then choose the first later prize that is better than all previous prizes.
Let \(X\) be the position of the best prize. Find the approximate probability of selecting the best prize under this strategy and find the approximate optimal value of \(k\).
The best prize is equally likely to occur in any position, so \[\mathbb{P}(X=i)=\frac1n,\qquad i=1,\ldots,n.\] If \(i\le k\), the best prize occurs during the rejection period, so the strategy cannot win.
If \(i>k\), then the strategy wins exactly when the best prize among the first \(i-1\) positions occurs among the first \(k\) positions. Since the best among the first \(i-1\) positions is equally likely to be in any of those positions, \[\mathbb{P}(\text{win}\mid X=i)=\frac{k}{i-1},\qquad i=k+1, \ldots,n.\] Hence \[\mathbb{P}_k(\text{win}) =\sum_{i=k+1}^n \mathbb{P}(\text{win}\mid X=i)\mathbb{P}(X=i) =\sum_{i=k+1}^n \frac{k}{i-1}\frac1n.\] Thus \[\mathbb{P}_k(\text{win}) =\frac{k}{n}\sum_{j=k}^{n-1}\frac1j \approx \frac{k}{n}\log\left(\frac{n}{k}\right).\] Let \(x=k/n\). Then approximately \[f(x)=x\log\left(\frac1x\right)=-x\log x.\] Differentiate: \[f'(x)=-\log x-1.\] Set \(f'(x)=0\): \[-\log x-1=0 \quad\Longrightarrow\quad x=e^{-1}.\] So the approximate optimal choice is \[\boxed{k\approx \frac{n}{e}},\] and the corresponding maximum probability is approximately \[\boxed{\frac1e\approx 0.368.}\]
7.2.2 Conditioning on the first step
This subsection shows a common recursive method: condition on the first move of a stochastic process.
Example 2 (Gambler’s ruin). A gambler starts with \(k\) dollars. In each round, the gambler wins \(1\) dollar with probability \(p\) and loses \(1\) dollar with probability \(q=1-p\). The game stops when the gambler reaches \(N\) dollars or \(0\) dollars. Let \[P_k=\mathbb{P}(\text{reach }N\text{ before }0\mid\text{start at }k).\] Use conditioning on the first bet to derive the recursion for \(P_k\).
Let \(A\) be the event that the gambler eventually reaches \(N\) before going broke. Let \(X\) be the result of the first bet, where \(X=+1\) with probability \(p\) and \(X=-1\) with probability \(q\).
Conditioning on the first bet gives \[P_k=\mathbb{P}(A) =\mathbb{P}(A\mid X=+1)\mathbb{P}(X=+1)+\mathbb{P}(A\mid X=-1)\mathbb{P}(X=-1).\] If the first bet is a win, the gambler moves to \(k+1\) dollars. If it is a loss, the gambler moves to \(k-1\) dollars. Therefore \[\boxed{P_k=pP_{k+1}+qP_{k-1}},\qquad k=1,\ldots,N-1,\] with boundary conditions \[P_0=0,\qquad P_N=1.\] If desired, solving the difference equation gives \[P_k=\begin{cases} \displaystyle \frac{1-(q/p)^k}{1-(q/p)^N}, & p\ne q,\\[1.2em] \displaystyle \frac{k}{N}, & p=q=\frac12. \end{cases}\]
Practice Problem 3 (First-step recursion). In the gambler’s ruin problem, suppose \(p=q=1/2\), \(N=10\), and the gambler starts with \(k=4\) dollars. What is the probability that the gambler reaches \(10\) dollars before going broke?
For the fair game \(p=q=1/2\), the solution is \[P_k=\frac{k}{N}.\] Thus \[P_4=\frac{4}{10}=0.4.\] So the probability is \[\boxed{0.4}.\]
7.2.3 Continuous conditioning and total probability
This subsection extends the law of total probability to conditioning on a continuous random variable.
Let \(X\) be a continuous random variable with density \(f_X(x)\). For any event \(A\), \[\mathbb{P}(A)=\int_{-\infty}^{\infty}\mathbb{P}(A\mid X=x)f_X(x)\,dx.\] A shorthand notation is \[\mathbb{P}(A)=\mathbb{E}[\mathbb{P}(A\mid X)],\] where \(\mathbb{P}(A\mid X)\) is a random variable that is a function of \(X\).
More generally, if \(X\) and \(Y\) have joint density or joint pmf, then \[p_{X\mid Y}(x\mid y)=\frac{p_{X,Y}(x,y)}{p_Y(y)}\] whenever \(p_Y(y)>0\), and the marginal density or pmf can be recovered by \[p_X(x)=\int p_{X\mid Y}(x\mid y)p_Y(y)\,dy\] in the continuous case, or \[p_X(x)=\sum_y p_{X\mid Y}(x\mid y)p_Y(y)\] in the discrete case.
Example 4 (Tail probability for a sum of exponentials). Suppose \(X\) and \(Y\) are independent exponential random variables with mean \(1\), so \(f_X(x)=e^{-x}\) for \(x\ge0\). For \(z\ge0\), compute \[\mathbb{P}(X+Y\ge z).\]
Condition on \(X=x\). Since \(X\) and \(Y\) are independent, \[\mathbb{P}(X+Y\ge z\mid X=x)=\mathbb{P}(Y\ge z-x).\] For an exponential random variable with mean \(1\), \[\mathbb{P}(Y\ge a)=e^{-a}\quad\text{for }a\ge0.\] Thus \[\mathbb{P}(Y\ge z-x)= \begin{cases} e^{-(z-x)}, & 0\le x\le z,\\ 1, & x>z. \end{cases}\] Therefore \[\begin{aligned} \mathbb{P}(X+Y\ge z) &=\int_0^\infty \mathbb{P}(X+Y\ge z\mid X=x)e^{-x}\,dx\\ &=\int_0^z e^{-(z-x)}e^{-x}\,dx+\int_z^\infty e^{-x}\,dx\\ &=\int_0^z e^{-z}\,dx+e^{-z}\\ &=ze^{-z}+e^{-z}. \end{aligned}\] Hence \[\boxed{\mathbb{P}(X+Y\ge z)=(z+1)e^{-z}},\qquad z\ge0.\] This agrees with the fact that \(X+Y\sim\operatorname{Gamma}(2,1)\) in rate parameterization.
Example 5 (A dependent example). Suppose \(X\sim\operatorname{Uniform}(0,1)\) and, conditional on \(X=x\), the random variable \(Y\) is uniform on \([0,x]\). Calculate \(\mathbb{E}[Y]\).
Conditioning on \(X=x\), we have \[Y\mid X=x\sim\operatorname{Uniform}(0,x),\] so \[\mathbb{E}[Y\mid X=x]=\frac{x}{2}.\] Therefore \[\mathbb{E}[Y\mid X]=\frac{X}{2}.\] Using the law of total expectation, \[\mathbb{E}[Y]=\mathbb{E}[\mathbb{E}(Y\mid X)]=\mathbb{E}\left[\frac{X}{2}\right] =\frac12\mathbb{E}[X]=\frac12\cdot\frac12=\frac14.\] Thus \[\boxed{\mathbb{E}[Y]=\frac14.}\]
7.2.4 Memoryless property of the exponential distribution
This subsection studies an important example where conditioning does not change the remaining lifetime distribution.
Let \(X\sim\operatorname{Exp}(\lambda)\), with density \[f_X(t)=\lambda e^{-\lambda t},\qquad t\ge0.\] Then \[\mathbb{P}(X\ge t)=e^{-\lambda t},\qquad t\ge0.\] For \(s,t\ge0\), \[\begin{aligned} \mathbb{P}(X\ge t+s\mid X>s) &=\frac{\mathbb{P}(X\ge t+s, X>s)}{\mathbb{P}(X>s)}\\ &=\frac{\mathbb{P}(X\ge t+s)}{\mathbb{P}(X>s)}\\ &=\frac{e^{-\lambda(t+s)}}{e^{-\lambda s}}\\ &=e^{-\lambda t}. \end{aligned}\] Thus \[\boxed{\mathbb{P}(X\ge t+s\mid X>s)=\mathbb{P}(X\ge t).}\]
If \(X\) is the lifetime of a device and the device has survived until time \(s\), then the additional waiting time has the same distribution as a brand-new lifetime. The exponential distribution has no memory.
Example 6 (Waiting for the next car). Cars pass a point on a highway. The times between successive cars are independent exponential random variables with mean \(m\). Suppose at a random time you stand at the point on the highway. What is the mean time until the next car passes?
An exponential interarrival time with mean \(m\) has rate \[\lambda=\frac1m.\] Because of the memoryless property, the remaining waiting time until the next car is again exponential with rate \(\lambda\). Therefore its mean is \[\frac1\lambda=m.\] Thus the mean time until the next car is \[\boxed{m}.\]
7.2.5 Mixed conditional distributions
This subsection considers examples where one variable is discrete and the other is continuous.
Example 7 (Poisson–Exponential–Gamma example). Suppose \(X\in\{0,1,2,\ldots\}\) is discrete and \(Y\ge0\) is continuous with joint density/mass function \[p_{X,Y}(x,y)=\frac{\lambda y^x e^{-(\lambda+1)y}}{x!}, \qquad x=0,1,2,\ldots,\quad y\ge0.\] Find the marginal distribution of \(Y\), the conditional distribution of \(X\mid Y=y\), and the conditional distribution of \(Y\mid X=x\).
First compute the marginal density of \(Y\): \[\begin{aligned} p_Y(y) &=\sum_{x=0}^{\infty}p_{X,Y}(x,y)\\ &=\sum_{x=0}^{\infty}\frac{\lambda y^x e^{-(\lambda+1)y}}{x!}\\ &=\lambda e^{-(\lambda+1)y}\sum_{x=0}^{\infty}\frac{y^x}{x!}\\ &=\lambda e^{-(\lambda+1)y}e^y\\ &=\lambda e^{-\lambda y},\qquad y\ge0. \end{aligned}\] Thus \[\boxed{Y\sim\operatorname{Exp}(\lambda).}\]
Next, \[\begin{aligned} p_{X\mid Y}(x\mid y) &=\frac{p_{X,Y}(x,y)}{p_Y(y)}\\ &=\frac{\lambda y^x e^{-(\lambda+1)y}/x!}{\lambda e^{-\lambda y}}\\ &=\frac{y^x e^{-y}}{x!}. \end{aligned}\] Therefore \[\boxed{X\mid Y=y\sim\operatorname{Poisson}(y).}\]
Finally, to identify \(Y\mid X=x\), treat \(x\) as fixed and keep only factors depending on \(y\): \[p_{Y\mid X}(y\mid x)\propto p_{X,Y}(x,y) \propto y^x e^{-(\lambda+1)y},\qquad y\ge0.\] This is a Gamma density with shape \(x+1\) and rate \(\lambda+1\). Hence \[\boxed{Y\mid X=x\sim\operatorname{Gamma}(x+1,\lambda+1)}\] where the second parameter is the rate.
7.3 Conditional Expectations
This section introduces conditional expectation as the expected value computed under a conditional distribution.
7.3.1 Definition
This subsection defines conditional expectation for both discrete and continuous random variables.
Definition 8 (Conditional expectation). Let \(X\) and \(Y\) be random variables.
If \(X\) is discrete, then \[\mathbb{E}[X\mid Y=y] =\sum_x x\,p_{X\mid Y}(x\mid y).\] If \(X\) is continuous, then \[\mathbb{E}[X\mid Y=y] =\int_{-\infty}^{\infty}x\,p_{X\mid Y}(x\mid y)\,dx.\]
For each possible value \(y\) of \(Y\), the expression \(\mathbb{E}[X\mid Y=y]\) is a number. Therefore \(\mathbb{E}[X\mid Y]\) is a random variable that is a function of \(Y\).
The conditional expectation \(\mathbb{E}[X\mid Y]\) is the best prediction of \(X\) after observing \(Y\), when prediction is measured by squared error. In this course, the most important point is that it is a random variable determined by \(Y\).
If \(X\) and \(Y\) are independent, then conditioning on \(Y\) does not change the distribution of \(X\). Hence \[\mathbb{E}[X\mid Y=y]=\mathbb{E}[X],\] and therefore \[\mathbb{E}[X\mid Y]=\mathbb{E}[X].\] Also, if \(X\) and \(Y\) are independent and the expectations exist, then \[\mathbb{E}[XY]=\mathbb{E}[X]\mathbb{E}[Y].\]
7.3.2 Law of total expectation
This subsection states the most important computational identity for conditional expectation.
Theorem 9 (Law of total expectation). If the relevant expectations exist, then \[\boxed{\mathbb{E}[\mathbb{E}[X\mid Y]]=\mathbb{E}[X].}\] More generally, \[\boxed{\mathbb{E}[g(X,Y)]=\mathbb{E}\big[\mathbb{E}[g(X,Y)\mid X]\big]}\] for any measurable function \(g\) for which the expectations exist.
Proof. Proof in the discrete case. Assume \(X\) and \(Y\) are discrete. Then \[\begin{aligned} \mathbb{E}_Y[\mathbb{E}_X(X\mid Y)] &=\sum_y \mathbb{E}[X\mid Y=y]\mathbb{P}(Y=y)\\ &=\sum_y\left(\sum_x x\mathbb{P}(X=x\mid Y=y)\right)\mathbb{P}(Y=y)\\ &=\sum_x\sum_y x\mathbb{P}(X=x\mid Y=y)\mathbb{P}(Y=y)\\ &=\sum_x\sum_y x\mathbb{P}(X=x,Y=y)\\ &=\sum_x x\mathbb{P}(X=x)\\ &=\mathbb{E}[X]. \end{aligned}\] ◻
Proof. Proof in the continuous case. Assume \(X\) and \(Y\) are continuous. Then \[\begin{aligned} \mathbb{E}_Y[\mathbb{E}_X(X\mid Y)] &=\int_y \mathbb{E}[X\mid Y=y]p_Y(y)\,dy\\ &=\int_y\int_x x p_{X\mid Y}(x\mid y)p_Y(y)\,dx\,dy\\ &=\int_y\int_x x p_{X,Y}(x,y)\,dx\,dy\\ &=\int_x x\left(\int_y p_{X,Y}(x,y)\,dy\right)\,dx\\ &=\int_x x p_X(x)\,dx\\ &=\mathbb{E}[X]. \end{aligned}\] ◻
Example 10 (Unit disk: uncorrelated but not independent). Let \((X,Y)\) be uniformly distributed over the unit disk \[D=\{(x,y):x^2+y^2\le1\}.\] The joint density is \[f_{X,Y}(x,y)=\frac1\pi,\qquad (x,y)\in D.\] Are \(X\) and \(Y\) uncorrelated?
From symmetry, \[\mathbb{E}[X]=0,\qquad \mathbb{E}[Y]=0.\] We need to compute \[\operatorname{Cov}(X,Y)=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y]=\mathbb{E}[XY].\] Condition on \(Y\). For a fixed \(Y=y\), the possible values of \(X\) are \[-\sqrt{1-y^2}\le X\le \sqrt{1-y^2}.\] Moreover, \[X\mid Y=y\sim \operatorname{Uniform}\left(-\sqrt{1-y^2},\sqrt{1-y^2}\right).\] Thus \[\mathbb{E}[X\mid Y=y]=0.\] Therefore \[\mathbb{E}[XY]=\mathbb{E}\big[\mathbb{E}[XY\mid Y]\big] =\mathbb{E}\big[Y\mathbb{E}[X\mid Y]\big] =\mathbb{E}[Y\cdot0]=0.\] Hence \[\boxed{\operatorname{Cov}(X,Y)=0.}\] So \(X\) and \(Y\) are uncorrelated. However, they are not independent because the conditional support of \(X\mid Y=y\) depends on \(y\).
Example 11 (Computing covariance by conditioning). Suppose \[X\sim\operatorname{Uniform}(1,2),\] and conditional on \(X=x\), \[Y\mid X=x\sim\operatorname{Exp}(x),\] where \(x\) is the rate parameter. Find \(\operatorname{Cov}(X,Y)\).
We use \[\operatorname{Cov}(X,Y)=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y].\] First, \[\mathbb{E}[X]=\frac{1+2}{2}=\frac32.\] Because \(Y\mid X=x\sim\operatorname{Exp}(x)\) with rate \(x\), \[\mathbb{E}[Y\mid X=x]=\frac1x.\] Thus \[\mathbb{E}[Y\mid X]=\frac1X.\] Using total expectation, \[\mathbb{E}[Y]=\mathbb{E}\left[\frac1X\right]=\int_1^2\frac1x\,dx=\log2.\] Next, \[\mathbb{E}[XY]=\mathbb{E}[\mathbb{E}[XY\mid X]] =\mathbb{E}[X\mathbb{E}(Y\mid X)] =\mathbb{E}\left[X\cdot\frac1X\right]=1.\] Therefore \[\boxed{\operatorname{Cov}(X,Y)=1-\frac32\log2.}\] This value is negative because large \(X\) implies a larger exponential rate and therefore a smaller conditional mean for \(Y\).
7.3.3 Random sums of random variables
This subsection uses conditioning to evaluate expectations when the number of terms is itself random.
Example 12 (Random sum). Let \(N,X_1,X_2,\\ldots\) be independent, where \(X_i\) are IID with \[\mathbb{E}[X_i]=\mu.\] Define \[Y=\sum_{i=1}^N X_i.\] For example, \(N\) could be the number of insurance claims in a month and \(X_i\) the size of the \(i\)-th claim. Find \(\mathbb{E}[Y]\).
Condition on \(N=n\). Then the number of terms is fixed: \[Y\mid N=n=\sum_{i=1}^n X_i.\] Since \(N\) is independent of the \(X_i\)’s, \[\mathbb{E}[Y\mid N=n] =\mathbb{E}\left[\sum_{i=1}^n X_i\right] =\sum_{i=1}^n\mathbb{E}[X_i] =n\mu.\] Therefore \[\mathbb{E}[Y\mid N]=N\mu.\] Using the law of total expectation, \[\mathbb{E}[Y]=\mathbb{E}[\mathbb{E}(Y\mid N)]=\mathbb{E}[N\mu]=\mu\mathbb{E}[N].\] Thus \[\boxed{\mathbb{E}\left[\sum_{i=1}^N X_i\right]=\mathbb{E}[N]\mathbb{E}[X_1].}\] This identity is a basic form of Wald’s equation.
Practice Problem 13 (Variance of a random sum). In the random sum example, assume also that \[\operatorname{Var}(X_i)=\sigma^2,\] and \(N\) is independent of the \(X_i\)’s. Use the law of total variance to show that \[\operatorname{Var}(Y)=\sigma^2\mathbb{E}[N]+\mu^2\operatorname{Var}(N).\]
Condition on \(N=n\). Then \[\operatorname{Var}(Y\mid N=n)=\operatorname{Var}\left(\sum_{i=1}^n X_i\right)=n\sigma^2,\] so \[\operatorname{Var}(Y\mid N)=N\sigma^2.\] Also, \[\mathbb{E}[Y\mid N]=N\mu.\] By the law of total variance, \[\begin{aligned} \operatorname{Var}(Y) &=\mathbb{E}[\operatorname{Var}(Y\mid N)]+\operatorname{Var}(\mathbb{E}[Y\mid N])\\ &=\mathbb{E}[N\sigma^2]+\operatorname{Var}(N\mu)\\ &=\sigma^2\mathbb{E}[N]+\mu^2\operatorname{Var}(N). \end{aligned}\] Thus \[\boxed{\operatorname{Var}(Y)=\sigma^2\mathbb{E}[N]+\mu^2\operatorname{Var}(N).}\]
7.3.4 Law of total variance
This subsection decomposes total variation into average within-group variation and between-group variation.
Theorem 14 (Law of total variance). If the relevant second moments exist, then \[\boxed{\operatorname{Var}(Y)=\mathbb{E}[\operatorname{Var}(Y\mid X)]+\operatorname{Var}(\mathbb{E}[Y\mid X]).}\]
Proof. Proof. Start from \[\operatorname{Var}(Y)=\mathbb{E}[Y^2]-(\mathbb{E}[Y])^2.\] Using total expectation, \[\mathbb{E}[Y^2]=\mathbb{E}[\mathbb{E}(Y^2\mid X)]\] and \[\mathbb{E}[Y]=\mathbb{E}[\mathbb{E}(Y\mid X)].\] For fixed \(X\), \[\operatorname{Var}(Y\mid X)=\mathbb{E}(Y^2\mid X)-\big(\mathbb{E}(Y\mid X)\big)^2.\] Thus \[\mathbb{E}(Y^2\mid X)=\operatorname{Var}(Y\mid X)+\big(\mathbb{E}(Y\mid X)\big)^2.\] Taking expectations, \[\mathbb{E}[Y^2]=\mathbb{E}[\operatorname{Var}(Y\mid X)]+\mathbb{E}\left[\big(\mathbb{E}(Y\mid X)\big)^2\right].\] Therefore \[\begin{aligned} \operatorname{Var}(Y) &=\mathbb{E}[\operatorname{Var}(Y\mid X)]+\mathbb{E}\left[\big(\mathbb{E}(Y\mid X)\big)^2\right]-\left(\mathbb{E}[\mathbb{E}(Y\mid X)]\right)^2\\ &=\mathbb{E}[\operatorname{Var}(Y\mid X)]+\operatorname{Var}(\mathbb{E}[Y\mid X]). \end{aligned}\] ◻
The identity \[\operatorname{Var}(Y)=\mathbb{E}[\operatorname{Var}(Y\mid X)]+\operatorname{Var}(\mathbb{E}[Y\mid X])\] says: \[\text{total variation}=\text{average conditional variation}+\text{variation of conditional means}.\]
7.3.5 A covariance identity
This subsection records a useful projection identity for covariance.
Let \(g(X)\) be a function of \(X\) and \(h(Y)\) be a function of \(Y\). Then \[\boxed{\operatorname{Cov}(g(X),h(Y))=\operatorname{Cov}(g(X),\mathbb{E}[h(Y)\mid X]).}\] Indeed, \[\begin{aligned} \operatorname{Cov}(g(X),h(Y)) &=\mathbb{E}[g(X)h(Y)]-\mathbb{E}[g(X)]\mathbb{E}[h(Y)]\\ &=\mathbb{E}\left[\mathbb{E}[g(X)h(Y)\mid X]\right]-\mathbb{E}[g(X)]\mathbb{E}\left[\mathbb{E}[h(Y)\mid X]\right]\\ &=\mathbb{E}\left[g(X)\mathbb{E}[h(Y)\mid X]\right]-\mathbb{E}[g(X)]\mathbb{E}\left[\mathbb{E}[h(Y)\mid X]\right]\\ &=\operatorname{Cov}(g(X),\mathbb{E}[h(Y)\mid X]). \end{aligned}\]
The random variable \(\mathbb{E}[h(Y)\mid X]\) may be viewed as the part of \(h(Y)\) that can be predicted from \(X\). Therefore, when computing covariance with a function of \(X\), we can replace \(h(Y)\) by its conditional expectation given \(X\).
7.3.6 Binomial–uniform example and Bayesian updating
This subsection connects conditional expectation, total variance, and Bayesian inference.
Example 15 (Binomial–uniform). Suppose \[X\mid Y\sim\operatorname{Binomial}(n,Y), \qquad Y\sim\operatorname{Uniform}(0,1).\] Find \(\mathbb{E}[X]\) and \(\operatorname{Var}(X)\). Then determine the conditional distribution of \(Y\mid X=x\).
Given \(Y\), the conditional mean and variance of \(X\) are \[\mathbb{E}[X\mid Y]=nY, \qquad \operatorname{Var}(X\mid Y)=nY(1-Y).\] By the law of total expectation, \[\mathbb{E}[X]=\mathbb{E}[\mathbb{E}(X\mid Y)]=\mathbb{E}[nY]=n\mathbb{E}[Y]=\frac n2.\]
By the law of total variance, \[\begin{aligned} \operatorname{Var}(X) &=\mathbb{E}[\operatorname{Var}(X\mid Y)]+\operatorname{Var}(\mathbb{E}[X\mid Y])\\ &=\mathbb{E}[nY(1-Y)]+\operatorname{Var}(nY)\\ &=n\left(\mathbb{E}[Y]-\mathbb{E}[Y^2]\right)+n^2\operatorname{Var}(Y). \end{aligned}\] For \(Y\sim\operatorname{Uniform}(0,1)\), \[\mathbb{E}[Y]=\frac12, \qquad \mathbb{E}[Y^2]=\frac13, \qquad \operatorname{Var}(Y)=\frac1{12}.\] Thus \[\operatorname{Var}(X)=n\left(\frac12-\frac13\right)+n^2\cdot\frac1{12} =\frac n6+\frac{n^2}{12}.\] Therefore \[\boxed{\mathbb{E}[X]=\frac n2, \qquad \operatorname{Var}(X)=\frac n6+\frac{n^2}{12}.}\]
Now find \(Y\mid X=x\). Since \(Y\sim\operatorname{Uniform}(0,1)\), its density is constant on \([0,1]\). Therefore \[\begin{aligned} p_{Y\mid X}(y\mid x) &\propto p_{X\mid Y}(x\mid y)p_Y(y)\\ &\propto \binom{n}{x}y^x(1-y)^{n-x}\cdot 1\\ &\propto y^x(1-y)^{n-x},\qquad 0<y<1. \end{aligned}\] This is the kernel of a beta distribution with parameters \[\alpha=x+1, \qquad \beta=n-x+1.\] Thus \[\boxed{Y\mid X=x\sim\operatorname{Beta}(x+1,n-x+1).}\] This is a Bayesian update: the prior \(Y\sim\operatorname{Beta}(1,1)\) becomes the posterior \(Y\mid X=x\sim\operatorname{Beta}(x+1,n-x+1)\) after observing \(x\) successes in \(n\) trials.
Practice Problem 16 (Posterior mean). In the binomial–uniform example, compute \(\mathbb{E}[Y\mid X=x]\).
If \[Y\mid X=x\sim\operatorname{Beta}(x+1,n-x+1),\] then the mean of a \(\operatorname{Beta}(\alpha,\beta)\) random variable is \[\frac{\alpha}{\alpha+\beta}.\] Therefore \[\mathbb{E}[Y\mid X=x] =\frac{x+1}{(x+1)+(n-x+1)} =\frac{x+1}{n+2}.\] Thus \[\boxed{\mathbb{E}[Y\mid X=x]=\frac{x+1}{n+2}.}\]
7.4 Applications: Weighting and Missing Data
This section shows how conditional expectation explains important weighting methods in statistics and data analysis.
7.4.1 Inverse probability weighting for missing data
This subsection studies a missing-data problem where income is sometimes unobserved.
Example 17 (Missing data and IPW). Consider a survey with two variables: \[X=\text{age of a participant}, \qquad Y=\text{income of a participant}.\] We want \[\mu=\mathbb{E}[Y].\] However, \(Y\) may be missing because some people refuse to provide income information. Let \(R\) be the response indicator: \[R=1 \quad\text{if }Y\text{ is observed}, \qquad R=0 \quad\text{if }Y\text{ is missing}.\] Assume \[R\perp Y\mid X,\] which means the response probability depends only on \(X\): \[\mathbb{P}(R=1\mid X,Y)=\mathbb{P}(R=1\mid X)=\pi(X).\] Assume \(\pi(X)\) is known. Show that \[W=\frac{RY}{\pi(X)}\] satisfies \[\mathbb{E}[W]=\mathbb{E}[Y].\]
We compute using conditional expectation. First, \[\mathbb{E}[W]=\mathbb{E}\left[\frac{RY}{\pi(X)}\right] =\mathbb{E}\left[\mathbb{E}\left(\frac{RY}{\pi(X)}\mid X\right)\right].\] Because \(1/\pi(X)\) is determined by \(X\), \[\mathbb{E}\left(\frac{RY}{\pi(X)}\mid X\right) =\frac1{\pi(X)}\mathbb{E}[RY\mid X].\] Using the conditional independence assumption \(R\perp Y\mid X\), \[\mathbb{E}[RY\mid X]=\mathbb{E}[R\mid X]\mathbb{E}[Y\mid X].\] But \[\mathbb{E}[R\mid X]=\mathbb{P}(R=1\mid X)=\pi(X).\] Hence \[\mathbb{E}\left(\frac{RY}{\pi(X)}\mid X\right) =\frac1{\pi(X)}\pi(X)\mathbb{E}[Y\mid X] =\mathbb{E}[Y\mid X].\] Therefore \[\mathbb{E}[W]=\mathbb{E}[\mathbb{E}(Y\mid X)]=\mathbb{E}[Y].\] Thus \[\boxed{\mathbb{E}\left[\frac{RY}{\pi(X)}\right]=\mathbb{E}[Y].}\]
In practice, if we observe IID data, we estimate \(\mu=\mathbb{E}[Y]\) by the inverse probability weighting estimator \[\widehat\mu_{\operatorname{IPW}} =\frac1n\sum_{i=1}^n \frac{R_iY_i}{\pi(X_i)}.\] This estimator upweights observed responses that had a smaller probability of being observed.
7.4.2 Survey sampling and importance weighting
This subsection explains importance weighting when the sample distribution differs from the population distribution.
Example 18 (Survey sampling). A city has three districts \(A\), \(B\), and \(C\). The population proportions are \[\mathbb{P}_{\text{pop}}(X=A)=0.6, \qquad \mathbb{P}_{\text{pop}}(X=B)=0.3, \qquad \mathbb{P}_{\text{pop}}(X=C)=0.1.\] Let \(Y\) be income. The target average income is \[\mu=0.6\mathbb{E}[Y\mid X=A]+0.3\mathbb{E}[Y\mid X=B]+0.1\mathbb{E}[Y\mid X=C].\] However, the survey samples the same number of individuals from each district, so in the sample \[\mathbb{P}_{\text{sample}}(X=A)=\mathbb{P}_{\text{sample}}(X=B)=\mathbb{P}_{\text{sample}}(X=C)=\frac13.\] Construct a quantity \(Z=g(X,Y)\) such that \[\mathbb{E}_{\text{sample}}[Z]=\mu.\]
We weight each observation by \[\frac{\text{population probability}}{\text{sample probability}}.\] Thus define \[\begin{aligned} Z &=\frac{0.6}{1/3}\mathbbm{1}(X=A)Y +\frac{0.3}{1/3}\mathbbm{1}(X=B)Y +\frac{0.1}{1/3}\mathbbm{1}(X=C)Y\\ &=1.8\mathbbm{1}(X=A)Y+0.9\mathbbm{1}(X=B)Y+0.3\mathbbm{1}(X=C)Y. \end{aligned}\] Then \[\begin{aligned} \mathbb{E}_{\text{sample}}[Z] &=1.8\mathbb{E}[\mathbbm{1}(X=A)Y]+0.9\mathbb{E}[\mathbbm{1}(X=B)Y]+0.3\mathbb{E}[\mathbbm{1}(X=C)Y]\\ &=1.8\mathbb{P}_{\text{sample}}(X=A)\mathbb{E}[Y\mid X=A]\\ &\quad+0.9\mathbb{P}_{\text{sample}}(X=B)\mathbb{E}[Y\mid X=B]\\ &\quad+0.3\mathbb{P}_{\text{sample}}(X=C)\mathbb{E}[Y\mid X=C]\\ &=1.8\cdot\frac13\mathbb{E}[Y\mid X=A]+0.9\cdot\frac13\mathbb{E}[Y\mid X=B]+0.3\cdot\frac13\mathbb{E}[Y\mid X=C]\\ &=0.6\mathbb{E}[Y\mid X=A]+0.3\mathbb{E}[Y\mid X=B]+0.1\mathbb{E}[Y\mid X=C]\\ &=\mu. \end{aligned}\] Thus \[\boxed{Z=1.8\mathbbm{1}(X=A)Y+0.9\mathbbm{1}(X=B)Y+0.3\mathbbm{1}(X=C)Y}\] has the desired property.
When the sampling design does not match the target population, conditional expectation tells us how to reweight the sample so that the weighted average targets the correct population quantity.
7.5 Summary
This section summarizes the main formulas that should be remembered from conditional distributions and conditional expectations.
| Concept | Formula / message |
|---|---|
| Conditional pmf/pdf | \(\displaystyle p_{X\mid Y}(x\mid y)=\frac{p_{X,Y}(x,y)}{p_Y(y)}\) |
| Law of total probability | \(\displaystyle p_X(x)=\sum_y p_{X\mid Y}(x\mid y)p_Y(y)\) or \(\displaystyle p_X(x)=\int p_{X\mid Y}(x\mid y)p_Y(y)\,dy\) |
| Conditional expectation | \(\displaystyle \mathbb{E}[X\mid Y=y]=\sum_x x p_{X\mid Y}(x\mid y)\) or \(\displaystyle \int x p_{X\mid Y}(x\mid y)\,dx\) |
| Law of total expectation | \(\displaystyle \mathbb{E}[\mathbb{E}[X\mid Y]]=\mathbb{E}[X]\) |
| Law of total variance | \(\displaystyle \operatorname{Var}(Y)=\mathbb{E}[\operatorname{Var}(Y\mid X)]+\operatorname{Var}(\mathbb{E}[Y\mid X])\) |
| Random sum | If \(Y=\sum_{i=1}^N X_i\), with \(N\) independent of IID \(X_i\) and \(\mathbb{E}[X_i]=\mu\), then \(\displaystyle \mathbb{E}[Y]=\mu\mathbb{E}[N]\). |
| Memoryless exponential | If \(X\sim\operatorname{Exp}(\lambda)\), then \(\displaystyle \mathbb{P}(X\ge t+s\mid X>s)=\mathbb{P}(X\ge t)\). |
| IPW | If \(R\perp Y\mid X\) and \(\mathbb{P}(R=1\mid X)=\pi(X)\), then \(\displaystyle \mathbb{E}\left[\frac{RY}{\pi(X)}\right]=\mathbb{E}[Y]\). |
7.6 Additional Practice Problems
This final section gives extra practice problems with full solutions.
Practice Problem 19 (Conditional expectation from a table). Suppose \(X,Y\in\{0,1\}\) have joint pmf \[\begin{array}{c|cc} & Y=0 & Y=1\\ \hline X=0 & 0.2 & 0.3\\ X=1 & 0.1 & 0.4 \end{array}\] Find \(\mathbb{E}[X\mid Y=1]\) and \(\mathbb{E}[X]\).
First, \[\mathbb{P}(Y=1)=0.3+0.4=0.7.\] Thus \[\mathbb{P}(X=1\mid Y=1)=\frac{\mathbb{P}(X=1,Y=1)}{\mathbb{P}(Y=1)}=\frac{0.4}{0.7}=\frac47.\] Since \(X\) is Bernoulli conditional on \(Y=1\), \[\boxed{\mathbb{E}[X\mid Y=1]=\frac47.}\] Also \[\mathbb{P}(X=1)=0.1+0.4=0.5,\] so \[\boxed{\mathbb{E}[X]=0.5.}\]
Practice Problem 20 (Total expectation with a Poisson mixture). Suppose \[X\mid\Lambda=\lambda\sim\operatorname{Poisson}(\lambda)\] and \(\mathbb{E}[\Lambda]\) exists. Find \(\mathbb{E}[X]\).
For a Poisson random variable with rate \(\lambda\), \[\mathbb{E}[X\mid\Lambda=\lambda]=\lambda.\] Thus \[\mathbb{E}[X\mid\Lambda]=\Lambda.\] By total expectation, \[\mathbb{E}[X]=\mathbb{E}[\mathbb{E}(X\mid\Lambda)]=\mathbb{E}[\Lambda].\] So \[\boxed{\mathbb{E}[X]=\mathbb{E}[\Lambda].}\]
Practice Problem 21 (Total variance with a Poisson mixture). In the previous problem, find \(\operatorname{Var}(X)\) in terms of \(\mathbb{E}[\Lambda]\) and \(\operatorname{Var}(\Lambda)\).
For \(X\mid\Lambda=\lambda\sim\operatorname{Poisson}(\lambda)\), \[\mathbb{E}[X\mid\Lambda]=\Lambda, \qquad \operatorname{Var}(X\mid\Lambda)=\Lambda.\] Therefore, by total variance, \[\begin{aligned} \operatorname{Var}(X) &=\mathbb{E}[\operatorname{Var}(X\mid\Lambda)]+\operatorname{Var}(\mathbb{E}[X\mid\Lambda])\\ &=\mathbb{E}[\Lambda]+\operatorname{Var}(\Lambda). \end{aligned}\] Thus \[\boxed{\operatorname{Var}(X)=\mathbb{E}[\Lambda]+\operatorname{Var}(\Lambda).}\]
Practice Problem 22 (Memoryless calculation). Let \(X\sim\operatorname{Exp}(2)\). Compute \[\mathbb{P}(X>7\mid X>3).\]
By the memoryless property, \[\mathbb{P}(X>7\mid X>3)=\mathbb{P}(X>4).\] Since \(X\sim\operatorname{Exp}(2)\), \[\mathbb{P}(X>4)=e^{-2\cdot 4}=e^{-8}.\] Thus \[\boxed{\mathbb{P}(X>7\mid X>3)=e^{-8}.}\]