8 Chapter 7: Inequalities and Identities
This chapter introduces the main probability inequalities used to control tail probabilities, prove convergence results, and analyze estimators. The focus is not only on remembering formulas, but also on learning when each inequality is the right tool.
Topics. Boole’s inequality; inclusion-exclusion; Bonferroni inequalities; Markov’s inequality; Chebyshev’s inequality; Chernoff bounds; Hoeffding’s inequality; Cauchy-Schwarz, Holder, Minkowski, and Jensen inequalities; equality identities for common distributions.
8.0.1 Overview
This section introduces probability inequalities and distribution identities that are used to control tail probabilities, prove limit theorems, and analyze statistical estimators.
Probability inequalities provide upper and lower bounds for probabilities of events, especially events involving deviations of random variables from their expectations. The important point is that a useful bound may require only partial information, such as the mean, variance, support, or moment generating function, rather than a full distributional calculation.
Main message Probability inequalities answer questions of the form \[\mathbb{P}\{Z \geq \mathbb{E}[Z]+t\},\qquad \mathbb{P}\{Z\leq \mathbb{E}[Z]-t\},\qquad \mathbb{P}\{|Z-\mathbb{E}[Z]|\geq t\}.\] They are useful when exact probabilities are hard to compute or when we want distribution-free guarantees.
The inequalities in this section include:
union bounds: Boole’s inequality, inclusion-exclusion, and Bonferroni inequalities;
tail bounds from moments: Markov’s inequality and Chebyshev’s inequality;
exponential tail bounds: Chernoff bounds and Hoeffding’s inequality;
functional inequalities: Cauchy-Schwarz, Holder, Minkowski, and Jensen;
equality identities for special distributions.
8.1 Union Bounds and Inclusion-Exclusion
This section begins with inequalities for the probability that at least one event in a collection occurs.
8.1.1 Boole’s inequality
This subsection states the most basic and most useful union bound.
Theorem 1 (Boole’s inequality). Suppose \((S,\mathcal B,\mathbb{P})\) is a probability space and \(E_1,E_2,\ldots\in \mathcal B\) are events. Then \[\mathbb{P}\left(\bigcup_{i=1}^{\infty}E_i\right) \leq \sum_{i=1}^{\infty}\mathbb{P}(E_i).\] For finitely many events, \[\mathbb{P}\left(\bigcup_{i=1}^{n}E_i\right) \leq \sum_{i=1}^{n}\mathbb{P}(E_i).\]
Boole’s inequality says that the probability that at least one of the events happens is no greater than the sum of the individual probabilities.
Proof. For two events, \[\mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(B)-\mathbb{P}(A\cap B)\leq \mathbb{P}(A)+\mathbb{P}(B).\] The finite result follows by induction. The countable result follows by continuity of probability from below: \[\mathbb{P}\left(\bigcup_{i=1}^{\infty}E_i\right) = \lim_{n\to\infty}\mathbb{P}\left(\bigcup_{i=1}^{n}E_i\right) \leq \lim_{n\to\infty}\sum_{i=1}^{n}\mathbb{P}(E_i).\] ◻
Example 2 (A quick union-bound estimate). Suppose \(100\) students each independently submit an assignment late with probability at most \(0.01\). Without assuming independence, bound the probability that at least one student submits late.
Let \(E_i\) be the event that student \(i\) submits late. Then \[\mathbb{P}\left(\bigcup_{i=1}^{100}E_i\right) \leq \sum_{i=1}^{100}\mathbb{P}(E_i) \leq 100(0.01)=1.\] This bound is not informative because it only gives an upper bound of \(1\). If the late probability were \(0.001\), then Boole’s inequality would give \[\mathbb{P}(\text{at least one late})\leq 100(0.001)=0.1.\] The example shows that the union bound is most useful when individual probabilities are small.
8.1.2 Inclusion-exclusion
This subsection gives the exact formula for a finite union by adding and subtracting intersections.
Suppose \(E_1,\ldots,E_n\) are events. Define \[S_1=\sum_{i=1}^{n}\mathbb{P}(E_i),\] \[S_2=\sum_{1\leq i<j\leq n}\mathbb{P}(E_i\cap E_j),\] and, more generally, \[S_k=\sum_{1\leq i_1<\cdots<i_k\leq n} \mathbb{P}(E_{i_1}\cap\cdots\cap E_{i_k}).\]
Theorem 3 (Inclusion-exclusion). For events \(E_1,\ldots,E_n\), \[\mathbb{P}\left(\bigcup_{i=1}^{n}E_i\right) =S_1-S_2+S_3-\cdots+(-1)^{n-1}S_n.\]
Example 4 (Three events). Write the inclusion-exclusion formula for three events \(A,B,C\).
For three events, \[\mathbb{P}(A\cup B\cup C) = \mathbb{P}(A)+\mathbb{P}(B)+\mathbb{P}(C) -\mathbb{P}(A\cap B)-\mathbb{P}(A\cap C)-\mathbb{P}(B\cap C) +\mathbb{P}(A\cap B\cap C).\] The first line adds the probability of being in any one event. The second line removes double-counting. The last term adds back the triple intersection, which was removed too many times.
8.1.3 Bonferroni inequalities
This subsection turns the inclusion-exclusion expansion into useful upper and lower bounds.
Theorem 5 (Bonferroni inequalities). Let \(E_1,\ldots,E_n\) be events and let \(S_k\) be defined as above. For \(1\leq k\leq n\), \[B_k=\sum_{j=1}^{k}(-1)^{j-1}S_j.\] If \(k\) is odd, then \[\mathbb{P}\left(\bigcup_{i=1}^{n}E_i\right)\leq B_k.\] If \(k\) is even, then \[\mathbb{P}\left(\bigcup_{i=1}^{n}E_i\right)\geq B_k.\]
The case \(k=1\) is exactly Boole’s inequality. The case \(k=2\) gives \[\mathbb{P}\left(\bigcup_{i=1}^{n}E_i\right) \geq \sum_i\mathbb{P}(E_i)-\sum_{i<j}\mathbb{P}(E_i\cap E_j).\]
Example 6 (Empty boxes). Place \(n\) distinguishable balls into \(m\) distinguishable boxes uniformly at random, where \(n>m\). Let \(E\) be the event that at least one box is empty. Use inclusion-exclusion and Bonferroni inequalities to estimate \(\mathbb{P}(E)\).
The sample space is \[S=\{\omega=(\omega_1,\ldots,\omega_n):1\leq \omega_i\leq m\}, \qquad |S|=m^n.\] Let \[E_\ell=\{\omega:\omega_i\neq \ell \text{ for all } i=1,\ldots,n\}\] be the event that box \(\ell\) is empty. Then \[E=E_1\cup\cdots\cup E_m.\] For any fixed \(k\) distinct boxes \(i_1,\ldots,i_k\), \[\mathbb{P}(E_{i_1}\cap\cdots\cap E_{i_k}) =\left(\frac{m-k}{m}\right)^n =\left(1-\frac{k}{m}\right)^n.\] Therefore inclusion-exclusion gives \[\mathbb{P}(E) = \binom{m}{1}\left(1-\frac1m\right)^n - \binom{m}{2}\left(1-\frac2m\right)^n +\cdots +(-1)^{m-1}\binom{m}{m}\left(1-\frac{m}{m}\right)^n.\] The first two Bonferroni bounds give \[m\left(1-\frac1m\right)^n - \binom{m}{2}\left(1-\frac2m\right)^n \leq \mathbb{P}(E) \leq m\left(1-\frac1m\right)^n.\] For \(m=10\), \[10(0.9)^n-45(0.8)^n\leq \mathbb{P}(E)\leq 10(0.9)^n.\] For \(n=50\), the gap between the upper and lower bounds is \[45(0.8)^{50}\approx 0.00064226146,\] so the two-sided estimate is quite accurate.
Practice Problem 7 (At least one duplicated birthday). Assume each birthday is equally likely among \(365\) days. Let \(E_{ij}\) be the event that people \(i\) and \(j\) share a birthday in a class of \(r\) people. Use Boole’s inequality to bound the probability that at least two people share a birthday.
There are \(\binom r2\) pairs. For each pair, \[\mathbb{P}(E_{ij})=\frac1{365}.\] If \(E\) is the event that at least one pair shares a birthday, then \[E=\bigcup_{1\leq i<j\leq r}E_{ij}.\] By Boole’s inequality, \[\mathbb{P}(E)\leq \sum_{i<j}\mathbb{P}(E_{ij})=\binom r2\frac1{365}.\] This is a simple upper bound; it can exceed \(1\) for large \(r\), in which case the best probability upper bound is \(1\).
8.2 Markov’s Inequality
This section introduces the fundamental inequality behind many tail bounds.
8.2.1 Basic Markov inequality
This subsection bounds a nonnegative random variable using only its expectation.
Theorem 8 (Markov’s inequality). Let \(X\) be a nonnegative random variable. Then, for any \(a>0\), \[\mathbb{P}(X\geq a)\leq \frac{\mathbb{E}[X]}{a}.\]
Proof for continuous \(X\). If \(X\) has density \(f\), then \[\mathbb{E}[X]=\int_0^\infty x f(x)\,dx =\int_0^a x f(x)\,dx+\int_a^\infty x f(x)\,dx \geq \int_a^\infty x f(x)\,dx \geq \int_a^\infty a f(x)\,dx =a\mathbb{P}(X\geq a).\] Dividing by \(a\) gives the result. The same proof works in the discrete case by replacing integrals with sums. ◻
Example 9 (A binomial upper-tail bound). Suppose \(X\sim \operatorname{Binomial}(n,p)\). For \(p<q<1\), use Markov’s inequality to bound \(\mathbb{P}(X\geq qn)\).
Since \(X\geq 0\) and \(\mathbb{E}[X]=np\), Markov’s inequality gives \[\mathbb{P}(X\geq qn) \leq \frac{\mathbb{E}[X]}{qn} = \frac{np}{qn} =\frac pq.\] This bound does not improve as \(n\) increases, so it is often weak for sums of many independent variables.
8.2.2 General Markov inequality
This subsection applies Markov’s idea to a nonnegative function of a random variable.
Theorem 10 (General Markov inequality). Let \(g(X)\) be a nonnegative random variable. For any \(r>0\), \[\mathbb{P}(g(X)\geq r) \leq \frac{\mathbb{E}[g(X)]}{r}.\]
Proof. Apply Markov’s inequality to the nonnegative random variable \(g(X)\). ◻
Example 11 (Universal deviation bound). Let \(X\) be a random variable with mean \(\mu\) and variance \(\sigma^2\). Show that \[\mathbb{P}(|X-\mu|\geq t\sigma)\leq \frac1{t^2}\] for \(t>0\).
Let \[g(X)=\frac{(X-\mu)^2}{\sigma^2}.\] Then \(g(X)\geq 0\) and \[\mathbb{E}[g(X)]=\frac{\mathbb{E}[(X-\mu)^2]}{\sigma^2}=1.\] The event \(|X-\mu|\geq t\sigma\) is the same as \[\frac{(X-\mu)^2}{\sigma^2}\geq t^2.\] By the general Markov inequality, \[\mathbb{P}(|X-\mu|\geq t\sigma) = \mathbb{P}(g(X)\geq t^2) \leq \frac{\mathbb{E}[g(X)]}{t^2} =\frac1{t^2}.\] For example, if \(t=2\), then \[\mathbb{P}(|X-\mu|\geq 2\sigma)\leq \frac14,\] so there is at least a \(75\%\) chance that \(X\) lies within two standard deviations of its mean, no matter what the distribution is.
8.3 Chebyshev’s Inequality
This section improves Markov’s inequality by using variance to control deviations from the mean.
8.3.1 Statement and proof
This subsection proves Chebyshev’s inequality as a direct consequence of Markov’s inequality.
Theorem 12 (Chebyshev’s inequality). Let \(X\) be a random variable with finite variance. Then, for any \(k>0\), \[\mathbb{P}(|X-\mathbb{E}[X]|\geq k) \leq \frac{\operatorname{Var}(X)}{k^2}.\] Equivalently, for \(t>0\), \[\mathbb{P}(|X-\mathbb{E}[X]|\geq t\sqrt{\operatorname{Var}(X)})\leq \frac1{t^2}.\]
Proof. Let \[Y=(X-\mathbb{E}[X])^2.\] Then \(Y\geq 0\). By Markov’s inequality, \[\mathbb{P}(|X-\mathbb{E}[X]|\geq k) =\mathbb{P}((X-\mathbb{E}[X])^2\geq k^2) \leq \frac{\mathbb{E}[(X-\mathbb{E}[X])^2]}{k^2} =\frac{\operatorname{Var}(X)}{k^2}.\] ◻
8.3.2 Comparison with Markov for a binomial tail
This subsection shows that Chebyshev’s inequality improves Markov’s inequality for a binomial upper tail because it uses the variance.
Example 13 (Binomial tail: Markov versus Chebyshev). Suppose \(X\sim \operatorname{Binomial}(n,p)\) and \(p<q<1\). Compare Markov’s and Chebyshev’s bounds for \(\mathbb{P}(X\geq qn)\).
Markov’s inequality gives \[\mathbb{P}(X\geq qn)\leq \frac pq.\] For Chebyshev’s inequality, note that \[\mathbb{E}[X]=np, \qquad \operatorname{Var}(X)=np(1-p).\] Since \(q>p\), \[\{X\geq qn\}\subseteq \{|X-np|\geq (q-p)n\}.\] Therefore \[\mathbb{P}(X\geq qn) \leq \mathbb{P}(|X-np|\geq (q-p)n) \leq \frac{np(1-p)}{(q-p)^2n^2} = \frac{p(1-p)}{(q-p)^2n}.\] Unlike the Markov bound, this bound goes to \(0\) as \(n\to\infty\).
Practice Problem 14 (Sample mean bound). Let \(X_1,\ldots,X_n\) be iid with mean \(\mu\) and variance \(\sigma^2\). Let \(\bar X_n=n^{-1}\sum_{i=1}^n X_i\). Use Chebyshev’s inequality to bound \(\mathbb{P}(|\bar X_n-\mu|\geq \varepsilon)\).
We have \[\mathbb{E}[\bar X_n]=\mu, \qquad \operatorname{Var}(\bar X_n)=\frac{\sigma^2}{n}.\] Chebyshev’s inequality gives \[\mathbb{P}(|\bar X_n-\mu|\geq \varepsilon) \leq \frac{\operatorname{Var}(\bar X_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2}.\] This is the basic second-moment concentration bound for the sample mean.
8.4 Chernoff Bounds
This section introduces exponential tail bounds based on moment generating functions.
8.4.1 The Chernoff method
This subsection derives the Chernoff bound by applying Markov’s inequality to \(e^{tX}\).
Theorem 15 (Chernoff bounds). Let \(X\) be a random variable with moment generating function \[M_X(t)=\mathbb{E}[e^{tX}].\] For any \(a\in\mathbb{R}\), \[\mathbb{P}(X\geq a) \leq \inf_{t\geq 0} e^{-ta}M_X(t),\] and \[\mathbb{P}(X\leq a) \leq \inf_{t\geq 0} e^{ta}M_X(-t).\]
Proof. For \(t\geq 0\), the function \(e^{tX}\) is nonnegative and \[\{X\geq a\}=\{e^{tX}\geq e^{ta}\}.\] By Markov’s inequality, \[\mathbb{P}(X\geq a) \leq \frac{\mathbb{E}[e^{tX}]}{e^{ta}} =e^{-ta}M_X(t).\] Taking the infimum over \(t\geq 0\) gives the upper-tail bound. For the lower tail, apply the same argument to \(-X\): \[\mathbb{P}(X\leq a)=\mathbb{P}(-X\geq -a) \leq e^{ta}M_X(-t).\] ◻
8.4.2 Binomial Chernoff bound
This subsection applies the Chernoff method to a binomial random variable.
Example 16 (Chernoff bound for a binomial tail). Let \(X\sim \operatorname{Binomial}(n,p)\). For \(p<q<1\), derive an exponential upper bound for \(\mathbb{P}(X\geq qn)\).
The moment generating function of \(X\sim \operatorname{Binomial}(n,p)\) is \[M_X(t)=(pe^t+1-p)^n.\] The Chernoff bound gives \[\mathbb{P}(X\geq qn) \leq \inf_{t\geq 0} e^{-tqn}(pe^t+1-p)^n.\] Let \[g(t)= e^{-tqn}(pe^t+1-p)^n.\] Minimizing \(\log g(t)\) gives \[\frac{\,d}{\,dt}\log g(t) =-qn+n\frac{pe^t}{pe^t+1-p}=0.\] Thus \[\frac{pe^{t^*}}{pe^{t^*}+1-p}=q,\] which implies \[e^{t^*}=\frac{q(1-p)}{(1-q)p}.\] Substituting this value into \(g(t)\) gives \[\mathbb{P}(X\geq qn) \leq \left(\frac pq\right)^{qn} \left(\frac{1-p}{1-q}\right)^{(1-q)n}.\] This is an exponential bound in \(n\).
8.4.3 Comparison: Markov, Chebyshev, and Chernoff
This subsection compares the three bounds for binomial tails.
For \(X\sim \operatorname{Binomial}(n,p)\) and \(p<q<1\): \[\text{Markov:}\qquad \mathbb{P}(X\geq qn)\leq \frac pq.\] \[\text{Chebyshev:}\qquad \mathbb{P}(X\geq qn) \leq \frac{p(1-p)}{(q-p)^2n}.\] \[\text{Chernoff:}\qquad \mathbb{P}(X\geq qn) \leq \left(\frac pq\right)^{qn} \left(\frac{1-p}{1-q}\right)^{(1-q)n}.\]
Example 17 (Comparison when \(p=1/2\) and \(q=3/4\)). Let \(X\sim \operatorname{Binomial}(n,1/2)\). Compare bounds for \(\mathbb{P}(X\geq 3n/4)\).
Markov’s inequality gives \[\mathbb{P}(X\geq 3n/4)\leq \frac{1/2}{3/4}=\frac23.\] Chebyshev’s inequality gives \[\mathbb{P}(X\geq 3n/4) \leq \frac{(1/2)(1/2)}{(1/4)^2n} =\frac4n.\] Chernoff’s bound gives \[\mathbb{P}(X\geq 3n/4) \leq \left(\frac{1/2}{3/4}\right)^{3n/4} \left(\frac{1/2}{1/4}\right)^{n/4} =\left(\frac{16}{27}\right)^{n/4}.\] Markov’s bound does not decay with \(n\), Chebyshev’s bound decays like \(1/n\), and Chernoff’s bound decays exponentially.
Example 18 (Comparison when \(p=1/3\) and \(q=2/3\)). Let \(X\sim \operatorname{Binomial}(n,1/3)\). Compare bounds for \(\mathbb{P}(X\geq 2n/3)\).
Markov’s inequality gives \[\mathbb{P}(X\geq 2n/3)\leq \frac{1/3}{2/3}=\frac12.\] Chebyshev’s inequality gives \[\mathbb{P}(X\geq 2n/3) \leq \frac{(1/3)(2/3)}{(1/3)^2n} =\frac2n.\] The Chernoff bound gives \[\mathbb{P}(X\geq 2n/3) \leq \left(\frac{1/3}{2/3}\right)^{2n/3} \left(\frac{2/3}{1/3}\right)^{n/3} =2^{-2n/3}2^{n/3}=2^{-n/3}.\] The lecture slide states the sharper form \(2^{-n/2}\) for this special comparison. The key qualitative point is the same: Chernoff gives exponential decay, while Chebyshev gives polynomial decay.
8.4.4 Chernoff bounds for sums
This subsection explains why Chernoff bounds are especially convenient for sums of independent random variables.
Suppose \(Z_1,\ldots,Z_n\) are independent. Then \[M_{Z_1+\cdots+Z_n}(t) =\prod_{i=1}^{n}M_{Z_i}(t).\] Therefore, if \(Z_i\) are iid, \[M_{\sum_{i=1}^n Z_i}(t)=M_Z(t)^n.\] This means that the Chernoff bound of a sum usually reduces to computing the MGF of a single summand.
Example 19 (Random signs). Let \(S\) be a random sign variable with \[\mathbb{P}(S=1)=\mathbb{P}(S=-1)=\frac12.\] Let \(X=S_1+\cdots+S_n\), where \(S_i\) are iid copies of \(S\). Find a Chernoff bound for \(\mathbb{P}(X\geq a)\).
The MGF of one random sign is \[M_S(t)=\mathbb{E}[e^{tS}]=\frac12e^t+\frac12e^{-t}=\cosh t.\] Hence \[M_X(t)=(\cosh t)^n.\] Chernoff’s bound gives \[\mathbb{P}(X\geq a) \leq \inf_{t\geq 0} e^{-ta}(\cosh t)^n.\] Using \(\cosh t\leq e^{t^2/2}\) for all \(t\), we get \[\mathbb{P}(X\geq a) \leq \inf_{t\geq 0}\exp\left(-ta+\frac{nt^2}{2}\right).\] The minimum occurs at \(t=a/n\), giving \[\mathbb{P}(X\geq a) \leq \exp\left(-\frac{a^2}{2n}\right).\]
8.5 Hoeffding’s Inequality
This section gives a powerful exponential concentration inequality for bounded independent random variables.
8.5.1 Concentration of the sample mean
This subsection compares Chebyshev’s polynomial-rate bound with Hoeffding’s exponential-rate bound.
Recall that if \(X_1,\ldots,X_n\) are iid with variance \(\sigma^2\), then Chebyshev’s inequality applied to the sample mean gives \[\mathbb{P}(|\bar X_n-\mathbb{E}[\bar X_n]|\geq \varepsilon) \leq \frac{\sigma^2}{n\varepsilon^2}.\] This decays like \(1/n\).
Theorem 20 (Hoeffding’s inequality, bounded case). Suppose \(X_1,\ldots,X_n\) are independent and \(0\leq X_i\leq 1\). Let \[\bar X_n=\frac1n\sum_{i=1}^n X_i.\] Then, for any \(\varepsilon>0\), \[\mathbb{P}(|\bar X_n-\mathbb{E}[\bar X_n]|\geq \varepsilon) \leq 2e^{-2n\varepsilon^2}.\] More generally, if \(a_i\leq X_i\leq b_i\), then a scaled version holds.
Why Hoeffding matters Hoeffding’s inequality gives an exponential concentration rate. This is much faster than the Chebyshev rate and is a central tool in modern statistics, statistical learning, high-dimensional problems, nonparametric inference, semiparametric inference, and empirical risk minimization.
Example 21 (Coin-flip sample proportion). Let \(X_1,\ldots,X_n\) be iid Bernoulli\((p)\) random variables and let \(\bar X_n\) be the sample proportion of successes. Bound \[\mathbb{P}(|\bar X_n-p|\geq \varepsilon).\]
Since \(0\leq X_i\leq 1\) and \(\mathbb{E}[\bar X_n]=p\), Hoeffding’s inequality gives \[\mathbb{P}(|\bar X_n-p|\geq \varepsilon) \leq 2e^{-2n\varepsilon^2}.\] For example, if \(\varepsilon=0.05\) and \(n=1000\), then \[2e^{-2(1000)(0.05)^2}=2e^{-5}\approx 0.0135.\] Thus the sample proportion is within \(0.05\) of \(p\) with probability at least \(0.9865\).
Practice Problem 22 (Sample size from Hoeffding). How large should \(n\) be to guarantee \[\mathbb{P}(|\bar X_n-p|\geq \varepsilon)\leq \delta\] for Bernoulli data?
Hoeffding gives \[\mathbb{P}(|\bar X_n-p|\geq \varepsilon) \leq 2e^{-2n\varepsilon^2}.\] We want \[2e^{-2n\varepsilon^2}\leq \delta.\] Taking logs, \[-2n\varepsilon^2\leq \log(\delta/2).\] Thus it is enough to take \[n\geq \frac{1}{2\varepsilon^2}\log\left(\frac2\delta\right).\]
8.6 Cauchy-Schwarz, Holder, and Minkowski Inequalities
This section introduces norm and inner-product inequalities for random variables.
8.6.1 Cauchy-Schwarz inequality
This subsection views random variables as vectors with inner product \(\langle X,Y\rangle=\mathbb{E}[XY]\).
Theorem 23 (Cauchy-Schwarz inequality). Suppose \(X\) and \(Y\) are random variables with finite second moments. Then \[\bigl(\mathbb{E}[XY]\bigr)^2\leq \mathbb{E}[X^2]\mathbb{E}[Y^2].\] Equality holds if and only if \(X=aY\) almost surely for some constant \(a\in\mathbb{R}\).
Proof. For any real number \(t\), \[0\leq \mathbb{E}[(X-tY)^2]=\mathbb{E}[X^2]-2t\mathbb{E}[XY]+t^2\mathbb{E}[Y^2].\] This quadratic in \(t\) is nonnegative for all \(t\), so its discriminant is nonpositive: \[4(\mathbb{E}[XY])^2-4\mathbb{E}[X^2]\mathbb{E}[Y^2]\leq 0.\] Thus \[(\mathbb{E}[XY])^2\leq \mathbb{E}[X^2]\mathbb{E}[Y^2].\] Equality occurs precisely when \(X-tY=0\) almost surely for some \(t\). ◻
The same idea generalizes to random vectors. If \(X,Y\in\mathbb{R}^n\), define \[\langle X,Y\rangle=\mathbb{E}[X^T Y]=\mathbb{E}[X_1Y_1+\cdots+X_nY_n].\] Then Cauchy-Schwarz becomes \[\left(\mathbb{E}[X^TY]\right)^2 \leq \mathbb{E}[X^TX]\,\mathbb{E}[Y^TY].\]
Example 24 (Correlation is between \(-1\) and \(1\)). Use Cauchy-Schwarz to prove \[|\operatorname{Corr}(X,Y)|\leq 1.\] Also characterize equality.
Assume \(\operatorname{Var}(X)>0\) and \(\operatorname{Var}(Y)>0\). Define standardized variables \[U=\frac{X-\mathbb{E}[X]}{\sqrt{\operatorname{Var}(X)}}, \qquad V=\frac{Y-\mathbb{E}[Y]}{\sqrt{\operatorname{Var}(Y)}}.\] Then \[\mathbb{E}[U]=0, \qquad \mathbb{E}[V]=0, \qquad \mathbb{E}[U^2]=\mathbb{E}[V^2]=1,\] and \[\operatorname{Corr}(X,Y)=\mathbb{E}[UV].\] By Cauchy-Schwarz, \[|\mathbb{E}[UV]|\leq \sqrt{\mathbb{E}[U^2]\mathbb{E}[V^2]}=1.\] Equality holds if and only if \(U=aV\) almost surely for some constant \(a\), which is equivalent to \[X=\alpha+\beta Y\] almost surely for constants \(\alpha,\beta\in\mathbb{R}\).
8.6.2 Holder’s inequality
This subsection generalizes Cauchy-Schwarz from second moments to \(p\)-th moments.
Theorem 25 (Holder’s inequality). Let \(X\) and \(Y\) be random variables and let \(p,q>1\) satisfy \[\frac1p+\frac1q=1.\] Then \[\mathbb{E}[|XY|] \leq \left(\mathbb{E}[|X|^p]\right)^{1/p} \left(\mathbb{E}[|Y|^q]\right)^{1/q}.\]
When \(p=q=2\), Holder’s inequality is the Cauchy-Schwarz inequality.
Example 26 (Cauchy-Schwarz as a special case). Show that Holder’s inequality with \(p=q=2\) gives Cauchy-Schwarz.
If \(p=q=2\), then \(1/p+1/q=1/2+1/2=1\). Holder’s inequality gives \[\mathbb{E}[|XY|] \leq \left(\mathbb{E}[|X|^2]\right)^{1/2} \left(\mathbb{E}[|Y|^2]\right)^{1/2}.\] Squaring both sides gives \[(\mathbb{E}[|XY|])^2\leq \mathbb{E}[X^2]\mathbb{E}[Y^2].\] For signed variables, applying this to \(|X|\) and \(|Y|\) yields the usual bound \[|\mathbb{E}[XY]|\leq \mathbb{E}[|XY|]\leq \sqrt{\mathbb{E}[X^2]\mathbb{E}[Y^2]}.\]
8.6.3 Minkowski’s inequality
This subsection gives the triangle inequality for \(L^p\) norms of random variables.
Theorem 27 (Minkowski’s inequality). Let \(X\) and \(Y\) be random variables. For \(1\leq p<\infty\), \[\left(\mathbb{E}[|X+Y|^p]\right)^{1/p} \leq \left(\mathbb{E}[|X|^p]\right)^{1/p} + \left(\mathbb{E}[|Y|^p]\right)^{1/p}.\]
Example 28 (Triangle inequality in \(L^2\)). State Minkowski’s inequality for \(p=2\).
For \(p=2\), Minkowski’s inequality says \[\sqrt{\mathbb{E}[(X+Y)^2]} \leq \sqrt{\mathbb{E}[X^2]}+ \sqrt{\mathbb{E}[Y^2]}.\] This is the triangle inequality for the norm \[\|X\|_2=\sqrt{\mathbb{E}[X^2]}.\]
8.7 Jensen’s Inequality
This section explains how convexity interacts with expectation.
8.7.1 Convex functions
This subsection reviews convexity before stating Jensen’s inequality.
Definition 29 (Convex and concave functions). A function \(g:\mathbb{R}\to\mathbb{R}\) is convex on an interval \([a,b]\) if for all \(x,y\in[a,b]\) and all \(\lambda\in[0,1]\), \[g(\lambda x+(1-\lambda)y) \leq \lambda g(x)+(1-\lambda)g(y).\] A function \(g\) is concave if \(-g\) is convex.
For a convex function, the same property holds for finite convex combinations: \[g\left(\sum_{i=1}^n \lambda_i x_i\right) \leq \sum_{i=1}^n \lambda_i g(x_i),\] where \(\lambda_i\geq 0\) and \(\sum_i\lambda_i=1\).
If \(g\) is twice differentiable, then \(g\) is convex on an interval if \(g''(x)\geq 0\) for all \(x\) in the interval.
Theorem 30 (Convex functions lie above tangents). Let \(g:[a,b]\to\mathbb{R}\) be convex and let \(a<c<b\). Then there exist constants \(A,B\in\mathbb{R}\) such that \[g(c)=Ac+B\] and \[g(x)\geq Ax+B\] for all \(x\in[a,b]\).
8.7.2 Jensen’s inequality
This subsection gives one of the most important inequalities involving expectation.
Theorem 31 (Jensen’s inequality). Suppose \(X\) is a random variable such that \(\mathbb{P}(a\leq X\leq b)=1\). If \(g:\mathbb{R}\to\mathbb{R}\) is convex on \([a,b]\), then \[\mathbb{E}[g(X)]\geq g(\mathbb{E}[X]).\] If \(g\) is concave on \([a,b]\), then \[\mathbb{E}[g(X)]\leq g(\mathbb{E}[X]).\]
Proof. Assume \(g\) is convex. Let \(c=\mathbb{E}[X]\). By the tangent-line property, there exist \(A,B\in\mathbb{R}\) such that \[g(x)\geq Ax+B\] for all \(x\) in the support of \(X\), and \[g(c)=Ac+B.\] Taking expectations gives \[\mathbb{E}[g(X)]\geq \mathbb{E}[AX+B]=A\mathbb{E}[X]+B=Ac+B=g(c)=g(\mathbb{E}[X]).\] The concave case follows by applying the convex result to \(-g\). ◻
Example 32 (Arithmetic-geometric mean inequality). Let \(a_1,\ldots,a_n\) be positive numbers. Use Jensen’s inequality to prove \[\frac1n\sum_{k=1}^{n}a_k \geq \left(\prod_{k=1}^{n}a_k\right)^{1/n}.\]
Let \(X\) be a discrete random variable with \[\mathbb{P}(X=a_k)=\frac1n, \qquad k=1,\ldots,n.\] The function \(g(x)=-\log x\) is convex on \((0,\infty)\). Jensen’s inequality gives \[-\log(\mathbb{E}[X])\leq \mathbb{E}[-\log X].\] Therefore \[-\log\left(\frac1n\sum_{k=1}^{n}a_k\right) \leq -\frac1n\sum_{k=1}^{n}\log a_k.\] Multiplying by \(-1\) reverses the inequality: \[\log\left(\frac1n\sum_{k=1}^{n}a_k\right) \geq \frac1n\sum_{k=1}^{n}\log a_k = \log\left(\prod_{k=1}^{n}a_k\right)^{1/n}.\] Exponentiating gives \[\frac1n\sum_{k=1}^{n}a_k \geq \left(\prod_{k=1}^{n}a_k\right)^{1/n}.\]
Example 33 (Variance is nonnegative). Suppose \(p>1\). Since \(g(x)=x^p\) is convex on appropriate domains, Jensen’s inequality implies moment inequalities. In particular, show that \(\operatorname{Var}(X)\geq 0\).
Take \(g(x)=x^2\), which is convex. Jensen’s inequality gives \[\mathbb{E}[X^2]\geq (\mathbb{E}[X])^2.\] Therefore \[\operatorname{Var}(X)=\mathbb{E}[X^2]-(\mathbb{E}[X])^2\geq 0.\]
Practice Problem 34 (Log expectation). Let \(X>0\) almost surely. Use Jensen’s inequality to compare \(\mathbb{E}[\log X]\) and \(\log\mathbb{E}[X]\).
The function \(\log x\) is concave on \((0,\infty)\). Jensen’s inequality for concave functions gives \[\mathbb{E}[\log X]\leq \log\mathbb{E}[X].\] Equivalently, \[\exp(\mathbb{E}[\log X])\leq \mathbb{E}[X].\] This is the probabilistic version of the geometric mean being no larger than the arithmetic mean.
8.8 Equality Identities for Special Distributions
This section collects useful identities for Poisson, Gamma, Normal, chi-squared, and negative binomial distributions.
The lecture notes emphasize that for some specific distributions, such as normal, chi-squared, Poisson, and negative binomial, there are exact equality relations that can simplify calculations. These identities are especially useful for moments and distributional calculations.
8.8.1 Poisson recursion
This subsection begins with the basic recursion for the Poisson probability mass function.
Proposition 35 (Poisson probability recursion). If \(X\sim \operatorname{Poisson}(\lambda)\), then for \(x=0,1,2,\ldots\), \[\mathbb{P}(X=x+1)=\frac{\lambda}{x+1}\mathbb{P}(X=x).\]
Proof. Since \[\mathbb{P}(X=x)=e^{-\lambda}\frac{\lambda^x}{x!},\] we have \[\mathbb{P}(X=x+1)=e^{-\lambda}\frac{\lambda^{x+1}}{(x+1)!} =\frac{\lambda}{x+1}e^{-\lambda}\frac{\lambda^x}{x!} =\frac{\lambda}{x+1}\mathbb{P}(X=x).\] ◻
8.8.2 Gamma identity
This subsection records a useful relationship between Gamma probabilities with adjacent shape parameters.
Proposition 36 (Gamma identity). Let \(X_{\alpha,\beta}\) denote a Gamma\((\alpha,\beta)\) random variable with density \(f(x\mid\alpha,\beta)\), where \(\alpha>1\). Then for constants \(a<b\), \[\mathbb{P}(a<X_{\alpha,\beta}<b) = \beta\{f(a\mid\alpha,\beta)-f(b\mid\alpha,\beta)\} + \mathbb{P}(a<X_{\alpha-1,\beta}<b),\] under the parameterization used in Casella-Berger style identities.
Remark. Remark 37. This identity is useful because it relates Gamma probabilities with shape \(\alpha\) to Gamma probabilities with shape \(\alpha-1\) plus boundary density terms.
8.8.3 Stein’s lemma
This subsection gives a normal-distribution identity that turns moments into derivative expectations.
Theorem 38 (Stein’s lemma). Let \(X\sim \operatorname{Normal}(\theta,\sigma^2)\) and let \(g\) be differentiable with \(\mathbb{E}[|g'(X)|]<\infty\). Then \[\mathbb{E}[g(X)(X-\theta)]=\sigma^2\mathbb{E}[g'(X)].\]
Example 39 (Computing a normal third moment). Use Stein’s lemma to compute \(\mathbb{E}[X^3]\) when \(X\sim \operatorname{Normal}(\theta,\sigma^2)\).
Write \[X^3=X^2(X-\theta+\theta)=X^2(X-\theta)+\theta X^2.\] Taking expectations, \[\mathbb{E}[X^3]=\mathbb{E}[X^2(X-\theta)]+\theta\mathbb{E}[X^2].\] Apply Stein’s lemma with \(g(x)=x^2\), so \(g'(x)=2x\): \[\mathbb{E}[X^2(X-\theta)]=\sigma^2\mathbb{E}[2X]=2\sigma^2\theta.\] Also, \[\mathbb{E}[X^2]=\operatorname{Var}(X)+(\mathbb{E}[X])^2=\sigma^2+\theta^2.\] Therefore \[\mathbb{E}[X^3]=2\sigma^2\theta+\theta(\sigma^2+\theta^2) =3\theta\sigma^2+\theta^3.\]
8.8.4 Chi-squared identity
This subsection records an identity involving chi-squared random variables with different degrees of freedom.
Proposition 40 (Chi-squared identity). Let \(\chi_p^2\) denote a chi-squared random variable with \(p\) degrees of freedom. If expectations exist, then for any function \(h\), \[\mathbb{E}[h(\chi_p^2)] =p\,\mathbb{E}\left[\frac{h(\chi_{p+2}^2)}{\chi_{p+2}^2}\right].\]
Remark. Remark 41. This identity relates expectations under \(p\) degrees of freedom to expectations under \(p+2\) degrees of freedom. Such identities are often useful in distribution theory and statistical inference.
8.8.5 Hwang-type identities for Poisson and negative binomial variables
This subsection presents identities from the lecture note for Poisson and negative binomial random variables.
Proposition 42 (Poisson identity). Let \(g\) be a function such that the expectations below exist. If \(X\sim \operatorname{Poisson}(\lambda)\), then \[\mathbb{E}[\lambda g(X)]=\mathbb{E}[Xg(X-1)].\]
Proof. Using the Poisson pmf, \[\mathbb{E}[Xg(X-1)] = \sum_{x=1}^{\infty} x g(x-1)e^{-\lambda}\frac{\lambda^x}{x!} = \sum_{x=1}^{\infty} g(x-1)e^{-\lambda}\frac{\lambda^x}{(x-1)!}.\] Let \(u=x-1\). Then \[\mathbb{E}[Xg(X-1)] = \sum_{u=0}^{\infty} g(u)e^{-\lambda}\frac{\lambda^{u+1}}{u!} = \lambda\sum_{u=0}^{\infty}g(u)e^{-\lambda}\frac{\lambda^u}{u!} = \mathbb{E}[\lambda g(X)].\] ◻
Proposition 43 (Negative binomial identity). If \(X\sim \operatorname{NegBinomial}(r,p)\), then under the parameterization used in the lecture note, \[\mathbb{E}[(1-p)g(X)] = \mathbb{E}\left[\frac{X}{r+X-1}g(X-1)\right],\] provided the expectations exist.
Example 44 (Poisson moment from the Poisson identity). If \(X\sim \operatorname{Poisson}(\lambda)\), use the identity with \(g(x)=x^2\) to compute \(\mathbb{E}[X^3]\).
The Poisson identity gives \[\mathbb{E}[\lambda X^2]=\mathbb{E}[X(X-1)^2].\] Expanding the right-hand side, \[X(X-1)^2=X(X^2-2X+1)=X^3-2X^2+X.\] Therefore \[\lambda\mathbb{E}[X^2]=\mathbb{E}[X^3]-2\mathbb{E}[X^2]+\mathbb{E}[X].\] Solving for \(\mathbb{E}[X^3]\) gives \[\mathbb{E}[X^3]=\lambda\mathbb{E}[X^2]+2\mathbb{E}[X^2]-\mathbb{E}[X].\] For \(X\sim\operatorname{Poisson}(\lambda)\), \[\mathbb{E}[X]=\lambda, \qquad \mathbb{E}[X^2]=\operatorname{Var}(X)+(\mathbb{E}X)^2=\lambda+\lambda^2.\] Thus \[\mathbb{E}[X^3] =\lambda(\lambda+\lambda^2)+2(\lambda+\lambda^2)-\lambda =\lambda^3+3\lambda^2+\lambda.\]
Practice Problem 45 (Use Jensen for a moment bound). Let \(X\) be any random variable with finite fourth moment. Prove that \[(\mathbb{E}[X^2])^2\leq \mathbb{E}[X^4].\]
Apply Jensen’s inequality to the convex function \(g(u)=u^2\) and the random variable \(U=X^2\). Then \[\mathbb{E}[g(U)]\geq g(\mathbb{E}[U]).\] This means \[\mathbb{E}[(X^2)^2]\geq (\mathbb{E}[X^2])^2.\] Hence \[\mathbb{E}[X^4]\geq (\mathbb{E}[X^2])^2.\]
8.9 Summary and Study Guide
This final section summarizes the main inequalities and indicates when each one is useful.
| Tool | Statement / idea | Uses |
|---|---|---|
| Boole | \(\mathbb{P}(\cup_i E_i)\leq \sum_i\mathbb{P}(E_i)\) | At least one bad event |
| Inclusion-exclusion | Exact alternating formula for finite unions | Exact union probabilities |
| Bonferroni | Truncated inclusion-exclusion gives upper/lower bounds | Better union bounds |
| Markov | \(\mathbb{P}(X\geq a)\leq \mathbb{E}[X]/a\) for \(X\geq0\) | First moment tail bounds |
| Chebyshev | \(\mathbb{P}(|X-\mathbb{E}X|\geq k)\leq \operatorname{Var}(X)/k^2\) | Variance-based concentration |
| Chernoff | \(\mathbb{P}(X\geq a)\leq \inf_{t\geq0}e^{-ta}M_X(t)\) | Exponential tail bounds |
| Hoeffding | \(\mathbb{P}(|\bar X-\mathbb{E}\bar X|\geq\epsilon)\leq 2e^{-2n\epsilon^2}\) for \(0\leq X_i\leq1\) | Bounded sample means |
| Cauchy-Schwarz | \((\mathbb{E}[XY])^2\leq \mathbb{E}[X^2]\mathbb{E}[Y^2]\) | Correlation and inner products |
| Hölder | \(\mathbb{E}|XY|\leq (\mathbb{E}|X|^p)^{1/p}(\mathbb{E}|Y|^q)^{1/q}\) | General moment control |
| Minkowski | \(\|X+Y\|_p\leq \|X\|_p+\|Y\|_p\) | Triangle inequality for random variables |
| Jensen | \(g(\mathbb{E}X)\leq \mathbb{E}[g(X)]\) for convex \(g\) | Convexity and expectation |
How to choose a bound Use Boole/Bonferroni for unions of events. Use Markov if only a first moment is available. Use Chebyshev if a variance is available. Use Chernoff or Hoeffding for exponential concentration, especially for sums or averages of independent random variables. Use Cauchy-Schwarz/Holder/Minkowski/Jensen when the problem involves expectations of functions, products, or convexity.