3  Chapter 2: Random Variables and Distributions

This chapter turns probability spaces into numerical models. A random variable is a function from outcomes to numbers, and its distribution describes how probability is assigned to those numbers.

Topics. Random variables; cumulative distribution functions; probability mass functions; probability density functions; discrete, continuous, and mixed distributions; common distribution families; exponential families; location-scale families.

3.1 From Outcomes to Numbers

This section begins the transition from probability spaces to random variables: instead of studying raw outcomes directly, we often study numerical summaries of the outcomes.

In Section 1, an experiment was modeled by a probability space \((S,\mathcal B,\mathbb{P})\). The elements of \(S\) can be very concrete, such as coin toss strings, or very complicated, such as continuous paths in Brownian motion. For statistical analysis, we usually need numbers. A random variable is the mathematical object that turns outcomes into numbers.

TipLearning box

Main idea. A random variable is not “random” by itself. It is a function. Randomness enters because the input outcome \(s\in S\) is random.

3.2 Random Variables

The goal of this section is to define random variables and show how they encode the numerical information we care about in an experiment.

3.2.1 Informal motivation

This subsection explains why random variables are useful before giving the formal definition.

Suppose we flip a coin twice. The sample space is \[S=\{FF,FT,TF,TT\}.\] If our question is “how many heads occurred?”, then we do not really care whether the outcome was \(FT\) or \(TF\); both have exactly one head. We care about a function of the outcome: \[X=\text{number of heads}.\] Thus, \[X(FF)=0,\qquad X(FT)=1,\qquad X(TF)=1,\qquad X(TT)=2.\]

Example 1 (Flipping an unfair coin twice). Suppose a coin has probability \(\phi\) of landing heads and probability \(1-\phi\) of landing tails. The coin is flipped twice independently. Let \(X\) be the number of heads. Find the possible values of \(X\) and the probability of each value.

The sample space is \[S=\{HH,HT,TH,TT\}.\] The random variable \(X\) takes values \(0,1,2\): \[X(TT)=0,\qquad X(HT)=X(TH)=1,\qquad X(HH)=2.\] Using independence, \[\mathbb{P}(X=0)=\mathbb{P}(TT)=(1-\phi)^2,\] \[\mathbb{P}(X=1)=\mathbb{P}(HT)+\mathbb{P}(TH)=\phi(1-\phi)+(1-\phi)\phi=2\phi(1-\phi),\] \[\mathbb{P}(X=2)=\mathbb{P}(HH)=\phi^2.\] These probabilities add to \(1\): \[(1-\phi)^2+2\phi(1-\phi)+\phi^2=1.\]

3.2.2 Formal definition

This subsection gives the formal definition: a random variable is a function from the sample space to the real line.

Definition 2 (Random variable). Let \((S,\mathcal B,\mathbb{P})\) be a probability space. A real-valued random variable is a function \[X:S\to \mathbb{R}.\] More precisely, in measure-theoretic probability, \(X\) must be measurable, meaning that events of the form \(\{s\in S:X(s)\le x\}\) are measurable. In this course, we will use this condition implicitly.

Example 3 (Sum of two dice). Toss two fair six-sided dice. Let \[X=\text{sum of the numbers}.\] Find the sample space, the range of \(X\), and \(\mathbb{P}(X=7)\).

The sample space is \[S=\{(i,j):i,j\in\{1,2,3,4,5,6\}\},\] with \(36\) equally likely outcomes. The random variable is \[X(i,j)=i+j.\] The range is \[\operatorname{Range}(X)=\{2,3,4,5,6,7,8,9,10,11,12\}.\] The event \(\{X=7\}\) consists of \[(1,6),(2,5),(3,4),(4,3),(5,2),(6,1),\] so \[\mathbb{P}(X=7)=\frac{6}{36}=\frac16.\]

Example 4 (Fertilizer and crop yield). Suppose an agricultural experiment applies different amounts of fertilizer to corn plants. Let \[X=\text{yield per acre}.\] Explain why \(X\) is a random variable.

The sample space contains possible outcomes of the agricultural experiment: weather conditions, soil response, plant growth, measurement noise, and other uncontrolled factors. The quantity \(X\) assigns to each outcome a real number, the yield per acre. Therefore \(X:S\to\mathbb{R}\) is a random variable. In applications, \(X\) would usually be modeled as continuous because yield can vary over an interval of values.

3.2.3 Indicator random variables

This subsection introduces one of the most useful random variables in probability: the indicator of an event.

Definition 5 (Indicator random variable). For an event \(E\subseteq S\), the indicator random variable of \(E\) is \[\mathbf{1}_E(s)= \begin{cases} 1, & s\in E,\\ 0, & s\notin E. \end{cases}\] It records whether the event \(E\) occurred.

Example 6 (Battery lifetime). Suppose an experiment records how long a battery operates before wearing down. We are only interested in whether the battery lasts at least two years. Define \[E=\{\text{battery lasts at least two years}\}.\] Construct an indicator random variable for \(E\).

Define \[I=\mathbf{1}_E= \begin{cases} 1, & \text{if the lifetime of the battery is two or more years},\\ 0, & \text{otherwise}. \end{cases}\] The variable \(I\) has range \(\{0,1\}\). It reduces a potentially continuous lifetime measurement to a binary success/failure variable.

Practice Problem 7 (Indicator for a die event). Roll one fair die and let \(E=\{\text{the result is even}\}\). Define \(\mathbf{1}_E\) and find \(\mathbb{P}(\mathbf{1}_E=1)\).

The sample space is \(S=\{1,2,3,4,5,6\}\). The event is \(E=\{2,4,6\}\). The indicator is \[\mathbf{1}_E(s)= \begin{cases} 1, & s\in\{2,4,6\},\\ 0, & s\in\{1,3,5\}. \end{cases}\] Therefore \[\mathbb{P}(\mathbf{1}_E=1)=\mathbb{P}(E)=\frac{3}{6}=\frac12.\]

3.3 Probability Distributions

This section explains how the probability measure on the sample space induces probabilities for the values of a random variable.

A random variable \(X\) has two essential ingredients:

  1. the set of possible values, \(\operatorname{Range}(X)\);

  2. the probabilities associated with those values or ranges of values.

The distribution of \(X\) describes how probability is assigned to values of \(X\).

Definition 8 (Probability distribution). The probability distribution of a random variable \(X\) is the rule that assigns probabilities to events involving \(X\), such as \(\{X\le x\}\), \(\{X=x\}\), or \(\{a\le X\le b\}\).

WarningWarning

Notation warning. Strictly speaking, for continuous random variables, \(\mathbb{P}(X=x)=0\) for every single point \(x\). Therefore a probability density \(f_X(x)\) is not equal to \(\mathbb{P}(X=x)\); it is a density whose area gives probability.

3.4 Cumulative Distribution Functions

This section introduces the cumulative distribution function, which is the most general way to describe the distribution of a real-valued random variable.

3.4.1 Definition of the CDF

This subsection defines the CDF and explains its meaning as an accumulated probability.

Definition 9 (Cumulative distribution function). The cumulative distribution function (CDF) of a random variable \(X\) is \[F_X(x)=\mathbb{P}(X\le x)=\mathbb{P}(\{s\in S:X(s)\le x\}).\] When there is no confusion, we write \(F(x)\) instead of \(F_X(x)\).

The CDF answers the question: “What is the probability that \(X\) is no larger than \(x\)?”

Example 10 (CDF for number of heads in two fair tosses). Flip a fair coin twice and let \(X\) be the number of heads. Find the CDF \(F_X(x)\).

From the sample space \(\{HH,HT,TH,TT\}\), \[\mathbb{P}(X=0)=\frac14,\qquad \mathbb{P}(X=1)=\frac24=\frac12,\qquad \mathbb{P}(X=2)=\frac14.\] Therefore \[F_X(x)= \begin{cases} 0, & x<0,\\[2mm] \frac14, & 0\le x<1,\\[2mm] \frac34, & 1\le x<2,\\[2mm] 1, & x\ge 2. \end{cases}\] The CDF is a step function because \(X\) is discrete.

3.4.2 Properties of CDFs

This subsection lists the structural properties that every CDF must satisfy.

Theorem 11 (Basic properties of a CDF). For any random variable \(X\), its CDF \(F_X\) satisfies:

  1. \(\displaystyle \lim_{x\to -\infty}F_X(x)=0\);

  2. \(\displaystyle \lim_{x\to +\infty}F_X(x)=1\);

  3. \(F_X\) is nondecreasing;

  4. \(F_X\) is right-continuous: \(\displaystyle \lim_{x\downarrow x_0}F_X(x)=F_X(x_0)\).

We give the intuition for each property. If \(x\) is extremely small, the event \(\{X\le x\}\) becomes impossible, so the probability tends to \(0\). If \(x\) is extremely large, the event \(\{X\le x\}\) becomes almost sure, so the probability tends to \(1\). If \(x_1<x_2\), then \(\{X\le x_1\}\subseteq\{X\le x_2\}\), so monotonicity of probability gives \(F_X(x_1)\le F_X(x_2)\). Right-continuity is a consequence of continuity of probability for decreasing sequences of events.

3.4.3 Discrete and continuous random variables

This subsection distinguishes the two main types of random variables encountered in the course.

Definition 12 (Discrete random variable). A random variable \(X\) is discrete if its range is finite or countably infinite. Equivalently, its CDF is a step function.

Definition 13 (Continuous random variable). In this course, a random variable \(X\) is called continuous if its CDF is continuous. In most examples we study, continuous random variables also have probability density functions.

TipLearning box

Discrete versus continuous.

  • Discrete random variables assign positive probability to individual points.

  • Continuous random variables assign probability to intervals through area under a density curve.

Practice Problem 14 (Checking a CDF). Is the function \[F(x)= \begin{cases} 0, & x<0,\\ 0.3, & 0\le x<2,\\ 0.8, & 2\le x<5,\\ 1, & x\ge 5 \end{cases}\] a valid CDF? If yes, find the corresponding point probabilities.

The function is nondecreasing, right-continuous, tends to \(0\) as \(x\to -\infty\), and tends to \(1\) as \(x\to\infty\). Hence it is a valid CDF. The jumps give the point probabilities: \[\mathbb{P}(X=0)=0.3-0=0.3,\] \[\mathbb{P}(X=2)=0.8-0.3=0.5,\] \[\mathbb{P}(X=5)=1-0.8=0.2.\]

3.5 Identical Distributions

This section explains what it means for two random variables to have the same probability law, even if they are not the same function on the sample space.

Definition 15 (Identically distributed). Random variables \(X\) and \(Y\) are identically distributed if \[\mathbb{P}(X\in A)=\mathbb{P}(Y\in A)\] for every appropriate set \(A\subseteq\mathbb{R}\). Equivalently, \[F_X(x)=F_Y(x)\qquad \text{for all }x\in\mathbb{R}.\] We write \(X\sim Y\) to indicate that \(X\) and \(Y\) have the same distribution.

Example 16 (Heads and tails in two fair flips). Flip a fair coin twice. Let \(X\) be the number of heads and let \(Y\) be the number of tails. Show that \(X\) and \(Y\) are identically distributed.

The sample space is \(\{HH,HT,TH,TT\}\). For \(X=\) number of heads, \[\mathbb{P}(X=0)=\frac14, \qquad \mathbb{P}(X=1)=\frac12, \qquad \mathbb{P}(X=2)=\frac14.\] For \(Y=\) number of tails, \[\mathbb{P}(Y=0)=\frac14, \qquad \mathbb{P}(Y=1)=\frac12, \qquad \mathbb{P}(Y=2)=\frac14.\] Thus \(X\) and \(Y\) have the same probability mass function and the same CDF. Therefore \(X\sim Y\). Notice that \(X\) and \(Y\) are not the same random variable, because for example on outcome \(HH\), \(X=2\) and \(Y=0\).

3.6 Probability Mass Functions

This section studies the probability mass function, the standard way to describe a discrete random variable.

3.6.1 Definition of the pmf

This subsection defines the pmf and connects it with probabilities of individual values.

Definition 17 (Probability mass function). For a discrete random variable \(X\), the probability mass function (pmf) is \[p_X(x)=\mathbb{P}(X=x)=\mathbb{P}(\{s\in S:X(s)=x\}),\] where \(x\) ranges over the possible values of \(X\).

Theorem 18 (Properties of a pmf). A function \(p_X\) is a pmf if it satisfies \[p_X(x)\ge 0\] for all \(x\), and \[\sum_{x\in\operatorname{Range}(X)}p_X(x)=1.\]

Nonnegativity follows from the nonnegativity axiom of probability. The values \(\{X=x\}\) for \(x\in\operatorname{Range}(X)\) are disjoint and their union is the whole sample space, so countable additivity gives \[1=\mathbb{P}(S)=\sum_{x\in\operatorname{Range}(X)}\mathbb{P}(X=x)=\sum_{x\in\operatorname{Range}(X)}p_X(x).\]

3.6.2 Relation between pmf and CDF

This subsection explains how the jumps of the CDF recover the pmf.

For a discrete random variable, \[F_X(x)=\sum_{t\le x}p_X(t).\] If \(X\) has a possible value \(k\), then \[p_X(k)=F_X(k)-F_X(k^-),\] where \[F_X(k^-)=\lim_{\epsilon\downarrow 0}F_X(k-\epsilon).\]

Example 19 (Recovering a pmf from a CDF). Suppose \[F(x)= \begin{cases} 0, & x<1,\\ 0.2, & 1\le x<3,\\ 0.7, & 3\le x<4,\\ 1, & x\ge 4. \end{cases}\] Find the pmf.

The jumps occur at \(1,3,4\). Therefore \[p(1)=F(1)-F(1^-)=0.2-0=0.2,\] \[p(3)=F(3)-F(3^-)=0.7-0.2=0.5,\] \[p(4)=F(4)-F(4^-)=1-0.7=0.3.\] So \(X\) takes values \(1,3,4\) with probabilities \(0.2,0.5,0.3\).

3.6.3 Bernoulli and categorical variables

This subsection introduces two basic discrete models: Bernoulli for binary outcomes and categorical for finite multi-class outcomes.

Definition 20 (Bernoulli distribution). A random variable \(X\) has a Bernoulli distribution with parameter \(\phi\), written \[X\sim \operatorname{Bernoulli}(\phi),\] if \(X\in\{0,1\}\) and \[p_X(k)=\phi^k(1-\phi)^{1-k} = \begin{cases} 1-\phi, & k=0,\\ \phi, & k=1. \end{cases}\]

Example 21 (Unfair coin). Flip an unfair coin once, where \(\mathbb{P}(H)=\phi\). Define \(X=1\) if the result is heads and \(X=0\) otherwise. Find the distribution of \(X\).

By definition, \[\mathbb{P}(X=1)=\mathbb{P}(H)=\phi, \qquad \mathbb{P}(X=0)=\mathbb{P}(T)=1-\phi.\] Thus \(X\sim \operatorname{Bernoulli}(\phi)\).

Definition 22 (Categorical distribution). A random variable \(Y\) has a categorical distribution with parameters \((\phi_1,\ldots,\phi_K)\) if \[Y\in\{1,2,\ldots,K\},\qquad \phi_i\ge 0,\qquad \sum_{i=1}^K\phi_i=1,\] and \[\mathbb{P}(Y=i)=\phi_i.\] Equivalently, \[p_Y(y)=\prod_{i=1}^K \phi_i^{\mathbf{1}(y=i)}.\]

Example 23 (Unfair six-sided die). Roll an unfair six-sided die with probabilities \(\phi_1,\ldots,\phi_6\). Write the pmf.

The possible values are \(1,2,3,4,5,6\). The pmf is \[p_Y(y)=\phi_y,\qquad y=1,2,3,4,5,6,\] where \(\phi_i\ge 0\) and \(\sum_{i=1}^6\phi_i=1\). In indicator form, \[p_Y(y)=\phi_1^{\mathbf{1}(y=1)}\phi_2^{\mathbf{1}(y=2)}\cdots \phi_6^{\mathbf{1}(y=6)}.\]

3.7 Probability Density Functions

This section studies the probability density function, the standard way to describe most continuous random variables in this course.

3.7.1 Definition of a pdf

This subsection defines density through the derivative of the CDF and through area under a curve.

Definition 24 (Probability density function). For an absolutely continuous random variable \(X\), the probability density function (pdf) is a function \(f_X\) such that \[F_X(x)=\mathbb{P}(X\le x)=\int_{-\infty}^x f_X(u)\,du.\] When \(F_X\) is differentiable, \[f_X(x)=F_X'(x)=\frac{d}{dx}F_X(x).\]

Theorem 25 (Properties of a pdf). A pdf \(f_X\) satisfies \[f_X(x)\ge 0\] for all \(x\), and \[\int_{-\infty}^{\infty} f_X(x)\,dx=1.\] For an interval \([a,b]\), \[\mathbb{P}(a\le X\le b)=\int_a^b f_X(x)\,dx.\]

The density must be nonnegative because probabilities of intervals are nonnegative. The total area is \(1\) because \[1=\mathbb{P}(-\infty<X<\infty)=\int_{-\infty}^{\infty}f_X(x)\,dx.\] The interval formula follows from \[\mathbb{P}(a\le X\le b)=F_X(b)-F_X(a)=\int_{-\infty}^b f_X(x)\,dx-\int_{-\infty}^a f_X(x)\,dx=\int_a^b f_X(x)\,dx.\] For continuous random variables, endpoint choices do not affect the probability because single points have probability zero.

3.7.2 Logistic CDF and density

This subsection uses the logistic function as an example of deriving a pdf from a CDF.

Example 26 (Logistic distribution). Suppose a continuous random variable \(X\) has CDF \[F_X(x)=\frac{1}{1+e^{-x}}.\] Find its pdf.

Differentiate the CDF: \[f_X(x)=F_X'(x)=\frac{d}{dx}(1+e^{-x})^{-1}.\] By the chain rule, \[f_X(x)=-(1+e^{-x})^{-2}(-e^{-x})=\frac{e^{-x}}{(1+e^{-x})^2}.\] This density is nonnegative and integrates to \[\int_{-\infty}^{\infty}f_X(x)\,dx=F_X(\infty)-F_X(-\infty)=1-0=1.\]

3.7.3 Uniform distribution

This subsection introduces the simplest continuous density: constant density on an interval.

Definition 27 (Continuous uniform distribution). A random variable \(X\) has a uniform distribution on \([a,b]\), written \[X\sim \operatorname{Uniform}(a,b),\] if \[f_X(x)= \begin{cases} \dfrac{1}{b-a}, & a\le x\le b,\\[2mm] 0, & x<a\text{ or }x>b. \end{cases}\]

Example 28 (Uniform probability). Let \(X\sim \operatorname{Uniform}(2,8)\). Compute \(\mathbb{P}(3\le X\le 5)\).

The density is \(f_X(x)=1/(8-2)=1/6\) for \(2\le x\le 8\). Therefore \[\mathbb{P}(3\le X\le 5)=\int_3^5\frac16\,dx=\frac{5-3}{6}=\frac13.\]

Practice Problem 29 (Find the uniform CDF). Let \(X\sim \operatorname{Uniform}(a,b)\). Derive the CDF \(F_X(x)\).

For \(x<a\), no mass has accumulated, so \(F_X(x)=0\). For \(a\le x\le b\), \[F_X(x)=\int_a^x \frac{1}{b-a}\,du=\frac{x-a}{b-a}.\] For \(x>b\), all mass has accumulated, so \(F_X(x)=1\). Hence \[F_X(x)= \begin{cases} 0, & x<a,\\[1mm] \dfrac{x-a}{b-a}, & a\le x\le b,\\[2mm] 1, & x>b. \end{cases}\]

3.7.4 Mixed distributions

This subsection explains why the CDF is more general than either a pmf alone or a pdf alone.

Example 30 (A random variable with no single pmf/pdf description). Let \(X\) be defined as follows: with probability \(0.5\), \(X\) takes the fixed value \(2\); with probability \(0.5\), \(X\) is drawn uniformly from \([0,1]\). Find the CDF.

For \(x\le 0\), \(F_X(x)=0\). For \(0<x\le 1\), only the uniform part contributes: \[F_X(x)=0.5\cdot x=\frac{x}{2}.\] For \(1<x<2\), the uniform part is fully accumulated, but the point mass at \(2\) has not yet occurred: \[F_X(x)=\frac12.\] For \(x\ge 2\), both parts are included: \[F_X(x)=1.\] Thus \[F_X(x)= \begin{cases} 0, & x\le 0,\\[1mm] \dfrac{x}{2}, & 0<x\le 1,\\[2mm] \dfrac12, & 1<x<2,\\[2mm] 1, & x\ge 2. \end{cases}\] This distribution has a continuous component on \([0,1]\) and a point mass at \(2\), so a single ordinary pdf or pmf does not describe the whole distribution.

3.7.5 Continuous versus absolutely continuous

This subsection clarifies a technical distinction that will usually be suppressed in this course.

Definition 31 (Continuous random variable). A random variable \(X\) is continuous if \[\mathbb{P}(X=x)=0\] for all \(x\in\mathbb{R}\).

Definition 32 (Absolutely continuous random variable). A random variable \(X\) is absolutely continuous if there exists a function \(f\) such that \[\mathbb{P}(X\in A)=\int_A f(x)\,dx\] for every Borel set \(A\).

Absolutely continuous random variables are continuous, but the converse is not always true. The Cantor distribution is a standard counterexample. In this course, almost all continuous random variables we use are absolutely continuous, so we will usually not distinguish the two.

3.8 Common Discrete Distributions

This section collects the main discrete distribution families used throughout probability and statistical inference.

3.8.1 Binomial distribution

This subsection introduces the binomial distribution as the number of successes in independent Bernoulli trials.

Definition 33 (Binomial distribution). Suppose we perform \(n\) independent trials, each with two outcomes: success with probability \(p\) and failure with probability \(1-p\). Let \(X\) be the number of successes. Then \[X\sim \operatorname{Binomial}(n,p),\] and \[\mathbb{P}(X=k)=\binom{n}{k}p^k(1-p)^{n-k},\qquad k=0,1,\ldots,n.\]

Example 34 (Coin flips). Flip a coin \(n\) times, where the probability of heads is \(p\). Let \(X\) be the number of heads. Find \(\mathbb{P}(X=k)\).

To get exactly \(k\) heads, choose which \(k\) of the \(n\) trials are heads. There are \(\binom{n}{k}\) such choices. Each particular sequence with \(k\) heads and \(n-k\) tails has probability \(p^k(1-p)^{n-k}\). Therefore \[\mathbb{P}(X=k)=\binom{n}{k}p^k(1-p)^{n-k}.\]

Example 35 (Airline overbooking). An airline knows that \(5\%\) of people do not show up for a flight. It sells \(52\) tickets for a plane with \(50\) seats. Assuming passengers show up independently, what is the probability that nobody is bumped?

Let \(X\) be the number of passengers who show up. Then \[X\sim \operatorname{Binomial}(52,0.95).\] Nobody is bumped if \(X\le 50\). Thus \[\mathbb{P}(\text{nobody bumped})=\mathbb{P}(X\le 50)=1-\mathbb{P}(X=51)-\mathbb{P}(X=52).\] Compute \[\mathbb{P}(X=51)=\binom{52}{51}(0.95)^{51}(0.05),\] \[\mathbb{P}(X=52)=(0.95)^{52}.\] Therefore \[\mathbb{P}(X\le 50)=1-52(0.95)^{51}(0.05)-(0.95)^{52}\approx 0.7405.\] So the probability that nobody is bumped is about \(74.05\%\).

3.8.2 Multinomial distribution

This subsection generalizes the binomial distribution from two categories to several categories.

Definition 36 (Multinomial distribution). Suppose each of \(n\) independent trials results in one of \(m\) categories \(O_1,\ldots,O_m\) with probabilities \(\phi_1,\ldots,\phi_m\), where \(\sum_{i=1}^m\phi_i=1\). Let \(X_i\) be the number of times category \(O_i\) appears. Then \[(X_1,\ldots,X_m)\sim \operatorname{Multinomial}(n;\phi_1,\ldots,\phi_m),\] and for \(n_1+\cdots+n_m=n\), \[\mathbb{P}(X_1=n_1,\ldots,X_m=n_m)=\frac{n!}{n_1!\cdots n_m!}\phi_1^{n_1}\cdots\phi_m^{n_m}.\]

Example 37 (Rolling an \(m\)-sided die). A possibly unfair \(m\)-sided die has probabilities \(\phi_1,\ldots,\phi_m\). Roll it \(n\) times. Find the probability of seeing side \(i\) exactly \(n_i\) times for all \(i\).

The vector of counts has a multinomial distribution. Thus, if \(n_1+\cdots+n_m=n\), \[\mathbb{P}(X_1=n_1,\ldots,X_m=n_m)=\frac{n!}{n_1!\cdots n_m!}\phi_1^{n_1}\cdots\phi_m^{n_m}.\] The coefficient counts how many sequences have those category counts.

3.8.3 Geometric distribution

This subsection introduces the geometric distribution as the waiting time until the first success.

Definition 38 (Geometric distribution). If independent Bernoulli trials have success probability \(p\), and \(X\) is the trial number on which the first success occurs, then \[X\sim \operatorname{Geometric}(p),\] and \[\mathbb{P}(X=n)=(1-p)^{n-1}p, \qquad n=1,2,3,\ldots.\]

Example 39 (First heads). A coin has probability \(p\) of heads. Flip until the first heads appears. Find the probability that the first heads appears on the fifth flip.

For the first heads to appear on flip \(5\), the first four flips must be tails and the fifth must be heads. Therefore \[\mathbb{P}(X=5)=(1-p)^4p.\]

Theorem 40 (CDF of a geometric random variable). If \(X\sim\operatorname{Geometric}(p)\), then for integer \(x\ge 1\), \[\mathbb{P}(X\le x)=1-(1-p)^x.\]

Using the geometric sum, \[\mathbb{P}(X\le x)=\sum_{i=1}^x\mathbb{P}(X=i)=\sum_{i=1}^x(1-p)^{i-1}p.\] Thus \[\mathbb{P}(X\le x)=p\frac{1-(1-p)^x}{1-(1-p)}=1-(1-p)^x.\]

3.8.4 Poisson distribution

This subsection introduces the Poisson distribution for counts of rare or randomly occurring events in time or space.

Definition 41 (Poisson distribution). A random variable \(X\) has a Poisson distribution with rate \(\lambda>0\), written \[X\sim \operatorname{Poisson}(\lambda),\] if \[\mathbb{P}(X=k)=\frac{\lambda^k e^{-\lambda}}{k!}, \qquad k=0,1,2,\ldots.\]

Example 42 (Telephone calls). A telephone operator handles on average \(5\) calls every \(3\) minutes. Model calls by a Poisson process. What is the probability of no calls in the next minute? What is the probability of at least two calls in the next minute?

The average rate per minute is \[\lambda=\frac{5}{3}.\] Let \(X\) be the number of calls in the next minute. Then \(X\sim\operatorname{Poisson}(5/3)\). The probability of no calls is \[\mathbb{P}(X=0)=e^{-5/3}\approx 0.1889.\] The probability of at least two calls is \[\mathbb{P}(X\ge 2)=1-\mathbb{P}(X=0)-\mathbb{P}(X=1) =1-e^{-5/3}-\frac{5}{3}e^{-5/3} \approx 0.4963.\]

Theorem 43 (Poisson approximation to the binomial). If \(n\to\infty\), \(p\to 0\), and \(np=\lambda\) stays constant, then \[\binom{n}{k}p^k(1-p)^{n-k}\to \frac{\lambda^k e^{-\lambda}}{k!}.\]

Set \(p=\lambda/n\). Then \[\binom{n}{k}p^k(1-p)^{n-k} =\frac{n(n-1)\cdots(n-k+1)}{k!}\left(\frac{\lambda}{n}\right)^k\left(1-\frac{\lambda}{n}\right)^{n-k}.\] The first product satisfies \[\frac{n(n-1)\cdots(n-k+1)}{n^k}\to 1,\] and \[\left(1-\frac{\lambda}{n}\right)^{n-k}\to e^{-\lambda}.\] Therefore the limit is \[\frac{\lambda^k}{k!}e^{-\lambda}.\]

3.8.5 Discrete uniform distribution

This subsection introduces equal probabilities over a finite set of integer-valued outcomes.

Definition 44 (Discrete uniform distribution). A random variable \(X\) has a discrete uniform distribution on \(\{a,a+1,\ldots,b\}\) if \[\mathbb{P}(X=x)=\frac1n, \qquad x=a,a+1,\ldots,b,\] where \[n=b-a+1.\]

Example 45 (Fair six-sided die). Let \(X\) be the result of rolling a fair six-sided die. Identify the distribution and compute \(\mathbb{P}(X\ge 5)\).

The die roll is discrete uniform on \(\{1,2,3,4,5,6\}\). Therefore \[\mathbb{P}(X=x)=\frac16,\qquad x=1,2,3,4,5,6.\] The event \(X\ge 5\) is \(\{5,6\}\), so \[\mathbb{P}(X\ge 5)=\frac26=\frac13.\]

Example 46 (German tank problem setup). Suppose enemy tanks are numbered \(1,2,\ldots,N\), and a captured tank has serial number \(X\). If the captured tank is equally likely to be any of the \(N\) tanks, write the distribution of \(X\).

The random variable \(X\) is discrete uniform on \(\{1,\ldots,N\}\). Hence \[\mathbb{P}(X=x)=\frac{1}{N},\qquad x=1,2,\ldots,N.\] In the German tank problem, observations from this distribution are used to estimate the unknown maximum \(N\).

3.8.6 Hypergeometric distribution

This subsection introduces sampling without replacement from a finite population.

Definition 47 (Hypergeometric distribution). Suppose an urn contains \(N\) balls, \(M\) of which are red and \(N-M\) of which are green. Draw \(K\) balls without replacement. Let \(X\) be the number of red balls drawn. Then \[\mathbb{P}(X=x)=\frac{\binom{M}{x}\binom{N-M}{K-x}}{\binom{N}{K}},\] for feasible values of \(x\). The mean and variance are \[\mathbb{E}[X]=\frac{KM}{N},\] \[\operatorname{Var}(X)=\frac{KM}{N}\left(\frac{(N-M)(N-K)}{N(N-1)}\right).\]

Example 48 (Urn without replacement). An urn contains \(20\) red balls and \(30\) green balls. Draw \(10\) balls without replacement. What is the probability of drawing exactly \(3\) red balls?

Here \(N=50\), \(M=20\), \(K=10\), and \(x=3\). Therefore \[\mathbb{P}(X=3)=\frac{\binom{20}{3}\binom{30}{7}}{\binom{50}{10}}.\] This counts favorable samples with exactly \(3\) red and \(7\) green divided by all samples of size \(10\).

Example 49 (Acceptance sampling). A retailer receives a lot of \(N=25\) machine parts, of which \(M=6\) are defective. A sample of \(K=10\) parts is selected without replacement. What is the probability that all sampled parts are acceptable?

Let \(X\) be the number of defective parts in the sample. Then \(X\) is hypergeometric with \(N=25\), \(M=6\), \(K=10\). All sampled parts are acceptable means \(X=0\). Therefore \[\mathbb{P}(X=0)=\frac{\binom{6}{0}\binom{19}{10}}{\binom{25}{10}} \approx 0.0342.\] This event is quite unlikely if there are six defectives in the lot.

Practice Problem 50 (With replacement versus without replacement). In the urn example, what distribution would be used if the balls were drawn with replacement?

With replacement, each draw has the same probability of red: \[p=\frac{M}{N}.\] The draws are independent, so the number of red balls in \(K\) draws follows \[X\sim \operatorname{Binomial}\left(K,\frac{M}{N}\right).\] Without replacement gives the hypergeometric distribution; with replacement gives the binomial distribution.

3.8.7 Negative binomial distribution

This subsection introduces the negative binomial distribution as a waiting-count model for a fixed number of successes.

Definition 51 (Negative binomial distribution: failures before \(r\) successes). Consider independent Bernoulli trials with success probability \(p\). Let \(Y\) be the number of failures observed before the \(r\)-th success. Then \[\mathbb{P}(Y=y)=\binom{y+r-1}{y}(1-p)^y p^r, \qquad y=0,1,2,\ldots.\] The mean and variance are \[\mathbb{E}[Y]=\frac{r(1-p)}{p}, \qquad \operatorname{Var}(Y)=\frac{r(1-p)}{p^2}.\]

Definition 52 (Negative binomial distribution: total trials). If \(X\) is the total number of trials needed to obtain \(r\) successes, then \(X=r+Y\) and \[\mathbb{P}(X=n)=\binom{n-1}{r-1}p^r(1-p)^{n-r}, \qquad n=r,r+1,r+2,\ldots.\]

Example 53 (Survey completion). Jim goes door to door asking people to fill out a survey. At each house, there is a \(0.6\) probability that the survey is completed. Jim must collect \(30\) completed surveys. What is the probability that the last survey is completed at the \(n\)-th house?

Let \(X\) be the total number of houses visited to obtain \(r=30\) completed surveys. The last survey is completed at the \(n\)-th house if among the first \(n-1\) houses there were exactly \(29\) completions, and the \(n\)-th house is a completion. Therefore \[\mathbb{P}(X=n)=\binom{n-1}{29}(0.6)^{30}(0.4)^{n-30}, \qquad n=30,31,32,\ldots.\]

Theorem 54 (Poisson limit of negative binomial). Under an appropriate limiting regime, the negative binomial distribution can converge to a Poisson distribution. In particular, when \(r\to\infty\), \(p\to 1\), and \[r(1-p)\to \lambda,\] the number of failures before \(r\) successes converges in distribution to \(\operatorname{Poisson}(\lambda)\).

The negative binomial model counts rare failures before many successes. If \(p\to 1\), failures become rare; if \(r\to\infty\) while \(r(1-p)\) stays near \(\lambda\), the expected number of failures remains finite. This is the same rare-event principle behind the Poisson approximation.

3.9 Common Continuous Distributions

This section collects the main continuous distribution families used throughout probability and statistical inference.

3.9.1 Normal distribution

This subsection introduces the normal distribution, the central continuous model in statistics.

Definition 55 (Normal distribution). A random variable \(X\) has a normal distribution with mean \(\mu\) and variance \(\sigma^2\), written \[X\sim \operatorname{Normal}(\mu,\sigma^2),\] if \[f_X(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \qquad x\in\mathbb{R}.\]

The normal distribution is one of the most important distributions because of the Central Limit Theorem: sums and averages of many small independent effects are often approximately normal.

Example 56 (Standard normal probability notation). Let \(Z\sim\operatorname{Normal}(0,1)\). Express \(\mathbb{P}(a\le Z\le b)\) using the standard normal CDF \(\Phi\).

The standard normal CDF is \[\Phi(x)=\mathbb{P}(Z\le x).\] Therefore \[\mathbb{P}(a\le Z\le b)=\mathbb{P}(Z\le b)-\mathbb{P}(Z<a)=\Phi(b)-\Phi(a).\] For a continuous random variable, using \(Z<a\) or \(Z\le a\) gives the same probability.

3.9.2 Cauchy distribution

This subsection introduces a heavy-tailed distribution whose mean does not exist.

Definition 57 (Cauchy distribution). A Cauchy random variable with location \(\mu\) and scale \(\sigma>0\) has density \[f_X(x)=\frac{1}{\pi\sigma\left[1+\left(\frac{x-\mu}{\sigma}\right)^2\right]} =\frac{\sigma}{\pi\left[\sigma^2+(x-\mu)^2\right]}.\] The Cauchy distribution has no mean; the parameter \(\mu\) is the median and location parameter.

Example 58 (Kicking a ball at a random angle). A person stands distance \(d\) from a straight line and kicks a ball toward the line. The angle \(\theta\) is uniformly distributed on \((-\pi/2,\pi/2)\). Let \(X\) be the hitting location on the line, measured from the closest point. Show why \(X\) has a Cauchy-type density.

Geometry gives \[\tan\theta=\frac{X}{d}, \qquad X=d\tan\theta.\] Since \(\theta\sim\operatorname{Uniform}(-\pi/2,\pi/2)\), \[f_\theta(\theta)=\frac1\pi.\] Use the transformation \(x=d\tan\theta\), so \[\theta=\arctan(x/d), \qquad \frac{d\theta}{dx}=\frac{d}{d^2+x^2}.\] Thus \[f_X(x)=f_\theta(\arctan(x/d))\left|\frac{d\theta}{dx}\right| =\frac1\pi\frac{d}{d^2+x^2}.\] This is a Cauchy density with location \(0\) and scale \(d\).

3.9.3 Exponential distribution

This subsection introduces the exponential distribution as a waiting-time model.

Definition 59 (Exponential distribution). A random variable \(X\) has an exponential distribution with rate \(\lambda>0\), written \[X\sim\operatorname{Exponential}(\lambda),\] if \[f_X(x)= \begin{cases} \lambda e^{-\lambda x}, & x\ge 0,\\ 0, & x<0. \end{cases}\] The exponential distribution is the continuous analogue of the geometric distribution.

Example 60 (Time until failure). Suppose the lifetime of a device follows \(X\sim\operatorname{Exponential}(\lambda)\). Find \(\mathbb{P}(X>t)\) and \(\mathbb{P}(X\le t)\) for \(t\ge 0\).

For \(t\ge 0\), \[\mathbb{P}(X>t)=\int_t^\infty \lambda e^{-\lambda x}\,dx=e^{-\lambda t}.\] Therefore \[\mathbb{P}(X\le t)=1-\mathbb{P}(X>t)=1-e^{-\lambda t}.\] The survival probability decays exponentially.

TipLearning box

Exponential versus Poisson

  • Poisson distribution: number of events in a time interval.

  • Exponential distribution: waiting time between events or time until the first event.

3.9.4 Laplace distribution

This subsection introduces the double exponential distribution, also called the Laplace distribution.

Definition 61 (Laplace distribution). A random variable \(X\) has a Laplace distribution with location \(\mu\) and scale \(b>0\), written \(X\sim\operatorname{Laplace}(\mu,b)\), if \[f_X(x)=\frac{1}{2b}\exp\left(-\frac{|x-\mu|}{b}\right), \qquad -\infty<x<\infty.\] Its mean and variance are \[\mathbb{E}[X]=\mu, \qquad \operatorname{Var}(X)=2b^2.\]

The Laplace distribution can be viewed as reflecting an exponential distribution around its center. The maximum likelihood estimator of the location \(\mu\) is the sample median.

Example 62 (Difference of exponential variables). Let \(U\) and \(V\) be independent \(\operatorname{Exponential}(\lambda)\) random variables. The difference \(X=U-V\) has a Laplace distribution centered at \(0\). State its scale parameter.

The difference of two independent exponential random variables with common rate \(\lambda\) follows a Laplace distribution with location \(0\) and scale \[b=\frac{1}{\lambda}.\] Thus \[f_X(x)=\frac{\lambda}{2}e^{-\lambda |x|}.\] This explains why the Laplace distribution is sometimes called the double exponential distribution.

3.9.5 Gamma distribution

This subsection introduces the gamma distribution and its connection to waiting times in Poisson processes.

Definition 63 (Gamma distribution). A random variable \(X\) has a gamma distribution with shape \(\alpha>0\) and scale \(\theta>0\), written \[X\sim\operatorname{Gamma}(\alpha,\theta),\] if \[f_X(x;\alpha,\theta)=\frac{x^{\alpha-1}e^{-x/\theta}}{\theta^\alpha\Gamma(\alpha)}, \qquad x\ge 0,\] where the gamma function is \[\Gamma(\alpha)=\int_0^\infty t^{\alpha-1}e^{-t}\,dt.\]

TipLearning box

Parameterization warning Some books use the rate parameter \(\beta=1/\theta\) instead of the scale parameter \(\theta\). Under the rate parameterization, \[f_X(x;\alpha,\beta)=\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}.\]

Example 64 (Exponential as a special gamma). Show that the exponential distribution is a special case of the gamma distribution.

Take \(\alpha=1\) and \(\theta=1/\lambda\). Since \(\Gamma(1)=1\), \[f_X(x)=\frac{x^{0}e^{-x/(1/\lambda)}}{(1/\lambda)^1\Gamma(1)}=\lambda e^{-\lambda x}, \qquad x\ge 0.\] This is the exponential density with rate \(\lambda\).

Example 65 (Erlang distribution). Explain why \(\operatorname{Gamma}(k,\theta)\) with integer \(k\) is called an Erlang distribution.

If \(k\) is a positive integer, then \[\Gamma(k)=(k-1)!.\] In a Poisson process, the waiting time until the \(k\)-th event is the sum of \(k\) independent exponential waiting times. This sum has a gamma distribution with integer shape \(k\), called the Erlang distribution.

Theorem 66 (Gamma-Poisson relationship). If \(X\sim \operatorname{Gamma}(\alpha,\beta)\) under the rate parameterization, with \(\alpha\) a positive integer, and if \(Y\sim \operatorname{Poisson}(\beta x)\), then \[\mathbb{P}(X\le x)=\mathbb{P}(Y\ge \alpha).\]

For integer \(\alpha\), \(X\) can be interpreted as the waiting time until the \(\alpha\)-th event in a Poisson process of rate \(\beta\). The event \(\{X\le x\}\) means that by time \(x\), at least \(\alpha\) events have occurred. If \(Y\) is the number of events by time \(x\), then \(Y\sim\operatorname{Poisson}(\beta x)\) and \[\{X\le x\}=\{Y\ge \alpha\}.\] Thus the probabilities are equal.

3.9.6 Weibull distribution

This subsection introduces the Weibull distribution as a transformation of an exponential random variable.

Definition 67 (Weibull distribution). If \(X\sim \operatorname{Exponential}(\beta)\) and \[Y=X^{1/\gamma},\] then \(Y\) has a Weibull distribution with parameters \((\gamma,\beta)\) and density \[f_Y(y\mid\gamma,\beta)=\frac{\gamma}{\beta}y^{\gamma-1}e^{-y^\gamma/\beta}, \qquad y>0,\] where \(\gamma>0\) and \(\beta>0\).

Example 68 (Deriving the Weibull density). Let \(X\sim\operatorname{Exponential}(1/\beta)\) in the scale-style notation with density \(f_X(x)=\frac{1}{\beta}e^{-x/\beta}\) for \(x>0\), and let \(Y=X^{1/\gamma}\). Derive the density of \(Y\).

The transformation is \(X=Y^\gamma\). Then \[\frac{dx}{dy}=\gamma y^{\gamma-1}.\] Thus \[f_Y(y)=f_X(y^\gamma)\left|\frac{dx}{dy}\right| =\frac1\beta e^{-y^\gamma/\beta}\gamma y^{\gamma-1} =\frac{\gamma}{\beta}y^{\gamma-1}e^{-y^\gamma/\beta},\] for \(y>0\).

3.9.7 Chi-squared distribution

This subsection introduces the chi-squared distribution as a gamma special case and as a sum of squared standard normals.

Definition 69 (Chi-squared distribution). A chi-squared random variable with \(k\) degrees of freedom is a gamma special case: \[\chi_k^2\sim \operatorname{Gamma}\left(\alpha=\frac{k}{2},\theta=2\right).\] Equivalently, if \(Z_1,\ldots,Z_k\) are independent standard normal random variables, then \[Z_1^2+\cdots+Z_k^2\sim \chi_k^2.\]

Example 70 (Sum of squared standard normals). If \(Z_1,Z_2,Z_3\) are independent \(\operatorname{Normal}(0,1)\) random variables, identify the distribution of \[Q=Z_1^2+Z_2^2+Z_3^2.\]

By the definition of a chi-squared distribution, \[Q\sim\chi_3^2.\] Equivalently, \[Q\sim\operatorname{Gamma}\left(\frac32,2\right).\]

3.9.8 Beta distribution

This subsection introduces the beta distribution, a flexible distribution on the interval \((0,1)\).

Definition 71 (Beta distribution). A random variable \(X\) has a beta distribution with parameters \(\alpha>0\) and \(\beta>0\), written \[X\sim\operatorname{Beta}(\alpha,\beta),\] if \[f_X(x;\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}, \qquad 0<x<1.\] Equivalently, \[f_X(x;\alpha,\beta)=\frac{1}{B(\alpha,\beta)}x^{\alpha-1}(1-x)^{\beta-1},\] where \[B(\alpha,\beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}.\]

The beta distribution is often used as a model for proportions and percentages. It is also a conjugate prior for the binomial model.

Example 72 (Uniform as a special beta). Show that \(\operatorname{Beta}(1,1)\) is the uniform distribution on \((0,1)\).

If \(\alpha=\beta=1\), then \[f_X(x)=\frac{\Gamma(2)}{\Gamma(1)\Gamma(1)}x^0(1-x)^0.\] Since \(\Gamma(2)=1! =1\) and \(\Gamma(1)=1\), \[f_X(x)=1, \qquad 0<x<1.\] This is the \(\operatorname{Uniform}(0,1)\) density.

3.9.9 Logistic distribution

This subsection revisits the logistic distribution as a flexible continuous distribution and as a model related to logistic regression.

Definition 73 (Logistic distribution). A logistic distribution can be described by the CDF \[F(x)=\frac{1}{1+e^{-\alpha-\beta x}},\] with corresponding density \[f(x)=\frac{\beta e^{-\alpha-\beta x}}{(1+e^{-\alpha-\beta x})^2} =\frac{\beta e^{\alpha+\beta x}}{(1+e^{\alpha+\beta x})^2}.\] Another location-scale form is \[F(x)=\frac12+\frac12\tanh\left(\frac{x-\mu}{2s}\right).\]

The logistic distribution is central in logistic regression for categorical dependent variables. It has also been used in rating systems, such as chess rating models.

Example 74 (Logit link). Suppose \[\mu=\frac{1}{1+e^{-\eta}}.\] Solve for \(\eta\) in terms of \(\mu\).

Start with \[\mu=\frac{1}{1+e^{-\eta}}.\] Then \[\frac{1}{\mu}=1+e^{-\eta}, \qquad \frac{1-\mu}{\mu}=e^{-\eta}.\] Taking logs gives \[-\eta=\log\left(\frac{1-\mu}{\mu}\right),\] so \[\eta=\log\left(\frac{\mu}{1-\mu}\right).\] This is the logit transformation.

3.9.10 Lognormal distribution

This subsection introduces a distribution for positive, right-skewed quantities.

Definition 75 (Lognormal distribution). A positive random variable \(X\) has a lognormal distribution if \[\log X\sim \operatorname{Normal}(\mu,\sigma^2).\] Its mean and variance are \[\mathbb{E}[X]=e^{\mu+\sigma^2/2},\] \[\operatorname{Var}(X)=e^{2\mu+\sigma^2}\left(e^{\sigma^2}-1\right).\]

The lognormal distribution is popular for modeling right-skewed variables such as incomes, file sizes, internet comment lengths, and biological measurements.

Example 76 (Median of a lognormal). If \(X\) is lognormal and \(\log X\sim\operatorname{Normal}(\mu,\sigma^2)\), find the median of \(X\).

The median \(m\) satisfies \[\mathbb{P}(X\le m)=\frac12.\] Since \(\log\) is increasing, \[\mathbb{P}(X\le m)=\mathbb{P}(\log X\le \log m).\] For a normal random variable \(Y\sim\operatorname{Normal}(\mu,\sigma^2)\), the median is \(\mu\), so we need \[\log m=\mu.\] Therefore \[m=e^\mu.\]

3.10 Summary Tables for Common Distributions

This section organizes the common distribution families by their stories, formulas, means, and variances.

3.10.1 Distribution stories

This subsection summarizes how the most common distributions arise in applications.

Name Useful story
\(\operatorname{Bernoulli}(p)\) Toss a coin with probability \(p\) of heads. \(X=\) number of heads in one toss.
\(\operatorname{Binomial}(n,p)\) Toss a coin with probability \(p\) of heads \(n\) times. \(X=\) number of heads in \(n\) tosses. It is the sum of \(n\) independent Bernoulli variables.
\(\operatorname{Geometric}(p)\) Toss a coin until the first heads. \(X=\) number of tosses until the first heads.
\(\operatorname{Poisson}(\lambda)\) Random calls arrive with rate \(\lambda\). \(X=\) number of calls in one time unit.
\(\operatorname{Exponential}(\lambda)\) Random calls arrive with rate \(\lambda\). \(X=\) time until the first arrival.
\(\operatorname{Gamma}(n,\lambda)\) Random calls arrive with rate \(\lambda\). \(X=\) time until the \(n\)-th arrival.
\(\operatorname{Uniform}(a,b)\) Pick a random number between \(a\) and \(b\).
\(\operatorname{Normal}(\mu,\sigma^2)\) Pick an individual from a large population. \(X=\) height or another aggregate measurement.
\(\operatorname{Beta}(\alpha,\beta)\) Model a random probability or proportion, often after observing successes and failures.

3.10.2 Pmf/pdf table

This subsection lists common pmfs and pdfs in one place for quick reference.

Name pmf/pdf
\(\operatorname{Bernoulli}(p)\) \(p_X(k)=p^k(1-p)^{1-k}\), \(k\in\{0,1\}\).
\(\operatorname{Binomial}(n,p)\) \(p_X(k)=\binom nk p^k(1-p)^{n-k}\), \(k=0,\ldots,n\).
\(\operatorname{Geometric}(p)\) \(p_X(k)=(1-p)^{k-1}p\), \(k=1,2,\ldots\).
\(\operatorname{Poisson}(\lambda)\) \(p_X(k)=\dfrac{\lambda^k e^{-\lambda}}{k!}\), \(k=0,1,2,\ldots\).
\(\operatorname{Exponential}(\lambda)\) \(f_X(x)=\lambda e^{-\lambda x}\), \(x\ge 0\).
\(\operatorname{Gamma}(\alpha,\theta)\) \(f_X(x)=\dfrac{x^{\alpha-1}e^{-x/\theta}}{\theta^\alpha\Gamma(\alpha)}\), \(x\ge0\).
\(\operatorname{Uniform}(a,b)\) \(f_X(x)=\dfrac{1}{b-a}\), \(a\le x\le b\).
\(\operatorname{Normal}(\mu,\sigma^2)\) \(f_X(x)=\dfrac{1}{\sqrt{2\pi}\sigma}\exp\left[-\dfrac{(x-\mu)^2}{2\sigma^2}\right]\).
\(\operatorname{Beta}(\alpha,\beta)\) \(f_X(x)=\dfrac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}\), \(0<x<1\).

3.10.3 Mean and variance table

This subsection lists common means and variances, which will be used later in expectation, sampling, and inference.

Name Mean Variance
\(\operatorname{Bernoulli}(p)\) \(p\) \(p(1-p)\)
\(\operatorname{Binomial}(n,p)\) \(np\) \(np(1-p)\)
\(\operatorname{Geometric}(p)\) \(1/p\) \((1-p)/p^2\)
\(\operatorname{Poisson}(\lambda)\) \(\lambda\) \(\lambda\)
\(\operatorname{Exponential}(\lambda)\) \(1/\lambda\) \(1/\lambda^2\)
\(\operatorname{Gamma}(n,\lambda)\), rate form \(n/\lambda\) \(n/\lambda^2\)
\(\operatorname{Uniform}(a,b)\) \((a+b)/2\) \((b-a)^2/12\)
\(\operatorname{Normal}(\mu,\sigma^2)\) \(\mu\) \(\sigma^2\)
\(\operatorname{Beta}(\alpha,\beta)\) \(\dfrac{\alpha}{\alpha+\beta}\) \(\dfrac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\)

3.11 Exponential Families

This section introduces the exponential family, a unifying framework that includes many common distributions and prepares students for generalized linear models and sufficient statistics.

3.11.1 Motivation and examples

This subsection explains why exponential families are important and lists common members.

The exponential family includes many distributions used in statistics: Gaussian, Bernoulli, binomial, multinomial, Poisson, exponential, gamma, beta, von Mises, Dirichlet, Weibull, and Wishart distributions, often under suitable fixed-parameter assumptions. Some families are exponential families only when certain parameters are fixed, such as binomial with fixed number of trials or multinomial with fixed total count. Some common distributions are not exponential families in full generality, such as Student’s \(t\), most mixture distributions, and uniform distributions when the bounds are both unknown.

3.11.2 Definition

This subsection gives the mathematical form of an exponential family distribution.

Definition 77 (Exponential family). A family of densities or probability mass functions is an exponential family if it can be written in the form \[p(y;\eta)=h(y)\exp\left(\eta^T T(y)-A(\eta)\right).\] Here:

  • \(\eta\in\mathbb{R}^d\) is the natural parameter;

  • \(T(y)\in\mathbb{R}^d\) is the sufficient statistic;

  • \(h(y)\) is the base measure;

  • \(A(\eta)\) is the log partition function or log normalizer.

The log partition function is defined by \[A(\eta)=\log\int h(y)\exp\left(\eta^T T(y)\right)\,dy,\] with sums replacing integrals in the discrete case. It makes the total probability equal to \(1\).

Another common form is \[p(y;\eta)=\exp\left(\eta^T T(y)-A(\eta)+C(y)\right),\] where \(C(y)=\log h(y)\).

For generalized linear models, a dispersion parameter \(\phi\) is often included: \[p(y;\eta,\phi)=\exp\left(\frac{\eta^T T(y)-A(\eta)}{\phi}+C(y,\phi)\right).\]

3.11.4 Moments of exponential families

This subsection explains why the log partition function is so useful: its derivatives give moments of the sufficient statistic.

Theorem 88 (Moments from the log partition function). For an exponential family \[p(y;\eta)=h(y)\exp(\eta^T T(y)-A(\eta)),\] we have \[\nabla_\eta A(\eta)=\mathbb{E}_\eta[T(Y)]\] and \[\nabla_\eta^2 A(\eta)=\operatorname{Cov}_\eta(T(Y)).\] In the one-dimensional case, \[A'(\eta)=\mathbb{E}[T(Y)], \qquad A''(\eta)=\operatorname{Var}(T(Y)).\]

In one dimension, \[A(\eta)=\log\int h(y)e^{\eta T(y)}\,dy.\] Differentiate: \[A'(\eta)=\frac{\int T(y)h(y)e^{\eta T(y)}\,dy}{\int h(y)e^{\eta T(y)}\,dy}.\] Since \[p(y;\eta)=h(y)e^{\eta T(y)-A(\eta)},\] the derivative becomes \[A'(\eta)=\int T(y)p(y;\eta)\,dy=\mathbb{E}[T(Y)].\] Differentiating again gives \[A''(\eta)=\mathbb{E}[T(Y)^2]-(\mathbb{E}[T(Y)])^2=\operatorname{Var}(T(Y)).\] The multidimensional version replaces derivatives by gradients and Hessians.

Example 89 (Bernoulli moments from \(A(\eta)\)). For Bernoulli in canonical form, \[A(\eta)=\log(1+e^\eta).\] Use derivatives of \(A\) to find \(\mathbb{E}[Y]\) and \(\operatorname{Var}(Y)\).

First, \[A'(\eta)=\frac{e^\eta}{1+e^\eta}=\frac{1}{1+e^{-\eta}}=\mu.\] Therefore \[\mathbb{E}[Y]=\mu.\] Second, \[A''(\eta)=\frac{e^\eta}{(1+e^\eta)^2}=\mu(1-\mu).\] Thus \[\operatorname{Var}(Y)=\mu(1-\mu).\]

3.12 Location and Scale Families

This section introduces another unifying idea: many distributions are obtained by shifting and rescaling a standard distribution.

3.12.1 Definition

This subsection defines the location-scale family through a transformation of a standard random variable.

Definition 90 (Location-scale family). Let \(X\) be a random variable with CDF \(F\) and pdf \(f\). For constants \(\mu\in\mathbb{R}\) and \(\sigma>0\), define \[Y=\mu+\sigma X.\] Then \(Y\) belongs to the location-scale family generated by \(X\). Its pdf is \[f_Y(y)=\frac1\sigma f\left(\frac{y-\mu}{\sigma}\right).\]

The formula follows from the change of variables \[x=\frac{y-\mu}{\sigma}, \qquad \frac{dx}{dy}=\frac1\sigma.\] Therefore \[f_Y(y)=f_X\left(\frac{y-\mu}{\sigma}\right)\left|\frac1\sigma\right|.\]

3.12.2 Examples of location-scale families

This subsection lists familiar distributions that fit the location-scale framework.

Examples include:

  • normal distribution;

  • Cauchy distribution;

  • continuous uniform distribution;

  • discrete uniform distribution in certain shifted/scaled forms;

  • logistic distribution;

  • Laplace distribution;

  • Student’s \(t\) distribution.

Example 91 (Normal location-scale transformation). If \(Z\sim\operatorname{Normal}(0,1)\) and \(Y=\mu+\sigma Z\), find the distribution of \(Y\).

The standard normal density is \[f_Z(z)=\frac1{\sqrt{2\pi}}e^{-z^2/2}.\] Using the location-scale formula, \[f_Y(y)=\frac1\sigma \frac1{\sqrt{2\pi}}\exp\left[-\frac12\left(\frac{y-\mu}{\sigma}\right)^2\right] =\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(y-\mu)^2}{2\sigma^2}\right].\] Thus \[Y\sim\operatorname{Normal}(\mu,\sigma^2).\]

Practice Problem 92 (Uniform location-scale transformation). Let \(X\sim\operatorname{Uniform}(0,1)\) and \(Y=a+(b-a)X\) where \(a<b\). Find the distribution of \(Y\).

Here \(\mu=a\) and \(\sigma=b-a\). Since \(X\) is uniform on \([0,1]\), the transformation maps \([0,1]\) to \([a,b]\). The density is \[f_Y(y)=\frac1{b-a}, \qquad a\le y\le b.\] Therefore \[Y\sim\operatorname{Uniform}(a,b).\]

3.13 Additional Practice Problems

This section provides extra problems with solutions to reinforce the core skills of Section 2.

Practice Problem 93 (Classify random variables). Classify each random variable as discrete, continuous, or mixed.

  1. Number of emails received in one hour.

  2. Time until the next email arrives.

  3. A variable that equals \(0\) with probability \(0.3\) and otherwise is uniform on \([1,2]\).

(a) The number of emails is a count, so it is discrete. A Poisson model may be appropriate.

(b) Waiting time is measured on a continuum, so it is continuous. An exponential model may be appropriate.

(c) This variable has a point mass at \(0\) and a continuous component on \([1,2]\), so it is mixed.

Practice Problem 94 (From density to probability). Let \[f(x)= \begin{cases} cx, & 0\le x\le 2,\\ 0, & \text{otherwise}. \end{cases}\] Find \(c\) and compute \(\mathbb{P}(1\le X\le 2)\).

The total area must be \(1\): \[1=\int_0^2 cx\,dx=c\left[\frac{x^2}{2}\right]_0^2=2c.\] Thus \(c=1/2\). Then \[\mathbb{P}(1\le X\le 2)=\int_1^2 \frac{x}{2}\,dx=\left[\frac{x^2}{4}\right]_1^2=1-\frac14=\frac34.\]

Practice Problem 95 (Choosing the right distribution). For each story, choose a reasonable distribution.

  1. Count the number of defective items in \(20\) independent items, each defective with probability \(0.02\).

  2. Count the number of arrivals in one minute when arrivals occur at average rate \(3\) per minute.

  3. Time until the first arrival when arrivals occur at average rate \(3\) per minute.

  4. Draw \(5\) cards from a deck without replacement and count the number of aces.

(a) Binomial: \(X\sim\operatorname{Binomial}(20,0.02)\).

(b) Poisson: \(X\sim\operatorname{Poisson}(3)\).

(c) Exponential: \(X\sim\operatorname{Exponential}(3)\).

(d) Hypergeometric: \(N=52\), \(M=4\) aces, \(K=5\) draws, so \[\mathbb{P}(X=x)=\frac{\binom4x\binom{48}{5-x}}{\binom{52}{5}}.\]

Practice Problem 96 (Exponential family calculation). Let \(Y\sim\operatorname{Poisson}(\lambda)\) and use the natural parameter \(\eta=\log\lambda\). Verify that \(A'(\eta)=\lambda\) and \(A''(\eta)=\lambda\).

For the Poisson exponential family, \[A(\eta)=e^\eta.\] Therefore \[A'(\eta)=e^\eta=\lambda,\] and \[A''(\eta)=e^\eta=\lambda.\] Since \(T(Y)=Y\), this agrees with \[\mathbb{E}[Y]=\lambda, \qquad \operatorname{Var}(Y)=\lambda.\]