15 Chapter 14: Hypothesis Tests I — Methods of Finding Tests
This chapter introduces the foundations of hypothesis testing and several general methods for constructing tests. The central ideas are null and alternative hypotheses, test statistics, rejection regions, Type I and Type II errors, significance level, power, \(p\)-values, likelihood ratio tests, Bayesian tests, union-intersection tests, intersection-union tests, and the Neyman–Pearson lemma.
Hypothesis testing; null and alternative hypotheses; Type I and Type II errors; significance level; power; \(p\)-values; likelihood ratio tests; Bayesian tests; union-intersection tests; intersection-union tests; radar detection; coin testing; normal mean tests; sufficient statistics and LRTs.
16 Introduction to Hypothesis Testing
This section develops the mathematical theory behind hypothesis tests and explains how classical tests arise from general testing principles.
In introductory statistics, students often learn concrete tests such as the \(Z\)-test, \(t\)-test, chi-square test, two-proportion test, two-sample mean test, and \(F\)-test. Here we study the theoretical principles behind such procedures.
Main goal A hypothesis test uses observed data to decide between two competing statements about a population parameter.
For example, a pharmaceutical company may want to know whether a new drug is effective in treating a disease. A natural pair of hypotheses is \[H_0: \text{the drug is not effective}, \qquad H_1: \text{the drug is effective}.\] The null hypothesis \(H_0\) represents the default or no-effect claim, while the alternative hypothesis \(H_1\) represents the new effect or departure from the default.
16.1 The testing decision
A hypothesis test is a rule that specifies which sample values lead us to reject \(H_0\) and which sample values lead us not to reject \(H_0\).
Definition 1 (Hypothesis test). Let \(X=(X_1,\ldots,X_n)\) be a sample from a population distribution depending on a parameter \(\theta\). A hypothesis test is a rule that divides the sample space into two regions:
a rejection region \(R\), where we reject \(H_0\);
an acceptance region \(A=R^c\), where we fail to reject \(H_0\).
Equivalently, a test may be specified by a test statistic \(W(X)\) and a threshold rule.
Remark 2 (Terminology). In modern statistical language, one often says “fail to reject \(H_0\)” instead of “accept \(H_0\).” This emphasizes that lack of evidence against \(H_0\) is not the same as proof that \(H_0\) is true.
16.2 Mathematical formulation
The mathematical formulation of testing is based on partitioning the parameter space.
Definition 3 (Null and alternative hypotheses). Let \(\Theta\) be the parameter space. A hypothesis test compares \[H_0: \theta\in \Theta_0 \qquad \text{versus} \qquad H_1: \theta\in \Theta_0^c.\] The set \(\Theta_0\) is the null parameter space, and \(\Theta_0^c\) is the alternative parameter space.
Examples include \[H_0:\mu=100 \quad \text{versus}\quad H_1:\mu\ne 100,\] or \[H_0:\mu\ge 100 \quad \text{versus}\quad H_1:\mu<100.\]
17 A First Example: Testing Whether a Coin Is Fair
This example introduces the main ingredients of hypothesis testing through the familiar problem of checking whether a coin is fair.
Suppose \(\theta\) is the probability of heads. We want to test \[H_0:\theta=\frac12 \qquad \text{versus}\qquad H_1:\theta\ne \frac12.\] Let \(X_i\sim \operatorname{Bernoulli}(\theta)\) represent the result of the \(i\)th toss, where \(X_i=1\) for heads and \(X_i=0\) for tails. Let \[X=X_1+\cdots+X_n.\] Then \[X\sim \operatorname{Binomial}(n,\theta).\]
Example 4 (Coin test with 100 tosses). Suppose the coin is tossed \(n=100\) times. Under \(H_0\), \(\theta=1/2\), so \[X\sim \operatorname{Binomial}\left(100,\frac12\right), \qquad \mathbb{E}_{H_0}[X]=50, \qquad \operatorname{Var}_{H_0}(X)=25.\] A natural test rejects \(H_0\) when \(X\) is too far from \(50\).
Let \(t>0\) be a threshold. The test is \[\text{fail to reject }H_0 \text{ if } |X-50|\le t, \qquad \text{reject }H_0 \text{ if } |X-50|>t.\] The threshold should control the probability of rejecting a fair coin: \[\mathbb{P}(\text{Type I error})= \mathbb{P}_{H_0}(|X-50|>t).\] Using the central limit theorem, under \(H_0\), \[Y=\frac{X-n\theta_0}{\sqrt{n\theta_0(1-\theta_0)}} =\frac{X-50}{5} \approx \operatorname{Normal}(0,1).\] For significance level \(\alpha=0.05\), the two-sided critical value is approximately \(1.96\). Thus we reject \(H_0\) when \[\left|\frac{X-50}{5}\right|>1.96.\] Equivalently, \[|X-50|>9.8.\] Using integer values, we fail to reject \(H_0\) approximately when \[X\in\{41,42,\ldots,59\},\] and reject \(H_0\) otherwise.
17.1 Type I and Type II errors
Every hypothesis test can make two kinds of mistakes.
Definition 5 (Type I and Type II errors). For a hypothesis test of \(H_0\) versus \(H_1\):
A Type I error occurs when \(H_0\) is true but we reject \(H_0\).
A Type II error occurs when \(H_1\) is true but we fail to reject \(H_0\).
The Type I error probability is usually denoted by \(\alpha\), and the Type II error probability is usually denoted by \(\beta\).
For the coin example, if \(\theta_1\ne 1/2\), then \[\beta(\theta_1)=\mathbb{P}_{\theta_1}(\text{fail to reject }H_0) =\mathbb{P}_{\theta_1}(41\le X\le 59).\] The power function is \[\operatorname{Power}(\theta)=\mathbb{P}_\theta(\text{reject }H_0)=1-\beta(\theta)\] for \(\theta\) in the alternative.
| Decision / Truth | \(H_0\) true | \(H_1\) true |
|---|---|---|
| Reject \(H_0\) | Type I error | Correct decision |
| Fail to reject \(H_0\) | Correct decision | Type II error |
Trade-off For a fixed sample size, making \(\alpha\) smaller often makes \(\beta\) larger. Reducing both errors usually requires increasing the sample size or using a more informative statistic.
18 Significance Level, Critical Values, and \(p\)-Values
This section explains the practical language of hypothesis testing: significance level, rejection region, critical value, and \(p\)-value.
18.1 Significance level
The significance level controls the probability of falsely rejecting the null hypothesis.
Definition 6 (Level-\(\alpha\) test). A test has significance level \(\alpha\) if \[\sup_{\theta\in\Theta_0}\mathbb{P}_\theta(\text{reject }H_0)\le \alpha.\] For a simple null hypothesis \(H_0:\theta=\theta_0\), this reduces to \[\mathbb{P}_{\theta_0}(\text{reject }H_0)\le \alpha.\]
18.2 \(p\)-value
The \(p\)-value measures how extreme the observed statistic is under the null model.
Definition 7 (\(p\)-value). For an observed statistic \(W(x_1,\ldots,x_n)\), the \(p\)-value is the probability, computed under \(H_0\), of observing a test statistic at least as extreme as the one observed. For a right-sided test, this is often \[p\text{-value}=\mathbb{P}_{H_0}\left(W(X_1,\ldots,X_n)\ge W(x_1,\ldots,x_n)\right).\] For a left-sided test, this is often \[p\text{-value}=\mathbb{P}_{H_0}\left(W(X_1,\ldots,X_n)\le W(x_1,\ldots,x_n)\right).\]
Interpretation The \(p\)-value is the smallest significance level at which the observed data would lead to rejection of \(H_0\).
A small \(p\)-value indicates that the observed data are unusual under \(H_0\), so the data provide evidence against \(H_0\).
19 Example: Radar Aircraft Detection
This example shows how hypothesis testing appears in signal detection and illustrates the trade-off between Type I and Type II errors.
A radar system receives a signal \(X\). If no aircraft is present, then \[X=W.\] If an aircraft is present, then \[X=1+W.\] Here \[W\sim \operatorname{Normal}\left(0,\sigma^2\right), \qquad \sigma^2=\frac19.\] Equivalently, \[X=\theta+W, \qquad \theta=\begin{cases} 0, & \text{no aircraft is present},\\ 1, & \text{an aircraft is present}. \end{cases}\] We test \[H_0:\theta=0 \qquad \text{versus}\qquad H_1:\theta=1.\]
Example 8 (Level \(0.05\) radar test). Construct a level \(\alpha=0.05\) test that rejects \(H_0\) when \(X>c\).
Under \(H_0\), \(X=W\sim \operatorname{Normal}(0,1/9)\). Thus \[\mathbb{P}_{H_0}(X>c)=\mathbb{P}(3X>3c)=1-\Phi(3c).\] To make this probability equal to \(0.05\), choose \[1-\Phi(3c)=0.05.\] Therefore \[3c=z_{0.95}\approx 1.645, \qquad c\approx \frac{1.645}{3}=0.5483.\] The level-\(0.05\) test is \[\text{reject }H_0 \quad \text{if} \quad X>0.5483.\]
Example 9 (Type II error for the radar test). For the level \(0.05\) radar test above, compute the Type II error probability.
Under \(H_1\), \(X=1+W\) with \(W\sim \operatorname{Normal}(0,1/9)\). The Type II error probability is \[\beta=\mathbb{P}_{H_1}(X\le c)=\mathbb{P}(1+W\le c)=\mathbb{P}(W\le c-1).\] Standardizing gives \[\beta=\Phi(3(c-1)).\] Using \(c=1.645/3\approx 0.5483\), \[3(c-1)\approx -1.355,\] so \[\beta\approx \Phi(-1.355)\approx 0.0877.\] Thus the probability of missing a present aircraft is about \(8.77\%\).
Example 10 (Evidence check at level \(0.01\)). Suppose the observed signal is \(X=0.6\). Determine whether there is sufficient evidence to reject \(H_0\) at significance level \(\alpha=0.01\).
For a right-sided level-\(0.01\) test, \[\mathbb{P}_{H_0}(X>c)=0.01.\] Thus \[3c=z_{0.99}\approx 2.326, \qquad c\approx \frac{2.326}{3}=0.7753.\] Since the observed value is \[0.6<0.7753,\] we do not reject \(H_0\) at the \(0.01\) level.
Example 11 (Power constraint). Find a critical value \(c\) so that the probability of missing a present aircraft is less than \(5\%\). Then compute the resulting significance level.
We want \[\beta=\mathbb{P}_{H_1}(X\le c)=0.05.\] Since under \(H_1\), \(X\sim \operatorname{Normal}(1,1/9)\), \[\mathbb{P}(3(X-1)\le 3(c-1))=0.05.\] Thus \[3(c-1)=z_{0.05}\approx -1.645,\] so \[c=1-\frac{1.645}{3}\approx 0.4517.\] The resulting significance level is \[\alpha=\mathbb{P}_{H_0}(X>c)=1-\Phi(3c).\] Since \(3c\approx 1.355\), \[\alpha\approx 1-\Phi(1.355)\approx 0.0877.\] To reduce the miss probability to \(5\%\), the false alarm probability must increase to about \(8.77\%\).
Example 12 (\(p\)-value for the radar test). For the observed value \(X_0=0.6\), compute the \(p\)-value for the right-sided radar test.
Under \(H_0\), \(X\sim \operatorname{Normal}(0,1/9)\). The standardized observed statistic is \[Z_0=\frac{0.6}{1/3}=1.8.\] For a right-sided test, \[p\text{-value}=\mathbb{P}_{H_0}(X\ge 0.6)=\mathbb{P}(Z\ge 1.8)=1-\Phi(1.8).\] Numerically, \[p\text{-value}\approx 0.0359.\] Therefore:
at \(\alpha=0.05\), reject \(H_0\);
at \(\alpha=0.01\), do not reject \(H_0\).
20 Classical Test for a Normal Mean
This section reviews how a familiar normal mean test fits into the general framework of test statistics and rejection regions.
Suppose \[X_1,\ldots,X_n\sim \operatorname{Normal}(\mu,\sigma^2),\] where \(\sigma^2\) is known. We test \[H_0:\mu=\mu_0 \qquad \text{versus}\qquad H_1:\mu\ne \mu_0.\]
Under \(H_0\), \[\bar X\sim \operatorname{Normal}\left(\mu_0,\frac{\sigma^2}{n}\right).\] Therefore \[Z=\frac{\bar X-\mu_0}{\sigma/\sqrt n}\sim \operatorname{Normal}(0,1).\]
A two-sided level-\(\alpha\) test rejects \(H_0\) when \[|Z|>z_{1-\alpha/2}.\] Equivalently, \[\left|\bar X-\mu_0\right|> z_{1-\alpha/2}\frac{\sigma}{\sqrt n}.\]
When \(\sigma^2\) is unknown and the data are normal, the classical test statistic is \[t=\frac{\bar X-\mu_0}{S/\sqrt n}, \qquad S^2=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X)^2,\] and under \(H_0\), \[t\sim t_{n-1}.\] The two-sided level-\(\alpha\) test rejects when \[|t|>t_{n-1,1-\alpha/2}.\]
21 Methods of Finding Tests
This section introduces several general methods for constructing hypothesis tests.
The main methods discussed in this section are:
likelihood ratio tests;
Bayesian tests;
union-intersection tests;
intersection-union tests;
the Neyman-Pearson lemma for simple hypotheses.
Guiding principle Different testing methods correspond to different statistical philosophies. Likelihood ratio tests compare best likelihoods; Bayesian tests use posterior probabilities; union-intersection and intersection-union tests build complex tests from simpler component tests.
22 Likelihood Ratio Tests
Likelihood ratio tests compare how well the null parameter space explains the data against how well the full parameter space explains the data.
Let \(X_1,\ldots,X_n\) be a sample from a population distribution with density or mass function \(f(x\mid\theta)\). For observed data \(x=(x_1,\ldots,x_n)\), the likelihood function is \[L(\theta\mid x)=f(x_1,\ldots,x_n\mid\theta)=\prod_{i=1}^n f(x_i\mid\theta).\]
Definition 13 (Likelihood ratio statistic). For testing \[H_0:\theta\in\Theta_0 \qquad \text{versus}\qquad H_1:\theta\in\Theta_0^c,\] the likelihood ratio statistic is \[\lambda(x)=\frac{\sup_{\theta\in\Theta_0}L(\theta\mid x)}{\sup_{\theta\in\Theta}L(\theta\mid x)}.\] A likelihood ratio test rejects \(H_0\) for small values of \(\lambda(x)\): \[R=\{x:\lambda(x)\le c\}, \qquad 0\le c\le 1.\]
Since \(\Theta_0\subseteq \Theta\), we always have \[0\le \lambda(x)\le 1.\] A small value of \(\lambda(x)\) means that the null model fits much worse than the unrestricted model.
22.1 Simple versus simple likelihood ratio test
The simplest LRT compares two point hypotheses.
Definition 14 (Simple likelihood ratio). For \[H_0:\theta=\theta_0 \qquad \text{versus}\qquad H_1:\theta=\theta_1,\] define \[\lambda(x)=\frac{L(\theta_0\mid x)}{L(\theta_1\mid x)}.\] The likelihood ratio test rejects \(H_0\) for small values of \(\lambda(x)\).
23 Likelihood Ratio Test: Radar Example
This section rederives the radar test using the likelihood ratio method.
Recall that \[X=\theta+W, \qquad W\sim \operatorname{Normal}\left(0,\frac19\right),\] and we test \[H_0:\theta=0 \qquad \text{versus}\qquad H_1:\theta=1.\]
Example 15 (LRT for radar detection). Find the likelihood ratio statistic and show that the LRT rejects for large values of \(X\).
The density under \(\theta=0\) is \[L(0\mid x)=\frac{3}{\sqrt{2\pi}}\exp\left(-\frac{9x^2}{2}\right).\] The density under \(\theta=1\) is \[L(1\mid x)=\frac{3}{\sqrt{2\pi}}\exp\left(-\frac{9(x-1)^2}{2}\right).\] Thus the likelihood ratio is \[\lambda(x)=\frac{L(0\mid x)}{L(1\mid x)} =\exp\left(-\frac{9x^2}{2}+\frac{9(x-1)^2}{2}\right).\] Simplifying, \[\lambda(x)=\exp\left(\frac{9(1-2x)}{2}\right).\] Since this is a decreasing function of \(x\), rejecting for small \(\lambda(x)\) is equivalent to rejecting for large \(x\). Therefore the LRT has the form \[\text{reject }H_0 \quad \text{if}\quad x>c'.\] For a level \(0.05\) test, \[\mathbb{P}_{H_0}(X>c')=0.05,\] so \[c'=\frac{z_{0.95}}{3}\approx \frac{1.645}{3}=0.5483.\] This is the same test constructed directly from Type I error control.
24 LRT for a Normal Mean with Known Variance
This section shows that the classical two-sided \(Z\)-test is a likelihood ratio test.
Suppose \[X_1,\ldots,X_n\sim \operatorname{Normal}(\mu,\sigma^2),\] where \(\sigma^2\) is known. We test \[H_0:\mu=\mu_0 \qquad \text{versus}\qquad H_1:\mu\ne\mu_0.\]
Example 16 (Normal mean LRT with known variance). Derive the likelihood ratio statistic.
The likelihood is \[L(\mu\mid x)= (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2\right).\] Under \(H_0\), the best null value is fixed at \(\mu_0\). Under the full parameter space, the MLE is \[\widehat\mu=\bar x.\] Therefore \[\lambda(x)=\frac{L(\mu_0\mid x)}{L(\bar x\mid x)}.\] Using the identity \[\sum_{i=1}^n (x_i-\mu_0)^2 = \sum_{i=1}^n (x_i-\bar x)^2+n(\bar x-\mu_0)^2,\] we obtain \[\lambda(x)=\exp\left(-\frac{n(\bar x-\mu_0)^2}{2\sigma^2}\right).\] This statistic decreases as \(|\bar x-\mu_0|\) increases. Thus rejecting for small \(\lambda(x)\) is equivalent to rejecting for large \[\left|\frac{\bar X-\mu_0}{\sigma/\sqrt n}\right|.\] Therefore the level-\(\alpha\) LRT rejects when \[\left|\frac{\bar X-\mu_0}{\sigma/\sqrt n}\right|>z_{1-\alpha/2}.\] This is the classical two-sided \(Z\)-test.
The cutoff \(c\) in the likelihood-ratio form can be written as \[c=\exp\left(-\frac{z_{1-\alpha/2}^2}{2}\right).\]
25 LRT for a Normal Mean with Unknown Variance
This section shows that the classical two-sided Student’s \(t\)-test can also be derived as a likelihood ratio test.
Suppose \[X_1,\ldots,X_n\sim \operatorname{Normal}(\mu,\sigma^2),\] where both \(\mu\) and \(\sigma^2\) are unknown. We test \[H_0:\mu=\mu_0 \qquad \text{versus}\qquad H_1:\mu\ne\mu_0.\]
Example 17 (Normal mean LRT with unknown variance). Derive the LRT and connect it to the Student \(t\) statistic.
Under the full model, the MLEs are \[\widehat\mu=\bar x, \qquad \widehat\sigma^2=\frac1n\sum_{i=1}^n (x_i-\bar x)^2.\] Under \(H_0\), \(\mu=\mu_0\), and the MLE of \(\sigma^2\) is \[\widehat\sigma_0^2=\frac1n\sum_{i=1}^n (x_i-\mu_0)^2.\] After substituting these into the normal likelihood, the likelihood ratio is \[\lambda(x)= \left(\frac{\widehat\sigma_0^2}{\widehat\sigma^2}\right)^{-n/2}.\] Using \[\sum_{i=1}^n (x_i-\mu_0)^2 = \sum_{i=1}^n (x_i-\bar x)^2+n(\bar x-\mu_0)^2,\] we get \[\frac{\widehat\sigma_0^2}{\widehat\sigma^2} =1+\frac{n(\bar x-\mu_0)^2}{\sum_{i=1}^n (x_i-\bar x)^2}.\] Let \[S^2=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X)^2\] and \[t=\frac{\bar X-\mu_0}{S/\sqrt n}.\] Then \[\frac{\widehat\sigma_0^2}{\widehat\sigma^2}=1+\frac{t^2}{n-1}.\] Therefore \(\lambda(x)\) is small exactly when \(|t|\) is large. The LRT rejects \(H_0\) when \[|t|>t_{n-1,1-\alpha/2}.\] This is the classical two-sided Student’s \(t\)-test with \(n-1\) degrees of freedom.
Remark 18. The classical \(Z\)-, \(t\)-, chi-square, proportion, pooled two-sample \(t\)-, two-proportion, and \(F\)-tests can all be interpreted as special cases or asymptotic versions of likelihood ratio tests.
26 LRTs and Sufficient Statistics
This section explains why likelihood ratio tests can be computed using sufficient statistics without losing information.
Theorem 19 (LRT based on a sufficient statistic). Suppose \(T(X)\) is sufficient for \(\theta\). Let \(\lambda(x)\) be the likelihood ratio statistic based on the full data \(X\), and let \(\lambda^*(T(x))\) be the likelihood ratio statistic based on the statistic \(T\). Then \[\lambda^*(T(x))=\lambda(x).\]
Proof. By the factorization theorem, the likelihood can be written as \[L(\theta\mid x)=g(T(x),\theta)h(x),\] where \(h(x)\) does not depend on \(\theta\). Then \[\lambda(x)=\frac{\sup_{\theta\in\Theta_0}g(T(x),\theta)h(x)}{\sup_{\theta\in\Theta}g(T(x),\theta)h(x)}.\] The factor \(h(x)\) cancels, so \[\lambda(x)=\frac{\sup_{\theta\in\Theta_0}g(T(x),\theta)}{\sup_{\theta\in\Theta}g(T(x),\theta)},\] which is exactly the likelihood ratio based on \(T(x)\). ◻
Example 20 (Normal mean with known variance using \(\bar X\)). Suppose \(X_1,\ldots,X_n\sim \operatorname{Normal}(\mu,\sigma^2)\) with known \(\sigma^2\). Since \(\bar X\) is sufficient for \(\mu\), derive the LRT using \(T=\bar X\).
The statistic \[T=\bar X\] has distribution \[T\sim \operatorname{Normal}\left(\mu,\frac{\sigma^2}{n}\right).\] Thus the likelihood based on \(T=t\) is \[L(\mu\mid t)=\frac{1}{\sqrt{2\pi\sigma^2/n}} \exp\left(-\frac{n(t-\mu)^2}{2\sigma^2}\right).\] For testing \(H_0:\mu=\mu_0\) versus \(H_1:\mu\ne\mu_0\), the unrestricted MLE based on \(t\) is \(\widehat\mu=t\). Therefore \[\lambda^*(t)=\frac{L(\mu_0\mid t)}{L(t\mid t)} =\exp\left(-\frac{n(t-\mu_0)^2}{2\sigma^2}\right).\] Substituting \(t=\bar x\) gives \[\lambda^*(\bar x)=\exp\left(-\frac{n(\bar x-\mu_0)^2}{2\sigma^2}\right),\] which is the same likelihood ratio statistic obtained from the full sample.
27 Bayesian Tests
This section presents hypothesis testing from a Bayesian viewpoint, where inference is based on posterior probabilities.
In Bayesian statistics, the parameter \(\theta\) is treated as a random quantity with prior distribution \(\pi(\theta)\). Given data \(x=(x_1,\ldots,x_n)\), the posterior distribution is \[\pi(\theta\mid x)=\frac{f(x\mid\theta)\pi(\theta)}{m(x)},\] where \[m(x)=\int f(x\mid\theta)\pi(\theta)\,d\theta\] is the marginal distribution of the data.
Definition 21 (Bayesian test by posterior probabilities). For testing \[H_0:\theta\in\Theta_0 \qquad \text{versus}\qquad H_1:\theta\in\Theta_0^c,\] compute the posterior probabilities \[\mathbb{P}(\theta\in\Theta_0\mid x) \quad \text{and}\quad \mathbb{P}(\theta\in\Theta_0^c\mid x).\] A simple Bayesian rule rejects \(H_0\) if \[\mathbb{P}(\theta\in\Theta_0\mid x)<\mathbb{P}(\theta\in\Theta_0^c\mid x).\] Equivalently, reject if \[\mathbb{P}(\theta\in\Theta_0\mid x)<\frac12.\] A more conservative rule might reject only if \[\mathbb{P}(\theta\in\Theta_0\mid x)<0.05.\]
27.1 Bayesian normal mean test
We now compute a Bayesian test for a normal mean.
Example 22 (Bayesian test for a normal mean). Suppose \[X_1,\ldots,X_n\mid\mu \sim \operatorname{Normal}(\mu,\sigma^2),\] where \(\sigma^2\) is known. Suppose the prior is \[\mu\sim \operatorname{Normal}(\theta,\tau^2).\] Test \[H_0:\mu\le \mu_0 \qquad \text{versus}\qquad H_1:\mu>\mu_0.\] Derive the posterior and the Bayesian decision rule that rejects when the posterior probability of \(H_0\) is less than \(1/2\).
The normal-normal conjugate posterior is normal: \[\mu\mid x \sim \operatorname{Normal}(m_n,v_n),\] where \[m_n=\frac{\tau^2\sum_{i=1}^n x_i+\sigma^2\theta}{n\tau^2+\sigma^2}\] and \[v_n=\frac{\sigma^2\tau^2}{\sigma^2+n\tau^2}.\] We reject \(H_0\) when \[\mathbb{P}(\mu\le \mu_0\mid x)<\frac12.\] Since the posterior distribution is normal and symmetric, this condition is equivalent to \[m_n>\mu_0.\] Thus the Bayesian test rejects when \[\frac{\tau^2\sum_{i=1}^n x_i+\sigma^2\theta}{n\tau^2+\sigma^2}>\mu_0.\] Equivalently, \[\bar x>\mu_0+\frac{\sigma^2}{n\tau^2}(\mu_0-\theta).\] This shows how the prior mean \(\theta\) shifts the rejection threshold.
28 Union-Intersection Tests
This section explains how to construct a test for a complicated alternative by combining simpler component tests.
The union-intersection test is useful when the null hypothesis can be written as an intersection of simpler hypotheses.
Suppose \[H_0:\theta\in\Theta_0=\bigcap_{\gamma\in\Gamma}\Theta_\gamma.\] Then by De Morgan’s law, \[H_1:\theta\in\Theta_0^c=\bigcup_{\gamma\in\Gamma}\Theta_\gamma^c.\] For each \(\gamma\), we test \[H_{0\gamma}:\theta\in\Theta_\gamma \qquad \text{versus}\qquad H_{1\gamma}:\theta\in\Theta_\gamma^c.\] Let the rejection region for the \(\gamma\)th subtest be \[R_\gamma=\{x:T_\gamma(x)>c\}.\] The union-intersection test rejects if any component test rejects: \[R=\bigcup_{\gamma\in\Gamma}R_\gamma.\] Equivalently, \[R=\left\{x:\sup_{\gamma\in\Gamma}T_\gamma(x)>c\right\}.\] Thus the combined test statistic is \[T(x)=\sup_{\gamma\in\Gamma}T_\gamma(x).\]
Example 23 (UIT for a two-sided normal mean test). Suppose \(X_1,\ldots,X_n\) are normal and we want to test \[H_0:\mu=\mu_0 \qquad \text{versus}\qquad H_1:\mu\ne \mu_0.\] Explain how this can be constructed from one-sided tests.
The null hypothesis can be written as \[H_0:\mu=\mu_0 \quad \Longleftrightarrow \quad \{\mu\le \mu_0\}\cap \{\mu\ge \mu_0\}.\] The alternative is \[H_1:\mu\ne \mu_0 \quad \Longleftrightarrow \quad \{\mu>\mu_0\}\cup \{\mu<\mu_0\}.\] Thus we combine two one-sided tests:
reject \(H_{0L}:\mu\le \mu_0\) for large positive values of the standardized statistic;
reject \(H_{0R}:\mu\ge \mu_0\) for large negative values of the standardized statistic.
If \(\sigma^2\) is known, the standardized statistic is \[Z=\frac{\bar X-\mu_0}{\sigma/\sqrt n}.\] The UIT rejects when either tail is too extreme, which is equivalent to \[|Z|>z_{1-\alpha/2}.\] This is the classical two-sided \(Z\)-test.
If \(\sigma^2\) is unknown, replace \(\sigma\) by \(S\): \[t=\frac{\bar X-\mu_0}{S/\sqrt n},\] and reject when \[|t|>t_{n-1,1-\alpha/2}.\] This is the classical two-sided \(t\)-test.
29 Intersection-Union Tests
This section explains the complementary construction, where we reject only if all component tests reject.
The intersection-union test is useful when the alternative hypothesis requires several conditions to hold simultaneously.
Suppose \[H_0:\theta\in\Theta_0=\bigcup_{\gamma\in\Gamma}\Theta_\gamma.\] Then by De Morgan’s law, \[H_1:\theta\in\Theta_0^c=\bigcap_{\gamma\in\Gamma}\Theta_\gamma^c.\] For each component test, let \[R_\gamma=\{x:T_\gamma(x)>c\}.\] The IUT rejects only when all component tests reject: \[R=\bigcap_{\gamma\in\Gamma}R_\gamma.\] Equivalently, \[R=\left\{x:\inf_{\gamma\in\Gamma}T_\gamma(x)>c\right\}.\] Thus the combined test statistic is \[T(x)=\inf_{\gamma\in\Gamma}T_\gamma(x).\]
Example 24 (Acceptance sampling for upholstery fabric). A batch of upholstery fabric has two quality parameters:
\(\theta_1\): mean breaking strength, which should exceed \(50\) pounds;
\(\theta_2\): probability of passing a flammability test, which should exceed \(0.95\).
Set up an intersection-union test for determining whether a batch is acceptable.
The batch is acceptable only if both standards are met: \[H_1:\theta_1>50 \quad \text{and}\quad \theta_2>0.95.\] Thus the null hypothesis is that at least one standard fails: \[H_0:\theta_1\le 50 \quad \text{or}\quad \theta_2\le 0.95.\] Equivalently, \[H_0=\{\theta_1\le 50\}\cup \{\theta_2\le 0.95\}\] and \[H_1=\{\theta_1>50\}\cap \{\theta_2>0.95\}.\] We collect two types of data:
breaking strengths \(X_1,\ldots,X_n\), often modeled as normal with mean \(\theta_1\);
flammability indicators \(Y_1,\ldots,Y_m\), where \(Y_i=1\) if the item passes and \(0\) otherwise, often modeled as Bernoulli with probability \(\theta_2\).
We test the two component null hypotheses \[H_{01}:\theta_1\le 50 \qquad \text{and}\qquad H_{02}:\theta_2\le 0.95.\] The intersection-union rule rejects the overall null \(H_0\) only if both component null hypotheses are rejected. In words, we declare the batch acceptable only if the strength requirement and the flammability requirement both pass their respective tests.
30 Neyman-Pearson Lemma: Simple Hypotheses
This section records the fundamental optimality result behind likelihood ratio tests for simple hypotheses.
Although the detailed proof is usually saved for a more advanced treatment, the main message is essential: for simple null and simple alternative hypotheses, the most powerful test at a fixed significance level is a likelihood ratio test.
Theorem 25 (Neyman-Pearson lemma). Consider testing two simple hypotheses \[H_0:\theta=\theta_0 \qquad \text{versus}\qquad H_1:\theta=\theta_1.\] Among all tests with Type I error probability at most \(\alpha\), the most powerful test rejects \(H_0\) for sufficiently small values of \[\frac{L(\theta_0\mid X)}{L(\theta_1\mid X)}.\] Equivalently, it rejects for sufficiently large values of \[\frac{L(\theta_1\mid X)}{L(\theta_0\mid X)}.\]
Remark 26. The Neyman-Pearson lemma explains why likelihood ratio tests are not only natural, but optimal, for simple-versus-simple testing problems.
31 Practice Problems
This section gives practice problems that reinforce the main ideas of hypothesis tests, likelihood ratios, \(p\)-values, and combined tests.
Practice Problem 27 (Coin test). A coin is tossed \(100\) times and \(62\) heads are observed. Test \[H_0:p=\frac12 \qquad \text{versus}\qquad H_1:p\ne \frac12\] using the normal approximation at level \(\alpha=0.05\).
Under \(H_0\), \[X\sim \operatorname{Binomial}\left(100,\frac12\right), \qquad \mathbb{E}[X]=50, \qquad \operatorname{sd}(X)=5.\] The standardized statistic is \[Z=\frac{62-50}{5}=2.4.\] For a two-sided level-\(0.05\) test, the critical value is \(1.96\). Since \[|2.4|>1.96,\] we reject \(H_0\). The approximate two-sided \(p\)-value is \[2(1-\Phi(2.4))\approx 0.0164.\]
Practice Problem 28 (Radar \(p\)-value). In the radar example, suppose \(X=0.4\). Compute the right-sided \(p\)-value under \(H_0\).
Under \(H_0\), \(X\sim \operatorname{Normal}(0,1/9)\). Standardizing gives \[Z=3X=3(0.4)=1.2.\] Thus the right-sided \(p\)-value is \[p=\mathbb{P}(Z\ge 1.2)=1-\Phi(1.2)\approx 0.1151.\] At level \(0.05\), we would not reject \(H_0\).
Practice Problem 29 (Normal mean, known variance). Suppose \(X_1,\ldots,X_{25}\sim \operatorname{Normal}(\mu,16)\) and \(\bar x=53\). Test \[H_0:\mu=50 \qquad \text{versus}\qquad H_1:\mu\ne 50\] at level \(0.05\).
Here \(\sigma=4\) and \(n=25\), so \[Z=\frac{\bar x-\mu_0}{\sigma/\sqrt n} =\frac{53-50}{4/5}=\frac{3}{0.8}=3.75.\] Since \[|3.75|>1.96,\] we reject \(H_0\) at level \(0.05\).
Practice Problem 30 (Normal mean, unknown variance). Suppose \(X_1,\ldots,X_{16}\) are normal, \(\bar x=10.8\), and \(s=2.4\). Test \[H_0:\mu=10 \qquad \text{versus}\qquad H_1:\mu\ne 10\] at level \(0.05\).
The test statistic is \[t=\frac{\bar x-\mu_0}{s/\sqrt n} =\frac{10.8-10}{2.4/4} =\frac{0.8}{0.6}=1.333.\] There are \(n-1=15\) degrees of freedom. The two-sided \(0.05\) critical value is about \[t_{15,0.975}\approx 2.131.\] Since \[|1.333|<2.131,\] we fail to reject \(H_0\).
Practice Problem 31 (Likelihood ratio for Bernoulli simple hypotheses). Let \(X_1,\ldots,X_n\sim \operatorname{Bernoulli}(p)\). Test \[H_0:p=p_0 \qquad \text{versus}\qquad H_1:p=p_1,\] where \(p_1>p_0\). Find the likelihood ratio and describe the rejection region.
Let \[S=\sum_{i=1}^n X_i.\] The likelihood is \[L(p\mid x)=p^S(1-p)^{n-S}.\] Thus \[\lambda(x)=\frac{L(p_0\mid x)}{L(p_1\mid x)} =\left(\frac{p_0}{p_1}\right)^S \left(\frac{1-p_0}{1-p_1}\right)^{n-S}.\] Taking logs, \[\log\lambda(x)=S\log\left(\frac{p_0}{p_1}\right) +(n-S)\log\left(\frac{1-p_0}{1-p_1}\right).\] Because \(p_1>p_0\), the likelihood ratio decreases as \(S\) increases. Therefore the LRT rejects \(H_0\) for large values of \(S\), i.e. \[\sum_{i=1}^n X_i \ge k\] for a threshold \(k\) chosen to control the Type I error probability.
Practice Problem 32 (Bayesian posterior decision). Suppose \(\mu\mid x\sim \operatorname{Normal}(4,1)\). Test \[H_0:\mu\le 3 \qquad \text{versus}\qquad H_1:\mu>3.\] Using the Bayesian rule “reject \(H_0\) if \(\mathbb{P}(H_0\mid x)<1/2\),” what is the decision?
Since the posterior is normal with mean \(4\) and variance \(1\), \[\mathbb{P}(H_0\mid x)=\mathbb{P}(\mu\le 3\mid x)=\Phi\left(\frac{3-4}{1}\right)=\Phi(-1)\approx 0.1587.\] Since \[0.1587<\frac12,\] we reject \(H_0\) using this Bayesian decision rule.
Practice Problem 33 (Intersection-union logic). A device is acceptable only if its battery life is above \(10\) hours and its failure probability is below \(0.01\). Write the null and alternative hypotheses for an IUT.
Let \(\theta_1\) be the mean battery life and \(\theta_2\) be the failure probability. The device is acceptable only if \[\theta_1>10 \qquad \text{and}\qquad \theta_2<0.01.\] Thus the alternative is \[H_1:\theta_1>10 \text{ and } \theta_2<0.01.\] The null is that at least one requirement fails: \[H_0:\theta_1\le 10 \text{ or } \theta_2\ge 0.01.\] An IUT rejects \(H_0\) only if both component tests reject their corresponding component null hypotheses.
32 Summary
This section introduced the foundations and construction methods for hypothesis testing.
Key takeaways
A hypothesis test divides the sample space into an acceptance region and a rejection region.
Type I error means rejecting a true null; Type II error means failing to reject a false null.
The significance level controls Type I error probability.
The \(p\)-value is the smallest significance level that would reject \(H_0\).
Likelihood ratio tests reject when the null model fits much worse than the unrestricted model.
The classical \(Z\)-test and \(t\)-test can be derived from likelihood ratio tests.
Bayesian tests use posterior probabilities of hypotheses.
UITs reject when any component test rejects; IUTs reject only when all component tests reject.