16 Chapter 15: Hypothesis Tests II — Evaluating Tests
This chapter explains how to evaluate and compare hypothesis tests after constructing them. The main ideas are Type I and Type II errors, power functions, test size and level, unbiased tests, uniformly most powerful tests, \(p\)-values, and risk functions.
Evaluating hypothesis tests; Type I and Type II errors; power function; size and level; unbiased tests; uniformly most powerful tests; Neyman–Pearson lemma; binomial and normal UMP tests; \(p\)-values; loss functions; risk functions; decision-theoretic evaluation of tests.
17 Why We Need to Evaluate Tests
This section explains how to compare hypothesis tests after we have learned how to construct them.
In hypothesis testing, different methods may lead to different tests. A likelihood ratio test, a Bayesian test, a union-intersection test, and a classical test statistic can all be reasonable procedures. The key question is: how do we decide whether one test is better than another?
Main idea Hypothesis tests are evaluated and compared through their probabilities of making mistakes. A good test should have small Type I error probability and small Type II error probability, subject to the unavoidable trade-off between the two.
The main evaluation tools in this section are:
the power function;
the size or significance level;
unbiasedness of tests;
uniformly most powerful tests;
risk functions and expected loss.
18 Two Types of Errors
This section reviews the two fundamental errors in hypothesis testing and introduces the power function as a unified way to measure them.
Consider a hypothesis test \[H_0: \theta\in \Theta_0 \qquad \text{versus} \qquad H_1: \theta\in \Theta_0^c.\] Let \(R\) be the rejection region of the test. The test rejects \(H_0\) when \(X\in R\) and fails to reject \(H_0\) when \(X\in R^c\).
18.1 Type I and Type II errors
This subsection defines the two ways a hypothesis test can be wrong.
Definition 1 (Type I and Type II errors). For the test \(H_0:\theta\in\Theta_0\) versus \(H_1:\theta\in\Theta_0^c\):
A Type I error occurs when \(\theta\in\Theta_0\) but the test rejects \(H_0\).
A Type II error occurs when \(\theta\in\Theta_0^c\) but the test fails to reject \(H_0\).
| Decision | \(H_0\) true | \(H_1\) true |
|---|---|---|
| Reject \(H_0\) | Type I error | Correct decision |
| Fail to reject \(H_0\) | Correct decision | Type II error |
For \(\theta\in\Theta_0\), the Type I error probability is \[\mathbb{P}_\theta(X\in R).\] For \(\theta\in\Theta_0^c\), the Type II error probability is \[\mathbb{P}_\theta(X\in R^c)=1-\mathbb{P}_\theta(X\in R).\]
18.2 Power function
This subsection introduces the main function used to evaluate a test over all parameter values.
Definition 2 (Power function). The power function of a hypothesis test with rejection region \(R\) is \[\beta(\theta):=\mathbb{P}_\theta(X\in R).\] Thus, \[\beta(\theta)= \begin{cases} \mathbb{P}_\theta(\text{Type I error}), & \theta\in\Theta_0,\\[3pt] 1-\mathbb{P}_\theta(\text{Type II error}), & \theta\in\Theta_0^c. \end{cases}\]
Ideal power function The ideal test would have \[\beta(\theta)= \begin{cases} 0, & \theta\in\Theta_0,\\ 1, & \theta\in\Theta_0^c. \end{cases}\] This would mean no Type I errors and no Type II errors. In real problems, such a perfect test is usually impossible.
19 Power Function Examples
This section computes power functions for binomial and normal tests and shows the trade-off between Type I and Type II errors.
19.1 Binomial power functions
This subsection compares two tests for the same binomial hypothesis.
Example 3 (Binomial power function: two tests). Let \[X\sim \operatorname{Binomial}(n=5,\theta),\] and consider \[H_0:\theta\le \frac12 \qquad \text{versus} \qquad H_1:\theta>\frac12.\] Compare the following two tests.
Test 1: reject \(H_0\) if \(X=5\).
The rejection region is \(R_1=\{5\}\), so the power function is \[\beta_1(\theta)=\mathbb{P}_\theta(X=5)=\theta^5.\] For \(\theta\le 1/2\), \[\beta_1(\theta)\le \left(\frac12\right)^5=\frac{1}{32}\approx 0.03125.\] Thus Test 1 has a small Type I error probability, but for values of \(\theta\) only moderately larger than \(1/2\), it has a large Type II error probability.
Test 2: reject \(H_0\) if \(X=3,4,\) or \(5\).
The rejection region is \(R_2=\{3,4,5\}\), so the power function is \[\beta_2(\theta)=\mathbb{P}_\theta(X=3,4,5) =\binom53\theta^3(1-\theta)^2+\binom54\theta^4(1-\theta)+\binom55\theta^5.\]
The first test is conservative: it rejects only when all five trials are successes. Its Type I error probability is at most \(1/32\) under \(H_0\).
The second test rejects more often, so it has larger power for \(\theta>1/2\). However, it also has larger Type I error probability. At the boundary \(\theta=1/2\), \[\beta_2\left(\frac12\right) =\binom53\left(\frac12\right)^5+\binom54\left(\frac12\right)^5+\binom55\left(\frac12\right)^5 =\frac{10+5+1}{32}=\frac12.\] Thus Test 2 has much larger power but also much larger size. This illustrates the Type I–Type II error trade-off.
19.2 Normal power function
This subsection derives the power function of a one-sided normal mean test.
Example 4 (Normal power function). Let \[X_1,\ldots,X_n\sim \operatorname{Normal}(\theta,\sigma^2),\] where \(\sigma^2\) is known. Test \[H_0:\theta\le \theta_0 \qquad \text{versus} \qquad H_1:\theta>\theta_0.\] Consider the test that rejects \(H_0\) if \[\frac{\bar X-\theta_0}{\sigma/\sqrt n}>c,\] where \(c>0\) is a constant.
For any fixed \(\theta\), \[Z=\frac{\bar X-\theta}{\sigma/\sqrt n}\sim N(0,1).\] Thus the power function is \[\begin{aligned} \beta(\theta) &=\mathbb{P}_\theta\left(\frac{\bar X-\theta_0}{\sigma/\sqrt n}>c\right)\\ &=\mathbb{P}_\theta\left(\frac{\bar X-\theta}{\sigma/\sqrt n}>c+\frac{\theta_0-\theta}{\sigma/\sqrt n}\right)\\ &=1-\Phi\left(c+\frac{\theta_0-\theta}{\sigma/\sqrt n}\right). \end{aligned}\] Therefore, \[\beta(\theta_0)=1-\Phi(c).\] If \(c=1.28\), then \(\mathbb{P}(Z>1.28)\approx 0.10\), so the test has about a \(10\%\) significance level at the boundary \(\theta=\theta_0\).
The power function has the following behavior: \[\theta\to -\infty \implies \beta(\theta)\to 0, \qquad \theta\to +\infty \implies \beta(\theta)\to 1.\] For fixed sample size, decreasing Type I error usually increases Type II error, and increasing power usually increases Type I error.
19.3 Choosing sample size from power requirements
This subsection shows how power calculations determine a sample size.
Example 5 (Choosing \(c\) and \(n\)). Let \[X_1,\ldots,X_n\sim \operatorname{Normal}(\theta,\sigma^2),\] with known \(\sigma^2\). Test \[H_0:\theta\le \theta_0 \qquad \text{versus} \qquad H_1:\theta>\theta_0\] using the rule \[\frac{\bar X-\theta_0}{\sigma/\sqrt n}>c.\] Choose \(c\) and \(n\) so that \[\mathbb{P}(\text{Type I error})\le 0.10, \qquad \mathbb{P}(\text{Type II error})\le 0.20\quad \text{if } \theta\ge \theta_0+\sigma.\]
The Type I error is largest at the boundary \(\theta=\theta_0\). Hence we require \[\beta(\theta_0)=\mathbb{P}(Z>c)=0.10.\] Thus \[c=z_{0.10}\approx 1.28.\] Next, requiring Type II error at most \(0.20\) when \(\theta=\theta_0+\sigma\) is equivalent to requiring power at least \(0.80\): \[\beta(\theta_0+\sigma)=0.80.\] But \[\beta(\theta_0+\sigma) =\mathbb{P}\left(Z>c-\sqrt n\right).\] So \[\mathbb{P}\left(Z>1.28-\sqrt n\right)=0.80.\] Equivalently, \[1.28-\sqrt n=z_{0.80}^{\text{right tail}},\] where \(\mathbb{P}(Z>z_{0.80}^{\text{right tail}})=0.80\). Since \(z_{0.80}^{\text{right tail}}\approx -0.84\), \[1.28-\sqrt n\approx -0.84, \qquad \sqrt n\approx 2.12, \qquad n\approx 4.49.\] Therefore, choose \[\boxed{n=5.}\]
20 Size and Level of a Test
This section distinguishes the exact size of a test from the more flexible idea of a level-\(\alpha\) test.
20.1 Size and level
This subsection gives the formal definitions.
Definition 6 (Size-\(\alpha\) test). For \(\alpha\in[0,1]\), a test with power function \(\beta(\theta)\) is called a size \(\alpha\) test if \[\sup_{\theta\in\Theta_0}\mathbb{P}_\theta(\text{reject }H_0) =\sup_{\theta\in\Theta_0}\beta(\theta)=\alpha.\]
Definition 7 (Level-\(\alpha\) test). A test is called a level \(\alpha\) test if \[\sup_{\theta\in\Theta_0}\beta(\theta)\le \alpha.\]
Remark 8. A size-\(\alpha\) test uses the full Type I error budget exactly. A level-\(\alpha\) test may be conservative and use less than the allowed Type I error probability.
20.2 Size of likelihood ratio tests
This subsection explains how cutoffs are chosen for likelihood ratio tests.
For a likelihood ratio test with rejection region \[R=\{x:\lambda(x)\le c\},\] the size is \[\sup_{\theta\in\Theta_0}\mathbb{P}_\theta(\lambda(X)\le c).\] To obtain a size-\(\alpha\) LRT, choose \(c\) so that \[\sup_{\theta\in\Theta_0}\mathbb{P}_\theta(\lambda(X)\le c)=\alpha.\]
Example 9 (Size-\(\alpha\) LRT for a normal mean). Suppose \(X_1, \ldots,X_n\sim N(\theta,\sigma^2)\) with known \(\sigma^2\), and test \[H_0:\theta=\theta_0 \qquad \text{versus} \qquad H_1:\theta\ne\theta_0.\] The LRT rejects for large values of \[\left|\frac{\bar X-\theta_0}{\sigma/\sqrt n}\right|.\] A size-\(\alpha\) test rejects \(H_0\) if \[\left|\frac{\bar X-\theta_0}{\sigma/\sqrt n}\right|>z_{\alpha/2},\] where \(\mathbb{P}(Z>z_{\alpha/2})=\alpha/2\) for \(Z\sim N(0,1)\).
Under \(H_0\), \[Z=\frac{\bar X-\theta_0}{\sigma/\sqrt n}\sim N(0,1).\] Thus \[\mathbb{P}_{\theta_0}(|Z|>z_{\alpha/2})=\alpha.\] This is the classical two-sided \(Z\)-test. In LRT form, the cutoff is equivalent to \[c=\exp\left(-\frac{z_{\alpha/2}^2}{2}\right).\]
Common critical values The following notation is standard: \[\begin{aligned} &z_\alpha: &&\mathbb{P}(Z>z_\alpha)=\alpha, \quad Z\sim N(0,1),\\ &t_{n-1,\alpha}: &&\mathbb{P}(T_{n-1}>t_{n-1,\alpha})=\alpha,\\ &\chi^2_{p,1-\alpha}: &&\mathbb{P}(\chi_p^2>\chi^2_{p,1-\alpha})=\alpha. \end{aligned}\]
21 Unbiased Tests
This section adapts the idea of unbiasedness from estimation to hypothesis testing.
Definition 10 (Unbiased test). A test with power function \(\beta(\theta)\) is called unbiased if \[\beta(\theta')\ge \beta(\theta'')\] for every \(\theta'\in\Theta_0^c\) and every \(\theta''\in\Theta_0\).
This means that the probability of rejecting \(H_0\) under the alternative should be at least as large as the probability of rejecting \(H_0\) under the null.
Interpretation An unbiased test should not be more likely to reject \(H_0\) when \(H_0\) is true than when \(H_1\) is true. It should point in the correct direction.
Example 11 (One-sided normal test is unbiased). Let \(X_1,\ldots,X_n\sim N(\theta,\sigma^2)\) with known \(\sigma^2\). Test \[H_0:\theta\le \theta_0 \qquad \text{versus} \qquad H_1:\theta>\theta_0\] by rejecting \(H_0\) when \[\frac{\bar X-\theta_0}{\sigma/\sqrt n}>c.\]
The power function is \[\beta(\theta)=1-\Phi\left(c+\frac{\theta_0-\theta}{\sigma/\sqrt n}\right).\] This function is increasing in \(\theta\). Therefore, if \(\theta'>\theta_0\) and \(\theta''\le\theta_0\), then \[\beta(\theta')\ge \beta(\theta''),\] so the test is unbiased.
22 Uniformly Most Powerful Tests
This section introduces the strongest optimality criterion for hypothesis tests in a class of tests.
22.1 Most powerful and UMP tests
This subsection defines UMP tests.
A level-\(\alpha\) test guarantees that the Type I error probability is at most \(\alpha\) for all parameter values in the null. After controlling Type I error, we want to make Type II error small. Equivalently, we want the power function to be large under the alternative.
Definition 12 (Uniformly most powerful test). Let \(\mathcal C\) be a class of tests for \[H_0:\theta\in\Theta_0 \qquad \text{versus} \qquad H_1:\theta\in\Theta_0^c.\] A test in \(\mathcal C\) with power function \(\beta(\theta)\) is called a uniformly most powerful test in \(\mathcal C\) if \[\beta(\theta)\ge \beta'(\theta)\] for every \(\theta\in\Theta_0^c\) and for every competing power function \(\beta'(\theta)\) corresponding to another test in \(\mathcal C\).
Remark 13. UMP tests do not always exist, especially for two-sided alternatives. When they do exist, they provide a very strong optimality guarantee.
22.2 Neyman–Pearson lemma
This subsection gives the fundamental result for finding most powerful tests for simple hypotheses.
Theorem 14 (Neyman–Pearson Lemma). Consider testing two simple hypotheses \[H_0:\theta=\theta_0 \qquad \text{versus} \qquad H_1:\theta=\theta_1.\] Let \(f(x\mid\theta_i)\) be the pdf or pmf of \(X\) under \(\theta_i\). Define a rejection region \(R\) by \[x\in R \quad \text{if} \quad f(x\mid\theta_1)>k f(x\mid\theta_0),\] and \[x\in R^c \quad \text{if} \quad f(x\mid\theta_1)<k f(x\mid\theta_0),\] for some constant \(k\ge0\), chosen so that \[\mathbb{P}_{\theta_0}(X\in R)=\alpha.\] Then any test with this rejection region is a most powerful level-\(\alpha\) test for testing \(H_0\) against \(H_1\).
Proof. The Neyman–Pearson lemma says that for a fixed Type I error probability, the best way to maximize power against the simple alternative \(\theta_1\) is to reject where the likelihood under \(H_1\) is large relative to the likelihood under \(H_0\). In other words, the test rejects when \[\frac{f(x\mid\theta_1)}{f(x\mid\theta_0)}>k.\] This is exactly the likelihood-ratio principle for simple hypotheses. The full proof compares the power of any competing level-\(\alpha\) test with the power of this likelihood-ratio test and uses the sign of \[f(x\mid\theta_1)-k f(x\mid\theta_0)\] on the rejection and non-rejection regions. ◻
Likelihood-ratio form The Neyman–Pearson most powerful test rejects \(H_0\) for large values of \[\frac{f(x\mid\theta_1)}{f(x\mid\theta_0)}.\] This means that the observed data are much more likely under \(H_1\) than under \(H_0\).
23 UMP Examples
This section applies the Neyman–Pearson lemma to binomial and normal models.
23.1 UMP binomial test
This subsection works out a discrete example where boundary cases may require special attention.
Example 15 (UMP binomial test). Let \[X\sim \operatorname{Binomial}(2,\theta).\] Test \[H_0:\theta=\theta_0=\frac12 \qquad \text{versus} \qquad H_1:\theta=\theta_1=\frac34.\] Compute the likelihood ratios for \(X\in\{0,1,2\}\).
The pmf is \[f(x\mid\theta)=\binom{2}{x}\theta^x(1-\theta)^{2-x}.\] The likelihood ratio is \[\frac{f(x\mid 3/4)}{f(x\mid 1/2)}.\] For \(x=0\), \[\frac{f(0\mid 3/4)}{f(0\mid 1/2)} =\frac{(1/4)^2}{(1/2)^2}=\frac14.\] For \(x=1\), \[\frac{f(1\mid 3/4)}{f(1\mid 1/2)} =\frac{2(3/4)(1/4)}{2(1/2)(1/2)}=\frac34.\] For \(x=2\), \[\frac{f(2\mid 3/4)}{f(2\mid 1/2)} =\frac{(3/4)^2}{(1/2)^2}=\frac94.\] Thus the likelihood ratios are ordered as \[\frac14<\frac34<\frac94.\] By the Neyman–Pearson lemma, the most powerful test rejects first at \(X=2\), then at \(X=1\), then at \(X=0\), depending on the desired size.
For example:
If \(3/4<k<9/4\), reject \(H_0\) when \(X=2\). The size is \[\alpha=\mathbb{P}_{1/2}(X=2)=\frac14.\]
If \(1/4<k<3/4\), reject \(H_0\) when \(X=1\) or \(X=2\). The size is \[\alpha=\mathbb{P}_{1/2}(X=1\text{ or }2)=\frac34.\]
If \(k>9/4\), never reject \(H_0\), giving size \(0\).
If \(k<1/4\), always reject \(H_0\), giving size \(1\).
For the rejection rule \(R=\{2\}\), the power function is \[\beta(\theta)=\mathbb{P}_\theta(X=2)=\theta^2.\] For the rejection rule \(R=\{1,2\}\), the power function is \[\beta(\theta)=\mathbb{P}_\theta(X\ge1)=1-(1-\theta)^2.\]
23.2 UMP normal test
This subsection derives the one-sided most powerful normal test.
Example 16 (UMP normal test). Suppose \[X_1,\ldots,X_n\sim \operatorname{Normal}(\theta,\sigma^2),\] where \(\sigma^2\) is known. Test \[H_0:\theta=\theta_0 \qquad \text{versus} \qquad H_1:\theta=\theta_1,\] where \(\theta_1<\theta_0\).
The sample mean \(\bar X\) is sufficient for \(\theta\), and \[\bar X\sim \operatorname{Normal}\left(\theta,\frac{\sigma^2}{n}\right).\] The density of \(\bar X=t\) is \[g(t\mid \theta) =\frac{1}{\sqrt{2\pi\sigma^2/n}} \exp\left\{-\frac{n(t-\theta)^2}{2\sigma^2}\right\}.\] The Neyman–Pearson test rejects \(H_0\) when \[g(\bar X\mid\theta_1)>k g(\bar X\mid\theta_0).\] Taking logarithms, this inequality is equivalent to rejecting for small values of \(\bar X\), because \(\theta_1<\theta_0\). Hence the most powerful test has the form \[\text{Reject }H_0 \quad \text{if} \quad \bar X<c.\] To make this a level-\(\alpha\) test, choose \(c\) so that \[\alpha=\mathbb{P}_{\theta_0}(\bar X<c).\] Under \(H_0\), \[\bar X\sim \operatorname{Normal}\left(\theta_0,\frac{\sigma^2}{n}\right),\] so \[\boxed{c=\theta_0-\frac{\sigma}{\sqrt n}z_\alpha,}\] where \(\mathbb{P}(Z>z_\alpha)=\alpha\).
If instead \(\theta_1>\theta_0\), then the UMP test rejects for large values of \(\bar X\): \[\bar X>\theta_0+\frac{\sigma}{\sqrt n}z_\alpha.\]
24 \(p\)-Values
This section defines valid \(p\)-values and connects them to significance levels and rejection rules.
24.1 Definition and interpretation
This subsection explains why \(p\)-values summarize the strength of evidence against the null hypothesis.
A \(p\)-value is a test statistic that measures how surprising the observed data are if the null hypothesis is true. Small \(p\)-values provide evidence in favor of the alternative hypothesis.
Definition 17 (Valid \(p\)-value). A statistic \(p(X)\) is a valid \(p\)-value if, for every \(\theta\in\Theta_0\) and every \(0\le \alpha\le 1\), \[\mathbb{P}_\theta(p(X)\le \alpha)\le \alpha.\]
Remark 18. A \(p\)-value is not the probability that \(H_0\) is true. It is a probability computed under the assumption that \(H_0\) is true.
Equivalently, the \(p\)-value can be understood as the smallest significance level at which the observed test statistic would lead to rejection: \[p(x)=\inf\{\alpha:\text{ reject }H_0\text{ at level }\alpha\}.\]
24.2 General construction of a \(p\)-value
This subsection gives a general formula for a valid \(p\)-value.
Theorem 19 (General \(p\)-value formula). Let \(W(X)\) be a test statistic such that large values of \(W\) provide evidence in favor of \(H_1\). Then \[p(x):=\sup_{\theta\in\Theta_0}\mathbb{P}_\theta\left(W(X)\ge W(x)\right)\] defines a valid \(p\)-value.
Proof. The quantity \(p(x)\) is the largest null probability of observing a test statistic at least as extreme as the observed statistic. Therefore, for any \(\theta\in\Theta_0\), the probability that this tail probability is at most \(\alpha\) is no larger than \(\alpha\). Hence \[\mathbb{P}_\theta(p(X)\le\alpha)\le\alpha,\] which is the validity condition. ◻
Interpretation A \(p\)-value answers the question: \[\text{How surprising is my observed signal, if the null hypothesis were true?}\] The smaller the \(p\)-value, the stronger the evidence against \(H_0\).
24.3 One-sided normal \(p\)-value
This subsection computes a standard \(p\)-value for a one-sided normal test.
Example 20 (One-sided normal \(p\)-value). Let \(X_1,\ldots,X_n\sim \operatorname{Normal}(\mu,\sigma^2)\) with \(\sigma\) known. Test \[H_0:\mu\le \mu_0 \qquad \text{versus} \qquad H_1:\mu>\mu_0.\] Use the statistic \[W(X)=Z=\frac{\bar X-\mu_0}{\sigma/\sqrt n}.\] Find the \(p\)-value.
Large values of \(Z\) favor \(H_1\). For any \(\mu\le\mu_0\), \[\begin{aligned} \mathbb{P}_\mu(W\ge w) &=\mathbb{P}_\mu\left(\frac{\bar X-\mu_0}{\sigma/\sqrt n}\ge w\right)\\ &=\mathbb{P}_\mu\left(\frac{\bar X-\mu}{\sigma/\sqrt n}\ge w+\frac{\mu_0-\mu}{\sigma/\sqrt n}\right). \end{aligned}\] The right tail probability is largest when \(\mu=\mu_0\). Hence the supremum over \(\Theta_0=\{\mu:\mu\le\mu_0\}\) is attained at \(\mu=\mu_0\).
If \[z_{\mathrm{obs}}=\frac{\bar x-\mu_0}{\sigma/\sqrt n},\] then the valid \(p\)-value is \[\boxed{p(x)=\mathbb{P}_{\mu_0}(Z\ge z_{\mathrm{obs}})=1-\Phi(z_{\mathrm{obs}}).}\]
Example 21 (Numerical \(p\)-value). Suppose \(z_{\mathrm{obs}}=2.1\) in the one-sided normal test above. Compute the \(p\)-value and state the decision at levels \(\alpha=0.05\) and \(\alpha=0.01\).
The one-sided \(p\)-value is \[p=1-\Phi(2.1)\approx 0.0179.\] At \(\alpha=0.05\), since \(0.0179<0.05\), reject \(H_0\).
At \(\alpha=0.01\), since \(0.0179>0.01\), fail to reject \(H_0\).
25 Loss and Risk Functions for Tests
This section treats hypothesis testing as a decision problem with possible costs for different mistakes.
25.1 Decision-theoretic formulation
This subsection introduces actions, loss functions, and risk functions for hypothesis testing.
In hypothesis testing, there are two possible actions: \[a_0=\text{fail to reject }H_0, \qquad a_1=\text{reject }H_0.\] A decision rule \(\delta(x)\) selects an action based on the observed data \(x\).
Definition 22 (Loss function). A loss function \[L:\Theta\times \mathcal A\to \mathbb{R}\] measures the penalty for taking action \(a\in\mathcal A\) when the true parameter is \(\theta\).
Example 23 (0–1 loss). For testing \(H_0:\theta\in\Theta_0\) versus \(H_1:\theta\in\Theta_0^c\), the ordinary 0–1 loss is \[L(\theta,a)= \begin{cases} 0, & \theta\in\Theta_0\text{ and }a=a_0,\text{ or }\theta\in\Theta_0^c\text{ and }a=a_1,\\ 1, & \theta\in\Theta_0\text{ and }a=a_1,\text{ or }\theta\in\Theta_0^c\text{ and }a=a_0. \end{cases}\]
The loss is \(0\) for a correct decision and \(1\) for an incorrect decision. Thus both Type I and Type II errors are treated as equally costly.
25.2 Generalized 0–1 loss
This subsection allows Type I and Type II errors to have different costs.
Definition 24 (Generalized 0–1 loss). Let \(c_I\) be the cost of a Type I error and \(c_{II}\) be the cost of a Type II error. Define \[L(\theta,a_0)= \begin{cases} 0, & \theta\in\Theta_0,\\ c_{II}, & \theta\in\Theta_0^c, \end{cases} \qquad L(\theta,a_1)= \begin{cases} c_I, & \theta\in\Theta_0,\\ 0, & \theta\in\Theta_0^c. \end{cases}\]
If \(c_I=c_{II}=1\), this reduces to the ordinary 0–1 loss. If \(c_I\ne c_{II}\), the test should account for the relative seriousness of the two errors.
25.3 Risk function
This subsection defines risk as the expected loss of a test.
Definition 25 (Risk function). The risk function of a decision rule \(\delta\) is \[R(\theta,\delta)=\mathbb{E}_\theta[L(\theta,\delta(X))].\]
Let the power function be \[\beta(\theta)=\mathbb{P}_\theta(\delta(X)=a_1),\] the probability of rejecting \(H_0\) under parameter \(\theta\). Under generalized 0–1 loss, \[R(\theta,\delta)= \begin{cases} c_I\beta(\theta), & \theta\in\Theta_0,\\[3pt] c_{II}\{1-\beta(\theta)\}, & \theta\in\Theta_0^c. \end{cases}\]
Risk interpretation For \(\theta\in\Theta_0\), the risk is the Type I error probability weighted by the cost \(c_I\). For \(\theta\in\Theta_0^c\), the risk is the Type II error probability weighted by the cost \(c_{II}\).
26 Risk Function Example
This section computes the risk function for a one-sided normal test.
Example 26 (Normal one-sided test risk). Let \[X_1,\ldots,X_n\stackrel{iid}{\sim} N(\theta,\sigma^2), \qquad \sigma=1, \qquad n=25.\] Test \[H_0:\theta\le 0 \qquad \text{versus} \qquad H_1:\theta>0.\] Use the size \(\alpha=0.05\) test that rejects \(H_0\) when \[\bar X>z_{1-\alpha}\frac{\sigma}{\sqrt n}.\] Take \(c_I=2\) and \(c_{II}=1\). Compute the power function and risk function.
Here \[z_{1-\alpha}=z_{0.95}\approx 1.645, \qquad \frac{\sigma}{\sqrt n}=\frac15.\] So the rejection rule is \[\bar X>1.645\cdot \frac15=0.329.\] For a general \(\theta\), \[\bar X\sim N\left(\theta,\frac1{25}\right).\] Thus the power function is \[\begin{aligned} \beta(\theta) &=\mathbb{P}_\theta(\bar X>0.329)\\ &=\mathbb{P}\left(\frac{\bar X-\theta}{1/5}>\frac{0.329-\theta}{1/5}\right)\\ &=1-\Phi(1.645-5\theta). \end{aligned}\] Under the generalized 0–1 loss with \(c_I=2\) and \(c_{II}=1\), \[R(\theta)= \begin{cases} 2\beta(\theta), & \theta\le 0,\\[3pt] 1-\beta(\theta), & \theta>0. \end{cases}\]
Example 27 (Numerical risk values). Using the previous example, compute the risk at \(\theta=0,-0.2,0.2,0.5\).
The power function is \[\beta(\theta)=1-\Phi(1.645-5\theta).\] At the boundary \(\theta=0\), \[\beta(0)=0.05, \qquad R(0)=2(0.05)=0.10.\] At \(\theta=-0.2\), \[\beta(-0.2)=1-\Phi(1.645+1)=1-\Phi(2.645)\approx 0.0041,\] so \[R(-0.2)=2(0.0041)=0.0082.\] At \(\theta=0.2\), \[\beta(0.2)=1-\Phi(1.645-1)=1-\Phi(0.645)\approx 0.260.\] Since \(\theta>0\), \[R(0.2)=1-\beta(0.2)\approx 0.740.\] At \(\theta=0.5\), \[\beta(0.5)=1-\Phi(1.645-2.5)=1-\Phi(-0.855)=\Phi(0.855)\approx 0.803.\] Thus \[R(0.5)=1-0.803=0.197.\]
Important lesson The risk can be small under parts of the null and large near the alternative boundary. The shape of the risk function depends on both the power function and the relative costs \(c_I\) and \(c_{II}\).
27 Practice Problems
This section gives practice problems that reinforce the main definitions and calculations from the section.
Practice Problem 28 (Power function for a binomial test). Let \(X\sim\operatorname{Binomial}(6,\theta)\) and test \[H_0:\theta\le \frac12 \qquad \text{versus} \qquad H_1:\theta>\frac12.\] Consider the test that rejects \(H_0\) if \(X\ge5\).
Find the power function.
Find the size of the test.
Find the Type II error probability when \(\theta=0.8\).
(a) The rejection region is \(R=\{5,6\}\), so \[\beta(\theta)=\mathbb{P}_\theta(X\ge5) =\binom65\theta^5(1-\theta)+\binom66\theta^6 =6\theta^5(1-\theta)+\theta^6.\] (b) Since \(\beta(\theta)\) is increasing in \(\theta\), the size is attained at \(\theta=1/2\): \[\alpha=\beta(1/2)=6\left(\frac12\right)^5\left(\frac12\right)+\left(\frac12\right)^6 =\frac{6+1}{64}=\frac7{64}.\] (c) At \(\theta=0.8\), \[\beta(0.8)=6(0.8)^5(0.2)+(0.8)^6.\] The Type II error probability is \[1-\beta(0.8)=1-\{6(0.8)^5(0.2)+(0.8)^6\}.\] Numerically, this is approximately \(0.3446\).
Practice Problem 29 (Normal sample size). Let \(X_1,\ldots,X_n\sim N(\theta,\sigma^2)\) with known \(\sigma^2\). Test \[H_0:\theta\le\theta_0 \qquad \text{versus} \qquad H_1:\theta>\theta_0\] using a level \(0.05\) test. Find a formula for the smallest \(n\) such that the power is at least \(0.90\) when \(\theta=\theta_0+\Delta\), where \(\Delta>0\).
The level \(0.05\) test rejects when \[\frac{\bar X-\theta_0}{\sigma/\sqrt n}>z_{0.05}.\] At \(\theta=\theta_0+\Delta\), the power is \[\beta(\theta_0+\Delta) =1-\Phi\left(z_{0.05}-\frac{\Delta\sqrt n}{\sigma}\right).\] We require \[1-\Phi\left(z_{0.05}-\frac{\Delta\sqrt n}{\sigma}\right)\ge 0.90.\] Equivalently, \[\Phi\left(z_{0.05}-\frac{\Delta\sqrt n}{\sigma}\right)\le0.10.\] Thus \[z_{0.05}-\frac{\Delta\sqrt n}{\sigma}\le z_{0.90}^{\text{right tail}},\] where \(\mathbb{P}(Z>z_{0.90}^{\text{right tail}})=0.90\), so \(z_{0.90}^{\text{right tail}}\approx -1.2816\). Hence \[\sqrt n\ge \frac{\sigma}{\Delta}(z_{0.05}+1.2816).\] Therefore \[\boxed{n\ge \left(\frac{\sigma}{\Delta}(z_{0.05}+1.2816)\right)^2.}\] The smallest integer \(n\) is the ceiling of the right-hand side.
Practice Problem 30 (One-sided normal p-value). Let \(X_1,\ldots,X_{25}\sim N(\mu,4)\) and test \[H_0:\mu\le 10 \qquad \text{versus} \qquad H_1:\mu>10.\] Suppose \(\bar x=10.9\). Compute the \(p\)-value and state the decision at \(\alpha=0.05\).
Here \(\sigma=2\) and \(n=25\), so \(\sigma/\sqrt n=2/5=0.4\). The observed test statistic is \[z_{\mathrm{obs}}=\frac{10.9-10}{0.4}=2.25.\] The one-sided \(p\)-value is \[p=1-\Phi(2.25)\approx 0.0122.\] Since \(0.0122<0.05\), reject \(H_0\) at the \(5\%\) level.
Practice Problem 31 (Risk function). For the one-sided normal test in the previous problem, use generalized 0–1 loss with \(c_I=3\) and \(c_{II}=1\). Write the risk function in terms of the power function.
The decision is to reject or fail to reject \(H_0:\mu\le10\). The risk function is \[R(\mu,\delta)= \begin{cases} 3\beta(\mu), & \mu\le 10,\\[3pt] 1-\beta(\mu), & \mu>10. \end{cases}\] where \[\beta(\mu)=\mathbb{P}_\mu(\text{reject }H_0).\] For a level-\(0.05\) test, \[\beta(\mu)=1-\Phi\left(z_{0.05}+\frac{10-\mu}{2/5}\right).\]
28 Summary
This section summarizes the main concepts used to evaluate hypothesis tests.
| Concept | Definition | Meaning |
|---|---|---|
| Power function | \(\beta(\theta)=\mathbb{P}_\theta(X\in R)\) | Probability of rejecting \(H_0\) at parameter value \(\theta\) |
| Type I error | \(\mathbb{P}_\theta(X\in R)\) for \(\theta\in\Theta_0\) | Rejecting \(H_0\) when \(H_0\) is true |
| Type II error | \(1-\beta(\theta)\) for \(\theta\in\Theta_0^c\) | Failing to reject \(H_0\) when \(H_1\) is true |
| Size | \(\sup_{\theta\in\Theta_0}\beta(\theta)\) | Worst-case Type I error probability |
| Level \(\alpha\) | \(\sup_{\theta\in\Theta_0}\beta(\theta)\le\alpha\) | Type I error controlled by \(\alpha\) |
| Unbiased test | \(\beta(\theta')\ge\beta(\theta'')\) for \(\theta'\in\Theta_0^c\), \(\theta''\in\Theta_0\) | Rejects more often under the alternative than under the null |
| UMP test | Maximizes power for all \(\theta\in\Theta_0^c\) within a class | Strong optimality criterion |
| \(p\)-value | Smallest level at which \(H_0\) is rejected | Measures evidence against \(H_0\) |
| Risk function | \(R(\theta,\delta)=\mathbb{E}_\theta[L(\theta,\delta(X))]\) | Expected loss of a test |
Key message A hypothesis test is not evaluated by whether it is correct on one dataset. It is evaluated by its long-run error probabilities, its power function, and, when costs matter, its risk function.