14 Chapter 13: Point Estimation II — Evaluating Estimators
This chapter continues point estimation. Chapter 12 focused on how to construct estimators. This chapter focuses on how to evaluate and compare estimators using mean squared error, bias, variance, Fisher information, Cramér–Rao lower bounds, Rao–Blackwellization, UMVUEs, loss functions, risk functions, and Bayes risk.
Evaluating and comparing point estimators; mean squared error; bias and variance; unbiasedness; best unbiased estimators; Cramer-Rao inequality; Fisher information; Rao-Blackwell theorem; Lehmann-Scheffe theorem; loss functions; risk functions; Bayes risk; examples for normal, Bernoulli, binomial, and Poisson models.
15 Overview: Evaluating and Comparing Estimators
This section studies how to choose among different point estimators when several construction methods are available.
In Section 12, we discussed ways to find estimators: method of moments, maximum likelihood estimation, and Bayesian methods. In this section, the question changes from construction to evaluation.
Suppose several estimators are available for the same parameter. Which estimator should we use?
Common criteria for a good estimator include:
small mean squared error;
unbiasedness;
small variance among unbiased estimators;
efficiency relative to a lower bound;
use of sufficient statistics;
consistency as the sample size grows;
robustness under small model deviations.
The main theme is that there is usually no single universal meaning of “best.” The answer depends on the loss function, the parameter space, and whether we care about finite-sample or large-sample performance.
16 Mean Squared Error and Bias
Mean squared error is one of the most useful criteria because it combines variance and systematic bias into a single number.
16.1 Mean squared error
We begin with the most common risk measure for point estimation.
Definition 1 (Mean squared error). Let \(W=W(X_1,\ldots,X_n)\) be an estimator of a scalar parameter \(\theta\). The mean squared error of \(W\) is \[\operatorname{MSE}_\theta(W)=\mathbb{E}_\theta\bigl[(W-\theta)^2\bigr].\]
The MSE measures the average squared distance between the random estimator and the unknown true parameter. It depends on \(\theta\), so it is a function on the parameter space.
16.2 Bias and unbiasedness
Bias measures whether the estimator systematically overestimates or underestimates the target.
Definition 2 (Bias). The bias of a point estimator \(W\) of \(\theta\) is \[\operatorname{Bias}_\theta(W)=\mathbb{E}_\theta[W]-\theta.\] The estimator \(W\) is unbiased for \(\theta\) if \[\mathbb{E}_\theta[W]=\theta\] for all possible values of \(\theta\).
Remark 3. Unbiasedness is a natural criterion, but it is not the only criterion. A biased estimator can have smaller MSE than an unbiased estimator if the reduction in variance is large enough.
16.3 Bias-variance decomposition
The key reason MSE is convenient is that it decomposes into variance plus squared bias.
Theorem 4 (Bias-variance decomposition). For any estimator \(W\) of \(\theta\), \[\operatorname{MSE}_\theta(W) =\mathbb{E}_\theta[(W-\theta)^2] =\operatorname{Var}_\theta(W)+\bigl(\mathbb{E}_\theta[W]-\theta\bigr)^2.\] Equivalently, \[\operatorname{MSE}_\theta(W)=\operatorname{Var}_\theta(W)+\operatorname{Bias}_\theta(W)^2.\]
Proof. Write \[W-\theta=(W-\mathbb{E}_\theta W)+(\mathbb{E}_\theta W-\theta).\] Then \[\begin{aligned} \mathbb{E}_\theta[(W-\theta)^2] &=\mathbb{E}_\theta\left[(W-\mathbb{E}_\theta W)^2\right] +2(\mathbb{E}_\theta W-\theta)\mathbb{E}_\theta[W-\mathbb{E}_\theta W] +(\mathbb{E}_\theta W-\theta)^2\\ &=\operatorname{Var}_\theta(W)+(\mathbb{E}_\theta W-\theta)^2, \end{aligned}\] because the middle term is zero. ◻
Remark 5. It is reasonable to consider other errors, such as \(\mathbb{E}_\theta|W-\theta|\). The MSE is especially common because it is differentiable and has the bias-variance decomposition.
17 Examples of MSE
These examples show how MSE is computed and how bias can sometimes improve MSE.
17.1 Sample mean and sample variance under a normal model
We first recall basic facts about the sample mean and sample variance.
Example 6 (Normal distribution). Let \(X_1,\ldots,X_n\) be a random sample from a population with mean \(\mu\) and variance \(\sigma^2\). Define \[\bar X=\frac1n\sum_{i=1}^n X_i, \qquad S^2=\frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2.\] Then \[\mathbb{E}(\bar X)=\mu,\qquad \operatorname{Var}(\bar X)=\frac{\sigma^2}{n},\qquad \mathbb{E}(S^2)=\sigma^2.\] Thus \(\bar X\) is unbiased for \(\mu\), and \(S^2\) is unbiased for \(\sigma^2\).
Since \(X_1,\ldots,X_n\) are iid, \[\mathbb{E}(\bar X)=\frac1n\sum_{i=1}^n \mathbb{E}(X_i)=\mu\] and \[\operatorname{Var}(\bar X)=\frac1{n^2}\sum_{i=1}^n \operatorname{Var}(X_i)=\frac{\sigma^2}{n}.\] The identity \(\mathbb{E}(S^2)=\sigma^2\) was proved in the sampling section. Therefore \[\operatorname{MSE}(\bar X)=\operatorname{Var}(\bar X)=\frac{\sigma^2}{n}\] when estimating \(\mu\). If \(X_i\sim \operatorname{Normal}(\mu,\sigma^2)\), then \[\frac{(n-1)S^2}{\sigma^2}\sim \chi^2_{n-1},\] so \[\operatorname{Var}(S^2)=\frac{2\sigma^4}{n-1}.\] Since \(S^2\) is unbiased for \(\sigma^2\), \[\operatorname{MSE}(S^2)=\frac{2\sigma^4}{n-1}.\]
17.2 The MLE of normal variance
The MLE for the normal variance is biased, but it can have smaller MSE than the unbiased sample variance.
Example 7 (Normal variance MLE). Suppose \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)\). The MLE of \(\sigma^2\) is \[\widehat\sigma^2_{\mathrm{MLE}} =\frac1n\sum_{i=1}^n (X_i-\bar X)^2 =\frac{n-1}{n}S^2.\] Find its bias, variance, and MSE.
Since \(\mathbb{E}(S^2)=\sigma^2\), \[\mathbb{E}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =\frac{n-1}{n}\sigma^2.\] Thus \[\operatorname{Bias}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =-\frac{\sigma^2}{n}.\] Also, \[\operatorname{Var}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =\left(\frac{n-1}{n}\right)^2\operatorname{Var}(S^2) =\left(\frac{n-1}{n}\right)^2\frac{2\sigma^4}{n-1} =\frac{2(n-1)\sigma^4}{n^2}.\] Therefore \[\operatorname{MSE}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =\frac{2(n-1)\sigma^4}{n^2}+\frac{\sigma^4}{n^2} =\frac{(2n-1)\sigma^4}{n^2}.\] Compare this with \[\operatorname{MSE}(S^2)=\frac{2\sigma^4}{n-1}.\] For \(n\ge 2\), \[\frac{(2n-1)}{n^2}<\frac{2}{n-1},\] so \[\operatorname{MSE}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr)<\operatorname{MSE}(S^2).\] The MLE trades a small bias for a larger variance reduction.
Bias-variance tradeoff An unbiased estimator is not automatically best under MSE. Sometimes a biased estimator has smaller MSE because its variance is much smaller.
17.3 A Bayes estimator for Bernoulli data
This example compares the usual sample proportion with a Bayesian shrinkage estimator.
Example 8 (Binomial Bayes estimator). Suppose \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)\), where \(p\) is unknown. Let \[S=X_1+\cdots+X_n.\] The MLE is \[\widehat p_{\mathrm{MLE}}=\bar X=\frac{S}{n}.\] If the prior is \(p\sim \operatorname{Beta}(\alpha,\beta)\), then the posterior mean estimator is \[\widehat p_{\mathrm{B}} =\mathbb{E}[p\mid S] =\frac{\alpha+S}{\alpha+\beta+n}.\] Compute the MSE of \(\bar X\) and \(\widehat p_{\mathrm{B}}\) as functions of \(p\).
For the MLE, \[\mathbb{E}_p(\bar X)=p, \qquad \operatorname{Var}_p(\bar X)=\frac{p(1-p)}{n}.\] Thus \[\operatorname{MSE}_p(\bar X)=\frac{p(1-p)}{n}.\] For the Bayes estimator, \[\widehat p_{\mathrm{B}}=\frac{\alpha+S}{\alpha+\beta+n}.\] Since \(S\sim \operatorname{Binomial}(n,p)\), \[\mathbb{E}_p(S)=np, \qquad \operatorname{Var}_p(S)=np(1-p).\] Therefore \[\mathbb{E}_p(\widehat p_{\mathrm{B}}) =\frac{\alpha+np}{\alpha+\beta+n},\] \[\operatorname{Var}_p(\widehat p_{\mathrm{B}}) =\frac{np(1-p)}{(\alpha+\beta+n)^2},\] and \[\operatorname{Bias}_p(\widehat p_{\mathrm{B}}) =\frac{\alpha+np}{\alpha+\beta+n}-p =\frac{\alpha-p(\alpha+\beta)}{\alpha+\beta+n}.\] Hence \[\operatorname{MSE}_p(\widehat p_{\mathrm{B}}) =\frac{np(1-p)}{(\alpha+\beta+n)^2} +\left(\frac{\alpha-p(\alpha+\beta)}{\alpha+\beta+n}\right)^2.\] If we choose \[\alpha=\beta=\frac{\sqrt n}{4},\] then this estimator shrinks \(\bar X\) toward \(1/2\). It can have better MSE for small \(n\) or when one strongly believes \(p\) is close to \(1/2\). For large \(n\), the MLE becomes very strong because the data dominates.
Remark 9. The lecture slide emphasizes a special choice of \(\alpha\) and \(\beta\) that makes the MSE curve flatter. The main statistical idea is shrinkage: the estimator sacrifices unbiasedness to reduce risk near values favored by the prior.
18 Best Unbiased Estimators
Unbiasedness alone is not enough, so we compare unbiased estimators by their variance.
18.1 Definition and uniqueness
The best unbiased estimator has the smallest variance among all unbiased estimators.
Definition 10 (Best unbiased estimator and UMVUE). An estimator \(\widehat\theta\) of \(\theta\) is called a best unbiased estimator if:
it is unbiased: \(\mathbb{E}_\theta[\widehat\theta]=\theta\) for all \(\theta\);
it has minimum variance among all unbiased estimators of \(\theta\) for every \(\theta\).
It is also called a uniform minimum variance unbiased estimator, or UMVUE.
The definition extends naturally to estimating a function \(g(\theta)\).
Theorem 11 (Uniqueness). If \(W\) is a best unbiased estimator of \(g(\theta)\), then \(W\) is unique almost surely.
Proof. Suppose \(W_1\) and \(W_2\) are both best unbiased estimators. Then \[W_3=\frac{W_1+W_2}{2}\] is also unbiased. Since \(W_1\) and \(W_2\) are both minimum-variance unbiased estimators, \(W_3\) cannot have smaller variance. But \[\operatorname{Var}(W_1-W_2)\ge 0\] and the usual variance identity implies that averaging two different unbiased estimators would strictly reduce variance unless \(W_1=W_2\) almost surely. Therefore \(W_1=W_2\) almost surely. ◻
18.2 Poisson unbiased estimation
This example shows that many unbiased estimators can exist, but one may be clearly better.
Example 12 (Poisson unbiased estimators). Suppose \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)\). Then both \(\bar X\) and \(S^2\) are unbiased estimators of \(\lambda\). Compare their MSEs.
For the sample mean, \[\mathbb{E}_\lambda(\bar X)=\lambda, \qquad \operatorname{Var}_\lambda(\bar X)=\frac{\lambda}{n}.\] Thus \[\operatorname{MSE}(\bar X)=\frac{\lambda}{n}.\] The sample variance \(S^2\) is also unbiased for the population variance, and for a Poisson random variable the population variance equals \(\lambda\). Hence \(\mathbb{E}(S^2)=\lambda\).
A general formula for the variance of the sample variance is \[\operatorname{Var}(S^2)=\frac1n\left(\mu_4-\frac{n-3}{n-1}\sigma^4\right),\] where \(\mu_4=\mathbb{E}[(X-\mu)^4]\) and \(\sigma^2=\operatorname{Var}(X)\). For \(X\sim \operatorname{Poisson}(\lambda)\), \[\mu=\lambda, \qquad \sigma^2=\lambda, \qquad \mu_4=\lambda+3\lambda^2.\] Therefore \[\begin{aligned} \operatorname{Var}(S^2) &=\frac1n\left(\lambda+3\lambda^2-\frac{n-3}{n-1}\lambda^2\right)\\ &=\frac{\lambda}{n}+\frac{2\lambda^2}{n-1}. \end{aligned}\] Since \(S^2\) is unbiased, \[\operatorname{MSE}(S^2)=\frac{\lambda}{n}+\frac{2\lambda^2}{n-1}.\] Thus \[\operatorname{MSE}(\bar X)<\operatorname{MSE}(S^2)\] for \(\lambda>0\). This shows \(\bar X\) is better than \(S^2\) among these two unbiased estimators, but by itself it does not prove \(\bar X\) is the best among all unbiased estimators.
19 Cramer-Rao Inequality and Fisher Information
The Cramer-Rao inequality gives a lower bound on the variance of unbiased estimators.
19.1 The inequality
The bound is one of the main tools for proving that an unbiased estimator is best.
Theorem 13 (Cramer-Rao inequality). Let \(X=(X_1,\ldots,X_n)\) have joint density or mass function \(f(x\mid \theta)\). Let \(W(X)\) be an estimator of \(g(\theta)\) satisfying the usual regularity conditions that allow differentiation under the integral sign. Then \[\operatorname{Var}_\theta(W) \ge \frac{\left(\dfrac{d}{d\theta}\mathbb{E}_\theta[W]\right)^2} {\mathbb{E}_\theta\left[\left(\dfrac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right]}.\] If \(W\) is unbiased for \(g(\theta)\), then \(\mathbb{E}_\theta[W]=g(\theta)\) and \[\operatorname{Var}_\theta(W) \ge \frac{\left(g'(\theta)\right)^2}{I_X(\theta)},\] where \[I_X(\theta)=\mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right]\] is the Fisher information in the sample.
Proof idea. Let \[U_\theta(X)=\frac{\partial}{\partial\theta}\log f(X\mid \theta)\] be the score function. Under regularity conditions, \[\mathbb{E}_\theta[U_\theta(X)]=0.\] Also, \[\begin{aligned} \frac{d}{d\theta}\mathbb{E}_\theta[W] &=\frac{d}{d\theta}\int W(x)f(x\mid \theta)\,dx\\ &=\int W(x)\frac{\partial}{\partial\theta}f(x\mid\theta)\,dx\\ &=\int W(x)\left(\frac{\partial}{\partial\theta}\log f(x\mid\theta)\right)f(x\mid\theta)\,dx\\ &=\mathbb{E}_\theta[W U_\theta(X)]\\ &=\operatorname{Cov}_\theta(W,U_\theta(X)). \end{aligned}\] By Cauchy-Schwarz, \[\left(\frac{d}{d\theta}\mathbb{E}_\theta[W]\right)^2 \le \operatorname{Var}_\theta(W)\operatorname{Var}_\theta(U_\theta(X)).\] Since \(\operatorname{Var}_\theta(U_\theta)=\mathbb{E}_\theta[U_\theta^2]=I_X(\theta)\), rearranging gives the result. ◻
19.2 Equality condition
The equality condition explains when the lower bound is attained.
Theorem 14 (Equality condition). Equality in the Cramer-Rao inequality holds if and only if there exists a function \(h(\theta)\) such that \[h(\theta)\{W(x)-g(\theta)\} =\frac{\partial}{\partial\theta}\log L(\theta\mid x),\] where \(L(\theta\mid x)\) is the likelihood function.
This condition is useful because if an unbiased estimator attains the Cramer-Rao lower bound, then it is a best unbiased estimator.
19.3 IID form and information number
For independent samples, the Fisher information adds.
Corollary 15 (IID sample). Suppose \(X_1,\ldots,X_n\) are iid with density or mass function \(f(x\mid\theta)\). Then \[I_X(\theta) =n I_1(\theta), \qquad I_1(\theta)=\mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right].\] Hence \[\operatorname{Var}_\theta(W) \ge \frac{\left(\dfrac{d}{d\theta}\mathbb{E}_\theta[W]\right)^2} {nI_1(\theta)}.\]
Lemma 16 (Computing Fisher information). Under suitable regularity conditions, often satisfied for exponential-family models, \[I_1(\theta) =\mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right] =-\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta^2}\log f(X\mid\theta)\right].\]
19.4 Poisson conclusion
We now return to the Poisson example and prove that \(\bar X\) is best unbiased.
Example 17 (Poisson UMVUE). Let \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)\). Show that \(\bar X\) is the best unbiased estimator of \(\lambda\).
For one observation, \[f(x\mid\lambda)=e^{-\lambda}\frac{\lambda^x}{x!},\] so \[\log f(x\mid\lambda)=-\lambda+x\log\lambda-\log(x!).\] Then \[\frac{\partial^2}{\partial\lambda^2}\log f(x\mid\lambda) =-\frac{x}{\lambda^2}.\] Therefore \[I_1(\lambda) =-\mathbb{E}_\lambda\left[-\frac{X}{\lambda^2}\right] =\frac{\lambda}{\lambda^2} =\frac1\lambda.\] For \(n\) iid observations, \[I_X(\lambda)=\frac{n}{\lambda}.\] If \(W\) is unbiased for \(\lambda\), then \(g(\lambda)=\lambda\) and \(g'(\lambda)=1\). The Cramer-Rao lower bound gives \[\operatorname{Var}_\lambda(W)\ge \frac{1}{n/\lambda}=\frac{\lambda}{n}.\] But \[\mathbb{E}_\lambda(\bar X)=\lambda, \qquad \operatorname{Var}_\lambda(\bar X)=\frac{\lambda}{n}.\] Thus \(\bar X\) is unbiased and attains the lower bound. Hence \(\bar X\) is the best unbiased estimator of \(\lambda\).
19.5 Normal variance example
The Cramer-Rao bound is a lower bound, but not every unbiased estimator attains it.
Example 18 (Normal variance). Let \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)\). Consider estimating \(\sigma^2\). If we use the parameter \(\theta=\sigma^2\), then the normal log-density for one observation is \[\log f(x\mid \mu,\theta) =-\frac12\log(2\pi\theta)-\frac{(x-\mu)^2}{2\theta}.\] When \(\mu\) is treated as known in this direct one-parameter calculation, \[-\mathbb{E}\left[\frac{\partial^2}{\partial\theta^2}\log f(X\mid\mu,\theta)\right] =\frac{1}{2\theta^2} =\frac{1}{2\sigma^4}.\] Thus the Cramer-Rao lower bound for unbiased estimators of \(\sigma^2\) is \[\operatorname{Var}(W)\ge \frac{2\sigma^4}{n}.\]
Compute derivatives: \[\frac{\partial}{\partial\theta}\log f(x\mid\mu,\theta) =-\frac1{2\theta}+\frac{(x-\mu)^2}{2\theta^2},\] \[\frac{\partial^2}{\partial\theta^2}\log f(x\mid\mu,\theta) =\frac1{2\theta^2}-\frac{(x-\mu)^2}{\theta^3}.\] Since \(\mathbb{E}[(X-\mu)^2]=\theta\), \[-\mathbb{E}\left[\frac{\partial^2}{\partial\theta^2}\log f(X\mid\mu,\theta)\right] =-\left(\frac1{2\theta^2}-\frac{\theta}{\theta^3}\right) =\frac1{2\theta^2}.\] Thus \(I_X(\theta)=n/(2\theta^2)\), so \[\operatorname{Var}(W)\ge \frac{1}{n/(2\theta^2)}=\frac{2\theta^2}{n}=\frac{2\sigma^4}{n}.\] The usual unbiased sample variance satisfies \[\operatorname{Var}(S^2)=\frac{2\sigma^4}{n-1},\] which is larger than \(2\sigma^4/n\). Therefore this simple Cramer-Rao calculation does not prove that \(S^2\) attains the lower bound.
Remark 19. When nuisance parameters are present, such as unknown \(\mu\) while estimating \(\sigma^2\), one must be careful with the precise form of Fisher information. The lecture emphasizes the main message: the Cramer-Rao lower bound is a benchmark, and not every natural unbiased estimator attains it.
20 Sufficiency and Unbiasedness
Sufficient statistics can improve estimators by conditioning away irrelevant randomness.
20.1 Rao-Blackwell theorem
The Rao-Blackwell theorem is a systematic estimator-improvement principle.
Theorem 20 (Rao-Blackwell theorem). Let \(X_1,\ldots,X_n\) be a sample from a population distribution with density or mass function \(f(x\mid\theta)\). Let \(W(X)\) be an unbiased estimator of \(g(\theta)\), and let \(T(X)\) be a sufficient statistic for \(\theta\). Define \[\phi(T)=\mathbb{E}[W(X)\mid T].\] Then \[\mathbb{E}_\theta[\phi(T)]=g(\theta)\] and \[\operatorname{Var}_\theta(\phi(T))\le \operatorname{Var}_\theta(W)\] for every \(\theta\).
Proof. For unbiasedness, use the law of total expectation: \[\mathbb{E}_\theta[\phi(T)] =\mathbb{E}_\theta\bigl[\mathbb{E}_\theta(W\mid T)\bigr] =\mathbb{E}_\theta(W)=g(\theta).\] For variance, use the law of total variance: \[\operatorname{Var}_\theta(W) =\operatorname{Var}_\theta\bigl(\mathbb{E}_\theta(W\mid T)\bigr) +\mathbb{E}_\theta\bigl[\operatorname{Var}_\theta(W\mid T)\bigr].\] Since the second term is nonnegative, \[\operatorname{Var}_\theta(W)\ge \operatorname{Var}_\theta(\phi(T)).\] Thus \(\phi(T)\) is uniformly at least as good as \(W\) among unbiased estimators. ◻
Rao-Blackwellization replaces an estimator by its conditional expectation given a sufficient statistic. This preserves unbiasedness and reduces variance.
20.2 Lehmann-Scheffe theorem
Completeness turns the Rao-Blackwell estimator into the unique best unbiased estimator.
Definition 21 (Complete sufficient statistic). A statistic \(T\) is complete for a family of distributions if \[\mathbb{E}_\theta[g(T)]=0 \quad \text{for all } \theta \quad \Longrightarrow \quad g(T)=0 \text{ almost surely}.\] If \(T\) is both sufficient and complete, it is called a complete sufficient statistic.
Theorem 22 (Lehmann-Scheffe theorem). Let \(T(X)\) be a complete sufficient statistic for \(\theta\). If \(\phi(T)\) is unbiased for \(g(\theta)\), then \(\phi(T)\) is the unique best unbiased estimator of \(g(\theta)\).
Remark 23. A related characterization says that an unbiased estimator is best unbiased if and only if it is uncorrelated with every unbiased estimator of zero.
20.3 Common UMVUEs
The following table summarizes common best unbiased estimators discussed in the course.
| Parameter | Distribution / model | UMVUE |
|---|---|---|
| Mean \(\mu\) | \(\operatorname{Normal}(\mu,\sigma^2)\), known \(\sigma^2\) | \(\bar X\) |
| Mean \(\mu\) | \(\operatorname{Normal}(\mu,\sigma^2)\), unknown \(\sigma^2\) | \(\bar X\) |
| Variance \(\sigma^2\) | \(\operatorname{Normal}(\mu,\sigma^2)\), unknown \(\mu\) | \(\displaystyle \frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2\) |
| Parameter \(p\) | Bernoulli / binomial | sample proportion \(\widehat p\) |
| Parameter \(\lambda\) | Poisson | \(\bar X\) |
| Parameter \(\theta\) | \(\operatorname{Uniform}[0,\theta]\) | \(\displaystyle \frac{n+1}{n}\max_i X_i\) |
Example 24 (Uniform UMVUE). Let \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Uniform}(0,\theta)\). Show that \[\widehat\theta=\frac{n+1}{n}X_{(n)}\] is unbiased for \(\theta\), where \(X_{(n)}=\max_i X_i\).
The CDF of \(X_{(n)}\) is \[\mathbb{P}(X_{(n)}\le x)=\left(\frac{x}{\theta}\right)^n, \qquad 0\le x\le \theta.\] Thus the density is \[f_{X_{(n)}}(x)=\frac{n x^{n-1}}{\theta^n}, \qquad 0\le x\le \theta.\] Then \[\mathbb{E}[X_{(n)}] =\int_0^\theta x\frac{n x^{n-1}}{\theta^n}\,dx =\frac{n}{\theta^n}\cdot \frac{\theta^{n+1}}{n+1} =\frac{n}{n+1}\theta.\] Therefore \[\mathbb{E}\left[\frac{n+1}{n}X_{(n)}\right]=\theta.\]
21 Loss Functions and Risk
Decision theory evaluates estimators by the loss incurred after making a decision.
21.1 Loss functions
A loss function measures the cost of estimating \(\theta\) by an action \(a\).
Definition 25 (Loss function). Let \(\Theta\) be the parameter space and \(\mathcal A\) be the action space. A loss function is a function \[L:\Theta\times\mathcal A\to \mathbb{R}\] where \(L(\theta,a)\) measures the cost of taking action \(a\) when the true parameter is \(\theta\).
For point estimation, the action space is often \(\mathcal A=\Theta\). Common losses include:
absolute error loss: \[L(\theta,a)=|a-\theta|;\]
squared error loss: \[L(\theta,a)=(a-\theta)^2;\]
zero-one loss for exact decisions: \[L(\theta,a)= \begin{cases} 0, & a=\theta,\\ 1, & a\ne \theta. \end{cases}\]
21.2 Risk function
The risk is the expected loss when a decision rule is used.
Definition 26 (Decision rule and risk). A decision rule is a function \[\delta:\mathcal X\to \mathcal A\] that selects an action based on the observed data. Its risk function is \[R(\theta,\delta)=\mathbb{E}_\theta[L(\theta,\delta(X))].\]
For squared error loss, \[R(\theta,\delta)=\mathbb{E}_\theta[(\delta(X)-\theta)^2]=\operatorname{MSE}_\theta(\delta).\] Thus MSE is a special case of risk.
Risk comparison Given two estimators \(\delta_1\) and \(\delta_2\), if \[R(\theta,\delta_1)<R(\theta,\delta_2)\] for all \(\theta\), then \(\delta_1\) is uniformly better under the chosen loss.
22 Risk Examples
The risk function can lead to conclusions different from unbiasedness.
22.1 Bernoulli shrinkage estimator
The Bernoulli example illustrates how shrinkage changes risk.
Example 27 (Bernoulli risk comparison). Suppose \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)\). Compare \[\bar X=\frac1n\sum_{i=1}^n X_i\] with the shrinkage estimator \[\widehat p_B=\frac{\sum_{i=1}^n X_i+\sqrt n/4}{n+\sqrt n}.\] Use squared error loss.
Let \(S=\sum_{i=1}^n X_i\). Then \[\operatorname{MSE}_p(\bar X)=\frac{p(1-p)}{n}.\] For \[\widehat p_B=\frac{S+c}{n+d}, \qquad c=\frac{\sqrt n}{4},\quad d=\sqrt n,\] we have \[\mathbb{E}_p(\widehat p_B)=\frac{np+c}{n+d}, \qquad \operatorname{Var}_p(\widehat p_B)=\frac{np(1-p)}{(n+d)^2}.\] Thus \[\operatorname{MSE}_p(\widehat p_B) =\frac{np(1-p)}{(n+d)^2} +\left(\frac{np+c}{n+d}-p\right)^2.\] Since \(d=\sqrt n\) and \(c=\sqrt n/4\), \[\frac{np+c}{n+d}-p =\frac{\sqrt n(1/4-p)}{n+\sqrt n}.\] Therefore \[\operatorname{MSE}_p(\widehat p_B) =\frac{np(1-p)}{(n+\sqrt n)^2} +\frac{n(1/4-p)^2}{(n+\sqrt n)^2}.\] This estimator shrinks toward \(1/4\) in this parametrization. Its risk may be smaller than that of \(\bar X\) in some parts of the parameter space and larger in others. The lesson is that the preferred estimator depends on the loss function and the values of \(p\) considered important.
Remark 28. The lecture slide displays risk curves for \(n=4\) and \(n=400\). The qualitative message is that shrinkage estimators can improve risk in small samples or near the shrinkage target, while the sample proportion becomes very strong as \(n\) grows.
22.2 Normal variance: choosing a constant multiple of \(S^2\)
This example shows how MSE can favor a biased multiple of the unbiased sample variance.
Example 29 (Normal variance risk). Let \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)\) and consider estimators of the form \[\delta_b(X)=bS^2, \qquad S^2=\frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2.\] Find the value of \(b\) that minimizes MSE for estimating \(\sigma^2\).
We know \[\mathbb{E}(S^2)=\sigma^2, \qquad \operatorname{Var}(S^2)=\frac{2\sigma^4}{n-1}.\] For \(\delta_b=bS^2\), \[\mathbb{E}(\delta_b)=b\sigma^2, \qquad \operatorname{Var}(\delta_b)=b^2\frac{2\sigma^4}{n-1}.\] Thus \[\begin{aligned} R(\sigma^2,\delta_b) &=\operatorname{MSE}(\delta_b)\\ &=b^2\frac{2\sigma^4}{n-1}+(b\sigma^2-\sigma^2)^2\\ &=\sigma^4\left(\frac{2b^2}{n-1}+(b-1)^2\right). \end{aligned}\] Differentiate with respect to \(b\): \[\frac{d}{db}\left(\frac{2b^2}{n-1}+(b-1)^2\right) =\frac{4b}{n-1}+2(b-1).\] Set equal to zero: \[\frac{4b}{n-1}+2b-2=0.\] Hence \[b\left(\frac{4}{n-1}+2\right)=2, \qquad b=\frac{n-1}{n+1}.\] Therefore the MSE-minimizing estimator in this class is \[\widetilde S^2 =\frac{n-1}{n+1}S^2 =\frac1{n+1}\sum_{i=1}^n (X_i-\bar X)^2.\]
Correction note The calculation above gives \(b=(n-1)/(n+1)\) under squared error loss. This estimator is biased but improves MSE within the class \(bS^2\).
22.3 Risk curves for variance estimators
The three common estimators of \(\sigma^2\) are \[S^2=\frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2,\] \[\widehat\sigma^2_{\mathrm{MLE}}=\frac1n\sum_{i=1}^n (X_i-\bar X)^2=\frac{n-1}{n}S^2,\] \[\widetilde S^2=\frac1{n+1}\sum_{i=1}^n (X_i-\bar X)^2=\frac{n-1}{n+1}S^2.\] Under squared error loss, their risks are proportional to \(\sigma^4\): \[R(\sigma^2,S^2)=\frac{2\sigma^4}{n-1},\] \[R(\sigma^2,\widehat\sigma^2_{\mathrm{MLE}})=\frac{(2n-1)\sigma^4}{n^2},\] \[R(\sigma^2,\widetilde S^2)=\sigma^4\left(\frac{2}{n-1}\left(\frac{n-1}{n+1}\right)^2+\frac{4}{(n+1)^2}\right).\] The MSE-minimizing constant multiple \(\widetilde S^2\) has the smallest risk among estimators of the form \(bS^2\).
22.4 Different loss functions: Stein’s loss
Changing the loss function can change which estimator is preferred.
Definition 30 (Stein’s loss for variance estimation). For estimating \(\sigma^2\) by an action \(a>0\), Stein’s loss is \[L(\sigma^2,a)=\frac{a}{\sigma^2}-1-\log\left(\frac{a}{\sigma^2}\right).\]
Example 31 (Stein’s loss for \(bS^2\)). For estimators \(\delta_b(X)=bS^2\), compute the risk under Stein’s loss up to terms independent of \(b\).
The risk is \[\begin{aligned} R(\sigma^2,\delta_b) &=\mathbb{E}\left[\frac{bS^2}{\sigma^2}-1-\log\left(\frac{bS^2}{\sigma^2}\right)\right]\\ &=b\mathbb{E}\left[\frac{S^2}{\sigma^2}\right]-1-\log b-\mathbb{E}\left[\log\left(\frac{S^2}{\sigma^2}\right)\right]. \end{aligned}\] Since \(\mathbb{E}[S^2/\sigma^2]=1\), \[R(\sigma^2,\delta_b)=b-\log b-1-C,\] where \[C=\mathbb{E}\left[\log\left(\frac{S^2}{\sigma^2}\right)\right]\] does not depend on \(b\). Differentiating \(b-\log b\) gives \[1-\frac1b=0,\] so \(b=1\). Under Stein’s loss, the best choice in this class is \(\delta_b=S^2\).
Important lesson The word “best” depends on the loss function. Under squared error loss, a biased multiple of \(S^2\) can be preferred; under Stein’s loss, \(S^2\) is preferred within the class \(bS^2\).
23 Bayes Risk
Bayes risk averages the frequentist risk over a prior distribution on the parameter.
23.1 Definition
When a prior distribution is specified, we can compare estimators by their average risk under the prior.
Definition 32 (Bayes risk). Given a prior distribution \(\pi(\theta)\), the Bayes risk of a decision rule \(\delta\) is \[R_B(\delta)=\int_\Theta R(\theta,\delta)\pi(\theta)\,d\theta.\] Equivalently, \[R_B(\delta)=\int_\Theta\int_{\mathcal X} L(\theta,\delta(x))f(x\mid\theta)\pi(\theta)\,dx\,d\theta.\]
Using Bayes’ rule, this can also be written as \[R_B(\delta) =\int_{\mathcal X}\left[\int_\Theta L(\theta,\delta(x))\pi(\theta\mid x)\,d\theta\right]m(x)\,dx,\] where \(\pi(\theta\mid x)\) is the posterior distribution and \[m(x)=\int f(x\mid\theta)\pi(\theta)\,d\theta\] is the marginal distribution of the data.
23.2 Posterior expected loss
For a fixed observed data value \(x\), the posterior expected loss is \[\int_\Theta L(\theta,a)\pi(\theta\mid x)\,d\theta.\] A Bayes estimator chooses the action \(a\) minimizing this posterior expected loss.
Example 33 (Bayes estimator under squared error loss). Suppose \(L(\theta,a)=(a-\theta)^2\). Show that the Bayes estimator is the posterior mean.
For fixed data \(x\), minimize \[\mathbb{E}[(a-\theta)^2\mid x]\] with respect to \(a\). Expand: \[\mathbb{E}[(a-\theta)^2\mid x] =a^2-2a\mathbb{E}[\theta\mid x]+\mathbb{E}[\theta^2\mid x].\] Differentiating with respect to \(a\) gives \[2a-2\mathbb{E}[\theta\mid x]=0.\] Thus \[a=\mathbb{E}[\theta\mid x].\] So the Bayes estimator under squared error loss is the posterior mean.
Example 34 (Bayes estimator under absolute error loss). Suppose \(L(\theta,a)=|a-\theta|\). The Bayes estimator is a posterior median.
For fixed data \(x\), minimize \[\mathbb{E}[|a-\theta|\mid x] =\int |a-\theta|\pi(\theta\mid x)\,d\theta.\] A standard derivative/subgradient argument shows that a minimizer satisfies \[\mathbb{P}(\theta\le a\mid x)\ge \frac12, \qquad \mathbb{P}(\theta\ge a\mid x)\ge \frac12.\] Thus any posterior median is a Bayes estimator under absolute error loss.
24 Summary of Estimator Criteria
This section closes with a comparison table for the main estimator-evaluation criteria.
| Criterion | Definition | Desirable property | Notes |
|---|---|---|---|
| Unbiasedness | \(\mathbb{E}[\widehat\theta]=\theta\) | Expected value equals the target | Does not guarantee minimum MSE |
| Bias | \(\operatorname{Bias}(\widehat\theta)=\mathbb{E}[\widehat\theta]-\theta\) | Bias close to zero | Can be positive or negative |
| Variance | \(\operatorname{Var}(\widehat\theta)\) | Smaller is better, especially among unbiased estimators | Measures sampling variability |
| MSE | \(\mathbb{E}[(\widehat\theta-\theta)^2]\) | Smaller is better | \(\operatorname{MSE}=\operatorname{Var}+\operatorname{Bias}^2\) |
| Efficiency | Variance relative to best possible lower bound | Close to Cramer-Rao lower bound | Usually for unbiased estimators |
| Consistency | \(\widehat\theta_n\xrightarrow{P}\theta\) | Converges to target as \(n\to\infty\) | Large-sample property |
| Sufficiency | Statistic captures all information about parameter | No information loss about \(\theta\) | Linked to factorization theorem and Rao-Blackwell |
| Robustness | Stability under small departures from assumptions | Less sensitive to outliers/model error | Important in applications |
Concept map
25 Practice Problems
The practice problems below reinforce the main methods for evaluating estimators.
Practice Problem 35 (Bias-variance decomposition). Let \(W\) be an estimator of \(\theta\). Prove that \[\mathbb{E}_\theta[(W-\theta)^2]=\operatorname{Var}_\theta(W)+\operatorname{Bias}_\theta(W)^2.\]
Use \[W-\theta=(W-\mathbb{E}W)+(\mathbb{E}W-\theta).\] Squaring and taking expectations gives \[\mathbb{E}[(W-\theta)^2] =\mathbb{E}[(W-\mathbb{E}W)^2]+2(\mathbb{E}W-\theta)\mathbb{E}[W-\mathbb{E}W]+(\mathbb{E}W-\theta)^2.\] The middle term is zero, so \[\mathbb{E}[(W-\theta)^2]=\operatorname{Var}(W)+\operatorname{Bias}(W)^2.\]
Practice Problem 36 (MSE for a biased variance estimator). Let \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)\) and let \[\widehat\sigma^2_c=cS^2.\] Find \(\operatorname{MSE}(\widehat\sigma^2_c)\) as a function of \(c\).
Since \(\mathbb{E}(S^2)=\sigma^2\) and \(\operatorname{Var}(S^2)=2\sigma^4/(n-1)\), \[\mathbb{E}(cS^2)=c\sigma^2, \qquad \operatorname{Var}(cS^2)=c^2\frac{2\sigma^4}{n-1}.\] Therefore \[\operatorname{MSE}(cS^2) =c^2\frac{2\sigma^4}{n-1}+(c\sigma^2-\sigma^2)^2 =\sigma^4\left(\frac{2c^2}{n-1}+(c-1)^2\right).\]
Practice Problem 37 (Optimal constant multiple). Using the previous problem, find the value of \(c\) that minimizes \(\operatorname{MSE}(cS^2)\).
Minimize \[h(c)=\frac{2c^2}{n-1}+(c-1)^2.\] Differentiate: \[h'(c)=\frac{4c}{n-1}+2(c-1).\] Set \(h'(c)=0\): \[\frac{4c}{n-1}+2c-2=0.\] Thus \[c\left(\frac{4}{n-1}+2\right)=2, \qquad c=\frac{n-1}{n+1}.\]
Practice Problem 38 (Cramer-Rao for Bernoulli). Let \(X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)\). Find the Fisher information and the Cramer-Rao lower bound for unbiased estimators of \(p\).
For one observation, \[f(x\mid p)=p^x(1-p)^{1-x}, \qquad x\in\{0,1\}.\] The log-likelihood for one observation is \[\ell(p)=X\log p+(1-X)\log(1-p).\] Then \[\ell'(p)=\frac{X}{p}-\frac{1-X}{1-p} =\frac{X-p}{p(1-p)}.\] Thus \[I_1(p)=\mathbb{E}\left[\left(\frac{X-p}{p(1-p)}\right)^2\right] =\frac{\operatorname{Var}(X)}{p^2(1-p)^2} =\frac{p(1-p)}{p^2(1-p)^2} =\frac1{p(1-p)}.\] For \(n\) iid observations, \[I_n(p)=\frac{n}{p(1-p)}.\] For unbiased estimators of \(p\), \[\operatorname{Var}(W)\ge \frac{1}{I_n(p)}=\frac{p(1-p)}{n}.\] Since \(\bar X\) is unbiased and has variance \(p(1-p)/n\), it attains the lower bound.
Practice Problem 39 (Rao-Blackwell improvement). Suppose \(W\) is unbiased for \(g(\theta)\) and \(T\) is sufficient. Let \(\phi(T)=\mathbb{E}(W\mid T)\). Prove that \(\phi(T)\) is unbiased and has variance no larger than \(W\).
By the law of total expectation, \[\mathbb{E}[\phi(T)]=\mathbb{E}[\mathbb{E}(W\mid T)]=\mathbb{E}(W)=g(\theta).\] By the law of total variance, \[\operatorname{Var}(W)=\operatorname{Var}(\mathbb{E}(W\mid T))+\mathbb{E}[\operatorname{Var}(W\mid T)] =\operatorname{Var}(\phi(T))+\mathbb{E}[\operatorname{Var}(W\mid T)].\] The second term is nonnegative, so \[\operatorname{Var}(\phi(T))\le \operatorname{Var}(W).\]
Practice Problem 40 (Bayes estimator under squared error). Let \(\pi(\theta\mid x)\) be a posterior density. Show that the posterior mean minimizes posterior expected squared error loss.
For fixed \(x\), minimize \[Q(a)=\int (a-\theta)^2\pi(\theta\mid x)\,d\theta.\] Expanding, \[Q(a)=a^2-2a\mathbb{E}(\theta\mid x)+\mathbb{E}(\theta^2\mid x).\] Differentiate: \[Q'(a)=2a-2\mathbb{E}(\theta\mid x).\] Thus the minimizer is \[a=\mathbb{E}(\theta\mid x).\]