14 Chapter 13: Point Estimation II — Evaluating Estimators

This chapter continues point estimation. Chapter 12 focused on how to construct estimators. This chapter focuses on how to evaluate and compare estimators using mean squared error, bias, variance, Fisher information, Cramér–Rao lower bounds, Rao–Blackwellization, UMVUEs, loss functions, risk functions, and Bayes risk.

Topics

Evaluating and comparing point estimators; mean squared error; bias and variance; unbiasedness; best unbiased estimators; Cramer-Rao inequality; Fisher information; Rao-Blackwell theorem; Lehmann-Scheffe theorem; loss functions; risk functions; Bayes risk; examples for normal, Bernoulli, binomial, and Poisson models.

15 Overview: Evaluating and Comparing Estimators

This section studies how to choose among different point estimators when several construction methods are available.

In Section 12, we discussed ways to find estimators: method of moments, maximum likelihood estimation, and Bayesian methods. In this section, the question changes from construction to evaluation.

Key idea

Suppose several estimators are available for the same parameter. Which estimator should we use?

Common criteria for a good estimator include:

small mean squared error;
unbiasedness;
small variance among unbiased estimators;
efficiency relative to a lower bound;
use of sufficient statistics;
consistency as the sample size grows;
robustness under small model deviations.

The main theme is that there is usually no single universal meaning of “best.” The answer depends on the loss function, the parameter space, and whether we care about finite-sample or large-sample performance.

16 Mean Squared Error and Bias

Mean squared error is one of the most useful criteria because it combines variance and systematic bias into a single number.

16.1 Mean squared error

We begin with the most common risk measure for point estimation.

Definition

Definition 1 (Mean squared error). Let $W=W(X_1,\ldots,X_n)$ be an estimator of a scalar parameter $\theta$. The mean squared error of $W$ is \[\operatorname{MSE}_\theta(W)=\mathbb{E}_\theta\bigl[(W-\theta)^2\bigr].\]

The MSE measures the average squared distance between the random estimator and the unknown true parameter. It depends on $\theta$, so it is a function on the parameter space.

16.2 Bias and unbiasedness

Bias measures whether the estimator systematically overestimates or underestimates the target.

Definition

Definition 2 (Bias). The bias of a point estimator $W$ of $\theta$ is \[\operatorname{Bias}_\theta(W)=\mathbb{E}_\theta[W]-\theta.\] The estimator $W$ is unbiased for $\theta$ if \[\mathbb{E}_\theta[W]=\theta\] for all possible values of $\theta$.

Remark

Remark 3. Unbiasedness is a natural criterion, but it is not the only criterion. A biased estimator can have smaller MSE than an unbiased estimator if the reduction in variance is large enough.

16.3 Bias-variance decomposition

The key reason MSE is convenient is that it decomposes into variance plus squared bias.

Theorem

Theorem 4 (Bias-variance decomposition). For any estimator $W$ of $\theta$, \[\operatorname{MSE}_\theta(W) =\mathbb{E}_\theta[(W-\theta)^2] =\operatorname{Var}_\theta(W)+\bigl(\mathbb{E}_\theta[W]-\theta\bigr)^2.\] Equivalently, \[\operatorname{MSE}_\theta(W)=\operatorname{Var}_\theta(W)+\operatorname{Bias}_\theta(W)^2.\]

Proof

Proof. Write \[W-\theta=(W-\mathbb{E}_\theta W)+(\mathbb{E}_\theta W-\theta).\] Then \[\begin{aligned} \mathbb{E}_\theta[(W-\theta)^2] &=\mathbb{E}_\theta\left[(W-\mathbb{E}_\theta W)^2\right] +2(\mathbb{E}_\theta W-\theta)\mathbb{E}_\theta[W-\mathbb{E}_\theta W] +(\mathbb{E}_\theta W-\theta)^2\\ &=\operatorname{Var}_\theta(W)+(\mathbb{E}_\theta W-\theta)^2, \end{aligned}\] because the middle term is zero. ◻

Remark

Remark 5. It is reasonable to consider other errors, such as $\mathbb{E}_\theta|W-\theta|$. The MSE is especially common because it is differentiable and has the bias-variance decomposition.

17 Examples of MSE

These examples show how MSE is computed and how bias can sometimes improve MSE.

17.1 Sample mean and sample variance under a normal model

We first recall basic facts about the sample mean and sample variance.

Example

Example 6 (Normal distribution). Let $X_1,\ldots,X_n$ be a random sample from a population with mean $\mu$ and variance $\sigma^2$. Define \[\bar X=\frac1n\sum_{i=1}^n X_i, \qquad S^2=\frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2.\] Then \[\mathbb{E}(\bar X)=\mu,\qquad \operatorname{Var}(\bar X)=\frac{\sigma^2}{n},\qquad \mathbb{E}(S^2)=\sigma^2.\] Thus $\bar X$ is unbiased for $\mu$, and $S^2$ is unbiased for $\sigma^2$.

Solution

Since $X_1,\ldots,X_n$ are iid, \[\mathbb{E}(\bar X)=\frac1n\sum_{i=1}^n \mathbb{E}(X_i)=\mu\] and \[\operatorname{Var}(\bar X)=\frac1{n^2}\sum_{i=1}^n \operatorname{Var}(X_i)=\frac{\sigma^2}{n}.\] The identity $\mathbb{E}(S^2)=\sigma^2$ was proved in the sampling section. Therefore \[\operatorname{MSE}(\bar X)=\operatorname{Var}(\bar X)=\frac{\sigma^2}{n}\] when estimating $\mu$. If $X_i\sim \operatorname{Normal}(\mu,\sigma^2)$, then \[\frac{(n-1)S^2}{\sigma^2}\sim \chi^2_{n-1},\] so \[\operatorname{Var}(S^2)=\frac{2\sigma^4}{n-1}.\] Since $S^2$ is unbiased for $\sigma^2$, \[\operatorname{MSE}(S^2)=\frac{2\sigma^4}{n-1}.\]

17.2 The MLE of normal variance

The MLE for the normal variance is biased, but it can have smaller MSE than the unbiased sample variance.

Example

Example 7 (Normal variance MLE). Suppose $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$. The MLE of $\sigma^2$ is \[\widehat\sigma^2_{\mathrm{MLE}} =\frac1n\sum_{i=1}^n (X_i-\bar X)^2 =\frac{n-1}{n}S^2.\] Find its bias, variance, and MSE.

Solution

Since $\mathbb{E}(S^2)=\sigma^2$, \[\mathbb{E}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =\frac{n-1}{n}\sigma^2.\] Thus \[\operatorname{Bias}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =-\frac{\sigma^2}{n}.\] Also, \[\operatorname{Var}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =\left(\frac{n-1}{n}\right)^2\operatorname{Var}(S^2) =\left(\frac{n-1}{n}\right)^2\frac{2\sigma^4}{n-1} =\frac{2(n-1)\sigma^4}{n^2}.\] Therefore \[\operatorname{MSE}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =\frac{2(n-1)\sigma^4}{n^2}+\frac{\sigma^4}{n^2} =\frac{(2n-1)\sigma^4}{n^2}.\] Compare this with \[\operatorname{MSE}(S^2)=\frac{2\sigma^4}{n-1}.\] For $n\ge 2$, \[\frac{(2n-1)}{n^2}<\frac{2}{n-1},\] so \[\operatorname{MSE}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr)<\operatorname{MSE}(S^2).\] The MLE trades a small bias for a larger variance reduction.

Key idea

Bias-variance tradeoff An unbiased estimator is not automatically best under MSE. Sometimes a biased estimator has smaller MSE because its variance is much smaller.

17.3 A Bayes estimator for Bernoulli data

This example compares the usual sample proportion with a Bayesian shrinkage estimator.

Example

Example 8 (Binomial Bayes estimator). Suppose $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)$, where $p$ is unknown. Let \[S=X_1+\cdots+X_n.\] The MLE is \[\widehat p_{\mathrm{MLE}}=\bar X=\frac{S}{n}.\] If the prior is $p\sim \operatorname{Beta}(\alpha,\beta)$, then the posterior mean estimator is \[\widehat p_{\mathrm{B}} =\mathbb{E}[p\mid S] =\frac{\alpha+S}{\alpha+\beta+n}.\] Compute the MSE of $\bar X$ and $\widehat p_{\mathrm{B}}$ as functions of $p$.

Solution

For the MLE, \[\mathbb{E}_p(\bar X)=p, \qquad \operatorname{Var}_p(\bar X)=\frac{p(1-p)}{n}.\] Thus \[\operatorname{MSE}_p(\bar X)=\frac{p(1-p)}{n}.\] For the Bayes estimator, \[\widehat p_{\mathrm{B}}=\frac{\alpha+S}{\alpha+\beta+n}.\] Since $S\sim \operatorname{Binomial}(n,p)$, \[\mathbb{E}_p(S)=np, \qquad \operatorname{Var}_p(S)=np(1-p).\] Therefore \[\mathbb{E}_p(\widehat p_{\mathrm{B}}) =\frac{\alpha+np}{\alpha+\beta+n},\] \[\operatorname{Var}_p(\widehat p_{\mathrm{B}}) =\frac{np(1-p)}{(\alpha+\beta+n)^2},\] and \[\operatorname{Bias}_p(\widehat p_{\mathrm{B}}) =\frac{\alpha+np}{\alpha+\beta+n}-p =\frac{\alpha-p(\alpha+\beta)}{\alpha+\beta+n}.\] Hence \[\operatorname{MSE}_p(\widehat p_{\mathrm{B}}) =\frac{np(1-p)}{(\alpha+\beta+n)^2} +\left(\frac{\alpha-p(\alpha+\beta)}{\alpha+\beta+n}\right)^2.\] If we choose \[\alpha=\beta=\frac{\sqrt n}{4},\] then this estimator shrinks $\bar X$ toward $1/2$. It can have better MSE for small $n$ or when one strongly believes $p$ is close to $1/2$. For large $n$, the MLE becomes very strong because the data dominates.

Remark

Remark 9. The lecture slide emphasizes a special choice of $\alpha$ and $\beta$ that makes the MSE curve flatter. The main statistical idea is shrinkage: the estimator sacrifices unbiasedness to reduce risk near values favored by the prior.

18 Best Unbiased Estimators

Unbiasedness alone is not enough, so we compare unbiased estimators by their variance.

18.1 Definition and uniqueness

The best unbiased estimator has the smallest variance among all unbiased estimators.

Definition

Definition 10 (Best unbiased estimator and UMVUE). An estimator $\widehat\theta$ of $\theta$ is called a best unbiased estimator if:

it is unbiased: $\mathbb{E}_\theta[\widehat\theta]=\theta$ for all $\theta$;
it has minimum variance among all unbiased estimators of $\theta$ for every $\theta$.

It is also called a uniform minimum variance unbiased estimator, or UMVUE.

The definition extends naturally to estimating a function $g(\theta)$.

Theorem

Theorem 11 (Uniqueness). If $W$ is a best unbiased estimator of $g(\theta)$, then $W$ is unique almost surely.

Proof

Proof. Suppose $W_1$ and $W_2$ are both best unbiased estimators. Then \[W_3=\frac{W_1+W_2}{2}\] is also unbiased. Since $W_1$ and $W_2$ are both minimum-variance unbiased estimators, $W_3$ cannot have smaller variance. But \[\operatorname{Var}(W_1-W_2)\ge 0\] and the usual variance identity implies that averaging two different unbiased estimators would strictly reduce variance unless $W_1=W_2$ almost surely. Therefore $W_1=W_2$ almost surely. ◻

18.2 Poisson unbiased estimation

This example shows that many unbiased estimators can exist, but one may be clearly better.

Example

Example 12 (Poisson unbiased estimators). Suppose $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)$. Then both $\bar X$ and $S^2$ are unbiased estimators of $\lambda$. Compare their MSEs.

Solution

For the sample mean, \[\mathbb{E}_\lambda(\bar X)=\lambda, \qquad \operatorname{Var}_\lambda(\bar X)=\frac{\lambda}{n}.\] Thus \[\operatorname{MSE}(\bar X)=\frac{\lambda}{n}.\] The sample variance $S^2$ is also unbiased for the population variance, and for a Poisson random variable the population variance equals $\lambda$. Hence $\mathbb{E}(S^2)=\lambda$.

A general formula for the variance of the sample variance is \[\operatorname{Var}(S^2)=\frac1n\left(\mu_4-\frac{n-3}{n-1}\sigma^4\right),\] where $\mu_4=\mathbb{E}[(X-\mu)^4]$ and $\sigma^2=\operatorname{Var}(X)$. For $X\sim \operatorname{Poisson}(\lambda)$, \[\mu=\lambda, \qquad \sigma^2=\lambda, \qquad \mu_4=\lambda+3\lambda^2.\] Therefore \[\begin{aligned} \operatorname{Var}(S^2) &=\frac1n\left(\lambda+3\lambda^2-\frac{n-3}{n-1}\lambda^2\right)\\ &=\frac{\lambda}{n}+\frac{2\lambda^2}{n-1}. \end{aligned}\] Since $S^2$ is unbiased, \[\operatorname{MSE}(S^2)=\frac{\lambda}{n}+\frac{2\lambda^2}{n-1}.\] Thus \[\operatorname{MSE}(\bar X)<\operatorname{MSE}(S^2)\] for $\lambda>0$. This shows $\bar X$ is better than $S^2$ among these two unbiased estimators, but by itself it does not prove $\bar X$ is the best among all unbiased estimators.

19 Cramer-Rao Inequality and Fisher Information

The Cramer-Rao inequality gives a lower bound on the variance of unbiased estimators.

19.1 The inequality

The bound is one of the main tools for proving that an unbiased estimator is best.

Theorem

Theorem 13 (Cramer-Rao inequality). Let $X=(X_1,\ldots,X_n)$ have joint density or mass function $f(x\mid \theta)$. Let $W(X)$ be an estimator of $g(\theta)$ satisfying the usual regularity conditions that allow differentiation under the integral sign. Then \[\operatorname{Var}_\theta(W) \ge \frac{\left(\dfrac{d}{d\theta}\mathbb{E}_\theta[W]\right)^2} {\mathbb{E}_\theta\left[\left(\dfrac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right]}.\] If $W$ is unbiased for $g(\theta)$, then $\mathbb{E}_\theta[W]=g(\theta)$ and \[\operatorname{Var}_\theta(W) \ge \frac{\left(g'(\theta)\right)^2}{I_X(\theta)},\] where \[I_X(\theta)=\mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right]\] is the Fisher information in the sample.

Proof

Proof idea. Let \[U_\theta(X)=\frac{\partial}{\partial\theta}\log f(X\mid \theta)\] be the score function. Under regularity conditions, \[\mathbb{E}_\theta[U_\theta(X)]=0.\] Also, \[\begin{aligned} \frac{d}{d\theta}\mathbb{E}_\theta[W] &=\frac{d}{d\theta}\int W(x)f(x\mid \theta)\,dx\\ &=\int W(x)\frac{\partial}{\partial\theta}f(x\mid\theta)\,dx\\ &=\int W(x)\left(\frac{\partial}{\partial\theta}\log f(x\mid\theta)\right)f(x\mid\theta)\,dx\\ &=\mathbb{E}_\theta[W U_\theta(X)]\\ &=\operatorname{Cov}_\theta(W,U_\theta(X)). \end{aligned}\] By Cauchy-Schwarz, \[\left(\frac{d}{d\theta}\mathbb{E}_\theta[W]\right)^2 \le \operatorname{Var}_\theta(W)\operatorname{Var}_\theta(U_\theta(X)).\] Since $\operatorname{Var}_\theta(U_\theta)=\mathbb{E}_\theta[U_\theta^2]=I_X(\theta)$, rearranging gives the result. ◻

19.2 Equality condition

The equality condition explains when the lower bound is attained.

Theorem

Theorem 14 (Equality condition). Equality in the Cramer-Rao inequality holds if and only if there exists a function $h(\theta)$ such that \[h(\theta)\{W(x)-g(\theta)\} =\frac{\partial}{\partial\theta}\log L(\theta\mid x),\] where $L(\theta\mid x)$ is the likelihood function.

This condition is useful because if an unbiased estimator attains the Cramer-Rao lower bound, then it is a best unbiased estimator.

19.3 IID form and information number

For independent samples, the Fisher information adds.

Corollary

Corollary 15 (IID sample). Suppose $X_1,\ldots,X_n$ are iid with density or mass function $f(x\mid\theta)$. Then \[I_X(\theta) =n I_1(\theta), \qquad I_1(\theta)=\mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right].\] Hence \[\operatorname{Var}_\theta(W) \ge \frac{\left(\dfrac{d}{d\theta}\mathbb{E}_\theta[W]\right)^2} {nI_1(\theta)}.\]

Lemma

Lemma 16 (Computing Fisher information). Under suitable regularity conditions, often satisfied for exponential-family models, \[I_1(\theta) =\mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right] =-\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta^2}\log f(X\mid\theta)\right].\]

19.4 Poisson conclusion

We now return to the Poisson example and prove that $\bar X$ is best unbiased.

Example

Example 17 (Poisson UMVUE). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)$. Show that $\bar X$ is the best unbiased estimator of $\lambda$.

Solution

For one observation, \[f(x\mid\lambda)=e^{-\lambda}\frac{\lambda^x}{x!},\] so \[\log f(x\mid\lambda)=-\lambda+x\log\lambda-\log(x!).\] Then \[\frac{\partial^2}{\partial\lambda^2}\log f(x\mid\lambda) =-\frac{x}{\lambda^2}.\] Therefore \[I_1(\lambda) =-\mathbb{E}_\lambda\left[-\frac{X}{\lambda^2}\right] =\frac{\lambda}{\lambda^2} =\frac1\lambda.\] For $n$ iid observations, \[I_X(\lambda)=\frac{n}{\lambda}.\] If $W$ is unbiased for $\lambda$, then $g(\lambda)=\lambda$ and $g'(\lambda)=1$. The Cramer-Rao lower bound gives \[\operatorname{Var}_\lambda(W)\ge \frac{1}{n/\lambda}=\frac{\lambda}{n}.\] But \[\mathbb{E}_\lambda(\bar X)=\lambda, \qquad \operatorname{Var}_\lambda(\bar X)=\frac{\lambda}{n}.\] Thus $\bar X$ is unbiased and attains the lower bound. Hence $\bar X$ is the best unbiased estimator of $\lambda$.

19.5 Normal variance example

The Cramer-Rao bound is a lower bound, but not every unbiased estimator attains it.

Example

Example 18 (Normal variance). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$. Consider estimating $\sigma^2$. If we use the parameter $\theta=\sigma^2$, then the normal log-density for one observation is \[\log f(x\mid \mu,\theta) =-\frac12\log(2\pi\theta)-\frac{(x-\mu)^2}{2\theta}.\] When $\mu$ is treated as known in this direct one-parameter calculation, \[-\mathbb{E}\left[\frac{\partial^2}{\partial\theta^2}\log f(X\mid\mu,\theta)\right] =\frac{1}{2\theta^2} =\frac{1}{2\sigma^4}.\] Thus the Cramer-Rao lower bound for unbiased estimators of $\sigma^2$ is \[\operatorname{Var}(W)\ge \frac{2\sigma^4}{n}.\]

Solution

Compute derivatives: \[\frac{\partial}{\partial\theta}\log f(x\mid\mu,\theta) =-\frac1{2\theta}+\frac{(x-\mu)^2}{2\theta^2},\] \[\frac{\partial^2}{\partial\theta^2}\log f(x\mid\mu,\theta) =\frac1{2\theta^2}-\frac{(x-\mu)^2}{\theta^3}.\] Since $\mathbb{E}[(X-\mu)^2]=\theta$, \[-\mathbb{E}\left[\frac{\partial^2}{\partial\theta^2}\log f(X\mid\mu,\theta)\right] =-\left(\frac1{2\theta^2}-\frac{\theta}{\theta^3}\right) =\frac1{2\theta^2}.\] Thus $I_X(\theta)=n/(2\theta^2)$, so \[\operatorname{Var}(W)\ge \frac{1}{n/(2\theta^2)}=\frac{2\theta^2}{n}=\frac{2\sigma^4}{n}.\] The usual unbiased sample variance satisfies \[\operatorname{Var}(S^2)=\frac{2\sigma^4}{n-1},\] which is larger than $2\sigma^4/n$. Therefore this simple Cramer-Rao calculation does not prove that $S^2$ attains the lower bound.

Remark

Remark 19. When nuisance parameters are present, such as unknown $\mu$ while estimating $\sigma^2$, one must be careful with the precise form of Fisher information. The lecture emphasizes the main message: the Cramer-Rao lower bound is a benchmark, and not every natural unbiased estimator attains it.

20 Sufficiency and Unbiasedness

Sufficient statistics can improve estimators by conditioning away irrelevant randomness.

20.1 Rao-Blackwell theorem

The Rao-Blackwell theorem is a systematic estimator-improvement principle.

Theorem

Theorem 20 (Rao-Blackwell theorem). Let $X_1,\ldots,X_n$ be a sample from a population distribution with density or mass function $f(x\mid\theta)$. Let $W(X)$ be an unbiased estimator of $g(\theta)$, and let $T(X)$ be a sufficient statistic for $\theta$. Define \[\phi(T)=\mathbb{E}[W(X)\mid T].\] Then \[\mathbb{E}_\theta[\phi(T)]=g(\theta)\] and \[\operatorname{Var}_\theta(\phi(T))\le \operatorname{Var}_\theta(W)\] for every $\theta$.

Proof

Proof. For unbiasedness, use the law of total expectation: \[\mathbb{E}_\theta[\phi(T)] =\mathbb{E}_\theta\bigl[\mathbb{E}_\theta(W\mid T)\bigr] =\mathbb{E}_\theta(W)=g(\theta).\] For variance, use the law of total variance: \[\operatorname{Var}_\theta(W) =\operatorname{Var}_\theta\bigl(\mathbb{E}_\theta(W\mid T)\bigr) +\mathbb{E}_\theta\bigl[\operatorname{Var}_\theta(W\mid T)\bigr].\] Since the second term is nonnegative, \[\operatorname{Var}_\theta(W)\ge \operatorname{Var}_\theta(\phi(T)).\] Thus $\phi(T)$ is uniformly at least as good as $W$ among unbiased estimators. ◻

Key idea

Rao-Blackwellization replaces an estimator by its conditional expectation given a sufficient statistic. This preserves unbiasedness and reduces variance.

20.2 Lehmann-Scheffe theorem

Completeness turns the Rao-Blackwell estimator into the unique best unbiased estimator.

Definition

Definition 21 (Complete sufficient statistic). A statistic $T$ is complete for a family of distributions if \[\mathbb{E}_\theta[g(T)]=0 \quad \text{for all } \theta \quad \Longrightarrow \quad g(T)=0 \text{ almost surely}.\] If $T$ is both sufficient and complete, it is called a complete sufficient statistic.

Theorem

Theorem 22 (Lehmann-Scheffe theorem). Let $T(X)$ be a complete sufficient statistic for $\theta$. If $\phi(T)$ is unbiased for $g(\theta)$, then $\phi(T)$ is the unique best unbiased estimator of $g(\theta)$.

Remark

Remark 23. A related characterization says that an unbiased estimator is best unbiased if and only if it is uncorrelated with every unbiased estimator of zero.

20.3 Common UMVUEs

The following table summarizes common best unbiased estimators discussed in the course.

Parameter	Distribution / model	UMVUE
Mean $\mu$	$\operatorname{Normal}(\mu,\sigma^2)$, known $\sigma^2$	$\bar X$
Mean $\mu$	$\operatorname{Normal}(\mu,\sigma^2)$, unknown $\sigma^2$	$\bar X$
Variance $\sigma^2$	$\operatorname{Normal}(\mu,\sigma^2)$, unknown $\mu$	$\displaystyle \frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2$
Parameter $p$	Bernoulli / binomial	sample proportion $\widehat p$
Parameter $\lambda$	Poisson	$\bar X$
Parameter $\theta$	$\operatorname{Uniform}[0,\theta]$	$\displaystyle \frac{n+1}{n}\max_i X_i$

Example

Example 24 (Uniform UMVUE). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Uniform}(0,\theta)$. Show that \[\widehat\theta=\frac{n+1}{n}X_{(n)}\] is unbiased for $\theta$, where $X_{(n)}=\max_i X_i$.

Solution

The CDF of $X_{(n)}$ is \[\mathbb{P}(X_{(n)}\le x)=\left(\frac{x}{\theta}\right)^n, \qquad 0\le x\le \theta.\] Thus the density is \[f_{X_{(n)}}(x)=\frac{n x^{n-1}}{\theta^n}, \qquad 0\le x\le \theta.\] Then \[\mathbb{E}[X_{(n)}] =\int_0^\theta x\frac{n x^{n-1}}{\theta^n}\,dx =\frac{n}{\theta^n}\cdot \frac{\theta^{n+1}}{n+1} =\frac{n}{n+1}\theta.\] Therefore \[\mathbb{E}\left[\frac{n+1}{n}X_{(n)}\right]=\theta.\]

21 Loss Functions and Risk

Decision theory evaluates estimators by the loss incurred after making a decision.

21.1 Loss functions

A loss function measures the cost of estimating $\theta$ by an action $a$.

Definition

Definition 25 (Loss function). Let $\Theta$ be the parameter space and $\mathcal A$ be the action space. A loss function is a function \[L:\Theta\times\mathcal A\to \mathbb{R}\] where $L(\theta,a)$ measures the cost of taking action $a$ when the true parameter is $\theta$.

For point estimation, the action space is often $\mathcal A=\Theta$. Common losses include:

absolute error loss: \[L(\theta,a)=|a-\theta|;\]
squared error loss: \[L(\theta,a)=(a-\theta)^2;\]
zero-one loss for exact decisions: \[L(\theta,a)= \begin{cases} 0, & a=\theta,\\ 1, & a\ne \theta. \end{cases}\]

21.2 Risk function

The risk is the expected loss when a decision rule is used.

Definition

Definition 26 (Decision rule and risk). A decision rule is a function \[\delta:\mathcal X\to \mathcal A\] that selects an action based on the observed data. Its risk function is \[R(\theta,\delta)=\mathbb{E}_\theta[L(\theta,\delta(X))].\]

For squared error loss, \[R(\theta,\delta)=\mathbb{E}_\theta[(\delta(X)-\theta)^2]=\operatorname{MSE}_\theta(\delta).\] Thus MSE is a special case of risk.

Key idea

Risk comparison Given two estimators $\delta_1$ and $\delta_2$, if \[R(\theta,\delta_1)<R(\theta,\delta_2)\] for all $\theta$, then $\delta_1$ is uniformly better under the chosen loss.

22 Risk Examples

The risk function can lead to conclusions different from unbiasedness.

22.1 Bernoulli shrinkage estimator

The Bernoulli example illustrates how shrinkage changes risk.

Example

Example 27 (Bernoulli risk comparison). Suppose $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)$. Compare \[\bar X=\frac1n\sum_{i=1}^n X_i\] with the shrinkage estimator \[\widehat p_B=\frac{\sum_{i=1}^n X_i+\sqrt n/4}{n+\sqrt n}.\] Use squared error loss.

Solution

Let $S=\sum_{i=1}^n X_i$. Then \[\operatorname{MSE}_p(\bar X)=\frac{p(1-p)}{n}.\] For \[\widehat p_B=\frac{S+c}{n+d}, \qquad c=\frac{\sqrt n}{4},\quad d=\sqrt n,\] we have \[\mathbb{E}_p(\widehat p_B)=\frac{np+c}{n+d}, \qquad \operatorname{Var}_p(\widehat p_B)=\frac{np(1-p)}{(n+d)^2}.\] Thus \[\operatorname{MSE}_p(\widehat p_B) =\frac{np(1-p)}{(n+d)^2} +\left(\frac{np+c}{n+d}-p\right)^2.\] Since $d=\sqrt n$ and $c=\sqrt n/4$, \[\frac{np+c}{n+d}-p =\frac{\sqrt n(1/4-p)}{n+\sqrt n}.\] Therefore \[\operatorname{MSE}_p(\widehat p_B) =\frac{np(1-p)}{(n+\sqrt n)^2} +\frac{n(1/4-p)^2}{(n+\sqrt n)^2}.\] This estimator shrinks toward $1/4$ in this parametrization. Its risk may be smaller than that of $\bar X$ in some parts of the parameter space and larger in others. The lesson is that the preferred estimator depends on the loss function and the values of $p$ considered important.

Remark

Remark 28. The lecture slide displays risk curves for $n=4$ and $n=400$. The qualitative message is that shrinkage estimators can improve risk in small samples or near the shrinkage target, while the sample proportion becomes very strong as $n$ grows.

22.2 Normal variance: choosing a constant multiple of $S^2$

This example shows how MSE can favor a biased multiple of the unbiased sample variance.

Example

Example 29 (Normal variance risk). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$ and consider estimators of the form \[\delta_b(X)=bS^2, \qquad S^2=\frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2.\] Find the value of $b$ that minimizes MSE for estimating $\sigma^2$.

Solution

We know \[\mathbb{E}(S^2)=\sigma^2, \qquad \operatorname{Var}(S^2)=\frac{2\sigma^4}{n-1}.\] For $\delta_b=bS^2$, \[\mathbb{E}(\delta_b)=b\sigma^2, \qquad \operatorname{Var}(\delta_b)=b^2\frac{2\sigma^4}{n-1}.\] Thus \[\begin{aligned} R(\sigma^2,\delta_b) &=\operatorname{MSE}(\delta_b)\\ &=b^2\frac{2\sigma^4}{n-1}+(b\sigma^2-\sigma^2)^2\\ &=\sigma^4\left(\frac{2b^2}{n-1}+(b-1)^2\right). \end{aligned}\] Differentiate with respect to $b$: \[\frac{d}{db}\left(\frac{2b^2}{n-1}+(b-1)^2\right) =\frac{4b}{n-1}+2(b-1).\] Set equal to zero: \[\frac{4b}{n-1}+2b-2=0.\] Hence \[b\left(\frac{4}{n-1}+2\right)=2, \qquad b=\frac{n-1}{n+1}.\] Therefore the MSE-minimizing estimator in this class is \[\widetilde S^2 =\frac{n-1}{n+1}S^2 =\frac1{n+1}\sum_{i=1}^n (X_i-\bar X)^2.\]

Warning

Correction note The calculation above gives $b=(n-1)/(n+1)$ under squared error loss. This estimator is biased but improves MSE within the class $bS^2$.

22.3 Risk curves for variance estimators

The three common estimators of $\sigma^2$ are \[S^2=\frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2,\] \[\widehat\sigma^2_{\mathrm{MLE}}=\frac1n\sum_{i=1}^n (X_i-\bar X)^2=\frac{n-1}{n}S^2,\] \[\widetilde S^2=\frac1{n+1}\sum_{i=1}^n (X_i-\bar X)^2=\frac{n-1}{n+1}S^2.\] Under squared error loss, their risks are proportional to $\sigma^4$: \[R(\sigma^2,S^2)=\frac{2\sigma^4}{n-1},\] \[R(\sigma^2,\widehat\sigma^2_{\mathrm{MLE}})=\frac{(2n-1)\sigma^4}{n^2},\] \[R(\sigma^2,\widetilde S^2)=\sigma^4\left(\frac{2}{n-1}\left(\frac{n-1}{n+1}\right)^2+\frac{4}{(n+1)^2}\right).\] The MSE-minimizing constant multiple $\widetilde S^2$ has the smallest risk among estimators of the form $bS^2$.

22.4 Different loss functions: Stein’s loss

Changing the loss function can change which estimator is preferred.

Definition

Definition 30 (Stein’s loss for variance estimation). For estimating $\sigma^2$ by an action $a>0$, Stein’s loss is \[L(\sigma^2,a)=\frac{a}{\sigma^2}-1-\log\left(\frac{a}{\sigma^2}\right).\]

Example

Example 31 (Stein’s loss for $bS^2$). For estimators $\delta_b(X)=bS^2$, compute the risk under Stein’s loss up to terms independent of $b$.

Solution

The risk is \[\begin{aligned} R(\sigma^2,\delta_b) &=\mathbb{E}\left[\frac{bS^2}{\sigma^2}-1-\log\left(\frac{bS^2}{\sigma^2}\right)\right]\\ &=b\mathbb{E}\left[\frac{S^2}{\sigma^2}\right]-1-\log b-\mathbb{E}\left[\log\left(\frac{S^2}{\sigma^2}\right)\right]. \end{aligned}\] Since $\mathbb{E}[S^2/\sigma^2]=1$, \[R(\sigma^2,\delta_b)=b-\log b-1-C,\] where \[C=\mathbb{E}\left[\log\left(\frac{S^2}{\sigma^2}\right)\right]\] does not depend on $b$. Differentiating $b-\log b$ gives \[1-\frac1b=0,\] so $b=1$. Under Stein’s loss, the best choice in this class is $\delta_b=S^2$.

Key idea

Important lesson The word “best” depends on the loss function. Under squared error loss, a biased multiple of $S^2$ can be preferred; under Stein’s loss, $S^2$ is preferred within the class $bS^2$.

23 Bayes Risk

Bayes risk averages the frequentist risk over a prior distribution on the parameter.

23.1 Definition

When a prior distribution is specified, we can compare estimators by their average risk under the prior.

Definition

Definition 32 (Bayes risk). Given a prior distribution $\pi(\theta)$, the Bayes risk of a decision rule $\delta$ is \[R_B(\delta)=\int_\Theta R(\theta,\delta)\pi(\theta)\,d\theta.\] Equivalently, \[R_B(\delta)=\int_\Theta\int_{\mathcal X} L(\theta,\delta(x))f(x\mid\theta)\pi(\theta)\,dx\,d\theta.\]

Using Bayes’ rule, this can also be written as \[R_B(\delta) =\int_{\mathcal X}\left[\int_\Theta L(\theta,\delta(x))\pi(\theta\mid x)\,d\theta\right]m(x)\,dx,\] where $\pi(\theta\mid x)$ is the posterior distribution and \[m(x)=\int f(x\mid\theta)\pi(\theta)\,d\theta\] is the marginal distribution of the data.

23.2 Posterior expected loss

For a fixed observed data value $x$, the posterior expected loss is \[\int_\Theta L(\theta,a)\pi(\theta\mid x)\,d\theta.\] A Bayes estimator chooses the action $a$ minimizing this posterior expected loss.

Example

Example 33 (Bayes estimator under squared error loss). Suppose $L(\theta,a)=(a-\theta)^2$. Show that the Bayes estimator is the posterior mean.

Solution

For fixed data $x$, minimize \[\mathbb{E}[(a-\theta)^2\mid x]\] with respect to $a$. Expand: \[\mathbb{E}[(a-\theta)^2\mid x] =a^2-2a\mathbb{E}[\theta\mid x]+\mathbb{E}[\theta^2\mid x].\] Differentiating with respect to $a$ gives \[2a-2\mathbb{E}[\theta\mid x]=0.\] Thus \[a=\mathbb{E}[\theta\mid x].\] So the Bayes estimator under squared error loss is the posterior mean.

Example

Example 34 (Bayes estimator under absolute error loss). Suppose $L(\theta,a)=|a-\theta|$. The Bayes estimator is a posterior median.

Solution

For fixed data $x$, minimize \[\mathbb{E}[|a-\theta|\mid x] =\int |a-\theta|\pi(\theta\mid x)\,d\theta.\] A standard derivative/subgradient argument shows that a minimizer satisfies \[\mathbb{P}(\theta\le a\mid x)\ge \frac12, \qquad \mathbb{P}(\theta\ge a\mid x)\ge \frac12.\] Thus any posterior median is a Bayes estimator under absolute error loss.

24 Summary of Estimator Criteria

This section closes with a comparison table for the main estimator-evaluation criteria.

Criterion	Definition	Desirable property	Notes
Unbiasedness	$\mathbb{E}[\widehat\theta]=\theta$	Expected value equals the target	Does not guarantee minimum MSE
Bias	$\operatorname{Bias}(\widehat\theta)=\mathbb{E}[\widehat\theta]-\theta$	Bias close to zero	Can be positive or negative
Variance	$\operatorname{Var}(\widehat\theta)$	Smaller is better, especially among unbiased estimators	Measures sampling variability
MSE	$\mathbb{E}[(\widehat\theta-\theta)^2]$	Smaller is better	$\operatorname{MSE}=\operatorname{Var}+\operatorname{Bias}^2$
Efficiency	Variance relative to best possible lower bound	Close to Cramer-Rao lower bound	Usually for unbiased estimators
Consistency	$\widehat\theta_n\xrightarrow{P}\theta$	Converges to target as $n\to\infty$	Large-sample property
Sufficiency	Statistic captures all information about parameter	No information loss about $\theta$	Linked to factorization theorem and Rao-Blackwell
Robustness	Stability under small departures from assumptions	Less sensitive to outliers/model error	Important in applications

Key idea

Concept map

25 Practice Problems

The practice problems below reinforce the main methods for evaluating estimators.

Practice Problem

Practice Problem 35 (Bias-variance decomposition). Let $W$ be an estimator of $\theta$. Prove that \[\mathbb{E}_\theta[(W-\theta)^2]=\operatorname{Var}_\theta(W)+\operatorname{Bias}_\theta(W)^2.\]

Solution

Use \[W-\theta=(W-\mathbb{E}W)+(\mathbb{E}W-\theta).\] Squaring and taking expectations gives \[\mathbb{E}[(W-\theta)^2] =\mathbb{E}[(W-\mathbb{E}W)^2]+2(\mathbb{E}W-\theta)\mathbb{E}[W-\mathbb{E}W]+(\mathbb{E}W-\theta)^2.\] The middle term is zero, so \[\mathbb{E}[(W-\theta)^2]=\operatorname{Var}(W)+\operatorname{Bias}(W)^2.\]

Practice Problem

Practice Problem 36 (MSE for a biased variance estimator). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$ and let \[\widehat\sigma^2_c=cS^2.\] Find $\operatorname{MSE}(\widehat\sigma^2_c)$ as a function of $c$.

Solution

Since $\mathbb{E}(S^2)=\sigma^2$ and $\operatorname{Var}(S^2)=2\sigma^4/(n-1)$, \[\mathbb{E}(cS^2)=c\sigma^2, \qquad \operatorname{Var}(cS^2)=c^2\frac{2\sigma^4}{n-1}.\] Therefore \[\operatorname{MSE}(cS^2) =c^2\frac{2\sigma^4}{n-1}+(c\sigma^2-\sigma^2)^2 =\sigma^4\left(\frac{2c^2}{n-1}+(c-1)^2\right).\]

Practice Problem

Practice Problem 37 (Optimal constant multiple). Using the previous problem, find the value of $c$ that minimizes $\operatorname{MSE}(cS^2)$.

Solution

Minimize \[h(c)=\frac{2c^2}{n-1}+(c-1)^2.\] Differentiate: \[h'(c)=\frac{4c}{n-1}+2(c-1).\] Set $h'(c)=0$: \[\frac{4c}{n-1}+2c-2=0.\] Thus \[c\left(\frac{4}{n-1}+2\right)=2, \qquad c=\frac{n-1}{n+1}.\]

Practice Problem

Practice Problem 38 (Cramer-Rao for Bernoulli). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)$. Find the Fisher information and the Cramer-Rao lower bound for unbiased estimators of $p$.

Solution

For one observation, \[f(x\mid p)=p^x(1-p)^{1-x}, \qquad x\in\{0,1\}.\] The log-likelihood for one observation is \[\ell(p)=X\log p+(1-X)\log(1-p).\] Then \[\ell'(p)=\frac{X}{p}-\frac{1-X}{1-p} =\frac{X-p}{p(1-p)}.\] Thus \[I_1(p)=\mathbb{E}\left[\left(\frac{X-p}{p(1-p)}\right)^2\right] =\frac{\operatorname{Var}(X)}{p^2(1-p)^2} =\frac{p(1-p)}{p^2(1-p)^2} =\frac1{p(1-p)}.\] For $n$ iid observations, \[I_n(p)=\frac{n}{p(1-p)}.\] For unbiased estimators of $p$, \[\operatorname{Var}(W)\ge \frac{1}{I_n(p)}=\frac{p(1-p)}{n}.\] Since $\bar X$ is unbiased and has variance $p(1-p)/n$, it attains the lower bound.

Practice Problem

Practice Problem 39 (Rao-Blackwell improvement). Suppose $W$ is unbiased for $g(\theta)$ and $T$ is sufficient. Let $\phi(T)=\mathbb{E}(W\mid T)$. Prove that $\phi(T)$ is unbiased and has variance no larger than $W$.

Solution

By the law of total expectation, \[\mathbb{E}[\phi(T)]=\mathbb{E}[\mathbb{E}(W\mid T)]=\mathbb{E}(W)=g(\theta).\] By the law of total variance, \[\operatorname{Var}(W)=\operatorname{Var}(\mathbb{E}(W\mid T))+\mathbb{E}[\operatorname{Var}(W\mid T)] =\operatorname{Var}(\phi(T))+\mathbb{E}[\operatorname{Var}(W\mid T)].\] The second term is nonnegative, so \[\operatorname{Var}(\phi(T))\le \operatorname{Var}(W).\]

Practice Problem

Practice Problem 40 (Bayes estimator under squared error). Let $\pi(\theta\mid x)$ be a posterior density. Show that the posterior mean minimizes posterior expected squared error loss.

Solution

For fixed $x$, minimize \[Q(a)=\int (a-\theta)^2\pi(\theta\mid x)\,d\theta.\] Expanding, \[Q(a)=a^2-2a\mathbb{E}(\theta\mid x)+\mathbb{E}(\theta^2\mid x).\] Differentiate: \[Q'(a)=2a-2\mathbb{E}(\theta\mid x).\] Thus the minimizer is \[a=\mathbb{E}(\theta\mid x).\]

--- title: "Chapter 13: Point Estimation II — Evaluating Estimators" format: html: toc: true toc-depth: 3 number-sections: true pdf: toc: true number-sections: true execute: warning: false message: false --- This chapter continues point estimation. Chapter 12 focused on how to construct estimators. This chapter focuses on how to evaluate and compare estimators using mean squared error, bias, variance, Fisher information, Cramér--Rao lower bounds, Rao--Blackwellization, UMVUEs, loss functions, risk functions, and Bayes risk. ::: {.callout-note title="Topics"} Evaluating and comparing point estimators; mean squared error; bias and variance; unbiasedness; best unbiased estimators; Cramer-Rao inequality; Fisher information; Rao-Blackwell theorem; Lehmann-Scheffe theorem; loss functions; risk functions; Bayes risk; examples for normal, Bernoulli, binomial, and Poisson models. ::: # Overview: Evaluating and Comparing Estimators This section studies how to choose among different point estimators when several construction methods are available. In Section 12, we discussed ways to *find* estimators: method of moments, maximum likelihood estimation, and Bayesian methods. In this section, the question changes from construction to evaluation. ::: {.callout-tip title="Key idea"} Suppose several estimators are available for the same parameter. Which estimator should we use? ::: Common criteria for a good estimator include: - small mean squared error; - unbiasedness; - small variance among unbiased estimators; - efficiency relative to a lower bound; - use of sufficient statistics; - consistency as the sample size grows; - robustness under small model deviations. The main theme is that there is usually no single universal meaning of "best." The answer depends on the loss function, the parameter space, and whether we care about finite-sample or large-sample performance. # Mean Squared Error and Bias Mean squared error is one of the most useful criteria because it combines variance and systematic bias into a single number. ## Mean squared error We begin with the most common risk measure for point estimation. ::: {.callout-note title="Definition"} **Definition 1** (Mean squared error). Let $W=W(X_1,\ldots,X_n)$ be an estimator of a scalar parameter $\theta$. The **mean squared error** of $W$ is $$\operatorname{MSE}_\theta(W)=\mathbb{E}_\theta\bigl[(W-\theta)^2\bigr].$$ ::: The MSE measures the average squared distance between the random estimator and the unknown true parameter. It depends on $\theta$, so it is a function on the parameter space. ## Bias and unbiasedness Bias measures whether the estimator systematically overestimates or underestimates the target. ::: {.callout-note title="Definition"} **Definition 2** (Bias). The **bias** of a point estimator $W$ of $\theta$ is $$\operatorname{Bias}_\theta(W)=\mathbb{E}_\theta[W]-\theta.$$ The estimator $W$ is **unbiased** for $\theta$ if $$\mathbb{E}_\theta[W]=\theta$$ for all possible values of $\theta$. ::: ::: {.callout-important title="Remark"} *Remark 3*. Unbiasedness is a natural criterion, but it is not the only criterion. A biased estimator can have smaller MSE than an unbiased estimator if the reduction in variance is large enough. ::: ## Bias-variance decomposition The key reason MSE is convenient is that it decomposes into variance plus squared bias. ::: {.callout-important title="Theorem"} **Theorem 4** (Bias-variance decomposition). *For any estimator $W$ of $\theta$, $$\operatorname{MSE}_\theta(W) =\mathbb{E}_\theta[(W-\theta)^2] =\operatorname{Var}_\theta(W)+\bigl(\mathbb{E}_\theta[W]-\theta\bigr)^2.$$ Equivalently, $$\operatorname{MSE}_\theta(W)=\operatorname{Var}_\theta(W)+\operatorname{Bias}_\theta(W)^2.$$* ::: ::: {.callout-note title="Proof"} *Proof.* Write $$W-\theta=(W-\mathbb{E}_\theta W)+(\mathbb{E}_\theta W-\theta).$$ Then $$\begin{aligned} \mathbb{E}_\theta[(W-\theta)^2] &=\mathbb{E}_\theta\left[(W-\mathbb{E}_\theta W)^2\right] +2(\mathbb{E}_\theta W-\theta)\mathbb{E}_\theta[W-\mathbb{E}_\theta W] +(\mathbb{E}_\theta W-\theta)^2\\ &=\operatorname{Var}_\theta(W)+(\mathbb{E}_\theta W-\theta)^2, \end{aligned}$$ because the middle term is zero. ◻ ::: ::: {.callout-important title="Remark"} *Remark 5*. It is reasonable to consider other errors, such as $\mathbb{E}_\theta|W-\theta|$. The MSE is especially common because it is differentiable and has the bias-variance decomposition. ::: # Examples of MSE These examples show how MSE is computed and how bias can sometimes improve MSE. ## Sample mean and sample variance under a normal model We first recall basic facts about the sample mean and sample variance. ::: {.callout-note title="Example"} **Example 6** (Normal distribution). Let $X_1,\ldots,X_n$ be a random sample from a population with mean $\mu$ and variance $\sigma^2$. Define $$\bar X=\frac1n\sum_{i=1}^n X_i, \qquad S^2=\frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2.$$ Then $$\mathbb{E}(\bar X)=\mu,\qquad \operatorname{Var}(\bar X)=\frac{\sigma^2}{n},\qquad \mathbb{E}(S^2)=\sigma^2.$$ Thus $\bar X$ is unbiased for $\mu$, and $S^2$ is unbiased for $\sigma^2$. ::: ::: {.callout-tip title="Solution"} Since $X_1,\ldots,X_n$ are iid, $$\mathbb{E}(\bar X)=\frac1n\sum_{i=1}^n \mathbb{E}(X_i)=\mu$$ and $$\operatorname{Var}(\bar X)=\frac1{n^2}\sum_{i=1}^n \operatorname{Var}(X_i)=\frac{\sigma^2}{n}.$$ The identity $\mathbb{E}(S^2)=\sigma^2$ was proved in the sampling section. Therefore $$\operatorname{MSE}(\bar X)=\operatorname{Var}(\bar X)=\frac{\sigma^2}{n}$$ when estimating $\mu$. If $X_i\sim \operatorname{Normal}(\mu,\sigma^2)$, then $$\frac{(n-1)S^2}{\sigma^2}\sim \chi^2_{n-1},$$ so $$\operatorname{Var}(S^2)=\frac{2\sigma^4}{n-1}.$$ Since $S^2$ is unbiased for $\sigma^2$, $$\operatorname{MSE}(S^2)=\frac{2\sigma^4}{n-1}.$$ ::: ## The MLE of normal variance The MLE for the normal variance is biased, but it can have smaller MSE than the unbiased sample variance. ::: {.callout-note title="Example"} **Example 7** (Normal variance MLE). Suppose $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$. The MLE of $\sigma^2$ is $$\widehat\sigma^2_{\mathrm{MLE}} =\frac1n\sum_{i=1}^n (X_i-\bar X)^2 =\frac{n-1}{n}S^2.$$ Find its bias, variance, and MSE. ::: ::: {.callout-tip title="Solution"} Since $\mathbb{E}(S^2)=\sigma^2$, $$\mathbb{E}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =\frac{n-1}{n}\sigma^2.$$ Thus $$\operatorname{Bias}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =-\frac{\sigma^2}{n}.$$ Also, $$\operatorname{Var}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =\left(\frac{n-1}{n}\right)^2\operatorname{Var}(S^2) =\left(\frac{n-1}{n}\right)^2\frac{2\sigma^4}{n-1} =\frac{2(n-1)\sigma^4}{n^2}.$$ Therefore $$\operatorname{MSE}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr) =\frac{2(n-1)\sigma^4}{n^2}+\frac{\sigma^4}{n^2} =\frac{(2n-1)\sigma^4}{n^2}.$$ Compare this with $$\operatorname{MSE}(S^2)=\frac{2\sigma^4}{n-1}.$$ For $n\ge 2$, $$\frac{(2n-1)}{n^2}<\frac{2}{n-1},$$ so $$\operatorname{MSE}\bigl(\widehat\sigma^2_{\mathrm{MLE}}\bigr)<\operatorname{MSE}(S^2).$$ The MLE trades a small bias for a larger variance reduction. ::: ::: {.callout-tip title="Key idea"} Bias-variance tradeoff An unbiased estimator is not automatically best under MSE. Sometimes a biased estimator has smaller MSE because its variance is much smaller. ::: ## A Bayes estimator for Bernoulli data This example compares the usual sample proportion with a Bayesian shrinkage estimator. ::: {.callout-note title="Example"} **Example 8** (Binomial Bayes estimator). Suppose $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)$, where $p$ is unknown. Let $$S=X_1+\cdots+X_n.$$ The MLE is $$\widehat p_{\mathrm{MLE}}=\bar X=\frac{S}{n}.$$ If the prior is $p\sim \operatorname{Beta}(\alpha,\beta)$, then the posterior mean estimator is $$\widehat p_{\mathrm{B}} =\mathbb{E}[p\mid S] =\frac{\alpha+S}{\alpha+\beta+n}.$$ Compute the MSE of $\bar X$ and $\widehat p_{\mathrm{B}}$ as functions of $p$. ::: ::: {.callout-tip title="Solution"} For the MLE, $$\mathbb{E}_p(\bar X)=p, \qquad \operatorname{Var}_p(\bar X)=\frac{p(1-p)}{n}.$$ Thus $$\operatorname{MSE}_p(\bar X)=\frac{p(1-p)}{n}.$$ For the Bayes estimator, $$\widehat p_{\mathrm{B}}=\frac{\alpha+S}{\alpha+\beta+n}.$$ Since $S\sim \operatorname{Binomial}(n,p)$, $$\mathbb{E}_p(S)=np, \qquad \operatorname{Var}_p(S)=np(1-p).$$ Therefore $$\mathbb{E}_p(\widehat p_{\mathrm{B}}) =\frac{\alpha+np}{\alpha+\beta+n},$$ $$\operatorname{Var}_p(\widehat p_{\mathrm{B}}) =\frac{np(1-p)}{(\alpha+\beta+n)^2},$$ and $$\operatorname{Bias}_p(\widehat p_{\mathrm{B}}) =\frac{\alpha+np}{\alpha+\beta+n}-p =\frac{\alpha-p(\alpha+\beta)}{\alpha+\beta+n}.$$ Hence $$\operatorname{MSE}_p(\widehat p_{\mathrm{B}}) =\frac{np(1-p)}{(\alpha+\beta+n)^2} +\left(\frac{\alpha-p(\alpha+\beta)}{\alpha+\beta+n}\right)^2.$$ If we choose $$\alpha=\beta=\frac{\sqrt n}{4},$$ then this estimator shrinks $\bar X$ toward $1/2$. It can have better MSE for small $n$ or when one strongly believes $p$ is close to $1/2$. For large $n$, the MLE becomes very strong because the data dominates. ::: ::: {.callout-important title="Remark"} *Remark 9*. The lecture slide emphasizes a special choice of $\alpha$ and $\beta$ that makes the MSE curve flatter. The main statistical idea is shrinkage: the estimator sacrifices unbiasedness to reduce risk near values favored by the prior. ::: # Best Unbiased Estimators Unbiasedness alone is not enough, so we compare unbiased estimators by their variance. ## Definition and uniqueness The best unbiased estimator has the smallest variance among all unbiased estimators. ::: {.callout-note title="Definition"} **Definition 10** (Best unbiased estimator and UMVUE). An estimator $\widehat\theta$ of $\theta$ is called a **best unbiased estimator** if: 1. it is unbiased: $\mathbb{E}_\theta[\widehat\theta]=\theta$ for all $\theta$; 2. it has minimum variance among all unbiased estimators of $\theta$ for every $\theta$. It is also called a **uniform minimum variance unbiased estimator**, or **UMVUE**. ::: The definition extends naturally to estimating a function $g(\theta)$. ::: {.callout-important title="Theorem"} **Theorem 11** (Uniqueness). *If $W$ is a best unbiased estimator of $g(\theta)$, then $W$ is unique almost surely.* ::: ::: {.callout-note title="Proof"} *Proof.* Suppose $W_1$ and $W_2$ are both best unbiased estimators. Then $$W_3=\frac{W_1+W_2}{2}$$ is also unbiased. Since $W_1$ and $W_2$ are both minimum-variance unbiased estimators, $W_3$ cannot have smaller variance. But $$\operatorname{Var}(W_1-W_2)\ge 0$$ and the usual variance identity implies that averaging two different unbiased estimators would strictly reduce variance unless $W_1=W_2$ almost surely. Therefore $W_1=W_2$ almost surely. ◻ ::: ## Poisson unbiased estimation This example shows that many unbiased estimators can exist, but one may be clearly better. ::: {.callout-note title="Example"} **Example 12** (Poisson unbiased estimators). Suppose $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)$. Then both $\bar X$ and $S^2$ are unbiased estimators of $\lambda$. Compare their MSEs. ::: ::: {.callout-tip title="Solution"} For the sample mean, $$\mathbb{E}_\lambda(\bar X)=\lambda, \qquad \operatorname{Var}_\lambda(\bar X)=\frac{\lambda}{n}.$$ Thus $$\operatorname{MSE}(\bar X)=\frac{\lambda}{n}.$$ The sample variance $S^2$ is also unbiased for the population variance, and for a Poisson random variable the population variance equals $\lambda$. Hence $\mathbb{E}(S^2)=\lambda$. A general formula for the variance of the sample variance is $$\operatorname{Var}(S^2)=\frac1n\left(\mu_4-\frac{n-3}{n-1}\sigma^4\right),$$ where $\mu_4=\mathbb{E}[(X-\mu)^4]$ and $\sigma^2=\operatorname{Var}(X)$. For $X\sim \operatorname{Poisson}(\lambda)$, $$\mu=\lambda, \qquad \sigma^2=\lambda, \qquad \mu_4=\lambda+3\lambda^2.$$ Therefore $$\begin{aligned} \operatorname{Var}(S^2) &=\frac1n\left(\lambda+3\lambda^2-\frac{n-3}{n-1}\lambda^2\right)\\ &=\frac{\lambda}{n}+\frac{2\lambda^2}{n-1}. \end{aligned}$$ Since $S^2$ is unbiased, $$\operatorname{MSE}(S^2)=\frac{\lambda}{n}+\frac{2\lambda^2}{n-1}.$$ Thus $$\operatorname{MSE}(\bar X)<\operatorname{MSE}(S^2)$$ for $\lambda>0$. This shows $\bar X$ is better than $S^2$ among these two unbiased estimators, but by itself it does not prove $\bar X$ is the best among *all* unbiased estimators. ::: # Cramer-Rao Inequality and Fisher Information The Cramer-Rao inequality gives a lower bound on the variance of unbiased estimators. ## The inequality The bound is one of the main tools for proving that an unbiased estimator is best. ::: {.callout-important title="Theorem"} **Theorem 13** (Cramer-Rao inequality). *Let $X=(X_1,\ldots,X_n)$ have joint density or mass function $f(x\mid \theta)$. Let $W(X)$ be an estimator of $g(\theta)$ satisfying the usual regularity conditions that allow differentiation under the integral sign. Then $$\operatorname{Var}_\theta(W) \ge \frac{\left(\dfrac{d}{d\theta}\mathbb{E}_\theta[W]\right)^2} {\mathbb{E}_\theta\left[\left(\dfrac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right]}.$$ If $W$ is unbiased for $g(\theta)$, then $\mathbb{E}_\theta[W]=g(\theta)$ and $$\operatorname{Var}_\theta(W) \ge \frac{\left(g'(\theta)\right)^2}{I_X(\theta)},$$ where $$I_X(\theta)=\mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right]$$ is the Fisher information in the sample.* ::: ::: {.callout-note title="Proof"} *Proof idea.* Let $$U_\theta(X)=\frac{\partial}{\partial\theta}\log f(X\mid \theta)$$ be the score function. Under regularity conditions, $$\mathbb{E}_\theta[U_\theta(X)]=0.$$ Also, $$\begin{aligned} \frac{d}{d\theta}\mathbb{E}_\theta[W] &=\frac{d}{d\theta}\int W(x)f(x\mid \theta)\,dx\\ &=\int W(x)\frac{\partial}{\partial\theta}f(x\mid\theta)\,dx\\ &=\int W(x)\left(\frac{\partial}{\partial\theta}\log f(x\mid\theta)\right)f(x\mid\theta)\,dx\\ &=\mathbb{E}_\theta[W U_\theta(X)]\\ &=\operatorname{Cov}_\theta(W,U_\theta(X)). \end{aligned}$$ By Cauchy-Schwarz, $$\left(\frac{d}{d\theta}\mathbb{E}_\theta[W]\right)^2 \le \operatorname{Var}_\theta(W)\operatorname{Var}_\theta(U_\theta(X)).$$ Since $\operatorname{Var}_\theta(U_\theta)=\mathbb{E}_\theta[U_\theta^2]=I_X(\theta)$, rearranging gives the result. ◻ ::: ## Equality condition The equality condition explains when the lower bound is attained. ::: {.callout-important title="Theorem"} **Theorem 14** (Equality condition). *Equality in the Cramer-Rao inequality holds if and only if there exists a function $h(\theta)$ such that $$h(\theta)\{W(x)-g(\theta)\} =\frac{\partial}{\partial\theta}\log L(\theta\mid x),$$ where $L(\theta\mid x)$ is the likelihood function.* ::: This condition is useful because if an unbiased estimator attains the Cramer-Rao lower bound, then it is a best unbiased estimator. ## IID form and information number For independent samples, the Fisher information adds. ::: {.callout-important title="Corollary"} **Corollary 15** (IID sample). *Suppose $X_1,\ldots,X_n$ are iid with density or mass function $f(x\mid\theta)$. Then $$I_X(\theta) =n I_1(\theta), \qquad I_1(\theta)=\mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right].$$ Hence $$\operatorname{Var}_\theta(W) \ge \frac{\left(\dfrac{d}{d\theta}\mathbb{E}_\theta[W]\right)^2} {nI_1(\theta)}.$$* ::: ::: {.callout-important title="Lemma"} **Lemma 16** (Computing Fisher information). *Under suitable regularity conditions, often satisfied for exponential-family models, $$I_1(\theta) =\mathbb{E}_\theta\left[\left(\frac{\partial}{\partial\theta}\log f(X\mid\theta)\right)^2\right] =-\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta^2}\log f(X\mid\theta)\right].$$* ::: ## Poisson conclusion We now return to the Poisson example and prove that $\bar X$ is best unbiased. ::: {.callout-note title="Example"} **Example 17** (Poisson UMVUE). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Poisson}(\lambda)$. Show that $\bar X$ is the best unbiased estimator of $\lambda$. ::: ::: {.callout-tip title="Solution"} For one observation, $$f(x\mid\lambda)=e^{-\lambda}\frac{\lambda^x}{x!},$$ so $$\log f(x\mid\lambda)=-\lambda+x\log\lambda-\log(x!).$$ Then $$\frac{\partial^2}{\partial\lambda^2}\log f(x\mid\lambda) =-\frac{x}{\lambda^2}.$$ Therefore $$I_1(\lambda) =-\mathbb{E}_\lambda\left[-\frac{X}{\lambda^2}\right] =\frac{\lambda}{\lambda^2} =\frac1\lambda.$$ For $n$ iid observations, $$I_X(\lambda)=\frac{n}{\lambda}.$$ If $W$ is unbiased for $\lambda$, then $g(\lambda)=\lambda$ and $g'(\lambda)=1$. The Cramer-Rao lower bound gives $$\operatorname{Var}_\lambda(W)\ge \frac{1}{n/\lambda}=\frac{\lambda}{n}.$$ But $$\mathbb{E}_\lambda(\bar X)=\lambda, \qquad \operatorname{Var}_\lambda(\bar X)=\frac{\lambda}{n}.$$ Thus $\bar X$ is unbiased and attains the lower bound. Hence $\bar X$ is the best unbiased estimator of $\lambda$. ::: ## Normal variance example The Cramer-Rao bound is a lower bound, but not every unbiased estimator attains it. ::: {.callout-note title="Example"} **Example 18** (Normal variance). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$. Consider estimating $\sigma^2$. If we use the parameter $\theta=\sigma^2$, then the normal log-density for one observation is $$\log f(x\mid \mu,\theta) =-\frac12\log(2\pi\theta)-\frac{(x-\mu)^2}{2\theta}.$$ When $\mu$ is treated as known in this direct one-parameter calculation, $$-\mathbb{E}\left[\frac{\partial^2}{\partial\theta^2}\log f(X\mid\mu,\theta)\right] =\frac{1}{2\theta^2} =\frac{1}{2\sigma^4}.$$ Thus the Cramer-Rao lower bound for unbiased estimators of $\sigma^2$ is $$\operatorname{Var}(W)\ge \frac{2\sigma^4}{n}.$$ ::: ::: {.callout-tip title="Solution"} Compute derivatives: $$\frac{\partial}{\partial\theta}\log f(x\mid\mu,\theta) =-\frac1{2\theta}+\frac{(x-\mu)^2}{2\theta^2},$$ $$\frac{\partial^2}{\partial\theta^2}\log f(x\mid\mu,\theta) =\frac1{2\theta^2}-\frac{(x-\mu)^2}{\theta^3}.$$ Since $\mathbb{E}[(X-\mu)^2]=\theta$, $$-\mathbb{E}\left[\frac{\partial^2}{\partial\theta^2}\log f(X\mid\mu,\theta)\right] =-\left(\frac1{2\theta^2}-\frac{\theta}{\theta^3}\right) =\frac1{2\theta^2}.$$ Thus $I_X(\theta)=n/(2\theta^2)$, so $$\operatorname{Var}(W)\ge \frac{1}{n/(2\theta^2)}=\frac{2\theta^2}{n}=\frac{2\sigma^4}{n}.$$ The usual unbiased sample variance satisfies $$\operatorname{Var}(S^2)=\frac{2\sigma^4}{n-1},$$ which is larger than $2\sigma^4/n$. Therefore this simple Cramer-Rao calculation does not prove that $S^2$ attains the lower bound. ::: ::: {.callout-important title="Remark"} *Remark 19*. When nuisance parameters are present, such as unknown $\mu$ while estimating $\sigma^2$, one must be careful with the precise form of Fisher information. The lecture emphasizes the main message: the Cramer-Rao lower bound is a benchmark, and not every natural unbiased estimator attains it. ::: # Sufficiency and Unbiasedness Sufficient statistics can improve estimators by conditioning away irrelevant randomness. ## Rao-Blackwell theorem The Rao-Blackwell theorem is a systematic estimator-improvement principle. ::: {.callout-important title="Theorem"} **Theorem 20** (Rao-Blackwell theorem). *Let $X_1,\ldots,X_n$ be a sample from a population distribution with density or mass function $f(x\mid\theta)$. Let $W(X)$ be an unbiased estimator of $g(\theta)$, and let $T(X)$ be a sufficient statistic for $\theta$. Define $$\phi(T)=\mathbb{E}[W(X)\mid T].$$ Then $$\mathbb{E}_\theta[\phi(T)]=g(\theta)$$ and $$\operatorname{Var}_\theta(\phi(T))\le \operatorname{Var}_\theta(W)$$ for every $\theta$.* ::: ::: {.callout-note title="Proof"} *Proof.* For unbiasedness, use the law of total expectation: $$\mathbb{E}_\theta[\phi(T)] =\mathbb{E}_\theta\bigl[\mathbb{E}_\theta(W\mid T)\bigr] =\mathbb{E}_\theta(W)=g(\theta).$$ For variance, use the law of total variance: $$\operatorname{Var}_\theta(W) =\operatorname{Var}_\theta\bigl(\mathbb{E}_\theta(W\mid T)\bigr) +\mathbb{E}_\theta\bigl[\operatorname{Var}_\theta(W\mid T)\bigr].$$ Since the second term is nonnegative, $$\operatorname{Var}_\theta(W)\ge \operatorname{Var}_\theta(\phi(T)).$$ Thus $\phi(T)$ is uniformly at least as good as $W$ among unbiased estimators. ◻ ::: ::: {.callout-tip title="Key idea"} Rao-Blackwellization replaces an estimator by its conditional expectation given a sufficient statistic. This preserves unbiasedness and reduces variance. ::: ## Lehmann-Scheffe theorem Completeness turns the Rao-Blackwell estimator into the unique best unbiased estimator. ::: {.callout-note title="Definition"} **Definition 21** (Complete sufficient statistic). A statistic $T$ is **complete** for a family of distributions if $$\mathbb{E}_\theta[g(T)]=0 \quad \text{for all } \theta \quad \Longrightarrow \quad g(T)=0 \text{ almost surely}.$$ If $T$ is both sufficient and complete, it is called a complete sufficient statistic. ::: ::: {.callout-important title="Theorem"} **Theorem 22** (Lehmann-Scheffe theorem). *Let $T(X)$ be a complete sufficient statistic for $\theta$. If $\phi(T)$ is unbiased for $g(\theta)$, then $\phi(T)$ is the unique best unbiased estimator of $g(\theta)$.* ::: ::: {.callout-important title="Remark"} *Remark 23*. A related characterization says that an unbiased estimator is best unbiased if and only if it is uncorrelated with every unbiased estimator of zero. ::: ## Common UMVUEs The following table summarizes common best unbiased estimators discussed in the course. ::: center Parameter Distribution / model UMVUE --------------------- ----------------------------------------------------------- -------------------------------------------------------- Mean $\mu$ $\operatorname{Normal}(\mu,\sigma^2)$, known $\sigma^2$ $\bar X$ Mean $\mu$ $\operatorname{Normal}(\mu,\sigma^2)$, unknown $\sigma^2$ $\bar X$ Variance $\sigma^2$ $\operatorname{Normal}(\mu,\sigma^2)$, unknown $\mu$ $\displaystyle \frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2$ Parameter $p$ Bernoulli / binomial sample proportion $\widehat p$ Parameter $\lambda$ Poisson $\bar X$ Parameter $\theta$ $\operatorname{Uniform}[0,\theta]$ $\displaystyle \frac{n+1}{n}\max_i X_i$ ::: ::: {.callout-note title="Example"} **Example 24** (Uniform UMVUE). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Uniform}(0,\theta)$. Show that $$\widehat\theta=\frac{n+1}{n}X_{(n)}$$ is unbiased for $\theta$, where $X_{(n)}=\max_i X_i$. ::: ::: {.callout-tip title="Solution"} The CDF of $X_{(n)}$ is $$\mathbb{P}(X_{(n)}\le x)=\left(\frac{x}{\theta}\right)^n, \qquad 0\le x\le \theta.$$ Thus the density is $$f_{X_{(n)}}(x)=\frac{n x^{n-1}}{\theta^n}, \qquad 0\le x\le \theta.$$ Then $$\mathbb{E}[X_{(n)}] =\int_0^\theta x\frac{n x^{n-1}}{\theta^n}\,dx =\frac{n}{\theta^n}\cdot \frac{\theta^{n+1}}{n+1} =\frac{n}{n+1}\theta.$$ Therefore $$\mathbb{E}\left[\frac{n+1}{n}X_{(n)}\right]=\theta.$$ ::: # Loss Functions and Risk Decision theory evaluates estimators by the loss incurred after making a decision. ## Loss functions A loss function measures the cost of estimating $\theta$ by an action $a$. ::: {.callout-note title="Definition"} **Definition 25** (Loss function). Let $\Theta$ be the parameter space and $\mathcal A$ be the action space. A **loss function** is a function $$L:\Theta\times\mathcal A\to \mathbb{R}$$ where $L(\theta,a)$ measures the cost of taking action $a$ when the true parameter is $\theta$. ::: For point estimation, the action space is often $\mathcal A=\Theta$. Common losses include: - absolute error loss: $$L(\theta,a)=|a-\theta|;$$ - squared error loss: $$L(\theta,a)=(a-\theta)^2;$$ - zero-one loss for exact decisions: $$L(\theta,a)= \begin{cases} 0, & a=\theta,\\ 1, & a\ne \theta. \end{cases}$$ ## Risk function The risk is the expected loss when a decision rule is used. ::: {.callout-note title="Definition"} **Definition 26** (Decision rule and risk). A decision rule is a function $$\delta:\mathcal X\to \mathcal A$$ that selects an action based on the observed data. Its risk function is $$R(\theta,\delta)=\mathbb{E}_\theta[L(\theta,\delta(X))].$$ ::: For squared error loss, $$R(\theta,\delta)=\mathbb{E}_\theta[(\delta(X)-\theta)^2]=\operatorname{MSE}_\theta(\delta).$$ Thus MSE is a special case of risk. ::: {.callout-tip title="Key idea"} Risk comparison Given two estimators $\delta_1$ and $\delta_2$, if $$R(\theta,\delta_1)<R(\theta,\delta_2)$$ for all $\theta$, then $\delta_1$ is uniformly better under the chosen loss. ::: # Risk Examples The risk function can lead to conclusions different from unbiasedness. ## Bernoulli shrinkage estimator The Bernoulli example illustrates how shrinkage changes risk. ::: {.callout-note title="Example"} **Example 27** (Bernoulli risk comparison). Suppose $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)$. Compare $$\bar X=\frac1n\sum_{i=1}^n X_i$$ with the shrinkage estimator $$\widehat p_B=\frac{\sum_{i=1}^n X_i+\sqrt n/4}{n+\sqrt n}.$$ Use squared error loss. ::: ::: {.callout-tip title="Solution"} Let $S=\sum_{i=1}^n X_i$. Then $$\operatorname{MSE}_p(\bar X)=\frac{p(1-p)}{n}.$$ For $$\widehat p_B=\frac{S+c}{n+d}, \qquad c=\frac{\sqrt n}{4},\quad d=\sqrt n,$$ we have $$\mathbb{E}_p(\widehat p_B)=\frac{np+c}{n+d}, \qquad \operatorname{Var}_p(\widehat p_B)=\frac{np(1-p)}{(n+d)^2}.$$ Thus $$\operatorname{MSE}_p(\widehat p_B) =\frac{np(1-p)}{(n+d)^2} +\left(\frac{np+c}{n+d}-p\right)^2.$$ Since $d=\sqrt n$ and $c=\sqrt n/4$, $$\frac{np+c}{n+d}-p =\frac{\sqrt n(1/4-p)}{n+\sqrt n}.$$ Therefore $$\operatorname{MSE}_p(\widehat p_B) =\frac{np(1-p)}{(n+\sqrt n)^2} +\frac{n(1/4-p)^2}{(n+\sqrt n)^2}.$$ This estimator shrinks toward $1/4$ in this parametrization. Its risk may be smaller than that of $\bar X$ in some parts of the parameter space and larger in others. The lesson is that the preferred estimator depends on the loss function and the values of $p$ considered important. ::: ::: {.callout-important title="Remark"} *Remark 28*. The lecture slide displays risk curves for $n=4$ and $n=400$. The qualitative message is that shrinkage estimators can improve risk in small samples or near the shrinkage target, while the sample proportion becomes very strong as $n$ grows. ::: ## Normal variance: choosing a constant multiple of $S^2$ This example shows how MSE can favor a biased multiple of the unbiased sample variance. ::: {.callout-note title="Example"} **Example 29** (Normal variance risk). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$ and consider estimators of the form $$\delta_b(X)=bS^2, \qquad S^2=\frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2.$$ Find the value of $b$ that minimizes MSE for estimating $\sigma^2$. ::: ::: {.callout-tip title="Solution"} We know $$\mathbb{E}(S^2)=\sigma^2, \qquad \operatorname{Var}(S^2)=\frac{2\sigma^4}{n-1}.$$ For $\delta_b=bS^2$, $$\mathbb{E}(\delta_b)=b\sigma^2, \qquad \operatorname{Var}(\delta_b)=b^2\frac{2\sigma^4}{n-1}.$$ Thus $$\begin{aligned} R(\sigma^2,\delta_b) &=\operatorname{MSE}(\delta_b)\\ &=b^2\frac{2\sigma^4}{n-1}+(b\sigma^2-\sigma^2)^2\\ &=\sigma^4\left(\frac{2b^2}{n-1}+(b-1)^2\right). \end{aligned}$$ Differentiate with respect to $b$: $$\frac{d}{db}\left(\frac{2b^2}{n-1}+(b-1)^2\right) =\frac{4b}{n-1}+2(b-1).$$ Set equal to zero: $$\frac{4b}{n-1}+2b-2=0.$$ Hence $$b\left(\frac{4}{n-1}+2\right)=2, \qquad b=\frac{n-1}{n+1}.$$ Therefore the MSE-minimizing estimator in this class is $$\widetilde S^2 =\frac{n-1}{n+1}S^2 =\frac1{n+1}\sum_{i=1}^n (X_i-\bar X)^2.$$ ::: ::: {.callout-warning title="Warning"} Correction note The calculation above gives $b=(n-1)/(n+1)$ under squared error loss. This estimator is biased but improves MSE within the class $bS^2$. ::: ## Risk curves for variance estimators The three common estimators of $\sigma^2$ are $$S^2=\frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2,$$ $$\widehat\sigma^2_{\mathrm{MLE}}=\frac1n\sum_{i=1}^n (X_i-\bar X)^2=\frac{n-1}{n}S^2,$$ $$\widetilde S^2=\frac1{n+1}\sum_{i=1}^n (X_i-\bar X)^2=\frac{n-1}{n+1}S^2.$$ Under squared error loss, their risks are proportional to $\sigma^4$: $$R(\sigma^2,S^2)=\frac{2\sigma^4}{n-1},$$ $$R(\sigma^2,\widehat\sigma^2_{\mathrm{MLE}})=\frac{(2n-1)\sigma^4}{n^2},$$ $$R(\sigma^2,\widetilde S^2)=\sigma^4\left(\frac{2}{n-1}\left(\frac{n-1}{n+1}\right)^2+\frac{4}{(n+1)^2}\right).$$ The MSE-minimizing constant multiple $\widetilde S^2$ has the smallest risk among estimators of the form $bS^2$. ## Different loss functions: Stein's loss Changing the loss function can change which estimator is preferred. ::: {.callout-note title="Definition"} **Definition 30** (Stein's loss for variance estimation). For estimating $\sigma^2$ by an action $a>0$, Stein's loss is $$L(\sigma^2,a)=\frac{a}{\sigma^2}-1-\log\left(\frac{a}{\sigma^2}\right).$$ ::: ::: {.callout-note title="Example"} **Example 31** (Stein's loss for $bS^2$). For estimators $\delta_b(X)=bS^2$, compute the risk under Stein's loss up to terms independent of $b$. ::: ::: {.callout-tip title="Solution"} The risk is $$\begin{aligned} R(\sigma^2,\delta_b) &=\mathbb{E}\left[\frac{bS^2}{\sigma^2}-1-\log\left(\frac{bS^2}{\sigma^2}\right)\right]\\ &=b\mathbb{E}\left[\frac{S^2}{\sigma^2}\right]-1-\log b-\mathbb{E}\left[\log\left(\frac{S^2}{\sigma^2}\right)\right]. \end{aligned}$$ Since $\mathbb{E}[S^2/\sigma^2]=1$, $$R(\sigma^2,\delta_b)=b-\log b-1-C,$$ where $$C=\mathbb{E}\left[\log\left(\frac{S^2}{\sigma^2}\right)\right]$$ does not depend on $b$. Differentiating $b-\log b$ gives $$1-\frac1b=0,$$ so $b=1$. Under Stein's loss, the best choice in this class is $\delta_b=S^2$. ::: ::: {.callout-tip title="Key idea"} Important lesson The word "best" depends on the loss function. Under squared error loss, a biased multiple of $S^2$ can be preferred; under Stein's loss, $S^2$ is preferred within the class $bS^2$. ::: # Bayes Risk Bayes risk averages the frequentist risk over a prior distribution on the parameter. ## Definition When a prior distribution is specified, we can compare estimators by their average risk under the prior. ::: {.callout-note title="Definition"} **Definition 32** (Bayes risk). Given a prior distribution $\pi(\theta)$, the Bayes risk of a decision rule $\delta$ is $$R_B(\delta)=\int_\Theta R(\theta,\delta)\pi(\theta)\,d\theta.$$ Equivalently, $$R_B(\delta)=\int_\Theta\int_{\mathcal X} L(\theta,\delta(x))f(x\mid\theta)\pi(\theta)\,dx\,d\theta.$$ ::: Using Bayes' rule, this can also be written as $$R_B(\delta) =\int_{\mathcal X}\left[\int_\Theta L(\theta,\delta(x))\pi(\theta\mid x)\,d\theta\right]m(x)\,dx,$$ where $\pi(\theta\mid x)$ is the posterior distribution and $$m(x)=\int f(x\mid\theta)\pi(\theta)\,d\theta$$ is the marginal distribution of the data. ## Posterior expected loss For a fixed observed data value $x$, the posterior expected loss is $$\int_\Theta L(\theta,a)\pi(\theta\mid x)\,d\theta.$$ A Bayes estimator chooses the action $a$ minimizing this posterior expected loss. ::: {.callout-note title="Example"} **Example 33** (Bayes estimator under squared error loss). Suppose $L(\theta,a)=(a-\theta)^2$. Show that the Bayes estimator is the posterior mean. ::: ::: {.callout-tip title="Solution"} For fixed data $x$, minimize $$\mathbb{E}[(a-\theta)^2\mid x]$$ with respect to $a$. Expand: $$\mathbb{E}[(a-\theta)^2\mid x] =a^2-2a\mathbb{E}[\theta\mid x]+\mathbb{E}[\theta^2\mid x].$$ Differentiating with respect to $a$ gives $$2a-2\mathbb{E}[\theta\mid x]=0.$$ Thus $$a=\mathbb{E}[\theta\mid x].$$ So the Bayes estimator under squared error loss is the posterior mean. ::: ::: {.callout-note title="Example"} **Example 34** (Bayes estimator under absolute error loss). Suppose $L(\theta,a)=|a-\theta|$. The Bayes estimator is a posterior median. ::: ::: {.callout-tip title="Solution"} For fixed data $x$, minimize $$\mathbb{E}[|a-\theta|\mid x] =\int |a-\theta|\pi(\theta\mid x)\,d\theta.$$ A standard derivative/subgradient argument shows that a minimizer satisfies $$\mathbb{P}(\theta\le a\mid x)\ge \frac12, \qquad \mathbb{P}(\theta\ge a\mid x)\ge \frac12.$$ Thus any posterior median is a Bayes estimator under absolute error loss. ::: # Summary of Estimator Criteria This section closes with a comparison table for the main estimator-evaluation criteria. ::: center ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Criterion Definition Desirable property Notes -------------- ------------------------------------------------------------------------- --------------------------------------------------------- --------------------------------------------------------------- Unbiasedness $\mathbb{E}[\widehat\theta]=\theta$ Expected value equals the target Does not guarantee minimum MSE Bias $\operatorname{Bias}(\widehat\theta)=\mathbb{E}[\widehat\theta]-\theta$ Bias close to zero Can be positive or negative Variance $\operatorname{Var}(\widehat\theta)$ Smaller is better, especially among unbiased estimators Measures sampling variability MSE $\mathbb{E}[(\widehat\theta-\theta)^2]$ Smaller is better $\operatorname{MSE}=\operatorname{Var}+\operatorname{Bias}^2$ Efficiency Variance relative to best possible lower bound Close to Cramer-Rao lower bound Usually for unbiased estimators Consistency $\widehat\theta_n\xrightarrow{P}\theta$ Converges to target as $n\to\infty$ Large-sample property Sufficiency Statistic captures all information about parameter No information loss about $\theta$ Linked to factorization theorem and Rao-Blackwell Robustness Stability under small departures from assumptions Less sensitive to outliers/model error Important in applications ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ::: ::: {.callout-tip title="Key idea"} Concept map ::: center ::: ::: # Practice Problems The practice problems below reinforce the main methods for evaluating estimators. ::: {.callout-warning title="Practice Problem"} **Practice Problem 35** (Bias-variance decomposition). Let $W$ be an estimator of $\theta$. Prove that $$\mathbb{E}_\theta[(W-\theta)^2]=\operatorname{Var}_\theta(W)+\operatorname{Bias}_\theta(W)^2.$$ ::: ::: {.callout-tip title="Solution"} Use $$W-\theta=(W-\mathbb{E}W)+(\mathbb{E}W-\theta).$$ Squaring and taking expectations gives $$\mathbb{E}[(W-\theta)^2] =\mathbb{E}[(W-\mathbb{E}W)^2]+2(\mathbb{E}W-\theta)\mathbb{E}[W-\mathbb{E}W]+(\mathbb{E}W-\theta)^2.$$ The middle term is zero, so $$\mathbb{E}[(W-\theta)^2]=\operatorname{Var}(W)+\operatorname{Bias}(W)^2.$$ ::: ::: {.callout-warning title="Practice Problem"} **Practice Problem 36** (MSE for a biased variance estimator). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Normal}(\mu,\sigma^2)$ and let $$\widehat\sigma^2_c=cS^2.$$ Find $\operatorname{MSE}(\widehat\sigma^2_c)$ as a function of $c$. ::: ::: {.callout-tip title="Solution"} Since $\mathbb{E}(S^2)=\sigma^2$ and $\operatorname{Var}(S^2)=2\sigma^4/(n-1)$, $$\mathbb{E}(cS^2)=c\sigma^2, \qquad \operatorname{Var}(cS^2)=c^2\frac{2\sigma^4}{n-1}.$$ Therefore $$\operatorname{MSE}(cS^2) =c^2\frac{2\sigma^4}{n-1}+(c\sigma^2-\sigma^2)^2 =\sigma^4\left(\frac{2c^2}{n-1}+(c-1)^2\right).$$ ::: ::: {.callout-warning title="Practice Problem"} **Practice Problem 37** (Optimal constant multiple). Using the previous problem, find the value of $c$ that minimizes $\operatorname{MSE}(cS^2)$. ::: ::: {.callout-tip title="Solution"} Minimize $$h(c)=\frac{2c^2}{n-1}+(c-1)^2.$$ Differentiate: $$h'(c)=\frac{4c}{n-1}+2(c-1).$$ Set $h'(c)=0$: $$\frac{4c}{n-1}+2c-2=0.$$ Thus $$c\left(\frac{4}{n-1}+2\right)=2, \qquad c=\frac{n-1}{n+1}.$$ ::: ::: {.callout-warning title="Practice Problem"} **Practice Problem 38** (Cramer-Rao for Bernoulli). Let $X_1,\ldots,X_n\overset{\text{iid}}{\sim}\operatorname{Bernoulli}(p)$. Find the Fisher information and the Cramer-Rao lower bound for unbiased estimators of $p$. ::: ::: {.callout-tip title="Solution"} For one observation, $$f(x\mid p)=p^x(1-p)^{1-x}, \qquad x\in\{0,1\}.$$ The log-likelihood for one observation is $$\ell(p)=X\log p+(1-X)\log(1-p).$$ Then $$\ell'(p)=\frac{X}{p}-\frac{1-X}{1-p} =\frac{X-p}{p(1-p)}.$$ Thus $$I_1(p)=\mathbb{E}\left[\left(\frac{X-p}{p(1-p)}\right)^2\right] =\frac{\operatorname{Var}(X)}{p^2(1-p)^2} =\frac{p(1-p)}{p^2(1-p)^2} =\frac1{p(1-p)}.$$ For $n$ iid observations, $$I_n(p)=\frac{n}{p(1-p)}.$$ For unbiased estimators of $p$, $$\operatorname{Var}(W)\ge \frac{1}{I_n(p)}=\frac{p(1-p)}{n}.$$ Since $\bar X$ is unbiased and has variance $p(1-p)/n$, it attains the lower bound. ::: ::: {.callout-warning title="Practice Problem"} **Practice Problem 39** (Rao-Blackwell improvement). Suppose $W$ is unbiased for $g(\theta)$ and $T$ is sufficient. Let $\phi(T)=\mathbb{E}(W\mid T)$. Prove that $\phi(T)$ is unbiased and has variance no larger than $W$. ::: ::: {.callout-tip title="Solution"} By the law of total expectation, $$\mathbb{E}[\phi(T)]=\mathbb{E}[\mathbb{E}(W\mid T)]=\mathbb{E}(W)=g(\theta).$$ By the law of total variance, $$\operatorname{Var}(W)=\operatorname{Var}(\mathbb{E}(W\mid T))+\mathbb{E}[\operatorname{Var}(W\mid T)] =\operatorname{Var}(\phi(T))+\mathbb{E}[\operatorname{Var}(W\mid T)].$$ The second term is nonnegative, so $$\operatorname{Var}(\phi(T))\le \operatorname{Var}(W).$$ ::: ::: {.callout-warning title="Practice Problem"} **Practice Problem 40** (Bayes estimator under squared error). Let $\pi(\theta\mid x)$ be a posterior density. Show that the posterior mean minimizes posterior expected squared error loss. ::: ::: {.callout-tip title="Solution"} For fixed $x$, minimize $$Q(a)=\int (a-\theta)^2\pi(\theta\mid x)\,d\theta.$$ Expanding, $$Q(a)=a^2-2a\mathbb{E}(\theta\mid x)+\mathbb{E}(\theta^2\mid x).$$ Differentiate: $$Q'(a)=2a-2\mathbb{E}(\theta\mid x).$$ Thus the minimizer is $$a=\mathbb{E}(\theta\mid x).$$ :::

Parameter	Distribution / model	UMVUE
Mean \(\mu\)	\(\operatorname{Normal}(\mu,\sigma^2)\), known \(\sigma^2\)	\(\bar X\)
Mean \(\mu\)	\(\operatorname{Normal}(\mu,\sigma^2)\), unknown \(\sigma^2\)	\(\bar X\)
Variance \(\sigma^2\)	\(\operatorname{Normal}(\mu,\sigma^2)\), unknown \(\mu\)	\(\displaystyle \frac1{n-1}\sum_{i=1}^n (X_i-\bar X)^2\)
Parameter \(p\)	Bernoulli / binomial	sample proportion \(\widehat p\)
Parameter \(\lambda\)	Poisson	\(\bar X\)
Parameter \(\theta\)	\(\operatorname{Uniform}[0,\theta]\)	\(\displaystyle \frac{n+1}{n}\max_i X_i\)

15 Overview: Evaluating and Comparing Estimators

16 Mean Squared Error and Bias

16.1 Mean squared error

16.2 Bias and unbiasedness

16.3 Bias-variance decomposition

17 Examples of MSE

17.1 Sample mean and sample variance under a normal model

17.2 The MLE of normal variance

17.3 A Bayes estimator for Bernoulli data

18 Best Unbiased Estimators

18.1 Definition and uniqueness

18.2 Poisson unbiased estimation

19 Cramer-Rao Inequality and Fisher Information

19.1 The inequality

19.2 Equality condition

19.3 IID form and information number

19.4 Poisson conclusion

19.5 Normal variance example

20 Sufficiency and Unbiasedness

20.1 Rao-Blackwell theorem

20.2 Lehmann-Scheffe theorem

20.3 Common UMVUEs

21 Loss Functions and Risk

21.1 Loss functions

21.2 Risk function

22 Risk Examples

22.1 Bernoulli shrinkage estimator

22.2 Normal variance: choosing a constant multiple of \(S^2\)

22.3 Risk curves for variance estimators

22.4 Different loss functions: Stein’s loss

23 Bayes Risk

23.1 Definition

23.2 Posterior expected loss

24 Summary of Estimator Criteria

25 Practice Problems