12  Chapter 12. Least Squares and Data Fitting

Story of the chapter. Earlier chapters taught us how to solve exact equations. In applications, however, exact equations often fail: data contain noise, models are simplified, and we usually have more observations than unknowns. Least squares is the linear algebra method for asking a wiser question: when an exact solution is impossible, what is the closest possible solution?

In this chapter, the answer comes from one geometric idea:

\[ \text{best approximation} = \text{orthogonal projection onto the model space}. \]

We will see that the familiar normal equations

\[ A^T A\vec{x}=A^T\vec{b} \]

are not just a formula. They are the coordinate form of the statement that the residual is perpendicular to the column space of \(A\).

12.1 Learning goals

By the end of this chapter, you should be able to:

  1. explain least squares as an orthogonal projection problem;
  2. derive and solve the normal equations;
  3. interpret residuals geometrically;
  4. understand weighted least squares;
  5. use QR factorization for least squares computation;
  6. fit lines, polynomials, and multivariable linear models to data;
  7. formulate best polynomial approximation in function spaces;
  8. use Python to compute and visualize least squares solutions.

12.2 12.1 Review: inner products and orthogonal projection

The least squares story begins with inner product geometry. An inner product gives a way to measure length and angle. Once we can measure angles, we can define perpendicularity. Once we can define perpendicularity, we can define projection.

12.2.1 Definition 12.1: Inner product space

A real vector space \(V\) is an inner product space if it has a function

\[ \langle \cdot,\cdot\rangle:V\times V\to \mathbb{R} \]

such that for all \(\vec{u},\vec{v},\vec{w}\in V\) and all \(c\in\mathbb{R}\):

  1. \(\langle \vec{u},\vec{v}\rangle=\langle \vec{v},\vec{u}\rangle\);
  2. \(\langle \vec{u}+\vec{v},\vec{w}\rangle=\langle \vec{u},\vec{w}\rangle+\langle \vec{v},\vec{w}\rangle\);
  3. \(\langle c\vec{u},\vec{v}\rangle=c\langle \vec{u},\vec{v}\rangle\);
  4. \(\langle \vec{v},\vec{v}\rangle\ge 0\);
  5. \(\langle \vec{v},\vec{v}\rangle=0\) if and only if \(\vec{v}=\vec{0}\).

The induced norm is

\[ \|\vec{v}\|=\sqrt{\langle \vec{v},\vec{v}\rangle}. \]

12.2.2 Example 12.2: Euclidean space

On \(\mathbb{R}^n\), the standard inner product is

\[ \langle \vec{u},\vec{v}\rangle=\vec{u}^{\,T}\vec{v}=u_1v_1+\cdots+u_nv_n. \]

12.2.3 Example 12.3: Function space

Let \(V=C[0,1]\). Define

\[ \langle f,g\rangle=\int_0^1 f(x)g(x)\,dx. \]

Then \(C[0,1]\) becomes an inner product space. This example is important because best polynomial approximation is a least squares problem in a function space.

12.2.4 Theorem 12.4: Orthogonal projection theorem

Let \(W\) be a finite-dimensional subspace of an inner product space \(V\). For every \(\vec{b}\in V\), there exists a unique vector \(\operatorname{proj}_W\vec{b}\in W\) such that

\[ \|\vec{b}-\operatorname{proj}_W\vec{b}\|\le \|\vec{b}-\vec{w}\| \]

for every \(\vec{w}\in W\). Moreover,

\[ \vec{b}-\operatorname{proj}_W\vec{b}\perp W. \]

Show proof

Let \(\vec{z}=\operatorname{proj}_W\vec{b}\). For any \(\vec{w}\in W\), write

\[ \vec{b}-\vec{w}=(\vec{b}-\vec{z})+(\vec{z}-\vec{w}). \]

The vector \(\vec{b}-\vec{z}\) is orthogonal to \(W\), and \(\vec{z}-\vec{w}\in W\). Hence the two terms are orthogonal. By the Pythagorean theorem,

\[ \|\vec{b}-\vec{w}\|^2 =\|\vec{b}-\vec{z}\|^2+\|\vec{z}-\vec{w}\|^2 \ge \|\vec{b}-\vec{z}\|^2. \]

Therefore \(\vec{z}\) is a closest point in \(W\) to \(\vec{b}\). If two closest points existed, their difference would have zero norm, so the closest point is unique.

12.2.5 Projection using an orthogonal basis

If \(W=\operatorname{span}\{\vec{v}_1,\ldots,\vec{v}_m\}\) and the vectors are mutually orthogonal, then

\[ \operatorname{proj}_W\vec{b} = \sum_{j=1}^m \frac{\langle \vec{b},\vec{v}_j\rangle}{\langle \vec{v}_j,\vec{v}_j\rangle}\vec{v}_j. \]

If the given basis is not orthogonal, first apply Gram–Schmidt or QR factorization.

12.3 12.2 The abstract least squares problem

Suppose \(W\subset V\) is a subspace and \(\vec{b}\in V\). If \(\vec{b}\notin W\), then the equation

\[ \vec{w}=\vec{b},\qquad \vec{w}\in W, \]

has no solution. Least squares replaces it by a closest-vector problem.

12.3.1 Definition 12.5: Least squares approximation in a subspace

A vector \(\vec{z}^{\,*}\in W\) is a least squares approximation to \(\vec{b}\) from \(W\) if

\[ \|\vec{b}-\vec{z}^{\,*}\|\le \|\vec{b}-\vec{w}\| \]

for every \(\vec{w}\in W\).

12.3.2 Theorem 12.6: Solution of the abstract least squares problem

The unique least squares approximation is

\[ \vec{z}^{\,*}=\operatorname{proj}_W\vec{b}. \]

Equivalently, the error vector satisfies

\[ \vec{b}-\vec{z}^{\,*}\perp W. \]

Show proof

This is exactly the orthogonal projection theorem applied to the finite-dimensional subspace \(W\). The closest vector in \(W\) to \(\vec{b}\) is the orthogonal projection of \(\vec{b}\) onto \(W\).

12.4 12.3 Inconsistent systems and the normal equations

Now let \(A\) be an \(n\times m\) matrix and let \(\vec{b}\in\mathbb{R}^n\). The system

\[ A\vec{x}=\vec{b} \]

may be inconsistent. But every vector of the form \(A\vec{x}\) lies in the column space

\[ \operatorname{Col}(A)=\operatorname{im}(A). \]

Thus least squares asks for the vector in \(\operatorname{Col}(A)\) closest to \(\vec{b}\).

12.4.1 Definition 12.7: Least squares solution of a linear system

A vector \(\vec{x}^{\,*}\in\mathbb{R}^m\) is a least squares solution of \(A\vec{x}=\vec{b}\) if

\[ \|A\vec{x}^{\,*}-\vec{b}\|\le \|A\vec{x}-\vec{b}\| \]

for every \(\vec{x}\in\mathbb{R}^m\).

The fitted vector is \(A\vec{x}^{\,*}\), and the residual is

\[ \vec{r}=\vec{b}-A\vec{x}^{\,*}. \]

12.4.2 Theorem 12.8: Projection interpretation

A vector \(\vec{x}^{\,*}\) is a least squares solution if and only if

\[ A\vec{x}^{\,*}=\operatorname{proj}_{\operatorname{Col}(A)}\vec{b}. \]

Equivalently,

\[ \vec{b}-A\vec{x}^{\,*}\perp \operatorname{Col}(A). \]

12.4.3 Theorem 12.9: Normal equations

For the standard Euclidean inner product, the least squares solutions of

\[ A\vec{x}=\vec{b} \]

are exactly the solutions of

\[ A^TA\vec{x}=A^T\vec{b}. \]

Show proof

The residual condition is

\[ \vec{b}-A\vec{x}^{\,*}\perp \operatorname{Col}(A). \]

The columns of \(A\) span \(\operatorname{Col}(A)\). Therefore the residual is orthogonal to every column of \(A\). Writing these dot products at once gives

\[ A^T(\vec{b}-A\vec{x}^{\,*})=\vec{0}. \]

Rearranging gives

\[ A^TA\vec{x}^{\,*}=A^T\vec{b}. \]

Conversely, if this equation holds, then \(A^T(\vec{b}-A\vec{x}^{\,*})=0\), so the residual is orthogonal to every column of \(A\). Thus \(A\vec{x}^{\,*}\) is the orthogonal projection of \(\vec{b}\) onto \(\operatorname{Col}(A)\).

12.5 12.4 A complete example: projection and normal equations

Let

\[ A= \begin{bmatrix} -1&4\\ 1&8\\ -1&4 \end{bmatrix}, \qquad \vec{b}=\begin{bmatrix}14\\-4\\0\end{bmatrix}. \]

The columns are

\[ \vec{a}_1=\begin{bmatrix}-1\\1\\-1\end{bmatrix}, \qquad \vec{a}_2=\begin{bmatrix}4\\8\\4\end{bmatrix}. \]

They are orthogonal since

\[ \vec{a}_1\cdot \vec{a}_2=-4+8-4=0. \]

Therefore

\[ \operatorname{proj}_{\operatorname{Col}(A)}\vec{b} = \frac{\vec{b}\cdot \vec{a}_1}{\vec{a}_1\cdot \vec{a}_1}\vec{a}_1 + \frac{\vec{b}\cdot \vec{a}_2}{\vec{a}_2\cdot \vec{a}_2}\vec{a}_2. \]

Now

\[ \frac{\vec{b}\cdot \vec{a}_1}{\vec{a}_1\cdot \vec{a}_1}=-6, \qquad \frac{\vec{b}\cdot \vec{a}_2}{\vec{a}_2\cdot \vec{a}_2}=\frac14. \]

Thus

\[ \operatorname{proj}_{\operatorname{Col}(A)}\vec{b} =-6\vec{a}_1+\frac14\vec{a}_2 = \begin{bmatrix}7\\-4\\7\end{bmatrix}. \]

Solving \(A\vec{x}^{\,*}=\operatorname{proj}_{\operatorname{Col}(A)}\vec{b}\) gives

\[ \vec{x}^{\,*}=\begin{bmatrix}-6\\1/4\end{bmatrix}. \]

The residual is

\[ \vec{r}=\vec{b}-A\vec{x}^{\,*} = \begin{bmatrix}7\\0\\-7\end{bmatrix}, \]

and the distance from \(\vec{b}\) to \(\operatorname{Col}(A)\) is

\[ \|\vec{r}\|=7\sqrt{2}. \]

12.5.1 Python computation

Code
import numpy as np

A = np.array([[-1, 4], [1, 8], [-1, 4]], dtype=float)
b = np.array([14, -4, 0], dtype=float)

x_star = np.linalg.lstsq(A, b, rcond=None)[0]
fitted = A @ x_star
residual = b - fitted

print("x* =", x_star)
print("fitted vector =", fitted)
print("residual =", residual)
print("residual norm =", np.linalg.norm(residual))
print("A.T @ residual =", A.T @ residual)
x* = [-6.    0.25]
fitted vector = [ 7. -4.  7.]
residual = [ 7.  0. -7.]
residual norm = 9.899494936611665
A.T @ residual = [0. 0.]

12.6 12.5 Weighted least squares

Sometimes not all observations should count equally. In statistics, some measurements may be more reliable than others. In numerical modeling, some errors may be more important than others. Weighted least squares changes the geometry by changing the inner product.

12.6.1 Definition 12.10: Weighted inner product

Let \(W\) be a symmetric positive definite matrix. Define

\[ \langle \vec{u},\vec{v}\rangle_W=\vec{u}^{\,T}W\vec{v}. \]

The weighted norm is

\[ \|\vec{u}\|_W=\sqrt{\vec{u}^{\,T}W\vec{u}}. \]

12.6.2 Theorem 12.11: Weighted normal equations

The weighted least squares problem minimizes

\[ \|A\vec{x}-\vec{b}\|_W^2=(A\vec{x}-\vec{b})^TW(A\vec{x}-\vec{b}). \]

Its solutions satisfy

\[ A^TWA\vec{x}=A^TW\vec{b}. \]

Show proof

The weighted residual condition is

\[ \vec{b}-A\vec{x}^{\,*}\perp_W \operatorname{Col}(A). \]

This means

\[ \langle \vec{b}-A\vec{x}^{\,*},A\vec{y}\rangle_W=0 \]

for every \(\vec{y}\). Using the weighted inner product,

\[ (\vec{b}-A\vec{x}^{\,*})^TWA\vec{y}=0 \]

for all \(\vec{y}\). Therefore

\[ A^TW(\vec{b}-A\vec{x}^{\,*})=0. \]

Rearranging gives

\[ A^TWA\vec{x}^{\,*}=A^TW\vec{b}. \]

12.7 12.6 Rank, uniqueness, and the matrix \(A^TA\)

The normal equations always have at least one solution, but the coefficient vector may not be unique if the columns of \(A\) are linearly dependent.

12.7.1 Proposition 12.12

For every real matrix \(A\),

\[ \ker(A)=\ker(A^TA). \]

Show proof

If \(A\vec{x}=0\), then \(A^TA\vec{x}=0\). Conversely, if \(A^TA\vec{x}=0\), then

\[ 0=\vec{x}^{\,T}A^TA\vec{x}=(A\vec{x})^T(A\vec{x})=\|A\vec{x}\|^2. \]

Hence \(A\vec{x}=0\). Therefore the kernels are equal.

12.7.2 Corollary 12.13: Unique least squares solution

If \(A\) has full column rank, then \(A^TA\) is invertible and the least squares solution is unique:

\[ \vec{x}^{\,*}=(A^TA)^{-1}A^T\vec{b}. \]

12.7.3 Example 12.14: Rank-deficient least squares

Let

\[ A= \begin{bmatrix} 1&1&0\\ 1&1&0\\ 1&0&1\\ 1&0&1 \end{bmatrix}, \qquad \vec{b}=\begin{bmatrix}1\\3\\2\\4\end{bmatrix}. \]

Then

\[ A^TA= \begin{bmatrix} 4&2&2\\ 2&2&0\\ 2&0&2 \end{bmatrix}, \qquad A^T\vec{b}=\begin{bmatrix}10\\4\\6\end{bmatrix}. \]

The normal equations have infinitely many solutions:

\[ \vec{x}^{\,*}=\begin{bmatrix}t\\2-t\\3-t\end{bmatrix}, \qquad t\in\mathbb{R}. \]

Although the coefficient vector is not unique, the fitted vector \(A\vec{x}^{\,*}\) is unique.

12.8 12.7 QR factorization and numerical computation

The normal equations are theoretically important, but in numerical computation it is often better to avoid forming \(A^TA\) directly.

12.8.1 Theorem 12.15: Least squares by QR factorization

Suppose \(A\) has full column rank and

\[ A=QR, \]

where \(Q\) has orthonormal columns and \(R\) is invertible upper triangular. Then the least squares solution is obtained by solving

\[ R\vec{x}=Q^T\vec{b}. \]

Show proof

The normal equations are

\[ A^TA\vec{x}=A^T\vec{b}. \]

Substitute \(A=QR\):

\[ R^TQ^TQR\vec{x}=R^TQ^T\vec{b}. \]

Since \(Q^TQ=I\),

\[ R^TR\vec{x}=R^TQ^T\vec{b}. \]

Because \(R\) is invertible, cancel \(R^T\) to obtain

\[ R\vec{x}=Q^T\vec{b}. \]

12.8.2 Python comparison: normal equations and QR

Code
A = np.array([[1, 0], [1, 1], [1, 2], [1, 3]], dtype=float)
b = np.array([1.0, 2.0, 2.9, 4.2])

# Normal equations
x_normal = np.linalg.solve(A.T @ A, A.T @ b)

# QR method
Q, R = np.linalg.qr(A)
x_qr = np.linalg.solve(R, Q.T @ b)

print("normal equations:", x_normal)
print("QR solution:", x_qr)
normal equations: [0.95 1.05]
QR solution: [0.95 1.05]

12.9 12.8 Data fitting

Data fitting turns observations into a linear algebra problem. The unknowns are model parameters. The design matrix records how the model depends on the inputs.

12.9.1 Example 12.16: Cubic interpolation

Find a cubic polynomial

\[ f(t)=c_0+c_1t+c_2t^2+c_3t^3 \]

passing through

\[ (0,5),\quad (1,3),\quad (-1,13),\quad (2,1). \]

The linear system is

\[ \begin{bmatrix} 1&0&0&0\\ 1&1&1&1\\ 1&-1&1&-1\\ 1&2&4&8 \end{bmatrix} \begin{bmatrix}c_0\\c_1\\c_2\\c_3\end{bmatrix} = \begin{bmatrix}5\\3\\13\\1\end{bmatrix}. \]

Solving gives

\[ \begin{bmatrix}c_0\\c_1\\c_2\\c_3\end{bmatrix} = \begin{bmatrix}5\\-4\\3\\-1\end{bmatrix}, \]

so

\[ f(t)=5-4t+3t^2-t^3. \]

12.9.2 Example 12.17: Quadratic least squares fit

Fit

\[ g(t)=c_0+c_1t+c_2t^2 \]

to the same four points. The design matrix is

\[ A= \begin{bmatrix} 1&0&0\\ 1&1&1\\ 1&-1&1\\ 1&2&4 \end{bmatrix}, \qquad \vec{b}=\begin{bmatrix}5\\3\\13\\1\end{bmatrix}. \]

The normal equations give

\[ \vec{c}=\begin{bmatrix}29/5\\-16/5\\3/5\end{bmatrix}. \]

Hence

\[ g(t)=\frac{29}{5}-\frac{16}{5}t+\frac35t^2. \]

12.9.3 Example 12.18: Linear least squares fit

Fit

\[ h(t)=c_0+c_1t \]

to the same four points. Then

\[ A= \begin{bmatrix} 1&0\\ 1&1\\ 1&-1\\ 1&2 \end{bmatrix}, \qquad \vec{b}=\begin{bmatrix}5\\3\\13\\1\end{bmatrix}. \]

Solving the normal equations gives

\[ \vec{c}=\begin{bmatrix}37/6\\-61/18\end{bmatrix}. \]

Thus

\[ h(t)=\frac{37}{6}-\frac{61}{18}t. \]

12.9.4 Python visualization

Code
import matplotlib.pyplot as plt

t = np.array([0, 1, -1, 2], dtype=float)
y = np.array([5, 3, 13, 1], dtype=float)

for degree in [1, 2, 3]:
    A = np.vander(t, N=degree+1, increasing=True)
    coeff = np.linalg.lstsq(A, y, rcond=None)[0]
    grid = np.linspace(-1.3, 2.3, 200)
    G = np.vander(grid, N=degree+1, increasing=True)
    plt.plot(grid, G @ coeff, label=f"degree {degree}")

plt.scatter(t, y)
plt.xlabel("t")
plt.ylabel("y")
plt.legend()
plt.title("Least squares polynomial fits")
plt.show()

12.10 12.9 General regression models

12.10.1 Simple linear regression

Given data

\[ (a_1,b_1),\ldots,(a_n,b_n), \]

fit

\[ h(t)=c_0+c_1t. \]

The design matrix and target vector are

\[ A= \begin{bmatrix} 1&a_1\\ 1&a_2\\ \vdots&\vdots\\ 1&a_n \end{bmatrix}, \qquad \vec{b}=\begin{bmatrix}b_1\\b_2\\\vdots\\b_n\end{bmatrix}. \]

The normal equations are

\[ \begin{bmatrix} n&\sum a_i\\ \sum a_i&\sum a_i^2 \end{bmatrix} \begin{bmatrix}c_0\\c_1\end{bmatrix} = \begin{bmatrix}\sum b_i\\\sum a_i b_i\end{bmatrix}. \]

12.10.2 Multiple linear regression

For inputs \((a_{i1},\ldots,a_{im})\) and outputs \(b_i\), fit

\[ h(t_1,\ldots,t_m)=c_0+c_1t_1+\cdots+c_mt_m. \]

Use

\[ A= \begin{bmatrix} 1&a_{11}&\cdots&a_{1m}\\ 1&a_{21}&\cdots&a_{2m}\\ \vdots&\vdots&\ddots&\vdots\\ 1&a_{n1}&\cdots&a_{nm} \end{bmatrix}. \]

Then solve

\[ A^TA\vec{c}=A^T\vec{b}. \]

12.10.3 Multiple outputs

If the outputs form a matrix \(B\), then the least squares matrix problem is

\[ AX\approx B. \]

The normal equations are

\[ A^TAX=A^TB. \]

If \(A\) has full column rank,

\[ X=(A^TA)^{-1}A^TB. \]

12.11 12.10 Best approximation of functions

Let \(V=C[0,1]\) with inner product

\[ \langle f,g\rangle=\int_0^1 f(x)g(x)\,dx. \]

Let

\[ P_n=\operatorname{span}\{1,x,x^2,\ldots,x^n\}. \]

The best degree-\(n\) polynomial approximation to \(f\) is

\[ p^*=\operatorname{proj}_{P_n}f. \]

If

\[ p^*(x)=c_0+c_1x+\cdots+c_nx^n, \]

then the residual \(f-p^*\) is orthogonal to \(P_n\):

\[ \langle f-p^*,x^k\rangle=0, \qquad k=0,1,\ldots,n. \]

Thus

\[ \sum_{j=0}^n c_j\int_0^1 x^{j+k}\,dx = \int_0^1 f(x)x^k\,dx. \]

Since

\[ \int_0^1 x^{j+k}\,dx=\frac{1}{j+k+1}, \]

the coefficient matrix is a Hilbert matrix.

12.11.1 Example 12.19: Best linear approximation of \(e^x\)

Find the best approximation to \(f(x)=e^x\) in \(P_1=\operatorname{span}\{1,x\}\).

Write

\[ p(x)=c_0+c_1x. \]

The normal equations are

\[ \begin{bmatrix} 1&1/2\\ 1/2&1/3 \end{bmatrix} \begin{bmatrix}c_0\\c_1\end{bmatrix} = \begin{bmatrix}e-1\\1\end{bmatrix}. \]

Solving gives

\[ c_0=4e-10, \qquad c_1=18-6e. \]

Therefore

\[ p(x)=(4e-10)+(18-6e)x. \]

12.12 12.11 AI companion activities

These activities are designed to help you use AI as a study partner, not as a substitute for your own reasoning.

12.12.1 Activity A: Explain the geometry

Ask an AI tool:

Explain why the least squares solution of \(A\vec{x}=\vec{b}\) is a projection onto \(\operatorname{Col}(A)\), using a picture in words.

Then check whether the explanation includes the residual condition

\[ \vec{b}-A\vec{x}^{\,*}\perp \operatorname{Col}(A). \]

12.12.2 Activity B: Debug a normal-equation computation

Give an AI tool a matrix \(A\) and vector \(\vec{b}\) and ask it to compute \(A^TA\) and \(A^T\vec{b}\). Verify the answer using Python.

12.12.3 Activity C: Compare methods

Ask:

Why is QR factorization often more stable than solving least squares by normal equations?

A good answer should mention that forming \(A^TA\) can square the condition number.

12.12.4 Activity D: Create a data-fitting example

Ask AI to generate five noisy data points from a line. Then use Python to compute the least squares fit and compare the recovered line to the original line.

12.13 12.12 Classical challenge questions

12.13.1 Challenge 1: Why the residual is unique

Suppose \(A\vec{x}=\vec{b}\) has many least squares solutions. Prove that the fitted vector \(A\vec{x}^{\,*}\) and residual \(\vec{b}-A\vec{x}^{\,*}\) are unique.

Show solution

Every least squares solution satisfies

\[ A\vec{x}^{\,*}=\operatorname{proj}_{\operatorname{Col}(A)}\vec{b}. \]

The orthogonal projection is unique. Hence the fitted vector is unique. The residual is

\[ \vec{b}-A\vec{x}^{\,*}, \]

so it is also unique.

12.13.2 Challenge 2: Normal equations as an optimization condition

Let

\[ F(\vec{x})=\|A\vec{x}-\vec{b}\|^2. \]

Show that \(\nabla F(\vec{x})=0\) gives the normal equations.

Show solution

Expand:

\[ F(\vec{x})=(A\vec{x}-\vec{b})^T(A\vec{x}-\vec{b}) =\vec{x}^{\,T}A^TA\vec{x}-2\vec{x}^{\,T}A^T\vec{b}+\vec{b}^{\,T}\vec{b}. \]

Taking the gradient gives

\[ \nabla F(\vec{x})=2A^TA\vec{x}-2A^T\vec{b}. \]

Setting this equal to zero gives

\[ A^TA\vec{x}=A^T\vec{b}. \]

12.13.3 Challenge 3: Weighted least squares by optimization

Let \(W\) be symmetric positive definite and define

\[ F_W(\vec{x})=(A\vec{x}-\vec{b})^TW(A\vec{x}-\vec{b}). \]

Show that the critical point satisfies \(A^TWA\vec{x}=A^TW\vec{b}\).

Show solution

Expanding gives

\[ F_W(\vec{x})=\vec{x}^{\,T}A^TWA\vec{x}-2\vec{x}^{\,T}A^TW\vec{b}+\vec{b}^{\,T}W\vec{b}. \]

Because \(W\) is symmetric,

\[ \nabla F_W(\vec{x})=2A^TWA\vec{x}-2A^TW\vec{b}. \]

Setting the gradient equal to zero gives

\[ A^TWA\vec{x}=A^TW\vec{b}. \]

12.14 12.13 Practice problems

12.14.1 Problem 1

Let

\[ A=\begin{bmatrix}1\\1\\1\end{bmatrix}, \qquad \vec{b}=\begin{bmatrix}2\\4\\5\end{bmatrix}. \]

Find the least squares solution and interpret it.

Show solution

The normal equation is

\[ 3x=11. \]

Thus

\[ x=\frac{11}{3}. \]

This is the average of the observations \(2,4,5\). The fitted vector is

\[ Ax=\begin{bmatrix}11/3\\11/3\\11/3\end{bmatrix}. \]

12.14.2 Problem 2

Let

\[ A=\begin{bmatrix}1&0\\1&1\\1&2\end{bmatrix}, \qquad \vec{b}=\begin{bmatrix}1\\2\\4\end{bmatrix}. \]

Find the least squares line \(h(t)=c_0+c_1t\).

Show solution

Compute

\[ A^TA=\begin{bmatrix}3&3\\3&5\end{bmatrix}, \qquad A^T\vec{b}=\begin{bmatrix}7\\10\end{bmatrix}. \]

Solving gives

\[ \vec{c}=\begin{bmatrix}5/6\\3/2\end{bmatrix}. \]

Therefore

\[ h(t)=\frac56+\frac32t. \]

12.14.3 Problem 3

For

\[ A=\begin{bmatrix}1&1\\1&1\\1&1\end{bmatrix}, \qquad \vec{b}=\begin{bmatrix}1\\2\\3\end{bmatrix}, \]

find all least squares solutions.

Show solution

The normal equations reduce to

\[ x_1+x_2=2. \]

Therefore all least squares solutions are

\[ \vec{x}^{\,*}=\begin{bmatrix}t\\2-t\end{bmatrix}, \qquad t\in\mathbb{R}. \]

The fitted vector is always

\[ \begin{bmatrix}2\\2\\2\end{bmatrix}. \]

12.14.4 Problem 4

Set up the normal equations for the best quadratic approximation to \(f(x)=\sin x\) on \([0,1]\).

Show solution

Let

\[ p(x)=c_0+c_1x+c_2x^2. \]

The normal equations are

\[ \int_0^1(\sin x-p(x))x^k\,dx=0, \qquad k=0,1,2. \]

Equivalently,

\[ \sum_{j=0}^2 c_j\int_0^1 x^{j+k}\,dx = \int_0^1 x^k\sin x\,dx, \qquad k=0,1,2. \]

12.14.5 Problem 5

Explain why \(A^TA\) is invertible if the columns of \(A\) are linearly independent.

Show solution

If the columns of \(A\) are linearly independent, then \(\ker(A)=\{0\}\). Since \(\ker(A)=\ker(A^TA)\), we also have \(\ker(A^TA)=\{0\}\). The matrix \(A^TA\) is square, so trivial kernel implies invertibility.