16  Chapter 16: SVD — The Matrix Microscope

Rank-one layers, hidden directions, compression, denoising, and the geometry of data

16.1 Opening story: a blurry image becomes a lesson

Imagine that you open a grayscale photograph on a computer. To your eyes, it is a picture. To the computer, it is a rectangular table of numbers.

A bright pixel may be close to \(1\). A dark pixel may be close to \(0\). A \(600 \times 800\) image is therefore a matrix with \(480{,}000\) entries.

At first this sounds hopeless. How can we understand half a million numbers?

But most real images are not random. They contain structure:

  • large bright and dark regions,
  • edges,
  • repeated textures,
  • smooth backgrounds,
  • important shapes,
  • small details,
  • and noise.

The singular value decomposition, or SVD, is one of the most powerful tools in linear algebra because it lets us look inside such a matrix and separate strong structure from weaker detail.

The central idea is:

SVD decomposes a matrix into ordered rank-one layers of information.

The first layer captures the strongest pattern. The second captures the strongest remaining pattern. The third captures the next strongest remaining pattern. And so on.

This is why we call SVD a matrix microscope. It lets us inspect a matrix layer by layer.

SVD appears in:

  • image compression,
  • noise reduction,
  • principal component analysis,
  • least squares,
  • recommendation systems,
  • natural language processing,
  • search engines,
  • scientific computing,
  • numerical linear algebra,
  • and modern machine learning.

In earlier chapters, we studied vectors, matrices, distances, projections, orthogonality, eigenvectors, stability, and energy landscapes. SVD brings many of those ideas together.

16.2 Learning goals

By the end of this chapter, you should be able to:

  1. Explain the SVD formula \(A = U \Sigma V^T\) in words.
  2. Interpret right singular vectors as important input directions.
  3. Interpret left singular vectors as important output directions.
  4. Interpret singular values as strengths of those directions.
  5. Write a matrix as a sum of rank-one layers.
  6. Explain why the first few SVD layers often capture most of the structure.
  7. Use truncated SVD for approximation and compression.
  8. Connect SVD to eigenvalues of \(A^T A\) and \(A A^T\).
  9. Use SVD to understand least squares, denoising, PCA, and recommendation systems.
  10. Recognize why SVD is numerically stable and broadly useful.

16.3 16.1 What problem does SVD solve?

A matrix may be large, rectangular, noisy, and hard to interpret.

SVD answers the question:

If this matrix is a machine, what are its most important input directions, output directions, and strengths?

Suppose \(A\) is an \(m \times n\) matrix. It maps vectors in \(\mathbb R^n\) to vectors in \(\mathbb R^m\):

\[ A: \mathbb R^n \to \mathbb R^m. \]

An input vector \(x\) goes into the machine, and \(Ax\) comes out.

SVD says that there are special orthonormal input directions \(v_1,\dots,v_r\) and special orthonormal output directions \(u_1,\dots,u_r\) such that

\[ A v_i = \sigma_i u_i. \]

This sentence is the heart of the chapter.

It means:

  • \(v_i\) is an important direction in the input space,
  • \(u_i\) is the corresponding direction in the output space,
  • \(\sigma_i\) measures how strongly \(A\) stretches \(v_i\) into \(u_i\).

The singular values are ordered:

\[ \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r > 0. \]

The largest singular value describes the strongest action of the matrix. The next describes the strongest action not already explained by the first. This ordered structure is what makes SVD so useful.

16.4 16.2 The SVD formula

For any real \(m \times n\) matrix \(A\), the singular value decomposition is

\[ A = U \Sigma V^T. \]

Here:

  • \(U\) is an \(m \times m\) orthogonal matrix,
  • \(V\) is an \(n \times n\) orthogonal matrix,
  • \(\Sigma\) is an \(m \times n\) diagonal-shaped matrix whose nonnegative diagonal entries are the singular values.

The columns of \(U\) are called left singular vectors.

The columns of \(V\) are called right singular vectors.

The diagonal entries of \(\Sigma\) are the singular values.

The formula

\[ A = U \Sigma V^T \]

can be read from right to left:

  1. \(V^T\) rotates or reflects the input into the singular-vector coordinate system.
  2. \(\Sigma\) stretches or shrinks along coordinate axes.
  3. \(U\) rotates or reflects the result into the output space.

So SVD says:

Every matrix is a rotation/reflection, followed by axis-aligned stretching, followed by another rotation/reflection.

This is a remarkable geometric statement.

16.4.1 The economy SVD

Often we only keep the nonzero singular values. If \(\operatorname{rank}(A)=r\), then the economy SVD is

\[ A = U_r \Sigma_r V_r^T, \]

where:

  • \(U_r\) is \(m \times r\),
  • \(\Sigma_r\) is \(r \times r\),
  • \(V_r\) is \(n \times r\).

This form is usually easier to interpret and compute with.

16.5 16.3 A matrix as a sum of simple layers

The SVD is not only a product. It is also a sum:

\[ A = \sigma_1 u_1 v_1^T + \sigma_2 u_2 v_2^T + \cdots + \sigma_r u_r v_r^T. \]

Each term

\[ \sigma_i u_i v_i^T \]

is a rank-one matrix.

A rank-one matrix is the simplest nonzero matrix shape. It is built from one column direction and one row direction.

So SVD says:

A complicated matrix is a weighted sum of simple rank-one patterns.

The singular values tell us how important each pattern is.

16.5.1 Example: one layer

Let

\[ u = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \qquad v = \begin{bmatrix} 4 \\ 5 \end{bmatrix}. \]

Then

\[ u v^T = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \begin{bmatrix} 4 & 5 \end{bmatrix} = \begin{bmatrix} 4 & 5 \\ 8 & 10 \\ 12 & 15 \end{bmatrix}. \]

Every column is a multiple of \(\nu\). Every row is a multiple of \(v^T\). That is why the matrix has only one independent direction.

A rank-one matrix contains one basic pattern. SVD builds a matrix by stacking such patterns in order of importance.

16.6 16.4 The best low-rank approximation

Suppose we keep only the first \(k\) SVD layers:

\[ A_k = \sigma_1 u_1 v_1^T + \cdots + \sigma_k u_k v_k^T. \]

Then \(A_k\) has rank at most \(k\).

The central theorem is one of the most useful results in applied linear algebra.

ImportantEckart–Young theorem

Among all matrices \(B\) with rank at most \(k\), the truncated SVD matrix

\[ A_k = \sum_{i=1}^k \sigma_i u_i v_i^T \]

is the closest matrix to \(A\) in Frobenius norm:

\[ \|A - A_k\|_F = \min_{\operatorname{rank}(B) \le k} \|A - B\|_F. \]

It is also best in spectral norm:

\[ \|A - A_k\|_2 = \sigma_{k+1}. \]

This theorem explains why SVD is so important in data science.

If the first few singular values are large and the rest are small, then the matrix is mostly described by a few rank-one layers. This is the mathematical foundation of compression, denoising, and dimensionality reduction.

16.7 16.5 Energy captured by singular values

The squared Frobenius norm of a matrix is the sum of squares of all entries:

\[ \|A\|_F^2 = \sum_{i,j} a_{ij}^2. \]

SVD gives another expression:

\[ \|A\|_F^2 = \sigma_1^2 + \sigma_2^2 + \cdots + \sigma_r^2. \]

So singular values measure how much matrix energy is stored in each layer.

The fraction of energy captured by the first \(k\) layers is

\[ \frac{\sigma_1^2 + \cdots + \sigma_k^2}{\sigma_1^2 + \cdots + \sigma_r^2}. \]

This tells us how much information is preserved by a rank-\(k\) approximation.

16.7.1 Interpretation

If the first \(10\) singular values capture \(95\%\) of the energy, then a rank-\(10\) approximation preserves most of the matrix’s large-scale structure.

If the singular values decay slowly, then the matrix does not compress well by SVD.

16.8 16.6 Image compression

A grayscale image is a matrix \(A\).

The SVD writes it as

\[ A = \sum_{i=1}^r \sigma_i u_i v_i^T. \]

The first layer often captures broad brightness structure. Later layers capture edges and details. Very late layers often capture fine texture or noise.

A rank-\(k\) compressed image is

\[ A_k = \sum_{i=1}^k \sigma_i u_i v_i^T. \]

Instead of storing all \(mn\) pixel values, we store:

  • \(k\) left singular vectors of length \(m\),
  • \(k\) right singular vectors of length \(n\),
  • \(k\) singular values.

That is about

\[ k(m+n+1) \]

numbers.

Compression is useful when

\[ k(m+n+1) \ll mn. \]

16.8.1 Example

For a \(1000 \times 1000\) image, the original image stores \(1{,}000{,}000\) numbers.

A rank-\(50\) SVD approximation stores about

\[ 50(1000+1000+1)=100{,}050 \]

numbers.

That is about \(10\%\) of the original storage.

16.9 16.7 Denoising

Noise often appears as many small, scattered directions. Structure often appears as fewer strong directions.

SVD denoising uses the following idea:

  1. Compute the SVD of a noisy matrix.
  2. Keep the largest singular values.
  3. Remove small singular-value layers.
  4. Reconstruct the matrix.

This works when signal has low-rank structure and noise spreads across many directions.

It does not always work. If important information lives in small singular values, then aggressive truncation can remove meaningful detail.

The key question is not simply “small means noise.” The key question is:

Which singular directions correspond to meaningful structure in the problem?

16.10 16.8 SVD and eigenvectors

SVD is closely connected to eigenvectors.

Start from

\[ A v_i = \sigma_i u_i. \]

Multiply both sides by \(A^T\):

\[ A^T A v_i = \sigma_i A^T u_i. \]

Since \(A^T u_i = \sigma_i v_i\), we get

\[ A^T A v_i = \sigma_i^2 v_i. \]

So the right singular vectors are eigenvectors of \(A^T A\), and the squared singular values are eigenvalues of \(A^T A\).

Similarly,

\[ A A^T u_i = \sigma_i^2 u_i. \]

So the left singular vectors are eigenvectors of \(A A^T\).

This connection explains why SVD is related to Chapter 13 and Chapter 15:

  • \(A^T A\) is symmetric and positive semidefinite,
  • its eigenvectors form orthogonal directions,
  • its eigenvalues describe squared stretching,
  • its quadratic form \(x^T A^T A x = \|Ax\|^2\) measures output energy.

16.11 16.9 SVD and least squares

Suppose we want to solve

\[ Ax \approx b. \]

If \(A\) is rectangular or rank-deficient, the system may not have an exact solution or may have many solutions.

SVD gives a clean way to understand the problem.

If

\[ A = U \Sigma V^T, \]

then solving \(Ax \approx b\) becomes solving

\[ U \Sigma V^T x \approx b. \]

Because \(U\) and \(V\) are orthogonal, they preserve lengths and angles. The difficulty is concentrated in \(\Sigma\).

Small singular values cause numerical instability. Directions with tiny \(\sigma_i\) are directions where the matrix nearly loses information.

The pseudoinverse is

\[ A^+ = V \Sigma^+ U^T, \]

where \(\Sigma^+\) replaces each nonzero singular value \(\sigma_i\) by \(1/\sigma_i\).

Then a least-squares solution is

\[ x = A^+ b. \]

If some singular values are very small, \(1/\sigma_i\) becomes very large. This can amplify noise.

This is why truncated SVD can also be used as a regularization method: ignore unstable small-singular-value directions.

16.12 16.10 SVD and PCA

Principal component analysis, or PCA, finds the main directions of variation in a dataset.

Suppose \(X\) is a centered data matrix:

  • each row is a data point,
  • each column is a feature,
  • each column has mean zero.

The SVD of \(X\) is

\[ X = U \Sigma V^T. \]

The columns of \(V\) are the principal directions in feature space.

The singular values describe how much variation occurs along those directions.

The coordinates of the data in the principal-component system are

\[ X V = U \Sigma. \]

So PCA is not a separate magic method. It is SVD applied to a centered data matrix.

16.13 16.11 SVD and recommendation systems

Suppose \(A\) is a user-item rating matrix:

  • rows represent users,
  • columns represent movies,
  • entries represent ratings.

Such a matrix is usually incomplete and noisy, but it may have low-rank structure.

For example, user taste may depend on hidden factors:

  • action versus drama,
  • humor versus seriousness,
  • mainstream versus independent,
  • old versus new,
  • romance versus science fiction.

SVD tries to discover hidden directions in the rating matrix.

A low-rank approximation says:

User preferences can be approximately explained by a small number of hidden taste dimensions.

This idea is a starting point for matrix factorization in recommendation systems.

Real recommendation systems include many additional issues: missing data, bias terms, regularization, implicit feedback, fairness, cold-start problems, and temporal effects. But SVD gives the linear algebra foundation.

16.14 16.12 Why SVD is more general than eigen-decomposition

Eigen-decomposition is powerful, but not every matrix has a full set of real eigenvectors. Eigenvectors are naturally defined for square matrices.

SVD works for every real matrix:

  • square or rectangular,
  • symmetric or nonsymmetric,
  • full rank or rank deficient,
  • stable or unstable,
  • diagonalizable or not.

This makes SVD one of the most reliable tools in applied linear algebra.

Eigen-decomposition asks:

Which directions are sent to multiples of themselves?

SVD asks:

Which input directions are sent most strongly to which output directions?

That second question makes sense for every matrix.

16.15 16.13 Worked example: a small SVD by interpretation

Consider

\[ A = \begin{bmatrix} 3 & 0 \\ 0 & 1 \end{bmatrix}. \]

This matrix stretches the \(x\)-axis by \(3\) and the \(y\)-axis by \(1\).

The singular values are

\[ \sigma_1=3, \qquad \sigma_2=1. \]

The right singular vectors are

\[ v_1 = \begin{bmatrix}1\\0\end{bmatrix}, \qquad v_2 = \begin{bmatrix}0\\1\end{bmatrix}. \]

The left singular vectors are the same.

So

\[ A = 3 \begin{bmatrix}1\\0\end{bmatrix} \begin{bmatrix}1&0\end{bmatrix} + 1 \begin{bmatrix}0\\1\end{bmatrix} \begin{bmatrix}0&1\end{bmatrix}. \]

The first rank-one layer captures horizontal stretching. The second captures vertical stretching.

Check the rank-one layers

\[ 3 \begin{bmatrix}1\\0\end{bmatrix} \begin{bmatrix}1&0\end{bmatrix} = \begin{bmatrix}3&0\\0&0\end{bmatrix}, \]

and

\[ 1 \begin{bmatrix}0\\1\end{bmatrix} \begin{bmatrix}0&1\end{bmatrix} = \begin{bmatrix}0&0\\0&1\end{bmatrix}. \]

Adding them gives

\[ \begin{bmatrix}3&0\\0&1\end{bmatrix}. \]

16.16 16.14 Worked example: a rank-one data matrix

Suppose

\[ A = \begin{bmatrix} 2 & 4 & 6 \\ 1 & 2 & 3 \\ 3 & 6 & 9 \end{bmatrix}. \]

Every row is a multiple of \([1,2,3]\). Every column is a multiple of \([2,1,3]^T\).

So \(A\) has rank \(1\).

We can write

\[ A = \begin{bmatrix}2\\1\\3\end{bmatrix} \begin{bmatrix}1&2&3\end{bmatrix}. \]

This matrix has exactly one nonzero singular value. SVD will discover that the matrix is really one pattern, repeated with different strengths.

16.17 16.15 Python: computing SVD

Code
import numpy as np

A = np.array([[3, 1],
              [0, 2],
              [0, 0]], dtype=float)

U, s, Vt = np.linalg.svd(A, full_matrices=False)

print("Singular values:", s)
print("U shape:", U.shape)
print("Vt shape:", Vt.shape)

Sigma = np.diag(s)
A_reconstructed = U @ Sigma @ Vt
print(np.round(A_reconstructed, 6))
Singular values: [3.25661654 1.84240298]
U shape: (3, 2)
Vt shape: (2, 2)
[[ 3.  1.]
 [-0.  2.]
 [ 0.  0.]]

The command np.linalg.svd returns singular values as a one-dimensional array, not as a diagonal matrix. We build Sigma with np.diag(s) when needed.

16.18 16.16 Python: rank-\(k\) approximation

Code
import numpy as np

np.random.seed(7)
A = np.random.randn(8, 6)

U, s, Vt = np.linalg.svd(A, full_matrices=False)

def rank_k_approx(U, s, Vt, k):
    return U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]

for k in [1, 2, 3, 5]:
    Ak = rank_k_approx(U, s, Vt, k)
    error = np.linalg.norm(A - Ak, ord="fro")
    energy = np.sum(s[:k]**2) / np.sum(s**2)
    print(f"k={k}: Frobenius error={error:.4f}, energy captured={energy:.2%}")
k=1: Frobenius error=6.0989, energy captured=36.96%
k=2: Frobenius error=4.7148, energy captured=62.33%
k=3: Frobenius error=3.1683, energy captured=82.99%
k=5: Frobenius error=1.0322, energy captured=98.19%

16.19 16.17 Common mistakes

16.19.1 Mistake 1: Thinking SVD is only for square matrices

SVD works for every real matrix, including rectangular data tables.

16.19.2 Mistake 2: Confusing singular values with eigenvalues

Singular values are always nonnegative. Eigenvalues can be positive, negative, zero, or complex.

16.19.3 Mistake 3: Forgetting the order

Singular values are usually listed from largest to smallest. The order matters because it tells us which layers are most important.

16.19.4 Mistake 4: Assuming small singular values are always useless

Small singular values can represent noise, but they can also contain meaningful fine detail.

16.19.5 Mistake 5: Using low-rank approximation without checking the problem

SVD compression works best when the matrix has approximate low-rank structure.

16.20 16.18 Practice problems

16.20.1 Problem 1

Explain in words what each part of \(A=U\Sigma V^T\) does geometrically.

Solution

\(V^T\) changes input coordinates into the right-singular-vector basis. \(\Sigma\) stretches or shrinks along coordinate axes. \(U\) rotates or reflects the result into the output coordinate system.

16.20.2 Problem 2

Let

\[ A = \begin{bmatrix}4&0\\0&2\end{bmatrix}. \]

Find its singular values and describe its rank-one layers.

Solution

The singular values are \(4\) and \(2\). The rank-one layers are

\[ 4\begin{bmatrix}1\\0\end{bmatrix}\begin{bmatrix}1&0\end{bmatrix} + 2\begin{bmatrix}0\\1\end{bmatrix}\begin{bmatrix}0&1\end{bmatrix}. \]

16.20.3 Problem 3

Suppose the singular values of a matrix are

\[ 10, 4, 1, 0.5. \]

What fraction of Frobenius energy is captured by the first two layers?

Solution

The total energy is

\[ 10^2+4^2+1^2+0.5^2=100+16+1+0.25=117.25. \]

The first two layers capture

\[ 100+16=116. \]

So the fraction is

\[ \frac{116}{117.25} \approx 0.9893. \]

The first two layers capture about \(98.93\%\) of the energy.

16.20.4 Problem 4

Why can a rank-\(k\) SVD approximation be useful for denoising?

Solution

If the meaningful signal is concentrated in the largest singular values while noise is spread across many smaller singular directions, keeping only the largest \(k\) layers can preserve structure and remove some noise. This is a useful model when the true signal is approximately low rank.

16.20.5 Problem 5

Explain why small singular values can make least-squares problems unstable.

Solution

The pseudoinverse uses reciprocals \(1/\sigma_i\). If \(\sigma_i\) is very small, then \(1/\sigma_i\) is very large. Small noise components in those directions can be amplified into large changes in the solution.

16.21 16.19 Challenge questions

  1. Why is SVD useful even when a matrix does not have real eigenvectors?
  2. When does image compression by SVD work well? When does it work poorly?
  3. How is truncated SVD similar to projection onto a lower-dimensional subspace?
  4. What is the difference between the rank of a matrix and the number of large singular values?
  5. How might SVD reveal hidden communities or hidden preferences in a data matrix?

16.22 16.20 AI companion activities

Use an AI assistant as a study partner, not as a replacement for your thinking.

16.22.1 Activity 1: Explain the microscope metaphor

Ask:

Explain SVD as a matrix microscope for a beginning linear algebra student. Use an image example and a data table example.

Then improve the answer by adding the words rank-one layer, singular value, and energy captured.

16.22.2 Activity 2: Diagnose a matrix

Give the AI a small matrix and ask:

Compute or estimate its rank, explain whether it looks low-rank, and describe what truncated SVD would keep.

Check the result with Python.

16.22.3 Activity 3: Compare methods

Ask:

Compare eigen-decomposition, QR decomposition, and SVD in terms of what each one reveals about a matrix.

Then make your own table with three columns: method, main idea, best use.

16.22.4 Activity 4: Build a mini project

Ask:

Design a small SVD project using images, text documents, or movie ratings. Include the data matrix, the meaning of rows and columns, and what a low-rank approximation would mean.

16.23 Chapter summary

SVD is one of the central tools of applied linear algebra.

It says that every matrix can be decomposed as

\[ A = U\Sigma V^T. \]

This means that a matrix is built from orthogonal input directions, nonnegative strengths, and orthogonal output directions.

It also says that a matrix can be written as a sum of rank-one layers:

\[ A = \sum_{i=1}^r \sigma_i u_i v_i^T. \]

The largest singular values describe the strongest layers. Keeping only the first few layers gives the best low-rank approximation.

This single idea explains image compression, denoising, PCA, least squares stability, and recommendation systems.

The SVD is a microscope because it lets us see which hidden directions carry the most structure.

16.24 Looking ahead

In the next chapter, we will use SVD directly for image compression. The abstract formula \(A=U\Sigma V^T\) will become visible: a picture will be rebuilt layer by layer.