17  Chapter 17: Image Compression

How a picture becomes a few important matrix layers

17.1 Opening story: a picture made of too many numbers

A photograph looks like one object.

To a computer, it is not one object. It is a table of numbers.

A grayscale image with \(1000\) rows and \(1000\) columns is a matrix with

\[ 1000 \times 1000 = 1{,}000{,}000 \]

entries. A color image has three such matrices, one for red, one for green, and one for blue.

That sounds enormous. But pictures are not usually random. A sky changes slowly from one pixel to the next. A face has broad regions of light and shadow. A handwritten digit has a background, a stroke, and a boundary. A medical image may contain smooth tissue regions and a few important edges.

So an image may contain many numbers but fewer meaningful patterns.

This chapter asks a practical question:

Can we keep the visible structure of an image while storing many fewer numbers?

The answer is yes, and the tool is the singular value decomposition.

In Chapter 16, the SVD showed us that every matrix can be decomposed into ordered rank-one layers:

\[ A = \sigma_1 u_1v_1^T + \sigma_2u_2v_2^T + \cdots + \sigma_r u_rv_r^T. \]

For images, each rank-one matrix \(u_iv_i^T\) is a simple visual layer, and the singular value \(\sigma_i\) tells us how important that layer is.

The central message of this chapter is:

Image compression is controlled information loss. We intentionally discard small layers while keeping the strongest visible structure.

This is not only a trick for pictures. It is one of the main ideas behind modern data compression, denoising, dimensionality reduction, recommendation systems, and machine learning.

17.2 Learning goals

By the end of this chapter, you should be able to:

  1. Interpret a grayscale image as a matrix and a color image as three matrices.
  2. Explain why many images have approximate low-rank structure.
  3. Use the SVD to build the rank-\(k\) approximation \(A_k\).
  4. Interpret singular values as ordered importance levels.
  5. Compare image quality, reconstruction error, and storage cost as \(k\) changes.
  6. Explain why the truncated SVD gives the best rank-\(k\) approximation.
  7. Use Python to compress, denoise, and visualize images.
  8. Connect image compression to PCA, denoising, latent structure, and AI.

17.3 17.1 Images as matrices

A grayscale image is a matrix.

Each matrix entry stores the brightness of one pixel. For example,

\[ A = \begin{bmatrix} 0 & 0 & 0 & 0\\ 0 & 1 & 1 & 0\\ 0 & 1 & 1 & 0\\ 0 & 0 & 0 & 0 \end{bmatrix} \]

represents a tiny bright square on a dark background.

In many image files, pixel brightness may be stored between \(0\) and \(255\). In mathematical and Python examples, it is often more convenient to scale brightness into the interval \([0,1]\).

NoteMatrix view of a grayscale image

If \(A \in \mathbb{R}^{m\times n}\) is a grayscale image, then:

  • \(m\) is the number of rows of pixels,
  • \(n\) is the number of columns of pixels,
  • \(A_{ij}\) is the brightness at row \(i\) and column \(j\).

A color image is usually stored as three matrices:

\[ A^{(R)}, \qquad A^{(G)}, \qquad A^{(B)}. \]

The first stores red intensity, the second stores green intensity, and the third stores blue intensity. In this chapter, we focus mostly on grayscale images because the essential linear algebra is already visible there.

17.4 17.2 Why images can be compressed

A completely random matrix is hard to compress. Every pixel is unrelated to every other pixel. If we discard information, the image quickly becomes meaningless.

Natural images are different. They contain structure:

  • neighboring pixels are often similar;
  • large regions may be smooth;
  • edges and boundaries often form coherent curves;
  • many visual patterns repeat;
  • noise is often weaker than the main signal.

This structure means the image matrix may be well approximated by a low-rank matrix.

ImportantApproximate low rank

An image matrix \(A\) may have high rank exactly, but still be approximately low rank if most of its visible structure can be captured by a small number of SVD layers.

This distinction is essential.

Exact rank asks: How many layers are needed to reproduce every pixel perfectly?

Approximate rank asks: How many layers are needed to reproduce the important visual structure?

For compression, approximate rank is often more important than exact rank.

17.5 17.3 Rank-one image layers

A rank-one matrix has the form

\[ uv^T. \]

Here \(u\in \mathbb{R}^m\) is a column pattern and \(v\in \mathbb{R}^n\) is a row pattern. The entry in row \(i\) and column \(j\) is

\[ (uv^T)_{ij} = u_i v_j. \]

This means a rank-one image is built by multiplying a vertical pattern by a horizontal pattern.

That may sound too simple to make a real image. But the SVD says that a complicated image can be built as a sum of simple rank-one image layers:

\[ A = \sigma_1u_1v_1^T + \sigma_2u_2v_2^T + \cdots + \sigma_ru_rv_r^T. \]

Each layer is simple. Together, they can create complex structure.

NoteVisual interpretation

A rank-one layer is like a transparent sheet placed over the image. The first sheet carries the strongest broad pattern. Later sheets add smaller corrections, details, edges, texture, and sometimes noise.

17.6 17.4 The rank-\(k\) approximation

The rank-\(k\) truncated SVD of \(A\) is

\[ A_k = \sigma_1u_1v_1^T + \sigma_2u_2v_2^T + \cdots + \sigma_ku_kv_k^T. \]

Instead of keeping all \(r\) layers, we keep only the first \(k\) layers.

If \(k=1\), the approximation is very simple.

If \(k\) is larger, the approximation becomes more detailed.

If \(k=r\), then \(A_k=A\) exactly.

The compression idea is:

Choose \(k\) much smaller than \(r\), but large enough that \(A_k\) still looks like the original image.

17.6.1 Worked example: a small matrix

Suppose an image matrix has singular values

\[ 10, \quad 4, \quad 1, \quad 0.2, \quad 0.05. \]

The first singular value is much larger than the later ones. The first few layers likely capture most of the image energy.

The total squared energy is

\[ 10^2 + 4^2 + 1^2 + 0.2^2 + 0.05^2. \]

If we keep only the first two layers, the captured energy ratio is

\[ \frac{10^2+4^2}{10^2+4^2+1^2+0.2^2+0.05^2}. \]

This tells us how much of the squared Frobenius norm is retained.

TipEnergy ratio

The energy captured by the first \(k\) singular values is

\[ \frac{\sigma_1^2+\cdots+\sigma_k^2}{\sigma_1^2+\cdots+\sigma_r^2}. \]

This is a useful numerical guide, but visual quality also depends on where the missing information appears.

17.7 17.5 Storage cost

Suppose \(A\in\mathbb{R}^{m\times n}\).

The full image stores

\[ mn \]

numbers.

The rank-\(k\) SVD approximation stores:

  • \(k\) left singular vectors of length \(m\): \(mk\) numbers;
  • \(k\) right singular vectors of length \(n\): \(nk\) numbers;
  • \(k\) singular values: \(k\) numbers.

So the compressed representation stores approximately

\[ k(m+n+1) \]

numbers.

Compression is worthwhile when

\[ k(m+n+1) < mn. \]

For a \(1000\times 1000\) grayscale image, the full image stores \(1{,}000{,}000\) numbers. A rank-\(50\) approximation stores

\[ 50(1000+1000+1)=100{,}050 \]

numbers, about one tenth as many.

WarningCompression is not free

Smaller \(k\) means fewer numbers, but also more information loss. The goal is not to make \(k\) as small as possible. The goal is to choose \(k\) so that the lost information is acceptable.

17.8 17.6 The best low-rank approximation theorem

The truncated SVD is not just a reasonable method. It is the best possible method for low-rank approximation under common error measures.

ImportantEckart–Young theorem

Let

\[ A = U\Sigma V^T \]

be the SVD of \(A\), with singular values

\[ \sigma_1\geq \sigma_2\geq \cdots \geq \sigma_r >0. \]

Among all matrices \(B\) with rank at most \(k\), the truncated SVD approximation

\[ A_k = \sum_{i=1}^k \sigma_i u_i v_i^T \]

minimizes both

\[ \|A-B\|_F \]

and

\[ \|A-B\|_2. \]

The theorem says:

If you are allowed to keep only \(k\) matrix layers, the SVD tells you the best \(k\) layers to keep.

The Frobenius error has a simple formula:

\[ \|A-A_k\|_F^2 = \sigma_{k+1}^2+\sigma_{k+2}^2+\cdots+\sigma_r^2. \]

So the discarded singular values measure the lost energy.

Proof (Proof idea). The full proof is deeper than what we need here, but the main intuition is clear.

The SVD writes the matrix in orthogonal rank-one directions. Because these directions are orthogonal under the Frobenius inner product, the squared error behaves like a Pythagorean sum. Keeping the largest \(k\) singular values removes the least possible squared energy.

This is the same principle behind projection: to approximate a vector using a subspace, keep the components in the most important orthogonal directions and discard the rest.

17.9 17.7 Compression as projection

There is a close connection between image compression and projection.

When we compute \(A_k\), we are projecting the matrix \(A\) onto the set of matrices built from the first \(k\) singular directions:

\[ A_k = U_kU_k^T A V_kV_k^T. \]

This formula says:

  1. keep only the important left-side image patterns;
  2. keep only the important right-side image patterns;
  3. remove everything outside those patterns.

The SVD does not merely reduce numbers. It chooses a coordinate system adapted to the image.

17.10 17.8 Denoising by discarding small layers

Suppose an observed image is

\[ A_{\text{noisy}} = A_{\text{true}} + N, \]

where \(N\) is noise.

If the true image has strong organized structure and the noise is spread across many weak directions, then the largest singular values may mostly represent the true image, while smaller singular values may represent noise.

A truncated SVD can therefore denoise the image:

\[ A_{\text{denoised}} = (A_{\text{noisy}})_k. \]

WarningDenoising requires judgment

Small singular values often contain noise, but they can also contain fine details. If \(k\) is too small, denoising becomes oversmoothing.

17.11 17.9 Color images

For color images, we can apply SVD separately to the red, green, and blue channels:

\[ A^{(R)} \approx A^{(R)}_k, \qquad A^{(G)} \approx A^{(G)}_k, \qquad A^{(B)} \approx A^{(B)}_k. \]

Then we combine the three approximated channels back into a color image.

This method is simple and instructive. More advanced image-compression methods, such as JPEG, use different bases and additional encoding ideas, but the central principle is similar:

transform the image into meaningful coordinates, keep the important coefficients, and discard or compress the smaller ones.

17.12 17.10 Python: synthetic image compression

Code
import numpy as np
import matplotlib.pyplot as plt

# Build a synthetic grayscale image
n = 120
x = np.linspace(-1, 1, n)
y = np.linspace(-1, 1, n)
X, Y = np.meshgrid(x, y)

circle = (X**2 + Y**2 < 0.45**2).astype(float)
gradient = 0.35 * (X + 1) / 2
stripe = 0.25 * np.sin(8*np.pi*X) * (np.abs(Y) < 0.25)
A = np.clip(circle + gradient + stripe, 0, 1)

plt.figure(figsize=(5,5))
plt.imshow(A, cmap="gray", vmin=0, vmax=1)
plt.axis("off")
plt.title("Synthetic image as a matrix")
plt.show()

Compute the SVD:

Code
U, s, Vt = np.linalg.svd(A, full_matrices=False)

plt.figure(figsize=(6,4))
plt.plot(s, marker="o")
plt.title("Singular values")
plt.xlabel("Index")
plt.ylabel("Singular value")
plt.grid(True)
plt.show()

Build rank-\(k\) approximations:

Code
def rank_k_approx(U, s, Vt, k):
    return (U[:, :k] * s[:k]) @ Vt[:k, :]

ks = [1, 3, 5, 10, 20, 40]
plt.figure(figsize=(12, 8))
for i, k in enumerate(ks, 1):
    Ak = rank_k_approx(U, s, Vt, k)
    plt.subplot(2, 3, i)
    plt.imshow(Ak, cmap="gray", vmin=0, vmax=1)
    plt.axis("off")
    plt.title(f"rank {k}")
plt.tight_layout()
plt.show()

Compute storage and energy:

Code
m, n = A.shape
full_storage = m*n

for k in [1, 3, 5, 10, 20, 40]:
    compressed_storage = k*(m+n+1)
    energy = np.sum(s[:k]**2) / np.sum(s**2)
    print(f"k={k:2d}: storage ratio={compressed_storage/full_storage:.3f}, energy={energy:.3f}")
k= 1: storage ratio=0.017, energy=0.889
k= 3: storage ratio=0.050, energy=0.971
k= 5: storage ratio=0.084, energy=0.985
k=10: storage ratio=0.167, energy=0.996
k=20: storage ratio=0.335, energy=1.000
k=40: storage ratio=0.669, energy=1.000

17.13 17.11 Worked examples

17.13.1 Example 1: storage threshold

A grayscale image has size \(800\times 600\). For what values of \(k\) does rank-\(k\) SVD storage use fewer numbers than the original image?

The full image stores

\[ 800\cdot 600 = 480{,}000 \]

numbers.

The rank-\(k\) approximation stores

\[ k(800+600+1)=1401k \]

numbers.

We need

\[ 1401k < 480{,}000. \]

Thus

\[ k < \frac{480{,}000}{1401}\approx 342.6. \]

So \(k\leq 342\) gives storage savings.

17.13.2 Example 2: Frobenius error

Suppose the singular values of an image matrix are

\[ 9, \quad 6, \quad 3, \quad 1, \quad 1. \]

The squared error of the rank-\(2\) approximation is

\[ 3^2+1^2+1^2 = 11. \]

So

\[ \|A-A_2\|_F = \sqrt{11}. \]

17.13.3 Example 3: energy captured

The energy captured by the rank-\(2\) approximation is

\[ \frac{9^2+6^2}{9^2+6^2+3^2+1^2+1^2} = \frac{117}{128}\approx 0.914. \]

So the first two layers capture about \(91.4\%\) of the squared energy.

17.14 17.12 Common misunderstandings

17.14.1 Misunderstanding 1: low rank means simple-looking

A rank-one image is simple. But a low-rank image with \(k=30\) or \(k=50\) can look visually rich. Low-rank does not mean boring. It means the image can be built from a relatively small number of organized layers.

17.14.2 Misunderstanding 2: high energy always means high visual quality

Energy is useful, but human perception is not exactly Frobenius norm. A small error near an important edge may be more visible than a larger error in a smooth background.

17.14.3 Misunderstanding 3: compression and denoising are the same

They are related but not identical. Compression aims to store fewer numbers. Denoising aims to remove unwanted variation. SVD truncation can do both, but the best \(k\) for compression may differ from the best \(k\) for denoising.

17.15 17.13 Practice problems

17.15.1 Problem 1

A \(500\times 500\) grayscale image is approximated using rank \(k=20\). How many numbers are stored in the SVD representation? What fraction of the original storage is this?

Solution. The full image stores

\[ 500\cdot 500 = 250{,}000 \]

numbers.

The rank-\(20\) SVD approximation stores

\[ 20(500+500+1)=20{,}020 \]

numbers.

The storage fraction is

\[ \frac{20{,}020}{250{,}000}\approx 0.0801. \]

So it uses about \(8.0\%\) of the original storage.

17.15.2 Problem 2

Suppose the singular values of an image matrix are

\[ 12, 5, 2, 1, 0.5. \]

Find the Frobenius error of the rank-\(2\) approximation.

Solution. The discarded singular values are \(2,1,0.5\). Therefore

\[ \|A-A_2\|_F^2 = 2^2+1^2+0.5^2 = 5.25. \]

So

\[ \|A-A_2\|_F = \sqrt{5.25}\approx 2.291. \]

17.15.3 Problem 3

Explain why keeping the first \(k\) singular values is better than keeping \(k\) randomly chosen singular values.

Solution. The singular values are ordered from largest to smallest. The largest singular values carry the most Frobenius energy. By the Eckart–Young theorem, keeping the first \(k\) singular values gives the best rank-\(k\) approximation. Randomly chosen singular values may discard large important layers and keep smaller less important layers.

17.15.4 Problem 4

Why might a rank-\(20\) approximation look good for one image but poor for another image of the same size?

Solution. Different images have different singular value patterns. If the singular values decay quickly, a small \(k\) captures most structure. If they decay slowly, important information is spread across many layers, so a larger \(k\) is needed. Smooth images often compress better than highly textured or random images.

17.15.5 Problem 5

Let \(A\in\mathbb{R}^{m\times n}\). Show that the rank-\(k\) SVD storage is smaller than full storage exactly when

\[ k < \frac{mn}{m+n+1}. \]

Solution. Full storage is \(mn\). Rank-\(k\) SVD storage is \(k(m+n+1)\). Compression requires

\[ k(m+n+1)<mn. \]

Dividing by \(m+n+1>0\) gives

\[ k<\frac{mn}{m+n+1}. \]

17.16 17.14 Challenge problems

17.16.1 Challenge 1: visual quality versus error

Find or create two different images with the same Frobenius reconstruction error after rank-\(k\) approximation. Do they look equally good? Explain why or why not.

17.16.2 Challenge 2: denoising threshold

Add Gaussian noise to a simple image. Try several values of \(k\). Find a value of \(k\) that removes noise while preserving the main structure. Explain how you chose it.

17.16.3 Challenge 3: color compression

Apply SVD compression separately to the red, green, and blue channels of a color image. Compare the result with grayscale compression.

17.16.4 Challenge 4: adversarial image

Create an image that is difficult to compress with low-rank SVD. What visual features make it difficult?

17.17 17.15 AI companion activities

Use an AI assistant as a mathematical conversation partner.

  1. Ask it to explain the difference between exact rank and approximate rank using an image example.
  2. Ask it to create three examples of images whose singular values decay quickly, moderately, and slowly.
  3. Ask it to compare SVD image compression with JPEG compression at a conceptual level.
  4. Ask it to generate Python code that compresses an RGB image channel-by-channel.
  5. Ask it to explain why the Frobenius error is determined by discarded singular values.
  6. Ask it to critique your choice of \(k\) for an image-compression experiment.

17.18 Chapter summary

A digital image is a matrix of pixel values. The SVD decomposes that matrix into ordered rank-one layers:

\[ A = \sum_{i=1}^r \sigma_i u_i v_i^T. \]

The rank-\(k\) approximation keeps only the first \(k\) layers:

\[ A_k = \sum_{i=1}^k \sigma_i u_i v_i^T. \]

This creates compression because \(A_k\) can be stored using about

\[ k(m+n+1) \]

numbers instead of \(mn\) numbers.

The singular values measure importance. The discarded singular values measure error:

\[ \|A-A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2. \]

The deeper lesson is that compression is not merely about making files smaller. It is about discovering structure, choosing what matters, and understanding what is lost.

In the next chapter, we will see the same idea from the perspective of data clouds: PCA uses SVD to find the most important directions of variation in data.