18 Chapter 18: Computational Complexity and Computation of Eigenvalues

When the formula is correct, but the computation is too large

Author

He Wang

19 The story: linear algebra at scale

In the first part of this course, many computations were small enough to do by hand. We solved systems, found bases, diagonalized matrices, computed least-squares solutions, and studied decompositions such as QR, Schur, spectral decomposition, SVD, Haar transforms, and FFT.

But applied linear algebra becomes truly powerful when the matrices are large. A matrix might represent a social network, a graph Laplacian, a covariance matrix, a discretized differential operator, an image, a signal, or a machine-learning data set.

At that scale, the question is no longer only:

The old question

What is the correct formula?

It becomes:

The computational question

Can we compute the answer using a reasonable amount of time and memory?

For example, the expressions

\[ (AB)\vec x \qquad\text{and}\qquad A(B\vec x) \]

are mathematically equal, but they may have very different computational costs. The lesson of this chapter is that linear algebra is not only about identities; it is also about algorithms.

This chapter has two main goals:

understand computational complexity for basic matrix operations;
study numerical eigenvalue computation, especially the power method and Rayleigh quotient.

20 Space complexity, time complexity, and Big-O notation

20.1 Definition 18.1: Space complexity

The space complexity of an algorithm is the amount of memory needed to store the input, intermediate quantities, and output.

20.2 Definition 18.2: Time complexity

The time complexity of an algorithm is the number of elementary operations needed to complete the computation. In numerical linear algebra, the elementary operations are usually arithmetic operations such as addition, subtraction, multiplication, and division.

20.3 Definition 18.3: Big-O notation

Let $f(n)$ and $g(n)$ be positive functions. We write

\[ f(n)=O(g(n)) \]

if there exist constants $C>0$ and $n_0$ such that

\[ f(n)\le Cg(n)\qquad\text{for all }n\ge n_0. \]

Big-O notation records the dominant asymptotic growth rate and ignores fixed constants and lower-order terms.

20.4 Memory: dense versus sparse matrices

A dense matrix $A\in\mathbb R^{m\times n}$ stores $mn$ numbers. Its memory cost is therefore

\[ O(mn). \]

If a matrix is sparse, most entries are zero. A sparse storage format stores only the nonzero entries and their positions. The memory cost is closer to

\[ O(\operatorname{nnz}(A)), \]

where $\operatorname{nnz}(A)$ denotes the number of nonzero entries.

20.5 Example 18.4: A tridiagonal matrix

A tridiagonal $n\times n$ matrix has nonzero entries only on the main diagonal, the diagonal above it, and the diagonal below it. It has approximately $3n$ nonzero entries, not $n^2$ entries. For large $n$, storing it as a sparse matrix saves enormous memory.

Code

import numpy as np
from scipy import sparse

n = 10_000
main = 2*np.ones(n)
off = -1*np.ones(n-1)
A_sparse = sparse.diags([off, main, off], [-1, 0, 1], format="csr")

print("shape:", A_sparse.shape)
print("number of nonzeros:", A_sparse.nnz)
print("density:", A_sparse.nnz/(n*n))

shape: (10000, 10000)
number of nonzeros: 29998
density: 0.00029998

21 Basic time complexity in linear algebra

Let $\vec x,\vec y\in\mathbb R^n$, $A\in\mathbb R^{m\times n}$, $B\in\mathbb R^{n\times p}$, and $M\in\mathbb R^{n\times n}$.

Operation	Typical dense cost
Sum entries of $\vec x$	$O(n)$
Dot product $\vec x\cdot\vec y$	$O(n)$
Matrix-vector product $A\vec x$	$O(mn)$
Matrix-matrix product $AB$	$O(mnp)$
Gaussian elimination for $M\vec x=\vec b$	$O(n^3)$
Dense eigenvalue computation	about $O(n^3)$

Interpretation

The difference between $O(n)$, $O(n^2)$, and $O(n^3)$ is small for homework-size matrices, but enormous for real data. If $n=10^6$, even storing a dense $n\times n$ matrix is usually impossible.

Code

import pandas as pd

sizes = [10, 100, 1_000, 10_000, 1_000_000]
data = []
for n in sizes:
    data.append({
        "n": n,
        "n": n,
        "n log2 n": n*np.log2(n),
        "n^2": n**2,
        "n^3": n**3,
    })
pd.DataFrame(data)

	n	n log2 n	n^2	n^3
0	10	3.321928e+01	100	1000
1	100	6.643856e+02	10000	1000000
2	1000	9.965784e+03	1000000	1000000000
3	10000	1.328771e+05	100000000	1000000000000
4	1000000	1.993157e+07	1000000000000	1000000000000000000

22 Associativity can save work

A recurring rule in efficient matrix computation is:

Do not form a large matrix if you only need its action on a vector.

22.1 Proposition 18.5: Associativity can save work

Let $A\in\mathbb R^{m\times n}$, $B\in\mathbb R^{n\times p}$, and $\vec x\in\mathbb R^p$. Mathematically,

\[ (AB)\vec x=A(B\vec x). \]

However, the computational costs are different:

\[ (AB)\vec x:\quad O(mnp)+O(mp), \]

while

\[ A(B\vec x):\quad O(np)+O(mn). \]

Proof

To compute $(AB)\vec x$, first form the full matrix $AB$. Since $AB$ has $mp$ entries and each entry is a dot product of length $n$, forming $AB$ costs $O(mnp)$. Then multiplying the resulting $m\times p$ matrix by $\vec x$ costs $O(mp)$.

For $A(B\vec x)$, first compute $\vec y=B\vec x$, costing $O(np)$. Then compute $A\vec y$, costing $O(mn)$.

The two results are equal because matrix multiplication is associative, but the second method avoids forming the large intermediate matrix $AB$.

22.2 Example 18.6: One vector versus a full product

If $m=n=p=10^4$, then forming $AB$ costs about $10^{12}$ scalar operations. Computing $B\vec x$ and then $A(B\vec x)$ costs about $2\cdot 10^8$ scalar operations. The two expressions are mathematically equal, but the computational work differs by thousands of times.

Code

def cost_form_then_multiply(m,n,p):
    return m*n*p + m*p

def cost_multiply_vector_first(m,n,p):
    return n*p + m*n

m=n=p=10_000
print("Form AB first:", cost_form_then_multiply(m,n,p))
print("Vector first:", cost_multiply_vector_first(m,n,p))
print("ratio:", cost_form_then_multiply(m,n,p)/cost_multiply_vector_first(m,n,p))

Form AB first: 1000100000000
Vector first: 200000000
ratio: 5000.5

23 Diagonal and sparse matrices

23.1 Definition 18.7: Diagonal matrix

Given $\vec a=(a_1,\ldots,a_n)^T$, the diagonal matrix with diagonal $\vec a$ is

\[ D=\operatorname{diag}(\vec a)= \begin{bmatrix} a_1& &0\\ &\ddots&\\ 0&&a_n \end{bmatrix}. \]

23.2 Proposition 18.8: Diagonal matrices behave like vectors

Let $D=\operatorname{diag}(\vec a)\in\mathbb R^{n\times n}$ and $B\in\mathbb R^{n\times p}$. Then

\[ DB= \begin{bmatrix} a_1B_{1,:}\\ a_2B_{2,:}\\ \vdots\\ a_nB_{n,:} \end{bmatrix}. \]

Thus multiplying by a diagonal matrix rescales rows. It costs $O(np)$, not $O(n^2p)$.

Proof

The $(i,j)$ entry of $DB$ is

\[ (DB)_{ij}=\sum_{k=1}^nD_{ik}B_{kj}. \]

Since $D$ is diagonal, $D_{ik}=0$ unless $k=i$. Therefore

\[ (DB)_{ij}=D_{ii}B_{ij}=a_iB_{ij}. \]

So each row of $B$ is scaled independently.

Code

B = np.array([[1,2,3],
              [4,5,6],
              [7,8,9]], dtype=float)
a = np.array([10, 100, 1000], dtype=float)
D = np.diag(a)

print(D @ B)
print(a[:,None] * B)   # same result without forming D

[[  10.   20.   30.]
 [ 400.  500.  600.]
 [7000. 8000. 9000.]]
[[  10.   20.   30.]
 [ 400.  500.  600.]
 [7000. 8000. 9000.]]

24 Solving linear systems: direct and iterative viewpoints

Solving a linear system

\[ A\vec x=\vec b \]

is one of the central problems in numerical linear algebra. Gaussian elimination is a direct method. It gives an answer after a finite sequence of row operations. But for large sparse systems, iterative methods can be preferable.

24.1 Example 18.9: Stable configuration

A physical equilibrium problem may lead to

\[ \begin{cases} 3x-y-z=1,\\ -x+2y-z=0,\\ -2x-y+4z=-1. \end{cases} \]

There are several computational viewpoints:

solve exactly by Gaussian elimination;
simulate the physical system until it stabilizes;
use an iterative method, possibly with preconditioning.

Code

A = np.array([[3,-1,-1],[-1,2,-1],[-2,-1,4]], dtype=float)
b = np.array([1,0,-1], dtype=float)
np.linalg.solve(A,b)

array([ 4.00000000e-01,  2.00000000e-01, -4.62592927e-17])

Preconditioning

A preconditioner changes a difficult system into an equivalent or approximately equivalent system that is easier for an iterative method to solve. In large-scale problems, the difference between cubic behavior and nearly linear behavior can be the difference between impossible and practical.

25 Eigenvalue computation: why the characteristic polynomial is not enough

Given $A\in\mathbb R^{n\times n}$, an eigenvalue $\lambda$ and eigenvector $\vec v\ne\vec 0$ satisfy

\[ A\vec v=\lambda\vec v. \]

The algebraic definition comes from

\[ \det(A-\lambda I)=0. \]

This is a polynomial equation of degree $n$.

Numerical warning

For large $n$, expanding and solving the characteristic polynomial is usually not a good numerical method. Polynomial coefficients can be highly sensitive to roundoff error, and high-degree polynomial root finding can be unstable. Modern eigenvalue algorithms use matrix iterations instead.

25.1 Definition 18.10: Dominant eigenvalue

Let $\lambda_1,\ldots,\lambda_n$ be the eigenvalues of $A$. A dominant eigenvalue is an eigenvalue whose absolute value is strictly largest:

\[ |\lambda_1|>|\lambda_i|\qquad i=2,\ldots,n. \]

An eigenvector associated with $\lambda_1$ is called a dominant eigenvector.

26 The power method

The power method is one of the simplest numerical algorithms for approximating a dominant eigenvector.

26.1 Definition 18.11: Power iteration

Choose an initial vector $\vec x_0\ne\vec 0$. Define

\[ \vec x_{k+1}=A\vec x_k. \]

Equivalently,

\[ \vec x_k=A^k\vec x_0. \]

In practical computation, one usually normalizes each step:

\[ \vec y_{k+1}=A\vec x_k, \qquad \vec x_{k+1}=\frac{\vec y_{k+1}}{\|\vec y_{k+1}\|}. \]

26.2 Theorem 18.12: Power method in the diagonalizable case

Suppose $A\in\mathbb C^{n\times n}$ is diagonalizable with eigenpairs

\[ A\vec b_i=\lambda_i\vec b_i \]

and

\[ |\lambda_1|>|\lambda_2|\ge\cdots\ge|\lambda_n|. \]

\[ \vec x_0=c_1\vec b_1+c_2\vec b_2+\cdots+c_n\vec b_n \]

with $c_1\ne0$, then the directions of $A^k\vec x_0$ converge to the dominant eigendirection $\operatorname{span}\{\vec b_1\}$.

Proof

Compute

\[ A^k\vec x_0 =c_1\lambda_1^k\vec b_1+c_2\lambda_2^k\vec b_2+\cdots+c_n\lambda_n^k\vec b_n. \]

Factor out $\lambda_1^k$:

\[ A^k\vec x_0 =\lambda_1^k\left(c_1\vec b_1+ \sum_{i=2}^n c_i\left(\frac{\lambda_i}{\lambda_1}\right)^k\vec b_i\right). \]

Since $|\lambda_i/\lambda_1|<1$ for every $i\ge2$, the terms

\[ \left(\frac{\lambda_i}{\lambda_1}\right)^k \]

converge to zero. After normalization, the direction approaches the direction of $\vec b_1$.

Rate of convergence

The dominant error term is governed by

\[ \left|\frac{\lambda_2}{\lambda_1}\right|^k. \]

Code

def power_method(A, steps=25, seed=1):
    rng = np.random.default_rng(seed)
    x = rng.normal(size=A.shape[0])
    x = x/np.linalg.norm(x)
    history = []
    for k in range(steps):
        y = A @ x
        x = y/np.linalg.norm(y)
        lam = (x @ A @ x)/(x @ x)
        history.append(lam)
    return np.array(history), x

A = np.array([[3,1],[1,2]], dtype=float)
history, x = power_method(A, steps=20)
print("approx eigenvector:", x)
print("Rayleigh quotients:", history[-5:])
print("numpy eigenvalues:", np.linalg.eigvals(A))

approx eigenvector: [0.85065081 0.52573111]
Rayleigh quotients: [3.61803399 3.61803399 3.61803399 3.61803399 3.61803399]
numpy eigenvalues: [3.61803399 1.38196601]

27 Rayleigh quotient

27.1 Definition 18.13: Rayleigh quotient

Let $A\in\mathbb R^{n\times n}$ be symmetric. The Rayleigh quotient is

\[ R_A(\vec x)=\frac{\vec x^TA\vec x}{\vec x^T\vec x}, \qquad \vec x\ne\vec 0. \]

For Hermitian $A\in\mathbb C^{n\times n}$, the complex Rayleigh quotient is

\[ R_A(\vec z)=\frac{\vec z^*A\vec z}{\vec z^*\vec z}, \qquad \vec z\ne\vec 0. \]

27.2 Proposition 18.14: Eigenvalue from eigenvector

If $A\vec v=\lambda\vec v$ and $\vec v\ne\vec 0$, then

\[ R_A(\vec v)=\lambda. \]

Proof

For the real case,

\[ R_A(\vec v)=\frac{\vec v^TA\vec v}{\vec v^T\vec v} =\frac{\vec v^T(\lambda\vec v)}{\vec v^T\vec v} =\lambda. \]

The Hermitian case is identical with transpose replaced by conjugate transpose.

27.3 Theorem 18.15: Rayleigh quotient and extreme eigenvalues

Let $A\in\mathbb R^{n\times n}$ be symmetric with eigenvalues

\[ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n. \]

Then

\[ \max_{\vec x\ne\vec 0}R_A(\vec x)=\lambda_1, \qquad \min_{\vec x\ne\vec 0}R_A(\vec x)=\lambda_n. \]

The same statement holds for Hermitian matrices over $\mathbb C$.

Proof

By the spectral theorem, choose an orthonormal eigenbasis $\vec q_1,\ldots,\vec q_n$. Write

\[ \vec x=c_1\vec q_1+\cdots+c_n\vec q_n. \]

Then

\[ \vec x^TA\vec x=\sum_{i=1}^n\lambda_i c_i^2, \qquad \vec x^T\vec x=\sum_{i=1}^n c_i^2. \]

Therefore

\[ R_A(\vec x)=\frac{\sum_{i=1}^n\lambda_i c_i^2}{\sum_{i=1}^n c_i^2}. \]

This is a weighted average of the eigenvalues, with nonnegative weights

\[ \frac{c_i^2}{\sum_{j=1}^n c_j^2}. \]

A weighted average of numbers between $\lambda_n$ and $\lambda_1$ also lies between $\lambda_n$ and $\lambda_1$. Equality is achieved at eigenvectors for the largest and smallest eigenvalues.

Code

A = np.array([[2,1,0],[1,3,1],[0,1,2]], dtype=float)
vals, vecs = np.linalg.eigh(A)
print("eigenvalues:", vals)

rng = np.random.default_rng(0)
for _ in range(5):
    x = rng.normal(size=3)
    R = x @ A @ x / (x @ x)
    print(R)

eigenvalues: [1. 2. 4.]
1.5828305196818626
1.5035355354580657
2.657681818948921
2.961282669874245
2.22987010981964

28 Restricted Rayleigh quotient and PCA intuition

28.1 Theorem 18.16: Restricted Rayleigh quotient

Let $A$ be real symmetric with eigenvalues

\[ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n \]

and orthonormal eigenvectors $\vec q_1,\ldots,\vec q_n$. Then

\[ \max_{\vec x\ne\vec 0,\ \vec q_1^T\vec x=0}R_A(\vec x)=\lambda_2. \]

More generally,

\[ \max_{\vec x\ne\vec 0,\ \vec q_1^T\vec x=\cdots=\vec q_{k-1}^T\vec x=0}R_A(\vec x)=\lambda_k. \]

Proof

The constraints

\[ \vec q_1^T\vec x=\cdots=\vec q_{k-1}^T\vec x=0 \]

mean that $\vec x$ lies in

\[ \operatorname{span}\{\vec q_k,\ldots,\vec q_n\}. \]

Write

\[ \vec x=c_k\vec q_k+\cdots+c_n\vec q_n. \]

Then

\[ R_A(\vec x)=\frac{\sum_{i=k}^n\lambda_i c_i^2}{\sum_{i=k}^n c_i^2}. \]

This is a weighted average of $\lambda_k,\ldots,\lambda_n$, so it is at most $\lambda_k$. Equality occurs when $\vec x=\vec q_k$.

This theorem explains why PCA finds directions one after another: the first principal direction maximizes variance, the second maximizes variance subject to being orthogonal to the first, and so on.

29 Deflation

After finding one eigenvalue, one may try to remove it and compute the next one. This idea is called deflation.

29.1 Definition 18.17: Deflation

Deflation is the process of modifying a problem after finding one solution so that the next computation finds a new solution.

29.2 Proposition 18.18: Simple deflation for a symmetric matrix

Let $A\in\mathbb R^{n\times n}$ be symmetric with orthonormal eigenvectors $\vec v_1,\ldots,\vec v_n$ and eigenvalues $\lambda_1,\ldots,\lambda_n$. Define

\[ B=A-\lambda_1\vec v_1\vec v_1^T. \]

Then

\[ B\vec v_1=\vec 0, \qquad B\vec v_i=\lambda_i\vec v_i\quad(i=2,\ldots,n). \]

Thus $B$ has eigenvalues $0,\lambda_2,\ldots,\lambda_n$.

Proof

Since $\vec v_1^T\vec v_1=1$,

\[ B\vec v_1 =A\vec v_1-\lambda_1\vec v_1(\vec v_1^T\vec v_1) =\lambda_1\vec v_1-\lambda_1\vec v_1 =\vec 0. \]

If $i\ge2$, then $\vec v_1^T\vec v_i=0$, so

\[ B\vec v_i =A\vec v_i-\lambda_1\vec v_1(\vec v_1^T\vec v_i) =\lambda_i\vec v_i. \]

Numerical caution

Deflation is conceptually useful, but repeated deflation can be numerically unstable. Small errors in earlier eigenvectors can contaminate later eigenvalue computations.

Code

A = np.array([[3,1],[1,2]], dtype=float)
vals, vecs = np.linalg.eigh(A)
idx = np.argsort(vals)[::-1]
vals = vals[idx]
vecs = vecs[:,idx]

lam1 = vals[0]
v1 = vecs[:,0]
B = A - lam1*np.outer(v1,v1)
print("A eigenvalues:", vals)
print("deflated B:\n", B)
print("B eigenvalues:", np.linalg.eigvalsh(B))

A eigenvalues: [3.61803399 1.38196601]
deflated B:
 [[ 0.38196601 -0.61803399]
 [-0.61803399  1.        ]]
B eigenvalues: [7.64583896e-16 1.38196601e+00]

30 Generalized eigenvalues

Eigenvalue problems often appear in generalized form.

30.1 Definition 18.19: Generalized eigenvalue

Let $A,B\in\mathbb R^{n\times n}$. A scalar $\lambda$ is called a generalized eigenvalue of the pair $(A,B)$ if there exists a nonzero vector $\vec v$ such that

\[ A\vec v=\lambda B\vec v. \]

The vector $\vec v$ is called a generalized eigenvector.

If $B=I$, this reduces to the ordinary eigenvalue problem. If $B$ is invertible, then

\[ A\vec v=\lambda B\vec v \]

is equivalent to

\[ B^{-1}A\vec v=\lambda\vec v. \]

However, in numerical computation, one usually avoids explicitly forming $B^{-1}A$.

30.2 Example 18.20: A generalized eigenvalue problem

Let

\[ A=\begin{bmatrix}1&2\\2&4\end{bmatrix}, \qquad B=\begin{bmatrix}1&1\\1&1\end{bmatrix}. \]

The generalized eigenvalues satisfy

\[ \det(A-\lambda B)=0. \]

Compute

\[ \det\begin{bmatrix}1-\lambda&2-\lambda\\2-\lambda&4-\lambda\end{bmatrix} =(1-\lambda)(4-\lambda)-(2-\lambda)^2=-\lambda. \]

Thus $\lambda=0$ is a generalized eigenvalue. The generalized eigenvectors solve $A\vec v=\vec 0$, so

\[ \vec v=k\begin{bmatrix}-2\\1\end{bmatrix}. \]

31 Applications

31.1 PageRank and Markov chains

If a network is represented by a stochastic matrix $P$, then a stationary distribution satisfies

\[ P\vec x=\vec x. \]

This is an eigenvalue problem with eigenvalue $1$. Power iteration is a natural way to approximate the stationary vector.

31.2 Principal component analysis

PCA solves an eigenvalue problem for a covariance matrix $C$. The first principal component solves

\[ \max_{\|\vec x\|=1}\vec x^TC\vec x. \]

By the Rayleigh quotient theorem, the solution is the eigenvector corresponding to the largest eigenvalue of $C$.

31.3 Spectral graph methods

Graph Laplacians lead to eigenvalue and generalized eigenvalue problems. In spectral clustering, eigenvectors of a graph Laplacian are used as low-dimensional coordinates for clustering vertices.

32 Classical challenge questions with solutions

32.1 Challenge 1: When should we precompute $AB$?

Let $A\in\mathbb R^{m\times n}$, $B\in\mathbb R^{n\times p}$, and suppose we need to compute $A(B\vec x)$ for many different vectors $\vec x\in\mathbb R^p$.

When is it better to precompute $AB$?
When is it better to avoid forming $AB$?

Solution

For one vector $\vec x$, it is usually better to compute $A(B\vec x)$ with cost

\[ O(np+mn), \]

rather than form $AB$, which costs $O(mnp)$.

If the same matrices $A$ and $B$ are used for $N$ different vectors, then the tradeoff is

\[ \text{precompute }AB:\quad O(mnp)+N\cdot O(mp), \]

while

\[ \text{do not precompute}:\quad N\cdot O(np+mn). \]

Precomputing may be useful only if $N$ is large enough and if storing $AB$ is feasible. If $A$ and $B$ are sparse but $AB$ is dense, precomputing can destroy sparsity.

32.2 Challenge 2: When does the power method fail?

Give two reasons the power method may fail or converge slowly.

Solution

First, if the initial vector $\vec x_0$ has no component in the dominant eigendirection, then the dominant eigenvector never appears in the iteration. In the diagonalizable case, this means $c_1=0$ in

\[ \vec x_0=c_1\vec b_1+\cdots+c_n\vec b_n. \]

Second, if

\[ \left|\frac{\lambda_2}{\lambda_1}\right| \]

is close to $1$, then convergence is slow. Other difficulties include repeated dominant eigenvalues, complex dominant eigenvalues, and non-diagonalizable matrices.

32.3 Challenge 3: Rayleigh quotient as a weighted average

Explain why the Rayleigh quotient of a symmetric matrix cannot exceed the largest eigenvalue.

Solution

Write $A=QDQ^T$ with eigenvalues $\lambda_1\ge\cdots\ge\lambda_n$. If $\vec x=Q\vec c$, then

\[ R_A(\vec x)=\frac{\sum_i\lambda_i c_i^2}{\sum_i c_i^2}. \]

This is a weighted average of the eigenvalues. A weighted average cannot be larger than the largest value being averaged. Therefore

\[ R_A(\vec x)\le\lambda_1. \]

32.4 Challenge 4: Deflation and projection

Suppose $A$ is symmetric and $\vec v_1$ is a unit eigenvector with eigenvalue $\lambda_1$. Explain why

\[ B=A-\lambda_1\vec v_1\vec v_1^T \]

removes the component of $A$ acting along $\vec v_1$.

Solution

The matrix $\vec v_1\vec v_1^T$ is the orthogonal projection onto $\operatorname{span}\{\vec v_1\}$. Since $A\vec v_1=\lambda_1\vec v_1$, the term

\[ \lambda_1\vec v_1\vec v_1^T \]

is exactly the rank-one part of $A$ on that eigendirection. Subtracting it gives

\[ B\vec v_1=\vec 0. \]

All orthogonal eigenvectors are unchanged because $\vec v_1^T\vec v_i=0$ for $i\ne1$.

33 Practice problems

33.1 Problem 1

Let $A\in\mathbb R^{5000\times 5000}$ and $\vec x\in\mathbb R^{5000}$. Compare the cost of computing $A^2\vec x$ by forming $A^2$ first versus computing $A(A\vec x)$.

Solution

Forming $A^2$ costs about $O(n^3)$ with $n=5000$, and then multiplying by $\vec x$ costs $O(n^2)$. Computing $A(A\vec x)$ uses two matrix-vector products, costing $2O(n^2)$. The second method avoids the expensive matrix-matrix product.

33.2 Problem 2

Let

\[ A=\begin{bmatrix}4&0\\0&1\end{bmatrix}, \qquad \vec x_0=\begin{bmatrix}1\\1\end{bmatrix}. \]

Compute the direction of $A^k\vec x_0$ as $k\to\infty$.

Solution

We have

\[ A^k\vec x_0= \begin{bmatrix}4^k&0\\0&1\end{bmatrix} \begin{bmatrix}1\\1\end{bmatrix} = \begin{bmatrix}4^k\\1\end{bmatrix}. \]

After normalization, the second component becomes negligible compared with the first. The direction converges to

\[ \operatorname{span}\left\{\begin{bmatrix}1\\0\end{bmatrix}\right\}. \]

33.3 Problem 3

For

\[ A=\begin{bmatrix}2&1\\1&2\end{bmatrix}, \]

compute the Rayleigh quotient at $\vec x=(1,0)^T$ and at $\vec y=(1,1)^T$.

Solution

For $\vec x=(1,0)^T$,

\[ R_A(\vec x)=\frac{\vec x^TA\vec x}{\vec x^T\vec x}=2. \]

For $\vec y=(1,1)^T$,

\[ A\vec y=(3,3)^T, \]

\[ R_A(\vec y)=\frac{\vec y^TA\vec y}{\vec y^T\vec y} =\frac{6}{2}=3. \]

Indeed, $\vec y$ is an eigenvector with eigenvalue $3$.

33.4 Problem 4

Suppose a symmetric matrix has eigenvalues $10,9,1$. What is the approximate convergence ratio of the power method?

Solution

The convergence ratio is approximately

\[ \left|\frac{\lambda_2}{\lambda_1}\right|=\frac{9}{10}=0.9. \]

This is close to $1$, so convergence may be slow.

33.5 Problem 5

Let $P$ be a column-stochastic matrix. Why is a stationary distribution an eigenvector problem?

Solution

A stationary distribution $\vec \pi$ satisfies

\[ P\vec \pi=\vec \pi. \]

This is exactly the eigenvalue equation with eigenvalue $1$:

\[ P\vec \pi=1\cdot \vec \pi. \]

34 AI companion activities

Use an AI assistant as a study partner, but verify all mathematical claims by computation or proof.

Activity 1: Complexity explanation

Ask an AI assistant:

Explain why computing $(AB)x$ by forming $AB$ first is usually inefficient. Give a numerical example with dimensions.

Then check whether the stated operation counts match the formulas in this chapter.

Activity 2: Power method debugging

Ask an AI assistant to write a Python implementation of the power method. Test it on:

\[ A=\begin{bmatrix}3&1\\1&2\end{bmatrix} \]

and compare the output with numpy.linalg.eig.

Activity 3: Rayleigh quotient interpretation

Ask an AI assistant:

Why is the Rayleigh quotient a weighted average of eigenvalues for symmetric matrices?

Then rewrite the explanation using the spectral decomposition $A=QDQ^T$.

Activity 4: Find failure cases

Ask an AI assistant to produce examples where the power method fails or converges slowly. For each example, compute the eigenvalues and explain the behavior using $|\lambda_2/\lambda_1|$.

--- title: "Chapter 18: Computational Complexity and Computation of Eigenvalues" subtitle: "When the formula is correct, but the computation is too large" author: "He Wang" format: html: toc: true number-sections: true code-fold: true code-tools: true execute: echo: true warning: false message: false jupyter: python3 --- # The story: linear algebra at scale In the first part of this course, many computations were small enough to do by hand. We solved systems, found bases, diagonalized matrices, computed least-squares solutions, and studied decompositions such as QR, Schur, spectral decomposition, SVD, Haar transforms, and FFT. But applied linear algebra becomes truly powerful when the matrices are large. A matrix might represent a social network, a graph Laplacian, a covariance matrix, a discretized differential operator, an image, a signal, or a machine-learning data set. At that scale, the question is no longer only: ::: {.callout-note} ## The old question What is the correct formula? ::: It becomes: ::: {.callout-important} ## The computational question Can we compute the answer using a reasonable amount of time and memory? ::: For example, the expressions $$ (AB)\vec x \qquad\text{and}\qquad A(B\vec x) $$ are mathematically equal, but they may have very different computational costs. The lesson of this chapter is that **linear algebra is not only about identities; it is also about algorithms**. This chapter has two main goals: 1. understand computational complexity for basic matrix operations; 2. study numerical eigenvalue computation, especially the power method and Rayleigh quotient. # Space complexity, time complexity, and Big-O notation ::: {.callout-definition} ## Definition 18.1: Space complexity The **space complexity** of an algorithm is the amount of memory needed to store the input, intermediate quantities, and output. ::: ::: {.callout-definition} ## Definition 18.2: Time complexity The **time complexity** of an algorithm is the number of elementary operations needed to complete the computation. In numerical linear algebra, the elementary operations are usually arithmetic operations such as addition, subtraction, multiplication, and division. ::: ::: {.callout-definition} ## Definition 18.3: Big-O notation Let $f(n)$ and $g(n)$ be positive functions. We write $$ f(n)=O(g(n)) $$ if there exist constants $C>0$ and $n_0$ such that $$ f(n)\le Cg(n)\qquad\text{for all }n\ge n_0. $$ Big-O notation records the dominant asymptotic growth rate and ignores fixed constants and lower-order terms. ::: ## Memory: dense versus sparse matrices A dense matrix $A\in\mathbb R^{m\times n}$ stores $mn$ numbers. Its memory cost is therefore $$ O(mn). $$ If a matrix is sparse, most entries are zero. A sparse storage format stores only the nonzero entries and their positions. The memory cost is closer to $$ O(\operatorname{nnz}(A)), $$ where $\operatorname{nnz}(A)$ denotes the number of nonzero entries. ::: {.callout-example} ## Example 18.4: A tridiagonal matrix A tridiagonal $n\times n$ matrix has nonzero entries only on the main diagonal, the diagonal above it, and the diagonal below it. It has approximately $3n$ nonzero entries, not $n^2$ entries. For large $n$, storing it as a sparse matrix saves enormous memory. ::: ```{python} import numpy as np from scipy import sparse n = 10_000 main = 2*np.ones(n) off = -1*np.ones(n-1) A_sparse = sparse.diags([off, main, off], [-1, 0, 1], format="csr") print("shape:", A_sparse.shape) print("number of nonzeros:", A_sparse.nnz) print("density:", A_sparse.nnz/(n*n)) ``` # Basic time complexity in linear algebra Let $\vec x,\vec y\in\mathbb R^n$, $A\in\mathbb R^{m\times n}$, $B\in\mathbb R^{n\times p}$, and $M\in\mathbb R^{n\times n}$. | Operation | Typical dense cost | |---|---:| | Sum entries of $\vec x$ | $O(n)$ | | Dot product $\vec x\cdot\vec y$ | $O(n)$ | | Matrix-vector product $A\vec x$ | $O(mn)$ | | Matrix-matrix product $AB$ | $O(mnp)$ | | Gaussian elimination for $M\vec x=\vec b$ | $O(n^3)$ | | Dense eigenvalue computation | about $O(n^3)$ | ::: {.callout-note} ## Interpretation The difference between $O(n)$, $O(n^2)$, and $O(n^3)$ is small for homework-size matrices, but enormous for real data. If $n=10^6$, even storing a dense $n\times n$ matrix is usually impossible. ::: ```{python} import pandas as pd sizes = [10, 100, 1_000, 10_000, 1_000_000] data = [] for n in sizes: data.append({ "n": n, "n": n, "n log2 n": n*np.log2(n), "n^2": n**2, "n^3": n**3, }) pd.DataFrame(data) ``` # Associativity can save work A recurring rule in efficient matrix computation is: ::: {.callout-important} ## Do not form a large matrix if you only need its action on a vector. ::: ::: {.callout-theorem} ## Proposition 18.5: Associativity can save work Let $A\in\mathbb R^{m\times n}$, $B\in\mathbb R^{n\times p}$, and $\vec x\in\mathbb R^p$. Mathematically, $$ (AB)\vec x=A(B\vec x). $$ However, the computational costs are different: $$ (AB)\vec x:\quad O(mnp)+O(mp), $$ while $$ A(B\vec x):\quad O(np)+O(mn). $$ ::: <details> <summary>Proof</summary> To compute $(AB)\vec x$, first form the full matrix $AB$. Since $AB$ has $mp$ entries and each entry is a dot product of length $n$, forming $AB$ costs $O(mnp)$. Then multiplying the resulting $m\times p$ matrix by $\vec x$ costs $O(mp)$. For $A(B\vec x)$, first compute $\vec y=B\vec x$, costing $O(np)$. Then compute $A\vec y$, costing $O(mn)$. The two results are equal because matrix multiplication is associative, but the second method avoids forming the large intermediate matrix $AB$. </details> ::: {.callout-example} ## Example 18.6: One vector versus a full product If $m=n=p=10^4$, then forming $AB$ costs about $10^{12}$ scalar operations. Computing $B\vec x$ and then $A(B\vec x)$ costs about $2\cdot 10^8$ scalar operations. The two expressions are mathematically equal, but the computational work differs by thousands of times. ::: ```{python} def cost_form_then_multiply(m,n,p): return m*n*p + m*p def cost_multiply_vector_first(m,n,p): return n*p + m*n m=n=p=10_000 print("Form AB first:", cost_form_then_multiply(m,n,p)) print("Vector first:", cost_multiply_vector_first(m,n,p)) print("ratio:", cost_form_then_multiply(m,n,p)/cost_multiply_vector_first(m,n,p)) ``` # Diagonal and sparse matrices ::: {.callout-definition} ## Definition 18.7: Diagonal matrix Given $\vec a=(a_1,\ldots,a_n)^T$, the diagonal matrix with diagonal $\vec a$ is $$ D=\operatorname{diag}(\vec a)= \begin{bmatrix} a_1& &0\\ &\ddots&\\ 0&&a_n \end{bmatrix}. $$ ::: ::: {.callout-theorem} ## Proposition 18.8: Diagonal matrices behave like vectors Let $D=\operatorname{diag}(\vec a)\in\mathbb R^{n\times n}$ and $B\in\mathbb R^{n\times p}$. Then $$ DB= \begin{bmatrix} a_1B_{1,:}\\ a_2B_{2,:}\\ \vdots\\ a_nB_{n,:} \end{bmatrix}. $$ Thus multiplying by a diagonal matrix rescales rows. It costs $O(np)$, not $O(n^2p)$. ::: <details> <summary>Proof</summary> The $(i,j)$ entry of $DB$ is $$ (DB)_{ij}=\sum_{k=1}^nD_{ik}B_{kj}. $$ Since $D$ is diagonal, $D_{ik}=0$ unless $k=i$. Therefore $$ (DB)_{ij}=D_{ii}B_{ij}=a_iB_{ij}. $$ So each row of $B$ is scaled independently. </details> ```{python} B = np.array([[1,2,3], [4,5,6], [7,8,9]], dtype=float) a = np.array([10, 100, 1000], dtype=float) D = np.diag(a) print(D @ B) print(a[:,None] * B) # same result without forming D ``` # Solving linear systems: direct and iterative viewpoints Solving a linear system $$ A\vec x=\vec b $$ is one of the central problems in numerical linear algebra. Gaussian elimination is a direct method. It gives an answer after a finite sequence of row operations. But for large sparse systems, iterative methods can be preferable. ::: {.callout-example} ## Example 18.9: Stable configuration A physical equilibrium problem may lead to $$ \begin{cases} 3x-y-z=1,\\ -x+2y-z=0,\\ -2x-y+4z=-1. \end{cases} $$ There are several computational viewpoints: 1. solve exactly by Gaussian elimination; 2. simulate the physical system until it stabilizes; 3. use an iterative method, possibly with preconditioning. ::: ```{python} A = np.array([[3,-1,-1],[-1,2,-1],[-2,-1,4]], dtype=float) b = np.array([1,0,-1], dtype=float) np.linalg.solve(A,b) ``` ::: {.callout-note} ## Preconditioning A preconditioner changes a difficult system into an equivalent or approximately equivalent system that is easier for an iterative method to solve. In large-scale problems, the difference between cubic behavior and nearly linear behavior can be the difference between impossible and practical. ::: # Eigenvalue computation: why the characteristic polynomial is not enough Given $A\in\mathbb R^{n\times n}$, an eigenvalue $\lambda$ and eigenvector $\vec v\ne\vec 0$ satisfy $$ A\vec v=\lambda\vec v. $$ The algebraic definition comes from $$ \det(A-\lambda I)=0. $$ This is a polynomial equation of degree $n$. ::: {.callout-warning} ## Numerical warning For large $n$, expanding and solving the characteristic polynomial is usually not a good numerical method. Polynomial coefficients can be highly sensitive to roundoff error, and high-degree polynomial root finding can be unstable. Modern eigenvalue algorithms use matrix iterations instead. ::: ::: {.callout-definition} ## Definition 18.10: Dominant eigenvalue Let $\lambda_1,\ldots,\lambda_n$ be the eigenvalues of $A$. A **dominant eigenvalue** is an eigenvalue whose absolute value is strictly largest: $$ |\lambda_1|>|\lambda_i|\qquad i=2,\ldots,n. $$ An eigenvector associated with $\lambda_1$ is called a **dominant eigenvector**. ::: # The power method The power method is one of the simplest numerical algorithms for approximating a dominant eigenvector. ::: {.callout-definition} ## Definition 18.11: Power iteration Choose an initial vector $\vec x_0\ne\vec 0$. Define $$ \vec x_{k+1}=A\vec x_k. $$ Equivalently, $$ \vec x_k=A^k\vec x_0. $$ In practical computation, one usually normalizes each step: $$ \vec y_{k+1}=A\vec x_k, \qquad \vec x_{k+1}=\frac{\vec y_{k+1}}{\|\vec y_{k+1}\|}. $$ ::: ::: {.callout-theorem} ## Theorem 18.12: Power method in the diagonalizable case Suppose $A\in\mathbb C^{n\times n}$ is diagonalizable with eigenpairs $$ A\vec b_i=\lambda_i\vec b_i $$ and $$ |\lambda_1|>|\lambda_2|\ge\cdots\ge|\lambda_n|. $$ If $$ \vec x_0=c_1\vec b_1+c_2\vec b_2+\cdots+c_n\vec b_n $$ with $c_1\ne0$, then the directions of $A^k\vec x_0$ converge to the dominant eigendirection $\operatorname{span}\{\vec b_1\}$. ::: <details> <summary>Proof</summary> Compute $$ A^k\vec x_0 =c_1\lambda_1^k\vec b_1+c_2\lambda_2^k\vec b_2+\cdots+c_n\lambda_n^k\vec b_n. $$ Factor out $\lambda_1^k$: $$ A^k\vec x_0 =\lambda_1^k\left(c_1\vec b_1+ \sum_{i=2}^n c_i\left(\frac{\lambda_i}{\lambda_1}\right)^k\vec b_i\right). $$ Since $|\lambda_i/\lambda_1|<1$ for every $i\ge2$, the terms $$ \left(\frac{\lambda_i}{\lambda_1}\right)^k $$ converge to zero. After normalization, the direction approaches the direction of $\vec b_1$. </details> ::: {.callout-note} ## Rate of convergence The dominant error term is governed by $$ \left|\frac{\lambda_2}{\lambda_1}\right|^k. $$ The smaller $|\lambda_2/\lambda_1|$ is, the faster the method converges. If $|\lambda_2|$ is close to $|\lambda_1|$, convergence can be very slow. ::: ```{python} def power_method(A, steps=25, seed=1): rng = np.random.default_rng(seed) x = rng.normal(size=A.shape[0]) x = x/np.linalg.norm(x) history = [] for k in range(steps): y = A @ x x = y/np.linalg.norm(y) lam = (x @ A @ x)/(x @ x) history.append(lam) return np.array(history), x A = np.array([[3,1],[1,2]], dtype=float) history, x = power_method(A, steps=20) print("approx eigenvector:", x) print("Rayleigh quotients:", history[-5:]) print("numpy eigenvalues:", np.linalg.eigvals(A)) ``` # Rayleigh quotient ::: {.callout-definition} ## Definition 18.13: Rayleigh quotient Let $A\in\mathbb R^{n\times n}$ be symmetric. The **Rayleigh quotient** is $$ R_A(\vec x)=\frac{\vec x^TA\vec x}{\vec x^T\vec x}, \qquad \vec x\ne\vec 0. $$ For Hermitian $A\in\mathbb C^{n\times n}$, the complex Rayleigh quotient is $$ R_A(\vec z)=\frac{\vec z^*A\vec z}{\vec z^*\vec z}, \qquad \vec z\ne\vec 0. $$ ::: ::: {.callout-theorem} ## Proposition 18.14: Eigenvalue from eigenvector If $A\vec v=\lambda\vec v$ and $\vec v\ne\vec 0$, then $$ R_A(\vec v)=\lambda. $$ ::: <details> <summary>Proof</summary> For the real case, $$ R_A(\vec v)=\frac{\vec v^TA\vec v}{\vec v^T\vec v} =\frac{\vec v^T(\lambda\vec v)}{\vec v^T\vec v} =\lambda. $$ The Hermitian case is identical with transpose replaced by conjugate transpose. </details> ::: {.callout-theorem} ## Theorem 18.15: Rayleigh quotient and extreme eigenvalues Let $A\in\mathbb R^{n\times n}$ be symmetric with eigenvalues $$ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n. $$ Then $$ \max_{\vec x\ne\vec 0}R_A(\vec x)=\lambda_1, \qquad \min_{\vec x\ne\vec 0}R_A(\vec x)=\lambda_n. $$ The same statement holds for Hermitian matrices over $\mathbb C$. ::: <details> <summary>Proof</summary> By the spectral theorem, choose an orthonormal eigenbasis $\vec q_1,\ldots,\vec q_n$. Write $$ \vec x=c_1\vec q_1+\cdots+c_n\vec q_n. $$ Then $$ \vec x^TA\vec x=\sum_{i=1}^n\lambda_i c_i^2, \qquad \vec x^T\vec x=\sum_{i=1}^n c_i^2. $$ Therefore $$ R_A(\vec x)=\frac{\sum_{i=1}^n\lambda_i c_i^2}{\sum_{i=1}^n c_i^2}. $$ This is a weighted average of the eigenvalues, with nonnegative weights $$ \frac{c_i^2}{\sum_{j=1}^n c_j^2}. $$ A weighted average of numbers between $\lambda_n$ and $\lambda_1$ also lies between $\lambda_n$ and $\lambda_1$. Equality is achieved at eigenvectors for the largest and smallest eigenvalues. </details> ```{python} A = np.array([[2,1,0],[1,3,1],[0,1,2]], dtype=float) vals, vecs = np.linalg.eigh(A) print("eigenvalues:", vals) rng = np.random.default_rng(0) for _ in range(5): x = rng.normal(size=3) R = x @ A @ x / (x @ x) print(R) ``` # Restricted Rayleigh quotient and PCA intuition ::: {.callout-theorem} ## Theorem 18.16: Restricted Rayleigh quotient Let $A$ be real symmetric with eigenvalues $$ \lambda_1\ge\lambda_2\ge\cdots\ge\lambda_n $$ and orthonormal eigenvectors $\vec q_1,\ldots,\vec q_n$. Then $$ \max_{\vec x\ne\vec 0,\ \vec q_1^T\vec x=0}R_A(\vec x)=\lambda_2. $$ More generally, $$ \max_{\vec x\ne\vec 0,\ \vec q_1^T\vec x=\cdots=\vec q_{k-1}^T\vec x=0}R_A(\vec x)=\lambda_k. $$ ::: <details> <summary>Proof</summary> The constraints $$ \vec q_1^T\vec x=\cdots=\vec q_{k-1}^T\vec x=0 $$ mean that $\vec x$ lies in $$ \operatorname{span}\{\vec q_k,\ldots,\vec q_n\}. $$ Write $$ \vec x=c_k\vec q_k+\cdots+c_n\vec q_n. $$ Then $$ R_A(\vec x)=\frac{\sum_{i=k}^n\lambda_i c_i^2}{\sum_{i=k}^n c_i^2}. $$ This is a weighted average of $\lambda_k,\ldots,\lambda_n$, so it is at most $\lambda_k$. Equality occurs when $\vec x=\vec q_k$. </details> This theorem explains why PCA finds directions one after another: the first principal direction maximizes variance, the second maximizes variance subject to being orthogonal to the first, and so on. # Deflation After finding one eigenvalue, one may try to remove it and compute the next one. This idea is called **deflation**. ::: {.callout-definition} ## Definition 18.17: Deflation Deflation is the process of modifying a problem after finding one solution so that the next computation finds a new solution. ::: ::: {.callout-theorem} ## Proposition 18.18: Simple deflation for a symmetric matrix Let $A\in\mathbb R^{n\times n}$ be symmetric with orthonormal eigenvectors $\vec v_1,\ldots,\vec v_n$ and eigenvalues $\lambda_1,\ldots,\lambda_n$. Define $$ B=A-\lambda_1\vec v_1\vec v_1^T. $$ Then $$ B\vec v_1=\vec 0, \qquad B\vec v_i=\lambda_i\vec v_i\quad(i=2,\ldots,n). $$ Thus $B$ has eigenvalues $0,\lambda_2,\ldots,\lambda_n$. ::: <details> <summary>Proof</summary> Since $\vec v_1^T\vec v_1=1$, $$ B\vec v_1 =A\vec v_1-\lambda_1\vec v_1(\vec v_1^T\vec v_1) =\lambda_1\vec v_1-\lambda_1\vec v_1 =\vec 0. $$ If $i\ge2$, then $\vec v_1^T\vec v_i=0$, so $$ B\vec v_i =A\vec v_i-\lambda_1\vec v_1(\vec v_1^T\vec v_i) =\lambda_i\vec v_i. $$ </details> ::: {.callout-warning} ## Numerical caution Deflation is conceptually useful, but repeated deflation can be numerically unstable. Small errors in earlier eigenvectors can contaminate later eigenvalue computations. ::: ```{python} A = np.array([[3,1],[1,2]], dtype=float) vals, vecs = np.linalg.eigh(A) idx = np.argsort(vals)[::-1] vals = vals[idx] vecs = vecs[:,idx] lam1 = vals[0] v1 = vecs[:,0] B = A - lam1*np.outer(v1,v1) print("A eigenvalues:", vals) print("deflated B:\n", B) print("B eigenvalues:", np.linalg.eigvalsh(B)) ``` # Generalized eigenvalues Eigenvalue problems often appear in generalized form. ::: {.callout-definition} ## Definition 18.19: Generalized eigenvalue Let $A,B\in\mathbb R^{n\times n}$. A scalar $\lambda$ is called a **generalized eigenvalue** of the pair $(A,B)$ if there exists a nonzero vector $\vec v$ such that $$ A\vec v=\lambda B\vec v. $$ The vector $\vec v$ is called a **generalized eigenvector**. ::: If $B=I$, this reduces to the ordinary eigenvalue problem. If $B$ is invertible, then $$ A\vec v=\lambda B\vec v $$ is equivalent to $$ B^{-1}A\vec v=\lambda\vec v. $$ However, in numerical computation, one usually avoids explicitly forming $B^{-1}A$. ::: {.callout-example} ## Example 18.20: A generalized eigenvalue problem Let $$ A=\begin{bmatrix}1&2\\2&4\end{bmatrix}, \qquad B=\begin{bmatrix}1&1\\1&1\end{bmatrix}. $$ The generalized eigenvalues satisfy $$ \det(A-\lambda B)=0. $$ Compute $$ \det\begin{bmatrix}1-\lambda&2-\lambda\\2-\lambda&4-\lambda\end{bmatrix} =(1-\lambda)(4-\lambda)-(2-\lambda)^2=-\lambda. $$ Thus $\lambda=0$ is a generalized eigenvalue. The generalized eigenvectors solve $A\vec v=\vec 0$, so $$ \vec v=k\begin{bmatrix}-2\\1\end{bmatrix}. $$ ::: # Applications ## PageRank and Markov chains If a network is represented by a stochastic matrix $P$, then a stationary distribution satisfies $$ P\vec x=\vec x. $$ This is an eigenvalue problem with eigenvalue $1$. Power iteration is a natural way to approximate the stationary vector. ## Principal component analysis PCA solves an eigenvalue problem for a covariance matrix $C$. The first principal component solves $$ \max_{\|\vec x\|=1}\vec x^TC\vec x. $$ By the Rayleigh quotient theorem, the solution is the eigenvector corresponding to the largest eigenvalue of $C$. ## Spectral graph methods Graph Laplacians lead to eigenvalue and generalized eigenvalue problems. In spectral clustering, eigenvectors of a graph Laplacian are used as low-dimensional coordinates for clustering vertices. # Classical challenge questions with solutions ## Challenge 1: When should we precompute $AB$? Let $A\in\mathbb R^{m\times n}$, $B\in\mathbb R^{n\times p}$, and suppose we need to compute $A(B\vec x)$ for many different vectors $\vec x\in\mathbb R^p$. 1. When is it better to precompute $AB$? 2. When is it better to avoid forming $AB$? <details> <summary>Solution</summary> For one vector $\vec x$, it is usually better to compute $A(B\vec x)$ with cost $$ O(np+mn), $$ rather than form $AB$, which costs $O(mnp)$. If the same matrices $A$ and $B$ are used for $N$ different vectors, then the tradeoff is $$ \text{precompute }AB:\quad O(mnp)+N\cdot O(mp), $$ while $$ \text{do not precompute}:\quad N\cdot O(np+mn). $$ Precomputing may be useful only if $N$ is large enough and if storing $AB$ is feasible. If $A$ and $B$ are sparse but $AB$ is dense, precomputing can destroy sparsity. </details> ## Challenge 2: When does the power method fail? Give two reasons the power method may fail or converge slowly. <details> <summary>Solution</summary> First, if the initial vector $\vec x_0$ has no component in the dominant eigendirection, then the dominant eigenvector never appears in the iteration. In the diagonalizable case, this means $c_1=0$ in $$ \vec x_0=c_1\vec b_1+\cdots+c_n\vec b_n. $$ Second, if $$ \left|\frac{\lambda_2}{\lambda_1}\right| $$ is close to $1$, then convergence is slow. Other difficulties include repeated dominant eigenvalues, complex dominant eigenvalues, and non-diagonalizable matrices. </details> ## Challenge 3: Rayleigh quotient as a weighted average Explain why the Rayleigh quotient of a symmetric matrix cannot exceed the largest eigenvalue. <details> <summary>Solution</summary> Write $A=QDQ^T$ with eigenvalues $\lambda_1\ge\cdots\ge\lambda_n$. If $\vec x=Q\vec c$, then $$ R_A(\vec x)=\frac{\sum_i\lambda_i c_i^2}{\sum_i c_i^2}. $$ This is a weighted average of the eigenvalues. A weighted average cannot be larger than the largest value being averaged. Therefore $$ R_A(\vec x)\le\lambda_1. $$ </details> ## Challenge 4: Deflation and projection Suppose $A$ is symmetric and $\vec v_1$ is a unit eigenvector with eigenvalue $\lambda_1$. Explain why $$ B=A-\lambda_1\vec v_1\vec v_1^T $$ removes the component of $A$ acting along $\vec v_1$. <details> <summary>Solution</summary> The matrix $\vec v_1\vec v_1^T$ is the orthogonal projection onto $\operatorname{span}\{\vec v_1\}$. Since $A\vec v_1=\lambda_1\vec v_1$, the term $$ \lambda_1\vec v_1\vec v_1^T $$ is exactly the rank-one part of $A$ on that eigendirection. Subtracting it gives $$ B\vec v_1=\vec 0. $$ All orthogonal eigenvectors are unchanged because $\vec v_1^T\vec v_i=0$ for $i\ne1$. </details> # Practice problems ## Problem 1 Let $A\in\mathbb R^{5000\times 5000}$ and $\vec x\in\mathbb R^{5000}$. Compare the cost of computing $A^2\vec x$ by forming $A^2$ first versus computing $A(A\vec x)$. <details> <summary>Solution</summary> Forming $A^2$ costs about $O(n^3)$ with $n=5000$, and then multiplying by $\vec x$ costs $O(n^2)$. Computing $A(A\vec x)$ uses two matrix-vector products, costing $2O(n^2)$. The second method avoids the expensive matrix-matrix product. </details> ## Problem 2 Let $$ A=\begin{bmatrix}4&0\\0&1\end{bmatrix}, \qquad \vec x_0=\begin{bmatrix}1\\1\end{bmatrix}. $$ Compute the direction of $A^k\vec x_0$ as $k\to\infty$. <details> <summary>Solution</summary> We have $$ A^k\vec x_0= \begin{bmatrix}4^k&0\\0&1\end{bmatrix} \begin{bmatrix}1\\1\end{bmatrix} = \begin{bmatrix}4^k\\1\end{bmatrix}. $$ After normalization, the second component becomes negligible compared with the first. The direction converges to $$ \operatorname{span}\left\{\begin{bmatrix}1\\0\end{bmatrix}\right\}. $$ </details> ## Problem 3 For $$ A=\begin{bmatrix}2&1\\1&2\end{bmatrix}, $$ compute the Rayleigh quotient at $\vec x=(1,0)^T$ and at $\vec y=(1,1)^T$. <details> <summary>Solution</summary> For $\vec x=(1,0)^T$, $$ R_A(\vec x)=\frac{\vec x^TA\vec x}{\vec x^T\vec x}=2. $$ For $\vec y=(1,1)^T$, $$ A\vec y=(3,3)^T, $$ so $$ R_A(\vec y)=\frac{\vec y^TA\vec y}{\vec y^T\vec y} =\frac{6}{2}=3. $$ Indeed, $\vec y$ is an eigenvector with eigenvalue $3$. </details> ## Problem 4 Suppose a symmetric matrix has eigenvalues $10,9,1$. What is the approximate convergence ratio of the power method? <details> <summary>Solution</summary> The convergence ratio is approximately $$ \left|\frac{\lambda_2}{\lambda_1}\right|=\frac{9}{10}=0.9. $$ This is close to $1$, so convergence may be slow. </details> ## Problem 5 Let $P$ be a column-stochastic matrix. Why is a stationary distribution an eigenvector problem? <details> <summary>Solution</summary> A stationary distribution $\vec \pi$ satisfies $$ P\vec \pi=\vec \pi. $$ This is exactly the eigenvalue equation with eigenvalue $1$: $$ P\vec \pi=1\cdot \vec \pi. $$ </details> # AI companion activities Use an AI assistant as a study partner, but verify all mathematical claims by computation or proof. ::: {.callout-tip} ## Activity 1: Complexity explanation Ask an AI assistant: > Explain why computing $(AB)x$ by forming $AB$ first is usually inefficient. Give a numerical example with dimensions. Then check whether the stated operation counts match the formulas in this chapter. ::: ::: {.callout-tip} ## Activity 2: Power method debugging Ask an AI assistant to write a Python implementation of the power method. Test it on: $$ A=\begin{bmatrix}3&1\\1&2\end{bmatrix} $$ and compare the output with `numpy.linalg.eig`. ::: ::: {.callout-tip} ## Activity 3: Rayleigh quotient interpretation Ask an AI assistant: > Why is the Rayleigh quotient a weighted average of eigenvalues for symmetric matrices? Then rewrite the explanation using the spectral decomposition $A=QDQ^T$. ::: ::: {.callout-tip} ## Activity 4: Find failure cases Ask an AI assistant to produce examples where the power method fails or converges slowly. For each example, compute the eigenvalues and explain the behavior using $|\lambda_2/\lambda_1|$. :::

Operation	Typical dense cost
Sum entries of \(\vec x\)	\(O(n)\)
Dot product \(\vec x\cdot\vec y\)	\(O(n)\)
Matrix-vector product \(A\vec x\)	\(O(mn)\)
Matrix-matrix product \(AB\)	\(O(mnp)\)
Gaussian elimination for \(M\vec x=\vec b\)	\(O(n^3)\)
Dense eigenvalue computation	about \(O(n^3)\)

19 The story: linear algebra at scale

20 Space complexity, time complexity, and Big-O notation

20.1 Definition 18.1: Space complexity

20.2 Definition 18.2: Time complexity

20.3 Definition 18.3: Big-O notation

20.4 Memory: dense versus sparse matrices

20.5 Example 18.4: A tridiagonal matrix

21 Basic time complexity in linear algebra

22 Associativity can save work

22.1 Proposition 18.5: Associativity can save work

22.2 Example 18.6: One vector versus a full product

23 Diagonal and sparse matrices

23.1 Definition 18.7: Diagonal matrix

23.2 Proposition 18.8: Diagonal matrices behave like vectors

24 Solving linear systems: direct and iterative viewpoints

24.1 Example 18.9: Stable configuration

25 Eigenvalue computation: why the characteristic polynomial is not enough

25.1 Definition 18.10: Dominant eigenvalue

26 The power method

26.1 Definition 18.11: Power iteration

26.2 Theorem 18.12: Power method in the diagonalizable case

27 Rayleigh quotient

27.1 Definition 18.13: Rayleigh quotient

27.2 Proposition 18.14: Eigenvalue from eigenvector

27.3 Theorem 18.15: Rayleigh quotient and extreme eigenvalues

28 Restricted Rayleigh quotient and PCA intuition

28.1 Theorem 18.16: Restricted Rayleigh quotient

29 Deflation

29.1 Definition 18.17: Deflation

29.2 Proposition 18.18: Simple deflation for a symmetric matrix

30 Generalized eigenvalues

30.1 Definition 18.19: Generalized eigenvalue

30.2 Example 18.20: A generalized eigenvalue problem

31 Applications

31.1 PageRank and Markov chains

31.2 Principal component analysis

31.3 Spectral graph methods

32 Classical challenge questions with solutions

32.1 Challenge 1: When should we precompute \(AB\)?

32.2 Challenge 2: When does the power method fail?

32.3 Challenge 3: Rayleigh quotient as a weighted average

32.4 Challenge 4: Deflation and projection

33 Practice problems

33.1 Problem 1

33.2 Problem 2

33.3 Problem 3

33.4 Problem 4

33.5 Problem 5

34 AI companion activities