Code
import numpy as np
Q = np.array([[4., 1.],
[1., 3.]])
b = np.array([1., 2.])
x_star = np.linalg.solve(Q, b)
f_star = 0.5 * x_star @ Q @ x_star - b @ x_star
x_star, f_star(array([0.09090909, 0.63636364]), -0.6818181818181818)
Vectors that choose the best answer
In earlier chapters, a vector often represented data, a signal, a state, a polynomial, a probability distribution, or a point in space. In optimization, a vector represents a choice.
A delivery route, a portfolio, a regression model, a classifier, a flow on a network, and a solution of a differential equation can all be described by a vector \[ x\in \mathbb R^n. \] Optimization asks:
Among all possible vectors \(x\), which one is best?
Linear algebra gives the language for the answer.
| Optimization language | Linear algebra language |
|---|---|
| decision variables | a vector \(x\in\mathbb R^n\) |
| equality constraints | an affine subspace \(Ax=b\) |
| inequality constraints | half-spaces \(Ax\le b\) |
| quadratic objective | a quadratic form \(x^TQx\) |
| least squares objective | distance to a column space |
| gradient | vector of first-order change |
| Hessian | symmetric matrix of second-order change |
| dual variables | vectors measuring constraint sensitivity |
The main message is:
Optimization is linear algebra plus the word “best.”
An optimization problem has the form \[ \min_{x\in \mathcal C} f(x) \] or \[ \max_{x\in \mathcal C} f(x), \] where \(f:\mathbb R^n\to \mathbb R\) is the objective function and \(\mathcal C\subseteq \mathbb R^n\) is the feasible set.
If \(\mathcal C=\mathbb R^n\), the problem is unconstrained. If \(\mathcal C\) is described by equations or inequalities, the problem is constrained.
Least squares \[ \min_{x\in\mathbb R^n}\|Ax-b\|_2^2. \]
Quadratic optimization \[ \min_{x\in\mathbb R^n} \left(\frac12 x^TQx-b^Tx\right), \] where \(Q=Q^T\).
Linear programming \[ \min_{x\in\mathbb R^n} c^Tx \quad\text{subject to}\quad Ax\le b. \]
Each problem is built from vectors, matrices, inner products, and geometry.
The derivative of a scalar-valued function \(f:\mathbb R^n\to\mathbb R\) is best understood as a linear approximation.
Let \(f:\mathbb R^n\to\mathbb R\) be differentiable. The gradient of \(f\) at \(x\) is \[ \nabla f(x)= \begin{bmatrix} \frac{\partial f}{\partial x_1}(x)\\ \vdots\\ \frac{\partial f}{\partial x_n}(x) \end{bmatrix}. \] For a small vector \(h\), \[ f(x+h)\approx f(x)+\nabla f(x)^Th. \]
The gradient points in the direction of steepest increase. Therefore, \(-\nabla f(x)\) points in the direction of steepest local decrease.
If \(f:\mathbb R^n\to\mathbb R\) has continuous second partial derivatives, the Hessian matrix is \[ \nabla^2 f(x)= \begin{bmatrix} \frac{\partial^2 f}{\partial x_1\partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n\partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n\partial x_n} \end{bmatrix}. \] The second-order approximation is \[ f(x+h)\approx f(x)+\nabla f(x)^Th+\frac12 h^T\nabla^2 f(x)h. \]
Let \[ f(x)=\frac12x^TQx-b^Tx+c, \] where \(Q=Q^T\). Then \[ \nabla f(x)=Qx-b, \qquad \nabla^2 f(x)=Q. \]
Thus the Hessian of a quadratic function is the matrix defining its quadratic form.
If \(f:\mathbb R^n\to\mathbb R\) is differentiable and \(x_*\) is a local minimum or local maximum in the interior of the domain, then \[ \nabla f(x_*)=0. \]
Convexity is the reason many optimization problems are tractable. The geometric idea is simple: a convex set contains every line segment between any two of its points.
A set \(\mathcal C\subseteq\mathbb R^n\) is convex if for all \(x,y\in\mathcal C\) and all \(0\le t\le 1\), \[ tx+(1-t)y\in\mathcal C. \]
A function \(f:\mathcal C\to\mathbb R\) on a convex set \(\mathcal C\) is convex if \[ f(tx+(1-t)y)\le tf(x)+(1-t)f(y) \] for all \(x,y\in\mathcal C\) and all \(0\le t\le 1\).
Let \(\mathcal C\subseteq\mathbb R^n\) be open and convex. Suppose \(f:\mathcal C\to\mathbb R\) has continuous second partial derivatives. Then \(f\) is convex if and only if \[ \nabla^2 f(x)\succeq 0 \] for every \(x\in\mathcal C\).
For \[ f(x)=\frac12x^TQx-b^Tx+c, \] the Hessian is \(Q\). Therefore:
Quadratic optimization is one of the clearest examples of linear algebra controlling optimization.
Let \(Q=Q^T\succ 0\). The function \[ f(x)=\frac12x^TQx-b^Tx \] has a unique minimizer \(x_*\), and it satisfies \[ Qx_*=b. \] Thus \[ x_*=Q^{-1}b. \]
import numpy as np
Q = np.array([[4., 1.],
[1., 3.]])
b = np.array([1., 2.])
x_star = np.linalg.solve(Q, b)
f_star = 0.5 * x_star @ Q @ x_star - b @ x_star
x_star, f_star(array([0.09090909, 0.63636364]), -0.6818181818181818)
The optimal vector is found by solving a linear system.
Least squares has already appeared as a projection problem. Now we view it as an optimization problem.
Given \(A\in\mathbb R^{m\times n}\) and \(b\in\mathbb R^m\), the least squares problem is \[ \min_{x\in\mathbb R^n}\|Ax-b\|_2^2. \]
A vector \(x_*\) solves \[ \min_x \|Ax-b\|_2^2 \] if and only if \[ A^TAx_*=A^Tb. \] If the columns of \(A\) are linearly independent, then \(A^TA\) is positive definite and \[ x_*=(A^TA)^{-1}A^Tb. \]
import numpy as np
A = np.array([[1., 0.],
[1., 1.],
[1., 2.],
[1., 3.]])
b = np.array([1.0, 2.1, 2.9, 4.2])
x_ls = np.linalg.lstsq(A, b, rcond=None)[0]
normal_x = np.linalg.solve(A.T @ A, A.T @ b)
x_ls, normal_x(array([0.99, 1.04]), array([0.99, 1.04]))
Least squares can be unstable when columns of \(A\) are nearly dependent. Regularization adds a penalty to discourage overly large coefficients.
For \(\lambda>0\), the ridge regression problem is \[ \min_x \left(\|Ax-b\|_2^2+\lambda\|x\|_2^2\right). \]
The ridge regression solution satisfies \[ (A^TA+\lambda I)x_\lambda=A^Tb. \] Hence \[ x_\lambda=(A^TA+\lambda I)^{-1}A^Tb. \]
import numpy as np
A = np.array([[1., 1.00],
[1., 1.01],
[1., 0.99],
[1., 1.02]])
b = np.array([2.0, 2.01, 1.99, 2.02])
for lam in [0, 0.01, 0.1, 1.0]:
if lam == 0:
x = np.linalg.lstsq(A, b, rcond=None)[0]
else:
x = np.linalg.solve(A.T @ A + lam*np.eye(2), A.T @ b)
print(lam, x)0 [1. 1.]
0.01 [0.99629118 1.00121203]
0.1 [0.98522167 0.9901968 ]
1.0 [0.88713423 0.89162409]
Now suppose the vector \(x\) must satisfy linear equality constraints: \[ Cx=d. \]
A constrained quadratic problem has the form \[ \min_x \frac12x^TQx-b^Tx \quad\text{subject to}\quad Cx=d. \]
The Lagrangian for the equality-constrained problem is \[ \mathcal L(x,\lambda) = \frac12x^TQx-b^Tx+\lambda^T(Cx-d), \] where \(\lambda\) is the vector of Lagrange multipliers.
A solution \((x_*,\lambda_*)\) satisfies \[ \begin{bmatrix} Q & C^T\\ C & 0 \end{bmatrix} \begin{bmatrix} x_*\\ \lambda_* \end{bmatrix} = \begin{bmatrix} b\\ d \end{bmatrix}. \] This block linear system is called a KKT system.
import numpy as np
Q = np.array([[2., 0.],
[0., 2.]])
b = np.array([2., 0.])
C = np.array([[1., 1.]])
d = np.array([1.])
KKT = np.block([[Q, C.T],
[C, np.zeros((1,1))]])
rhs = np.concatenate([b, d])
sol = np.linalg.solve(KKT, rhs)
x_star = sol[:2]
lambda_star = sol[2:]
x_star, lambda_star(array([1., 0.]), array([-0.]))
Linear programming is optimization over a polyhedron.
A linear program is an optimization problem of the form \[ \min_x c^Tx \quad\text{subject to}\quad Ax\le b. \]
A polyhedron is a set that can be written as the intersection of finitely many half-spaces: \[ \mathcal P=\{x\in\mathbb R^n: Ax\le b\}. \]
The feasible set of a linear program is a polyhedron. The objective \(c^Tx\) is linear, so its level sets are parallel hyperplanes.
If a linear program has an optimal solution and the feasible polyhedron has vertices, then at least one optimal solution occurs at a vertex.
Duality is one of the deepest ideas in optimization. It says that every linear program has a companion problem that gives certificates for how good a solution can be.
A standard primal linear program is \[ \min_x c^Tx \quad\text{subject to}\quad Ax=b,\quad x\ge 0. \] Its dual is \[ \max_y b^Ty \quad\text{subject to}\quad A^Ty\le c. \]
If \(x\) is feasible for the primal and \(y\) is feasible for the dual, then \[ b^Ty\le c^Tx. \]
Weak duality says that every dual feasible point gives a lower bound for every primal feasible point in a minimization problem.
Gradient descent is one of the simplest and most important optimization algorithms.
Given a differentiable function \(f\), gradient descent constructs a sequence \[ x_{k+1}=x_k-\alpha \nabla f(x_k), \] where \(\alpha>0\) is called the step size or learning rate.
For a quadratic \[ f(x)=\frac12x^TQx-b^Tx, \] gradient descent becomes \[ x_{k+1}=x_k-\alpha(Qx_k-b) = (I-\alpha Q)x_k+\alpha b. \] Thus the convergence of gradient descent is controlled by eigenvalues of \(Q\).
Let \(Q=Q^T\succ 0\), with eigenvalues \[ 0<\lambda_{\min}\le \lambda_i\le \lambda_{\max}. \] For \[ f(x)=\frac12x^TQx-b^Tx, \] gradient descent converges to the minimizer if \[ 0<\alpha<\frac{2}{\lambda_{\max}}. \]
import numpy as np
Q = np.array([[5., 1.],
[1., 2.]])
b = np.array([1., 1.])
alpha = 0.25
x = np.array([3., -2.])
history = [x.copy()]
for k in range(25):
grad = Q @ x - b
x = x - alpha * grad
history.append(x.copy())
np.array(history[-5:])array([[0.11111958, 0.44441646],
[0.11111599, 0.44442833],
[0.11111392, 0.44443517],
[0.11111273, 0.4444391 ],
[0.11111204, 0.44444137]])
Gradient descent uses the gradient. Newton’s method uses both the gradient and the Hessian.
For a twice differentiable function \(f\), the Newton step \(p_k\) at \(x_k\) solves \[ \nabla^2 f(x_k)p_k=-\nabla f(x_k). \] The update is \[ x_{k+1}=x_k+p_k. \]
Newton’s method comes from minimizing the local quadratic approximation \[ f(x_k+p)\approx f(x_k)+\nabla f(x_k)^Tp+\frac12p^T\nabla^2 f(x_k)p. \]
For a positive definite quadratic function, Newton’s method reaches the minimizer in one step.
Many constrained optimization problems can be solved by repeatedly taking a gradient step and projecting back onto the feasible set.
Let \(\mathcal C\subseteq\mathbb R^n\) be closed and convex. The projection of \(y\) onto \(\mathcal C\) is \[ \operatorname{Proj}_{\mathcal C}(y) = \arg\min_{x\in\mathcal C}\|x-y\|_2. \]
For \[ \mathcal C=\mathbb R_{\ge 0}^n, \] the projection is componentwise: \[ \operatorname{Proj}_{\mathcal C}(y)_i=\max(y_i,0). \]
Projected gradient descent takes the form \[ x_{k+1} = \operatorname{Proj}_{\mathcal C} \left(x_k-\alpha\nabla f(x_k)\right). \]
Eigenvalues appear naturally in optimization through the Rayleigh quotient.
For a symmetric matrix \(A\), the Rayleigh quotient is \[ R_A(x)=\frac{x^TAx}{x^Tx}, \qquad x\ne 0. \]
If \(A=A^T\) has eigenvalues \[ \lambda_1\le \lambda_2\le \cdots\le \lambda_n, \] then \[ \lambda_1=\min_{x\ne 0} \frac{x^TAx}{x^Tx}, \qquad \lambda_n=\max_{x\ne 0} \frac{x^TAx}{x^Tx}. \]
In Markowitz portfolio optimization, one chooses a vector \(x\) of asset weights. A covariance matrix \(\Sigma\) measures risk: \[ \text{risk}=x^T\Sigma x. \] A basic problem is \[ \min_x x^T\Sigma x \quad\text{subject to}\quad \mathbf 1^Tx=1,\quad \mu^Tx=r. \] This is a quadratic optimization problem with linear equality constraints.
A linear classifier uses a hyperplane \[ w^Tx+b=0. \] The support vector machine chooses a separating hyperplane with large margin. In its simplest form, it leads to a constrained quadratic optimization problem.
For a graph Laplacian \(L\), \[ x^TLx=\sum_{(i,j)\in E}w_{ij}(x_i-x_j)^2. \] Minimizing this energy encourages adjacent vertices to have similar values. This idea appears in graph signal processing, spectral clustering, and semi-supervised learning.
Training many models means minimizing a loss: \[ \min_\theta \frac1m\sum_{i=1}^m \ell(f_\theta(x_i),y_i). \] Linear algebra enters through matrix representations of data, gradients, Hessians, least squares, regularization, eigenvalues, and iterative methods.
Show that for every matrix \(A\) and every \(\lambda>0\), the matrix \[ A^TA+\lambda I \] is positive definite.
Let \(Q=Q^T\succ 0\). Explain why a large condition number \[ \kappa(Q)=\frac{\lambda_{\max}}{\lambda_{\min}} \] can make gradient descent slow.
For \[ \min_x \frac12\|x-y\|^2 \quad\text{subject to}\quad Cx=d, \] derive the KKT system.
Let \[ Q=\begin{bmatrix}3&1\\1&2\end{bmatrix}, \qquad b=\begin{bmatrix}1\\4\end{bmatrix}. \] Find the minimizer of \[ f(x)=\frac12x^TQx-b^Tx. \]
Let \[ A=\begin{bmatrix} 1&0\\ 1&1\\ 1&2 \end{bmatrix}, \qquad b=\begin{bmatrix}1\\2\\2\end{bmatrix}. \] Write the normal equations for the least squares problem.
For \[ f(x,y)=x^2+4y^2+2xy-6x, \] find the gradient and Hessian. Is \(f\) convex?
Let \[ A=\begin{bmatrix}1&1\\1&1.01\\1&0.99\end{bmatrix}. \] Explain why ridge regression may be preferable to ordinary least squares.
Let \[ A=A^T. \] Explain why the largest eigenvalue of \(A\) solves an optimization problem.
Use an AI assistant as a study partner, not as a replacement for your own reasoning.
Concept explanation. Ask:
“Explain why least squares is both a projection problem and an optimization problem.”
Proof checker. Write your own proof of the normal equations and ask the AI to find missing assumptions.
Python debugging. Ask the AI to help debug a gradient descent implementation, then verify every line yourself.
Geometric explanation. Ask:
“Why does the condition number of a positive definite matrix affect the speed of gradient descent?”
Model comparison. Ask:
“Compare least squares, ridge regression, and constrained least squares using the language of linear algebra.”
In this chapter: