Crash Course Cheat Sheet: Numerical Analysis for Optimization
A quick reference guide for key concepts in numerical analysis relevant to optimization, covering ODE solvers and numerical linear algebra.
This cheat sheet summarizes key concepts from Numerical Analysis relevant to understanding and developing optimization algorithms in Machine Learning. It covers essential topics from numerical methods for Ordinary Differential Equations (ODEs) and Numerical Linear Algebra (NLA).
Numerical Methods for Ordinary Differential Equations (ODEs)
| Concept / Method | Key Formula / Description | Relevance to Optimization / Key Insight | | ————————————- | ————————————————————————————————————————————————– | ————————————————————————————————————————————————————————– | | ODE & Initial Value Problem (IVP) | ODE:
. IVP: with
. | Models continuous opt. paths like Gradient Flow (
) or Heavy Ball ODE. Opt. algos are discretizations. | | Discretization & Finite Diff. | Approx. solution at
with step
. Fwd Diff:
. | Foundation for turning continuous ODE models into iterative algorithms. Step size
often maps to learning rate
. | | Explicit Euler Method |
1
| Gradient Descent (
) is Explicit Euler on Gradient Flow. 1st order global error (
). | | Implicit Euler Method |
1
| Often better stability (e.g., A-stable), allowing larger steps. Requires solving for
at each step. | | Stability of Numerical Methods | Errors don’t cause divergence. Conditional (e.g., Explicit Euler:
must be small) vs. Unconditional (e.g., Implicit Euler for some problems). | Explains why GD diverges if learning rate
(
) is too large. Relates to max stable
(e.g.,
for quadratic). | | Systems & Higher-Order ODEs | Convert
-order ODE to system of
first-order ODEs:
. Solve component-wise. | Heavy Ball ODE (
) for momentum is 2nd order, converted to a system. Discretization yields momentum updates. | | Linear Multistep Methods (LMMs) | Use multiple past steps:
. | Polyak’s momentum (
) can be seen as a 2-step LMM. |
Numerical Linear Algebra (NLA) for Optimization
| Concept / Method | Key Formula / Description | Relevance to Optimization / Key Insight | | ————————————- | ——————————————————————————————————————————————————- | ————————————————————————————————————————————————————————————————————— | | **Condition Number
** |
. For SPD Hessian
,
. | Measures problem sensitivity. High
means ill-conditioned loss surface (long, narrow valleys), slows GD convergence. Factor
. | | **Solving Linear Systems
** | Direct methods (LU, Cholesky for SPD;
) for small/dense
. Iterative methods (CG, GMRES) for large/sparse
. | Core of Newton-type methods (solve
). Iterative methods are crucial for large-scale optimization. | | Conjugate Gradient (CG) Method | Iterative solver for SPD
. Generates
-conjugate search directions
. Requires matrix-vector products (
). | Solves Newton systems
efficiently without forming/inverting
. Convergence rate depends on
; much faster than GD for ill-cond. problems. | | Preconditioning | Transform
to
(or similar).
and
is easy to compute. Goal:
. | Speeds up iterative solvers like CG by improving effective condition number. Preconditioned CG (PCG) for faster Newton steps. | | Types of Preconditioners | Diagonal (Jacobi):
. Incomplete Cholesky (IC). Structured: Block-diag, Kronecker (e.g., K-FAC, Shampoo). | Diagonal preconditioning is the basis for adaptive methods (AdaGrad, RMSProp, Adam). Structured preconditioners for advanced optimizers. | | Newton’s Method (in Optimization) | Step:
. Solves Hessian system
. | Uses 2nd-order (curvature) info for potentially quadratic convergence. System usually solved with (P)CG for large problems. | | Quasi-Newton Methods | Approximate Hessian
or inverse
using gradient info (e.g., BFGS). L-BFGS for large scale (limited memory). | Avoids explicit Hessian computation/storage. L-BFGS uses past
vectors to implicitly compute
. | | Adaptive Optimizers (Adam, etc.) | E.g., Adam:
. The term
scales learning rate per parameter. |
(running avg of squared gradients) acts as a diagonal estimate of preconditioning, attempting to normalize gradient steps. Improves conditioning locally. |