Post

Crash Course: Numerical Methods for ODEs in Optimization

A primer on numerical methods for solving Ordinary Differential Equations, tailored for understanding optimization algorithms like Gradient Descent, Momentum, and their continuous-time limits.

Crash Course: Numerical Methods for ODEs in Optimization

Part I — Foundations: Accuracy, Stability, and Core Integrators

This post provides a primer on the foundational concepts of numerical integration for ordinary differential equations (ODEs). We’ll explore the core ideas of accuracy and stability and introduce the workhorse integrator families: θ-methods, Runge-Kutta methods, and linear multistep methods. The goal is to build intuition for how these methods connect to and underpin common optimization algorithms. For simplicity, we often write

\[ f(y) \]

for autonomous ODEs where the dynamics do not explicitly depend on time

\[ t \]

.

1. Motivation & ODE Models of Optimization

Many optimization algorithms can be viewed as discretizations of continuous-time dynamical systems described by ODEs. This perspective is powerful; it allows us to analyze algorithm behavior using the well-developed theory of numerical integration.

Two fundamental ODE models in optimization are:

  • Gradient Flow: The simplest model describes the path of steepest descent on a loss surface
\[ L(\theta) \]

. The trajectory follows the negative gradient, formally written as:

1
2
3
4
5
<div class="math-block" markdown="0"> \[ \dot\theta(t) = -\nabla L(\theta(t)) \]
</div>


Here, 
\[ \dot\theta \]

represents the time derivative of the parameters

\[ \theta \]

. For gradient flow,

\[ \tfrac{d}{dt}L(\theta(t))=\nabla L(\theta(t))^\top \dot\theta(t)=-\Vert\nabla L(\theta(t))\Vert^2\le 0 \]

, so the loss is non-increasing along trajectories.

  • Heavy-Ball Momentum: This model introduces a second-order momentum term, analogous to a physical object with mass
\[ m \]

moving through a viscous medium with friction coefficient

\[ \gamma \]

:

1
2
3
4
5
<div class="math-block" markdown="0"> \[ m\ddot\theta(t) + \gamma\dot\theta(t) + \nabla L(\theta(t)) = 0 \]
</div>


This system often converges faster than pure gradient flow by incorporating inertia, which accelerates convergence, particularly on strongly convex problems, when appropriately damped.

A Note on Existence and Uniqueness

For these ODEs to be useful models, we need assurance that a solution exists and is unique for a given starting point. The Picard–Lindelöf theorem guarantees this, provided the function defining the dynamics (e.g.,

\[ -\nabla L(\theta) \]

) is continuous and satisfies a Lipschitz condition. This means the function’s rate of change is bounded, preventing the solution from “blowing up” in finite time.

Example. Gradient flow on a quadratic: closed-form and monotonicity

Let

\[ L(\theta)=\tfrac12,\theta^\top H\theta \]

with

\[ H\succ 0 \]

. Gradient flow

\[ \dot\theta=-H\theta \]

has solution

\[ \theta(t)=e^{-Ht},\theta(0). \]

Then

\[ \frac{d}{dt}L(\theta(t))=\nabla L(\theta(t))^\top \dot\theta(t)=-(H\theta)^\top(H\theta)=-\Vert H\theta\Vert^2\le 0, \]

so the loss decreases strictly unless

\[ \theta=0 \]

. Mode-wise, if

\[ H q_i=\lambda_i q_i \]

, then

\[ \theta^{(i)}(t)=e^{-\lambda_i t}\theta^{(i)}(0) \]

.

Example. Heavy-ball as a first-order system and energy dissipation

Write

\[ x_1=\theta,\ x_2=\dot\theta \]

. The ODE

\[ \dot x_1=x_2,\qquad \dot x_2=-\frac{\gamma}{m}x_2-\frac{1}{m}\nabla L(x_1) \]

has Lyapunov function

\[ \mathcal{E}(x_1,x_2)=L(x_1)+\tfrac{m}{2}\Vert x_2\Vert^2 \]

with

\[ \dot{\mathcal{E}}=\nabla L^\top x_2 + m x_2^\top \dot x_2 = \nabla L^\top x_2 + m x_2^\top\Big(-\tfrac{\gamma}{m}x_2-\tfrac{1}{m}\nabla L\Big) = -\gamma \Vert x_2\Vert^2 \le 0. \]

Thus inertia + damping dissipates kinetic energy while still descending in

\[ L \]

.

2. Consistency, Stability, and Convergence

A method is consistent if its one-step local truncation error (LTE) tends to zero as

\[ h\to 0 \]

. If the LTE is

\[ O(h^{p+1}) \]

, the method has **order

\[ p \]

** and, under stability, its global error is

\[ O(h^{p}) \]

. For linear multistep methods (LMMs), the Dahlquist equivalence theorem states: consistency + zero-stability ⇒ convergence. Here zero-stability means small perturbations in starting values do not blow up as

\[ h\to 0 \]

, and it is characterized by the root condition on the characteristic polynomial (see the box below). For one-step methods (e.g., Runge-Kutta), zero-stability is automatic; stability is analyzed via the stability function on the test equation.

Root Condition for Zero-Stability

A linear multistep method is zero-stable if and only if all roots of its first characteristic polynomial

\[ \rho(\zeta) \]

lie within or on the unit circle in the complex plane, and any roots on the unit circle are simple.

Example. LTE of Explicit Euler via Taylor

Exact solution satisfies

\[ y(t_{n+1})=y(t_n)+h f(t_n,y(t_n))+\tfrac{h^2}{2},y''(\xi_n),\quad \xi_n\in(t_n,t_{n+1}). \]

Explicit Euler does

\[ y_{n+1}=y_n+h f(t_n,y_n) \]

. The local truncation error (inserting the exact

\[ y(t_n) \]

) is

\[ \tau_{n+1}=\frac{y(t_{n+1})-\big(y(t_n)+h f(t_n,y(t_n))\big)}{h}=\tfrac{h}{2},y''(\xi_n)=\mathcal{O}(h). \]

Hence LTE

\[ =\mathcal{O}(h^2) \]

and global error

\[ =\mathcal{O}(h) \]

(order 1).

Example. Zero-stability check for AB2

AB2 is

\[ y_{n+1}=y_n+\tfrac{h}{2}(3f_n-f_{n-1}) \]

. The first characteristic polynomial (apply to

\[ \dot y=0 \]

) is

\[ \rho(\zeta)=\zeta^2-\zeta. \]

Its roots are

\[ {0,1} \]

: inside/on the unit circle and the unit-modulus root is simple ⇒ zero-stable.

3. Absolute Stability & Stiffness

While zero-stability addresses error accumulation for

\[ \dot{y}=0 \]

, we often need to understand stability for more general dynamics. This leads to the concept of absolute stability.

We analyze stability using the Dahlquist test equation:

\[ \dot{y} = \lambda y, \quad \text{Re}(\lambda) < 0 \]

Applying a numerical method to this equation with step size

\[ h \]

yields a recurrence of the form

\[ y_{n+1} = R(z)y_n \]

, where

\[ z = h\lambda \]

. The function

\[ R(z) \]

is the method’s stability function. The region of absolute stability is the set of all

\[ z \in \mathbb{C} \]

for which

\[ \vert R(z) \vert \le 1 \]

. For asymptotic decay of the numerical solution of

\[ \dot y=\lambda y \]

, one needs

\[ \vert R(z)\vert<1 \]

. On the boundary

\[ \vert R(z)\vert=1 \]

the method may be neutrally stable.

  • A-stability: A method is A-stable if its stability region contains the entire left half-plane (
\[ \text{Re}(z) \le 0 \]

).

  • L-stability: A method is L-stable if it is A-stable and additionally
\[ \lim_{\text{Re}(z) \to -\infty} \vert R(z) \vert = 0 \]

.

  • Stiffness: A problem is considered stiff when the step size required for stability is much smaller than the step size needed for accuracy.

For a Runge-Kutta method with Butcher tableau given by coefficients

\[ (A, b, c) \]

, the stability function is:

\[ R(z) = 1 + z b^\top(I - zA)^{-1}e \]

where

\[ e \]

is the vector of ones. A crucial result is that no explicit Runge-Kutta method can be A-stable.

Example. Real-axis bound for Explicit Euler on

\[ \dot y=\lambda y \]

For Explicit Euler,

\[ R(z)=1+z \]

with

\[ z=h\lambda \]

. If

\[ \lambda\in\mathbb{R}_{<0} \]

, stability demands

\[ \vert 1+h\lambda\vert \le 1 \iff -2\le h\lambda\le 0 \iff 0\le h\le \frac{2}{\vert \lambda\vert }. \]

This mirrors GD’s step bound on a quadratic with Lipschitz constant

\[ L=\vert \lambda\vert \]

.

Example. A textbook stiff test:

\[ y'=-1000(y-\sin t)+\cos t \]

Exact solution is smooth, but the Jacobian scale

\[ \sim -1000 \]

forces Explicit Euler to use

\[ h\lesssim 2/1000 \]

for stability, even though accuracy might permit larger steps. Implicit Euler or trapezoid remain stable for much larger

\[ h \]

(A-stable), illustrating stability- vs accuracy-limited step sizes.

4. The θ-methods: Unifying Euler and Trapezoid

The θ-methods are a simple family of one-step integrators. The update rule is given by:

\[ y_{n+1}=y_n+h\big[(1-\theta)f(t_n,y_n)+\theta f(t_{n+1},y_{n+1})\big]. \]

Its stability function on

\[ \dot y=\lambda y \]

is:

\[ R(z)=\frac{1+(1-\theta)z}{1-\theta z},\qquad z=h\lambda. \]

This family includes three famous methods:

  1. **
\[ \theta=0 \]

(Explicit Euler):** Order 1. Not A-stable.

  1. **
\[ \theta=1 \]

(Implicit Euler):** Order 1. A-stable and L-stable.

  1. **
\[ \theta=1/2 \]

(Trapezoidal Rule):** Order 2. A-stable but not L-stable.

Connection to Optimization

The explicit Euler method is directly equivalent to the Gradient Descent algorithm. Applying explicit Euler to the gradient flow ODE

\[ \dot\theta = -\nabla L(\theta) \]

with a step size (learning rate)

\[ \alpha \]

gives the familiar update:

\[ \theta_{n+1} = \theta_n - \alpha \nabla L(\theta_n) \]

For a quadratic loss

\[ L(\theta)=\tfrac12\theta^\top H\theta \]

with eigenvalues

\[ 0<\lambda_i\le \lambda_{\max}(H)=L \]

, explicit Euler (= GD with step

\[ \alpha \]

) evolves mode-wise as

\[ \theta^{(i)}_{n+1}=(1-\alpha\lambda_i)\theta^{(i)}_n \]

. Stability requires

\[ \vert 1-\alpha\lambda_i\vert<1\ \forall i \]

, i.e.

\[ 0<\alpha<\frac{2}{L}. \]

Derivation. Stability function for the θ-method

On

\[ \dot y=\lambda y \]

,

\[ y_{n+1}=y_n+h\big((1-\theta)\lambda y_n+\theta \lambda y_{n+1}\big) \ \Rightarrow\ (1-\theta z) y_{n+1}=(1+(1-\theta)z) y_n,\quad z=h\lambda, \]

so

\[ R(z)=\frac{1+(1-\theta)z}{1-\theta z}. \]

For

\[ \theta=1 \]

(Implicit Euler),

\[ R(z)=1/(1-z) \]

is A- and L-stable; for

\[ \theta=\tfrac12 \]

,

\[ R(z)=\frac{1+\tfrac z2}{1-\tfrac z2} \]

is A-stable but not L-stable.

Example. Semi-implicit update that yields Polyak momentum

Heavy-ball as first-order system:

\[ \dot\theta=v,\qquad \dot v=-\tfrac{\gamma}{m}v-\tfrac{1}{m}\nabla L(\theta). \]

A semi-implicit Euler step (explicit in

\[ \theta \]

, implicit in

\[ v \]

):

\[ \theta_{k+1}=\theta_k+h v_k,\qquad v_{k+1}=\frac{v_k-\tfrac{h}{m}\nabla L(\theta_k)}{1+\tfrac{\gamma h}{m}}. \]

Let

\[ \mu=\frac{1}{1+\tfrac{\gamma h}{m}} \]

and define

\[ u_k=h v_k \]

. Then

\[ u_{k+1}=\mu u_k - \frac{h^2}{m}\mu \nabla L(\theta_k),\qquad \theta_{k+1}=\theta_k+u_k, \]

which is a Polyak-style momentum with coefficients linked to ((h,\gamma,m)).

5. Runge-Kutta (RK) Methods

Runge-Kutta methods are one-step methods that achieve higher orders of accuracy by evaluating

\[ f \]

at intermediate points within a step. An

\[ s \]

-stage RK method is defined by its Butcher tableau:

\[ \begin{array}{c|c} c & A \\ \hline & b^\top \end{array} \]

(Reminder:

\[ c_i=\sum_j a_{ij} \]

for explicit Runge-Kutta.)

Causality (within a step). In an explicit Runge-Kutta method, the stage vector depends only on already-computed stages of the same step: the Butcher matrix

\[ A \]

is strictly lower triangular, i.e.,

\[ a_{ij}=0 \]

for

\[ j\ge i \]

. Equivalently, the current stage

\[ k_i \]

never uses “future” stages

\[ k_j \]

with

\[ j\ge i \]

. This mirrors causality/autoregressivity in ML: the update at the current (sub)time index depends only on past information, not on future indices. By contrast, implicit RK allows

\[ a_{ii}\neq 0 \]

(and dependencies upward), coupling stages via a solve—which is non-causal within the step and requires simultaneous computation.

Derivation. Matrix form and the causal structure of explicit RK

Stack the stages

\[ k=(k_1,\dots,k_s)^\top \]

. For an RK method,

\[ k = f\Big(t_n + c,h,; y_n \mathbf{1} + h,A,k\Big),\qquad y_{n+1} = y_n + h,b^\top k, \]

where

\[ A\in\mathbb{R}^{s\times s} \]

,

\[ b,c\in\mathbb{R}^s \]

, and

\[ \mathbf{1} \]

is the vector of ones (broadcasted as needed).

  • Explicit RK:
\[ A \]

is strictly lower triangular:

\[ A=\begin{pmatrix} 0 & 0 & \cdots & 0 \\ a_{21} & 0 & \cdots & 0 \\ a_{31} & a_{32} & \ddots & \vdots \\ \vdots & \vdots & \ddots & 0 \end{pmatrix}. \]

Then

\[ k_1=f(t_n,y_n) \]

;

\[ k_2 \]

depends on

\[ k_1 \]

;

\[ k_3 \]

depends on

\[ k_1,k_2 \]

; etc. We can compute stages sequentially—a causal, feed-forward evaluation within the step.

  • Implicit RK: some
\[ a_{ii}\neq 0 \]

(or entries above the strict lower triangle), so

\[ k_i \]

appears on both sides. Stages are coupled and require solving (e.g., via Newton or fixed-point iterations). This breaks step-internal causality and is analogous to bidirectional/globally coupled updates.

ML analogy.

  • Explicit RK ≈ autoregressive updates within a step (like forward RNN without look-ahead); no dependence on future sub-indices.
  • Implicit RK ≈ globally coupled inference per step (like solving a consistency condition), trading more computation per step for stability (A-/L-stability) on stiff dynamics.

The order conditions are equations in terms of

\[ A, b, c \]

that must be satisfied.

  • Order 1:
\[ \sum_i b_i=1 \]
  • Order 2:
\[ \sum_i b_i c_i=\tfrac12 \]
  • Order 3:
\[ \sum_i b_i c_i^2=\tfrac13,\quad \sum_{i,j} b_i a_{ij} c_j=\tfrac16 \]

Embedded pairs & FSAL. A 5(4) pair computes

\[ y^{(5)} \]

and

\[ y^{(4)} \]

with shared stages; their difference estimates error for adaptivity. Dormand–Prince 5(4) (the classic ode45 choice) uses 7 stages but has FSAL, so successful steps reuse the last stage as the next step’s first—about 6 function evaluations per accepted step. Tsitouras 5(4) is a modern pair with similar accuracy and excellent efficiency in practice.

Step-size controllers, often based on PI or PID control theory, use the stream of local error estimates to adjust the step size

\[ h \]

smoothly, aiming to keep the error below specified tolerances (rtol/atol).

Example. One RK4 step on a 2D linear system

Take

\[ \dot y=A y \]

with

\[ A=\begin{pmatrix} 0 & 1\ -2 & -3 \end{pmatrix},\quad y_0=\binom{1}{0},\ h=0.1. \]

Compute stages

\[ k_1=Ay_n,\ k_2=A\big(y_n+\tfrac{h}{2}k_1\big), k_3=A\big(y_n+\tfrac{h}{2}k_2\big), k_4=A\big(y_n+h k_3\big), \]

then

\[ y_{n+1}=y_n+\tfrac{h}{6}(k_1+2k_2+2k_3+k_4). \]

This concretely shows 4 fev/step and the weighted averaging of slopes.

Example. Embedded 5(4) error estimate and accept/reject

Suppose an embedded pair returns

\[ y^{(5)}\ast {n+1},,y^{(4)}\ast {n+1} \]

with

\[ e=y^{(5)}\ast {n+1}-y^{(4)}\ast {n+1},\qquad \Vert e\Vert_{\mathrm{sc}}=\frac{\Vert e\Vert}{\texttt{atol}+\texttt{rtol},\Vert y^{(5)}\ast {n+1}\Vert}. \]

If

\[ \Vert e\Vert\ast {\mathrm{sc}}=0.8\le 1 \]

, accept and (in Part II) increase

\[ h \]

modestly; if

\[ =1.6>1 \]

, reject and retry with a smaller step. With FSAL, accepted steps reuse the final stage as the next step’s first evaluation.

6. Linear Multistep Methods (LMMs)

Unlike one-step methods, a

\[ k \]

-step LMM uses information from the previous

\[ k \]

steps to compute the next point. They are causal across time steps, as they only use past values of

\[ y \]

and

\[ f \]

.

\[ \sum_{j=0}^{k} \alpha_j y_{n+j} = h \sum_{j=0}^{k} \beta_j f_{n+j} \]
  • Adams-Bashforth (explicit) and Adams-Moulton (implicit) methods: These are often combined in predictor-corrector pairs. MATLAB’s ode113 is a variable-step, variable-order (VSVO) solver based on this family.

    Why “1 fev/step after startup”? The AB predictor uses only stored

\[ f \]

-values; a single fresh

\[ f(t_{n+1},\tilde y_{n+1}) \]

powers the AM corrector and the step’s error estimate.

  • BDF (backward differentiation formulas) are implicit, stiff-robust multistep methods. They are **A-stable only for orders
\[ q\le 2 \]

** (BDF1 = implicit Euler, BDF2), and **zero-stable only up to

\[ q\le 6 \]

**. Orders (3)–(6) are not A-stable but have large left-half-plane sectors (A(

\[ \alpha \]

)-stability with large wedges) that are effective on many stiff problems.

Example. ABM (PECE) predictor–corrector in one step

Given

\[ y_n,y_{n-1} \]

and stored

\[ f_n,f_{n-1} \]

:

  1. Predict (AB2):
\[ \tilde y_{n+1}=y_n+\tfrac{h}{2}(3 f_n - f_{n-1}). \]
  1. Evaluate
\[ f_{n+1}=f(t_{n+1},\tilde y_{n+1}). \]
  1. Correct (AM2):
\[ y_{n+1}=y_n+\tfrac{h}{2}(f_{n+1}+f_n). \]

After startup, only step 2 uses a fresh function evaluation ⇒ about 1 fev/accepted step.

Derivation. BDF2 recurrence on the test equation

BDF2 reads

\[ \frac{3y_{n+1}-4y_n+y_{n-1}}{2h}=f(t_{n+1},y_{n+1}). \]

For

\[ \dot y=\lambda y \]

with

\[ z=h\lambda \]

,

\[ (3-2z)y_{n+1}=4y_n-y_{n-1}. \]

Let

\[ y_{n}=r^n \]

; the amplification polynomial is

\[ (3-2z)r^2-4r+1=0. \]

Stability holds when both roots satisfy

\[ \vert r\vert <1 \]

, which is true for all

\[ \Re z\le 0 \]

(BDF2 is A-stable) though not L-stable.

Example. Momentum as a 2-step LMM and zero-stability

The discrete momentum update

\[ \theta_{n+1}=(1+\mu)\theta_n-\mu\theta_{n-1}-\alpha\nabla L(\theta_n) \]

has characteristic polynomial (on

\[ \dot\theta=0 \]

)

\[ \rho(\zeta)=\zeta^2-(1+\mu)\zeta+\mu. \]

Its roots are

\[ 1 \]

and

\[ \mu \]

. Zero-stability requires roots in the unit disk and simple on the unit circle ⇒ (\mu\in[0,1)) in standard uses.

7. Worked Connections to Optimization

  • GD and Euler Stability: As mentioned, Gradient Descent is explicit Euler on gradient flow. Stability limits on the learning rate
\[ \alpha \]

correspond directly to the stability region of the explicit Euler method.

  • Heavy-Ball Momentum: The heavy-ball ODE can be written as a first-order system. Discretizing this system using a semi-implicit Euler method leads directly to the Polyak momentum update.
  • Momentum as an LMM: The standard momentum update can also be viewed as a 2-step explicit LMM. The momentum recurrence
\[ (\theta_{n+1}=(1+\mu)\theta_n-\mu\theta_{n-1}-\alpha\nabla L(\theta_n)) \]

has characteristic polynomial

\[ \zeta^2-(1+\mu)\zeta+\mu \]

with roots

\[ 1 \]

and

\[ \mu \]

. Zero-stability requires all roots in the unit disk and simple on the unit circle ⇒ **

\[ \mu\in[0,1) \]

** in standard uses. —

What You Should Remember

  • Order vs. Function Evaluations: Higher-order methods are more accurate for small step sizes but typically require more function evaluations per step.
  • Stability Regions: The size and shape of the stability region determine a method’s suitability for different types of problems, especially stiff ones. A-stability is a key property for stiff solvers.
  • Adaptivity: For most problems, adaptive step-sizing using embedded pairs (like Dormand-Prince) is far more efficient than using a fixed step size.
  • Euler ↔ GD: The simplest numerical method, explicit Euler, is equivalent to the fundamental optimization algorithm, Gradient Descent. This connection provides a bridge for analyzing optimization methods using ODE theory.

References

  • E. Hairer, S. P. Nørsett, G. Wanner. Solving Ordinary Differential Equations I: Nonstiff Problems (Springer). Authoritative for RK theory, order conditions, and stability; see Chs. II–IV. (SpringerLink)
  • E. Hairer, G. Wanner. Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems (Springer). Core reference for stiffness, BDF, Rosenbrock/linearly-implicit methods. (SpringerLink)
  • J. C. Butcher. Numerical Methods for Ordinary Differential Equations (Wiley). Deep dive on Runge-Kutta order theory and stability. (Wiley Online Library)
  • J. R. Dormand, P. J. Prince. “A family of embedded Runge–Kutta formulae.” J. Comput. Appl. Math., 1980. Classic source of the 5(4) pair behind ode45; FSAL and stage coefficients. (ScienceDirect)
  • C. Tsitouras. “Runge–Kutta pairs of order 5(4) satisfying only the first column simplifying assumption.” Appl. Numer. Math., 2011. Modern efficient 5(4) pairs (Tsit5). (ScienceDirect)
  • G. Dahlquist. “A special stability problem for linear multistep methods.” BIT, 1963. The classic paper underpinning LMM stability barriers and the equivalence theorem. (math.unipd.it)
  • P. E. Kloeden, E. Platen. Numerical Solution of Stochastic Differential Equations (Springer). Standard reference for Euler–Maruyama, Milstein, strong/weak orders (you’ll cite this in Part II). (SpringerLink)
  • MATLAB ode45 documentation. Notes that ode45 uses the Dormand–Prince explicit RK 5(4) pair with adaptive steps. Useful practical cross-reference. (mathworks.com)
This post is licensed under CC BY 4.0 by the author.