Post

Momentum: Second-Order Gradient Flow ODE and a Multi-Step Method

Delving into the momentum optimization method, understanding its origins from both second-order 'heavy ball' dynamics and as a linear multi-step method for first-order gradient flow, and extending to adaptive methods like RMSProp and Adam.

Momentum: Second-Order Gradient Flow ODE and a Multi-Step Method

Introduction: The “Rolling Ball” Intuition

When we optimize a function, especially in high-dimensional spaces typical of machine learning, vanilla Gradient Descent (GD) can be painstakingly slow. It often zig-zags down narrow valleys and gets stuck in flat regions or saddle points. The momentum method is a popular technique to overcome these issues by drawing an analogy to a physical system: a ball rolling down a hill. This ball not only moves in the direction of steepest descent (the gradient) but also accumulates “momentum” from its past movements, helping it to power through flat regions and dampen oscillations in ravines.

In this post, we’ll explore two fundamental ways to derive and understand momentum, particularly Polyak’s Heavy Ball (PHB) method:

  1. As a discretization of a system of first-order Ordinary Differential Equations (ODEs) derived from a second-order ODE representing a physical system (the “heavy ball” with friction).
  2. As a Linear Multi-step Method (LMM) applied to the first-order gradient flow ODE.

We will also extend this ODE perspective to understand adaptive optimization algorithms like RMSProp and Adam. These perspectives reveal that momentum and related techniques are not just ad-hoc tricks but principled approaches rooted in the mathematics of dynamical systems and numerical analysis.

Recap: Gradient Descent and the Gradient Flow ODE

Before diving into momentum, let’s briefly recall Gradient Descent. To minimize a differentiable function \(f(\mathbf x)\), GD iteratively updates the parameters \(\mathbf x\) in the direction opposite to the gradient \(\nabla f(\mathbf x)\):

\[\mathbf x_{k+1} = \mathbf x_k - \eta \nabla f(\mathbf x_k)\]

where \(\eta > 0\) is the learning rate.

This update can be seen as an explicit or forward Euler discretization of the gradient flow ODE:

\[\dot{\mathbf x}(t) = -\nabla f(\mathbf x(t))\]

where \(\dot{\mathbf x}(t)\) denotes the derivative of \(\mathbf x\) with respect to time \(t\). The solutions to this ODE, called gradient flow trajectories, continuously follow the path of steepest descent.

Perspective 1: Momentum from a Second-Order ODE (The Heavy Ball)

The physical intuition of a ball rolling down a hill can be formalized using Newton’s second law of motion. Consider a particle of mass \(m\) moving in a potential field defined by \(f(\mathbf x)\). The force exerted on the particle by the potential is \(-\nabla f(\mathbf x)\). If there’s also a friction or viscous drag force proportional to its velocity \(\dot{\mathbf x}(t)\) with a damping coefficient \(\gamma \ge 0\), the equation of motion is:

\[m \ddot{\mathbf x}(t) + \gamma \dot{\mathbf x}(t) + \nabla f(\mathbf x(t)) = 0\]

Here, \(\ddot{\mathbf x}(t)\) is the acceleration. This is a second-order ODE.

To derive an optimization algorithm, a standard approach is to first convert this second-order ODE into an equivalent system of two first-order ODEs. Let \(\mathbf v_{phys}(t) = \dot{\mathbf x}(t)\) be the physical velocity. Then \(\dot{\mathbf v}_{phys}(t) = \ddot{\mathbf x}(t)\), and substituting this into the equation of motion (after dividing by \(m\)) yields: \(\dot{\mathbf v}_{phys}(t) = -\frac{\gamma}{m} \mathbf v_{phys}(t) - \frac{1}{m} \nabla f(\mathbf x(t))\). Thus, the second-order ODE is equivalent to the following system of first-order ODEs:

\[\begin{align*} \dot{\mathbf x}(t) &= \mathbf v_{phys}(t) \\ \dot{\mathbf v}_{phys}(t) &= -\frac{\gamma}{m} \mathbf v_{phys}(t) - \frac{1}{m} \nabla f(\mathbf x(t)) \end{align*}\]

We now discretize this system to obtain an iterative optimization algorithm. The specific way we approximate the derivatives and evaluate terms will determine the resulting algorithm. This derivation leads directly to the Polyak’s Heavy Ball method.

Derivation: Polyak’s Heavy Ball from the System of First-Order ODEs

Let \(h > 0\) be the time discretization step size. We denote \(\mathbf x_k \approx \mathbf x(kh)\) and \(\mathbf v_{phys,k} \approx \mathbf v_{phys}(kh)\).

We discretize the system of ODEs as follows:

  1. For the position update \(\dot{\mathbf x}(t) = \mathbf v_{phys}(t)\), we use the velocity at the end of the interval, \(\mathbf v_{phys,k+1}\), to determine the change in position:

    \[\frac{\mathbf x_{k+1} - \mathbf x_k}{h} = \mathbf v_{phys,k+1} \quad \implies \quad \mathbf x_{k+1} = \mathbf x_k + h \mathbf v_{phys,k+1}\]
  2. For the velocity update \(\dot{\mathbf v}_{phys}(t) = -\frac{\gamma}{m} \mathbf v_{phys}(t) - \frac{1}{m} \nabla f(\mathbf x(t))\), we use a forward Euler discretization, evaluating terms at time \(t_k = kh\):

    \[\frac{\mathbf v_{phys,k+1} - \mathbf v_{phys,k}}{h} = -\frac{\gamma}{m} \mathbf v_{phys,k} - \frac{1}{m} \nabla f(\mathbf x_k)\] \[\implies \mathbf v_{phys,k+1} = \mathbf v_{phys,k} - \frac{h\gamma}{m} \mathbf v_{phys,k} - \frac{h}{m} \nabla f(\mathbf x_k) = \left(1 - \frac{h\gamma}{m}\right)\mathbf v_{phys,k} - \frac{h}{m} \nabla f(\mathbf x_k)\]

    (continued) This specific choice of discretization (updating velocity first using \(\mathbf x_k\), then updating position using the new velocity) is a form of symplectic Euler method. More precisely, for dissipative systems like this one with a friction term, it’s known as the dissipative (or conformal symplectic) version of the symplectic-Euler integrator (see, e.g., Zhang et al. 2020, “Conformal Symplectic and Relativistic Optimization,” NeurIPS Proceedings). This helps readers familiar with classical mechanics understand how friction fits into such integrators.

Now, let’s define the “momentum” term \(\mathbf v_k\) as it appears in the standard algorithm. This term is typically a scaled version of the physical velocity, representing the accumulated change. Let \(\mathbf v_k := h \mathbf v_{phys,k}\). Then \(\mathbf v_{k+1} = h \mathbf v_{phys,k+1}\). Substituting these into our discretized equations:

  • The velocity update becomes:

    \[\frac{\mathbf v_{k+1}}{h} = \left(1 - \frac{h\gamma}{m}\right)\frac{\mathbf v_k}{h} - \frac{h}{m} \nabla f(\mathbf x_k)\]

    Multiplying by \(h\) gives:

    \[\mathbf v_{k+1} = \left(1 - \frac{h\gamma}{m}\right)\mathbf v_k - \frac{h^2}{m} \nabla f(\mathbf x_k)\]
  • The position update becomes:

    \[\mathbf x_{k+1} = \mathbf x_k + \mathbf v_{k+1}\]

If we define the learning rate \(\eta = \frac{h^2}{m}\) and the momentum parameter \(\beta = 1 - \frac{h\gamma}{m}\), the update rules are:

\[\begin{align*} \mathbf v_{k+1} &= \beta \mathbf v_k - \eta \nabla f(\mathbf x_k) \\ \mathbf x_{k+1} &= \mathbf x_k + \mathbf v_{k+1} \end{align*}\]

This is precisely the two-variable form of the Polyak’s Heavy Ball (PHB) method.

Equivalence to the single-variable form: This two-variable system can be rewritten as a single second-order difference equation for \(\mathbf x_k\). From the position update, we have \(\mathbf v_{k+1} = \mathbf x_{k+1} - \mathbf x_k\). Consequently, \(\mathbf v_k = \mathbf x_k - \mathbf x_{k-1}\) (for \(k \ge 1\), assuming consistent initialization). Substituting these into the velocity update equation:

\[(\mathbf x_{k+1} - \mathbf x_k) = \beta (\mathbf x_k - \mathbf x_{k-1}) - \eta \nabla f(\mathbf x_k)\]

Rearranging for \(\mathbf x_{k+1}\), we get the single-variable form of PHB:

\[\mathbf x_{k+1} = \mathbf x_k - \eta \nabla f(\mathbf x_k) + \beta (\mathbf x_k - \mathbf x_{k-1})\]

Connection back to the original second-order ODE: This second-order difference equation can be shown to be a direct finite difference approximation of the original second-order ODE. Rearranging the single-variable form:

\[\mathbf x_{k+1} - (1+\beta)\mathbf x_k + \beta \mathbf x_{k-1} = -\eta \nabla f(\mathbf x_k)\]

Substitute back the definitions of \(\eta = \frac{h^2}{m}\) and \(\beta = 1 - \frac{h\gamma}{m}\):

\[\mathbf x_{k+1} - \left(1 + \left(1 - \frac{h\gamma}{m}\right)\right)\mathbf x_k + \left(1 - \frac{h\gamma}{m}\right)\mathbf x_{k-1} = -\frac{h^2}{m} \nabla f(\mathbf x_k)\] \[\mathbf x_{k+1} - \left(2 - \frac{h\gamma}{m}\right)\mathbf x_k + \left(1 - \frac{h\gamma}{m}\right)\mathbf x_{k-1} = -\frac{h^2}{m} \nabla f(\mathbf x_k)\]

Multiply by \(m/h^2\):

\[\frac{m}{h^2} \left( \mathbf x_{k+1} - \left(2 - \frac{h\gamma}{m}\right)\mathbf x_k + \left(1 - \frac{h\gamma}{m}\right)\mathbf x_{k-1} \right) + \nabla f(\mathbf x_k) = 0\]

Group terms to match standard finite difference formulas:

\[\frac{m}{h^2} \left( (\mathbf x_{k+1} - 2\mathbf x_k + \mathbf x_{k-1}) + \frac{h\gamma}{m}(\mathbf x_k - \mathbf x_{k-1}) \right) + \nabla f(\mathbf x_k) = 0\] \[m \left( \frac{\mathbf x_{k+1} - 2\mathbf x_k + \mathbf x_{k-1}}{h^2} \right) + \gamma \left( \frac{\mathbf x_k - \mathbf x_{k-1}}{h} \right) + \nabla f(\mathbf x_k) = 0\]

This equation corresponds to approximating the original ODE \(m \ddot{\mathbf x}(t) + \gamma \dot{\mathbf x}(t) + \nabla f(\mathbf x(t)) = 0\) at time \(t_k = kh\) using:

  • Central difference for acceleration: \(\ddot{\mathbf x}(t_k) \approx \frac{\mathbf x_{k+1} - 2\mathbf x_k + \mathbf x_{k-1}}{h^2}\)
  • Backward difference for velocity (in the damping term): \(\dot{\mathbf x}(t_k) \approx \frac{\mathbf x_k - \mathbf x_{k-1}}{h}\) This confirms that the derived PHB algorithm is a consistent discretization of the heavy ball ODE. The system-based derivation makes the choice of these specific finite differences more systematic.

The algorithm’s parameters are related to the physical system’s parameters and the discretization step size as follows:

  • Learning rate: \(\eta = \frac{h^2}{m}\)
  • Momentum parameter: \(\beta = 1 - \frac{h\gamma}{m}\)

Polyak’s Heavy Ball (PHB) Method

The update rule for Polyak’s Heavy Ball method is often expressed in two equivalent forms:

  1. Two-variable (velocity) form:

    \[\begin{align*} \mathbf v_{k+1} &= \beta \mathbf v_k - \eta \nabla f(\mathbf x_k) \\ \mathbf x_{k+1} &= \mathbf x_k + \mathbf v_{k+1} \end{align*}\]

    Here, \(\mathbf v_k\) is the momentum term. Initializing \(\mathbf v_0 = \mathbf 0\) is common.

  2. Single-variable (three-term recurrence) form:

    \[\mathbf x_{k+1} = \mathbf x_k - \eta \nabla f(\mathbf x_k) + \beta (\mathbf x_k - \mathbf x_{k-1})\]

    (continued) This form requires \(\mathbf x_0\) and \(\mathbf x_{-1}\) (or \(\mathbf x_1\) to be computed differently, e.g., \(\mathbf x_1 = \mathbf x_0 - \eta \nabla f(\mathbf x_0)\) which corresponds to \(\mathbf v_0=\mathbf 0\) giving \(\mathbf v_1 = -\eta \nabla f(\mathbf x_0)\), so \(\mathbf x_1 = \mathbf x_0 - \eta \nabla f(\mathbf x_0)\); and thus for \(k=0\), \(\mathbf x_0-\mathbf x_{-1}\) would be zero).

The parameters are the learning rate \(\eta > 0\) and the momentum coefficient \(\beta \in [0, 1)\). The two forms are equivalent if \(\mathbf v_k = \mathbf x_k - \mathbf x_{k-1}\) (for \(k \ge 1\)), or more generally, if \(\mathbf v_0\) is initialized, then \(\mathbf v_k\) represents the accumulated momentum, and \(\mathbf x_{k+1}-\mathbf x_k = \mathbf v_{k+1}\).

This derivation shows that the momentum term arises naturally from the inertia (\(m\)) and damping (\(\gamma\)) of the physical system, as captured by the discretization scheme:

  • A larger mass \(m\) (relative to \(h^2\)) implies a smaller effective learning rate \(\eta\).
  • A larger friction coefficient \(\gamma\) (relative to \(m/h\)) implies a smaller momentum parameter \(\beta\).
  • If \(\gamma h / m = 1\), then \(\beta=0\). In this case, \(\mathbf v_{k+1} = -\eta \nabla f(\mathbf x_k)\) and \(\mathbf x_{k+1} = \mathbf x_k - \eta \nabla f(\mathbf x_k)\), effectively recovering standard Gradient Descent (assuming \(\mathbf v_k\) becomes zero if \(\beta=0\) or it’s the first step after initialization with \(\mathbf v_0=\mathbf 0\)). More generally, if \(\beta=0\), the single-variable form becomes \(\mathbf x_{k+1} = \mathbf x_k - \eta \nabla f(\mathbf x_k)\), which is GD. This scenario corresponds to a system where momentum effects from past steps (beyond the current gradient) are not explicitly carried forward.
  • If \(\gamma = 0\) (no friction), then \(\beta=1\). This can lead to oscillations or instability if not paired with a small enough step size \(\eta\).

Perspective 2: Momentum as a Linear Multi-step Method

Another powerful way to understand momentum is by viewing it as a Linear Multi-step Method (LMM) for solving an ODE. LMMs are a class of numerical methods that use information from previous time steps to approximate the solution at the current step.

Definition: Linear Multi-step Method (LMM)

For an initial value problem \(\dot{\mathbf y}(t) = \mathbf F(t, \mathbf y(t))\) with \(\mathbf y(t_0) = \mathbf y_0\), an \(s\)-step LMM is defined by:

\[\sum_{j=0}^s \alpha_j \mathbf y_{n+j} = h_{LMM} \sum_{j=0}^s \beta_j \mathbf F(t_{n+j}, \mathbf y_{n+j})\]

where \(h_{LMM}\) is the step size, \(\alpha_j\) and \(\beta_j\) are method-specific coefficients, and by convention \(\alpha_s=1\).

  • If \(\beta_s = 0\), the method is explicit.
  • If \(\beta_s \neq 0\), the method is implicit.

We can show that Polyak’s Heavy Ball method can also be interpreted as a specific instance of an LMM applied to the first-order gradient flow ODE \(\dot{\mathbf x}(t) = -\nabla f(\mathbf x(t))\). This provides an alternative perspective on how it uses past information.

Derivation: Polyak’s Heavy Ball as a Linear Multi-step Method

Consider the first-order gradient flow ODE:

\[\dot{\mathbf x}(t) = -\nabla f(\mathbf x(t))\]

Here, \(\mathbf y(t) \equiv \mathbf x(t)\) and \(\mathbf F(t, \mathbf x(t)) \equiv -\nabla f(\mathbf x(t))\) in the LMM definition.

Now, let’s look at the Polyak’s Heavy Ball update rule in its single-variable form:

\[\mathbf x_{k+1} = \mathbf x_k - \eta \nabla f(\mathbf x_k) + \beta (\mathbf x_k - \mathbf x_{k-1})\]

We can rearrange this to:

\[\mathbf x_{k+1} - (1+\beta)\mathbf x_k + \beta \mathbf x_{k-1} = -\eta \nabla f(\mathbf x_k)\]

This equation perfectly matches the LMM form. It’s a 2-step method (\(s=2\)). Let \(n+j\) map to our indices such that \(n+s = k+1\). If we set the current time index \(n=k-1\), then:

  • \(\mathbf y_{n+2} = \mathbf x_{k+1}\) (value at the new step)
  • \(\mathbf y_{n+1} = \mathbf x_k\) (value at the current step)
  • \(\mathbf y_{n} = \mathbf x_{k-1}\) (value at the previous step)

The coefficients \(\alpha_j\) for the left-hand side \(\sum_{j=0}^2 \alpha_j \mathbf y_{n+j}\) are:

  • \(\alpha_2 = 1\) (coefficient of \(\mathbf x_{k+1}\), since \(j=s=2\))
  • \(\alpha_1 = -(1+\beta)\) (coefficient of \(\mathbf x_k\))
  • \(\alpha_0 = \beta\) (coefficient of \(\mathbf x_{k-1}\))

For the right-hand side \(h_{LMM} \sum_{j=0}^s \beta_j \mathbf F(t_{n+j}, \mathbf y_{n+j})\), we compare it with \(-\eta \nabla f(\mathbf x_k)\). The gradient term \(-\nabla f(\mathbf x_k)\) corresponds to \(\mathbf F(t_{n+1}, \mathbf y_{n+1})\).

  • Since the PHB update for \(\mathbf x_{k+1}\) does not depend on \(\nabla f(\mathbf x_{k+1})\), the method is explicit. This means \(\beta_s = \beta_2 = 0\) in the LMM definition.
  • The gradient is evaluated only at \(\mathbf x_k \equiv \mathbf y_{n+1}\). Thus, only \(\beta_1\) can be non-zero; \(\beta_0\) must be 0.
  • This leaves the LMM right-hand side as \(h_{LMM} \beta_1 \mathbf F(t_{n+1}, \mathbf y_{n+1})\).
  • Equating this to the PHB right-hand side: \(h_{LMM} \beta_1 \mathbf F(t_{n+1}, \mathbf y_{n+1}) = -\eta \nabla f(\mathbf x_k)\).
  • Since \(\mathbf F(t_{n+1}, \mathbf y_{n+1}) = -\nabla f(\mathbf x_k)\), we have \(h_{LMM} \beta_1 (-\nabla f(\mathbf x_k)) = -\eta \nabla f(\mathbf x_k)\).
  • This implies \(h_{LMM} \beta_1 = \eta\).

There are different ways to satisfy this:

  1. Choose the LMM step size \(h_{LMM} = \eta\). Then \(\beta_1 = 1\).
  2. Choose \(h_{LMM} = 1\). Then \(\beta_1 = \eta\). This effectively absorbs the step size into the definition of \(\mathbf F\), or considers \(\eta\) as the combined \(h_{LMM}\beta_1\).

So, with \(\alpha_2=1, \alpha_1=-(1+\beta), \alpha_0=\beta\) and \(\beta_2=0, \beta_1=\eta/h_{LMM}, \beta_0=0\), Polyak’s Heavy Ball method is an explicit 2-step linear multi-step method applied to the gradient flow ODE \(\dot{\mathbf x} = -\nabla f(\mathbf x)\). The “momentum” term \((\mathbf x_k - \mathbf x_{k-1})\) is how this LMM incorporates past information (\(\mathbf x_{k-1}\)) to determine the next step \(\mathbf x_{k+1}\).

The LMM structure of PHB allows us to use tools from numerical ODE analysis, such as characteristic polynomials, to study its properties.

Analysis: Characteristic Polynomials and Consistency of PHB as an LMM

For an LMM, we define two characteristic polynomials:

  • First characteristic polynomial: \(\rho(z) = \sum_{j=0}^s \alpha_j z^j\)
  • Second characteristic polynomial: \(\sigma(z) = \sum_{j=0}^s \beta_j z^j\)

For Polyak’s method, using our derived coefficients (and choosing \(h_{LMM}=1\) for simplicity in defining \(\beta_j\), so \(\beta_1=\eta\)):

  • \[\rho(z) = \alpha_2 z^2 + \alpha_1 z + \alpha_0 = z^2 - (1+\beta)z + \beta\]
  • \(\sigma(z) = \beta_2 z^2 + \beta_1 z + \beta_0 = \eta z\) (since only \(\beta_1=\eta\) is non-zero among \(\beta_0, \beta_1, \beta_2\) for \(s=2\))

An LMM is consistent (meaning it has order of accuracy \(p \ge 1\)) if it satisfies two conditions related to the behavior at \(z=1\):

  1. Root condition (related to zero-stability, necessary for convergence): \(\rho(1) = 0\). More stringently for zero-stability, all roots of \(\rho(z)\) must lie within or on the unit circle, and any roots on the unit circle must be simple.
  2. Consistency condition for order \(p \ge 1\): \(\rho'(1) = \sigma(1)\).

Let’s check these for PHB as an LMM for \(\dot{\mathbf x} = -\nabla f(\mathbf x)\):

  1. \(\rho(1) = 1^2 - (1+\beta)(1) + \beta = 1 - 1 - \beta + \beta = 0\). This condition is satisfied for any \(\beta\). The roots of \(\rho(z)=0\) are \(z^2-(1+\beta)z+\beta=0\), which are \(z=1\) and \(z=\beta\). For zero-stability, we need \(\vert\beta\vert \le 1\). Typically for PHB, \(0 \le \beta < 1\), so this is met.

  2. For the second condition: \(\rho'(z) = 2z - (1+\beta)\), so \(\rho'(1) = 2 - (1+\beta) = 1-\beta\). \(\sigma(1) = \eta \cdot 1 = \eta\). So, the consistency condition \(\rho'(1)=\sigma(1)\) (which ensures order at least 1) requires \(1-\beta = \eta\).

This particular relationship, \(1-\beta = \eta\), is known as the “critically damped” setting for PHB when analyzing its convergence on quadratic functions. It implies a specific tuning between the momentum parameter and the learning rate for the method to be a first-order accurate approximation (in the LMM local truncation error sense) of the gradient flow ODE \(\dot{\mathbf x} = -\nabla f(\mathbf x)\). However, PHB is an effective optimization algorithm even when \(1-\beta \neq \eta\). The LMM formulation primarily describes its algebraic structure as a numerical integrator. Its effectiveness as an optimizer is analyzed through other means (e.g., Lyapunov stability for the discrete updates, convergence rates on specific function classes). The fact that it is an LMM by its form is significant, regardless of whether this specific LMM consistency condition (which relates to the local truncation error) is met for arbitrary \(\eta, \beta\).

Connecting the Two Perspectives

The beauty is that these two perspectives are deeply connected.

  1. We started with a physical system (heavy ball with friction) described by a second-order ODE.
  2. Converting this to a system of first-order ODEs and then discretizing it led directly to the Polyak’s Heavy Ball update rule (in its two-variable and equivalent single-variable forms).
  3. This update rule, particularly in the single-variable form \(\mathbf x_{k+1} - (1+\beta)\mathbf x_k + \beta \mathbf x_{k-1} = -\eta \nabla f(\mathbf x_k)\), has the algebraic structure of a 2-step Linear Multi-step Method.
  4. This LMM can be seen as a way to solve the simpler first-order gradient flow ODE \(\dot{\mathbf x} = -\nabla f(\mathbf x)\), but using information from multiple past steps (\(\mathbf x_k, \mathbf x_{k-1}\)) instead of just one (\(\mathbf x_k\)) as in Euler’s method (standard GD).

So, the “inertia” from the physical model (which persists velocity) translates into the “memory” of the LMM (which uses previous iterates). The momentum term, crucial for navigating complex optimization landscapes, is essentially how the LMM leverages past states to make a more informed step, aiming to better approximate the trajectory of the underlying gradient flow dynamics or, from the other perspective, to emulate the behavior of a physical system with inertia.

A Glimpse at Nesterov’s Accelerated Gradient (NAG)

Nesterov’s Accelerated Gradient (NAG) is another highly successful momentum-based method, often outperforming PHB, especially in convex optimization. Its update rule is subtly different:

Nesterov’s Accelerated Gradient (NAG)

Using an explicit velocity term \(\mathbf v_k\):

\[\begin{align*} \mathbf v_{k+1} &= \beta \mathbf v_k - \eta \nabla f(\mathbf x_k + \beta \mathbf v_k) \\ \mathbf x_{k+1} &= \mathbf x_k + \mathbf v_{k+1} \end{align*}\]

The key difference from PHB is that the gradient is evaluated at a “look-ahead” point \(\mathbf x_k + \beta \mathbf v_k\), rather than at \(\mathbf x_k\). (Note: there are multiple formulations of NAG; this is a common one. Sometimes the lookahead is on \(\mathbf x_k - \beta \mathbf v_k\) if \(\mathbf v_k\) is defined as \(\mathbf x_k-\mathbf x_{k-1}\), or the update \(\mathbf x_{k+1} = (\mathbf x_k + \beta \mathbf v_k) + \mathbf v_{k+1} - \beta \mathbf v_k\), i.e. \(\mathbf x_{k+1} = \mathbf x_k + \mathbf v_{k+1} + \beta(\mathbf v_{k+1}-\mathbf v_k)\).)

Interestingly, NAG also has a continuous-time ODE interpretation. Su, Boyd, and Candès (2016, JMLR; arXiv:1403.4645) showed that a version of NAG can be seen as a discretization of the following second-order ODE:

\[\ddot{\mathbf x}(t) + \frac{k_d}{t} \dot{\mathbf x}(t) + \nabla f(\mathbf x(t)) = 0\]

For a typical choice for the constant \(k_d \ge 3\). Notice the time-dependent damping term \(\frac{k_d}{t}\). As time \(t\) increases, the damping decreases. This “adaptive” damping is thought to be one reason for NAG’s superior performance in some settings. This ODE itself can also be derived from a Bregman-Lagrangian variational perspective, as explored by Wibisono, Wilson, and Jordan (2016, PNAS; arXiv:1603.04245). Deriving NAG from this ODE involves a more complex discretization scheme than the one used for PHB.

While NAG can also be written in a multi-step form, its “look-ahead” gradient evaluation makes its LMM interpretation for the simple gradient flow ODE \(\dot{\mathbf x} = -\nabla f(\mathbf x)\) less direct or natural than for PHB. It’s more directly a discretization of its own unique second-order ODE.

Adaptive Methods: RMSProp and Adam from an ODE Perspective

Adaptive optimization methods like RMSProp and Adam adjust the learning rate for each parameter dynamically, often based on moving averages of past gradients and squared gradients. Their continuous-time interpretations typically involve systems of coupled ODEs, reflecting this dynamic interplay. For the derivations below, we simplify by omitting the small \(\epsilon\) typically added to denominators for numerical stability and the bias correction terms in Adam, focusing on the core dynamics. This helps in revealing the underlying ODE structure. The condition “don’t update when the gradient is null” is naturally handled if the update step is proportional to the gradient (like in RMSProp). For methods with momentum (like Adam), a null current gradient means no new “push”, but past momentum can still cause an update.

RMSProp: Adaptive Scaling of Gradients

RMSProp (Root Mean Square Propagation) scales the learning rate for each parameter by an estimate of the root mean square of recent gradients for that parameter. It’s similar to Adagrad, but instead of dividing by the L2 norm past gradients, it instead divides by a weighted quadratic mean (weighted RMS) of past gradients. (See Wikipedia’s page on power means)

The discrete update rules for RMSProp (simplified, without \(\epsilon\) and assuming component-wise operations for vectors like \((\nabla f(\mathbf x_k))^2\) and division/sqrt) are:

  1. Update the squared gradient accumulator \(\mathbf s_k\):

    \[\mathbf s_{k+1} = \beta_2 \mathbf s_k + (1-\beta_2) (\nabla f(\mathbf x_k))^2\]
  2. Update parameters \(\mathbf x_k\):

    \[\mathbf x_{k+1} = \mathbf x_k - \frac{\eta}{\sqrt{\mathbf s_{k+1}}} \nabla f(\mathbf x_k)\]

    (continued) Here, \(\beta_2\) is the decay rate for the moving average of squared gradients (typically close to 1, e.g., 0.999), and \(\eta\) is the learning rate. If any component of \(\sqrt{\mathbf s_{k+1}}\) is zero, the update for that component of \(\mathbf x_k\) would be undefined or handled by a safeguard (like adding \(\epsilon\) or skipping the update for that component). If \(\nabla f(\mathbf x_k)=\mathbf 0\), then \(\mathbf x_{k+1}=\mathbf x_k\), and \(\mathbf s_{k+1} = \beta_2 \mathbf s_k\) (the accumulator decays).

Special Case: RMSProp with Uniform Averaging vs. Adagrad

RMSProp typically uses an exponential moving average for the squared gradients with a constant decay rate \(\beta_2\). However, a specific time-dependent choice for the weighting factor \(1-\beta_2\) can make RMSProp perform uniform averaging of all past squared gradients. Let \(\mathbf g_i = \nabla f(\mathbf x_i)\). If we set the weight for the current squared gradient \(\mathbf g_k^2\) to be \(1/(k+1)\) (where \(k\) is the iteration index, starting from \(k=0\); so \(t=k+1\) is the count of gradients observed including the current one), this means \(1-\beta_2 = 1/(k+1)\), and thus \(\beta_2 = k/(k+1)\).

The RMSProp accumulator update then becomes:

\[\mathbf s_{k+1} = \beta_2 \mathbf s_k + (1-\beta_2) \mathbf g_k^2 = \frac{k}{k+1} \mathbf s_k + \frac{1}{k+1} \mathbf g_k^2\]

Assuming \(\mathbf s_0 = \mathbf 0\), this recurrence leads to (verifiable by induction):

  • For \(k=0\) (first iteration, \(t=1\)): \(\mathbf s_1 = \frac{0}{1}\mathbf s_0 + \frac{1}{1}\mathbf g_0^2 = \mathbf g_0^2\)
  • For \(k=1\) (second iteration, \(t=2\)): \(\mathbf s_2 = \frac{1}{2}\mathbf s_1 + \frac{1}{2}\mathbf g_1^2 = \frac{1}{2}\mathbf g_0^2 + \frac{1}{2}\mathbf g_1^2 = \frac{1}{2}(\mathbf g_0^2+\mathbf g_1^2)\) In general, this choice yields:
\[\mathbf s_{k+1} = \frac{1}{k+1} \sum_{i=0}^k \mathbf g_i^2\]

This expression shows that \(\mathbf s_{k+1}\) is the arithmetic mean (a uniform average) of all squared gradients from \(\mathbf g_0^2\) to \(\mathbf g_k^2\).

The Adagrad algorithm, on the other hand, accumulates the sum of all past squared gradients (element-wise):

\[\mathbf G_k = \sum_{i=0}^k \mathbf g_i^2\]

Thus, for this specific RMSProp variant, the accumulator \(\mathbf s_{k+1}\) is related to Adagrad’s accumulator \(\mathbf G_k\) by \(\mathbf s_{k+1} = \mathbf G_k / (k+1)\).

Now, let’s compare their parameter update rules, including a small constant \(\epsilon > 0\) typically added for numerical stability:

  • RMSProp (with uniform averaging as described):

    \[\mathbf x_{k+1} = \mathbf x_k - \frac{\eta_{RMSProp}}{\sqrt{\mathbf s_{k+1}} + \epsilon_{RMSProp}} \mathbf g_k = \mathbf x_k - \frac{\eta_{RMSProp}}{\sqrt{\mathbf G_k/(k+1)} + \epsilon_{RMSProp}} \mathbf g_k\]

    This can be rewritten as:

    \[\mathbf x_{k+1} = \mathbf x_k - \frac{\eta_{RMSProp} \sqrt{k+1}}{\sqrt{\mathbf G_k} + \epsilon_{RMSProp}\sqrt{k+1}} \mathbf g_k\]
  • Adagrad:

    \[\mathbf x_{k+1} = \mathbf x_k - \frac{\eta_{Adagrad}}{\sqrt{\mathbf G_k} + \epsilon_{Adagrad}} \mathbf g_k\]

Comparing these two forms, this special variant of RMSProp is equivalent to Adagrad if Adagrad’s global learning rate \(\eta_{Adagrad}\) and stability constant \(\epsilon_{Adagrad}\) were set as \(\eta_{Adagrad}(k) = \eta_{RMSProp} \sqrt{k+1}\) and \(\epsilon_{Adagrad}(k) = \epsilon_{RMSProp}\sqrt{k+1}\).

If both methods used the same base learning rate \(\eta\) and ignoring \(\epsilon\), the effective learning rate for each parameter in this RMSProp variant is scaled by an additional factor of \(\sqrt{k+1}\) compared to Adagrad.

Adagrad’s learning rates are known to diminish, sometimes too aggressively, because \(\sqrt{\mathbf G_k}\) typically grows with \(k\). If \(\sqrt{\mathbf G_k}\) grows roughly as \(\sqrt{k+1}\) (e.g., if gradients have, on average, constant magnitudes), Adagrad’s effective learning rate decays as \(O(1/\sqrt{k+1})\). In contrast, this RMSProp variant’s effective learning rate would remain approximately constant (\(\eta \sqrt{k+1} / \sqrt{k+1} \approx \eta\)). This behavior—preventing aggressive decay of learning rates—is a primary motivation for RMSProp. Standard RMSProp, which uses a constant \(\beta_2\) (e.g., 0.9 or 0.999), “forgets” old gradients more strongly by giving exponentially less weight to older gradients, which further helps in stabilizing the effective learning rate over long training periods compared to Adagrad.

In terms of the ODE perspective, if \(1-\beta_2(t) = 1/t\), the ODE for \(\mathbf s(t)\) becomes \(\dot{\mathbf s}(t) = \frac{1}{t}((\nabla f(\mathbf x(t)))^2 - \mathbf s(t))\). This is a linear first-order ODE whose solution (with \(t_0 \mathbf s(t_0) \to \mathbf 0\) as \(t_0 \to 0\)) is \(\mathbf s(t) = \frac{1}{t}\int_{t_0}^t (\nabla f(\mathbf x(\tau)))^2 d\tau\). The parameter update ODE \(\dot{\mathbf x}(t) = -\eta_0 \frac{1}{\sqrt{\mathbf s(t)}} \nabla f(\mathbf x(t))\) then becomes \(\dot{\mathbf x}(t) = -\eta_0 \frac{\sqrt{t}}{\sqrt{\int (\nabla f(\mathbf x(\tau)))^2 d\tau}} \nabla f(\mathbf x(t))\), explicitly showing the \(\sqrt{t}\) scaling factor in the continuous-time analog.

The continuous-time behavior of RMSProp can be described by a system of ODEs. We derive this by interpreting the discrete updates as approximations of continuous dynamics.

Derivation: Continuous-Time ODE System for RMSProp

To derive the continuous-time Ordinary Differential Equation (ODE) system corresponding to RMSProp, we interpret the discrete update rules as a forward Euler discretization with an effective step size of \(h=1\). This means one iteration of the algorithm corresponds to one unit of time evolution in the ODE.

Let \(\mathbf s_k \approx \mathbf s(t)\) and \(\mathbf x_k \approx \mathbf x(t)\), where \(t\) is the continuous time variable and \(t_k = k \cdot h = k\).

  1. **ODE for the squared gradient accumulator \(\mathbf s(t)\)\(:** The discrete update for\)\mathbf s_k$$ is:

    \[\mathbf s_{k+1} = \beta_2 \mathbf s_k + (1-\beta_2) (\nabla f(\mathbf x_k))^2\]

    Rearranging this gives:

    \[\mathbf s_{k+1} - \mathbf s_k = (1-\beta_2) ((\nabla f(\mathbf x_k))^2 - \mathbf s_k)\]

    If we view \(\mathbf s_{k+1} - \mathbf s_k\) as an approximation to \(\dot{\mathbf s}(t) \cdot h\) where the step size \(h=1\), this directly suggests the ODE:

    \[\dot{\mathbf s}(t) = (1-\beta_2) ((\nabla f(\mathbf x(t)))^2 - \mathbf s(t))\]

    This equation describes \(\mathbf s(t)\) as an exponential moving average of the squared gradients \((\nabla f(\mathbf x(t)))^2\), where the term \((1-\beta_2)\) acts as the inverse of the averaging time constant (if time is measured in units of iterations).

  2. **ODE for the parameters \(\mathbf x(t)\)\(:** The discrete update for\)\mathbf x_k$$ is:

    \[\mathbf x_{k+1} = \mathbf x_k - \frac{\eta}{\sqrt{\mathbf s_{k+1}}} \nabla f(\mathbf x_k)\]

    Rearranging this gives:

    \[\mathbf x_{k+1} - \mathbf x_k = - \frac{\eta}{\sqrt{\mathbf s_{k+1}}} \nabla f(\mathbf x_k)\]

    Viewing \(\mathbf x_{k+1} - \mathbf x_k\) as an approximation to \(\dot{\mathbf x}(t) \cdot h\) with \(h=1\), and assuming \(\mathbf s_{k+1} \approx \mathbf s(t)\) in the continuous limit for the denominator (or more precisely, that the update uses \(\mathbf s(t_{k+1})\) which is consistent with an implicit-explicit step if \(\mathbf s\) is updated first), we get the ODE:

    \[\dot{\mathbf x}(t) = -\frac{\eta_0}{\sqrt{\mathbf s(t)}} \nabla f(\mathbf x(t))\]

    where \(\eta_0\) is the learning rate parameter from the algorithm. Note that the term \(\sqrt{\mathbf s_{k+1}}\) in the discrete update for \(\mathbf x_k\) means that \(\mathbf s\) is updated first, and then its new value \(\mathbf s_{k+1}\) is used for the \(\mathbf x\) update. In the ODE limit, this distinction often blurs to \(\mathbf s(t)\).

The continuous-time system for RMSProp is thus:

\[\begin{align*} \dot{\mathbf x}(t) &= -\frac{\eta_0}{\sqrt{\mathbf s(t)}} \nabla f(\mathbf x(t)) \\ \dot{\mathbf s}(t) &= (1-\beta_2) ((\nabla f(\mathbf x(t)))^2 - \mathbf s(t)) \end{align*}\]

(Operations involving \(\mathbf s(t)\) and \((\nabla f(\mathbf x(t)))^2\) are element-wise.)

Interpretation of the RMSProp ODE System:

  • The equation for \(\dot{\mathbf s}(t)\) shows that \(\mathbf s(t)\) tracks an exponential moving average of the squared gradients. Each component \(s_i(t)\) estimates the recent typical magnitude (squared) of the gradient component \(\nabla_i f(\mathbf x(t))\).
  • The equation for \(\dot{\mathbf x}(t)\) describes a preconditioned gradient flow. The term \(\text{diag}(1/\sqrt{s_i(t)})\) acts as a diagonal preconditioner. It adaptively scales the learning rate for each parameter: parameters with historically large gradient components (large \(s_i(t)\)) receive smaller effective learning rates, while those with small gradient components receive larger ones.
  • This system describes a particle moving in the potential \(f(\mathbf x)\), where its “mobility” or “inverse friction” in each coordinate direction is dynamically adjusted based on the history of forces experienced in that direction.
  • Connection to SignSGD: If \(\mathbf s(t)\) could track \((\nabla f(\mathbf x(t)))^2\) perfectly and instantaneously (e.g., if \(\beta_2 \to 0\), implying \((1-\beta_2) \to 1\), and assuming a steady state where \(\dot{\mathbf s}(t)=\mathbf 0\)), then \(\mathbf s(t)\) would effectively be \((\nabla f(\mathbf x(t)))^2\). In such a hypothetical scenario, the update would approach that of SignSGD: \(\dot{\mathbf x}(t) \approx -\eta_0 \frac{\nabla f(\mathbf x(t))}{\vert \nabla f(\mathbf x(t)) \vert}\) (where the division and absolute value are element-wise). Since \(\mathbf s(t)\) is a smoothed average, RMSProp provides a smoothed approximation to this normalization.

It’s natural to ask if this system, like the heavy ball ODE, can be reduced to a single second-order ODE for \(\mathbf x(t)\).

Analysis: Can RMSProp be expressed as a single second-order ODE?

Unlike the heavy ball method, reducing this system to a single, clean second-order ODE for \(\mathbf x(t)\) is not straightforward. We can formally differentiate \(\dot{\mathbf x}(t)\) with respect to time:

\[\ddot{\mathbf x}(t) = -\eta_0 \frac{d}{dt}\left( \frac{\nabla f(\mathbf x(t))}{\sqrt{\mathbf s(t)}} \right)\]

Using the quotient rule (or product rule for \(\nabla f(\mathbf x(t)) \cdot \mathbf s(t)^{-1/2}\)), and assuming element-wise operations:

\[\ddot{x}_i(t) = -\eta_0 \left( \frac{\frac{d}{dt}(\nabla_i f(\mathbf x(t))) \sqrt{s_i(t)} - \nabla_i f(\mathbf x(t)) \frac{d}{dt}(\sqrt{s_i(t)})}{s_i(t)} \right)\]

Let’s expand the derivatives:

  • \(\frac{d}{dt}(\nabla_i f(\mathbf x(t))) = \sum_j \frac{\partial^2 f}{\partial x_i \partial x_j} \dot{x}_j(t)\). This is the \(i\)-th component of \(\nabla^2 f(\mathbf x(t)) \dot{\mathbf x}(t)\).
  • \(\frac{d}{dt}(\sqrt{s_i(t)}) = \frac{1}{2\sqrt{s_i(t)}} \dot{s}_i(t) = \frac{1}{2\sqrt{s_i(t)}} (1-\beta_2)((\nabla_i f(\mathbf x(t)))^2 - s_i(t))\).

Substituting these into the expression for \(\ddot{x}_i(t)\) leads to:

\[\ddot{x}_i(t) = -\eta_0 \left( \frac{(\nabla^2 f(\mathbf x(t)) \dot{\mathbf x}(t))_i}{\sqrt{s_i(t)}} - \nabla_i f(\mathbf x(t)) \frac{(1-\beta_2)((\nabla_i f(\mathbf x(t)))^2 - s_i(t))}{2 s_i(t)^{3/2}} \right)\]

This can be rearranged, for instance, by substituting \(\dot{\mathbf x}(t) = -\eta_0 \mathbf s(t)^{-1/2} \nabla f(\mathbf x(t))\) into the Hessian term. The resulting equation for \(\ddot{\mathbf x}(t)\) would involve terms like \(\nabla f(\mathbf x(t))\), \(\nabla^2 f(\mathbf x(t))\), and highly complex coefficients that depend on \(\mathbf s(t)\) (which itself evolves according to its own ODE). It does not simplify to the canonical form \(M\ddot{\mathbf x} + C\dot{\mathbf x} + \nabla f(\mathbf x)=0\) with constant or simple state-dependent (on just \(\mathbf x, \dot{\mathbf x}\)) matrices \(M, C\). The system of first-order ODEs is a more natural and transparent representation for RMSProp.

Adam: Adaptive Moments

Adam (Adaptive Moment Estimation) combines the ideas of momentum (first moment of the gradient) and adaptive scaling like RMSProp (second moment of the gradient).

The simplified discrete update rules (omitting bias correction and \(\epsilon\)):

  1. Update biased first moment estimate \(\mathbf m_k\) (momentum):

    \[\mathbf m_{k+1} = \beta_1 \mathbf m_k + (1-\beta_1) \nabla f(\mathbf x_k)\]
  2. Update biased second moment estimate \(\mathbf s_k\) (scaling):

    \[\mathbf s_{k+1} = \beta_2 \mathbf s_k + (1-\beta_2) (\nabla f(\mathbf x_k))^2\]
  3. Update parameters \(\mathbf x_k\):

    \[\mathbf x_{k+1} = \mathbf x_k - \eta \frac{\mathbf m_{k+1}}{\sqrt{\mathbf s_{k+1}}}\]

    (continued) Here, \(\beta_1\) is the decay rate for the momentum term (e.g., 0.9), and \(\beta_2\) for the squared gradient accumulator (e.g., 0.999). If \(\nabla f(\mathbf x_k)=\mathbf 0\), then \(\mathbf m_{k+1} = \beta_1 \mathbf m_k\) and \(\mathbf s_{k+1} = \beta_2 \mathbf s_k\). The parameter \(\mathbf x_k\) will still update due to \(\mathbf m_{k+1}\) unless \(\mathbf m_k\) was zero. This is typical momentum behavior.

Summary of Adam Optimizer

In short, Adam is simply Adagrad with exponential moving average (EMA) on both the gradient and the normalizing factor. Equivalently, it’s a blend of momentum gradient descent and RMSProp.

Similar to RMSProp, we can derive a system of ODEs that describes the continuous-time behavior of Adam.

Derivation: Continuous-Time ODE System for Adam

Following a similar procedure as for RMSProp, we interpret the discrete updates of Adam as a forward Euler discretization with an effective step size of \(h=1\). One iteration of Adam corresponds to one unit of time evolution in its ODE counterpart.

Let \(\mathbf m_k \approx \mathbf m(t)\), \(\mathbf s_k \approx \mathbf s(t)\), and \(\mathbf x_k \approx \mathbf x(t)\), where \(t_k = k\).

  1. **ODE for the first moment estimate \(\mathbf m(t)\)\(:** The discrete update for\)\mathbf m_k$$ is:

    \[\mathbf m_{k+1} = \beta_1 \mathbf m_k + (1-\beta_1) \nabla f(\mathbf x_k)\]

    Rearranging:

    \[\mathbf m_{k+1} - \mathbf m_k = (1-\beta_1) (\nabla f(\mathbf x_k) - \mathbf m_k)\]

    Interpreting \(\mathbf m_{k+1} - \mathbf m_k\) as an approximation to \(\dot{\mathbf m}(t) \cdot h\) with \(h=1\), we get:

    \[\dot{\mathbf m}(t) = (1-\beta_1)(\nabla f(\mathbf x(t)) - \mathbf m(t))\]

    This shows \(\mathbf m(t)\) as an exponential moving average of past gradients. It acts like a velocity term that tries to follow the current gradient.

  2. **ODE for the second moment estimate \(\mathbf s(t)\)\(:** The discrete update for\)\mathbf s_k$$ is:

    \[\mathbf s_{k+1} = \beta_2 \mathbf s_k + (1-\beta_2) (\nabla f(\mathbf x_k))^2\]

    Rearranging:

    \[\mathbf s_{k+1} - \mathbf s_k = (1-\beta_2) ((\nabla f(\mathbf x_k))^2 - \mathbf s_k)\]

    This yields the ODE:

    \[\dot{\mathbf s}(t) = (1-\beta_2)((\nabla f(\mathbf x(t)))^2 - \mathbf s(t))\]

    This is identical to the RMSProp accumulator ODE.

  3. **ODE for the parameters \(\mathbf x(t)\)\(:** The discrete update for\)\mathbf x_k$$ is:

    \[\mathbf x_{k+1} = \mathbf x_k - \eta \frac{\mathbf m_{k+1}}{\sqrt{\mathbf s_{k+1}}}\]

    Rearranging:

    \[\mathbf x_{k+1} - \mathbf x_k = -\eta \frac{\mathbf m_{k+1}}{\sqrt{\mathbf s_{k+1}}}\]

    This yields the ODE (assuming \(\mathbf m_{k+1} \approx \mathbf m(t)\) and \(\mathbf s_{k+1} \approx \mathbf s(t)\), or more precisely, that they reflect values at \(t\) or \(t+h\) depending on update order, which blurs in the limit):

    \[\dot{\mathbf x}(t) = -\eta_0 \frac{\mathbf m(t)}{\sqrt{\mathbf s(t)}}\]

    where \(\eta_0\) is the Adam learning rate parameter. The parameters move in the direction of the momentum \(\mathbf m(t)\), scaled by \(1/\sqrt{\mathbf s(t)}\).

The continuous-time behavior of Adam can be described by the following system of three coupled ODEs:

\[\begin{align*} \dot{\mathbf x}(t) &= -\eta_0 \frac{\mathbf m(t)}{\sqrt{\mathbf s(t)}} \\ \dot{\mathbf m}(t) &= (1-\beta_1) (\nabla f(\mathbf x(t)) - \mathbf m(t)) \\ \dot{\mathbf s}(t) &= (1-\beta_2) ((\nabla f(\mathbf x(t)))^2 - \mathbf s(t)) \end{align*}\]

(Element-wise operations for \(\mathbf m(t), \mathbf s(t)\), and terms in their updates, and in \(\dot{\mathbf x}(t)\) involving division and square root.)

Interpretation of the Adam ODE System:

  • Adam combines a momentum-like update (via \(\mathbf m(t)\)) with adaptive coordinate-wise scaling (via \(\mathbf s(t)\)).
  • \(\dot{\mathbf m}(t)\) shows \(\mathbf m(t)\) as a smoothed version of the gradient, providing inertia.
  • \(\dot{\mathbf s}(t)\) provides the adaptive scaling based on recent gradient magnitudes.
  • \(\dot{\mathbf x}(t)\) uses the smoothed gradient \(\mathbf m(t)\) for direction and scales it adaptively.
  • The system describes a particle whose motion has inertia (\(\mathbf m(t)\)) and whose “mobility” in each direction is adaptively tuned (\(\mathbf s(t)\)). This can be thought of as a “heavy ball” moving on a dynamically changing, anisotropic landscape.
  • The bias correction terms (\(1/(1-\beta_1^k)\) and \(1/(1-\beta_2^k)\)) in practical Adam are important for early iterations to correct the initialization bias of \(\mathbf m_k\) and \(\mathbf s_k\) (which start at \(\mathbf 0\)). In an ODE context, this could be modeled by time-dependent coefficients \((1-\beta_1(t))\) and \((1-\beta_2(t))\) or by specific initial conditions or scalings of \(\mathbf m(t)\) and \(\mathbf s(t)\) for small \(t\). For large \(t\) (many iterations), these corrections become negligible.

Again, we can investigate if this system can be reduced to a single second-order ODE for \(\mathbf x(t)\).

Analysis: Can Adam be expressed as a single second-order ODE?

This is even more complex than for RMSProp due to the additional ODE for \(\mathbf m(t)\). From the first ODE, \(\dot{\mathbf x}(t) = -\eta_0 \frac{\mathbf m(t)}{\sqrt{\mathbf s(t)}}\), we can express \(\mathbf m(t)\) in terms of \(\dot{\mathbf x}(t)\) and \(\mathbf s(t)\) (assuming element-wise invertibility of \(1/\sqrt{\mathbf s(t)}\)):

\[\mathbf m(t) = -\frac{\sqrt{\mathbf s(t)}}{\eta_0} \dot{\mathbf x}(t)\]

Now, differentiate this expression for \(\mathbf m(t)\) with respect to time to get \(\dot{\mathbf m}(t)\):

\[\dot{\mathbf m}(t) = -\frac{1}{\eta_0} \frac{d}{dt} \left( \sqrt{\mathbf s(t)} \dot{\mathbf x}(t) \right)\]

Using the product rule (for each component, assuming diagonal \(\mathbf s(t)\)$$):

\[\dot{m}_i(t) = -\frac{1}{\eta_0} \left( \frac{d\sqrt{s_i(t)}}{dt} \dot{x}_i(t) + \sqrt{s_i(t)} \ddot{x}_i(t) \right)\]

where \(\frac{d\sqrt{s_i(t)}}{dt} = \frac{1}{2\sqrt{s_i(t)}} \dot{s}_i(t)\). So,

\[\dot{m}_i(t) = -\frac{1}{\eta_0} \left( \frac{\dot{s}_i(t)}{2\sqrt{s_i(t)}} \dot{x}_i(t) + \sqrt{s_i(t)} \ddot{x}_i(t) \right)\]

Substitute this \(\dot{\mathbf m}(t)\) and the expression for \(\mathbf m(t)\) into Adam’s second ODE, \(\dot{\mathbf m}(t) = (1-\beta_1) (\nabla f(\mathbf x(t)) - \mathbf m(t))\):

\[-\frac{1}{\eta_0} \left( \frac{\dot{\mathbf s}(t)}{2\sqrt{\mathbf s(t)}} \odot \dot{\mathbf x}(t) + \sqrt{\mathbf s(t)} \odot \ddot{\mathbf x}(t) \right) = (1-\beta_1)\left(\nabla f(\mathbf x(t)) - \left(-\frac{\sqrt{\mathbf s(t)}}{\eta_0}\dot{\mathbf x}(t)\right)\right)\]

(where \(\odot\) denotes element-wise product). Rearranging to look like a second-order ODE for \(\mathbf x(t)\):

\[\frac{\sqrt{\mathbf s(t)}}{\eta_0} \odot \ddot{\mathbf x}(t) + \left[ \frac{\dot{\mathbf s}(t)}{2\eta_0\sqrt{\mathbf s(t)}} + (1-\beta_1)\frac{\sqrt{\mathbf s(t)}}{\eta_0} \right] \odot \dot{\mathbf x}(t) + (1-\beta_1)\nabla f(\mathbf x(t)) = \mathbf 0\]

This formally looks like \(M(\mathbf s)\ddot{\mathbf x}(t) + C(\mathbf s,\dot{\mathbf s})\dot{\mathbf x}(t) + K \nabla f(\mathbf x(t)) = \mathbf 0\), where the “mass” \(M(\mathbf s)\) and “damping” \(C(\mathbf s,\dot{\mathbf s})\) are diagonal matrices whose entries depend on \(\mathbf s(t)\) and \(\dot{\mathbf s}(t)\). However, \(\mathbf s(t)\) is governed by its own ODE \(\dot{\mathbf s}(t) = (1-\beta_2)((\nabla f(\mathbf x(t)))^2 - \mathbf s(t))\). Substituting \(\dot{\mathbf s}(t)\) into the expression for \(C(\mathbf s,\dot{\mathbf s})\) will make the “damping” term highly complex, depending on \(\mathbf s(t)\) and \((\nabla f(\mathbf x(t)))^2\). The coefficients of this “single” second-order ODE for \(\mathbf x(t)\) are not simple constants nor simple functions of just \(\mathbf x\) and \(\dot{\mathbf x}\). They depend on an auxiliary variable \(\mathbf s(t)\) that has its own dynamics driven by the history of gradients encountered along the path \(\mathbf x(t)\). While one can formally write it this way, its coefficients are so intricately coupled with the state and history via \(\mathbf s(t)\) that the system form is generally more transparent for analysis and interpretation.

The ODE perspective for adaptive methods like RMSProp and Adam reveals them as sophisticated dynamical systems that implement preconditioned, momentum-driven descent, where the preconditioning and momentum effects are themselves evolving based on the optimization trajectory. This provides intuition for their ability to navigate complex loss landscapes effectively.

Why is the ODE Perspective So Valuable?

Understanding optimization algorithms through the lens of ODEs offers several benefits:

  1. Deeper Intuition: It moves beyond algorithmic recipes to physical or mathematical analogies, explaining why methods like momentum work.
  2. Principled Design: New algorithms can be designed by proposing different ODEs (e.g., with different damping or inertial terms, or adaptive preconditioning) and then discretizing them. The conversion to a system of first-order ODEs and then applying various numerical integration schemes provides a rich framework for algorithm discovery.
  3. Analysis Tools: The rich theory of dynamical systems and numerical ODEs can be applied to analyze stability, convergence rates, and behavior of optimization algorithms. For instance, Lyapunov stability theory for ODEs can be adapted to prove convergence for optimization algorithms.
  4. Hyperparameter Understanding: The relationship between ODE parameters (like mass \(m\), friction \(\gamma\), decay rates for accumulators, discretization step \(h\)) and algorithm hyperparameters (\(\eta, \beta, \beta_1, \beta_2\)) can guide tuning. For example, the condition for critical damping in the ODE can inform choices for \(\beta\) relative to \(\eta\).

Conclusion

Momentum, a cornerstone of modern optimization, is more than just adding a fraction of the previous update. By viewing it through the lens of Ordinary Differential Equations, we’ve seen Polyak’s Heavy Ball method emerge from two distinct but related paths:

  • As a discretization of a second-order ODE (via a system of first-order ODEs) describing a physical “heavy ball” system with inertia and friction.
  • As a Linear Multi-step Method applied to the fundamental first-order gradient flow ODE.

Furthermore, this ODE perspective extends to adaptive methods like RMSProp and Adam, revealing them as systems of coupled first-order ODEs that describe preconditioned gradient flows, often with inertial components. These viewpoints not only provide a solid theoretical grounding for these advanced optimization techniques but also open avenues for analyzing their behavior and designing new, more effective optimization algorithms. The continuous-time viewpoint reminds us that many discrete algorithms are, at their heart, approximations of underlying continuous dynamical processes.

Summary of Key Methods and ODEs

MethodUpdate Rule (one common form, simplified)Underlying ODE (Conceptual or Direct, simplified)Key Idea
Gradient Descent (GD)\(\mathbf x_{k+1} = \mathbf x_k - \eta \nabla f(\mathbf x_k)\)\(\dot{\mathbf x}(t) = -\nabla f(\mathbf x(t))\) (Gradient Flow)Step along negative gradient
Polyak’s Heavy Ball (PHB)\(\mathbf x_{k+1} = \mathbf x_k - \eta \nabla f(\mathbf x_k) + \beta (\mathbf x_k - \mathbf x_{k-1})\)\(m \ddot{\mathbf x} + \gamma \dot{\mathbf x} + \nabla f(\mathbf x) = \mathbf 0\)Inertia + friction, LMM for Gradient Flow
Nesterov’s Accel. (NAG)\(\mathbf v_{k+1} = \beta \mathbf v_k - \eta \nabla f(\mathbf x_k + \beta \mathbf v_k)\)
\(\mathbf x_{k+1} = \mathbf x_k + \mathbf v_{k+1}\)
\(\ddot{\mathbf x}(t) + \frac{k_d}{t} \dot{\mathbf x}(t) + \nabla f(\mathbf x(t)) = \mathbf 0\) (Su, Boyd, Candès)“Look-ahead” gradient, time-varying damping
RMSProp\(\mathbf s_{k+1} = \beta_2 \mathbf s_k + (1-\beta_2) (\nabla f_k)^2\)
\(\mathbf x_{k+1} = \mathbf x_k - \frac{\eta}{\sqrt{\mathbf s_{k+1}}} \nabla f_k\)
\(\dot{\mathbf x} = -\frac{\eta_0}{\sqrt{\mathbf s}} \nabla f(\mathbf x)\)
\(\dot{\mathbf s} = (1-\beta_2) ((\nabla f(\mathbf x))^2 - \mathbf s)\)
Adaptive per-parameter learning rates
Adam\(\mathbf m_{k+1}\!=\!\beta_1 \mathbf m_k \!+\! (1\!-\!\beta_1) \nabla f_k\)
\(\mathbf s_{k+1}\!=\!\beta_2 \mathbf s_k \!+\! (1\!-\!\beta_2) (\nabla f_k)^2\)
\(\mathbf x_{k+1}\!=\!\mathbf x_k \!-\! \eta \frac{\mathbf m_{k+1}}{\sqrt{\mathbf s_{k+1}}}\)
\(\dot{\mathbf x}\!=\!-\eta_0 \frac{\mathbf m}{\sqrt{\mathbf s}}\)
\(\dot{\mathbf m}\!=\!(1\!-\!\beta_1)(\nabla f(\mathbf x)\!-\!\mathbf m)\)
\(\dot{\mathbf s}\!=\!(1\!-\!\beta_2)((\nabla f(\mathbf x))^2\!-\!\mathbf s)\)
Momentum + adaptive learning rates

Note: \(\nabla f_k = \nabla f(\mathbf x_k)\). Vector operations like square, square root, and division are typically element-wise in the adaptive methods.

Further Perspectives

Beyond the direct ODE derivations discussed, these optimization methods can be understood through several other rich mathematical lenses. For example:

  • The variational or Lagrangian perspective, particularly using Bregman Lagrangians (as in Wibisono et al., 2016), provides a unified framework for deriving momentum methods like PHB and NAG via time-rescaling and symplectic discretization of underlying mechanical systems.
  • A control-theoretic view (e.g., Lessard et al., 2016) treats the optimization process as designing a controller (the algorithm) to steer the system (the parameters) to an optimum, often by analyzing stability and performance using tools from robust control.
  • The role of conformal symplectic integrators (Zhang et al., 2020) explains how methods like PHB preserve certain geometric structures even in the presence of dissipation (friction), leading to favorable stability properties.
  • Recent analyses of continuous-time Adam-style methods (e.g., Gould & Tanaka, 2024) continue to refine our understanding of stability regions and dynamics for adaptive optimizers by studying their ODE limits.

These alternative viewpoints often provide complementary insights and can inspire new algorithmic designs.

Reflection

This exploration of momentum and adaptive optimization methods through ODEs highlights a recurring theme in mathematical optimization for machine learning: many successful discrete algorithms are shadows of underlying continuous processes. The heavy ball analogy gives an intuitive grasp for classical momentum, while the LMM perspective places it firmly within the established field of numerical ODE solvers. Extending this view to adaptive methods like RMSProp and Adam shows them as more complex dynamical systems, yet still amenable to interpretation as preconditioned flows. All these viewpoints enrich our understanding beyond mere algorithmic steps, offering insights into why these methods work and how they might be improved or generalized. This connection between discrete iteration and continuous flow is a powerful paradigm for both analysis and invention in optimization.

This post is licensed under CC BY 4.0 by the author.