Post

Adam Optimizer: Online Learning of Updates and Efficacy with EMA

Exploring novel theoretical understandings of the Adam optimizer through the lens of Follow-The-Regularized-Leader for its updates, and the provable benefits of combining Adam with model exponential moving average in non-convex optimization.

Adam Optimizer: Online Learning of Updates and Efficacy with EMA

Introduction: Bridging Adam and Online Learning

The Adam optimizer has become ubiquitous in deep learning due to its robust performance across diverse architectures and datasets. Yet its theoretical foundations have remained elusive, with prior analyses failing to explain why Adam works so well in practice. Recent breakthroughs by Ahn (2024a) and Ahn et al. (2024b) reveal Adam’s profound connection to online learning theory while providing optimal convergence guarantees for nonconvex optimization when combined with model exponential moving average (EMA).

In this post, we’ll explore:

  1. How Adam implements discounted Follow-the-Regularized-Leader (β-FTRL)
  2. Why momentum corresponds to loss discounting in online learning
  3. How EMA enables optimal convergence in nonconvex settings
  4. When coordinate-wise adaptivity provides provable acceleration

Definition. Standard Adam Update

For gradients \(\mathbf{g}_t\), Adam computes:

\[\begin{aligned} \mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1)\mathbf{g}_t \\ \mathbf{v}_t &= \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2)\mathbf{g}_t^2 \\ \Delta_t &= -\alpha \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon},\quad \hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t},\quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t} \end{aligned}\]

Online Learning of Updates Framework

The key insight from Ahn (2024a) is to reframe optimization as online learning of parameter increments. Instead of directly learning parameters \(\mathbf{x}_t\), we learn updates \(\Delta_t\):

\[\mathbf{x}_t = \mathbf{x}_{t-1} + \Delta_{t-1}\]

At each step \(t\), the learner suffers a linear loss:

\[\ell_t(\Delta) = \langle \mathbf{v}_t, \Delta \rangle\]

where \(\mathbf{v}_t\) is constructed from gradient history. This Online Learning of Updates (OLU) framework transforms optimization into an online decision problem.

Derivation 1: Constructing Discounted Losses

To recover Adam, we define scaled gradients with exponential discounting:

\[\mathbf{v}_s = \beta_1^{-s} \mathbf{g}_s\]

The cumulative discounted loss becomes:

\[L_t(\Delta) = \sum_{s=1}^t \beta_1^{t-s} \langle \mathbf{v}_s, \Delta \rangle = \beta_1^t \left\langle \sum_{s=1}^t \beta_1^{-s}\mathbf{g}_s, \Delta \right\rangle\]

This linear loss function encodes the entire gradient history with exponential decay controlled by \(\beta_1\). The discount factor \(\beta_1^{t-s}\) assigns higher weight to recent gradients.

Adam as Discounted Follow-the-Regularized-Leader

Adam emerges naturally as the solution to a discounted FTRL problem with adaptive regularization:

Theorem. Adam is β-FTRL

The Adam update solves:

\[\Delta_t = \underset{\Delta \in \mathbb{R}^d}{\text{argmin}} \left( \eta_t \sum_{s=1}^t \beta_1^{t-s} \langle \mathbf{v}_s, \Delta \rangle + \frac{1}{2} \Vert \Delta\Vert _2^2 \right)\]

with learning rate schedule:

\[\eta_t = \alpha \beta_1^t / \sqrt{\sum_{s=1}^t \mathbf{v}_s^2}\]

Derivation 2: Solving the FTRL Objective

  1. Take the gradient of the FTRL objective and set to zero:

    \[\nabla_{\Delta} \left[ \eta_t \beta_1^t \sum_{s=1}^t \beta_1^{-s} \langle \mathbf{g}_s, \Delta \rangle + \frac{1}{2} \Vert \Delta\Vert _2^2 \right] = 0\]
  2. Substitute \(\mathbf{v}_s = \beta_1^{-s}\mathbf{g}_s\):

    \[\eta_t \beta_1^t \sum_{s=1}^t \mathbf{v}_s + \Delta = 0\]
  3. Solve for \(\Delta\):

    \[\Delta_t = -\eta_t \beta_1^t \sum_{s=1}^t \mathbf{v}_s\]
  4. Plug in the learning rate:

    \[\Delta_t = -\alpha \beta_1^t \frac{\sum_{s=1}^t \mathbf{v}_s}{\sqrt{\sum_{s=1}^t \mathbf{v}_s^2}} \cdot \frac{1}{\beta_1^t} = -\alpha \frac{\sum_{s=1}^t \mathbf{v}_s}{\sqrt{\sum_{s=1}^t \mathbf{v}_s^2}}\]
  5. Recover standard Adam terms:

    \[\begin{aligned} \text{Numerator: } \sum_{s=1}^t \mathbf{v}_s &= \beta_1^{-t} \mathbf{m}_t \\ \text{Denominator: } \sqrt{\sum_{s=1}^t \mathbf{v}_s^2} &= \beta_1^{-t} \sqrt{\mathbf{v}_t} + \mathcal{O}(\epsilon) \end{aligned}\]

    Thus:

    \[\Delta_t = -\alpha \frac{\mathbf{m}_t}{\sqrt{\mathbf{v}_t} + \epsilon}\]

This derivation reveals Adam’s components as fundamental to online learning:

  • Momentum \(\beta_1\): Discount factor for past losses
  • Adaptive scaling \(\mathbf{v}_t\): Scale-free regularization
  • Learning rate \(\alpha\): Global step size scaling

Dynamic Regret Analysis

The OLU framework enables rigorous analysis through dynamic regret:

\[\mathcal{R}_T = \sum_{t=1}^T \ell_t(\Delta_t) - \min_{\{\mathbf{u}_t\}} \sum_{t=1}^T \ell_t(\mathbf{u}_t - \mathbf{u}_{t-1})\]

where \(\mathbf{u}_t\) is a comparator sequence. Adam achieves strong regret bounds:

Proposition. Dynamic Regret of β-FTRL

For any comparator sequence \(\{\mathbf{u}_t\}\) with \(\Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert \leq D\):

\[\mathcal{R}_T \leq \frac{\alpha}{\sqrt{1-\beta_1}} \sum_{i=1}^d \sqrt{\sum_{t=1}^T g_{t,i}^2} + \frac{D^2}{2\alpha} \sqrt{\sum_{t=1}^T \Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert ^2}\]

Proof Sketch: Regret Decomposition

  1. Decompose regret into stability and prediction terms:

    \[\mathcal{R}_T = \underbrace{\sum_t \ell_t(\Delta_t) - \sum_t \ell_t(\mathbf{u}_t)}_{\text{Static regret}} + \underbrace{\sum_t \langle \mathbf{v}_t, \mathbf{u}_t - \mathbf{u}_{t-1} \rangle}_{\text{Stability penalty}}\]
  2. Bound static regret via scale-free FTRL analysis:

    \[\sum_{t=1}^T \langle \mathbf{v}_t, \Delta_t - \mathbf{u} \rangle \leq \frac{\alpha}{\sqrt{1-\beta_1}} \sum_{i=1}^d \Vert \mathbf{g}_{1:T,i}\Vert _2\]
  3. Control stability term with path length:

    \[\sum_t \langle \mathbf{v}_t, \mathbf{u}_t - \mathbf{u}_{t-1} \rangle \leq \frac{D^2}{2\alpha} \sqrt{\sum_{t=1}^T \Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert ^2}\]
  4. Combine using Cauchy-Schwarz:

    \[\mathcal{R}_T \leq \frac{\alpha}{\sqrt{1-\beta_1}} \sum_{i=1}^d \sqrt{\sum_{t=1}^T g_{t,i}^2} + \frac{D^2}{2\alpha} \sqrt{\sum_{t=1}^T \Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert ^2}\]

EMA for Nonconvex Optimization

While Adam excels in online settings, nonconvex optimization requires additional stabilization. Ahn et al. (2024b) prove that combining clipped Adam with model EMA achieves optimal convergence:

Theorem. Convergence of Adam+EMA

For \(L\)-smooth nonconvex \(F\) with stochastic gradients satisfying \(\mathbb{E}[\Vert \mathbf{g}_t - \nabla F(\mathbf{x}_t)\Vert ^2] \leq \sigma^2\), clipped Adam with EMA attains:

\[\mathbb{E}\left[ \Vert \nabla F(\bar{\mathbf{x}}_T)\Vert ^2 \right] \leq \mathcal{O}\left( \frac{\sigma}{\sqrt{T}} + \frac{\sigma^2}{T} \right)\]

which matches the lower bound for nonsmooth nonconvex optimization.

The EMA update:

\[\bar{\mathbf{x}}_t = (1 - \gamma)\bar{\mathbf{x}}_{t-1} + \gamma\mathbf{x}_t\]

provides three key benefits:

  1. Smoothing: Convexifies the optimization landscape
  2. Variance Reduction: Averages out gradient noise
  3. Implicit Iterate Averaging: Stabilizes convergence

Why EMA Outperforms Uniform Averaging

Compared to uniform averaging \(\bar{\mathbf{x}}_T = \frac{1}{T} \sum_{t=1}^T \mathbf{x}_t\), EMA:

  • Assigns higher weight to recent iterates
  • Adapts faster to curvature changes
  • Requires only \(\mathcal{O}(d)\) memory
  • Provides better empirical performance in generative models

The optimal weighting scheme balances:

\[\gamma_t \propto \mathbb{E}[\Vert \nabla F(\mathbf{x}_t)\Vert ^{-1}]\]

which EMA approximates through exponential discounting.

Coordinate-Wise Adaptivity Advantage

Adam’s per-coordinate scaling provides provable acceleration under gradient scale heterogeneity:

ConditionIsotropic MethodsAdam (β-FTRL)
Uniform scales\(\mathcal{O}(\sqrt{dT})\)\(\mathcal{O}(\sqrt{dT})\)
High variance\(\mathcal{O}(\sqrt{d\sum_t \Vert \mathbf{g}_t\Vert ^2})\)\(\mathcal{O}(\sum_{i=1}^d \sqrt{\sum_t g_{t,i}^2})\)

The coordinate-adaptive regret bound:

\[\mathcal{R}_T \leq \mathcal{O}\left( \sum_{i=1}^d \sqrt{\sum_{t=1}^T g_{t,i}^2 } \right)\]

can be \(\sqrt{d}\)-times smaller than isotropic methods when gradient norms vary significantly across coordinates.

Summary: Online Learning Perspective

Adam Component Correspondence

| Adam Element | Online Learning Interpretation | | ———————— | ———————————- | | Momentum \(\beta_1\) | Loss discount factor | | Scaling \(\mathbf{v}_t\) | Adaptive regularization | | Learning rate \(\alpha\) | Step size multiplier | | EMA \(\gamma\) | Implicit iterate averaging |

Key Insights

  1. Adam implements discounted FTRL for update directions
  2. Momentum enables adaptation to non-stationarity
  3. EMA provides optimal convergence in nonconvex settings
  4. Per-coordinate scaling accelerates convergence under gradient heterogeneity

References

  1. Ahn, K. (2024a). Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise. arXiv:2402.01567
  2. Ahn, K., Lee, J. D., Sra, S., & Oh, S. (2024b). Adam with model exponential moving average is effective for nonconvex optimization. arXiv:2405.18199
  3. Kingma, D. P., & Ba, J. L. (2015). Adam: A Method for Stochastic Optimization. ICLR
  4. McMahan, H. B. (2017). A Survey of Algorithms and Analysis for Adaptive Online Learning. JMLR
This post is licensed under CC BY 4.0 by the author.