Adam Optimizer: Online Learning of Updates and Efficacy with EMA

Exploring novel theoretical understandings of the Adam optimizer through the lens of Follow-The-Regularized-Leader for its updates, and the provable benefits of combining Adam with model exponential moving average in non-convex optimization.

Posted May 25, 2025 Updated Jun 6, 2025

By Ji-Ha Kim

5 min read

Introduction: Bridging Adam and Online Learning

The Adam optimizer has become ubiquitous in deep learning due to its robust performance across diverse architectures and datasets. Yet its theoretical foundations have remained elusive, with prior analyses failing to explain why Adam works so well in practice. Recent breakthroughs by Ahn (2024a) and Ahn et al. (2024b) reveal Adam’s profound connection to online learning theory while providing optimal convergence guarantees for nonconvex optimization when combined with model exponential moving average (EMA).

In this post, we’ll explore:

How Adam implements discounted Follow-the-Regularized-Leader (β-FTRL)
Why momentum corresponds to loss discounting in online learning
How EMA enables optimal convergence in nonconvex settings
When coordinate-wise adaptivity provides provable acceleration

Definition. Standard Adam Update
For gradients \(\mathbf{g}_t\), Adam computes:
\[\begin{aligned} \mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1)\mathbf{g}_t \\ \mathbf{v}_t &= \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2)\mathbf{g}_t^2 \\ \Delta_t &= -\alpha \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon},\quad \hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t},\quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t} \end{aligned}\]

Online Learning of Updates Framework

The key insight from Ahn (2024a) is to reframe optimization as online learning of parameter increments. Instead of directly learning parameters \(\mathbf{x}_t\), we learn updates \(\Delta_t\):

\[\mathbf{x}_t = \mathbf{x}_{t-1} + \Delta_{t-1}\]

At each step \(t\), the learner suffers a linear loss:

\[\ell_t(\Delta) = \langle \mathbf{v}_t, \Delta \rangle\]

where \(\mathbf{v}_t\) is constructed from gradient history. This Online Learning of Updates (OLU) framework transforms optimization into an online decision problem.

Derivation 1: Constructing Discounted Losses

To recover Adam, we define scaled gradients with exponential discounting:

\[\mathbf{v}_s = \beta_1^{-s} \mathbf{g}_s\]

The cumulative discounted loss becomes:

\[L_t(\Delta) = \sum_{s=1}^t \beta_1^{t-s} \langle \mathbf{v}_s, \Delta \rangle = \beta_1^t \left\langle \sum_{s=1}^t \beta_1^{-s}\mathbf{g}_s, \Delta \right\rangle\]

This linear loss function encodes the entire gradient history with exponential decay controlled by \(\beta_1\). The discount factor \(\beta_1^{t-s}\) assigns higher weight to recent gradients.

Adam as Discounted Follow-the-Regularized-Leader

Adam emerges naturally as the solution to a discounted FTRL problem with adaptive regularization:

Theorem. Adam is β-FTRL
The Adam update solves:
\[\Delta_t = \underset{\Delta \in \mathbb{R}^d}{\text{argmin}} \left( \eta_t \sum_{s=1}^t \beta_1^{t-s} \langle \mathbf{v}_s, \Delta \rangle + \frac{1}{2} \Vert \Delta\Vert _2^2 \right)\]
with learning rate schedule:
\[\eta_t = \alpha \beta_1^t / \sqrt{\sum_{s=1}^t \mathbf{v}_s^2}\]

Derivation 2: Solving the FTRL Objective

Take the gradient of the FTRL objective and set to zero:
\[\nabla_{\Delta} \left[ \eta_t \beta_1^t \sum_{s=1}^t \beta_1^{-s} \langle \mathbf{g}_s, \Delta \rangle + \frac{1}{2} \Vert \Delta\Vert _2^2 \right] = 0\]
Substitute \(\mathbf{v}_s = \beta_1^{-s}\mathbf{g}_s\):
\[\eta_t \beta_1^t \sum_{s=1}^t \mathbf{v}_s + \Delta = 0\]
Solve for \(\Delta\):
\[\Delta_t = -\eta_t \beta_1^t \sum_{s=1}^t \mathbf{v}_s\]
Plug in the learning rate:
\[\Delta_t = -\alpha \beta_1^t \frac{\sum_{s=1}^t \mathbf{v}_s}{\sqrt{\sum_{s=1}^t \mathbf{v}_s^2}} \cdot \frac{1}{\beta_1^t} = -\alpha \frac{\sum_{s=1}^t \mathbf{v}_s}{\sqrt{\sum_{s=1}^t \mathbf{v}_s^2}}\]
Recover standard Adam terms:
\[\begin{aligned} \text{Numerator: } \sum_{s=1}^t \mathbf{v}_s &= \beta_1^{-t} \mathbf{m}_t \\ \text{Denominator: } \sqrt{\sum_{s=1}^t \mathbf{v}_s^2} &= \beta_1^{-t} \sqrt{\mathbf{v}_t} + \mathcal{O}(\epsilon) \end{aligned}\]
Thus:
\[\Delta_t = -\alpha \frac{\mathbf{m}_t}{\sqrt{\mathbf{v}_t} + \epsilon}\]

This derivation reveals Adam’s components as fundamental to online learning:

Momentum \(\beta_1\): Discount factor for past losses
Adaptive scaling \(\mathbf{v}_t\): Scale-free regularization
Learning rate \(\alpha\): Global step size scaling

Dynamic Regret Analysis

The OLU framework enables rigorous analysis through dynamic regret:

\[\mathcal{R}_T = \sum_{t=1}^T \ell_t(\Delta_t) - \min_{\{\mathbf{u}_t\}} \sum_{t=1}^T \ell_t(\mathbf{u}_t - \mathbf{u}_{t-1})\]

where \(\mathbf{u}_t\) is a comparator sequence. Adam achieves strong regret bounds:

Proposition. Dynamic Regret of β-FTRL
For any comparator sequence \(\{\mathbf{u}_t\}\) with \(\Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert \leq D\):
\[\mathcal{R}_T \leq \frac{\alpha}{\sqrt{1-\beta_1}} \sum_{i=1}^d \sqrt{\sum_{t=1}^T g_{t,i}^2} + \frac{D^2}{2\alpha} \sqrt{\sum_{t=1}^T \Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert ^2}\]

Proof Sketch: Regret Decomposition

Decompose regret into stability and prediction terms:
\[\mathcal{R}_T = \underbrace{\sum_t \ell_t(\Delta_t) - \sum_t \ell_t(\mathbf{u}_t)}_{\text{Static regret}} + \underbrace{\sum_t \langle \mathbf{v}_t, \mathbf{u}_t - \mathbf{u}_{t-1} \rangle}_{\text{Stability penalty}}\]
Bound static regret via scale-free FTRL analysis:
\[\sum_{t=1}^T \langle \mathbf{v}_t, \Delta_t - \mathbf{u} \rangle \leq \frac{\alpha}{\sqrt{1-\beta_1}} \sum_{i=1}^d \Vert \mathbf{g}_{1:T,i}\Vert _2\]
Control stability term with path length:
\[\sum_t \langle \mathbf{v}_t, \mathbf{u}_t - \mathbf{u}_{t-1} \rangle \leq \frac{D^2}{2\alpha} \sqrt{\sum_{t=1}^T \Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert ^2}\]
Combine using Cauchy-Schwarz:
\[\mathcal{R}_T \leq \frac{\alpha}{\sqrt{1-\beta_1}} \sum_{i=1}^d \sqrt{\sum_{t=1}^T g_{t,i}^2} + \frac{D^2}{2\alpha} \sqrt{\sum_{t=1}^T \Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert ^2}\]

EMA for Nonconvex Optimization

While Adam excels in online settings, nonconvex optimization requires additional stabilization. Ahn et al. (2024b) prove that combining clipped Adam with model EMA achieves optimal convergence:

Theorem. Convergence of Adam+EMA
For \(L\)-smooth nonconvex \(F\) with stochastic gradients satisfying \(\mathbb{E}[\Vert \mathbf{g}_t - \nabla F(\mathbf{x}_t)\Vert ^2] \leq \sigma^2\), clipped Adam with EMA attains:
\[\mathbb{E}\left[ \Vert \nabla F(\bar{\mathbf{x}}_T)\Vert ^2 \right] \leq \mathcal{O}\left( \frac{\sigma}{\sqrt{T}} + \frac{\sigma^2}{T} \right)\]
which matches the lower bound for nonsmooth nonconvex optimization.

The EMA update:

\[\bar{\mathbf{x}}_t = (1 - \gamma)\bar{\mathbf{x}}_{t-1} + \gamma\mathbf{x}_t\]

provides three key benefits:

Smoothing: Convexifies the optimization landscape
Variance Reduction: Averages out gradient noise
Implicit Iterate Averaging: Stabilizes convergence

Why EMA Outperforms Uniform Averaging

Compared to uniform averaging \(\bar{\mathbf{x}}_T = \frac{1}{T} \sum_{t=1}^T \mathbf{x}_t\), EMA:

Assigns higher weight to recent iterates
Adapts faster to curvature changes
Requires only \(\mathcal{O}(d)\) memory
Provides better empirical performance in generative models

The optimal weighting scheme balances:

\[\gamma_t \propto \mathbb{E}[\Vert \nabla F(\mathbf{x}_t)\Vert ^{-1}]\]

which EMA approximates through exponential discounting.

Coordinate-Wise Adaptivity Advantage

Adam’s per-coordinate scaling provides provable acceleration under gradient scale heterogeneity:

Condition	Isotropic Methods	Adam (β-FTRL)
Uniform scales	\(\mathcal{O}(\sqrt{dT})\)	\(\mathcal{O}(\sqrt{dT})\)
High variance	\(\mathcal{O}(\sqrt{d\sum_t \Vert \mathbf{g}_t\Vert ^2})\)	\(\mathcal{O}(\sum_{i=1}^d \sqrt{\sum_t g_{t,i}^2})\)

The coordinate-adaptive regret bound:

\[\mathcal{R}_T \leq \mathcal{O}\left( \sum_{i=1}^d \sqrt{\sum_{t=1}^T g_{t,i}^2 } \right)\]

can be \(\sqrt{d}\)-times smaller than isotropic methods when gradient norms vary significantly across coordinates.

Summary: Online Learning Perspective

Adam Component Correspondence

| Adam Element | Online Learning Interpretation | | ———————— | ———————————- | | Momentum \(\beta_1\) | Loss discount factor | | Scaling \(\mathbf{v}_t\) | Adaptive regularization | | Learning rate \(\alpha\) | Step size multiplier | | EMA \(\gamma\) | Implicit iterate averaging |

Key Insights

Adam implements discounted FTRL for update directions
Momentum enables adaptation to non-stationarity
EMA provides optimal convergence in nonconvex settings
Per-coordinate scaling accelerates convergence under gradient heterogeneity

References

Ahn, K. (2024a). Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise. arXiv:2402.01567
Ahn, K., Lee, J. D., Sra, S., & Oh, S. (2024b). Adam with model exponential moving average is effective for nonconvex optimization. arXiv:2405.18199
Kingma, D. P., & Ba, J. L. (2015). Adam: A Method for Stochastic Optimization. ICLR
McMahan, H. B. (2017). A Survey of Algorithms and Analysis for Adaptive Online Learning. JMLR

Mathematical Optimization, Machine Learning

This post is licensed under CC BY 4.0 by the author.