Adam Optimizer: Online Learning of Updates and Efficacy with EMA
Exploring novel theoretical understandings of the Adam optimizer through the lens of Follow-The-Regularized-Leader for its updates, and the provable benefits of combining Adam with model exponential moving average in non-convex optimization.
Introduction: Bridging Adam and Online Learning
The Adam optimizer has become ubiquitous in deep learning due to its robust performance across diverse architectures and datasets. Yet its theoretical foundations have remained elusive, with prior analyses failing to explain why Adam works so well in practice. Recent breakthroughs by Ahn (2024a) and Ahn et al. (2024b) reveal Adam’s profound connection to online learning theory while providing optimal convergence guarantees for nonconvex optimization when combined with model exponential moving average (EMA).
In this post, we’ll explore:
- How Adam implements discounted Follow-the-Regularized-Leader (β-FTRL)
- Why momentum corresponds to loss discounting in online learning
- How EMA enables optimal convergence in nonconvex settings
- When coordinate-wise adaptivity provides provable acceleration
Definition. Standard Adam Update
For gradients \(\mathbf{g}_t\), Adam computes:
\[\begin{aligned} \mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1)\mathbf{g}_t \\ \mathbf{v}_t &= \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2)\mathbf{g}_t^2 \\ \Delta_t &= -\alpha \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon},\quad \hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t},\quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t} \end{aligned}\]
Online Learning of Updates Framework
The key insight from Ahn (2024a) is to reframe optimization as online learning of parameter increments. Instead of directly learning parameters \(\mathbf{x}_t\), we learn updates \(\Delta_t\):
\[\mathbf{x}_t = \mathbf{x}_{t-1} + \Delta_{t-1}\]At each step \(t\), the learner suffers a linear loss:
\[\ell_t(\Delta) = \langle \mathbf{v}_t, \Delta \rangle\]where \(\mathbf{v}_t\) is constructed from gradient history. This Online Learning of Updates (OLU) framework transforms optimization into an online decision problem.
Derivation 1: Constructing Discounted Losses
To recover Adam, we define scaled gradients with exponential discounting:
\[\mathbf{v}_s = \beta_1^{-s} \mathbf{g}_s\]The cumulative discounted loss becomes:
\[L_t(\Delta) = \sum_{s=1}^t \beta_1^{t-s} \langle \mathbf{v}_s, \Delta \rangle = \beta_1^t \left\langle \sum_{s=1}^t \beta_1^{-s}\mathbf{g}_s, \Delta \right\rangle\]This linear loss function encodes the entire gradient history with exponential decay controlled by \(\beta_1\). The discount factor \(\beta_1^{t-s}\) assigns higher weight to recent gradients.
Adam as Discounted Follow-the-Regularized-Leader
Adam emerges naturally as the solution to a discounted FTRL problem with adaptive regularization:
Theorem. Adam is β-FTRL
The Adam update solves:
\[\Delta_t = \underset{\Delta \in \mathbb{R}^d}{\text{argmin}} \left( \eta_t \sum_{s=1}^t \beta_1^{t-s} \langle \mathbf{v}_s, \Delta \rangle + \frac{1}{2} \Vert \Delta\Vert _2^2 \right)\]with learning rate schedule:
\[\eta_t = \alpha \beta_1^t / \sqrt{\sum_{s=1}^t \mathbf{v}_s^2}\]
Derivation 2: Solving the FTRL Objective
Take the gradient of the FTRL objective and set to zero:
\[\nabla_{\Delta} \left[ \eta_t \beta_1^t \sum_{s=1}^t \beta_1^{-s} \langle \mathbf{g}_s, \Delta \rangle + \frac{1}{2} \Vert \Delta\Vert _2^2 \right] = 0\]Substitute \(\mathbf{v}_s = \beta_1^{-s}\mathbf{g}_s\):
\[\eta_t \beta_1^t \sum_{s=1}^t \mathbf{v}_s + \Delta = 0\]Solve for \(\Delta\):
\[\Delta_t = -\eta_t \beta_1^t \sum_{s=1}^t \mathbf{v}_s\]Plug in the learning rate:
\[\Delta_t = -\alpha \beta_1^t \frac{\sum_{s=1}^t \mathbf{v}_s}{\sqrt{\sum_{s=1}^t \mathbf{v}_s^2}} \cdot \frac{1}{\beta_1^t} = -\alpha \frac{\sum_{s=1}^t \mathbf{v}_s}{\sqrt{\sum_{s=1}^t \mathbf{v}_s^2}}\]Recover standard Adam terms:
\[\begin{aligned} \text{Numerator: } \sum_{s=1}^t \mathbf{v}_s &= \beta_1^{-t} \mathbf{m}_t \\ \text{Denominator: } \sqrt{\sum_{s=1}^t \mathbf{v}_s^2} &= \beta_1^{-t} \sqrt{\mathbf{v}_t} + \mathcal{O}(\epsilon) \end{aligned}\]Thus:
\[\Delta_t = -\alpha \frac{\mathbf{m}_t}{\sqrt{\mathbf{v}_t} + \epsilon}\]
This derivation reveals Adam’s components as fundamental to online learning:
- Momentum \(\beta_1\): Discount factor for past losses
- Adaptive scaling \(\mathbf{v}_t\): Scale-free regularization
- Learning rate \(\alpha\): Global step size scaling
Dynamic Regret Analysis
The OLU framework enables rigorous analysis through dynamic regret:
\[\mathcal{R}_T = \sum_{t=1}^T \ell_t(\Delta_t) - \min_{\{\mathbf{u}_t\}} \sum_{t=1}^T \ell_t(\mathbf{u}_t - \mathbf{u}_{t-1})\]where \(\mathbf{u}_t\) is a comparator sequence. Adam achieves strong regret bounds:
Proposition. Dynamic Regret of β-FTRL
For any comparator sequence \(\{\mathbf{u}_t\}\) with \(\Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert \leq D\):
\[\mathcal{R}_T \leq \frac{\alpha}{\sqrt{1-\beta_1}} \sum_{i=1}^d \sqrt{\sum_{t=1}^T g_{t,i}^2} + \frac{D^2}{2\alpha} \sqrt{\sum_{t=1}^T \Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert ^2}\]
Proof Sketch: Regret Decomposition
Decompose regret into stability and prediction terms:
\[\mathcal{R}_T = \underbrace{\sum_t \ell_t(\Delta_t) - \sum_t \ell_t(\mathbf{u}_t)}_{\text{Static regret}} + \underbrace{\sum_t \langle \mathbf{v}_t, \mathbf{u}_t - \mathbf{u}_{t-1} \rangle}_{\text{Stability penalty}}\]Bound static regret via scale-free FTRL analysis:
\[\sum_{t=1}^T \langle \mathbf{v}_t, \Delta_t - \mathbf{u} \rangle \leq \frac{\alpha}{\sqrt{1-\beta_1}} \sum_{i=1}^d \Vert \mathbf{g}_{1:T,i}\Vert _2\]Control stability term with path length:
\[\sum_t \langle \mathbf{v}_t, \mathbf{u}_t - \mathbf{u}_{t-1} \rangle \leq \frac{D^2}{2\alpha} \sqrt{\sum_{t=1}^T \Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert ^2}\]Combine using Cauchy-Schwarz:
\[\mathcal{R}_T \leq \frac{\alpha}{\sqrt{1-\beta_1}} \sum_{i=1}^d \sqrt{\sum_{t=1}^T g_{t,i}^2} + \frac{D^2}{2\alpha} \sqrt{\sum_{t=1}^T \Vert \mathbf{u}_t - \mathbf{u}_{t-1}\Vert ^2}\]
EMA for Nonconvex Optimization
While Adam excels in online settings, nonconvex optimization requires additional stabilization. Ahn et al. (2024b) prove that combining clipped Adam with model EMA achieves optimal convergence:
Theorem. Convergence of Adam+EMA
For \(L\)-smooth nonconvex \(F\) with stochastic gradients satisfying \(\mathbb{E}[\Vert \mathbf{g}_t - \nabla F(\mathbf{x}_t)\Vert ^2] \leq \sigma^2\), clipped Adam with EMA attains:
\[\mathbb{E}\left[ \Vert \nabla F(\bar{\mathbf{x}}_T)\Vert ^2 \right] \leq \mathcal{O}\left( \frac{\sigma}{\sqrt{T}} + \frac{\sigma^2}{T} \right)\]which matches the lower bound for nonsmooth nonconvex optimization.
The EMA update:
\[\bar{\mathbf{x}}_t = (1 - \gamma)\bar{\mathbf{x}}_{t-1} + \gamma\mathbf{x}_t\]provides three key benefits:
- Smoothing: Convexifies the optimization landscape
- Variance Reduction: Averages out gradient noise
- Implicit Iterate Averaging: Stabilizes convergence
Why EMA Outperforms Uniform Averaging
Compared to uniform averaging \(\bar{\mathbf{x}}_T = \frac{1}{T} \sum_{t=1}^T \mathbf{x}_t\), EMA:
- Assigns higher weight to recent iterates
- Adapts faster to curvature changes
- Requires only \(\mathcal{O}(d)\) memory
- Provides better empirical performance in generative models
The optimal weighting scheme balances:
\[\gamma_t \propto \mathbb{E}[\Vert \nabla F(\mathbf{x}_t)\Vert ^{-1}]\]which EMA approximates through exponential discounting.
Coordinate-Wise Adaptivity Advantage
Adam’s per-coordinate scaling provides provable acceleration under gradient scale heterogeneity:
Condition | Isotropic Methods | Adam (β-FTRL) |
---|---|---|
Uniform scales | \(\mathcal{O}(\sqrt{dT})\) | \(\mathcal{O}(\sqrt{dT})\) |
High variance | \(\mathcal{O}(\sqrt{d\sum_t \Vert \mathbf{g}_t\Vert ^2})\) | \(\mathcal{O}(\sum_{i=1}^d \sqrt{\sum_t g_{t,i}^2})\) |
The coordinate-adaptive regret bound:
\[\mathcal{R}_T \leq \mathcal{O}\left( \sum_{i=1}^d \sqrt{\sum_{t=1}^T g_{t,i}^2 } \right)\]can be \(\sqrt{d}\)-times smaller than isotropic methods when gradient norms vary significantly across coordinates.
Summary: Online Learning Perspective
Adam Component Correspondence
| Adam Element | Online Learning Interpretation | | ———————— | ———————————- | | Momentum \(\beta_1\) | Loss discount factor | | Scaling \(\mathbf{v}_t\) | Adaptive regularization | | Learning rate \(\alpha\) | Step size multiplier | | EMA \(\gamma\) | Implicit iterate averaging |
Key Insights
- Adam implements discounted FTRL for update directions
- Momentum enables adaptation to non-stationarity
- EMA provides optimal convergence in nonconvex settings
- Per-coordinate scaling accelerates convergence under gradient heterogeneity
References
- Ahn, K. (2024a). Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise. arXiv:2402.01567
- Ahn, K., Lee, J. D., Sra, S., & Oh, S. (2024b). Adam with model exponential moving average is effective for nonconvex optimization. arXiv:2405.18199
- Kingma, D. P., & Ba, J. L. (2015). Adam: A Method for Stochastic Optimization. ICLR
- McMahan, H. B. (2017). A Survey of Algorithms and Analysis for Adaptive Online Learning. JMLR