Online Learning Crash Course – Cheat Sheet

A consolidated cheat sheet of key concepts, algorithms, and formulas from the Online Learning Crash Course for quick reference.

Posted Jun 1, 2025 Updated Jun 6, 2025

By Ji-Ha Kim

4 min read

This cheat sheet provides a quick reference to the key concepts, definitions, algorithms, and bounds covered in the Online Learning Crash Course. Refer to the individual modules for detailed explanations and derivations.

1. Core Concepts

Concept	Description / Definition	Module
Online Learning Protocol	Iterative process: 1. Learner plays \(x_t \in \mathcal{X}\). 2. Environment reveals \(\ell_t(\cdot)\). 3. Learner incurs \(\ell_t(x_t)\), observes feedback.	0
Decision Set (\(\mathcal{X}\))	The set of allowable actions for the learner. Often assumed convex and bounded.	0
Loss Function (\(\ell_t\))	Quantifies penalty for action \(x_t\) at round \(t\). Often assumed convex in Online Convex Optimization (OCO).	0
Static Regret (\(R_T\))	\(R_T = \sum_{t=1}^T \ell_t(x_t) - \min_{x^\ast \in \mathcal{X}} \sum_{t=1}^T \ell_t(x^\ast )\). Goal: Sublinear regret (\(R_T = o(T)\)).	1
Online Convex Optimization (OCO)	Online learning where \(\mathcal{X}\) and all \(\ell_t\) are convex.	0, 2

2. Key Algorithms Summary

Algorithm	Core Update Idea	Key Regularizer / Divergence / Geometry	Typical Regret (Convex Losses)	Pros	Cons / Notes	Module
Online Gradient Descent (OGD)	\(x_{t+1} = \Pi_{\mathcal{X}}(x_t - \eta_t g_t)\)	Euclidean distance implicitly (\(\frac{1}{2}\Vert x - x_t \Vert_2^2\))	\(O(DG\sqrt{T})\)	Simple, optimal for general convex.	Sensitive to \(\eta_t\). Uniform steps across coordinates.	2
Follow-The-Regularized-Leader (FTRL)	\(x_{t+1} = \arg\min_{x \in \mathcal{X}} \left( \sum_{s=1}^{t} \langle g_s, x \rangle + \frac{1}{\eta} R(x) \right)\)	User-chosen regularizer \(R(x)\) (e.g., \(\frac{1}{2}\Vert x \Vert_2^2\))	Varies with \(R(x)\), e.g., \(O(\sqrt{T})\) for L2.	Unifying, flexible.	\(\arg\min\) can be costly.	3
Online Mirror Descent (OMD)	\(x_{t+1} = \arg\min_{x \in \mathcal{X}} \left( \eta_t \langle g_t, x \rangle + D_\psi(x, x_t) \right)\)	Bregman divergence \(D_\psi(x, y)\) from mirror map \(\psi(x)\)	\(O(R_{\max}G_\ast \sqrt{T/\sigma})\) (geometry-dependent constants)	Adapts to problem geometry (e.g., simplex w/ KL-divergence).	Choice of \(\psi\) crucial. Solvers for \(\arg\min\) needed.	4
AdaGrad (Diagonal)	\(x_{t+1, j} = x_{t,j} - \frac{\eta}{\sqrt{H_{t,j}} + \epsilon} g_{t,j}\), where \(H_{t,j} = \sum_{s=1}^t g_{s,j}^2\)	Data-dependent (diagonal Mahalanobis norm)	\(O(D_\infty \sum_j \sqrt{\sum g_{s,j}^2})\) (good for sparse)	Adaptive per-coordinate learning rates. No manual \(\eta_t\) schedule.	Learning rates monotonically decrease, can become too small.	5

Where:

\[g_t = \nabla \ell_t(x_t)\]
\(D\): Diameter of \(\mathcal{X}\) w.r.t L2 norm.
\(G\): Upper bound on \(\Vert g_t \Vert_2\).
\(R_{\max}^2 \approx D_\psi(x^\ast , x_1)\).
\(G_\ast\): Upper bound on \(\Vert g_t \Vert_{\psi^\ast }\) (dual norm).
\(\sigma\): Strong convexity parameter of \(\psi\).
\(D_\infty \approx \max_x \Vert x \Vert_\infty\).

3. Important Mathematical Tools

Bregman Divergence: For a differentiable, strictly convex function \(\psi\):
\[D_\psi(x, y) = \psi(x) - \psi(y) - \langle \nabla \psi(y), x - y \rangle\]
- If \(\psi(x) = \frac{1}{2}\Vert x \Vert_2^2\), then \(D_\psi(x, y) = \frac{1}{2}\Vert x - y \Vert_2^2\).
- If \(\psi(x) = \sum x_i \log x_i\) (negative entropy), then \(D_\psi(x, y) = \sum x_i \log(x_i/y_i)\) (KL-divergence).
Strong Convexity: A function \(f\) is \(\sigma\)-strongly convex w.r.t. norm \(\Vert \cdot \Vert\) if for all \(x,y\) in its domain and \(\alpha \in [0,1]\): \(f(\alpha x + (1-\alpha)y) \le \alpha f(x) + (1-\alpha)f(y) - \frac{1}{2}\sigma \alpha (1-\alpha) \Vert x-y \Vert^2\) Equivalently, \(f(x) \ge f(y) + \langle \nabla f(y), x-y \rangle + \frac{\sigma}{2}\Vert x-y \Vert^2\).

4. Learning Rate Strategies (for OGD/OMD)

Known \(T\), \(D\), \(G\): \(\eta_t = \eta = \frac{D}{G\sqrt{T}}\) (or similar, depending on exact bound/constants) gives \(O(\sqrt{T})\) regret.
Unknown \(T\): Decaying schedule, e.g., \(\eta_t = \frac{\eta_0}{\sqrt{t}}\). Can also achieve \(O(\sqrt{T})\) regret.
Strongly Convex Losses (\(\lambda\)): \(\eta_t = \frac{1}{\lambda t}\) can give \(O(\log T)\) regret.
Adaptive Methods (e.g., AdaGrad): Learning rates are set automatically based on gradient history. Often only a base learning rate \(\eta\) needs tuning.

5. Online-to-Batch Conversion

If loss functions \(\ell_t(x) = \ell(x; z_t)\) with \(z_t \sim \mathcal{D}\) (i.i.d.), and an online algorithm achieves regret \(R_T\): The average predictor \(\bar{x}_T = \frac{1}{T}\sum_{t=1}^T x_t\) has expected excess risk:

\[\mathbb{E}[L(\bar{x}_T)] - L(x^{opt}) \le \frac{\mathbb{E}[R_T(x^{opt})]}{T}\]

If \(R_T = O(\sqrt{T})\), then expected excess risk is \(O(1/\sqrt{T})\).

6. Quick Algorithm Choice Guide

Simplified Algorithm Selection
General Convex Problem? Start with OGD.
Simple, good baseline.
Requires learning rate tuning.
Problem has specific geometry (e.g., simplex, PSD cone)? Consider OMD.
Choose mirror map \(\psi\) appropriate for \(\mathcal{X}\) (e.g., negative entropy for simplex).
May lead to better constants or structure-exploiting updates (e.g., multiplicative updates).
High-dimensional, sparse features? Consider AdaGrad.
Adapts learning rates per coordinate.
Often less sensitive to initial global learning rate \(\eta\).
Be mindful of learning rates decaying too quickly. For deep learning, variants like RMSProp/Adam are preferred.
Need a unifying theoretical framework? Think FTRL.
Many algorithms (OGD, AdaGrad variants) are instances of FTRL with different regularizers \(R(x)\).
Useful for deriving new algorithms or analyzing existing ones.

This cheat sheet is intended as a condensed summary. For a deeper understanding of the nuances, assumptions, and derivations, please revisit the detailed modules of this Online Learning Crash Course. Good luck with your learning journey into optimization!

Machine Learning, Online Learning

This post is licensed under CC BY 4.0 by the author.