A Modern Introduction to Online Learning - Ch 1
These are notes for the text A Modern Introduction to Online Learning by Francesco Orabona on ArXiV.
Chapter 1 - What is Online Learning?
The core idea of Online Learning in this context is presented through a repeated game framework where the goal is to minimize Regret.
The Basic Game (Example):
- For rounds \(t = 1, \dots, T\):
- An adversary chooses a secret number \(y_t \in [0, 1]\).
- You choose a number \(x_t \in [0, 1]\).
- The adversary reveals \(y_t\), and you pay the squared loss \((x_t - y_t)^2\).
Goal: Minimize Regret
Initially, if we assume \(y_t\) are i.i.d from some distribution with variance \(\sigma^2\), the best strategy is to always predict the mean, incurring an expected loss of \(\sigma^2 T\). A natural goal is to minimize the excess loss compared to this optimal (but unknown) strategy.
Removing the stochastic assumption, we consider an arbitrary sequence \(y_t\). The performance metric becomes the Regret: how much worse our total loss is compared to the best single fixed action chosen in hindsight.
Definition - Regret (for the example game)
The regret after \(T\) rounds is defined as:
\[\text{Regret}_T := \sum_{t=1}^T (x_t - y_t)^2 - \min_{x \in [0, 1]} \sum_{t=1}^T (x - y_t)^2\]An algorithm is considered successful (“wins the game”) if \(\text{Regret}_T\) grows sublinearly in \(T\) (i.e., \(\lim_{T\to\infty} \frac{\text{Regret}_T}{T} = 0\)).
General Online Learning Framework:
- At each round \(t\), the algorithm chooses an action \(x_t\) from a feasible set \(V \subseteq \mathbb{R}^d\).
- It incurs a loss \(\ell_t(x_t)\), where the loss function \(\ell_t\) can be chosen arbitrarily (adversarially) at each round.
- The goal is to minimize the regret compared to the best fixed competitor \(u \in V\) in hindsight.
Definition - Regret (General)
For a sequence of loss functions \(\ell_1, \dots, \ell_T\) and algorithm predictions \(x_1, \dots, x_T\), the regret with respect to a competitor \(u \in V\) is:
\[\text{Regret}_T(u) := \sum_{t=1}^T \ell_t(x_t) - \sum_{t=1}^T \ell_t(u)\]The online algorithm does not know \(u\) or the future losses when making its prediction \(x_t\).
Adversarial Nature: The sequence of losses \(\ell_t\) can be chosen by an adversary, potentially based on the algorithm’s past actions. This makes standard statistical assumptions (like i.i.d data) invalid for the analysis.
A Strategy: Follow-the-Leader (FTL)
A natural strategy is to choose the action at time \(t\) that would have been optimal for the past rounds \(1, \dots, t-1\).
- In the example game: The best action in hindsight after \(T\) rounds is \(x^*_T = \frac{1}{T} \sum_{t=1}^T y_t\).
- FTL Strategy: \(x_t = x^*_{t-1} = \frac{1}{t-1} \sum_{i=1}^{t-1} y_i\) (for \(t > 1\)).
Analysis of FTL for the Guessing Game:
- Lemma 1.2 (Hannan’s Lemma): Let \(x^*_t = \arg\min_{x \in V} \sum_{i=1}^t \ell_i(x)\). Then for any sequence of loss functions \(\ell_t\): \(\sum_{t=1}^T \ell_t(x^*_t) \le \sum_{t=1}^T \ell_t(x^*_T)\) (Playing adaptively based on past data is no worse than playing the single best action in hindsight).
- Theorem 1.3: For the number guessing game (\(\ell_t(x)=(x-y_t)^2, y_t \in [0,1]\)), the FTL strategy (\(x_t = x^*_{t-1}\)) achieves: \(\text{Regret}_T \le 4 + 4 \ln T\) This is sublinear in \(T\), so FTL “wins” this specific game.
Proof Idea: Use Lemma 1.2 to bound the regret against the sequence \(x^*_1, \dots, x^*_T\). Then bound the difference between consecutive optimal actions $$ x^_{t-1} - x^_t \(. Show this difference decreases quickly enough (like\)O(1/t)\(), and the sum is bounded by\)O(\ln T)$$.
Key Takeaways from FTL Example:
- The FTL strategy for this specific game is parameter-free (no learning rates etc. to tune).
- It is computationally efficient (only requires maintaining a running average).
- It does not use gradients.
- Online learning is distinct from statistical learning; concepts like overfitting don’t directly apply.
History Bits:
- Savage (1951): Introduced the concept of “loss” (regret) in statistical decision problems based on Wald (1950).
- Hannan (1957): Designed the first randomized algorithm for repeated games with vanishing average regret; proved Lemma 1.2.