A Modern Introduction to Online Learning - Ch 2
These are notes for the text A Modern Introduction to Online Learning by Francesco Orabona on arXiv.
Chapter 2 - Online Subgradient Descent
The main contribution of this chapter is presenting the online subgradient descent algorithm (which is not a pure descent method!), which generalizes the gradient descent method to non-differentiable convex losses.
The general online learning problem:
Definition. Online Learning Game
- For \(t = 1, \dots, T \in \mathbb{N}\)
- Output \(x_t \in \mathcal{V} \subseteq \mathbb{R}^d\)
- Pay \(\ell_t(x_t)\) for \(\ell_t: \mathcal{V} \to \mathbb{R}\)
- Receive some feedback on \(\ell_t\)
- End for
Goal: minimize regret.
Definition. Regret (General)
For a sequence of loss functions \(\ell_1, \dots, \ell_T\) and algorithm predictions \(x_1, \dots, x_T\), the regret with respect to a competitor \(u \in \mathcal{V}\) is:
\[\text{Regret}_T(u) := \sum_{t=1}^T \ell_t(x_t) \;-\; \sum_{t=1}^T \ell_t(u).\]The online algorithm does not know \(u\) or future losses when making its prediction \(x_t\).
Problem: this formulation is too general and we cannot make any progress. We must choose reasonable assumptions.
Plausible:
- Convex loss
- Lipschitz loss
Implausible:
- Knowing future
Why do we need more information when we developed FTL that worked in our first game? Here is an example failure case in a different game.
Example 2.10. Failure of FTL
Intuitively, we will construct two pendulums that are out of sync. Or you can imagine a player in a sport constantly juking a naive opponent who always lags one step behind in the wrong direction.
Let \(\mathcal{V}=[-1,1]\). Consider the sequences of losses \(\ell_t(x)=z_t x\), where \(z_t=-0.5,1,-1,1,-1,\dots\). On the first round, we let FTL predict \(x_1\) arbitrarily since we have no past history. Then, FTL will predict \(x_1,-1,1,-1,1,\dots\). Overall, the cumulative loss becomes:
\[\begin{aligned} \sum_{t=1}^{T} \ell_t(x_t) &=\underbrace{-0.5x_1}_\text{1 round}+\underbrace{(-1)(1)+(1)(-1)+\dots}_{T-1\text{ rounds}} \\ &=-0.5x_1+T-1 \end{aligned}\]Meanwhile, the fixed competitor \(u=0\) yields \(0\) cumulative loss. Thus:
\[\text{Regret}_T(u=0)=T-1-0.5x_1\ge T-1.5.\]
Basic theory of convex analysis
Definition 2.2. Convex Set
A set \(\mathcal{V} \subseteq \mathbb{R}^d\) is convex iff it contains the line segment between any two points in the set.
Intuitively, the shape doesn’t bend inward to create a hollow area.
\[0 < \lambda < 1,\; x,y \in \mathcal{V} \;\implies\; \lambda x + (1-\lambda) y \in \mathcal{V}.\]
\[[-\infty,+\infty]\]Definition. Extended Real Number Line
\[\mathrm{dom}\,f := \{x \in \mathbb{R}^d : f(x) < +\infty\}.\]Definition. Domain of a Function
\[i_{\mathcal{V}}(x) := \begin{cases} 0, & x \in \mathcal{V},\\ +\infty, & \text{otherwise}. \end{cases}\]Definition. Indicator Function of a Set
Tip. Constrained to Unconstrained Formulation
The indicator function and extended real number line allow us to convert a constrained optimization problem into an unconstrained one:
\[\arg\min_{x \in \mathcal{V}} f(x) = \arg\min_{x \in \mathbb{R}^d} \bigl(f(x) + i_{\mathcal{V}}(x)\bigr).\]
Definition. Epigraph of a Function
The epigraph of \(f: \mathcal{V} \subseteq \mathbb{R}^d \to [-\infty,+\infty]\) is the set of all points above its graph:
\[\mathrm{epi}\,f = \{(x,y)\in \mathcal{V}\times\mathbb{R} : y \ge f(x)\}.\]
Definition 2.3. Convex Function
A function \(f: \mathcal{V}\subseteq\mathbb{R}^d \to [-\infty,+\infty]\) is convex iff its epigraph \(\mathrm{epi}\,f\) is convex. Implicitly, its domain must therefore be convex.
Tip. Sum of Convex Function and Indicator
If \(f: \mathcal{V}\subseteq\mathbb{R}^d \to [-\infty,+\infty]\) is convex, then
\[f + i_{\mathcal{V}} : \mathbb{R}^d \to [-\infty,+\infty]\]is also convex. Moreover, \(i_{\mathcal{V}}\) is convex iff \(\mathcal{V}\) is convex, so each convex set is associated with a convex function.
TODO: Image not working in notes
Theorem 2.4. ([Rockafellar, 1970, Theorem 4.1])
Let \(f: \mathbb{R}^d \to (-\infty,+\infty]\) where \(\text{dom }f\) is a convex set. Then the following are equivalent:
\[f \text{ convex}\] \[\iff\] \[\forall \lambda \in (0,1), \forall x,y \in \text{dom }f,\] \[f(\lambda x + (1-\lambda)y) \le \lambda f(x) + (1-\lambda) f(y).\]
Example 2.5. Convex functions: affine functions.
The simplest convex functions are affine functions:
\[f(x) = \langle \textbf{z},\textbf{x} \rangle + b\]
Example 2.6. Convex functions: norms.
Definition. Norm of a vector
Given a vector space \(\mathcal{V}\) over a subfield \(\mathbb{F}\) of \(\mathbb{C}\), a norm on \(\mathcal{V}\) is a real-valued function \(\Vert\cdot\Vert: \mathcal{V} \to \mathbb{R}\). Then, for all vectors \(x,y \in \mathcal{V}\), scalars \(s \in \mathbb{F}\), the norm satisfies the following:
\[\Vert x+y\Vert \le \Vert x\Vert + \Vert y\Vert\]
- Subadditivity/Triangle inequality.
\[\Vert sx \Vert = \vert s \vert \Vert x \Vert\]
- Absolute homogeneity:
\[\Vert x \Vert = 0 \implies x=0\]
- Positive definiteness/Point-seperating:
Proof. Norms are convex
It is sufficient to show that for \(x,y \in \mathbb{R}^d\), \(\lambda \in (0,1)\):
\[\Vert \lambda x + (1-\lambda)y \Vert \le \lambda \Vert x \Vert + (1-\lambda) \Vert y \Vert\]First, apply the triangle inequality (sub-additivity):
\[\Vert \lambda x + (1-\lambda)y \Vert \le \Vert \lambda x \Vert + \Vert (1-\lambda) y \Vert\]Then, apply absolute homogeneity:
\[\Vert \lambda x \Vert + \Vert (1-\lambda) y \Vert \le \vert \lambda \vert \Vert x \Vert + \vert 1-\lambda \vert \Vert y \Vert\]Since \(\lambda \in (0,1)\), then \(0<\lambda\) and \(0<1-\lambda\). Hence, \(\vert \lambda \vert = \lambda\) and \(\vert 1-\lambda \vert = 1-\lambda\). Chaining the inequalities gives the desired result.
Theorem 2.7. First-order condition for convexity ([Rockafellar, 1970, Theorem 25.1 and Corollary 25.1.1])
Given a convex function \(f: \mathbb{R}^d \to (-\infty,+\infty]\) with \(f\) differentiable at \(x \in \text{int dom } f\), we have
\[f(y)-f(x)-\langle \nabla f(x),y-x \rangle \ge 0 \quad \forall y \in \mathbb{R}^d.\](Remark. This is actually an iff when \(y\in \text{dom }f\).)
Theorem 2.8. First-order optimality condition for differentiable convex functions.
Let \(\mathcal{V}\) be a convex non-empty set, \(x^\ast \in \mathcal{V}\), and \(f: \mathcal{V} \to \mathbb{R}\) a convex function, differentiable over an open set that contains \(\mathcal{V}\). Then,
\[x^\ast \in \arg\min_{x \in \mathcal{V}} f(x)\] \[\iff\] \[\langle \nabla f(x^\ast), y-x^\ast \rangle \ge 0, \forall y\in \mathcal{V}.\]Moreover, if \(x^\ast \in \text{int } \mathcal{V}\), by choosing \(\epsilon\) enough such that \(y=x^\ast-\epsilon \nabla f(x^\ast) \in \mathcal{V}\), \(x^\ast\) is a minimum iff \(\nabla f(x^\ast)=0\).
(Remark. Intuitively, this means that there no a feasible directional derivative \(\nabla_{y-x^\ast} f(x^\ast)\) in which we can take a step that will decrease the function. There is no feasible step that will make an acute angle with the negative gradient. Furthermore, if \(x^\ast \in \text{int } \mathcal{V}\), by definition, any direction must be feasible, so it is only possible that no direction results in a decrease.)
Proof.
\((\Rightarrow)\):
Let \(x^\ast \in \arg\min_{x \in \mathcal{V}} f(x)\). By definition, this means \(f(x^\ast) \le f(x),\, \forall y \in \mathcal{V}\). Also, since \(\mathcal{V}\) is convex and non-empty, it must contain all line segments between two points.
Formally, since the line segment given by points \(z(\lambda)=x^\ast+\lambda(y-x^\ast) \in \mathcal{V}\), where \(0 \lt \lambda \lt 1\), we have by assumption of the optimality of \(x^\ast\) that
\[f(x^\ast) \le f(z(\lambda)).\]This implies, since \(0\lt\lambda\):
\[0 \le \frac{f(x^\ast+\lambda(y-x))-f(x^\ast)}{\lambda}\]Since \(f\) is differentiable, the limit as \(\lambda \downarrow 0\) exists and is precisely equal to the directional derivative \(\langle \nabla f(x^\ast),y-x^\ast\rangle\).
\[\diamond\]\((\Leftarrow)\):
Assume \(0 \le \langle \nabla f(x^\ast),y-x^\ast\rangle\). By the first-order condition for convexity, we have:
\[0\le f(y)-f(x^\ast)-\langle \nabla f(x^\ast),y-x^\ast \rangle \le f(y)-f(x^\ast).\]Thus \(f(x^\ast)\le f(y)\), which is the definition of a minimizer.
\[\diamond\]Now, we prove the “Moreover” part: if \(x^\ast \in \text{int }\mathcal{V}\), then
\[x^\ast \in \arg\min_{x\in\mathcal{V}} f(x) \iff \nabla f(x^\ast)=0.\]\((\Rightarrow)\):
Assume \(x^\ast \in \text{int }\mathcal{V}\) and \(x^\ast \in \arg\min_{x\in\mathcal{V}} f(x)\).
By definition of an interior point, there exists for some radius \(r>0\), an open ball \(B(x^\ast,r)=\{z \mid \Vert z-x^\ast \Vert \lt r \} \subseteq \mathcal{V}\).
Consider a gradient descent step \(y=x^\ast-\epsilon \nabla f(x^\ast)\), which we want to stay in \(\mathcal{V}\). In other words,
\[\Vert y-x^\ast \Vert = \epsilon \Vert \nabla f(x^\ast) \Vert \lt r.\]If \(\nabla f(x^\ast)=0\), then we are done. Else, choose small enough epsilon for the statement to hold:
\[0 \lt \epsilon \lt r/\Vert \nabla f(x^\ast) \Vert.\]By the first part of the theorem, we have:
\[\begin{aligned} 0 &\le \langle \nabla f(x^\ast),y-x^\ast \rangle \\ &= \langle \nabla f(x^\ast),-\epsilon \nabla f(x^\ast) \rangle \\ &= -\epsilon \Vert \nabla f(x^\ast) \Vert^2. \end{aligned}\]But also, since \(0\lt\epsilon\) and \(0\le\Vert\nabla f(x^\ast)\Vert^2\), it follows that
\[0 \le \Vert \nabla f(x^\ast) \Vert \le 0 \iff \nabla f(x^\ast) = 0.\] \[\diamond\]\((\Leftarrow)\):
Assume \(x^\ast \in \text{int }\mathcal{V}\) and \(\nabla f(x^\ast)=0\).
We must verify the optimality of \(x^\ast\):
\[x^\ast \in \arg\min_{x\in\mathcal{V}} \iff f(x^\ast)\le f(x), \forall x\in\mathcal{V}.\]But from the first-order optimality condition, we satisfy
\[0\le f(x)-f(x^\ast)-\langle \nabla f(x^\ast),x-x^\ast\rangle=f(x)-f(x^\ast)\]thus \(f(x^\ast)\le f(x)\).
\[\blacksquare\]
Theorem 2.9. Jensen’s Inequality
Let \(f: \mathbb{R}^d \to (-\infty,+\infty]\) be a measurable convex function and \(x\) be an \(\mathbb{R}^d\)-valued random element on some probability space such that \(\mathbb{E}[x]\) exists and \(x \in \text{dom }f\) with probability \(1\). Then
\[\mathbb{E}[f(x)]\le f(\mathbb{E}[x]).\](Remark. To remember the direction, simply test with a discrete probability distribution (Bernoulli) on two points, which once again gives the equivalent definition of convexity:
\[f(\lambda x+(1-\lambda)y) \le \lambda f(x) + (1-\lambda)f(y).\]Visually, set \(\lambda=0.5\) and imagine an upward parabola (convex). The midpoint of the line segment joining two points should lie above the function evaluated at the horizontal midpoint.)