Preface
Preliminary information for the blog post series on machine learning and optimization.
Here is the structure of this blog post series on mathematical optimization theory in ML.
- Goal
- Approach
- Prerequisites
Goal
Optimization is a cornerstone in modern machine learning. When training a large-scale neural network for image recognition, a simple tweak in the optimizer can mean the difference between hours and days of training. But why does Adam sometimes converge where SGD fails? We need to understand the underlying mechanics of optimization algorithms to make informed decisions.
Whether we are fitting a model to data, tuning hyperparameters, or learning weights in a neural network, we are solving optimization problems that are often large, noisy, and non-convex. Yet, there tends to be a significant disconnect between the theory and practice of optimization in machine learning. This blog post series aims to explore the mathematical landscape of optimization theory applied to machine learning.
Our goal is not so much about investigating the implementation details in applying and deploying these algorithms, but analyzing their mathematical properties that allow for efficient convergence while enabling efficient distributed computation. Throughout this series, we will see how many concepts of optimization theory are inspired and cleverly re-used from physics and mathematics, and how they relate to machine learning. We will explore:
- The intuition behind optimization algorithms and how they emerged through connections to other fields.
- Why certain methods succeed (or fail).
- How theoretical guarantees relate to practical performance.
- When and how to choose (or design) the right optimizer.
Approach
This series takes a problem-first, theory-second top-down approach. The presentation is centered on mathematical derivations rather than code or implementation details.
I iterated through several iterations and decided to cut down on the breadth and depth of the topics covered because it was getting too long. Ultimately, I decided on the following.
This series is not a textbook. It is intended to be a self-contained introduction, that is to say, topics that are covered will be explained mostly from scratch, but many topics and proofs will be omitted for brevity. My goal is to scratch the surface as broadly as possible, by introducing the basic concepts and their accompanying definitions and theorems.
If you achieve to properly master these concepts, you need to work through exercises and examples, both theoretical and practical, by working through mathematical problems on paper in addition to implementing and playing around with the algorithms. If you simply read this series without additional effort, you might earn a very rough intuition of the connections between different perspectives and concepts, but you cannot expect to be able to apply these ideas to real-world problems.
The order of presentation is intended to be:
- Real-world problem
- Intuition with a concrete example
- Investigating desired properties
- Formalizing the theory
- Leveraging the theory to make meaningful connections
At the end of each post, we will provide a summary of the main ideas, a cheat sheet in table format for quick reference, and a reflection on the post’s contributions.
Prerequisites
This series is designed for readers with:
- A working knowledge of linear algebra and calculus.
- Comfort with mathematical notation and reasoning.
- (Optional) Basic familiarity with machine learning terminology (e.g., regression, classification, neural networks). This is mostly for the sake of context and applications, but it is not strictly required to follow the math.
Many individual points may be harder to appreciate without further background, so I encourage you to study with more in-depth material on the side. There will be crash courses included on the side, with a focus mostly on the necessary background to follow the posts in this series. I recommend to read them in this order:
---
config:
theme: redux
---
flowchart TD
A(["Start"]) --> LA["Linear Algebra"]
LA --> CAL["Calculus"]
CAL --> EFA["Functional Analysis & Matrix Norms"]
CAL --> NA["Numerical Analysis"]
EFA --> TC["Tensor Calculus"]
EFA --> VC["Variational Calculus"]
TC --> DG["Differential Geometry"]
TC --> SIT["Statistics and Information Theory"]
VC --> CA["Convex Analysis"]
SIT --> IG["Information Geometry"]
DG --> IG
CA --> OL["Online Learning"]
If the mermaid flowchart does not render:
1
2
3
4
5
6
7
8
9
10
11
- Linear Algebra
└─ Multivariable Calculus
├─ Functional Analysis & Matrix Norms
│ ├─ Tensor Calculus
│ │ ├─ Differential Geometry
│ │ └─ Statistics and Information Theory
│ │ └─ Information Geometry (requires both above)
│ └─ Variational Calculus
│ └─ Convex Analysis
│ └─ Online Learning
└─ Numerical Analysis
The series is mostly intended to be self-contained beyond these items.
Series Outline
I have inserted an example reading order between the series and the crash courses
.
- Introduction to basic mathematical optimization
Multivariable Calculus
Linear Algebra
- Iterative methods: gradient-free vs. gradient-based optimization
- Desirable properties of optimizers
- Speedrun of common gradient-based ML optimizers
- Problem formalization
Numerical Analysis
- Gradient descent and gradient flow
Functional Analysis
- Challenges of high-dimensional non-convex optimization in deep learning
Tensor Calculus
Differential Geometry
- Stochastic Gradient Descent and effects of randomness
- Soft inductive biases (regularization)
- Adaptive methods and preconditioning
- Momentum
Statistics and Information Theory
Information Geometry
- Adam optimizer, info geo view: diagonal Fisher information approximation
Variational Calculus
Convex Analysis
Online Learning
- Adam optimizer, online learning view: Discounted Follow-The-Regularized-Leader
Matrix Norms
(part ofFunctional Analysis
)
- Metrized deep learning (Iso/IsoAdam, Shampoo, Muon)
- Parameter-free optimization