Preface

Preliminary information for the blog post series on machine learning and optimization.

Posted May 18, 2025

By Ji-Ha Kim

4 min read

Preface

Here is the structure of this blog post series on mathematical optimization theory in ML.

Goal
Approach
Prerequisites

Goal

Optimization is a cornerstone in modern machine learning. When training a large-scale neural network for image recognition, a simple tweak in the optimizer can mean the difference between hours and days of training. But why does Adam sometimes converge where SGD fails? We need to understand the underlying mechanics of optimization algorithms to make informed decisions.

Whether we are fitting a model to data, tuning hyperparameters, or learning weights in a neural network, we are solving optimization problems that are often large, noisy, and non-convex. Yet, there tends to be a significant disconnect between the theory and practice of optimization in machine learning. This blog post series aims to explore the mathematical landscape of optimization theory applied to machine learning.

Our goal is not so much about investigating the implementation details in applying and deploying these algorithms, but analyzing their mathematical properties that allow for efficient convergence while enabling efficient distributed computation. Throughout this series, we will see how many concepts of optimization theory are inspired and cleverly re-used from physics and mathematics, and how they relate to machine learning. We will explore:

The intuition behind optimization algorithms and how they emerged through connections to other fields.
Why certain methods succeed (or fail).
How theoretical guarantees relate to practical performance.
When and how to choose (or design) the right optimizer.

Approach

This series takes a problem-first, theory-second top-down approach. The presentation is centered on mathematical derivations rather than code or implementation details.

I iterated through several iterations and decided to cut down on the breadth and depth of the topics covered because it was getting too long. Ultimately, I decided on the following.

This series is not a textbook. It is intended to be a self-contained introduction, that is to say, topics that are covered will be explained mostly from scratch, but many topics and proofs will be omitted for brevity. My goal is to scratch the surface as broadly as possible, by introducing the basic concepts and their accompanying definitions and theorems.

If you achieve to properly master these concepts, you need to work through exercises and examples, both theoretical and practical, by working through mathematical problems on paper in addition to implementing and playing around with the algorithms. If you simply read this series without additional effort, you might earn a very rough intuition of the connections between different perspectives and concepts, but you cannot expect to be able to apply these ideas to real-world problems.

The order of presentation is intended to be:

Real-world problem
Intuition with a concrete example
Investigating desired properties
Formalizing the theory
Leveraging the theory to make meaningful connections

At the end of each post, we will provide a summary of the main ideas, a cheat sheet in table format for quick reference, and a reflection on the post’s contributions.

Prerequisites

This series is designed for readers with:

A working knowledge of linear algebra and calculus.
Comfort with mathematical notation and reasoning.
(Optional) Basic familiarity with machine learning terminology (e.g., regression, classification, neural networks). This is mostly for the sake of context and applications, but it is not strictly required to follow the math.

Many individual points may be harder to appreciate without further background, so I encourage you to study with more in-depth material on the side. There will be crash courses included on the side, with a focus mostly on the necessary background to follow the posts in this series. I recommend to read them in this order:

---
config:
  theme: redux
---
flowchart TD
    A(["Start"]) --> LA["Linear Algebra"]

    LA --> CAL["Calculus"]

    CAL --> EFA["Functional Analysis & Matrix Norms"]
    CAL --> NA["Numerical Analysis"]

    EFA --> TC["Tensor Calculus"]
    EFA --> VC["Variational Calculus"]

    TC --> DG["Differential Geometry"]
    TC --> SIT["Statistics and Information Theory"]
    VC --> CA["Convex Analysis"]

    SIT --> IG["Information Geometry"]
    DG --> IG
    CA --> OL["Online Learning"]

If the mermaid flowchart does not render:

- Linear Algebra  
  └─ Multivariable Calculus  
      ├─ Functional Analysis & Matrix Norms  
      │   ├─ Tensor Calculus  
      │   │   ├─ Differential Geometry  
      │   │   └─ Statistics and Information Theory  
      │   │       └─ Information Geometry (requires both above)  
      │   └─ Variational Calculus  
      │       └─ Convex Analysis  
      │           └─ Online Learning  
      └─ Numerical Analysis  

The series is mostly intended to be self-contained beyond these items.

Series Outline

I have inserted an example reading order between the series and the crash courses.

Introduction to basic mathematical optimization
- Multivariable Calculus
- Linear Algebra
Iterative methods: gradient-free vs. gradient-based optimization
Desirable properties of optimizers
Speedrun of common gradient-based ML optimizers
Problem formalization
- Numerical Analysis
Gradient descent and gradient flow
- Functional Analysis
Challenges of high-dimensional non-convex optimization in deep learning
- Tensor Calculus
- Differential Geometry
Stochastic Gradient Descent and effects of randomness
Soft inductive biases (regularization)
Adaptive methods and preconditioning
Momentum
- Statistics and Information Theory
- Information Geometry
Adam optimizer, info geo view: diagonal Fisher information approximation
- Variational Calculus
- Convex Analysis
- Online Learning
Adam optimizer, online learning view: Discounted Follow-The-Regularized-Leader
- Matrix Norms (part of Functional Analysis)
Metrized deep learning (Iso/IsoAdam, Shampoo, Muon)
Parameter-free optimization

Mathematical Optimization, Machine Learning

This post is licensed under CC BY 4.0 by the author.

Goal

Approach

Prerequisites

Series Outline

Trending Tags