Post

Autoregression vs Diffusion - Understanding Sampling via Optimal Transport

Autoregression and diffusion look like opposites, but both are solving the same transport problem - how to turn simple noise into structured data.

Autoregression vs Diffusion - Understanding Sampling via Optimal Transport

Introduction

Generative modeling is fundamentally an exercise in statistical estimation.

Generative Modeling

Let \(\mathcal{D} = \{x_1, \dots, x_N\}\) be an empirical dataset drawn i.i.d. from a data distribution \(x \sim P_{\text{data}}(x)\) defined over a high-dimensional space \(\mathcal{X}\). We want to generate new approximate samples from \(P_{\text{data}}(x)\).

The Procedural Strategy

Direct sampling from a complex, high-dimensional data distribution is generally intractable. Instead, we reduce the generative problem to a two-step procedure: noise sampling followed by procedural generation.

Think of world generation in a game like Minecraft: instead of trying to randomly generate a billion individual blocks and hoping they form a coherent landscape, we start from a single random seed—a simple source of randomness—and use it as the starting point for a procedural generation algorithm. In generative AI, we take a similar approach.

The Transport Objective

We introduce a tractable base distribution \(z \sim P_{\text{noise}}(z)\) (like standard Gaussian noise) that is easy to sample from. The goal is then to convert this simple randomness into samples that match the statistics of \(P_{\text{data}}(x)\).

Generative Reduction

Hard Samples from Easy Randomness

Think of a tiny terrain generator. The target distribution lives over coherent terrains in $x$-space, which is hard to sample from directly. The reduction is to sample a tiny easy seed, then let one deterministic generator produce the terrain.

Direct Final-Space Sampling
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Easy Seed + Generator

Autoregressive models and diffusion models are often presented as very different generative strategies. At a high level, however, both function as this “procedural generation algorithm,” converting simple randomness into complex samples from the data distribution. The difference is in how they parameterize that conversion: autoregression does it through a sequence of conditional transports, while diffusion does it through a time-dependent denoising dynamics. Optimal transport is therefore a useful geometric lens for comparing them, even when the underlying training objectives are not literally the same.

The Choice of Base Distribution

Why are Gaussian or Uniform distributions standard choices for \(P_{\text{noise}}\)? Practitioners favor them for practical reasons: they are easy to sample from, isotropic, stable under perturbation, and compatible with common noising procedures. They also have a clean maximum-entropy characterization once the underlying domain or moment constraints are fixed.

Uniform Maximum Entropy

For a strictly bounded interval \(\lbrack a, b\rbrack\), the maximum entropy distribution is the Uniform distribution \(\mathcal{U}\lbrack a, b\rbrack\).

 

We maximize entropy \(\mathbb{E}\lbrack -\ln p(x)\rbrack\) subject to \(\int_a^b p(x) dx = 1\).

Setting the functional derivative of the Lagrangian to zero gives:

\[ \frac{\delta}{\delta p} \left( -\int_a^b p(x)\ln p(x)dx + \lambda_0 \left(\int_a^b p(x) dx - 1\right) \right) = 0 \]
\[ -\ln p(x) - 1 + \lambda_0 = 0 \implies p(x) = e^{\lambda_0 - 1} \]

Since \(p(x)\) is constant, it must precisely be \(\frac{1}{b-a}\) to integrate to 1.

Gaussian Maximum Entropy

On \(\mathbb{R}\), there is no uniform probability distribution on the whole space, and differential entropy has no maximizer without an additional moment constraint. If we fix the mean \(\mu\) and variance \(\sigma^2\), then the unique maximum-entropy distribution is the Gaussian \(\mathcal{N}(\mu, \sigma^2)\).

This is why Gaussian noise is the canonical maximum-entropy base under a fixed second-moment budget (“Normal Distribution,” 2026).

 

By translation invariance of differential entropy, we may center first and assume \(\mu=0\). We then maximize \(\mathbb{E}\lbrack -\ln p(x)\rbrack\) subject to \(\int p(x) dx = 1\) and \(\int x^2 p(x) dx = \sigma^2\). The Lagrangian functional derivative yields:

\[ -\ln p(x) - 1 + \lambda_0 + \lambda_1 x^2 = 0 \implies p(x) = e^{\lambda_0 - 1 + \lambda_1 x^2} \]

Integrability forces the quadratic coefficient to be negative, which yields the Gaussian form

\[ p(x) \propto \exp\bigl(-c(x-\mu)^2\bigr). \]
Interactive 1D Example

Constrained Entropy Ascent

This uses projected entropy ascent on a discretized density. Each step increases entropy and then reapplies the active constraint, so the curve relaxes toward the actual max-entropy solution for that setting.

Step 0 / 180

From Noise to Data: The Transport Problem

To formalize this, we turn to Optimal Transport (OT). (Peyré, 2025)(Thorpe, 2018)

Optimal transport is the continuous, high-dimensional version of this same mass-moving problem.

Generative Optimal Transport

In the generative setting, we start from noise \(Z \sim P_{\text{noise}}\) and want an output with distribution \(P_{\text{data}}\). A transport map \(T\) should therefore satisfy \(T(Z) \sim P_{\text{data}}\).

Pushforward Shorthand

Later we will abbreviate “if \(Z \sim P\) then \(T(Z) \sim Q\)” by writing \(T_\sharp P = Q\).

If \(T\) is a differentiable bijection, this distribution-matching condition is equivalent to the usual change-of-variables formula:

\[ p_{\text{data}}(x) = p_{\text{noise}}(T^{-1}(x)) \left\vert \det J_{T^{-1}}(x) \right\vert = p_{\text{noise}}(z)\left\vert \det J_T(z) \right\vert ^{-1}, \qquad x=T(z). \]

The Monge Problem

Let \(c: \mathcal{Z} \times \mathcal{X} \to \mathbb{R} \cup \{+\infty\}\) be a fixed ground cost. Monge seeks a deterministic map \(T\) minimizing

\[ \min_T \underset{Z \,\sim\, P_{\text{noise}}}{\mathbb{E}}\lbrack c(Z, T(Z))\rbrack \quad \text{s.t.} \quad T(Z) \sim P_{\text{data}}. \]
Discrete Monge Example

Two Feasible Deterministic Maps $T$

In the discrete Monge problem, each source atom $z_i$ must choose exactly one target atom $x_j$. If the masses are indivisible, a feasible deterministic map $T$ exists only when the source atoms already come in the exact pieces demanded by the targets. Here the route coefficients $c_{ij}$ are shipping rates in $\$/\mathrm{kg}$, so with 1kg boxes the Monge objective is the total dollar cost.

1kg box
Source $z_1$
1kg box
Source $z_2$
needs 1kg
Target $x_1$
needs 1kg
Target $x_2$
Discrete Monge Failure

No Feasible Deterministic Map $T$

2kg box
Single Source $z_1$
needs 1kg
Target $x_1$
needs 1kg
Target $x_2$

The Kantorovich Problem

Monge forces each source point \(z\) to choose a single destination \(T(z)\). Kantorovich relaxes this by optimizing over any random pair \((Z,X)\) with the correct marginals:

\[ \min_{(Z,X)} \mathbb{E}\lbrack c(Z,X)\rbrack \quad \text{s.t.} \quad Z \sim P_{\text{noise}}, \quad X \sim P_{\text{data}}. \]

In probabilistic language, we are free to choose any coupling between noise and data. Monge is the special case \(X = T(Z)\) almost surely.

Discrete Kantorovich Form

In the finite case, if source point \(i\) carries mass \(a_i\), target point \(j\) needs mass \(b_j\), and moving one unit of mass costs \(c_{ij}\), then Kantorovich transport solves

\[ \min_{\pi_{ij} \ge 0} \sum_{i,j} c_{ij}\pi_{ij} \]

subject to

\[ \sum_j \pi_{ij} = a_i, \qquad \sum_i \pi_{ij} = b_j. \]

This is useful here because it allows mass splitting and turns the discrete problem into a linear optimization problem. Monge is recovered as the special case

\[ \pi_{i,T(i)} = a_i, \qquad \pi_{ij} = 0 \text{ for } j \ne T(i). \]

Why the Discrete Problem Is Linear

Kantorovich transport is linear in the plan \(\pi_{ij}\), so the discrete problem is a linear optimization problem. We will return to its dual later, once the basic OT picture is in place.

Discrete Kantorovich Example

Transport Plan $\pi$ with Mass Splitting

Source atoms $z_1$ (2kg) and $z_2$ (1kg) must fill target atoms $x_1$ (1kg) and $x_2$ (2kg). Let $\alpha = \pi_{11}$ be the mass moving from $z_1$ to $x_1$. Conservation of mass determines the entire plan $\pi$: $\pi_{11} = \alpha, \pi_{12} = 2-\alpha, \pi_{21} = 1-\alpha, \pi_{22} = \alpha$. In the Monge problem, $\alpha \in \{0, 1\}$, but Kantorovich allows any $\alpha \in [0, 1]$.

$z_1$2 kg
$z_2$1 kg
$x_1$1 kg
$x_2$2 kg

The 1D Case: Inverse Transform Sampling

To build intuition, let’s start with a simple case: Inverse Transform Sampling(“Inverse Transform Sampling,” 2025).

Suppose we operate strictly in a 1-dimensional continuous space \(\mathbb{R}\), and the target distribution \(P_{\text{data}}\) is entirely defined by its Cumulative Distribution Function (CDF) \(F(x) = P(X \le x)\).

The 1D Sampling Problem

We can sample \(U \sim \mathcal{U}(0,1)\), but we want \(X=T(U)\) to have CDF \(F\), i.e. \(P(X \le t)=F(t)\) for every \(t\). In other words, we want to turn uniform noise into samples from the target distribution.

The 1D Closed-Form Solution

Set \(X=F^{-1}(U)\), where \(F^{-1}(u)\) is the smallest \(x\) with \(F(x)\ge u\). Then \(P(X \le t)=P(U \le F(t))=F(t)\), so inverse transform sampling is already a transport map from uniform noise to the target distribution.

We now rewrite the same idea in the discrete setting first.

The local swap behind the 1D OT solution is easiest to see in the smallest nontrivial example:

Interactive 1D Example

Distance Matching and Uncrossing

Drag the ordered source points $x_1 < x_2$ and target points $y_1 < y_2$. The same number line drives both comparisons: the top row shows the uncrossed matching (sorting order), while the bottom row shows the crossed matching. In 1D, uncrossing never increases the cost for convex costs like distance.

Sources $x_i$
Targets $y_i$
Uncrossed
Crossed

Discrete 1D OT: Equal-Mass Matching

Suppose source points \(x_1,\dots,x_n\) and target points \(y_1,\dots,y_n\) each carry mass \(\frac{1}{n}\), and each source point must be matched to exactly one target point. Then we choose a permutation \(\sigma\) and solve

\[ \min_{\sigma \in S_n} \frac{1}{n}\sum_{i=1}^n c(x_i, y_{\sigma(i)}). \]

Quadrangle Inequality

Assume \(x_1 < x_2\), \(y_1 < y_2\), and \(c(x,y) = h(x-y)\) with \(h\) convex. Then

\[ c(x_1, y_1) + c(x_2, y_2) \le c(x_1, y_2) + c(x_2, y_1). \]

In words: if two matching lines cross, uncrossing them never increases the cost.

 

Let \(a = x_1-y_2\) and \(d = x_2-y_1\). Since \(x_1 < x_2\) and \(y_1 < y_2\), we have \(d-a = (x_2-x_1) + (y_2-y_1) > 0\), so \(a < d\). With

\[ \lambda = \frac{x_2-x_1}{(x_2-x_1) + (y_2-y_1)} \in (0,1), \]

we can write

\[ x_1-y_1 = \lambda a + (1-\lambda)d, \qquad x_2-y_2 = (1-\lambda)a + \lambda d. \]

Since \(h\) is convex,

\[ h(x_1-y_1) + h(x_2-y_2) \le h(a) + h(d). \]

Substituting back \(a = x_1-y_2\) and \(d = x_2-y_1\) gives

\[ c(x_1, y_1) + c(x_2, y_2) \le c(x_1, y_2) + c(x_2, y_1). \]

Discrete Monotone Matching

Sort both sets of points in non-decreasing order:

\[ x^{(1)} \le x^{(2)} \le \dots \le x^{(n)} \quad \text{and} \quad y^{(1)} \le y^{(2)} \le \dots \le y^{(n)}. \]

Here \(x^{(k)}\) means “the \(k\)-th smallest source point,” and similarly for \(y^{(k)}\). Then an optimal matching is

\[ x^{(1)} \leftrightarrow y^{(1)}, \qquad x^{(2)} \leftrightarrow y^{(2)}, \qquad \dots, \qquad x^{(n)} \leftrightarrow y^{(n)}. \]

So in 1D, optimal transport is just “sort both sides and pair equal ranks.”

 

Start from any permutation \(\sigma\). If \(\sigma\) is not monotone, then some \(i<j\) satisfy \(\sigma(i)>\sigma(j)\), so the pairs \(x^{(i)} \mapsto y^{(\sigma(i))}\) and \(x^{(j)} \mapsto y^{(\sigma(j))}\) cross. By the Monge property, swapping those two targets does not increase the cost:

\[ c\bigl(x^{(i)}, y^{(\sigma(j))}\bigr) + c\bigl(x^{(j)}, y^{(\sigma(i))}\bigr) \le c\bigl(x^{(i)}, y^{(\sigma(i))}\bigr) + c\bigl(x^{(j)}, y^{(\sigma(j))}\bigr). \]

After the swap, the number of inversions of \(\sigma\) decreases by at least \(1\). Repeating this process finitely many times removes all inversions.

A permutation with no inversions is increasing, and the only increasing permutation of \(\{1,\dots,n\}\) is the identity. So eventually we reach \(\sigma(i)=i\) for every \(i\), which is exactly the sorted matching \(x^{(i)} \leftrightarrow y^{(i)}\).

Equal Quantiles

The continuous version says the same thing, but with percentiles instead of sorted lists.

The \(u\)-quantile of a distribution means: the smallest point whose CDF value is \(u\). Equivalently, it is the pseudoinverse-CDF value \(F^{-1}(u):=\inf\{x:F(x)\ge u\}\).

For example, if a point \(x\) is at the \(70\%\) quantile of the source distribution, then it should be sent to the \(70\%\) quantile of the target distribution.

Continuous 1D OT

Let \(P\) and \(Q\) be 1D source and target laws with CDFs \(F\) and \(G\). For the same class of 1D convex costs \(c(x,y)=h(x-y)\), we seek a map \(T\) that sends the source law to the target law and, among all such maps, has the smallest transport cost:

\[ \min_T \int c(x, T(x))\,dP(x) \]

subject to

\[ T_\sharp P = Q. \]

In words: if \(X \sim P\), then \(T(X)\) should have CDF \(G\).

Quantile Matching

For 1D convex costs, the optimal map factorizes through a uniform base \(u \sim \mathcal{U}(0,1)\):

\[ u = F(x), \qquad T(x) = G^{-1}(u) \]
  1. Pullback: Compute \(u = F(x)\). By the probability integral transform, this extracts pure uniform noise.

  2. Pushforward: Sample \(T(x) = G^{-1}(u)\). This exactly recovers inverse transform sampling.

 

First check that the map has the right output distribution. If \(X \sim P\) and \(U = F(X)\), then \(U \sim \mathcal{U}(0,1)\). Define \(Y = G^{-1}(U) = G^{-1}(F(X))\). Then for any \(t\),

\[ P(Y \le t) = P(G^{-1}(U) \le t) = P(U \le G(t)) = G(t), \]

so \(Y\) has CDF \(G\). Therefore \(T_\sharp P = Q\).

Now check optimality. For each \(m\), take the equally spaced quantile levels \(u_k = \frac{k-\frac12}{m}\) and define

\[ x_k = F^{-1}(u_k), \qquad y_k = G^{-1}(u_k), \qquad k=1,\dots,m. \]

By the discrete monotone matching result, pairing \(x_k\) with \(y_k\) minimizes

\[ \frac1m \sum_{k=1}^m h(x_k - y_{\sigma(k)}) \]

over all permutations \(\sigma\). As \(m \to \infty\), these sums converge to the quantile integral

\[ \int_0^1 h\bigl(F^{-1}(u) - G^{-1}(u)\bigr)\,du, \]

which is exactly

\[ \int h(x - T(x))\,dP(x) \qquad \text{for } T(x)=G^{-1}(F(x)). \]

So equal-quantile matching is optimal in the continuous problem as well.

If \(h\) is strictly convex, such as \(h(t)=t^2\), this optimizer is unique up to sets of measure zero. For \(h(t)=\vert t\vert\), ties can occur.

CDFs vs. Densities

The transport rule is written using CDFs, but models often predict densities or probabilities:

\[ F(x) = \int_{-\infty}^{x} p(t)\,dt, \qquad p(x) = F'(x) \]

and in the discrete case

\[ F(v_k) = \sum_{j \le k} p_j. \]

So learning \(p\) is enough: the CDF is obtained by integrating or summing, and sampling uses the inverse CDF.

Interactive Example

Inverse Transform Sampling

Drag the $u$ handle along the horizontal axis, which represents uniform noise. The solid blue curve is the Inverse CDF $x = F^{-1}(u)$. The shape along the vertical axis shows the target density $p(x)$. Notice how the cumulative mass (the filled orange area) visually grows in proportion to your chosen $u$ input!

Target Density $x \sim P_{\text{data}}$
Inverse CDF $x = F^{-1}(u)$
USource Noise $u \sim \mathcal{U}(0,1)$

In higher dimensions, we can keep the same quantile-matching idea, but apply it one coordinate at a time.

Autoregression as Sequential Transport

This gives the Knothe-Rosenblatt rearrangement, which is the mathematical backbone of autoregression.

Knothe-Rosenblatt Rearrangement

Fix an order of the coordinates and write

\[ T(x_1,\dots,x_D) = \bigl(T_1(x_1), T_2(x_1,x_2), \dots, T_D(x_1,\dots,x_D)\bigr). \]

Here \(x_{<i}=(x_1,\dots,x_{i-1})\) means “all earlier coordinates.”

Each coordinate is defined by a 1D conditional inverse-CDF step:

\[ T_1(x_1) = F^{-1}_{Q_1}(F_{P_1}(x_1)) \]
\[ T_2(x_2 \vert x_1) = F^{-1}_{Q_2 \vert Q_1}(F_{P_2 \vert P_1}(x_2 \vert x_1)) \]
\[ \dots \]
\[ T_D(x_D \vert x_{<D}) = F^{-1}_{Q_D \vert Q_{<D}}(F_{P_D \vert P_{<D}}(x_D \vert x_{<D})) \]

So after fixing the earlier coordinates, we apply the same 1D quantile-matching rule to the next conditional distribution.

Non-Optimality (Greedy Sequence)

The Knothe-Rosenblatt map guarantees \(T_\sharp P_{\text{noise}} = P_{\text{data}}\), but it is not globally optimal for the standard squared Euclidean cost \(W_2^2 = \mathbb{E}\lbrack \Vert x - y\Vert _2^2\rbrack\) in high dimensions.

By factorizing sequentially, it implicitly solves a greedy transport problem. It is canonical only because it appears as the optimal limit of a heavily skewed quadratic cost that strictly prioritizes earlier coordinates:

\[ c(x, y) = \sum_{i=1}^D \lambda_i (x_i - y_i)^2, \qquad \lambda_1 \gg \lambda_2 \gg \dots \gg \lambda_D \]

At the population level, vanilla autoregression is this triangular transport written in density language.

Chain Rule Factorization

If \(x=(x_1,\dots,x_D)\) and \(x_{<i}=(x_1,\dots,x_{i-1})\), then

\[ p(x)=\prod_{i=1}^D p(x_i \mid x_{<i}). \]

Sampling Rule

Draw \(u_1,\dots,u_D \sim \mathcal{U}(0,1)\) and set \(x_i = F^{-1}_{X_i \mid X_{<i}=x_{<i}}(u_i)\). Each step is just 1D inverse transform sampling conditioned on the past.

Example: LLMs as Classification

Next-Token Prediction

Take a tiny vocabulary \(\mathcal{V}=\{\text{"apple"}, \text{"banana"}, \text{"cherry"}\}\). At step \(i\), the model outputs probabilities \(p_k=P(x_i=v_k \mid x_{<i})\).

Suppose it predicts \((0.6, 0.3, 0.1)\). The discrete CDF is then \((0.6, 0.9, 1.0)\). If we draw \(u=0.75\), the inverse-CDF rule picks the first token whose cumulative probability exceeds \(u\), namely “banana”.

So next-token sampling is conditional inverse transform sampling, and standard cross-entropy training learns the conditional distributions that induce these 1D transports:

\[ \mathcal{L}_{\text{NLL}} = -\sum_{i=1}^D \log P(x_i^{\text{true}} \mid x_{<i}). \]
Interactive Discrete Example

Next-Token Inverse CDF Sampling

Drag the slider to sample a uniform random variable \( u \sim \mathcal{U}(0,1) \). The vocabulary space is partitioned according to the predicted probability of each token. The interval that catches \( u \) becomes the generated token!

"apple"
"banana"
"cherry"
0.0
0.6
0.9
1.0

Reparameterized Autoregression

Instead of autoregressing in the original coordinates, we can first change coordinates and then factorize there.

Change of Variables

If \(y=g(x)\) is invertible, then

\[ p_X(x)=p_Y(g(x))\vert \det J_g(x)\vert . \]

So any autoregressive factorization in the \(y\)-coordinates induces a density in the original \(x\)-coordinates.

Coordinate Transformation

Understanding Reparameterization Maps

A map $y = g(x)$ changes the coordinates before we factorize or sample. Some reparameterizations only change ordering, while others expose more meaningful coordinates such as principal axes or coarse-vs-fine structure.

$x_1$ coordinate lines
$x_2$ coordinate lines
Original Coordinates $(x_1, x_2)$
Transformed Coordinates $(y_1, y_2)$

Example 1: Changing the Ordering

Permuting Coordinates

If \(y_i=x_{\sigma(i)}\) for a permutation \(\sigma\), then \(J_g\) is a permutation matrix, so \(\vert \det J_g\vert =1\). For example, \((y_1,y_2,y_3)=(x_2,x_1,x_3)\) simply changes generation order. The model is still autoregressive, just in a different ordering.

Example 2: Frequency Space

One concrete way to see frequency-space autoregression is through the 2-point Haar basis, the smallest wavelet example of coarse-to-fine generation (Yu et al., 2026).

Interactive Frequency Reparameterization

2-Pixel Haar Coordinates

For a 2-pixel patch $x=(x_1,x_2)$, the 2-point Haar basis separates a shared brightness term $y_{\text{low}}=\frac{x_1+x_2}{\sqrt{2}}$ from a signed contrast term $y_{\text{high}}=\frac{x_1-x_2}{\sqrt{2}}$. In that basis, autoregression can model coarse intensity first and local edge direction second.

Patch $x$
$x_1=$ 0.60
$x_2=$ 0.60
Haar $y$
$y_{\text{low}}$
0.85
$y_{\text{high}}$
0.00
Coarse → Final
from $y_{\text{low}}$
+ $y_{\text{high}}$

The same idea appears in the standard practical visualization below: wavelet decomposition into approximation and detail bands, followed by progressive reconstruction.

Practical Wavelet View

Image-Space Haar Decomposition

This is the standard practical visualization: an image, its 1-level Haar subbands, and the corresponding coarse-to-fine reconstructions. The approximation band stores broad structure, while the detail bands isolate left-right edges, top-bottom edges, and diagonal texture.

Original
Haar Subbands
Approx
L-R Detail
T-B Detail
Diagonal
Reconstruction
Approx Only
+ Main Edges
Full

Fourier gives the complementary practical view: the same image is represented by global frequencies, visualized as a centered spectrum together with low-pass and high-pass reconstructions.

Practical Fourier View

Image-Space Fourier Decomposition

The Fourier view is complementary to the Haar view. Instead of localized multiscale bands, it represents the same image with global sinusoidal frequencies: the center of the spectrum is low frequency, and the outer region is high frequency.

Original
Centered Spectrum
Low-Pass Only
High-Pass Only

Relation to Diffusion

For images, diffusion often appears to refine samples from coarse structure toward fine detail. Dieleman interprets DDPM-style denoising as an approximate low-to-high spectral ordering that is valid in expectation across many images, rather than as a hard per-sample rule (“Diffusion Is Spectral Autoregression,” 2024). Falck’s follow-up accepts that approximate DDPM picture, but argues that this spectral hierarchy is not necessary for good diffusion performance: hierarchy-free diffusion can match DDPM and even improve high-frequency generation (Falck, 2025).

Global Optimality: Brenier’s Theorem

If we abandon the greedy sequence and seek the true globally optimal map for the symmetric squared Euclidean cost, we arrive at Brenier’s theorem (Brenier, 1991).

Brenier’s Theorem

If \(P\) is absolutely continuous and \(P,Q\) have finite second moments, then the quadratic-cost problem

\[ W_2^2(P,Q)=\inf_{T_\sharp P = Q}\mathbb{E}_{X\sim P}\Vert X-T(X)\Vert _2^2 \]

has a unique optimal map up to \(P\)-null sets, and it has the form \(T=\nabla\psi\) for a convex potential \(\psi\).

Why a Convex Gradient?

In 1D, optimal transport was monotone: equal quantiles never cross. In higher dimensions, the analogue of that monotonicity is a convex gradient. Indeed, convexity gives

\[ \langle \nabla\psi(x)-\nabla\psi(y),\, x-y \rangle \ge 0, \]

so the displacement field is monotone in the ambient inner product. When \(D=1\), gradients of convex functions are exactly increasing functions, so Brenier reduces to monotone rearrangement.

What the Theorem Says Geometrically

Brenier’s theorem is stronger than existence: the global \(W_2^2\) optimizer is deterministic, unique almost everywhere, and generated by a single scalar potential. So the best quadratic-cost generator is not an arbitrary black box, but a convex-potential gradient.

Dynamic Optimal Transport

Brenier chooses the optimal endpoint map. Dynamic OT asks for the optimal interpolation in time.

Benamou-Brenier Dynamic Formulation

For quadratic cost,

\[ W_2^2(\mu_0,\mu_1) = \inf_{\rho_t,\,v_t} \int_0^1 \! \int \Vert v_t(x)\Vert _2^2\, \rho_t(x)\, dx\, dt \]

subject to

\[ \partial_t \rho_t + \operatorname{div}(\rho_t v_t) = 0, \qquad \rho_0=\mu_0,\ \rho_1=\mu_1. \]

So instead of optimizing one endpoint map directly, we optimize over all density paths and velocity fields that transport \(\mu_0\) to \(\mu_1\) with minimal kinetic action (Benamou & Brenier, 2000).

Eulerian and Lagrangian Views

The formula above is Eulerian: it works with a density field \(\rho_t\) and a velocity field \(v_t\) over space-time. When the Brenier map \(T\) exists, the same optimizer can be written in Lagrangian coordinates by following particles

\[ X_t=(1-t)X_0+t\,T(X_0), \qquad X_0\sim \mu_0, \]

and then setting \(\rho_t=\lbrack X_t\rbrack _\sharp \mu_0\). This curve of measures is the displacement interpolation: each particle moves at constant speed along the segment from its source location to its Brenier destination. So the static map view and the dynamic least-action view describe the same quadratic OT geodesic.

Path-Space / Stochastic Formulation

A different dynamic formulation optimizes not over deterministic trajectories but over laws on paths:

\[ \inf_{P:\,P_0=\mu_0,\ P_1=\mu_1}\operatorname{KL}(P \Vert R), \]

where \(R\) is a reference diffusion or more general reference process on path space. This is the Schrödinger bridge problem. At finite noise it selects the most likely stochastic dynamics compatible with the endpoint marginals. For the standard Brownian reference, the small-noise limit recovers quadratic OT. This is the right language for diffusion-style bridge methods and for later multi-time questions where the reference process is part of the model, not just a numerical regularizer (Léonard, 2013; Chen et al., 2014; De Bortoli et al., 2021; Tong et al., 2024).

Continuous normalizing flows live in the same continuity-equation formalism, but their training objectives do not minimize the Benamou-Brenier action by default. They learn an admissible transport dynamics, not necessarily the Wasserstein geodesic.

Constrained vs. Unconstrained Transport

At the level of exact endpoint transport, we still have two main high-dimensional constructions:

  1. Knothe-Rosenblatt / autoregression: tractable, sequential, and order-dependent.
  2. Brenier / quadratic OT: symmetric and globally optimal for \(W_2^2\), but not available as a simple sequential recipe.

That distinction is the main architectural split.

Autoregression vs. Diffusion

FeatureAutoregression (Knothe-Rosenblatt)Flow / Diffusion (Unconstrained)
Map Form\(T_i(x_i \vert x_{<i}) = F^{-1}_{Q_i \vert Q_{<i}}(F_{P_i \vert P_{<i}}(x_i))\)Ideal OT map: \(T(x) = \nabla \psi(x)\); practical parameterization: learn \(v_\theta(x,t)\)
TractabilityExact sequential 1D inversionsLearned via velocity regression
Structural BiasArbitrary coordinate orderingNone (isotropic)
Training SignalExact likelihood: \(\sum_i \log p(x_i \vert x_{<i})\)CFM / score matching
Inference\(D\) sequential stepsODE integration (or one-step maps)

Autoregression keeps an exact sequential sampler and exact likelihood factorization, but pays for it with an ordering bias. Unconstrained transport is geometrically better aligned with the symmetric quadratic cost, but it must be learned numerically.

Two 2D pictures help keep the distinction straight. First, we compare two exact continuous maps at the population-law level: same source law, same target law, different map. Then we pass to a finite-sample discretization of those same laws: one frozen source cloud, one frozen target cloud, and two couplings on the same cached point set.

Continuous Population-Level Comparison

Knothe-Rosenblatt vs Brenier in 2D

This panel is the exact continuous problem. The formulas compare two true maps between probability laws, while the plotted particles are only a visual probe of those maps. Start from one sampled source cloud $z_i \sim \mathcal{N}(0, I)$ and compare two exact maps into the same correlated Gaussian law.

The pale blue ellipses are contours of the target distribution. The sequential map is the Knothe-Rosenblatt transform $T_{\mathrm{KR}}(z)=Lz$, while the direct map is the Brenier transform $T_{\mathrm{OT}}(z)=\Sigma^{1/2}z$.

Every source-to-target path is drawn as a faint gray trace; the yellow sample cloud shows how the transport actually moves along those paths.

Discrete Empirical Comparison

Sorting vs Sinkhorn in 2D

This panel is the empirical finite-sample problem. We freeze one source sample and one target sample, drawn from the same underlying source/target laws as the continuous panel above, and compare two couplings on that exact cached 64-point instance.

Here the target is shown through the sampled blue cloud, while every pairing is drawn as a faint gray trace and the yellow transported sample moves over the full coupling.

Unconstrained Transport: Continuous Flows

A standard parameterization of an unconstrained transport is to learn a time-dependent velocity field and integrate it (Lai et al., 2025).

Continuous Normalizing Flow (CNF)

\[ \frac{dx}{dt} = v_\theta(x, t), \qquad x(0) \sim P_{\text{noise}}, \quad x(1) \sim P_{\text{data}} \]

The field induces a flow map \(\phi_t\), and we want \(\phi_1(X)\) to follow \(P_{\text{data}}\) when \(X \sim P_{\text{noise}}\).

This avoids parameterizing \(\nabla\psi\) directly, but it does not by itself specify which dynamics should connect the endpoints. Benamou-Brenier chooses the least-action field; a generic CNF objective only chooses some field whose terminal pushforward is correct.

So the training question becomes: what conditional dynamics should \(v_\theta\) regress toward?

Flow Matching & Rectified Flows

Flow Matching(Lipman et al., 2023) and closely related Rectified Flow methods choose a coupling \(q(x_0,x_1)\) between source and target points, then define a conditional path between each paired endpoint. For the straight-line version,

\[ X_t=(1-t)X_0+tX_1, \]

the target velocity is the constant displacement \(X_1-X_0\).

Conditional Flow Matching (CFM) Objective

\[ \mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t,\,(X_0,X_1)\sim q}\bigl\Vert v_\theta(X_t,\, t) - (X_1 - X_0) \bigr\Vert _2^2 \]

We sample a time \(t\) and an endpoint pair from a chosen coupling \(q\), then regress the model toward the straight-line conditional velocity.

Conditional vs. Marginal Optimality

For a fixed pair \((X_0, X_1)\), the straight line is optimal between those two endpoints. But the global geometry is determined by the coupling \(q\) and by the family of conditional paths.

If \(q=P_{\text{noise}}\otimes P_{\text{data}}\), we get the usual independent-pairing target.

If \(q\) is chosen from an OT plan, we get an OT-informed target.

If the conditional paths are Brownian bridges weighted by an entropic OT coupling, we move to Schrödinger-bridge variants.

So flow matching is a framework for turning a chosen coupling-and-path construction into a regression loss; it is not automatically the Benamou-Brenier geodesic (Lipman et al., 2023; Tong et al., 2024; De Bortoli et al., 2021).

Practical OT: Regularized Couplings

Everything above was exact and population-level: closed-form 1D maps, exact triangular rearrangements, and Brenier’s exact quadratic-cost optimizer. In practice, training works with finite batches and regularized couplings instead.

Entropic Optimal Transport

Given source and target distributions \(\mu,\nu\) and a reference coupling \(\pi_{\mathrm{ref}}\), entropic OT solves

\[ \min_{\pi \text{ coupling } \mu,\nu} \int c(x,y)\,d\pi(x,y) + \varepsilon\, \mathrm{KL}(\pi \Vert \pi_{\mathrm{ref}}). \]

The KL term pulls the coupling toward the chosen reference plan \(\pi_{\mathrm{ref}}\) while keeping the marginals fixed. Intuitively, it discourages extremely sharp plans and makes the optimization numerically smoother.

Product Reference = Entropy Regularization up to Constants

If \(\mu,\nu\), and \(\pi\) admit densities, then

\[ \mathrm{KL}(\pi \Vert \mu \otimes \nu)=H(\mu)+H(\nu)-H(\pi). \]

So with product reference \(\mu \otimes \nu\), KL regularization is the same as negative-entropy regularization up to constants, and therefore has the same minimizer. This follows by expanding the logarithm and using the fact that \(\pi\) has marginals \(\mu\) and \(\nu\).

Why the Reference Coupling Matters

With the usual product reference \(\mu \otimes \nu\), the KL term mainly favors diffuse couplings. Freulon et al. show that once the reference itself carries correlations, especially in the Gaussian case, the penalty no longer acts like generic smoothing: it favors couplings compatible with that reference structure (Freulon et al., 2026).

On a small fixed discrete instance, entropic OT is easiest to read as a sharp-vs-diffuse tradeoff. As \(\varepsilon\) increases, the plan keeps the same row and column marginals but spreads mass across more nearby source-target pairs.

Entropic Optimal Transport

How Sinkhorn Diffuses a Discrete Plan

Fix a small balanced discrete problem with source masses $a_i$ at points $z_i$, target masses $b_j$ at points $x_j$, and squared-distance costs $c_{ij}=\|z_i-x_j\|^2$. The entropically regularized plan $\pi^\varepsilon = \arg\min_{\pi \in \Pi(a,b)} \sum_{i,j} c_{ij}\pi_{ij} - \varepsilon H(\pi)$ stays marginally feasible but becomes progressively more diffuse as $\varepsilon$ grows.

Source points $z_i$
Target points $x_j$
Coupling mass $\pi_{ij}^{\varepsilon}$
Coupling Graph
Source Atoms $z_i$
Target Atoms $x_j$
Edge width and opacity scale with the transported mass between each pair.
Plan Matrix
$x_1$$x_2$$x_3$$x_4$$x_5$
$z_1$$z_2$$z_3$$z_4$$z_5$

Uncrossing Paths: Mini-Batch OT

Flow matching still leaves one practical question open: inside a mini-batch, which source sample should be paired with which target sample?

Batch OT

Random pairings can send nearby interpolation points toward very different endpoints, so the conditional targets \(X_1-X_0\) vary abruptly across the batch. A mini-batch OT solve reduces that mismatch by pairing nearby endpoints:

\[ \pi^\ast = \arg\min_{\pi \text{ matching } X_0^B \text{ to } X_1^B} \sum_{i,j} \pi_{ij}\, \bigl\Vert X_0^{(i)} - X_1^{(j)}\bigr\Vert _2^2 \]

This is efficiently approximated with Sinkhorn. With product reference, it is exactly the entropic OT relaxation above. Geometrically, the batch paths cross less, so nearby interpolation points tend to carry more compatible velocity targets.

On a small cached batch, the contrast is easiest to see directly:

Finite-Batch Pairing

Random Pairings vs Mini-Batch OT

Both panels use the same cached source batch $X_0^B$ and target batch $X_1^B$. Only the pairing changes. Faint gray traces show the full straight-line paths, aqua dots show the fixed endpoints, and the yellow dots with short arrows show the current interpolants $X_t$ and their velocity targets $X_1 - X_0$.

full path
target batch $X_1^B$
current state $X_t$
velocity target $X_1 - X_0$
Random Pairing
16 crossings
Mini-Batch OT
0 crossings

Endpoint Matching Is Not Yet a Dynamical Model

Mini-batch OT solves a two-marginal problem: for one time gap, it picks a coupling between the batch at the start and the batch at the end. That is enough to draw straight-line paths across that single gap, but a trajectory model over times \(t_1,\dots,t_n\) needs one joint law, or at least mutually compatible transitions, across all times.

Freulon et al. make this precise in the Gaussian setting. They show that if each pairwise entropic OT problem is regularized by a reference coupling coming from the same continuous-time Gaussian reference process, then the local transitions remain independently solvable but still fit together into one coherent dynamical model. With a product reference, each gap is regularized toward independence instead, so solving the gaps separately does not build in temporal coherence (Freulon et al., 2026).

One-Step Maps

One-step generators try to learn the transport map directly, rather than integrating an ODE or running many denoising steps at inference. Drifting Models(Deng et al., 2026) are a concrete example.

Drifting Model

A drifting model parameterizes a one-step generator

\[ x=f_\theta(z), \qquad z \sim P_{\text{noise}}, \]

and studies the training-time pushforward distribution

\[ q_\theta = \lbrack f_\theta\rbrack _\sharp P_{\text{noise}}. \]

Instead of specifying an inference-time transport field, it introduces a drifting field\(V_{p,q}(x)\) that prescribes how a current generated sample should move as the model is updated. The paper imposes the anti-symmetry condition

\[ V_{p,q}(x) = -V_{q,p}(x), \]

so that \(V_{p,p}(x)=0\): matched data and model distributions are equilibrium points.

Drifting Objective and Training Step

For generated samples \(x=f_\theta(z)\sim q_\theta\), positive samples \(y^+\sim P_{\text{data}}\), and negative samples \(y^-\sim q_\theta\), Deng et al. use a kernelized attraction-repulsion field

\[ V_{p,q}(x)=V_p^+(x)-V_q^-(x), \]

where

\[ V_p^+(x)=\frac{1}{Z_p(x)}\mathbb{E}_{y^+\sim p}\!\big\lbrack k(x,y^+)(y^+-x)\big\rbrack , \qquad V_q^-(x)=\frac{1}{Z_q(x)}\mathbb{E}_{y^-\sim q}\!\big\lbrack k(x,y^-)(y^--x)\big\rbrack , \]

with normalization factors \(Z_p(x)=\mathbb{E}_{y^+\sim p}\lbrack k(x,y^+)\rbrack\) and \(Z_q(x)=\mathbb{E}_{y^-\sim q}\lbrack k(x,y^-)\rbrack\). In their implementation, the similarity kernel is

\[ k(x,y)=\exp\!\left(-\frac{\Vert x-y\Vert _2}{\tau}\right), \]

realized with batchwise softmax normalizations. The update target is then the drifted sample

\[ x_{\mathrm{drift}}=\operatorname{stopgrad}\bigl(x+V_{P_{\text{data}},q_\theta}(x)\bigr), \]

and the training loss is \(\mathcal{L}_{\mathrm{drift}}(\theta)=\mathbb{E}\bigl\Vert x-x_{\mathrm{drift}}\bigr\Vert _2^2\). In practice, the expectations are estimated on mini-batches, the generated batch is reused as negative samples, and the same construction is often applied in a learned feature space \(\phi(x)\) rather than raw pixel space.

Unlike flow matching, this objective does not require paired endpoints. It compares generated samples against positive and negative neighborhoods, uses the resulting drift to update the one-step generator during training, and then samples with one evaluation of \(f_\theta\) at test time.

Conceptually, this closes the loop:

\[ \text{Map (Brenier, intractable)} \to \text{Flow (tractable, multi-step)} \to \text{Learned one-step map}. \]

Dual Potentials

Before returning to exact-map structure, it is useful to switch from the primal viewpoint of moving mass to the dual viewpoint of certifying transport cost by potentials.

Discrete Kantorovich Dual

In the discrete case, the dual problem is

\[ \max_{f_i,\,g_j} \sum_i a_i f_i + \sum_j b_j g_j \qquad \text{s.t.} \qquad f_i + g_j \le c_{ij}\;\; \text{for all } i,j. \]

Here \(f_i\) is the price assigned to source point \(x_i\) and \(g_j\) is the price assigned to target point \(y_j\). The rule \(f_i+g_j \le c_{ij}\) says their combined price can never exceed the true cost of pairing them.

Why This Is Useful

Any feasible pair \((f,g)\) gives a guaranteed lower bound on every transport plan:

\[ \sum_i a_i f_i + \sum_j b_j g_j = \sum_{i,j}\pi_{ij}(f_i+g_j) \le \sum_{i,j} c_{ij}\pi_{ij}. \]

So if we can find large prices that still satisfy the constraints, we certify that the transport cost cannot be any smaller. Duality says the best such certificate exactly matches the true optimum.

Complementary Slackness

At optimality, transported pairs satisfy \(\pi_{ij}>0 \Rightarrow f_i+g_j=c_{ij}\). In words: mass only travels along pairs whose price constraint is tight. So the dual potentials identify the active geometry of the coupling.

The same tiny transport example makes this lower-bound picture concrete:

Discrete Kantorovich Dual

Prices as Lower Bounds

Move through the feasible family $f_1=t$, $f_2=-t$, $g_1=1-t$, $g_2=1+t$ and watch the dual lower bound rise until the used constraints become tight. Masses are in $\mathrm{kg}$ and costs are in $\$/\mathrm{kg}$.

$$ \begin{array}{c|cc|c} & x_1 & x_2 & a_i \,(\mathrm{kg}) \\ \hline z_1 & c_{11}=1\,\$/\mathrm{kg} & c_{12}=3\,\$/\mathrm{kg} & 2 \\ z_2 & c_{21}=4\,\$/\mathrm{kg} & c_{22}=1\,\$/\mathrm{kg} & 1 \\ \hline b_j\,(\mathrm{kg}) & 1 & 2 & \end{array} $$
Price Spread $t$0.00
0.00.51.0
Primal$5.00
Dual$3.00
Gap$2.00
$f_i + g_j \le c_{ij}$
$x_1$
$g_1$+1.00
$x_2$
$g_2$+1.00
$z_1$
$f_1$+0.00
1.00 ≤ 1.00
1.00 ≤ 3.00
$z_2$
$f_2$+0.00
1.00 ≤ 4.00
1.00 ≤ 1.00
Filled dots mark the routes used by $\pi^*$. The route $z_1 \to x_2$ is still slack, so the dual bound stays below the primal cost.

Continuous Kantorovich Dual

The continuous version is the same idea, but now the prices are functions rather than finite lists of numbers. Among all \(f\) and \(g\) satisfying \(f(z)+g(x)\le c(z,x)\) for every \(z,x\), the best lower bound is

\[ \sup_{f,g} \; \mathbb{E}\lbrack f(Z)\rbrack + \mathbb{E}\lbrack g(X)\rbrack , \qquad Z \sim P_{\text{noise}},\; X \sim P_{\text{data}}. \]

Under standard assumptions, this equals the Kantorovich optimum.

Why This Matters Here

These dual potentials are the continuous analogue of the discrete prices. They reappear in regularized OT, Sinkhorn-style updates, and potential-based viewpoints on transport geometry.

Optional Refinement: Polar Factorization

Even if a generator already matches the correct density, its internal geometry need not be transport-optimal.

Polar Factorization Theorem

Under the same regularity assumptions as Brenier’s theorem, any exact generator \(F\) with \(F_\sharp P_{\text{noise}} = P_{\text{data}}\) can be factorized \(P_{\text{noise}}\)-almost everywhere as

\[ F=\nabla\psi \circ M, \]

where \(\nabla\psi\) is the Brenier map and \(M\) preserves the source law: \(M_\sharp P_{\text{noise}}=P_{\text{noise}}\).

This is the infinite-dimensional analogue of matrix polar decomposition. Every exact generator can be decomposed into a source-law-preserving latent rearrangement, followed by the optimal transport (Vesseron & Cuturi, 2025).

Same Density, Different Pairing

If \(\widetilde{G}=G\circ M\) and \(M_\sharp P_{\text{noise}}=P_{\text{noise}}\), then

\[ \widetilde{G}_\sharp P_{\text{noise}} = G_\sharp P_{\text{noise}}. \]

So a latent rearrangement does not change the modeled density. It only changes the coupling: which latent point is paired with which output sample.

A finite-sample version makes the factorization concrete: the middle column preserves the same empirical source cloud, the right column preserves the same output cloud, and only the pairing cost changes.

Polar Factorization

Same Output Law, Different Coupling

This is a finite-sample picture of the polar factorization theorem. The middle column applies a latent rearrangement $M$ that preserves the empirical source law, and the right column applies the same Brenier map $T=\nabla\psi$. So the final output cloud stays fixed while only the coupling and its quadratic cost change.

Latent rearrangement $M$
Brenier map $T=\nabla\psi$
Shared output cloud $T(\{z_i\})$
Source $z$
Rearranged $M(z)$
Output $F(z)$

Immediate Consequence of Brenier’s Theorem

If \(T=\nabla\psi\) is Brenier’s map from \(P_{\text{noise}}\) to \(P_{\text{data}}\), then for any exact generator \(\widetilde{G}\) with the same pushforward law,

\[ \mathbb{E}\Vert Z-T(Z)\Vert _2^2 \le \mathbb{E}\Vert Z-\widetilde{G}(Z)\Vert _2^2, \qquad Z\sim P_{\text{noise}}. \]

This is the generator-language restatement of Brenier’s theorem: among all maps with pushforward \(P_{\text{data}}\), the Brenier map has the smallest quadratic transport cost.

This is exactly the lens used by Morel et al.: start from a trained normalizing flow that already matches \(P_{\text{data}}\), then search over Gaussian-preserving latent rearrangements that lower its quadratic transport cost without changing the final density. The target is not a new density model, but a coupling closer to the Monge map (Morel et al., 2023).

Gaussian-Preserving Rearrangements

For standard normal priors, Morel et al. make this concrete by moving to uniform coordinates, applying a volume-preserving shuffle there, and mapping back.

Gaussian-Preserving Rearrangements via Uniform Coordinates

Let \(\Phi:\mathbb{R}^D \to (0,1)^D\) be the coordinatewise standard normal CDF and let \(\phi:(0,1)^D \to (0,1)^D\) be a smooth volume-preserving diffeomorphism with \(\vert \det J_\phi\vert =1\). Then

\[ M=\Phi^{-1}\circ \phi \circ \Phi \]

preserves the standard Gaussian: if \(Z\sim \mathcal{N}(0,I)\), then \(M(Z)\sim \mathcal{N}(0,I)\).

 

If \(U=\Phi(Z)\), then \(U\) is uniform on \((0,1)^D\). Volume preservation implies \(\phi(U)\) is still uniform, and applying \(\Phi^{-1}\) coordinatewise sends it back to a standard Gaussian. Hence \(M(Z)\sim \mathcal{N}(0,I)\).

Why Divergence-Free ODEs and Euler Regularization Appear

Morel et al. parameterize \(\phi\) as the endpoint of an ODE \(\dot X_t=v_t(X_t)\) on the cube, with \(\nabla\cdot v_t=0\) and tangential boundary conditions. Liouville’s formula then keeps \(\det J_{X_t}\) constant, so each time-\(t\) map preserves volume. Euler regularization does not change which endpoints preserve the density; it selects a lower-energy volume-preserving path to the same endpoint (Morel et al., 2023).

Conclusion

Summary

Through the lens of optimal transport, autoregression and diffusion are two computational strategies for the same geometric task: transporting \(P_{\text{noise}}\) to \(P_{\text{data}}\).

Neither paradigm is tied to a modality by mathematical necessity. Text uses autoregression and images use diffusion mostly because of inductive bias and computational convenience. Better transport parameterizations and solvers will likely keep blurring that boundary.


References

  1. Brenier, Y. (1991). Polar Factorization and Monotone Rearrangement of Vector-valued Functions. Communications on Pure and Applied Mathematics, 44(4), 375–417. https://doi.org/10.1002/cpa.3160440402↩︎
  2. Benamou, J.-D., & Brenier, Y. (2000). A Computational Fluid Mechanics Solution to the Monge-Kantorovich Mass Transfer Problem. Numerische Mathematik, 84(3), 375–393. https://doi.org/10.1007/s002119900117↩︎
  3. Chen, Y., Georgiou, T. T., & Pavon, M. (2014). On the Relation between Optimal Transport and Schrödinger Bridges: A Stochastic Control Viewpoint (Issue arXiv:1412.4430). arXiv. https://doi.org/10.48550/arXiv.1412.4430↩︎
  4. De Bortoli, V., Thornton, J., Heng, J., & Doucet, A. (2021). Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling (Issue arXiv:2106.01357). arXiv. https://doi.org/10.48550/arXiv.2106.0135712
  5. Deng, M., Li, H., Li, T., Du, Y., & He, K. (2026). Generative Modeling via Drifting (Issue arXiv:2602.04770). arXiv. https://doi.org/10.48550/arXiv.2602.04770↩︎
  6. Diffusion Is Spectral Autoregression. (2024). In Sander Dieleman. https://sander.ai/2024/09/02/spectral-autoregression.html↩︎
  7. Falck, F. (2025). Diffusion Is Not Necessarily Spectral Autoregression. https://fabianfalck.com/posts/spectralauto↩︎
  8. Freulon, P., Georgakis, N., & Panaretos, V. (2026). Entropic Optimal Transport beyond Product Reference Couplings: The Gaussian Case on Euclidean Space (Issue arXiv:2507.01709). arXiv. https://doi.org/10.48550/arXiv.2507.0170912
  9. Guillen, N., & McCann, R. (2010). Five Lectures on Optimal Transportation: Geometry, Regularity and Applications (Issue arXiv:1011.2911). arXiv. https://doi.org/10.48550/arXiv.1011.2911
  10. Inverse Transform Sampling. (2025). Wikipedia. https://en.wikipedia.org/w/index.php?title=Inverse_transform_sampling&oldid=1330078866↩︎
  11. Lai, C.-H., Song, Y., Kim, D., Mitsufuji, Y., & Ermon, S. (2025). The Principles of Diffusion Models (Issue arXiv:2510.21890). arXiv. https://doi.org/10.48550/arXiv.2510.21890↩︎
  12. Léonard, C. (2013). A Survey of the Schrödinger Problem and Some of Its Connections with Optimal Transport (Issue arXiv:1308.0215). arXiv. https://doi.org/10.48550/arXiv.1308.0215↩︎
  13. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling (Issue arXiv:2210.02747). arXiv. https://doi.org/10.48550/arXiv.2210.0274712
  14. Morel, G., Drumetz, L., Benaïchouche, S., Courty, N., & Rousseau, F. (2023). Turning Normalizing Flows into Monge Maps with Geodesic Gaussian Preserving Flows (Issue arXiv:2209.10873). arXiv. https://doi.org/10.48550/arXiv.2209.1087312
  15. Normal Distribution. (2026). Wikipedia. https://en.wikipedia.org/w/index.php?title=Normal_distribution&oldid=1349034873#Maximum_entropy↩︎
  16. Optimal Transport Wiki – OT WIKI. Retrieved April 21, 2026, from https://www.otwiki.xyz/
  17. Pass, B., & Alberta, U. An Introduction to Optimal Transport. Matching Theory.
  18. Peyré, G. (2025). Optimal Transport for Machine Learners (Issue arXiv:2505.06589). arXiv. https://doi.org/10.48550/arXiv.2505.06589↩︎
  19. The Principles of Diffusion Models. Retrieved April 19, 2026, from https://the-principles-of-diffusion-models.github.io/
  20. Quadrangle Inequality Properties. In Codeforces. Retrieved April 20, 2026, from https://codeforces.com/blog/entry/86306
  21. Tong, A. Y., Malkin, N., Fatras, K., Atanackovic, L., Zhang, Y., Huguet, G., Wolf, G., & Bengio, Y. (2024). Simulation-Free Schrödinger Bridges via Score and Flow Matching. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, 238, 1279–1287. https://proceedings.mlr.press/v238/tong24a.html12
  22. Santambrogio, F. (2014). Introduction to Optimal Transport Theory. In Y. Ollivier, H. Pajot, & C. Villani (Eds.), Optimal Transport (1st ed., pp. 3–21). Cambridge University Press. https://doi.org/10.1017/CBO9781107297296.002
  23. Santambrogio, F. Optimal Transport for Applied Mathematicians – Calculus of Variations, PDEs and Modeling.
  24. Thorpe, M. (2018). Introduction to Optimal Transport.↩︎
  25. Vesseron, N., & Cuturi, M. (2025). On a Neural Implementation of Brenier’s Polar Factorization (Issue arXiv:2403.03071). arXiv. https://doi.org/10.48550/arXiv.2403.03071↩︎
  26. Villani, C. (2009). Optimal Transport: Old and New (Issue 338). Springer.
  27. Villani, C. (2016). Topics in Optimal Transportation (Reprinted with corrections, Issue 58). American Mathematical Society.
  28. Yu, H., Luo, H., Yuan, H., Rong, Y., Huang, J., & Zhao, F. (2026). Frequency Autoregressive Image Generation with Continuous Tokens (Issue arXiv:2503.05305). arXiv. https://doi.org/10.48550/arXiv.2503.05305↩︎
This post is licensed under CC BY 4.0 by the author.