Autoregression vs Diffusion - Understanding Sampling via Optimal Transport
Autoregression and diffusion look like opposites, but both are solving the same transport problem - how to turn simple noise into structured data.
Introduction
Generative modeling is fundamentally an exercise in statistical estimation.
Generative Modeling
Let \(\mathcal{D} = \{x_1, \dots, x_N\}\) be an empirical dataset drawn i.i.d. from a data distribution \(x \sim P_{\text{data}}(x)\) defined over a high-dimensional space \(\mathcal{X}\). We want to generate new approximate samples from \(P_{\text{data}}(x)\).
The Procedural Strategy
Direct sampling from a complex, high-dimensional data distribution is generally intractable. Instead, we reduce the generative problem to a two-step procedure: noise sampling followed by procedural generation.
Think of world generation in a game like Minecraft: instead of trying to randomly generate a billion individual blocks and hoping they form a coherent landscape, we start from a single random seed—a simple source of randomness—and use it as the starting point for a procedural generation algorithm. In generative AI, we take a similar approach.
The Transport Objective
We introduce a tractable base distribution \(z \sim P_{\text{noise}}(z)\) (like standard Gaussian noise) that is easy to sample from. The goal is then to convert this simple randomness into samples that match the statistics of \(P_{\text{data}}(x)\).
Hard Samples from Easy Randomness
Think of a tiny terrain generator. The target distribution lives over coherent terrains in $x$-space, which is hard to sample from directly. The reduction is to sample a tiny easy seed, then let one deterministic generator produce the terrain.
Autoregressive models and diffusion models are often presented as very different generative strategies. At a high level, however, both function as this “procedural generation algorithm,” converting simple randomness into complex samples from the data distribution. The difference is in how they parameterize that conversion: autoregression does it through a sequence of conditional transports, while diffusion does it through a time-dependent denoising dynamics. Optimal transport is therefore a useful geometric lens for comparing them, even when the underlying training objectives are not literally the same.
The Choice of Base Distribution
Why are Gaussian or Uniform distributions standard choices for \(P_{\text{noise}}\)? Practitioners favor them for practical reasons: they are easy to sample from, isotropic, stable under perturbation, and compatible with common noising procedures. They also have a clean maximum-entropy characterization once the underlying domain or moment constraints are fixed.
Uniform Maximum Entropy
For a strictly bounded interval \(\lbrack a, b\rbrack\), the maximum entropy distribution is the Uniform distribution \(\mathcal{U}\lbrack a, b\rbrack\).
We maximize entropy \(\mathbb{E}\lbrack -\ln p(x)\rbrack\) subject to \(\int_a^b p(x) dx = 1\).
Setting the functional derivative of the Lagrangian to zero gives:
Since \(p(x)\) is constant, it must precisely be \(\frac{1}{b-a}\) to integrate to 1.
Gaussian Maximum Entropy
On \(\mathbb{R}\), there is no uniform probability distribution on the whole space, and differential entropy has no maximizer without an additional moment constraint. If we fix the mean \(\mu\) and variance \(\sigma^2\), then the unique maximum-entropy distribution is the Gaussian \(\mathcal{N}(\mu, \sigma^2)\).
This is why Gaussian noise is the canonical maximum-entropy base under a fixed second-moment budget (“Normal Distribution,” 2026).
By translation invariance of differential entropy, we may center first and assume \(\mu=0\). We then maximize \(\mathbb{E}\lbrack -\ln p(x)\rbrack\) subject to \(\int p(x) dx = 1\) and \(\int x^2 p(x) dx = \sigma^2\). The Lagrangian functional derivative yields:
Integrability forces the quadratic coefficient to be negative, which yields the Gaussian form
Constrained Entropy Ascent
This uses projected entropy ascent on a discretized density. Each step increases entropy and then reapplies the active constraint, so the curve relaxes toward the actual max-entropy solution for that setting.
From Noise to Data: The Transport Problem
To formalize this, we turn to Optimal Transport (OT). (Peyré, 2025)(Thorpe, 2018)
Optimal transport is the continuous, high-dimensional version of this same mass-moving problem.
Generative Optimal Transport
In the generative setting, we start from noise \(Z \sim P_{\text{noise}}\) and want an output with distribution \(P_{\text{data}}\). A transport map \(T\) should therefore satisfy \(T(Z) \sim P_{\text{data}}\).
Pushforward Shorthand
Later we will abbreviate “if \(Z \sim P\) then \(T(Z) \sim Q\)” by writing \(T_\sharp P = Q\).
If \(T\) is a differentiable bijection, this distribution-matching condition is equivalent to the usual change-of-variables formula:
\[ p_{\text{data}}(x) = p_{\text{noise}}(T^{-1}(x)) \left\vert \det J_{T^{-1}}(x) \right\vert = p_{\text{noise}}(z)\left\vert \det J_T(z) \right\vert ^{-1}, \qquad x=T(z). \]
The Monge Problem
Let \(c: \mathcal{Z} \times \mathcal{X} \to \mathbb{R} \cup \{+\infty\}\) be a fixed ground cost. Monge seeks a deterministic map \(T\) minimizing
\[ \min_T \underset{Z \,\sim\, P_{\text{noise}}}{\mathbb{E}}\lbrack c(Z, T(Z))\rbrack \quad \text{s.t.} \quad T(Z) \sim P_{\text{data}}. \]
Two Feasible Deterministic Maps $T$
In the discrete Monge problem, each source atom $z_i$ must choose exactly one target atom $x_j$. If the masses are indivisible, a feasible deterministic map $T$ exists only when the source atoms already come in the exact pieces demanded by the targets. Here the route coefficients $c_{ij}$ are shipping rates in $\$/\mathrm{kg}$, so with 1kg boxes the Monge objective is the total dollar cost.
No Feasible Deterministic Map $T$
The Kantorovich Problem
Monge forces each source point \(z\) to choose a single destination \(T(z)\). Kantorovich relaxes this by optimizing over any random pair \((Z,X)\) with the correct marginals:
\[ \min_{(Z,X)} \mathbb{E}\lbrack c(Z,X)\rbrack \quad \text{s.t.} \quad Z \sim P_{\text{noise}}, \quad X \sim P_{\text{data}}. \]In probabilistic language, we are free to choose any coupling between noise and data. Monge is the special case \(X = T(Z)\) almost surely.
Discrete Kantorovich Form
In the finite case, if source point \(i\) carries mass \(a_i\), target point \(j\) needs mass \(b_j\), and moving one unit of mass costs \(c_{ij}\), then Kantorovich transport solves
\[ \min_{\pi_{ij} \ge 0} \sum_{i,j} c_{ij}\pi_{ij} \]subject to
\[ \sum_j \pi_{ij} = a_i, \qquad \sum_i \pi_{ij} = b_j. \]This is useful here because it allows mass splitting and turns the discrete problem into a linear optimization problem. Monge is recovered as the special case
\[ \pi_{i,T(i)} = a_i, \qquad \pi_{ij} = 0 \text{ for } j \ne T(i). \]
Why the Discrete Problem Is Linear
Kantorovich transport is linear in the plan \(\pi_{ij}\), so the discrete problem is a linear optimization problem. We will return to its dual later, once the basic OT picture is in place.
Transport Plan $\pi$ with Mass Splitting
Source atoms $z_1$ (2kg) and $z_2$ (1kg) must fill target atoms $x_1$ (1kg) and $x_2$ (2kg). Let $\alpha = \pi_{11}$ be the mass moving from $z_1$ to $x_1$. Conservation of mass determines the entire plan $\pi$: $\pi_{11} = \alpha, \pi_{12} = 2-\alpha, \pi_{21} = 1-\alpha, \pi_{22} = \alpha$. In the Monge problem, $\alpha \in \{0, 1\}$, but Kantorovich allows any $\alpha \in [0, 1]$.
The 1D Case: Inverse Transform Sampling
To build intuition, let’s start with a simple case: Inverse Transform Sampling(“Inverse Transform Sampling,” 2025).
Suppose we operate strictly in a 1-dimensional continuous space \(\mathbb{R}\), and the target distribution \(P_{\text{data}}\) is entirely defined by its Cumulative Distribution Function (CDF) \(F(x) = P(X \le x)\).
The 1D Sampling Problem
We can sample \(U \sim \mathcal{U}(0,1)\), but we want \(X=T(U)\) to have CDF \(F\), i.e. \(P(X \le t)=F(t)\) for every \(t\). In other words, we want to turn uniform noise into samples from the target distribution.
The 1D Closed-Form Solution
Set \(X=F^{-1}(U)\), where \(F^{-1}(u)\) is the smallest \(x\) with \(F(x)\ge u\). Then \(P(X \le t)=P(U \le F(t))=F(t)\), so inverse transform sampling is already a transport map from uniform noise to the target distribution.
We now rewrite the same idea in the discrete setting first.
The local swap behind the 1D OT solution is easiest to see in the smallest nontrivial example:
Distance Matching and Uncrossing
Drag the ordered source points $x_1 < x_2$ and target points $y_1 < y_2$. The same number line drives both comparisons: the top row shows the uncrossed matching (sorting order), while the bottom row shows the crossed matching. In 1D, uncrossing never increases the cost for convex costs like distance.
Discrete 1D OT: Equal-Mass Matching
Suppose source points \(x_1,\dots,x_n\) and target points \(y_1,\dots,y_n\) each carry mass \(\frac{1}{n}\), and each source point must be matched to exactly one target point. Then we choose a permutation \(\sigma\) and solve
\[ \min_{\sigma \in S_n} \frac{1}{n}\sum_{i=1}^n c(x_i, y_{\sigma(i)}). \]
Quadrangle Inequality
Assume \(x_1 < x_2\), \(y_1 < y_2\), and \(c(x,y) = h(x-y)\) with \(h\) convex. Then
\[ c(x_1, y_1) + c(x_2, y_2) \le c(x_1, y_2) + c(x_2, y_1). \]In words: if two matching lines cross, uncrossing them never increases the cost.
Let \(a = x_1-y_2\) and \(d = x_2-y_1\). Since \(x_1 < x_2\) and \(y_1 < y_2\), we have \(d-a = (x_2-x_1) + (y_2-y_1) > 0\), so \(a < d\). With
we can write
Since \(h\) is convex,
Substituting back \(a = x_1-y_2\) and \(d = x_2-y_1\) gives
Discrete Monotone Matching
Sort both sets of points in non-decreasing order:
\[ x^{(1)} \le x^{(2)} \le \dots \le x^{(n)} \quad \text{and} \quad y^{(1)} \le y^{(2)} \le \dots \le y^{(n)}. \]Here \(x^{(k)}\) means “the \(k\)-th smallest source point,” and similarly for \(y^{(k)}\). Then an optimal matching is
\[ x^{(1)} \leftrightarrow y^{(1)}, \qquad x^{(2)} \leftrightarrow y^{(2)}, \qquad \dots, \qquad x^{(n)} \leftrightarrow y^{(n)}. \]So in 1D, optimal transport is just “sort both sides and pair equal ranks.”
Start from any permutation \(\sigma\). If \(\sigma\) is not monotone, then some \(i<j\) satisfy \(\sigma(i)>\sigma(j)\), so the pairs \(x^{(i)} \mapsto y^{(\sigma(i))}\) and \(x^{(j)} \mapsto y^{(\sigma(j))}\) cross. By the Monge property, swapping those two targets does not increase the cost:
After the swap, the number of inversions of \(\sigma\) decreases by at least \(1\). Repeating this process finitely many times removes all inversions.
A permutation with no inversions is increasing, and the only increasing permutation of \(\{1,\dots,n\}\) is the identity. So eventually we reach \(\sigma(i)=i\) for every \(i\), which is exactly the sorted matching \(x^{(i)} \leftrightarrow y^{(i)}\).
Equal Quantiles
The continuous version says the same thing, but with percentiles instead of sorted lists.
The \(u\)-quantile of a distribution means: the smallest point whose CDF value is \(u\). Equivalently, it is the pseudoinverse-CDF value \(F^{-1}(u):=\inf\{x:F(x)\ge u\}\).
For example, if a point \(x\) is at the \(70\%\) quantile of the source distribution, then it should be sent to the \(70\%\) quantile of the target distribution.
Continuous 1D OT
Let \(P\) and \(Q\) be 1D source and target laws with CDFs \(F\) and \(G\). For the same class of 1D convex costs \(c(x,y)=h(x-y)\), we seek a map \(T\) that sends the source law to the target law and, among all such maps, has the smallest transport cost:
\[ \min_T \int c(x, T(x))\,dP(x) \]subject to
\[ T_\sharp P = Q. \]In words: if \(X \sim P\), then \(T(X)\) should have CDF \(G\).
Quantile Matching
For 1D convex costs, the optimal map factorizes through a uniform base \(u \sim \mathcal{U}(0,1)\):
\[ u = F(x), \qquad T(x) = G^{-1}(u) \]
Pullback: Compute \(u = F(x)\). By the probability integral transform, this extracts pure uniform noise.
Pushforward: Sample \(T(x) = G^{-1}(u)\). This exactly recovers inverse transform sampling.
First check that the map has the right output distribution. If \(X \sim P\) and \(U = F(X)\), then \(U \sim \mathcal{U}(0,1)\). Define \(Y = G^{-1}(U) = G^{-1}(F(X))\). Then for any \(t\),
so \(Y\) has CDF \(G\). Therefore \(T_\sharp P = Q\).
Now check optimality. For each \(m\), take the equally spaced quantile levels \(u_k = \frac{k-\frac12}{m}\) and define
By the discrete monotone matching result, pairing \(x_k\) with \(y_k\) minimizes
over all permutations \(\sigma\). As \(m \to \infty\), these sums converge to the quantile integral
which is exactly
So equal-quantile matching is optimal in the continuous problem as well.
If \(h\) is strictly convex, such as \(h(t)=t^2\), this optimizer is unique up to sets of measure zero. For \(h(t)=\vert t\vert\), ties can occur.
CDFs vs. Densities
The transport rule is written using CDFs, but models often predict densities or probabilities:
\[ F(x) = \int_{-\infty}^{x} p(t)\,dt, \qquad p(x) = F'(x) \]and in the discrete case
\[ F(v_k) = \sum_{j \le k} p_j. \]So learning \(p\) is enough: the CDF is obtained by integrating or summing, and sampling uses the inverse CDF.
Inverse Transform Sampling
Drag the $u$ handle along the horizontal axis, which represents uniform noise. The solid blue curve is the Inverse CDF $x = F^{-1}(u)$. The shape along the vertical axis shows the target density $p(x)$. Notice how the cumulative mass (the filled orange area) visually grows in proportion to your chosen $u$ input!
In higher dimensions, we can keep the same quantile-matching idea, but apply it one coordinate at a time.
Autoregression as Sequential Transport
This gives the Knothe-Rosenblatt rearrangement, which is the mathematical backbone of autoregression.
Knothe-Rosenblatt Rearrangement
Fix an order of the coordinates and write
\[ T(x_1,\dots,x_D) = \bigl(T_1(x_1), T_2(x_1,x_2), \dots, T_D(x_1,\dots,x_D)\bigr). \]Here \(x_{<i}=(x_1,\dots,x_{i-1})\) means “all earlier coordinates.”
Each coordinate is defined by a 1D conditional inverse-CDF step:
\[ T_1(x_1) = F^{-1}_{Q_1}(F_{P_1}(x_1)) \]\[ T_2(x_2 \vert x_1) = F^{-1}_{Q_2 \vert Q_1}(F_{P_2 \vert P_1}(x_2 \vert x_1)) \]\[ \dots \]\[ T_D(x_D \vert x_{<D}) = F^{-1}_{Q_D \vert Q_{<D}}(F_{P_D \vert P_{<D}}(x_D \vert x_{<D})) \]So after fixing the earlier coordinates, we apply the same 1D quantile-matching rule to the next conditional distribution.
Non-Optimality (Greedy Sequence)
The Knothe-Rosenblatt map guarantees \(T_\sharp P_{\text{noise}} = P_{\text{data}}\), but it is not globally optimal for the standard squared Euclidean cost \(W_2^2 = \mathbb{E}\lbrack \Vert x - y\Vert _2^2\rbrack\) in high dimensions.
By factorizing sequentially, it implicitly solves a greedy transport problem. It is canonical only because it appears as the optimal limit of a heavily skewed quadratic cost that strictly prioritizes earlier coordinates:
\[ c(x, y) = \sum_{i=1}^D \lambda_i (x_i - y_i)^2, \qquad \lambda_1 \gg \lambda_2 \gg \dots \gg \lambda_D \]
At the population level, vanilla autoregression is this triangular transport written in density language.
Chain Rule Factorization
If \(x=(x_1,\dots,x_D)\) and \(x_{<i}=(x_1,\dots,x_{i-1})\), then
\[ p(x)=\prod_{i=1}^D p(x_i \mid x_{<i}). \]
Sampling Rule
Draw \(u_1,\dots,u_D \sim \mathcal{U}(0,1)\) and set \(x_i = F^{-1}_{X_i \mid X_{<i}=x_{<i}}(u_i)\). Each step is just 1D inverse transform sampling conditioned on the past.
Example: LLMs as Classification
Next-Token Prediction
Take a tiny vocabulary \(\mathcal{V}=\{\text{"apple"}, \text{"banana"}, \text{"cherry"}\}\). At step \(i\), the model outputs probabilities \(p_k=P(x_i=v_k \mid x_{<i})\).
Suppose it predicts \((0.6, 0.3, 0.1)\). The discrete CDF is then \((0.6, 0.9, 1.0)\). If we draw \(u=0.75\), the inverse-CDF rule picks the first token whose cumulative probability exceeds \(u\), namely “banana”.
So next-token sampling is conditional inverse transform sampling, and standard cross-entropy training learns the conditional distributions that induce these 1D transports:
\[ \mathcal{L}_{\text{NLL}} = -\sum_{i=1}^D \log P(x_i^{\text{true}} \mid x_{<i}). \]
Next-Token Inverse CDF Sampling
Drag the slider to sample a uniform random variable \( u \sim \mathcal{U}(0,1) \). The vocabulary space is partitioned according to the predicted probability of each token. The interval that catches \( u \) becomes the generated token!
Reparameterized Autoregression
Instead of autoregressing in the original coordinates, we can first change coordinates and then factorize there.
Change of Variables
If \(y=g(x)\) is invertible, then
\[ p_X(x)=p_Y(g(x))\vert \det J_g(x)\vert . \]So any autoregressive factorization in the \(y\)-coordinates induces a density in the original \(x\)-coordinates.
Understanding Reparameterization Maps
A map $y = g(x)$ changes the coordinates before we factorize or sample. Some reparameterizations only change ordering, while others expose more meaningful coordinates such as principal axes or coarse-vs-fine structure.
Example 1: Changing the Ordering
Permuting Coordinates
If \(y_i=x_{\sigma(i)}\) for a permutation \(\sigma\), then \(J_g\) is a permutation matrix, so \(\vert \det J_g\vert =1\). For example, \((y_1,y_2,y_3)=(x_2,x_1,x_3)\) simply changes generation order. The model is still autoregressive, just in a different ordering.
Example 2: Frequency Space
One concrete way to see frequency-space autoregression is through the 2-point Haar basis, the smallest wavelet example of coarse-to-fine generation (Yu et al., 2026).
2-Pixel Haar Coordinates
For a 2-pixel patch $x=(x_1,x_2)$, the 2-point Haar basis separates a shared brightness term $y_{\text{low}}=\frac{x_1+x_2}{\sqrt{2}}$ from a signed contrast term $y_{\text{high}}=\frac{x_1-x_2}{\sqrt{2}}$. In that basis, autoregression can model coarse intensity first and local edge direction second.
The same idea appears in the standard practical visualization below: wavelet decomposition into approximation and detail bands, followed by progressive reconstruction.
Image-Space Haar Decomposition
This is the standard practical visualization: an image, its 1-level Haar subbands, and the corresponding coarse-to-fine reconstructions. The approximation band stores broad structure, while the detail bands isolate left-right edges, top-bottom edges, and diagonal texture.
Fourier gives the complementary practical view: the same image is represented by global frequencies, visualized as a centered spectrum together with low-pass and high-pass reconstructions.
Image-Space Fourier Decomposition
The Fourier view is complementary to the Haar view. Instead of localized multiscale bands, it represents the same image with global sinusoidal frequencies: the center of the spectrum is low frequency, and the outer region is high frequency.
Relation to Diffusion
For images, diffusion often appears to refine samples from coarse structure toward fine detail. Dieleman interprets DDPM-style denoising as an approximate low-to-high spectral ordering that is valid in expectation across many images, rather than as a hard per-sample rule (“Diffusion Is Spectral Autoregression,” 2024). Falck’s follow-up accepts that approximate DDPM picture, but argues that this spectral hierarchy is not necessary for good diffusion performance: hierarchy-free diffusion can match DDPM and even improve high-frequency generation (Falck, 2025).
Global Optimality: Brenier’s Theorem
If we abandon the greedy sequence and seek the true globally optimal map for the symmetric squared Euclidean cost, we arrive at Brenier’s theorem (Brenier, 1991).
Brenier’s Theorem
If \(P\) is absolutely continuous and \(P,Q\) have finite second moments, then the quadratic-cost problem
\[ W_2^2(P,Q)=\inf_{T_\sharp P = Q}\mathbb{E}_{X\sim P}\Vert X-T(X)\Vert _2^2 \]has a unique optimal map up to \(P\)-null sets, and it has the form \(T=\nabla\psi\) for a convex potential \(\psi\).
Why a Convex Gradient?
In 1D, optimal transport was monotone: equal quantiles never cross. In higher dimensions, the analogue of that monotonicity is a convex gradient. Indeed, convexity gives
\[ \langle \nabla\psi(x)-\nabla\psi(y),\, x-y \rangle \ge 0, \]so the displacement field is monotone in the ambient inner product. When \(D=1\), gradients of convex functions are exactly increasing functions, so Brenier reduces to monotone rearrangement.
What the Theorem Says Geometrically
Brenier’s theorem is stronger than existence: the global \(W_2^2\) optimizer is deterministic, unique almost everywhere, and generated by a single scalar potential. So the best quadratic-cost generator is not an arbitrary black box, but a convex-potential gradient.
Dynamic Optimal Transport
Brenier chooses the optimal endpoint map. Dynamic OT asks for the optimal interpolation in time.
Benamou-Brenier Dynamic Formulation
For quadratic cost,
\[ W_2^2(\mu_0,\mu_1) = \inf_{\rho_t,\,v_t} \int_0^1 \! \int \Vert v_t(x)\Vert _2^2\, \rho_t(x)\, dx\, dt \]subject to
\[ \partial_t \rho_t + \operatorname{div}(\rho_t v_t) = 0, \qquad \rho_0=\mu_0,\ \rho_1=\mu_1. \]So instead of optimizing one endpoint map directly, we optimize over all density paths and velocity fields that transport \(\mu_0\) to \(\mu_1\) with minimal kinetic action (Benamou & Brenier, 2000).
Eulerian and Lagrangian Views
The formula above is Eulerian: it works with a density field \(\rho_t\) and a velocity field \(v_t\) over space-time. When the Brenier map \(T\) exists, the same optimizer can be written in Lagrangian coordinates by following particles
\[ X_t=(1-t)X_0+t\,T(X_0), \qquad X_0\sim \mu_0, \]and then setting \(\rho_t=\lbrack X_t\rbrack _\sharp \mu_0\). This curve of measures is the displacement interpolation: each particle moves at constant speed along the segment from its source location to its Brenier destination. So the static map view and the dynamic least-action view describe the same quadratic OT geodesic.
Path-Space / Stochastic Formulation
A different dynamic formulation optimizes not over deterministic trajectories but over laws on paths:
\[ \inf_{P:\,P_0=\mu_0,\ P_1=\mu_1}\operatorname{KL}(P \Vert R), \]where \(R\) is a reference diffusion or more general reference process on path space. This is the Schrödinger bridge problem. At finite noise it selects the most likely stochastic dynamics compatible with the endpoint marginals. For the standard Brownian reference, the small-noise limit recovers quadratic OT. This is the right language for diffusion-style bridge methods and for later multi-time questions where the reference process is part of the model, not just a numerical regularizer (Léonard, 2013; Chen et al., 2014; De Bortoli et al., 2021; Tong et al., 2024).
Continuous normalizing flows live in the same continuity-equation formalism, but their training objectives do not minimize the Benamou-Brenier action by default. They learn an admissible transport dynamics, not necessarily the Wasserstein geodesic.
Constrained vs. Unconstrained Transport
At the level of exact endpoint transport, we still have two main high-dimensional constructions:
- Knothe-Rosenblatt / autoregression: tractable, sequential, and order-dependent.
- Brenier / quadratic OT: symmetric and globally optimal for \(W_2^2\), but not available as a simple sequential recipe.
That distinction is the main architectural split.
Autoregression vs. Diffusion
Feature Autoregression (Knothe-Rosenblatt) Flow / Diffusion (Unconstrained) Map Form \(T_i(x_i \vert x_{<i}) = F^{-1}_{Q_i \vert Q_{<i}}(F_{P_i \vert P_{<i}}(x_i))\) Ideal OT map: \(T(x) = \nabla \psi(x)\); practical parameterization: learn \(v_\theta(x,t)\) Tractability Exact sequential 1D inversions Learned via velocity regression Structural Bias Arbitrary coordinate ordering None (isotropic) Training Signal Exact likelihood: \(\sum_i \log p(x_i \vert x_{<i})\) CFM / score matching Inference \(D\) sequential steps ODE integration (or one-step maps)
Autoregression keeps an exact sequential sampler and exact likelihood factorization, but pays for it with an ordering bias. Unconstrained transport is geometrically better aligned with the symmetric quadratic cost, but it must be learned numerically.
Two 2D pictures help keep the distinction straight. First, we compare two exact continuous maps at the population-law level: same source law, same target law, different map. Then we pass to a finite-sample discretization of those same laws: one frozen source cloud, one frozen target cloud, and two couplings on the same cached point set.
Knothe-Rosenblatt vs Brenier in 2D
This panel is the exact continuous problem. The formulas compare two true maps between probability laws, while the plotted particles are only a visual probe of those maps. Start from one sampled source cloud $z_i \sim \mathcal{N}(0, I)$ and compare two exact maps into the same correlated Gaussian law.
The pale blue ellipses are contours of the target distribution. The sequential map is the Knothe-Rosenblatt transform $T_{\mathrm{KR}}(z)=Lz$, while the direct map is the Brenier transform $T_{\mathrm{OT}}(z)=\Sigma^{1/2}z$.
Every source-to-target path is drawn as a faint gray trace; the yellow sample cloud shows how the transport actually moves along those paths.
Sorting vs Sinkhorn in 2D
This panel is the empirical finite-sample problem. We freeze one source sample and one target sample, drawn from the same underlying source/target laws as the continuous panel above, and compare two couplings on that exact cached 64-point instance.
Here the target is shown through the sampled blue cloud, while every pairing is drawn as a faint gray trace and the yellow transported sample moves over the full coupling.
Unconstrained Transport: Continuous Flows
A standard parameterization of an unconstrained transport is to learn a time-dependent velocity field and integrate it (Lai et al., 2025).
Continuous Normalizing Flow (CNF)
\[ \frac{dx}{dt} = v_\theta(x, t), \qquad x(0) \sim P_{\text{noise}}, \quad x(1) \sim P_{\text{data}} \]The field induces a flow map \(\phi_t\), and we want \(\phi_1(X)\) to follow \(P_{\text{data}}\) when \(X \sim P_{\text{noise}}\).
This avoids parameterizing \(\nabla\psi\) directly, but it does not by itself specify which dynamics should connect the endpoints. Benamou-Brenier chooses the least-action field; a generic CNF objective only chooses some field whose terminal pushforward is correct.
So the training question becomes: what conditional dynamics should \(v_\theta\) regress toward?
Flow Matching & Rectified Flows
Flow Matching(Lipman et al., 2023) and closely related Rectified Flow methods choose a coupling \(q(x_0,x_1)\) between source and target points, then define a conditional path between each paired endpoint. For the straight-line version,
the target velocity is the constant displacement \(X_1-X_0\).
Conditional Flow Matching (CFM) Objective
\[ \mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t,\,(X_0,X_1)\sim q}\bigl\Vert v_\theta(X_t,\, t) - (X_1 - X_0) \bigr\Vert _2^2 \]We sample a time \(t\) and an endpoint pair from a chosen coupling \(q\), then regress the model toward the straight-line conditional velocity.
Conditional vs. Marginal Optimality
For a fixed pair \((X_0, X_1)\), the straight line is optimal between those two endpoints. But the global geometry is determined by the coupling \(q\) and by the family of conditional paths.
If \(q=P_{\text{noise}}\otimes P_{\text{data}}\), we get the usual independent-pairing target.
If \(q\) is chosen from an OT plan, we get an OT-informed target.
If the conditional paths are Brownian bridges weighted by an entropic OT coupling, we move to Schrödinger-bridge variants.
So flow matching is a framework for turning a chosen coupling-and-path construction into a regression loss; it is not automatically the Benamou-Brenier geodesic (Lipman et al., 2023; Tong et al., 2024; De Bortoli et al., 2021).
Practical OT: Regularized Couplings
Everything above was exact and population-level: closed-form 1D maps, exact triangular rearrangements, and Brenier’s exact quadratic-cost optimizer. In practice, training works with finite batches and regularized couplings instead.
Entropic Optimal Transport
Given source and target distributions \(\mu,\nu\) and a reference coupling \(\pi_{\mathrm{ref}}\), entropic OT solves
\[ \min_{\pi \text{ coupling } \mu,\nu} \int c(x,y)\,d\pi(x,y) + \varepsilon\, \mathrm{KL}(\pi \Vert \pi_{\mathrm{ref}}). \]The KL term pulls the coupling toward the chosen reference plan \(\pi_{\mathrm{ref}}\) while keeping the marginals fixed. Intuitively, it discourages extremely sharp plans and makes the optimization numerically smoother.
Product Reference = Entropy Regularization up to Constants
If \(\mu,\nu\), and \(\pi\) admit densities, then
\[ \mathrm{KL}(\pi \Vert \mu \otimes \nu)=H(\mu)+H(\nu)-H(\pi). \]So with product reference \(\mu \otimes \nu\), KL regularization is the same as negative-entropy regularization up to constants, and therefore has the same minimizer. This follows by expanding the logarithm and using the fact that \(\pi\) has marginals \(\mu\) and \(\nu\).
Why the Reference Coupling Matters
With the usual product reference \(\mu \otimes \nu\), the KL term mainly favors diffuse couplings. Freulon et al. show that once the reference itself carries correlations, especially in the Gaussian case, the penalty no longer acts like generic smoothing: it favors couplings compatible with that reference structure (Freulon et al., 2026).
On a small fixed discrete instance, entropic OT is easiest to read as a sharp-vs-diffuse tradeoff. As \(\varepsilon\) increases, the plan keeps the same row and column marginals but spreads mass across more nearby source-target pairs.
How Sinkhorn Diffuses a Discrete Plan
Fix a small balanced discrete problem with source masses $a_i$ at points $z_i$, target masses $b_j$ at points $x_j$, and squared-distance costs $c_{ij}=\|z_i-x_j\|^2$. The entropically regularized plan $\pi^\varepsilon = \arg\min_{\pi \in \Pi(a,b)} \sum_{i,j} c_{ij}\pi_{ij} - \varepsilon H(\pi)$ stays marginally feasible but becomes progressively more diffuse as $\varepsilon$ grows.
Uncrossing Paths: Mini-Batch OT
Flow matching still leaves one practical question open: inside a mini-batch, which source sample should be paired with which target sample?
Batch OT
Random pairings can send nearby interpolation points toward very different endpoints, so the conditional targets \(X_1-X_0\) vary abruptly across the batch. A mini-batch OT solve reduces that mismatch by pairing nearby endpoints:
\[ \pi^\ast = \arg\min_{\pi \text{ matching } X_0^B \text{ to } X_1^B} \sum_{i,j} \pi_{ij}\, \bigl\Vert X_0^{(i)} - X_1^{(j)}\bigr\Vert _2^2 \]
This is efficiently approximated with Sinkhorn. With product reference, it is exactly the entropic OT relaxation above. Geometrically, the batch paths cross less, so nearby interpolation points tend to carry more compatible velocity targets.
On a small cached batch, the contrast is easiest to see directly:
Random Pairings vs Mini-Batch OT
Both panels use the same cached source batch $X_0^B$ and target batch $X_1^B$. Only the pairing changes. Faint gray traces show the full straight-line paths, aqua dots show the fixed endpoints, and the yellow dots with short arrows show the current interpolants $X_t$ and their velocity targets $X_1 - X_0$.
Endpoint Matching Is Not Yet a Dynamical Model
Mini-batch OT solves a two-marginal problem: for one time gap, it picks a coupling between the batch at the start and the batch at the end. That is enough to draw straight-line paths across that single gap, but a trajectory model over times \(t_1,\dots,t_n\) needs one joint law, or at least mutually compatible transitions, across all times.
Freulon et al. make this precise in the Gaussian setting. They show that if each pairwise entropic OT problem is regularized by a reference coupling coming from the same continuous-time Gaussian reference process, then the local transitions remain independently solvable but still fit together into one coherent dynamical model. With a product reference, each gap is regularized toward independence instead, so solving the gaps separately does not build in temporal coherence (Freulon et al., 2026).
One-Step Maps
One-step generators try to learn the transport map directly, rather than integrating an ODE or running many denoising steps at inference. Drifting Models(Deng et al., 2026) are a concrete example.
Drifting Model
A drifting model parameterizes a one-step generator
\[ x=f_\theta(z), \qquad z \sim P_{\text{noise}}, \]and studies the training-time pushforward distribution
\[ q_\theta = \lbrack f_\theta\rbrack _\sharp P_{\text{noise}}. \]Instead of specifying an inference-time transport field, it introduces a drifting field\(V_{p,q}(x)\) that prescribes how a current generated sample should move as the model is updated. The paper imposes the anti-symmetry condition
\[ V_{p,q}(x) = -V_{q,p}(x), \]so that \(V_{p,p}(x)=0\): matched data and model distributions are equilibrium points.
Drifting Objective and Training Step
For generated samples \(x=f_\theta(z)\sim q_\theta\), positive samples \(y^+\sim P_{\text{data}}\), and negative samples \(y^-\sim q_\theta\), Deng et al. use a kernelized attraction-repulsion field
\[ V_{p,q}(x)=V_p^+(x)-V_q^-(x), \]where
\[ V_p^+(x)=\frac{1}{Z_p(x)}\mathbb{E}_{y^+\sim p}\!\big\lbrack k(x,y^+)(y^+-x)\big\rbrack , \qquad V_q^-(x)=\frac{1}{Z_q(x)}\mathbb{E}_{y^-\sim q}\!\big\lbrack k(x,y^-)(y^--x)\big\rbrack , \]with normalization factors \(Z_p(x)=\mathbb{E}_{y^+\sim p}\lbrack k(x,y^+)\rbrack\) and \(Z_q(x)=\mathbb{E}_{y^-\sim q}\lbrack k(x,y^-)\rbrack\). In their implementation, the similarity kernel is
\[ k(x,y)=\exp\!\left(-\frac{\Vert x-y\Vert _2}{\tau}\right), \]realized with batchwise softmax normalizations. The update target is then the drifted sample
\[ x_{\mathrm{drift}}=\operatorname{stopgrad}\bigl(x+V_{P_{\text{data}},q_\theta}(x)\bigr), \]and the training loss is \(\mathcal{L}_{\mathrm{drift}}(\theta)=\mathbb{E}\bigl\Vert x-x_{\mathrm{drift}}\bigr\Vert _2^2\). In practice, the expectations are estimated on mini-batches, the generated batch is reused as negative samples, and the same construction is often applied in a learned feature space \(\phi(x)\) rather than raw pixel space.
Unlike flow matching, this objective does not require paired endpoints. It compares generated samples against positive and negative neighborhoods, uses the resulting drift to update the one-step generator during training, and then samples with one evaluation of \(f_\theta\) at test time.
Conceptually, this closes the loop:
Dual Potentials
Before returning to exact-map structure, it is useful to switch from the primal viewpoint of moving mass to the dual viewpoint of certifying transport cost by potentials.
Discrete Kantorovich Dual
In the discrete case, the dual problem is
\[ \max_{f_i,\,g_j} \sum_i a_i f_i + \sum_j b_j g_j \qquad \text{s.t.} \qquad f_i + g_j \le c_{ij}\;\; \text{for all } i,j. \]Here \(f_i\) is the price assigned to source point \(x_i\) and \(g_j\) is the price assigned to target point \(y_j\). The rule \(f_i+g_j \le c_{ij}\) says their combined price can never exceed the true cost of pairing them.
Why This Is Useful
Any feasible pair \((f,g)\) gives a guaranteed lower bound on every transport plan:
\[ \sum_i a_i f_i + \sum_j b_j g_j = \sum_{i,j}\pi_{ij}(f_i+g_j) \le \sum_{i,j} c_{ij}\pi_{ij}. \]So if we can find large prices that still satisfy the constraints, we certify that the transport cost cannot be any smaller. Duality says the best such certificate exactly matches the true optimum.
Complementary Slackness
At optimality, transported pairs satisfy \(\pi_{ij}>0 \Rightarrow f_i+g_j=c_{ij}\). In words: mass only travels along pairs whose price constraint is tight. So the dual potentials identify the active geometry of the coupling.
The same tiny transport example makes this lower-bound picture concrete:
Prices as Lower Bounds
Move through the feasible family $f_1=t$, $f_2=-t$, $g_1=1-t$, $g_2=1+t$ and watch the dual lower bound rise until the used constraints become tight. Masses are in $\mathrm{kg}$ and costs are in $\$/\mathrm{kg}$.
Continuous Kantorovich Dual
The continuous version is the same idea, but now the prices are functions rather than finite lists of numbers. Among all \(f\) and \(g\) satisfying \(f(z)+g(x)\le c(z,x)\) for every \(z,x\), the best lower bound is
\[ \sup_{f,g} \; \mathbb{E}\lbrack f(Z)\rbrack + \mathbb{E}\lbrack g(X)\rbrack , \qquad Z \sim P_{\text{noise}},\; X \sim P_{\text{data}}. \]Under standard assumptions, this equals the Kantorovich optimum.
Why This Matters Here
These dual potentials are the continuous analogue of the discrete prices. They reappear in regularized OT, Sinkhorn-style updates, and potential-based viewpoints on transport geometry.
Optional Refinement: Polar Factorization
Even if a generator already matches the correct density, its internal geometry need not be transport-optimal.
Polar Factorization Theorem
Under the same regularity assumptions as Brenier’s theorem, any exact generator \(F\) with \(F_\sharp P_{\text{noise}} = P_{\text{data}}\) can be factorized \(P_{\text{noise}}\)-almost everywhere as
\[ F=\nabla\psi \circ M, \]where \(\nabla\psi\) is the Brenier map and \(M\) preserves the source law: \(M_\sharp P_{\text{noise}}=P_{\text{noise}}\).
This is the infinite-dimensional analogue of matrix polar decomposition. Every exact generator can be decomposed into a source-law-preserving latent rearrangement, followed by the optimal transport (Vesseron & Cuturi, 2025).
Same Density, Different Pairing
If \(\widetilde{G}=G\circ M\) and \(M_\sharp P_{\text{noise}}=P_{\text{noise}}\), then
\[ \widetilde{G}_\sharp P_{\text{noise}} = G_\sharp P_{\text{noise}}. \]So a latent rearrangement does not change the modeled density. It only changes the coupling: which latent point is paired with which output sample.
A finite-sample version makes the factorization concrete: the middle column preserves the same empirical source cloud, the right column preserves the same output cloud, and only the pairing cost changes.
Same Output Law, Different Coupling
This is a finite-sample picture of the polar factorization theorem. The middle column applies a latent rearrangement $M$ that preserves the empirical source law, and the right column applies the same Brenier map $T=\nabla\psi$. So the final output cloud stays fixed while only the coupling and its quadratic cost change.
Immediate Consequence of Brenier’s Theorem
If \(T=\nabla\psi\) is Brenier’s map from \(P_{\text{noise}}\) to \(P_{\text{data}}\), then for any exact generator \(\widetilde{G}\) with the same pushforward law,
\[ \mathbb{E}\Vert Z-T(Z)\Vert _2^2 \le \mathbb{E}\Vert Z-\widetilde{G}(Z)\Vert _2^2, \qquad Z\sim P_{\text{noise}}. \]This is the generator-language restatement of Brenier’s theorem: among all maps with pushforward \(P_{\text{data}}\), the Brenier map has the smallest quadratic transport cost.
This is exactly the lens used by Morel et al.: start from a trained normalizing flow that already matches \(P_{\text{data}}\), then search over Gaussian-preserving latent rearrangements that lower its quadratic transport cost without changing the final density. The target is not a new density model, but a coupling closer to the Monge map (Morel et al., 2023).
Gaussian-Preserving Rearrangements
For standard normal priors, Morel et al. make this concrete by moving to uniform coordinates, applying a volume-preserving shuffle there, and mapping back.
Gaussian-Preserving Rearrangements via Uniform Coordinates
Let \(\Phi:\mathbb{R}^D \to (0,1)^D\) be the coordinatewise standard normal CDF and let \(\phi:(0,1)^D \to (0,1)^D\) be a smooth volume-preserving diffeomorphism with \(\vert \det J_\phi\vert =1\). Then
\[ M=\Phi^{-1}\circ \phi \circ \Phi \]preserves the standard Gaussian: if \(Z\sim \mathcal{N}(0,I)\), then \(M(Z)\sim \mathcal{N}(0,I)\).
If \(U=\Phi(Z)\), then \(U\) is uniform on \((0,1)^D\). Volume preservation implies \(\phi(U)\) is still uniform, and applying \(\Phi^{-1}\) coordinatewise sends it back to a standard Gaussian. Hence \(M(Z)\sim \mathcal{N}(0,I)\).
Why Divergence-Free ODEs and Euler Regularization Appear
Morel et al. parameterize \(\phi\) as the endpoint of an ODE \(\dot X_t=v_t(X_t)\) on the cube, with \(\nabla\cdot v_t=0\) and tangential boundary conditions. Liouville’s formula then keeps \(\det J_{X_t}\) constant, so each time-\(t\) map preserves volume. Euler regularization does not change which endpoints preserve the density; it selects a lower-energy volume-preserving path to the same endpoint (Morel et al., 2023).
Conclusion
Summary
Through the lens of optimal transport, autoregression and diffusion are two computational strategies for the same geometric task: transporting \(P_{\text{noise}}\) to \(P_{\text{data}}\).
Neither paradigm is tied to a modality by mathematical necessity. Text uses autoregression and images use diffusion mostly because of inductive bias and computational convenience. Better transport parameterizations and solvers will likely keep blurring that boundary.
References
- Brenier, Y. (1991). Polar Factorization and Monotone Rearrangement of Vector-valued Functions. Communications on Pure and Applied Mathematics, 44(4), 375–417. https://doi.org/10.1002/cpa.3160440402↩︎
- Benamou, J.-D., & Brenier, Y. (2000). A Computational Fluid Mechanics Solution to the Monge-Kantorovich Mass Transfer Problem. Numerische Mathematik, 84(3), 375–393. https://doi.org/10.1007/s002119900117↩︎
- Chen, Y., Georgiou, T. T., & Pavon, M. (2014). On the Relation between Optimal Transport and Schrödinger Bridges: A Stochastic Control Viewpoint (Issue arXiv:1412.4430). arXiv. https://doi.org/10.48550/arXiv.1412.4430↩︎
- De Bortoli, V., Thornton, J., Heng, J., & Doucet, A. (2021). Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling (Issue arXiv:2106.01357). arXiv. https://doi.org/10.48550/arXiv.2106.0135712
- Deng, M., Li, H., Li, T., Du, Y., & He, K. (2026). Generative Modeling via Drifting (Issue arXiv:2602.04770). arXiv. https://doi.org/10.48550/arXiv.2602.04770↩︎
- Diffusion Is Spectral Autoregression. (2024). In Sander Dieleman. https://sander.ai/2024/09/02/spectral-autoregression.html↩︎
- Falck, F. (2025). Diffusion Is Not Necessarily Spectral Autoregression. https://fabianfalck.com/posts/spectralauto↩︎
- Freulon, P., Georgakis, N., & Panaretos, V. (2026). Entropic Optimal Transport beyond Product Reference Couplings: The Gaussian Case on Euclidean Space (Issue arXiv:2507.01709). arXiv. https://doi.org/10.48550/arXiv.2507.0170912
- Guillen, N., & McCann, R. (2010). Five Lectures on Optimal Transportation: Geometry, Regularity and Applications (Issue arXiv:1011.2911). arXiv. https://doi.org/10.48550/arXiv.1011.2911
- Inverse Transform Sampling. (2025). Wikipedia. https://en.wikipedia.org/w/index.php?title=Inverse_transform_sampling&oldid=1330078866↩︎
- Lai, C.-H., Song, Y., Kim, D., Mitsufuji, Y., & Ermon, S. (2025). The Principles of Diffusion Models (Issue arXiv:2510.21890). arXiv. https://doi.org/10.48550/arXiv.2510.21890↩︎
- Léonard, C. (2013). A Survey of the Schrödinger Problem and Some of Its Connections with Optimal Transport (Issue arXiv:1308.0215). arXiv. https://doi.org/10.48550/arXiv.1308.0215↩︎
- Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling (Issue arXiv:2210.02747). arXiv. https://doi.org/10.48550/arXiv.2210.0274712
- Morel, G., Drumetz, L., Benaïchouche, S., Courty, N., & Rousseau, F. (2023). Turning Normalizing Flows into Monge Maps with Geodesic Gaussian Preserving Flows (Issue arXiv:2209.10873). arXiv. https://doi.org/10.48550/arXiv.2209.1087312
- Normal Distribution. (2026). Wikipedia. https://en.wikipedia.org/w/index.php?title=Normal_distribution&oldid=1349034873#Maximum_entropy↩︎
- Optimal Transport Wiki – OT WIKI. Retrieved April 21, 2026, from https://www.otwiki.xyz/
- Pass, B., & Alberta, U. An Introduction to Optimal Transport. Matching Theory.
- Peyré, G. (2025). Optimal Transport for Machine Learners (Issue arXiv:2505.06589). arXiv. https://doi.org/10.48550/arXiv.2505.06589↩︎
- The Principles of Diffusion Models. Retrieved April 19, 2026, from https://the-principles-of-diffusion-models.github.io/
- Quadrangle Inequality Properties. In Codeforces. Retrieved April 20, 2026, from https://codeforces.com/blog/entry/86306
- Tong, A. Y., Malkin, N., Fatras, K., Atanackovic, L., Zhang, Y., Huguet, G., Wolf, G., & Bengio, Y. (2024). Simulation-Free Schrödinger Bridges via Score and Flow Matching. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, 238, 1279–1287. https://proceedings.mlr.press/v238/tong24a.html12
- Santambrogio, F. (2014). Introduction to Optimal Transport Theory. In Y. Ollivier, H. Pajot, & C. Villani (Eds.), Optimal Transport (1st ed., pp. 3–21). Cambridge University Press. https://doi.org/10.1017/CBO9781107297296.002
- Santambrogio, F. Optimal Transport for Applied Mathematicians – Calculus of Variations, PDEs and Modeling.
- Thorpe, M. (2018). Introduction to Optimal Transport.↩︎
- Vesseron, N., & Cuturi, M. (2025). On a Neural Implementation of Brenier’s Polar Factorization (Issue arXiv:2403.03071). arXiv. https://doi.org/10.48550/arXiv.2403.03071↩︎
- Villani, C. (2009). Optimal Transport: Old and New (Issue 338). Springer.
- Villani, C. (2016). Topics in Optimal Transportation (Reprinted with corrections, Issue 58). American Mathematical Society.
- Yu, H., Luo, H., Yuan, H., Rong, Y., Huang, J., & Zhao, F. (2026). Frequency Autoregressive Image Generation with Continuous Tokens (Issue arXiv:2503.05305). arXiv. https://doi.org/10.48550/arXiv.2503.05305↩︎