Autoregression vs Diffusion - Understanding Sampling via Optimal Transport
Autoregression and diffusion look like opposites, but both are solving the same transport problem - how to turn simple noise into structured data.
Introduction
Generative modeling is fundamentally an exercise in statistical estimation.
Generative Modeling
Let \(\mathcal{D} = \{x_1, \dots, x_N\}\) be an empirical dataset drawn i.i.d. from a data distribution \(x \sim P_{\text{data}}(x)\) defined over a high-dimensional space \(\mathcal{X}\). We want to generate new approximate samples from \(P_{\text{data}}(x)\).
The Procedural Strategy
Direct sampling from a complex, high-dimensional data distribution is generally intractable. Instead, we reduce the generative problem to a two-step procedure: noise sampling followed by procedural generation.
Think of world generation in a game like Minecraft: instead of trying to randomly generate a billion individual blocks and hoping they form a coherent landscape, we start from a single random seed—a simple source of randomness—and use it as the starting point for a procedural generation algorithm. In generative AI, we take a similar approach.
The Transport Objective
We introduce a tractable base distribution \(z \sim P_{\text{noise}}(z)\) (like standard Gaussian noise) that is easy to sample from. The goal is then to convert this simple randomness into samples that match the statistics of \(P_{\text{data}}(x)\).
Hard Samples from Easy Randomness
Think of a tiny terrain generator. The target distribution lives over coherent terrains in $x$-space, which is hard to sample from directly. The reduction is to sample a tiny easy seed, then feed it through one generator that produces the terrain.
Autoregressive models and diffusion models are often presented as very different generative strategies. At a high level, however, both function as this “procedural generation algorithm,” converting simple randomness into complex samples from the data distribution. The difference is in how they parameterize that conversion: autoregression does it through a sequence of conditional transports, while diffusion does it through a time-dependent denoising dynamics. Optimal transport is therefore a useful geometric lens for comparing them, even when the underlying training objectives are not literally the same.
The route is: define the transport problem, solve it exactly in 1D, show how autoregression reuses that 1D solution conditionally, then compare it with Brenier maps, Wasserstein geodesics, and the learned flow or one-step approximations used in practice.
The Choice of Base Distribution
Why are Gaussian or Uniform distributions standard choices for \(P_{\text{noise}}\)? Practitioners favor them for practical reasons: they are easy to sample from, isotropic, stable under perturbation, and compatible with common noising procedures. They also have a clean maximum-entropy characterization once the underlying domain or moment constraints are fixed.
Uniform Maximum Entropy
For a strictly bounded interval \(\lbrack a, b\rbrack\), the maximum entropy distribution is the Uniform distribution \(\mathcal{U}\lbrack a, b\rbrack\).
We maximize entropy \(\mathbb{E}\lbrack -\ln p(x)\rbrack\) subject to \(\int_a^b p(x) dx = 1\).
Setting the functional derivative of the Lagrangian to zero gives:
Since \(p(x)\) is constant, it must precisely be \(\frac{1}{b-a}\) to integrate to 1.
Gaussian Maximum Entropy
On \(\mathbb{R}\), there is no uniform probability distribution on the whole space, and differential entropy has no maximizer without an additional moment constraint. If we fix the mean \(\mu\) and variance \(\sigma^2\), then the unique maximum-entropy distribution is the Gaussian \(\mathcal{N}(\mu, \sigma^2)\).
This is why Gaussian noise is the canonical maximum-entropy base under a fixed second-moment budget (“Normal Distribution,” 2026).
By translation invariance of differential entropy, we may center first and assume \(\mu=0\). We then maximize \(\mathbb{E}\lbrack -\ln p(x)\rbrack\) subject to \(\int p(x) dx = 1\) and \(\int x^2 p(x) dx = \sigma^2\). The Lagrangian functional derivative yields:
Integrability forces the quadratic coefficient to be negative, which yields the Gaussian form
Constrained Entropy Ascent
This uses projected entropy ascent on a discretized density. Each step increases entropy and then reapplies the active constraint, so the curve relaxes toward the actual max-entropy solution for that setting.
From Noise to Data: The Transport Problem
Once the base distribution is fixed, the next question is geometric: how do we move mass from \(P_{\text{noise}}\) to \(P_{\text{data}}\) while paying as little cost as possible? Optimal transport formalizes exactly that source-to-target conversion problem (Peyré, 2025)(Thorpe, 2018).
It studies how to rearrange one distribution into another, either by a point-to-point map or, more generally, by a transport plan on source-target pairs.
Pushforward Shorthand
We write \(T_\sharp P = Q\) to mean: if \(Z \sim P\), then \(T(Z) \sim Q\).
If \(T\) is a differentiable bijection, this distribution-matching condition is equivalent to the usual change-of-variables formula:
\[ p_{\text{data}}(x) = p_{\text{noise}}(T^{-1}(x)) \left\vert \det J_{T^{-1}}(x) \right\vert = p_{\text{noise}}(z)\left\vert \det J_T(z) \right\vert ^{-1}, \qquad x=T(z). \]
Generative Optimal Transport
In the generative setting, we start from noise \(Z \sim P_{\text{noise}}\) and want outputs with law \(P_{\text{data}}\). A transport map therefore aims to satisfy \(T_\sharp P_{\text{noise}} = P_{\text{data}}\).
The Monge Problem
Let \(c: \mathcal{Z} \times \mathcal{X} \to \mathbb{R} \cup \{+\infty\}\) be a fixed ground cost. Monge seeks a single-valued transport map \(T\) minimizing
\[ \min_T \underset{Z \,\sim\, P_{\text{noise}}}{\mathbb{E}}\lbrack c(Z, T(Z))\rbrack \quad \text{s.t.} \quad T_\sharp P_{\text{noise}} = P_{\text{data}}. \]
Two Feasible Monge Maps $T$
In the discrete Monge problem, each source atom $z_i$ must choose exactly one target atom $x_j$. If the masses are indivisible, a feasible single-valued map $T$ exists only when the source atoms already come in the exact pieces demanded by the targets. Here the route coefficients $c_{ij}$ are shipping rates in $\$/\mathrm{kg}$, so with 1kg boxes the Monge objective is the total dollar cost.
No Feasible Monge Map $T$
The Kantorovich Problem
Monge forces each source point \(z\) to choose a single destination \(T(z)\). Kantorovich relaxes this by optimizing over any coupling\(\pi\) on \(\mathcal{Z} \times \mathcal{X}\) with marginals \(P_{\text{noise}}\) and \(P_{\text{data}}\):
\[ \min_{\pi \in \Pi(P_{\text{noise}}, P_{\text{data}})} \int c(z,x)\,d\pi(z,x). \]Here \(\pi\) records how much source mass at \(z\) is assigned to target mass at \(x\). Monge is the special case where the plan is concentrated on the graph of a map:
\[ \pi = (\mathrm{Id}, T)_\sharp P_{\text{noise}}. \]
Discrete Kantorovich Form
In the finite case, if source point \(i\) carries mass \(a_i\), target point \(j\) needs mass \(b_j\), and moving one unit of mass costs \(c_{ij}\), then Kantorovich transport solves
\[ \min_{\pi_{ij} \ge 0} \sum_{i,j} c_{ij}\pi_{ij} \]subject to
\[ \sum_j \pi_{ij} = a_i, \qquad \sum_i \pi_{ij} = b_j. \]This is useful here because it allows mass splitting and turns the discrete problem into a linear optimization problem. Monge is recovered as the special case
\[ \pi_{i,T(i)} = a_i, \qquad \pi_{ij} = 0 \text{ for } j \ne T(i). \]
Why the Discrete Problem Is Linear
Kantorovich transport is linear in the plan \(\pi_{ij}\), so the discrete problem is a linear optimization problem. That linear structure gives a dual problem with its own useful interpretation.
Transport Plan $\pi$ with Mass Splitting
Source atoms $z_1$ (2kg) and $z_2$ (1kg) must fill target atoms $x_1$ (1kg) and $x_2$ (2kg). Let $\alpha = \pi_{11}$ be the mass moving from $z_1$ to $x_1$. Conservation of mass determines the entire plan $\pi$: $\pi_{11} = \alpha, \pi_{12} = 2-\alpha, \pi_{21} = 1-\alpha, \pi_{22} = \alpha$. In the Monge problem, $\alpha \in \{0, 1\}$, but Kantorovich allows any $\alpha \in [0, 1]$.
Dual Potentials
Because the discrete Kantorovich problem is linear, it has a useful dual: instead of moving mass directly, we assign prices to source and target locations and use those prices to certify lower bounds on every transport plan.
Discrete Kantorovich Dual
In the discrete case, the dual problem is
\[ \max_{f_i,\,g_j} \sum_i a_i f_i + \sum_j b_j g_j \qquad \text{s.t.} \qquad f_i + g_j \le c_{ij}\;\; \text{for all } i,j. \]Here \(f_i\) is the price assigned to source point \(x_i\) and \(g_j\) is the price assigned to target point \(y_j\). The rule \(f_i+g_j \le c_{ij}\) says their combined price can never exceed the true cost of pairing them.
Why This Is Useful
Any feasible pair \((f,g)\) gives a guaranteed lower bound on every transport plan:
\[ \sum_i a_i f_i + \sum_j b_j g_j = \sum_{i,j}\pi_{ij}(f_i+g_j) \le \sum_{i,j} c_{ij}\pi_{ij}. \]So if we can find large prices that still satisfy the constraints, we certify that the transport cost cannot be any smaller. Duality says the best such certificate exactly matches the true optimum.
Complementary Slackness
At optimality, transported pairs satisfy \(\pi_{ij}>0 \Rightarrow f_i+g_j=c_{ij}\). In words: mass only travels along pairs whose price constraint is tight. So the dual potentials identify the active geometry of the coupling.
The same tiny transport example makes this lower-bound picture concrete:
Prices as Lower Bounds
Move through the feasible family $f_1=t$, $f_2=-t$, $g_1=1-t$, $g_2=1+t$ and watch the dual lower bound rise until the used constraints become tight. Masses are in $\mathrm{kg}$ and costs are in $\$/\mathrm{kg}$.
Continuous Kantorovich Dual
The continuous version is the same idea, but now the prices are functions rather than finite lists of numbers. Among all \(f\) and \(g\) satisfying \(f(z)+g(x)\le c(z,x)\) for every \(z,x\), the best lower bound is
\[ \sup_{f,g} \; \mathbb{E}\lbrack f(Z)\rbrack + \mathbb{E}\lbrack g(X)\rbrack , \qquad Z \sim P_{\text{noise}},\; X \sim P_{\text{data}}. \]Under standard assumptions, this equals the Kantorovich optimum.
Why This Matters Here
These dual potentials are the continuous analogue of the discrete prices. They reappear in regularized OT, Sinkhorn-style updates, and potential-based viewpoints on transport geometry.
Wasserstein Distance: Earth Mover’s Distance
The Kantorovich problem does more than choose a coupling. When the cost comes from a ground metric, its optimal value becomes a metric on probability distributions themselves.
\(p\)-Wasserstein Distance
Let \(d(x,y)\) be a ground metric on the sample space and let \(p\ge 1\). For probability laws \(P\) and \(Q\) with finite \(p\)-th moments, the \(p\)-Wasserstein distance is
\[ W_p(P,Q) = \left( \inf_{\pi\in\Pi(P,Q)} \int d(x,y)^p\,d\pi(x,y) \right)^{1/p}. \]The coupling \(\pi\) tells us how to align source samples with target samples before measuring average movement.
Earth Mover’s Distance
In the finite case, \(W_1\) has the literal earth-moving interpretation:
\[ W_1(P,Q)=\min_{\pi_{ij}\ge0}\sum_{i,j} d_{ij}\pi_{ij} \]subject to the same row and column marginal constraints as before. If \(a_i\) is a pile of dirt, \(b_j\) is a hole to fill, and \(d_{ij}\) is travel distance, then \(\pi_{ij}\) says how much dirt moves from \(i\) to \(j\). The objective is total work.
For quadratic transport, the common object is \(W_2^2\), the minimum expected squared displacement. The metric is \(W_2\) itself, after taking the square root.
Kantorovich-Rubinstein Duality for \(W_1\)
For \(W_1\), the general continuous dual above collapses to a single 1-Lipschitz potential.
For the distance cost \(c(x,y)=d(x,y)\),
\[ W_1(P,Q) = \sup_{\Vert f\Vert _{\mathrm{Lip}}\le 1} \left\{ \mathbb{E}_{X\sim P}\lbrack f(X)\rbrack -\mathbb{E}_{Y\sim Q}\lbrack f(Y)\rbrack \right\}. \]Instead of minimizing over couplings, the dual maximizes over all 1-Lipschitz test functions. Such a function assigns scalar “height” values to the sample space, but its slope is capped by the ground metric (Thickstun, n.d.).
Why the \(W_1\) Dual Is Useful
The primal earth-mover view asks how to move mass. The dual view asks for the strongest 1-Lipschitz critic that separates the two distributions in expectation.
This is the reason \(W_1\) became important in adversarial generative modeling: if two distributions are close in the ground geometry, no Lipschitz critic can give them very different scores, even when their supports do not overlap exactly.
Why the Ground Geometry Matters
Pointwise divergences compare probability mass at the same location. Wasserstein distances compare distributions through the geometry of the sample space: moving mass a short distance is cheaper than moving it far away.
This is why Wasserstein geometry is natural for generative modeling. It does not merely say whether two densities agree pointwise; it asks how much work is needed to deform one distribution into the other.
Preview: The Flow View
Once \(W_2\) is treated as a metric on probability laws, it becomes meaningful to ask for shortest paths between distributions. If a curve of densities \(\rho_t\) is transported by a velocity field \(v_t\) through
\[ \partial_t \rho_t+\operatorname{div}(\rho_t v_t)=0, \]then its kinetic action is
\[ \int_0^1\!\int \Vert v_t(x)\Vert_2^2\,\rho_t(x)\,dx\,dt. \]The Benamou-Brenier formulation below says that \(W_2^2\) is exactly the smallest possible action over all such flows. So the Wasserstein metric is not just an endpoint loss; it induces the dynamical least-action picture.
The 1D Case: Inverse Transform Sampling
To build intuition, let’s start with a simple case: Inverse Transform Sampling(“Inverse Transform Sampling,” 2025).
Suppose we operate strictly in a 1-dimensional continuous space \(\mathbb{R}\), and the target distribution \(P_{\text{data}}\) is entirely defined by its Cumulative Distribution Function (CDF) \(F(x) = P(X \le x)\).
The 1D Sampling Problem
We can sample \(U \sim \mathcal{U}(0,1)\), but we want \(X=T(U)\) to have CDF \(F\), i.e. \(P(X \le t)=F(t)\) for every \(t\). In other words, we want to turn uniform noise into samples from the target distribution.
The 1D Closed-Form Solution
Set \(X=F^{-1}(U)\), where \(F^{-1}(u)\) is the smallest \(x\) with \(F(x)\ge u\). Then \(P(X \le t)=P(U \le F(t))=F(t)\), so inverse transform sampling is already a transport map from uniform noise to the target distribution.
We now rewrite the same idea in the discrete setting first.
The local swap behind the 1D OT solution is easiest to see in the smallest nontrivial example:
Distance Matching and Uncrossing
Drag the ordered source points $x_1 < x_2$ and target points $y_1 < y_2$. The same number line drives both comparisons: the top row shows the uncrossed matching (sorting order), while the bottom row shows the crossed matching. In 1D, uncrossing never increases the cost for convex costs like distance.
Discrete 1D OT: Equal-Mass Matching
Suppose source points \(x_1,\dots,x_n\) and target points \(y_1,\dots,y_n\) each carry mass \(\frac{1}{n}\), and each source point must be matched to exactly one target point.
Under this equal-mass assumption, any feasible Monge map must match each source to a distinct target: if two sources landed on the same target, that target would receive mass \(\frac{2}{n}\) instead of \(\frac{1}{n}\), and some other target would be left empty. So every feasible matching is a bijection between the two sets, hence it is completely described by a permutation \(\sigma \in S_n\).
We therefore solve
\[ \min_{\sigma \in S_n} \frac{1}{n}\sum_{i=1}^n c(x_i, y_{\sigma(i)}). \]
Quadrangle Inequality
Assume \(x_1 < x_2\), \(y_1 < y_2\), and \(c(x,y) = h(x-y)\) with \(h\) convex. Then
\[ c(x_1, y_1) + c(x_2, y_2) \le c(x_1, y_2) + c(x_2, y_1). \]In words: if two matching lines cross, uncrossing them never increases the cost.
Let \(a = x_1-y_2\) and \(d = x_2-y_1\). Since \(x_1 < x_2\) and \(y_1 < y_2\), we have \(d-a = (x_2-x_1) + (y_2-y_1) > 0\), so \(a < d\). With
we can write
Since \(h\) is convex,
Substituting back \(a = x_1-y_2\) and \(d = x_2-y_1\) gives
Discrete Monotone Matching
Sort both sets of points in non-decreasing order:
\[ x^{(1)} \le x^{(2)} \le \dots \le x^{(n)} \quad \text{and} \quad y^{(1)} \le y^{(2)} \le \dots \le y^{(n)}. \]Here \(x^{(k)}\) means “the \(k\)-th smallest source point,” and similarly for \(y^{(k)}\). Then an optimal matching is
\[ x^{(1)} \leftrightarrow y^{(1)}, \qquad x^{(2)} \leftrightarrow y^{(2)}, \qquad \dots, \qquad x^{(n)} \leftrightarrow y^{(n)}. \]So in 1D, optimal transport is just “sort both sides and pair equal ranks.”
Start from any permutation \(\sigma\). If \(\sigma\) is not monotone, then some \(i<j\) satisfy \(\sigma(i)>\sigma(j)\), so the pairs \(x^{(i)} \mapsto y^{(\sigma(i))}\) and \(x^{(j)} \mapsto y^{(\sigma(j))}\) cross. By the Monge property, swapping those two targets does not increase the cost:
After the swap, the number of inversions of \(\sigma\) decreases by at least \(1\). Repeating this process finitely many times removes all inversions.
A permutation with no inversions is increasing, and the only increasing permutation of \(\{1,\dots,n\}\) is the identity. So eventually we reach \(\sigma(i)=i\) for every \(i\), which is exactly the sorted matching \(x^{(i)} \leftrightarrow y^{(i)}\).
Equal Quantiles
The continuous version says the same thing, but with percentiles instead of sorted lists.
The \(u\)-quantile of a distribution means: the smallest point whose CDF value is \(u\). Equivalently, it is the pseudoinverse-CDF value \(F^{-1}(u):=\inf\{x:F(x)\ge u\}\).
For example, if a point \(x\) is at the \(70\%\) quantile of the source distribution, then it should be sent to the \(70\%\) quantile of the target distribution.
Continuous 1D OT
Let \(P\) and \(Q\) be 1D source and target laws with CDFs \(F\) and \(G\). For the same class of 1D convex costs \(c(x,y)=h(x-y)\), we seek a map \(T\) that sends the source law to the target law and, among all such maps, has the smallest transport cost:
\[ \min_T \int c(x, T(x))\,dP(x) \]subject to
\[ T_\sharp P = Q. \]In words: if \(X \sim P\), then \(T(X)\) should have CDF \(G\).
Quantile Matching
For 1D convex costs, the optimal map factorizes through a uniform base \(u \sim \mathcal{U}(0,1)\):
\[ u = F(x), \qquad T(x) = G^{-1}(u) \]
Pullback: Compute \(u = F(x)\). By the probability integral transform, this extracts pure uniform noise.
Pushforward: Sample \(T(x) = G^{-1}(u)\). This exactly recovers inverse transform sampling.
First check that the map has the right output distribution. If \(X \sim P\) and \(U = F(X)\), then \(U \sim \mathcal{U}(0,1)\). Define \(Y = G^{-1}(U) = G^{-1}(F(X))\). Then for any \(t\),
so \(Y\) has CDF \(G\). Therefore \(T_\sharp P = Q\).
Now check optimality. For each \(m\), take the equally spaced quantile levels \(u_k = \frac{k-\frac12}{m}\) and define
By the discrete monotone matching result, pairing \(x_k\) with \(y_k\) minimizes
over all permutations \(\sigma\). As \(m \to \infty\), these sums converge to the quantile integral
which is exactly
So equal-quantile matching is optimal in the continuous problem as well.
If \(h\) is strictly convex, such as \(h(t)=t^2\), this optimizer is unique up to sets of measure zero. For \(h(t)=\vert t\vert\), ties can occur.
CDFs vs. Densities
The transport rule is written using CDFs, but models often predict densities or probabilities:
\[ F(x) = \int_{-\infty}^{x} p(t)\,dt, \qquad p(x) = F'(x) \]and in the discrete case
\[ F(v_k) = \sum_{j \le k} p_j. \]So learning \(p\) is enough: the CDF is obtained by integrating or summing, and sampling uses the inverse CDF.
Inverse Transform Sampling
Drag the $u$ handle along the horizontal axis, which represents uniform noise. The solid blue curve is the Inverse CDF $x = F^{-1}(u)$. The shape along the vertical axis shows the target density $p(x)$. Notice how the cumulative mass (the filled orange area) visually grows in proportion to your chosen $u$ input!
Sliced Wasserstein: Reusing 1D OT
The 1D closed form also gives a practical high-dimensional approximation: project the distributions onto many lines, solve the resulting 1D OT problems, and average the costs.
Sliced Wasserstein Distance
For a direction \(\theta\) on the unit sphere \(\mathbb{S}^{D-1}\), let
\[ P_\theta=(x\mapsto \theta^\top x)_\sharp P, \qquad Q_\theta=(x\mapsto \theta^\top x)_\sharp Q \]be the 1D projected laws. The sliced Wasserstein distance is
\[ \operatorname{SW}_p^p(P,Q) = \int_{\mathbb{S}^{D-1}} W_p^p(P_\theta,Q_\theta)\,d\theta. \]In practice, the integral is estimated with random directions:
\[ \operatorname{SW}_p^p(P,Q) \approx \frac1K\sum_{k=1}^K W_p^p(P_{\theta_k},Q_{\theta_k}). \]
Why This Belongs Here
Sliced OT is useful because each projected problem is 1D, so the quantile-matching solution above applies directly. For empirical samples, this means: project both clouds onto a direction, sort the projected values, pair equal ranks, and repeat across directions.
The tradeoff is that sliced OT is a comparison metric, not a single high-dimensional transport map. Each projection induces its own 1D matching, so the averaged objective does not by itself produce one coherent Brenier coupling in the original space.
Kitagawa and Takatsu make the limitation precise: sliced Wasserstein-type metrics are topologically equivalent to classical Wasserstein metrics, but they are generally not bi-Lipschitz equivalent, and for \(p>1\) in dimension \(D\ge2\) the sliced spaces are not geodesic. So sliced OT is often a cheap geometry-aware surrogate when full high-dimensional OT is too expensive, but it should not replace \(W_p\) in arguments where the exact metric scale or Wasserstein geodesic structure matters (Montesuma et al., 2024; Kitagawa & Takatsu, 2024).
Project, Sort, Average
Sliced OT turns a high-dimensional comparison into many 1D comparisons. Rotate the projection direction, then switch between the projected cloud and the sorted one-dimensional matching induced by that direction.
Each slice uses the same quantile-matching rule as 1D OT: project both empirical clouds, sort the scalar values, pair equal ranks, and average the squared gaps.
This is useful as a geometry-aware comparison or loss, but it is not yet a sampler: every projection gives its own 1D pairing, and those pairings do not assemble into a single high-dimensional transport map.
For sampling maps in higher dimensions, a different reuse of the 1D idea is to apply quantile matching one coordinate at a time.
Autoregression as Sequential Transport
This gives the Knothe-Rosenblatt rearrangement, which is the mathematical backbone of autoregression.
Knothe-Rosenblatt Rearrangement
Fix an order of the coordinates and write
\[ T(x_1,\dots,x_D) = \bigl(T_1(x_1), T_2(x_1,x_2), \dots, T_D(x_1,\dots,x_D)\bigr). \]Here \(x_{<i}=(x_1,\dots,x_{i-1})\) means “all earlier coordinates.”
Each coordinate is defined by a 1D conditional inverse-CDF step:
\[ T_1(x_1) = F^{-1}_{Q_1}(F_{P_1}(x_1)) \]\[ T_2(x_2 \vert x_1) = F^{-1}_{Q_2 \vert Q_1}(F_{P_2 \vert P_1}(x_2 \vert x_1)) \]\[ \dots \]\[ T_D(x_D \vert x_{<D}) = F^{-1}_{Q_D \vert Q_{<D}}(F_{P_D \vert P_{<D}}(x_D \vert x_{<D})) \]So after fixing the earlier coordinates, we apply the same 1D quantile-matching rule to the next conditional distribution.
Non-Optimality (Greedy Sequence)
The Knothe-Rosenblatt map guarantees \(T_\sharp P_{\text{noise}} = P_{\text{data}}\), but it is not globally optimal for the standard squared Euclidean cost \(W_2^2 = \mathbb{E}\lbrack \Vert x - y\Vert _2^2\rbrack\) in high dimensions.
By factorizing sequentially, it implicitly solves a greedy transport problem. It is canonical only because it appears as the optimal limit of a heavily skewed quadratic cost that strictly prioritizes earlier coordinates:
\[ c(x, y) = \sum_{i=1}^D \lambda_i (x_i - y_i)^2, \qquad \lambda_1 \gg \lambda_2 \gg \dots \gg \lambda_D \]
At the population level, vanilla autoregression is this triangular transport written in density language.
Chain Rule Factorization
If \(x=(x_1,\dots,x_D)\) and \(x_{<i}=(x_1,\dots,x_{i-1})\), then
\[ p(x)=\prod_{i=1}^D p(x_i \mid x_{<i}). \]
Sampling Rule
Draw \(u_1,\dots,u_D \sim \mathcal{U}(0,1)\) and set \(x_i = F^{-1}_{X_i \mid X_{<i}=x_{<i}}(u_i)\). Each step is just 1D inverse transform sampling conditioned on the past.
Example: LLMs as Classification
Next-Token Prediction
Take a tiny vocabulary \(\mathcal{V}=\{\text{"apple"}, \text{"banana"}, \text{"cherry"}\}\). At step \(i\), the model outputs probabilities \(p_k=P(x_i=v_k \mid x_{<i})\).
Suppose it predicts \((0.6, 0.3, 0.1)\). The discrete CDF is then \((0.6, 0.9, 1.0)\). If we draw \(u=0.75\), the inverse-CDF rule picks the first token whose cumulative probability exceeds \(u\), namely “banana”.
So next-token sampling is conditional inverse transform sampling, and standard cross-entropy training learns the conditional distributions that induce these 1D transports:
\[ \mathcal{L}_{\text{NLL}} = -\sum_{i=1}^D \log P(x_i^{\text{true}} \mid x_{<i}). \]
Next-Token Inverse CDF Sampling
Drag the slider to sample a uniform random variable \( u \sim \mathcal{U}(0,1) \). The vocabulary space is partitioned according to the predicted probability of each token. The interval that catches \( u \) becomes the generated token!
Reparameterized Autoregression
Instead of autoregressing in the original coordinates, we can first change coordinates and then factorize there.
Change of Variables
If \(y=g(x)\) is invertible, then
\[ p_X(x)=p_Y(g(x))\vert \det J_g(x)\vert . \]So any autoregressive factorization in the \(y\)-coordinates induces a density in the original \(x\)-coordinates.
Understanding Reparameterization Maps
A map $y = g(x)$ changes the coordinates before we factorize or sample. Some reparameterizations only change ordering, while others expose more meaningful coordinates such as principal axes or coarse-vs-fine structure.
Example 1: Changing the Ordering
Permuting Coordinates
If \(y_i=x_{\sigma(i)}\) for a permutation \(\sigma\), then \(J_g\) is a permutation matrix, so \(\vert \det J_g\vert =1\). For example, \((y_1,y_2,y_3)=(x_2,x_1,x_3)\) simply changes generation order. The model is still autoregressive, just in a different ordering.
Example 2: Next-Scale Prediction
An important recent ML example is Visual Autoregressive Modeling (VAR), which replaces raster-scan next-token prediction with next-scale prediction: generate a coarse image first, then autoregressively refine it resolution by resolution (Tian et al., 2024). The factorization is still autoregressive, but the ordering now runs over image scales rather than individual pixel positions.
Next-Scale Prediction Instead of Raster Order
VAR-style image autoregression changes the ordering variable. Instead of predicting pixels in raster order, it predicts the next resolution conditioned on all coarser resolutions. The same sample is refined scale by scale: first a seed image, then a coarse layout, then mid-frequency structure, and finally the full-resolution detail.
Example 3: Frequency Space
Frequency-space autoregression is a different kind of reparameterization: instead of changing only the generation schedule, we first change basis and then autoregress in the transformed coordinates.
At toy scale, one concrete way to see the same coarse-to-fine logic is through the 2-point Haar basis, the smallest wavelet example of frequency-space autoregression (Yu et al., 2026).
2-Pixel Haar Coordinates
For a 2-pixel patch $x=(x_1,x_2)$, the 2-point Haar basis separates a shared brightness term $y_{\text{low}}=\frac{x_1+x_2}{\sqrt{2}}$ from a signed contrast term $y_{\text{high}}=\frac{x_1-x_2}{\sqrt{2}}$. In that basis, autoregression can model coarse intensity first and local edge direction second.
The same idea appears in the standard practical visualization below: wavelet decomposition into approximation and detail bands, followed by progressive reconstruction.
Image-Space Haar Decomposition
This is the standard practical visualization: an image, its 1-level Haar subbands, and the corresponding coarse-to-fine reconstructions. The approximation band stores broad structure, while the detail bands isolate left-right edges, top-bottom edges, and diagonal texture.
Fourier gives the complementary practical view: the same image is represented by global frequencies, visualized as a centered spectrum together with low-pass and high-pass reconstructions.
Image-Space Fourier Decomposition
The Fourier view is complementary to the Haar view. Instead of localized multiscale bands, it represents the same image with global sinusoidal frequencies: the center of the spectrum is low frequency, and the outer region is high frequency.
Relation to Diffusion
For images, diffusion often appears to refine samples from coarse structure toward fine detail. Dieleman interprets DDPM-style denoising as an approximate low-to-high spectral ordering that is valid in expectation across many images, rather than as a hard per-sample rule (“Diffusion Is Spectral Autoregression,” 2024). Falck’s follow-up accepts that approximate DDPM picture, but argues that this spectral hierarchy is not necessary for good diffusion performance: hierarchy-free diffusion can match DDPM and even improve high-frequency generation (Falck, 2025).
Global Optimality: Brenier’s Theorem
If we abandon the greedy sequence and seek the true globally optimal map for the symmetric squared Euclidean cost, we arrive at Brenier’s theorem (Brenier, 1991).
Brenier’s Theorem
If \(P\) is absolutely continuous and \(P,Q\) have finite second moments, then the quadratic-cost problem
\[ W_2^2(P,Q)=\inf_{T_\sharp P = Q}\mathbb{E}_{X\sim P}\Vert X-T(X)\Vert _2^2 \]has a unique optimal map up to \(P\)-null sets, and it has the form \(T=\nabla\psi\) for a convex potential \(\psi\).
The Right Higher-Dimensional Analogue of Monotonicity
In 1D, optimal transport is monotone: quantiles never cross. In higher dimensions, the correct replacement is cyclic monotonicity, not just the 2-point inequality.
Cyclic Monotonicity
Let \(\Gamma_T=\{(x,T(x))\}\). We say \(\Gamma_T\) is cyclically monotone if for every finite family \(x_1,\dots,x_m\),
\[ \sum_{i=1}^m \langle x_i, T(x_i)\rangle \ge \sum_{i=1}^m \langle x_i, T(x_{i+1})\rangle, \qquad x_{m+1}=x_1. \]For quadratic cost, this is equivalent to
\[ \sum_{i=1}^m \Vert x_i-T(x_i)\Vert _2^2 \le \sum_{i=1}^m \Vert x_i-T(x_{i+1})\Vert _2^2. \]So no finite cyclic reassignment of destinations can lower the total transport cost.
Pairwise Monotonicity Is Only the 2-Cycle Case
Taking \(m=2\) gives
\[ \langle T(x_1)-T(x_2),\,x_1-x_2\rangle \ge 0, \]which is the familiar monotonicity inequality. That is necessary, but Brenier’s structure is stronger: it rules out every finite cyclic swap, not just pairwise reversals.
Rockafellar’s Characterization (Rockafellar, 1966)
A set \(\Gamma \subset \mathbb{R}^d \times \mathbb{R}^d\) is cyclically monotone if and only if there exists a convex function \(\phi\) such that
\[ \Gamma \subset \partial \phi. \]In particular, if \(\Gamma\) is the graph of a single-valued map \(T\), then \(T(x)\in \partial \phi(x)\), and wherever \(\phi\) is differentiable we have \(T(x)=\nabla\phi(x)\).
The key point is that cyclic monotonicity is strictly stronger than the ordinary 2-point test. The widget below makes that distinction concrete for a draggable triple, including Rockafellar’s finite support test directly inside the figure.
Cyclic Monotonicity and Rockafellar's Witness
Drag the three source points $x_1,x_2,x_3$. The middle panel shows $T(x_i)$, and the right panel asks whether the same triple can be explained by one convex potential $\phi$.
Why Brenier Produces a Convex Potential
For quadratic cost, the support of the optimal plan is cyclically monotone. Rockafellar turns that geometric optimality into a convex potential, and absolute continuity of \(P\) upgrades the plan to an almost-everywhere-defined map. So \(T=\nabla\psi\) is not an extra modeling choice; it is the structural consequence of optimality.
Closed Form: Gaussian to Gaussian
One of the few genuinely high-dimensional cases with a clean closed form is transport between Gaussian laws. It is especially useful here because the standard ML prior is usually Gaussian, and because it gives a concrete non-1D example where the Brenier map can be written down explicitly.
Gaussian-to-Gaussian Brenier Map
Let
\[ P=\mathcal{N}(m_0,\Sigma_0), \qquad Q=\mathcal{N}(m_1,\Sigma_1), \]where \(\Sigma_0\) and \(\Sigma_1\) are positive definite covariance matrices. For squared Euclidean cost, the optimal Monge map from \(P\) to \(Q\) is affine:
\[ T(x)=m_1+A(x-m_0), \]with
\[ A=\Sigma_0^{-1/2}\left(\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2}\right)^{1/2}\Sigma_0^{-1/2}. \]This is the unique symmetric positive-definite matrix satisfying
\[ A\Sigma_0 A=\Sigma_1. \]Hence \(T\) is the gradient of the convex quadratic potential
\[ \psi(x)=m_1^\top x+\frac12(x-m_0)^\top A(x-m_0). \]
Gaussian \(W_2\) Distance
The corresponding squared Wasserstein distance is
\[ W_2^2(P,Q) = \Vert m_0-m_1\Vert_2^2 + \operatorname{tr}\left( \Sigma_0+\Sigma_1 - 2\left(\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2}\right)^{1/2} \right). \]This covariance term is the Bures metric on positive definite matrices (Peyré, 2025).
Useful Special Cases
If the source is standard Gaussian, then
\[ Z\sim \mathcal{N}(0,I), \qquad Q=\mathcal{N}(m,\Sigma) \]gives the especially simple Brenier map
\[ T(z)=m+\Sigma^{1/2}z. \]This is the direct map used in the 2D comparison below.
If \(\Sigma_0\) and \(\Sigma_1\) commute, then the formula reduces to
\[ A=\Sigma_1^{1/2}\Sigma_0^{-1/2}, \qquad W_2^2(P,Q)=\Vert m_0-m_1\Vert_2^2+\Vert \Sigma_0^{1/2}-\Sigma_1^{1/2}\Vert_F^2. \]In 1D, this becomes
\[ T(x)=m_1+\frac{\sigma_1}{\sigma_0}(x-m_0), \qquad W_2^2=(m_0-m_1)^2+(\sigma_0-\sigma_1)^2. \]
Brenier vs. Cholesky / KR for Gaussians
A Gaussian target has many exact affine generators. If \(\Sigma=LL^\top\) is the Cholesky factorization, then \(z\mapsto m+Lz\) also sends \(\mathcal{N}(0,I)\) to \(\mathcal{N}(m,\Sigma)\). That triangular map is the Gaussian Knothe-Rosenblatt rearrangement for the chosen coordinate order.
The Brenier map instead uses the symmetric square root \(z\mapsto m+\Sigma^{1/2}z\). Both maps produce the same target law, but they induce different couplings and different quadratic transport costs. This is the population-level distinction visualized in the 2D figure later in the post.
Dynamic Optimal Transport
With the Wasserstein distance in place, probability laws can be treated as points in a metric space. Brenier chooses the optimal endpoint map between two such points. Dynamic OT asks for the optimal interpolation in time: the Wasserstein geodesic connecting them.
Mixture Interpolation vs. Displacement Interpolation
A simple way to interpolate two densities is the linear mixture
\[ \rho_t=(1-t)\rho_0+t\rho_1. \]This does not move mass. It fades one distribution out while fading the other in at fixed spatial locations.
Wasserstein interpolation is different. If \(T\) is the Brenier map from \(\rho_0\) to \(\rho_1\), then particles move by
\[ X_t=(1-t)X_0+tT(X_0), \qquad X_0\sim \rho_0, \]and the intermediate law is \(\rho_t=\lbrack X_t\rbrack _\sharp\rho_0\). This is displacement interpolation: mass travels through the sample space instead of appearing and disappearing.
Euclidean vs. Wasserstein Geodesics
Start from one sampled source cloud and compare two paths between the same Gaussian endpoint laws. The Euclidean density path fades one law into the other, while the Wasserstein geodesic moves the same samples along the Brenier displacement.
The endpoint ellipses are density contours. In the Wasserstein mode, every source-to-target path is drawn as a faint gray trace; the yellow sample cloud shows how mass actually moves along those paths.
The point of the comparison is that the Wasserstein path is not merely a visually smoother interpolation. It is the shortest path in the \(W_2\) geometry, and it has an equivalent dynamic formulation as the least-action velocity field carrying one density into the other.
Benamou-Brenier Dynamic Formulation
For quadratic cost,
\[ W_2^2(\mu_0,\mu_1) = \inf_{\rho_t,\,v_t} \int_0^1 \! \int \Vert v_t(x)\Vert _2^2\, \rho_t(x)\, dx\, dt \]subject to
\[ \partial_t \rho_t + \operatorname{div}(\rho_t v_t) = 0, \qquad \rho_0=\mu_0,\ \rho_1=\mu_1. \]So instead of optimizing one endpoint map directly, we optimize over all density paths and velocity fields that transport \(\mu_0\) to \(\mu_1\) with minimal kinetic action (Benamou & Brenier, 2000).
Eulerian and Lagrangian Views
The formula above is Eulerian: it works with a density field \(\rho_t\) and a velocity field \(v_t\) over space-time. When the Brenier map \(T\) exists, the same optimizer can be written in Lagrangian coordinates by following particles
\[ X_t=(1-t)X_0+t\,T(X_0), \qquad X_0\sim \mu_0, \]and then setting \(\rho_t=\lbrack X_t\rbrack _\sharp \mu_0\). Thus the static map view and the dynamic least-action view describe the same quadratic OT geodesic.
Path-Space / Stochastic Formulation
A different dynamic formulation optimizes not over density/velocity fields directly but over laws on paths:
\[ \inf_{P:\,P_0=\mu_0,\ P_1=\mu_1}\operatorname{KL}(P \Vert R), \]where \(R\) is a reference diffusion or more general reference process on path space. This is the Schrödinger bridge problem. At finite noise it selects the most likely stochastic dynamics compatible with the endpoint marginals. For the standard Brownian reference, the small-noise limit recovers quadratic OT. This is the right language for diffusion-style bridge methods and for later multi-time questions where the reference process is part of the model, not just a numerical regularizer (Léonard, 2013; Tong et al., 2024).
Continuous normalizing flows live in the same continuity-equation formalism, but their training objectives do not minimize the Benamou-Brenier action by default. They learn an admissible transport dynamics, not necessarily the Wasserstein geodesic.
Constrained vs. Unconstrained Transport
At the level of exact endpoint transport, we still have two main high-dimensional constructions:
- Knothe-Rosenblatt / autoregression: tractable, sequential, and order-dependent.
- Brenier / quadratic OT: symmetric and globally optimal for \(W_2^2\), but not available as a simple sequential recipe.
That distinction is the main architectural split.
Autoregression vs. Diffusion
Feature Autoregression (Knothe-Rosenblatt) Flow / Diffusion (Unconstrained) Map Form \(T_i(x_i \vert x_{<i}) = F^{-1}_{Q_i \vert Q_{<i}}(F_{P_i \vert P_{<i}}(x_i))\) Learn an unconstrained transport map; OT ideal: \(T(x) = \nabla \psi(x)\) Tractability Exact sequential 1D inversions Learned numerically rather than by exact coordinatewise inversions Structural Bias Fixed coordinate or token ordering No required coordinate order; bias comes from architecture, coupling, and path choice Typical Training Signal Exact likelihood: \(\sum_i \log p(x_i \vert x_{<i})\) CFM, score matching, likelihood, or distillation variants Typical Inference \(D\) sequential steps ODE/SDE integration, denoising, or one-step maps
Autoregression keeps an exact sequential sampler and exact likelihood factorization, but pays for it with an ordering bias. Unconstrained transport removes that coordinate-ordering constraint and can target the symmetric quadratic geometry, but the map or flow has to be learned numerically.
Two 2D pictures help keep the distinction straight. First, we compare two exact continuous maps at the population-law level: same source law, same target law, different map. Then we pass to a finite-sample discretization of those same laws: one frozen source cloud, one frozen target cloud, and two couplings on the same cached point set.
Knothe-Rosenblatt vs Brenier in 2D
This panel is the exact continuous problem. The formulas compare two true maps between probability laws, while the plotted particles are only a visual probe of those maps. Start from one sampled source cloud $z_i \sim \mathcal{N}(0, I)$ and compare two exact maps into the same correlated Gaussian law.
The pale blue ellipses are contours of the target distribution. The sequential map is the Knothe-Rosenblatt transform $T_{\mathrm{KR}}(z)=Lz$, while the direct map is the Brenier transform $T_{\mathrm{OT}}(z)=\Sigma^{1/2}z$.
Every source-to-target path is drawn as a faint gray trace; the yellow sample cloud shows how the transport actually moves along those paths.
Sorting vs Sinkhorn in 2D
This panel is the empirical finite-sample problem. We freeze one source sample and one target sample, drawn from the same underlying source/target laws as the continuous panel above, and compare two couplings on that exact cached 64-point instance.
Here the target is shown through the sampled blue cloud, while every pairing is drawn as a faint gray trace and the yellow transported sample moves over the full coupling.
Unconstrained Transport: Continuous Flows
One standard way to parameterize an unconstrained transport is to learn a time-dependent velocity field and integrate it (Lai et al., 2025).
Continuous Normalizing Flow (CNF)
\[ \frac{dx}{dt} = v_\theta(x, t), \qquad x(0) \sim P_{\text{noise}}, \quad x(1) \sim P_{\text{data}} \]The field induces a flow map \(\phi_t\), and we want \(\phi_1(X)\) to follow \(P_{\text{data}}\) when \(X \sim P_{\text{noise}}\).
This avoids parameterizing \(\nabla\psi\) directly. It learns a transport map indirectly as the endpoint flow \(\phi_1\), but it does not by itself specify which dynamics should connect the endpoints. Benamou-Brenier chooses the least-action field; a generic CNF objective only chooses some field whose terminal pushforward is correct.
So the training question becomes: what conditional dynamics should \(v_\theta\) regress toward?
Flow Matching & Rectified Flows
Flow Matching(Lipman et al., 2023) and closely related Rectified Flow(Liu et al., 2022) methods choose a coupling \(q(x_0,x_1)\) between source and target points, then define a conditional path between each paired endpoint. For the straight-line version,
the target velocity is the constant displacement \(X_1-X_0\).
Conditional Flow Matching (CFM) Objective
\[ \mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t,\,(X_0,X_1)\sim q}\bigl\Vert v_\theta(X_t,\, t) - (X_1 - X_0) \bigr\Vert _2^2 \]We sample a time \(t\) and an endpoint pair from a chosen coupling \(q\), then regress the model toward the straight-line conditional velocity.
Conditional vs. Marginal Optimality
For a fixed pair \((X_0, X_1)\), the straight line is optimal between those two endpoints. But the global geometry is determined by the coupling \(q\) and by the family of conditional paths.
If \(q=P_{\text{noise}}\otimes P_{\text{data}}\), we get the usual independent-pairing target.
If \(q\) is chosen from an OT plan, we get an OT-informed target.
If the conditional paths are Brownian bridges weighted by an entropic OT coupling, we move to Schrödinger-bridge variants.
So flow matching is a framework for turning a chosen coupling-and-path construction into a regression loss; it is not automatically the Benamou-Brenier geodesic (Lipman et al., 2023; Liu et al., 2022; Tong et al., 2024). However, by parameterizing the vector field via convex functions, Optimal Flow Matching(Kornilov et al., 2024) is able to recover the straight OT displacement in just one FM step.
Practical OT: Regularized Couplings
Everything above was exact and population-level: closed-form 1D maps, exact triangular rearrangements, and Brenier’s exact quadratic-cost optimizer. In practice, training works with finite batches and regularized couplings instead.
Entropic Optimal Transport
Given source and target distributions \(\mu,\nu\) and a reference coupling \(\pi_{\mathrm{ref}}\), entropic OT solves
\[ \min_{\pi \text{ coupling } \mu,\nu} \int c(x,y)\,d\pi(x,y) + \varepsilon\, \mathrm{KL}(\pi \Vert \pi_{\mathrm{ref}}). \]The KL term pulls the coupling toward the chosen reference plan \(\pi_{\mathrm{ref}}\) while keeping the marginals fixed. Intuitively, it discourages extremely sharp plans and makes the optimization numerically smoother.
Product Reference = Entropy Regularization up to Constants
If \(\mu,\nu\), and \(\pi\) admit densities, then
\[ \mathrm{KL}(\pi \Vert \mu \otimes \nu)=H(\mu)+H(\nu)-H(\pi). \]So with product reference \(\mu \otimes \nu\), KL regularization is the same as negative-entropy regularization up to constants, and therefore has the same minimizer. This follows by expanding the logarithm and using the fact that \(\pi\) has marginals \(\mu\) and \(\nu\).
Why the Reference Coupling Matters
With the usual product reference \(\mu \otimes \nu\), the KL term mainly favors diffuse couplings. Freulon et al. show that once the reference itself carries correlations, especially in the Gaussian case, the penalty no longer acts like generic smoothing: it favors couplings compatible with that reference structure (Freulon et al., 2026).
On a small fixed discrete instance, entropic OT is easiest to read as a sharp-vs-diffuse tradeoff. As \(\varepsilon\) increases, the plan keeps the same row and column marginals but spreads mass across more nearby source-target pairs.
How Sinkhorn Diffuses a Discrete Plan
Fix a small balanced discrete problem with source masses $a_i$ at points $z_i$, target masses $b_j$ at points $x_j$, and squared-distance costs $c_{ij}=\|z_i-x_j\|^2$. The entropically regularized plan $\pi^\varepsilon = \arg\min_{\pi \in \Pi(a,b)} \sum_{i,j} c_{ij}\pi_{ij} - \varepsilon H(\pi)$ stays marginally feasible but becomes progressively more diffuse as $\varepsilon$ grows.
Uncrossing Paths: Mini-Batch OT
Flow matching still leaves one practical question open: inside a mini-batch, which source sample should be paired with which target sample?
Batch OT
Random pairings can send nearby interpolation points toward very different endpoints, so the conditional targets \(X_1-X_0\) vary abruptly across the batch. A mini-batch OT solve reduces that mismatch by pairing nearby endpoints:
\[ \pi^\ast = \arg\min_{\pi \text{ matching } X_0^B \text{ to } X_1^B} \sum_{i,j} \pi_{ij}\, \bigl\Vert X_0^{(i)} - X_1^{(j)}\bigr\Vert _2^2 \]
This is efficiently approximated with Sinkhorn. With product reference, it is exactly the entropic OT relaxation above. Geometrically, the batch paths cross less, so nearby interpolation points tend to carry more compatible velocity targets.
On a small cached batch, the contrast is easiest to see directly:
Random Pairings vs Mini-Batch OT
Both panels use the same cached source batch $X_0^B$ and target batch $X_1^B$. Only the pairing changes. Faint gray traces show the full straight-line paths, aqua dots show the fixed endpoints, and the yellow dots with short arrows show the current interpolants $X_t$ and their velocity targets $X_1 - X_0$.
Endpoint Matching Is Not Yet a Dynamical Model
Mini-batch OT solves a two-marginal problem: for one time gap, it picks a coupling between the batch at the start and the batch at the end. That is enough to draw straight-line paths across that single gap, but a trajectory model over times \(t_1,\dots,t_n\) needs one joint law, or at least mutually compatible transitions, across all times.
Freulon et al. make this precise in the Gaussian setting. They show that if each pairwise entropic OT problem is regularized by a reference coupling coming from the same continuous-time Gaussian reference process, then the local transitions remain independently solvable but still fit together into one coherent dynamical model. With a product reference, each gap is regularized toward independence instead, so solving the gaps separately does not build in temporal coherence (Freulon et al., 2026).
One-Step Maps
One-step generators try to learn the transport map directly, rather than integrating an ODE or running many denoising steps at inference. Drifting Models(Deng et al., 2026) and Optimal Flow Matching(Kornilov et al., 2024) are concrete examples.
Drifting Model
A drifting model parameterizes a one-step generator
\[ x=f_\theta(z), \qquad z \sim P_{\text{noise}}, \]and studies the training-time pushforward distribution
\[ q_\theta = \lbrack f_\theta\rbrack _\sharp P_{\text{noise}}. \]Instead of specifying an inference-time transport field, it introduces a drifting field\(V_{p,q}(x)\) that prescribes how a current generated sample should move as the model is updated. The paper imposes the anti-symmetry condition
\[ V_{p,q}(x) = -V_{q,p}(x), \]so that \(V_{p,p}(x)=0\): matched data and model distributions are equilibrium points.
Drifting Objective and Training Step
For generated samples \(x=f_\theta(z)\sim q_\theta\), positive samples \(y^+\sim P_{\text{data}}\), and negative samples \(y^-\sim q_\theta\), Deng et al. use a kernelized attraction-repulsion field
\[ V_{p,q}(x)=V_p^+(x)-V_q^-(x), \]where
\[ V_p^+(x)=\frac{1}{Z_p(x)}\mathbb{E}_{y^+\sim p}\!\big\lbrack k(x,y^+)(y^+-x)\big\rbrack , \qquad V_q^-(x)=\frac{1}{Z_q(x)}\mathbb{E}_{y^-\sim q}\!\big\lbrack k(x,y^-)(y^--x)\big\rbrack , \]with normalization factors \(Z_p(x)=\mathbb{E}_{y^+\sim p}\lbrack k(x,y^+)\rbrack\) and \(Z_q(x)=\mathbb{E}_{y^-\sim q}\lbrack k(x,y^-)\rbrack\). In their implementation, the similarity kernel is
\[ k(x,y)=\exp\!\left(-\frac{\Vert x-y\Vert _2}{\tau}\right), \]realized with batchwise softmax normalizations. The update target is then the drifted sample
\[ x_{\mathrm{drift}}=\operatorname{stopgrad}\bigl(x+V_{P_{\text{data}},q_\theta}(x)\bigr), \]and the training loss is \(\mathcal{L}_{\mathrm{drift}}(\theta)=\mathbb{E}\bigl\Vert x-x_{\mathrm{drift}}\bigr\Vert _2^2\). In practice, the expectations are estimated on mini-batches, the generated batch is reused as negative samples, and the same construction is often applied in a learned feature space \(\phi(x)\) rather than raw pixel space.
Unlike flow matching, this objective does not require paired endpoints. It compares generated samples against positive and negative neighborhoods, uses the resulting drift to update the one-step generator during training, and then samples with one evaluation of \(f_\theta\) at test time.
Conceptually, this closes the loop:
Optional Refinement: Polar Factorization
Even if a generator already matches the correct density, its internal geometry need not be transport-optimal.
Polar Factorization Theorem
Under the same regularity assumptions as Brenier’s theorem, any exact generator \(F\) with \(F_\sharp P_{\text{noise}} = P_{\text{data}}\) can be factorized \(P_{\text{noise}}\)-almost everywhere as
\[ F=\nabla\psi \circ M, \]where \(\nabla\psi\) is the Brenier map and \(M\) preserves the source law: \(M_\sharp P_{\text{noise}}=P_{\text{noise}}\).
This is the infinite-dimensional analogue of matrix polar decomposition. Every exact generator can be decomposed into a source-law-preserving latent rearrangement, followed by the optimal transport (Vesseron & Cuturi, 2025).
Same Density, Different Pairing
If \(\widetilde{G}=G\circ M\) and \(M_\sharp P_{\text{noise}}=P_{\text{noise}}\), then
\[ \widetilde{G}_\sharp P_{\text{noise}} = G_\sharp P_{\text{noise}}. \]So a latent rearrangement does not change the modeled density. It only changes the coupling: which latent point is paired with which output sample.
A finite-sample version makes the factorization concrete: the middle column preserves the same empirical source cloud, the right column preserves the same output cloud, and only the pairing cost changes.
Same Output Law, Different Coupling
This is a finite-sample picture of the polar factorization theorem. The middle column applies a latent rearrangement $M$ that preserves the empirical source law, and the right column applies the same Brenier map $T=\nabla\psi$. So the final output cloud stays fixed while only the coupling and its quadratic cost change.
This is exactly the lens used by Morel et al.: start from a trained normalizing flow that already matches \(P_{\text{data}}\), then search over Gaussian-preserving latent rearrangements that lower its quadratic transport cost without changing the final density. The target is not a new density model, but a coupling closer to the Monge map (Morel et al., 2023).
Gaussian-Preserving Rearrangements
For standard normal priors, Morel et al. make this concrete by moving to uniform coordinates, applying a volume-preserving shuffle there, and mapping back.
Gaussian-Preserving Rearrangements via Uniform Coordinates
Let \(\Phi:\mathbb{R}^D \to (0,1)^D\) be the coordinatewise standard normal CDF and let \(\phi:(0,1)^D \to (0,1)^D\) be a smooth volume-preserving diffeomorphism with \(\vert \det J_\phi\vert =1\). Then
\[ M=\Phi^{-1}\circ \phi \circ \Phi \]preserves the standard Gaussian: if \(Z\sim \mathcal{N}(0,I)\), then \(M(Z)\sim \mathcal{N}(0,I)\).
If \(U=\Phi(Z)\), then \(U\) is uniform on \((0,1)^D\). Volume preservation implies \(\phi(U)\) is still uniform, and applying \(\Phi^{-1}\) coordinatewise sends it back to a standard Gaussian. Hence \(M(Z)\sim \mathcal{N}(0,I)\).
Why Divergence-Free ODEs and Euler Regularization Appear
Morel et al. parameterize \(\phi\) as the endpoint of an ODE \(\dot X_t=v_t(X_t)\) on the cube, with \(\nabla\cdot v_t=0\) and tangential boundary conditions. Liouville’s formula then keeps \(\det J_{X_t}\) constant, so each time-\(t\) map preserves volume. Euler regularization does not change which endpoints preserve the density; it selects a lower-energy volume-preserving path to the same endpoint (Morel et al., 2023).
Conclusion
Summary
Through the lens of optimal transport, autoregression and diffusion are two computational strategies for the same geometric task: transporting \(P_{\text{noise}}\) to \(P_{\text{data}}\).
Autoregression makes the transport exact by imposing a triangular, coordinate-ordered structure. Flow and diffusion methods remove that ordering constraint, but then the transport map or dynamics must be learned from a chosen objective, coupling, and path family.
The OT lens is therefore not saying that these methods are literally the same algorithm. It says they are different parameterizations and relaxations of the same source-to-target problem. Text uses autoregression and images use diffusion mostly because of inductive bias and computational convenience; better transport parameterizations and solvers will keep blurring that boundary.
References
- Benamou, J.-D., & Brenier, Y. (2000). A Computational Fluid Mechanics Solution to the Monge-Kantorovich Mass Transfer Problem. Numerische Mathematik, 84(3), 375–393. https://doi.org/10.1007/s002119900117↩︎
- Brenier, Y. (1991). Polar Factorization and Monotone Rearrangement of Vector-valued Functions. Communications on Pure and Applied Mathematics, 44(4), 375–417. https://doi.org/10.1002/cpa.3160440402↩︎
- Deng, M., Li, H., Li, T., Du, Y., & He, K. (2026). Generative Modeling via Drifting (Issue arXiv:2602.04770). arXiv. https://doi.org/10.48550/arXiv.2602.04770↩︎
- Diffusion Is Spectral Autoregression. (2024). In Sander Dieleman. https://sander.ai/2024/09/02/spectral-autoregression.html↩︎
- Falck, F. (2025). Diffusion Is Not Necessarily Spectral Autoregression. https://fabianfalck.com/posts/spectralauto↩︎
- Freulon, P., Georgakis, N., & Panaretos, V. (2026). Entropic Optimal Transport beyond Product Reference Couplings: The Gaussian Case on Euclidean Space (Issue arXiv:2507.01709). arXiv. https://doi.org/10.48550/arXiv.2507.0170912
- Guillen, N., & McCann, R. (2010). Five Lectures on Optimal Transportation: Geometry, Regularity and Applications (Issue arXiv:1011.2911). arXiv. https://doi.org/10.48550/arXiv.1011.2911
- Inverse Transform Sampling. (2025). Wikipedia. https://en.wikipedia.org/w/index.php?title=Inverse_transform_sampling&oldid=1330078866↩︎
- Kitagawa, J., & Takatsu, A. (2024). Sliced Optimal Transport: Is It a Suitable Replacement? (Issue arXiv:2311.15874). arXiv. https://doi.org/10.48550/arXiv.2311.15874↩︎
- Kornilov, N., Mokrov, P., Gasnikov, A., & Korotin, A. (2024). Optimal Flow Matching: Learning Straight Trajectories in Just One Step (Issue arXiv:2403.13117). arXiv. https://doi.org/10.48550/arXiv.2403.1311712
- Lai, C.-H., Song, Y., Kim, D., Mitsufuji, Y., & Ermon, S. (2025). The Principles of Diffusion Models (Issue arXiv:2510.21890). arXiv. https://doi.org/10.48550/arXiv.2510.21890↩︎
- Léonard, C. (2013). A Survey of the Schrödinger Problem and Some of Its Connections with Optimal Transport (Issue arXiv:1308.0215). arXiv. https://doi.org/10.48550/arXiv.1308.0215↩︎
- Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling (Issue arXiv:2210.02747). arXiv. https://doi.org/10.48550/arXiv.2210.0274712
- Liu, X., Gong, C., & Liu, Q. (2022). Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow (Issue arXiv:2209.03003). arXiv. https://doi.org/10.48550/arXiv.2209.0300312
- McNulty, Z. Kantorovich Duality and the Wasserstein Space.
- Montesuma, E. F., Mboula, F. N., & Souloumiac, A. (2024). Recent Advances in Optimal Transport for Machine Learning (Issue arXiv:2306.16156). arXiv. https://doi.org/10.48550/arXiv.2306.16156↩︎
- Morel, G., Drumetz, L., Benaïchouche, S., Courty, N., & Rousseau, F. (2023). Turning Normalizing Flows into Monge Maps with Geodesic Gaussian Preserving Flows (Issue arXiv:2209.10873). arXiv. https://doi.org/10.48550/arXiv.2209.1087312
- Normal Distribution. (2026). Wikipedia. https://en.wikipedia.org/w/index.php?title=Normal_distribution&oldid=1349034873#Maximum_entropy↩︎
- Optimal Transport Wiki – OT WIKI. Retrieved April 21, 2026, from https://www.otwiki.xyz/
- Pass, B., & Alberta, U. An Introduction to Optimal Transport. Matching Theory.
- Peyré, G. (2025). Optimal Transport for Machine Learners (Issue arXiv:2505.06589). arXiv. https://doi.org/10.48550/arXiv.2505.0658912
- The Principles of Diffusion Models. Retrieved April 19, 2026, from https://the-principles-of-diffusion-models.github.io/
- Quadrangle Inequality Properties. In Codeforces. Retrieved April 20, 2026, from https://codeforces.com/blog/entry/86306
- Rockafellar, R. (1966). Characterization of the Subdifferentials of Convex Functions. Pacific Journal of Mathematics, 17(3), 497–510. https://doi.org/10.2140/pjm.1966.17.497↩︎
- Santambrogio, F. (2014). Introduction to Optimal Transport Theory. In Y. Ollivier, H. Pajot, & C. Villani (Eds.), Optimal Transport (1st ed., pp. 3–21). Cambridge University Press. https://doi.org/10.1017/CBO9781107297296.002
- Santambrogio, F. Optimal Transport for Applied Mathematicians – Calculus of Variations, PDEs and Modeling.
- Tang, S. (2026). Foundations of Schrödinger Bridges for Generative Modeling (Issue arXiv:2603.18992). arXiv. https://doi.org/10.48550/arXiv.2603.18992
- Thickstun, J. Kantorovich-Rubinstein Duality.↩︎
- Thorpe, M. (2018). Introduction to Optimal Transport.↩︎
- Tian, K., Jiang, Y., Yuan, Z., Peng, B., & Wang, L. (2024). Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (Issue arXiv:2404.02905). arXiv. https://doi.org/10.48550/arXiv.2404.02905↩︎
- Tong, A. Y., Malkin, N., Fatras, K., Atanackovic, L., Zhang, Y., Huguet, G., Wolf, G., & Bengio, Y. (2024). Simulation-Free Schrödinger Bridges via Score and Flow Matching. Proceedings of the 27th International Conference on Artificial Intelligence and Statistics, 238, 1279–1287. https://proceedings.mlr.press/v238/tong24a.html12
- Vesseron, N., & Cuturi, M. (2025). On a Neural Implementation of Brenier’s Polar Factorization (Issue arXiv:2403.03071). arXiv. https://doi.org/10.48550/arXiv.2403.03071↩︎
- Villani, C. (2009). Optimal Transport: Old and New (Issue 338). Springer.
- Villani, C. (2016). Topics in Optimal Transportation (Reprinted with corrections, Issue 58). American Mathematical Society.
- Yu, H., Luo, H., Yuan, H., Rong, Y., Huang, J., & Zhao, F. (2026). Frequency Autoregressive Image Generation with Continuous Tokens (Issue arXiv:2503.05305). arXiv. https://doi.org/10.48550/arXiv.2503.05305↩︎