Rational Matrix Iterations for Polar Decomposition

Newton-Schulz is the classic hardware-aware polar decomposition, but it struggles with ill-conditioning and low precision. We derive a comprehensive, robust alternative using rational iterations (DWH) and specific stabilization primitives to ensure convergence without eigenvalues or restarts.

Posted Apr 5, 2026 Updated Apr 6, 2026

By Ji-Ha Kim

5 min read

Overview
The Muon optimizer (Jordan, 2024) has demonstrated that projecting update directions onto the Stiefel manifold—via the matrix polar decomposition—is a powerful tool for training Transformers and other scale-sensitive architectures.
While the Newton-Schulz iteration is the standard hardware-aware choice, it is a polynomial map that can be slow or unstable for ill-conditioned matrices. This has led to the development of methods like Polar Express(Amsel et al., 2025), which optimize the polynomial basin.
In this post, we work through a different path: rational iterations. We present a “clean” end-to-end design that hits several high-bar goals: exactly 2 rectangular GEMMs, no power iteration for $\lambda_{\min}$ , and stability that handles the spurious negative eigenvalues common in low-precision (FP16/TF32) arithmetic.

1. The Rational Path: Dynamic Weighted Halley

Iterative methods for the polar factor $U = \operatorname{polar}(A)$ work by applying a function $$f(x)$$ to the singular values $\sigma_i$ of $$A$$ . While Newton-Schulz is polynomial, the Dynamic Weighted Halley (DWH) iteration (Nakatsukasa et al., 2010) uses a rational function:

f(x) = x \frac{a + b x^2}{1 + c x^2}

By picking $$a, b, c$$ dynamically based on an estimate of the lower bound of the spectrum, DWH achieves global convergence and extremely high throughput.

1.1 Offline Coefficient Computation

Given a known lower bound estimate $\ell \in (0, 1]$ , the optimal DWH coefficients are computed once (in FP64) as follows:

\gamma(\ell) = \left(\frac{4(1-\ell^2)}{\ell^4}\right)^{1/3},\qquad r = \sqrt{1+\gamma},

a(\ell) = r + \frac{1}{2}\sqrt{8 - 4\gamma + \frac{8(2-\ell^2)}{\ell^2 r}},

b(\ell) = \frac{(a-1)^2}{4},\qquad c(\ell)=a+b-1.

We then define the “apply-friendly” parameters used in the iteration step:

\alpha = \frac{b}{c},\qquad \beta = a-\alpha.

2. Stability Primitives for Low Precision

In low precision (FP16 or TF32), the Gram matrix $G = X^\top X$ can suffer from “drift” that creates spurious negative eigenvalues. Standard methods like Gram-NS need restarts or heavy regularization to handle this. ({Dao AI Lab}, 2026)

2.1 Symmetrization and Jacobi Scaling

We work in the “column-space view” $X \in \mathbb{R}^{M \times N}, M \ge N$ . For $A \in \mathbb{R}^{m \times n}$ , we set $$X = A$$ if $m \ge n$ and $X = A^\top$ otherwise.

Symmetrize: $G \leftarrow \tfrac12(G+G^\top)$ .
Jacobi Preconditioning: $D = \operatorname{rsqrt}(\max(G_{ii}, d_{\min}))$ .
Jacobi Gram: $\tilde{A} \leftarrow DGD$ .

2.2 Moment-Based Upper Bounds

We avoid expensive power iterations for $\lambda_{\max}$ by using a moment-based upper bound $u = \min(u_M, u_F)$ computed from $\tilde{A}$ :

u_M = \frac{\operatorname{tr}(\tilde{A}) + \sqrt{(N-1)\max(0, N\Vert \tilde{A}\Vert _F^2 - \operatorname{tr}(\tilde{A})^2)}}{N}, \quad u_F = \Vert \tilde{A}\Vert _F.

Inflate with $u \leftarrow u + \eta \frac{\Vert X\Vert _F^2}{s_x^2}$ for matmul rounding safety.

3. The End-to-End Algorithm (Explicit Version)

The following procedure, Rational Polar Decomposition, explicitly integrates the adaptive bound logic and the denominator-stabilized Cholesky factorization required for low-precision stability.

Algorithm 1 Rational Polar Decomposition

1:procedure Sym(A)

2:return

\frac{1}{2}(A + A^\top)

3:end procedure

4:procedure FactorizeSafe(A)

S \leftarrow A, \quad \tau \leftarrow 0

6:while true do

7:try

L \leftarrow \text{Cholesky}(S + \tau I)

8:if

\(L\)

exists then

9:return

{L, \tau}

10:else

11:

\tau \leftarrow \max(10^{-6} \cdot \operatorname{tr}(S)/N, 10\tau)

12:end if

13:end while

14:end procedure

15:procedure RationalPolar(

X \in \mathbb{R}^{M \times N}

)

16:

X \leftarrow X / 2^{\lfloor \log_2(\max \vert X_{ij}\vert ) \rceil}

// GEMM-safe scaling

17:

G \leftarrow \text{Sym}(X^\top X)

18:

D \leftarrow \text{diag}(\max(\text{diag}(G), 10^{-30}))^{-1/2}

19:

B \leftarrow \text{Sym}(D G D), \quad \text{diag}(B) \leftarrow 1

20:

\sigma \leftarrow \sqrt{\text{MomentBound}(B)} \cdot \text{safety} + \epsilon

21:

X \leftarrow X/\sigma, \quad B \leftarrow B/\sigma^2

22:repeat

23:

{ \alpha_k, \beta_k, c_k, \delta }_k \leftarrow \text{ComputeCoefficients}(\ell_0)

24:

{L, \tau} \leftarrow \text{FactorizeSafe}(B + \frac{1}{c_0} I)

25:

\rho \leftarrow \tau / (\operatorname{tr}(B)/N)

26:if

\rho > 10^{-3}

then

27:

\ell_0 \leftarrow \min(10\ell_0, 0.1)

// Restart with higher floor

28:else

29:break

30:end if

31:until false

32:

H_0 \leftarrow \frac{1}{c_0} \text{inv}(L L^\top), \quad K_0 \leftarrow I - H_0, \quad M_0 \leftarrow \alpha_0 I + \beta_0 H_0

33:

B_1 \leftarrow \text{Sym}(I + \delta(c_0\alpha_0^2 B + 2\alpha_0\beta_0 K_0 + \beta_0^2 H_0 K_0))

34:

L_1, \_ \leftarrow \text{FactorizeSafe}(B_1)

35:

T \leftarrow M_0 \cdot \text{inv}(L_1 L_1^\top), \quad K_{\text{final}} \leftarrow \alpha_1 M_0 + \beta_1 T

36:return

Q = (X D) K_{\text{final}} D

37:end procedure

4. Why This is “Hardware-Aware SOTA”

4.1 Efficiency

Two GEMMs: Exactly one $X^\top X$ and one $X K_{\text{final}}$ . All other operations are $N \times N$ or diagonal.
No Eigenvalues: Power iteration is replaced by moment scaling ( $\lambda_{\max}$ ) and adaptive denominator tracking ( $\lambda_{\min}$ ).
Minimized Latency: The “small side” $N\times N$ work is dominated by Cholesky/Trsm, which is fast on tensor cores.

4.2 Stability

Bounded Coefficients: We never form huge matrices like $$I + 10^3 G$$ . Instead, we factor $G + \epsilon I$ , where $\epsilon$ is the natural floor computed from DWH.
Robust Denominators: Minimal jitter ensures we never hit singular matrices without needing to restart the whole optimization step.
No Bias: We only shift denominators, preserving the update direction in the Gram matrix.

Conclusion

Rational iterations like DWH provide a more robust foundation for the polar decomposition in deep learning optimizers than polynomial Newton-Schulz. By combining them with hardware-specific preconditioning and moment-based scaling, we can build optimizers that are both faster and more stable in the low-precision regimes required by modern hardware.

References

Nakatsukasa, Y., Bai, Z., & Gygi, F. (2010). Optimizing Halley’s Iteration for Computing the Matrix Polar Decomposition. SIAM Journal on Matrix Analysis and Applications, 31(5), 2700–2720. https://doi.org/10.1137/090774999↩︎
Nakatsukasa, Y., & Higham, N. J. (2013). Stable and Efficient QDWH Algorithm for Computing the SVD. SIAM Journal on Scientific Computing, 35(3), A1325–A1349. https://doi.org/10.1137/120876605
Dao AI Lab. (2026). A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon. https://dao-lab.ai/blog/2026/gram-newton-schulz/↩︎
Jordan, K. (2024). Muon: An Optimizer for Hidden Layers in Neural Networks. https://kellerjordan.github.io/posts/muon/↩︎
Amsel, N., Persson, D., Musco, C., & Gower, R. M. (2025). The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm (Issue arXiv:2505.16932). arXiv. https://arxiv.org/abs/2505.16932↩︎

Numerical Linear Algebra, Mathematical Optimization

This post is licensed under CC BY 4.0 by the author.