Post

Rational Matrix Iterations for Polar Decomposition

Newton-Schulz is the classic hardware-aware polar decomposition, but it struggles with ill-conditioning and low precision. We derive a comprehensive, robust alternative using rational iterations (DWH) and specific stabilization primitives to ensure convergence without eigenvalues or restarts.

Rational Matrix Iterations for Polar Decomposition

Overview

The Muon optimizer (Jordan, 2024) has demonstrated that projecting update directions onto the Stiefel manifold—via the matrix polar decomposition—is a powerful tool for training Transformers and other scale-sensitive architectures.

While the Newton-Schulz iteration is the standard hardware-aware choice, it is a polynomial map that can be slow or unstable for ill-conditioned matrices. This has led to the development of methods like Polar Express(Amsel et al., 2025), which optimize the polynomial basin.

In this post, we work through a different path: rational iterations. We present a “clean” end-to-end design that hits several high-bar goals: exactly 2 rectangular GEMMs, no power iteration for \(\lambda_{\min}\), and stability that handles the spurious negative eigenvalues common in low-precision (FP16/TF32) arithmetic.


1. The Rational Path: Dynamic Weighted Halley

Iterative methods for the polar factor \(U = \operatorname{polar}(A)\) work by applying a function \(f(x)\) to the singular values \(\sigma_i\) of \(A\). While Newton-Schulz is polynomial, the Dynamic Weighted Halley (DWH) iteration (Nakatsukasa et al., 2010) uses a rational function:

\[ f(x) = x \frac{a + b x^2}{1 + c x^2} \]

By picking \(a, b, c\) dynamically based on an estimate of the lower bound of the spectrum, DWH achieves global convergence and extremely high throughput.

1.1 Offline Coefficient Computation

Given a known lower bound estimate \(\ell \in (0, 1]\), the optimal DWH coefficients are computed once (in FP64) as follows:

\[ \gamma(\ell) = \left(\frac{4(1-\ell^2)}{\ell^4}\right)^{1/3},\qquad r = \sqrt{1+\gamma}, \]
\[ a(\ell) = r + \frac{1}{2}\sqrt{8 - 4\gamma + \frac{8(2-\ell^2)}{\ell^2 r}}, \]
\[ b(\ell) = \frac{(a-1)^2}{4},\qquad c(\ell)=a+b-1. \]

We then define the “apply-friendly” parameters used in the iteration step:

\[ \alpha = \frac{b}{c},\qquad \beta = a-\alpha. \]

2. Stability Primitives for Low Precision

In low precision (FP16 or TF32), the Gram matrix \(G = X^\top X\) can suffer from “drift” that creates spurious negative eigenvalues. Standard methods like Gram-NS need restarts or heavy regularization to handle this. ({Dao AI Lab}, 2026)

2.1 Symmetrization and Jacobi Scaling

We work in the “column-space view” \(X \in \mathbb{R}^{M \times N}, M \ge N\). For \(A \in \mathbb{R}^{m \times n}\), we set \(X = A\) if \(m \ge n\) and \(X = A^\top\) otherwise.

  1. Symmetrize: \(G \leftarrow \tfrac12(G+G^\top)\).
  2. Jacobi Preconditioning: \(D = \operatorname{rsqrt}(\max(G_{ii}, d_{\min}))\).
  3. Jacobi Gram: \(\tilde{A} \leftarrow DGD\).

2.2 Moment-Based Upper Bounds

We avoid expensive power iterations for \(\lambda_{\max}\) by using a moment-based upper bound \(u = \min(u_M, u_F)\) computed from \(\tilde{A}\):

\[ u_M = \frac{\operatorname{tr}(\tilde{A}) + \sqrt{(N-1)\max(0, N\Vert \tilde{A}\Vert _F^2 - \operatorname{tr}(\tilde{A})^2)}}{N}, \quad u_F = \Vert \tilde{A}\Vert _F. \]

Inflate with \(u \leftarrow u + \eta \frac{\Vert X\Vert _F^2}{s_x^2}\) for matmul rounding safety.


3. The End-to-End Algorithm (Explicit Version)

The following procedure, Rational Polar Decomposition, explicitly integrates the adaptive bound logic and the denominator-stabilized Cholesky factorization required for low-precision stability.

Algorithm 1 Rational Polar Decomposition
1:procedure Sym(A)
2:return\(\frac{1}{2}(A + A^\top)\)
3:end procedure
4:procedure FactorizeSafe(A)
5:\(S \leftarrow A, \quad \tau \leftarrow 0\)
6:while true do
7:try\(L \leftarrow \text{Cholesky}(S + \tau I)\)
8:if \(L\) exists then
9:return\({L, \tau}\)
10:else
11:\(\tau \leftarrow \max(10^{-6} \cdot \operatorname{tr}(S)/N, 10\tau)\)
12:end if
13:end while
14:end procedure
15:procedure RationalPolar(\(X \in \mathbb{R}^{M \times N}\))
16:\(X \leftarrow X / 2^{\lfloor \log_2(\max \vert X_{ij}\vert ) \rceil}\)  // GEMM-safe scaling
17:\(G \leftarrow \text{Sym}(X^\top X)\)
18:\(D \leftarrow \text{diag}(\max(\text{diag}(G), 10^{-30}))^{-1/2}\)
19:\(B \leftarrow \text{Sym}(D G D), \quad \text{diag}(B) \leftarrow 1\)
20:\(\sigma \leftarrow \sqrt{\text{MomentBound}(B)} \cdot \text{safety} + \epsilon\)
21:\(X \leftarrow X/\sigma, \quad B \leftarrow B/\sigma^2\)
22:repeat
23:\({ \alpha_k, \beta_k, c_k, \delta }_k \leftarrow \text{ComputeCoefficients}(\ell_0)\)
24:\({L, \tau} \leftarrow \text{FactorizeSafe}(B + \frac{1}{c_0} I)\)
25:\(\rho \leftarrow \tau / (\operatorname{tr}(B)/N)\)
26:if \(\rho > 10^{-3}\)then
27:\(\ell_0 \leftarrow \min(10\ell_0, 0.1)\)  // Restart with higher floor
28:else
29:break
30:end if
31:until false
32:\(H_0 \leftarrow \frac{1}{c_0} \text{inv}(L L^\top), \quad K_0 \leftarrow I - H_0, \quad M_0 \leftarrow \alpha_0 I + \beta_0 H_0\)
33:\(B_1 \leftarrow \text{Sym}(I + \delta(c_0\alpha_0^2 B + 2\alpha_0\beta_0 K_0 + \beta_0^2 H_0 K_0))\)
34:\(L_1, \_ \leftarrow \text{FactorizeSafe}(B_1)\)
35:\(T \leftarrow M_0 \cdot \text{inv}(L_1 L_1^\top), \quad K_{\text{final}} \leftarrow \alpha_1 M_0 + \beta_1 T\)
36:return\(Q = (X D) K_{\text{final}} D\)
37:end procedure

4. Why This is “Hardware-Aware SOTA”

4.1 Efficiency

  • Two GEMMs: Exactly one \(X^\top X\) and one \(X K_{\text{final}}\). All other operations are \(N \times N\) or diagonal.
  • No Eigenvalues: Power iteration is replaced by moment scaling (\(\lambda_{\max}\)) and adaptive denominator tracking (\(\lambda_{\min}\)).
  • Minimized Latency: The “small side” \(N\times N\) work is dominated by Cholesky/Trsm, which is fast on tensor cores.

4.2 Stability

  • Bounded Coefficients: We never form huge matrices like \(I + 10^3 G\). Instead, we factor \(G + \epsilon I\), where \(\epsilon\) is the natural floor computed from DWH.
  • Robust Denominators: Minimal jitter ensures we never hit singular matrices without needing to restart the whole optimization step.
  • No Bias: We only shift denominators, preserving the update direction in the Gram matrix.

Conclusion

Rational iterations like DWH provide a more robust foundation for the polar decomposition in deep learning optimizers than polynomial Newton-Schulz. By combining them with hardware-specific preconditioning and moment-based scaling, we can build optimizers that are both faster and more stable in the low-precision regimes required by modern hardware.


References

  1. Nakatsukasa, Y., Bai, Z., & Gygi, F. (2010). Optimizing Halley’s Iteration for Computing the Matrix Polar Decomposition. SIAM Journal on Matrix Analysis and Applications, 31(5), 2700–2720. https://doi.org/10.1137/090774999↩︎
  2. Nakatsukasa, Y., & Higham, N. J. (2013). Stable and Efficient QDWH Algorithm for Computing the SVD. SIAM Journal on Scientific Computing, 35(3), A1325–A1349. https://doi.org/10.1137/120876605
  3. Dao AI Lab. (2026). A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon. https://dao-lab.ai/blog/2026/gram-newton-schulz/↩︎
  4. Jordan, K. (2024). Muon: An Optimizer for Hidden Layers in Neural Networks. https://kellerjordan.github.io/posts/muon/↩︎
  5. Amsel, N., Persson, D., Musco, C., & Gower, R. M. (2025). The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm (Issue arXiv:2505.16932). arXiv. https://arxiv.org/abs/2505.16932↩︎
This post is licensed under CC BY 4.0 by the author.