Post

Matrix Inverse Square Root (SPD) with Fixed-Budget GEMM Kernels

A practical guide to matrix inverse square root iterations for SPD matrices, focusing on GPU-friendly GEMM kernels and fixed-budget convergence.

Matrix Inverse Square Root (SPD) with Fixed-Budget GEMM Kernels

Link to PyTorch implementation on GitHub: https://github.com/JiHa-Kim/fast-matrix-invsqrt

1. Background and motivation

Many ML systems repeatedly need to “whiten” or precondition vectors using a covariance-like SPD matrix \(A \succ 0\) (Amari, 1998) (Davidon, 1959) (Anil et al., 2021) (rohan anil [@arohan], 2024). A canonical operator is the inverse square root \(A^{-1/2}\), defined as the unique SPD matrix \(X\) such that

\[ X A X = I. \]

If we can apply (or approximate) \(A^{-1/2}\) quickly, we can stabilize optimization or normalization while avoiding expensive factorizations.

In deep learning settings, two practical constraints dominate:

  1. Throughput: kernels should be dominated by GEMMs (matrix-matrix multiplies), which are highly optimized on GPUs.
  2. Low precision: bf16/fp16 arithmetic is common, so we want iterations that remain stable and deliver “good enough” accuracy in a small, fixed number of steps.

This repo is built around that viewpoint: implement a small family of matmul-only inverse-square-root iterations, normalize inputs to make them behave well under fixed budgets, and decide winners empirically using a benchmark harness.

Recent work on GPU-friendly polar/sign methods in ML makes the same point: high accuracy is often unnecessary, while GEMM-only iterations and fast early progress matter (Anil et al., 2021) (Amsel et al., 2025) (Boissin et al., 2025)


2. Problem statement and quality metrics

Given SPD \(A \in \mathbb{R}^{n \times n}\), we compute an approximation \(X \approx A^{-1/2}\). The defining identity \(XAX=I\) suggests the residual

\[ R := I - X A X. \]

The harness reports a normalized Frobenius residual

\[ \mathrm{resid}_{\mathrm{fro}} := \frac{\|R\|_F}{\sqrt{n}}, \]

and also includes symmetry diagnostics for \(X\) and \(W := XAX\), plus optional spectral and apply-to-vector proxies, making them suitable for fixed-budget benchmarking.


3. Newton-Schulz inverse square root: derivation and coupled form

Newton-Schulz Update

The classical Newton-Schulz inverse-square-root update for scalar \(a > 0\) is given by:

\[ x_{+} = x\left(\frac{3}{2} - \frac{1}{2} a x^2\right). \]

3.2 Matrix form and the commuting-polynomial regime

A direct matrix analogue is

\[ X_{k+1} = X_k\left(\frac{3}{2}I - \frac{1}{2} A X_k^2\right). \]

If we start with \(X_0 = \alpha I\), then each \(X_k\) is a matrix polynomial in \(A\) and therefore commutes with \(A\). In that regime, the iteration acts eigenvalue-wise like the scalar update. This “polynomial in the input matrix” structure is central in classic Newton-type analyses for related matrix functions (notably the polar decomposition).

3.3 The coupled GEMM-only update used in this repo (NS3/NS4)

To avoid explicitly forming \(A X_k^2\), the implementation tracks an auxiliary matrix \(Y_k\) and uses a coupled update that remains GEMM-dominated:

\[ B_k = 1.5I - 0.5Y_k, \qquad X_{k+1} = X_k B_k, \qquad Y_{k+1} = B_k Y_k B_k, \]

initialized with \(X_0 = I\) and \(Y_0 = A_{\mathrm{norm}}\). This is implemented directly in `inverse_sqrt_ns`, including optional symmetrization of \(Y\) for numerical hygiene.

A key throughput trick is terminal last step: on the final iteration, the code skips updating \(Y\) and returns \(X\) only.

The benchmark harness always enables this in its method calls.

3.4 Why initialization (spectral placement matters)

In the commuting regime, eigenvalue evolution follows:

\[ y_{k+1} = \phi(y_k), \qquad \phi(y) := y\left(\frac{3-y}{2}\right)^2 = \frac{y(3-y)^2}{4}. \]

Define \(e := 1-y\). Expanding around \(e=0\) gives

\[ e_{k+1} = \frac{3}{4}e_k^2 + \frac{1}{4}e_k^3, \]

so once eigenvalues are near 1, the residual shrinks roughly quadratically.

However, with a fixed small iteration budget (2-4 steps), performance depends heavily on where the initial spectrum lies. This is why the repo places strong emphasis on preconditioning/normalization.


4. Convergence region: a sufficient condition and an NS3-specific view

4.1 A common sufficient condition: \(\|Y_0 - I\| < 1\)

A widely used local convergence guarantee for Newton-Schulz style coupled iterations is:

\[ \|Y_0 - I\| < 1 \]

in a consistent norm, which (for SPD and spectral norm) implies eigenvalues of \(Y_0\) lie in \((0,2)\). This is the practical condition emphasized in iSQRT-COV, which uses coupled Newton-Schulz iterations to avoid GPU-unfriendly SVD/EIG while noting the method is locally convergent and requires proper normalization.

4.2 NS3 scalar basin intuition

For the specific NS3 map \(\phi(y)=y(3-y)^2/4\), fixed points are \(y \in \{0,1,5\}\). The fixed point at \(y=1\) is strongly attracting (\(\phi'(1)=0\)), while very large \(y\) can blow up since \(\phi(y) \sim y^3/4\). This motivates keeping eigenvalues away from large values and away from values too close to zero, especially for small iteration counts.

In practice, enforcing something like \(y \in [\ell,1]\) with \(\ell>0\) is a robust way to ensure both stability and rapid transient progress.


5. Preconditioning and normalization (making fixed budgets work)

The repo’s precond_spd is explicitly designed to (a) cheaply control an upper spectral bound and (b) enforce a lower “spectral floor” using a Gershgorin proxy, because those are exactly the quantities that govern early contraction of NS-style polynomials.

5.1 Optional diagonal scaling (“aol” mode)

One option is symmetric diagonal scaling:

\[ d_i = \frac{1}{\sqrt{\sum_j |A_{ij}|}}, \qquad A \leftarrow D A D, \]

implemented as mode == "aol".

This is in the spirit of matrix equilibration / diagonal scaling methods (Ruiz, n.d.) used to reduce extreme row/column scaling while preserving symmetry.

5.2 Upper scaling via a row-sum bound (GPU-friendly \(\lambda_{\max}\) proxy)

By default, the preconditioner uses the max absolute row sum

\[ u := \max_i \sum_j |(A_{\mathrm{pre}})_{ij}|, \qquad A_{\mathrm{norm}} := \frac{A_{\mathrm{pre}}}{u}. \]

It is cheap (reductions only) and ensures \(\rho(A_{\mathrm{pre}}) \le \|A_{\mathrm{pre}}\|_\infty = u\), so scaling by \(u\) pushes the spectrum toward \([0,1]\) in a conservative way.

A power-iteration estimator exists but was found to be slower in the repo’s current baseline regime.

5.3 Lower floor via Gershgorin bound + diagonal shift

To avoid tiny eigenvalues, the preconditioner computes a Gershgorin-style lower bound

\[ g_{\mathrm{lo}} := \min_i \left(a_{ii} - \sum_{j\ne i} |a_{ij}|\right). \]

If \(g_{\mathrm{lo}} < \ell_{\mathrm{target}}\), it applies a diagonal shift

\[ A_{\mathrm{norm}} \leftarrow A_{\mathrm{norm}} + \delta I, \qquad \delta := \max(0, \ell_{\mathrm{target}} - g_{\mathrm{lo}}), \]

then renormalizes by a row-sum bound again.

This uses Gershgorin’s theorem: all eigenvalues lie within Gershgorin discs (intervals for symmetric matrices), so the quantity above provides a cheap, conservative lower bound; shifting by \(\delta I\) increases all eigenvalues. (Gerschgorin, 1931)

5.4 Why \(\ell_{\mathrm{target}} = 0.05\) is a reasonable fixed-budget choice

Even if NS3 converges asymptotically, fixed budgets care about transient progress. If the worst-case eigenvalue starts very small, it may take many iterations before the quadratic regime near 1 kicks in. Enforcing a floor like \(0.05\) reduces that transient delay while keeping the shift modest; the repo also provides tuned polynomial schedules that further accelerate early contraction (next section).

The code and CLI defaults explicitly bake in \(\ell_{\mathrm{target}}=0.05\) as the “precomputed schedule” regime.


6. Polar-Express-style polynomial schedules (PE-NS3 and PE2)

Newton-Schulz uses a fixed affine polynomial \(q(y)=1.5-0.5y\). A known drawback (also emphasized in recent ML-oriented polar/sign work) is slow initial progress when eigenvalues are not already close to 1. Polar Express addresses this by adapting the polynomial update each iteration via a minimax design, while remaining GEMM-only and stable in bf16 (Amsel et al., 2025).

This repo adapts the same principle to inverse square root.

6.1 General polynomial step

In the coupled form, each iteration chooses a polynomial \(q_k\) and applies

\[ B_k = q_k(Y_k), \qquad X_{k+1} = X_k B_k, \qquad Y_{k+1} = B_k Y_k B_k. \]

Eigenvalue-wise (in the commuting regime),

\[ y_{k+1} = y_k q_k(y_k)^2. \]

The ideal is \(q(y)=y^{-1/2}\), giving \(y_{k+1}=1\) in one step. With low-degree polynomials, the goal is to make the whitening residual \(|1-y q(y)^2|\) small over the expected interval \(y \in [\ell,u]\).

6.2 The minimax objective and positivity floor used for tuning

The offline tuner solves (approximately) a minimax problem of the form

\[ \min_{q \in \mathcal{P}_d} \max_{y \in [\ell,u]} |1 - y q(y)^2| \quad \text{subject to} \quad q(y) \ge q_{\min} > 0. \]

In code, this is implemented with:

  • a smooth-max surrogate for \(\max\),
  • a positivity penalty \(\mathbb{E}[\max(0, q_{\min}-q(y))^2]\).

See the affine and quadratic fitting closures in coeff_tuner.py.

This “minimax polynomial per step” mechanism is exactly the conceptual link to Polar Express (Amsel et al., 2025).

6.3 Interval propagation (why the tuner updates \((\ell,u]\) per step)

Given a current enclosure \(y \in [\ell,u]\) and a chosen \(q\), the next eigenvalues satisfy \(y_{k+1} = \psi(y_k)\). The mapped enclosure for the next step is

\[ [\ell',u'] \subseteq \left[\min_{y\in[\ell,u]} \phi(y),\ \max_{y\in[\ell,u]} \phi(y)\right]. \]

The tuner computes this numerically via dense gridding and min/max, for both affine and quadratic cases.

6.4 What PE-NS3 and PE2 are in this repo

The repo benchmarks:

  • NS3 / NS4: fixed Newton-Schulz polynomial for 3 or 4 iterations.
  • PE-NS3: 3-step affine schedules \(q_t(y)=a_t + b_t y\).
  • PE2: 2-step quadratic schedules \(q_t(y)=a_t + b_t y + c_t y^2\), which requires forming \(Y^2\) but offers more approximation power per step.

The schedule loader uses precomputed coefficients when \(\ell_{\mathrm{target}}=0.05\), otherwise it generates tuned schedules and applies optional "safety" scaling.


AUTO Selection

The baseline policy size_rho selects PE2 for larger matrices (size > 512), otherwise it selects PE-NS3.


8. PE2 wins today (2026-02-21)

The current baseline report uses bf16, row-sum normalization, terminal last step enabled, and --auto-policy size_rho. In head-to-head trials (e.g. comparing PE-NS3 vs PE2) winner-per-case, PE2 wins 14 out of 15 cells across sizes \(\{256,512,1024\}\) and the tested SPD case families; the exception is near_rank_def where NS3 wins under the report’s winner criterion.

Including AUTO in “best line” summaries, AUTO sometimes wins by choosing a favorable kernel (notably some 512 and 1024 cases).


9. Reproducibility

The README provides the “rigorous benchmark” command used to generate baseline-style results.


References and Further Reading

  1. Amari, S.-ichi. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251–276. https://doi.org/10.1162/089976698300017746
  2. Amsel, N., Persson, D., Musco, C., & Gower, R. M. (2025). The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm (Issue arXiv:2505.16932). arXiv. https://doi.org/10.48550/arXiv.2505.16932
  3. Anil, R., Gupta, V., Koren, T., Regan, K., & Singer, Y. (2021). Scalable Second Order Optimization for Deep Learning (Issue arXiv:2002.09018). arXiv. https://doi.org/10.48550/arXiv.2002.09018
  4. Boissin, T., Massena, T., Mamalet, F., & Serrurier, M. (2025). Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning (Issue arXiv:2512.04632). arXiv. https://doi.org/10.48550/arXiv.2512.04632
  5. Davidon, W. C. (1959). VARIABLE METRIC METHOD FOR MINIMIZATION (No. ANL-5990; Issue ANL-5990). Argonne National Lab., Lemont, Ill. https://doi.org/10.2172/4252678
  6. Gerschgorin, S. (1931). Über Die Abgrenzung Der Eigenwerte Einer Matrix. Izvestiya Akademii Nauk SSSR, Seriya Matematicheskaya, 6, 749–754. http://mi.mathnet.ru/eng/im5235
  7. Guo, C.-H., & Higham, N. J. (2006). A Schur–Newton Method for the Matrix \lowercase\textbraceleft\boldmath p \textbraceright th Root and Its Inverse. SIAM Journal on Matrix Analysis and Applications, 28(3), 788–804. https://doi.org/10.1137/050643374
  8. Gupta, V., Koren, T., & Singer, Y. (2018). Shampoo: Preconditioned Stochastic Tensor Optimization (Issue arXiv:1802.09568). arXiv. https://doi.org/10.48550/arXiv.1802.09568
  9. Higham, N. Matrix Procrustes Problems.
  10. Higham, N. (2022). What Is Gershgorin’s Theorem? In Nick Higham. https://nhigham.com/2022/11/22/what-is-gershgorins-theorem/
  11. Horn, R. A., & Johnson, C. R. (2017). Matrix Analysis (Second edition, corrected reprint). Cambridge University Press.
  12. Kovarik, Z. (1970). Some Iterative Methods for Improving Orthonormality. SIAM Journal on Numerical Analysis, 7(3), 386–389. https://doi.org/10.1137/0707031
  13. Muon: An Optimizer for Hidden Layers in Neural Networks \textbar Keller Jordan Blog. Retrieved February 22, 2026, from https://kellerjordan.github.io/posts/muon/
  14. Newton-Schulz - Docs.Modula.Systems. Retrieved February 21, 2026, from https://docs.modula.systems/algorithms/newton-schulz/
  15. rohan anil [@_arohan_]. (2024). Just Some Fun Linear Algebra: Shampoo L(t) = L(t-1) + Gt @ Gt.T R(t) = R(t-1) + Gt.T @ Gt Update Rule: L(t)^\textbraceleft -1/4\textbraceright @ Gt @ R(t)^\textbraceleft -1/4\textbraceright One Sided Shampoo (for Embedding or Very Large Layers) Update: L(t)^\textbraceleft -1/2\textbraceright @ Gt or Gt @ R(t)^\textbraceleft -1/2\textbraceright @kellerjordan0 Modification, No [Tweet]. In Twitter. https://x.com/_arohan_/status/1843050297985466565
  16. Ruiz, D. A Scaling Algorithm to Equilibrate Both Rows and Columns Norms in Matrices.
This post is licensed under CC BY 4.0 by the author.