Matrix Inverse Square Root (SPD) with Fixed-Budget GEMM Kernels

A practical guide to matrix inverse square root iterations for SPD matrices, focusing on GPU-friendly GEMM kernels and fixed-budget convergence.

Posted Feb 21, 2026 Updated Feb 22, 2026

By Ji-Ha Kim

10 min read

Link to PyTorch implementation on GitHub: https://github.com/JiHa-Kim/fast-matrix-invsqrt

1. Background and motivation

Many ML systems repeatedly need to “whiten” or precondition vectors using a covariance-like SPD matrix $A \succ 0$ (Amari, 1998) (Davidon, 1959) (Anil et al., 2021) (rohan anil [@arohan], 2024). A canonical operator is the inverse square root $A^{-1/2}$ , defined as the unique SPD matrix $$X$$ such that

\[ X A X = I. \]

If we can apply (or approximate) $A^{-1/2}$ quickly, we can stabilize optimization or normalization while avoiding expensive factorizations.

In deep learning settings, two practical constraints dominate:

Throughput: kernels should be dominated by GEMMs (matrix-matrix multiplies), which are highly optimized on GPUs.
Low precision: bf16/fp16 arithmetic is common, so we want iterations that remain stable and deliver “good enough” accuracy in a small, fixed number of steps.

This repo is built around that viewpoint: implement a small family of matmul-only inverse-square-root iterations, normalize inputs to make them behave well under fixed budgets, and decide winners empirically using a benchmark harness.

Recent work on GPU-friendly polar/sign methods in ML makes the same point: high accuracy is often unnecessary, while GEMM-only iterations and fast early progress matter (Anil et al., 2021) (Amsel et al., 2025) (Boissin et al., 2025)

2. Problem statement and quality metrics

Given SPD $A \in \mathbb{R}^{n \times n}$ , we compute an approximation $X \approx A^{-1/2}$ . The defining identity $$XAX=I$$ suggests the residual

\[ R := I - X A X. \]

The harness reports a normalized Frobenius residual

\mathrm{resid}_{\mathrm{fro}} := \frac{\|R\|_F}{\sqrt{n}},

and also includes symmetry diagnostics for $$X$$ and $$W := XAX$$ , plus optional spectral and apply-to-vector proxies, making them suitable for fixed-budget benchmarking.

3. Newton-Schulz inverse square root: derivation and coupled form

Newton-Schulz Update
The classical Newton-Schulz inverse-square-root update for scalar $$a > 0$$ is given by:
$x_{+} = x\left(\frac{3}{2} - \frac{1}{2} a x^2\right).$

3.2 Matrix form and the commuting-polynomial regime

A direct matrix analogue is

X_{k+1} = X_k\left(\frac{3}{2}I - \frac{1}{2} A X_k^2\right).

If we start with $X_0 = \alpha I$ , then each $$X_k$$ is a matrix polynomial in $$A$$ and therefore commutes with $$A$$ . In that regime, the iteration acts eigenvalue-wise like the scalar update. This “polynomial in the input matrix” structure is central in classic Newton-type analyses for related matrix functions (notably the polar decomposition).

3.3 The coupled GEMM-only update used in this repo (NS3/NS4)

To avoid explicitly forming $$A X_k^2$$ , the implementation tracks an auxiliary matrix $$Y_k$$ and uses a coupled update that remains GEMM-dominated:

B_k = 1.5I - 0.5Y_k, \qquad X_{k+1} = X_k B_k, \qquad Y_{k+1} = B_k Y_k B_k,

initialized with $$X_0 = I$$ and $$Y_0 = A_{\mathrm{norm}}$. This is implemented directly in `inverse_sqrt_ns`, including optional symmetrization of $Y$ for numerical hygiene.$

A key throughput trick is terminal last step: on the final iteration, the code skips updating $$Y$$ and returns $$X$$ only.

The benchmark harness always enables this in its method calls.

3.4 Why initialization (spectral placement matters)

In the commuting regime, eigenvalue evolution follows:

y_{k+1} = \phi(y_k), \qquad \phi(y) := y\left(\frac{3-y}{2}\right)^2 = \frac{y(3-y)^2}{4}.

Define $$e := 1-y$$ . Expanding around $$e=0$$ gives

e_{k+1} = \frac{3}{4}e_k^2 + \frac{1}{4}e_k^3,

so once eigenvalues are near 1, the residual shrinks roughly quadratically.

However, with a fixed small iteration budget (2-4 steps), performance depends heavily on where the initial spectrum lies. This is why the repo places strong emphasis on preconditioning/normalization.

4. Convergence region: a sufficient condition and an NS3-specific view

4.1 A common sufficient condition: $\|Y_0 - I\| < 1$

A widely used local convergence guarantee for Newton-Schulz style coupled iterations is:

\|Y_0 - I\| < 1

in a consistent norm, which (for SPD and spectral norm) implies eigenvalues of $$Y_0$$ lie in $$(0,2)$. This is the practical condition emphasized in iSQRT-COV, which uses coupled Newton-Schulz iterations to avoid GPU-unfriendly SVD/EIG while noting the method is locally convergent and requires proper normalization.$

4.2 NS3 scalar basin intuition

For the specific NS3 map $\phi(y)=y(3-y)^2/4$ , fixed points are $$y \in \{0,1,5\}$. The fixed point at $y=1$ is strongly attracting ($\phi'(1)=0$), while very large $y$ can blow up since $\phi(y) \sim y^3/4$. This motivates keeping eigenvalues away from large values and away from values too close to zero, especially for small iteration counts.$

In practice, enforcing something like $y \in [\ell,1]$ with $\ell>0$ is a robust way to ensure both stability and rapid transient progress.

5. Preconditioning and normalization (making fixed budgets work)

The repo’s precond_spd is explicitly designed to (a) cheaply control an upper spectral bound and (b) enforce a lower “spectral floor” using a Gershgorin proxy, because those are exactly the quantities that govern early contraction of NS-style polynomials.

5.1 Optional diagonal scaling (“aol” mode)

One option is symmetric diagonal scaling:

d_i = \frac{1}{\sqrt{\sum_j |A_{ij}|}}, \qquad A \leftarrow D A D,

implemented as mode == "aol".

This is in the spirit of matrix equilibration / diagonal scaling methods (Ruiz, n.d.) used to reduce extreme row/column scaling while preserving symmetry.

5.2 Upper scaling via a row-sum bound (GPU-friendly $\lambda_{\max}$ proxy)

By default, the preconditioner uses the max absolute row sum

u := \max_i \sum_j |(A_{\mathrm{pre}})_{ij}|, \qquad A_{\mathrm{norm}} := \frac{A_{\mathrm{pre}}}{u}.

It is cheap (reductions only) and ensures $$\rho(A_{\mathrm{pre}}) \le \|A_{\mathrm{pre}}\|_\infty = u$, so scaling by $u$ pushes the spectrum toward $[0,1]$ in a conservative way.$

A power-iteration estimator exists but was found to be slower in the repo’s current baseline regime.

5.3 Lower floor via Gershgorin bound + diagonal shift

To avoid tiny eigenvalues, the preconditioner computes a Gershgorin-style lower bound

g_{\mathrm{lo}} := \min_i \left(a_{ii} - \sum_{j\ne i} |a_{ij}|\right).

If $g_{\mathrm{lo}} < \ell_{\mathrm{target}}$ , it applies a diagonal shift

A_{\mathrm{norm}} \leftarrow A_{\mathrm{norm}} + \delta I, \qquad \delta := \max(0, \ell_{\mathrm{target}} - g_{\mathrm{lo}}),

then renormalizes by a row-sum bound again.

This uses Gershgorin’s theorem: all eigenvalues lie within Gershgorin discs (intervals for symmetric matrices), so the quantity above provides a cheap, conservative lower bound; shifting by $\delta I$ increases all eigenvalues. (Gerschgorin, 1931)

5.4 Why $\ell_{\mathrm{target}} = 0.05$ is a reasonable fixed-budget choice

Even if NS3 converges asymptotically, fixed budgets care about transient progress. If the worst-case eigenvalue starts very small, it may take many iterations before the quadratic regime near 1 kicks in. Enforcing a floor like $$0.05$$ reduces that transient delay while keeping the shift modest; the repo also provides tuned polynomial schedules that further accelerate early contraction (next section).

The code and CLI defaults explicitly bake in $\ell_{\mathrm{target}}=0.05$ as the “precomputed schedule” regime.

6. Polar-Express-style polynomial schedules (PE-NS3 and PE2)

Newton-Schulz uses a fixed affine polynomial $$q(y)=1.5-0.5y$. A known drawback (also emphasized in recent ML-oriented polar/sign work) is slow initial progress when eigenvalues are not already close to 1. Polar Express addresses this by adapting the polynomial update each iteration via a minimax design, while remaining GEMM-only and stable in bf16 (Amsel et al., 2025).$

This repo adapts the same principle to inverse square root.

6.1 General polynomial step

In the coupled form, each iteration chooses a polynomial $$q_k$$ and applies

B_k = q_k(Y_k), \qquad X_{k+1} = X_k B_k, \qquad Y_{k+1} = B_k Y_k B_k.

Eigenvalue-wise (in the commuting regime),

y_{k+1} = y_k q_k(y_k)^2.

The ideal is $$q(y)=y^{-1/2}$, giving $y_{k+1}=1$ in one step. With low-degree polynomials, the goal is to make the whitening residual $|1-y q(y)^2|$ small over the expected interval $y \in [\ell,u]$.$

6.2 The minimax objective and positivity floor used for tuning

The offline tuner solves (approximately) a minimax problem of the form

\min_{q \in \mathcal{P}_d} \max_{y \in [\ell,u]} |1 - y q(y)^2| \quad \text{subject to} \quad q(y) \ge q_{\min} > 0.

In code, this is implemented with:

a smooth-max surrogate for $$\max$,$
a positivity penalty $$\mathbb{E}[\max(0, q_{\min}-q(y))^2]$.$

See the affine and quadratic fitting closures in coeff_tuner.py.

This “minimax polynomial per step” mechanism is exactly the conceptual link to Polar Express (Amsel et al., 2025).

6.3 Interval propagation (why the tuner updates $(\ell,u]$ per step)

Given a current enclosure $y \in [\ell,u]$ and a chosen $$q$, the next eigenvalues satisfy $y_{k+1} = \psi(y_k)$. The mapped enclosure for the next step is$

[\ell',u'] \subseteq \left[\min_{y\in[\ell,u]} \phi(y),\ \max_{y\in[\ell,u]} \phi(y)\right].

The tuner computes this numerically via dense gridding and min/max, for both affine and quadratic cases.

6.4 What PE-NS3 and PE2 are in this repo

The repo benchmarks:

NS3 / NS4: fixed Newton-Schulz polynomial for 3 or 4 iterations.
PE-NS3: 3-step affine schedules $$q_t(y)=a_t + b_t y$.$
PE2: 2-step quadratic schedules $$q_t(y)=a_t + b_t y + c_t y^2$, which requires forming $Y^2$ but offers more approximation power per step.$

The schedule loader uses precomputed coefficients when $$\ell_{\mathrm{target}}=0.05$, otherwise it generates tuned schedules and applies optional "safety" scaling.$

AUTO Selection
The baseline policy size_rho selects PE2 for larger matrices (size > 512), otherwise it selects PE-NS3.

8. PE2 wins today (2026-02-21)

The current baseline report uses bf16, row-sum normalization, terminal last step enabled, and --auto-policy size_rho. In head-to-head trials (e.g. comparing PE-NS3 vs PE2) winner-per-case, PE2 wins 14 out of 15 cells across sizes $\{256,512,1024\}$ and the tested SPD case families; the exception is near_rank_def where NS3 wins under the report’s winner criterion.

Including AUTO in “best line” summaries, AUTO sometimes wins by choosing a favorable kernel (notably some 512 and 1024 cases).

9. Reproducibility

The README provides the “rigorous benchmark” command used to generate baseline-style results.

References and Further Reading

Amari, S.-ichi. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251–276. https://doi.org/10.1162/089976698300017746
Amsel, N., Persson, D., Musco, C., & Gower, R. M. (2025). The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm (Issue arXiv:2505.16932). arXiv. https://doi.org/10.48550/arXiv.2505.16932
Anil, R., Gupta, V., Koren, T., Regan, K., & Singer, Y. (2021). Scalable Second Order Optimization for Deep Learning (Issue arXiv:2002.09018). arXiv. https://doi.org/10.48550/arXiv.2002.09018
Boissin, T., Massena, T., Mamalet, F., & Serrurier, M. (2025). Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning (Issue arXiv:2512.04632). arXiv. https://doi.org/10.48550/arXiv.2512.04632
Davidon, W. C. (1959). VARIABLE METRIC METHOD FOR MINIMIZATION (No. ANL-5990; Issue ANL-5990). Argonne National Lab., Lemont, Ill. https://doi.org/10.2172/4252678
Gerschgorin, S. (1931). Über Die Abgrenzung Der Eigenwerte Einer Matrix. Izvestiya Akademii Nauk SSSR, Seriya Matematicheskaya, 6, 749–754. http://mi.mathnet.ru/eng/im5235
Guo, C.-H., & Higham, N. J. (2006). A Schur–Newton Method for the Matrix \lowercase\textbraceleft\boldmath p \textbraceright th Root and Its Inverse. SIAM Journal on Matrix Analysis and Applications, 28(3), 788–804. https://doi.org/10.1137/050643374
Gupta, V., Koren, T., & Singer, Y. (2018). Shampoo: Preconditioned Stochastic Tensor Optimization (Issue arXiv:1802.09568). arXiv. https://doi.org/10.48550/arXiv.1802.09568
Higham, N. Matrix Procrustes Problems.
Higham, N. (2022). What Is Gershgorin’s Theorem? In Nick Higham. https://nhigham.com/2022/11/22/what-is-gershgorins-theorem/
Horn, R. A., & Johnson, C. R. (2017). Matrix Analysis (Second edition, corrected reprint). Cambridge University Press.
Kovarik, Z. (1970). Some Iterative Methods for Improving Orthonormality. SIAM Journal on Numerical Analysis, 7(3), 386–389. https://doi.org/10.1137/0707031
Muon: An Optimizer for Hidden Layers in Neural Networks \textbar Keller Jordan Blog. Retrieved February 22, 2026, from https://kellerjordan.github.io/posts/muon/
Newton-Schulz - Docs.Modula.Systems. Retrieved February 21, 2026, from https://docs.modula.systems/algorithms/newton-schulz/
rohan anil [@_arohan_]. (2024). Just Some Fun Linear Algebra: Shampoo L(t) = L(t-1) + Gt @ Gt.T R(t) = R(t-1) + Gt.T @ Gt Update Rule: L(t)^\textbraceleft -1/4\textbraceright @ Gt @ R(t)^\textbraceleft -1/4\textbraceright One Sided Shampoo (for Embedding or Very Large Layers) Update: L(t)^\textbraceleft -1/2\textbraceright @ Gt or Gt @ R(t)^\textbraceleft -1/2\textbraceright @kellerjordan0 Modification, No [Tweet]. In Twitter. https://x.com/_arohan_/status/1843050297985466565
Ruiz, D. A Scaling Algorithm to Equilibrate Both Rows and Columns Norms in Matrices.

Machine Learning, Numerical Optimization

This post is licensed under CC BY 4.0 by the author.