Post

Matrix Inverse Roots with Fixed-Budget GEMM Kernels

A practical guide to matrix inverse p-th root iterations for SPD matrices, focusing on GPU-friendly GEMM kernels and fixed-budget convergence.

Matrix Inverse Roots with Fixed-Budget GEMM Kernels

Link to PyTorch implementation on GitHub: https://github.com/JiHa-Kim/fast-matrix-inverse-roots

Matrix Inverse p-th Roots for SPD Matrices: A GPU-Oriented Mathematical Overview

1. Why inverse p-th roots matter

Given a symmetric positive definite (SPD) matrix \(A\), operators of the form

\[ A^{-1/p} \]

appear in preconditioning, whitening, second-order optimization (Davidon, 1959; Amari, 1998), and matrix-normalization layers. In practical ML systems, the bottleneck is not just asymptotic complexity; it is hardware efficiency. GPU throughput strongly favors matrix multiplication (GEMM), while full eigendecomposition and factorizations are expensive and less throughput-friendly.

This repository is built around that practical regime:

  • fixed small iteration budgets,
  • GEMM-dominant updates,
  • mixed precision robustness (especially bf16),
  • explicit benchmarking on realistic SPD test families.

2. Problem statement and the SPD structure

For \(A \in \mathbb{R}^{n \times n}\), \(A = A^T\), \(A > 0\), the principal inverse p-th root is

\[ A^{-1/p} = Q \Lambda^{-1/p} Q^T, \]

where \(A = Q \Lambda Q^T\) and \(\Lambda = \text{diag}(\lambda_i)\) with \(\lambda_i > 0\) (Horn & Johnson, 2017; Higham, 2008).

Two computational tasks are central:

  1. Explicit root construction: compute \(X \approx A^{-1/p}\).
  2. Direct apply: compute \(Z \approx A^{-1/p} B\) for \(B \in \mathbb{R}^{n \times k}\) with \(k \ll n\).

The second task can be much cheaper if solved without materializing dense \(A^{-1/p}\).


3. Spectral shaping via preconditioning

Polynomial fixed-point methods are sensitive to the spectral interval. The repository therefore preconditions each SPD input before iteration (Ruiz, n.d.).

Given raw \(A\), the pipeline (implemented in precond_spd) combines:

  1. Optional scaling mode (none, frob, aol).
  2. Optional ridge shift.
  3. Upper normalization using a row-sum bound:
\[ u = \max_i \sum_j \vert A_{ij}\vert,\qquad A \leftarrow A/u. \]
  1. Lower-floor enforcement using a Gershgorin-style proxy:
\[ g_{lo} = \min_i \left(a_{ii} - \sum_{j \ne i} \vert a_{ij}\vert\right), \]

then diagonal correction when needed to enforce target floor \(l_{\text{target}}\). This stage is based on Gershgorin’s circle theorem (Gerschgorin, 1931; Higham, 2022).

This stage is mathematically simple but operationally crucial: it shrinks the spectral interval into a range where short polynomial schedules are effective.


4. Polynomial inverse-root iteration framework

Let

\[ Y_t = X_t^p A. \]

If \(Y_t \to I\), then \(X_t \to A^{-1/p}\).

The core step uses a polynomial multiplier

\[ B_t = q_t(Y_t), \]

with \(q_t\) typically quadratic:

\[ q_t(y) = a_t + b_t y + c_t y^2. \]

Then update:

\[ X_{t+1} = X_t B_t. \]

The repository implements two variants (Kovarik, 1970).

4.1 Uncoupled variant

Recompute \(Y_t\) from scratch each step:

\[ Y_t = X_t^p A,\qquad X_{t+1}=X_t q_t(Y_t). \]

Pros:

  • lower persistent state,
  • conceptually simple,
  • often strong residual quality at harder exponents.

4.2 Coupled variant

Carry both \(X_t\) and \(Y_t\):

\[ X_{t+1}=X_t B_t,\qquad Y_{t+1}=B_t^p Y_t \]

in the commuting polynomial model. This avoids full \(X_t^p A\) recomputation each step and is often faster in wall-clock terms.


5. Newton baseline and PE-Quad generalization

For \(p=2\), classical inverse Newton-Schulz (Newton-Schulz - Docs.Modula.Systems, n.d.) uses

\[ q(y)=\frac{3}{2}-\frac{1}{2}y. \]

In this repository that baseline is presented as Inverse-Newton in benchmark harnesses.

The main method family is PE-Quad (quadratic polynomial schedules) (Amsel et al., 2025), where each iteration has its own tuned \((a_t,b_t,c_t)\). Conceptually this follows the same philosophy as modern polar/sign polynomial methods (Chen & Chow, 1991; Nakatsukasa & Freund, 2016): optimize finite-step contraction over a spectral interval rather than relying on a single fixed affine map. In this repository, that baseline is presented as Inverse-Newton (referencing the Schur-Newton framework (Guo & Higham, 2006)) in benchmark harnesses.


6. Complexity and memory tradeoffs

6.1 Explicit root paths

Both uncoupled and coupled explicit-root methods are GEMM-based and fundamentally \(O(n^3)\) per step.

  • Coupled can save time by avoiding some recomputations.
  • Uncoupled usually uses fewer persistent \(n \times n\) buffers.

The code also includes key engineering optimizations:

  • workspace reuse (ws) to avoid repeated allocations,
  • out= matmul paths,
  • fused addmm/baddbmm,
  • specialization for common small exponents (\(p=2\), \(p=3\), \(p=4\) paths).

6.2 Direct apply path (\(A^{-1/p}B\))

If only \(A^{-1/p}B\) is needed, direct apply can avoid explicit dense inverse-root construction:

  • explicit route: roughly \(O(n^3) + O(n^2 k)\),
  • direct polynomial apply: roughly \(O(n^2 k \ast \text{degree})\) for fixed degree.

For large \(n\) and moderate \(k\), this can be a major practical win.


7. Chebyshev direct apply and Clenshaw recurrence

The repository’s apply_inverse_proot_chebyshev approximates

\[ f(x)=x^{-1/p} \]

on \([l_{\min}, l_{\text{max}}]\) with a Chebyshev polynomial, then evaluates \(f(A)B\) via Clenshaw recurrence (Clenshaw, 1955).

Map interval to \([-1,1]\):

\[ t(x)=\frac{2x-(l_{\text{max}}+l_{\min})}{l_{\text{max}}-l_{\min}}. \]

With coefficients \(c_k\), evaluate stably backward:

\[ y_k = c_k B + 2\,t(A)y_{k+1} - y_{k+2}, \]

then recover \(Z = f(A)B\) from the final recurrence state.

Important practical condition: choose \(l_{\min}\) safely. If the true spectrum drops below the approximation interval, error can degrade quickly.


8. Numerical stability in mixed precision

The repository uses several stability controls that are mathematically mild but practically important:

  1. Symmetrization of iterates (\(X\) or \(Y\)) to suppress antisymmetric drift.
  2. Symmetrization cadence (symmetrize_every) to trade extra work for stability.
  3. Terminal-step optimization in coupled paths:
    • skip final \(Y\) update when output needs only final \(X\) or \(Z\).
  4. SPD-oriented preconditioning to keep polynomial dynamics in a stable interval.

Together these allow short, high-throughput runs in bf16 without collapsing quality metrics.


9. Reading the current benchmark evidence

Latest benchmark artifacts in this repository were regenerated on 2026-02-25 and include:

  • inverse-root sweep: results/benchmark_report.md,
  • solve/apply report: reports/chebyshev_solve_benchmark.md,
  • raw solve logs in artifacts/benchmarks/.

High-level pattern from the inverse-root sweep (\(p \in \{1,2,3,4,8\}\), sizes \(256,512,1024\)):

  • \(p=1\): Newton baseline frequently wins both speed and residual.
  • \(p=2,3,4\): coupled PE-Quad is often fastest (Amsel et al., 2025).
  • high exponent (\(p=8\)): uncoupled PE-Quad dominates residual quality in the tested grid.

For second-order optimization contexts, these roots are essential for preconditioners like Shampoo (Gupta et al., 2018; Anil et al., 2021; rohan anil [@arohan], 2024) and Muon (Boissin et al., 2025; Muon: An Optimizer for Hidden Layers in Neural Networks \Textbar Keller Jordan Blog, n.d.).

For direct apply (\(p=2\), \(n=2048\) cases), Chebyshev direct apply is typically fastest and lowest-memory in current measurements.


10. Conceptual takeaway

The mathematical story of this repository is not “one algorithm wins always.” It is:

  1. Shape the spectrum first.
  2. Use polynomial maps that match hardware (GEMM-heavy, short schedules).
  3. Separate explicit-root and direct-apply use cases.
  4. Let measured finite-budget behavior guide method choice.

That combination is what makes inverse p-th-root methods practical for modern GPU training systems.


References

  1. Amari, S.-ichi. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251–276. https://doi.org/10.1162/089976698300017746
  2. Amsel, N., Persson, D., Musco, C., & Gower, R. M. (2025). The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm (Issue arXiv:2505.16932). arXiv. https://doi.org/10.48550/arXiv.2505.16932
  3. Anil, R., Gupta, V., Koren, T., Regan, K., & Singer, Y. (2021). Scalable Second Order Optimization for Deep Learning (Issue arXiv:2002.09018). arXiv. https://doi.org/10.48550/arXiv.2002.09018
  4. Boissin, T., Massena, T., Mamalet, F., & Serrurier, M. (2025). Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning (Issue arXiv:2512.04632). arXiv. https://doi.org/10.48550/arXiv.2512.04632
  5. Davidon, W. C. (1959). VARIABLE METRIC METHOD FOR MINIMIZATION (No. ANL-5990; Issue ANL-5990). Argonne National Lab., Lemont, Ill. https://doi.org/10.2172/4252678
  6. Gerschgorin, S. (1931). Über Die Abgrenzung Der Eigenwerte Einer Matrix. Izvestiya Akademii Nauk SSSR, Seriya Matematicheskaya, 6, 749–754. http://mi.mathnet.ru/eng/im5235
  7. Guo, C.-H., & Higham, N. J. (2006). A Schur–Newton Method for the Matrix \lowercase\textbraceleft\boldmath p \textbraceright th Root and Its Inverse. SIAM Journal on Matrix Analysis and Applications, 28(3), 788–804. https://doi.org/10.1137/050643374
  8. Gupta, V., Koren, T., & Singer, Y. (2018). Shampoo: Preconditioned Stochastic Tensor Optimization (Issue arXiv:1802.09568). arXiv. https://doi.org/10.48550/arXiv.1802.09568
  9. Higham, N. Matrix Procrustes Problems.
  10. Higham, N. (2022). What Is Gershgorin’s Theorem? In Nick Higham. https://nhigham.com/2022/11/22/what-is-gershgorins-theorem/
  11. Horn, R. A., & Johnson, C. R. (2017). Matrix Analysis (Second edition, corrected reprint). Cambridge University Press.
  12. Kovarik, Z. (1970). Some Iterative Methods for Improving Orthonormality. SIAM Journal on Numerical Analysis, 7(3), 386–389. https://doi.org/10.1137/0707031
  13. Muon: An Optimizer for Hidden Layers in Neural Networks \textbar Keller Jordan Blog. Retrieved February 22, 2026, from https://kellerjordan.github.io/posts/muon/
  14. Newton-Schulz - Docs.Modula.Systems. Retrieved February 21, 2026, from https://docs.modula.systems/algorithms/newton-schulz/
  15. rohan anil [@_arohan_]. (2024). Just Some Fun Linear Algebra: Shampoo L(t) = L(t-1) + Gt @ Gt.T R(t) = R(t-1) + Gt.T @ Gt Update Rule: L(t)^\textbraceleft -1/4\textbraceright @ Gt @ R(t)^\textbraceleft -1/4\textbraceright One Sided Shampoo (for Embedding or Very Large Layers) Update: L(t)^\textbraceleft -1/2\textbraceright @ Gt or Gt @ R(t)^\textbraceleft -1/2\textbraceright @kellerjordan0 Modification, No [Tweet]. In Twitter. https://x.com/_arohan_/status/1843050297985466565
  16. Ruiz, D. A Scaling Algorithm to Equilibrate Both Rows and Columns Norms in Matrices.
  17. Chen, H. C., & Chow, S. S. (1991). On Iterative Methods for Computing the Polar Decomposition of a Matrix. Linear Algebra and Its Applications, 159, 235–245. https://doi.org/10.1016/0024-3795(91)90236-V
  18. Nakatsukasa, Y., & Freund, R. W. (2016). Computing the Polar Decomposition with the Halley Iteration. SIAM Journal on Scientific Computing, 38(3), A1412–A1444. https://doi.org/10.1137/15M1044431
  19. Higham, N. J. (2008). Functions of Matrices: Theory and Computation. SIAM.
  20. Clenshaw, C. W. (1955). A Note on the Summation of Chebyshev Series. Mathematical Tables and Other Aids to Computation, 9(51), 118–120. https://doi.org/10.2307/2002574
This post is licensed under CC BY 4.0 by the author.