Lion-K CCWD: Corrected Cautious Weight Decay
Retention and radius parametrization for Lion-K with Corrected Cautious Weight Decay.
PyTorch implementation on GitHub
https://github.com/JiHa-Kim/ScionC
Summary
Lion-\(\mathcal{K}\) with corrected cautious weight decay is cleaner when written in retention/radius coordinates. The update is
\[ M'=\beta_2M+(1-\beta_2)G, \qquad Z=\beta_1M'+(1-\beta_1)G, \qquad U=-\nabla\mathcal{K}(Z), \]\[ P_i=\mathbf{1}_{\{\operatorname{sign}(W_i)=\operatorname{sign}(U_i)\}}, \qquad W'=W-(1-\zeta)(P\odot W)+(1-\zeta)\rho U. \]Here \(\beta_2\) is the momentum state retention, \(\beta_1\) is a dimensionless readout blend, \(\zeta\) is the active-coordinate weight retention, and \(\rho\) is the radius coordinate. The equivalent additive scale is \(\gamma=(1-\zeta)\rho\).
Why This Parametrization
Raw Lion/AdamW-style knobs mix three different roles: memory timescale, additive update scale, and weight-decay strength. Retention/radius coordinates separate them. The retentions \(\beta_2\) and \(\zeta\) transfer as half-lives in the chosen training count, while \(\rho=1/\lambda\) carries the weight-decay coordinate. This matches the constrained-optimization view of Lion-\(\mathcal{K}\)(Chen et al., 2025), the cautious mask from CWD (Chen et al., 2026), and corrected decoupled decay (Chou, 2026).
1. Retention Coordinates
EMA Retention
Any update
\[ Y'=qY+(1-q)Z \]has retention \(q\). Use the halving exponent
\[ \boxed{ H_q=-\log_2q, \qquad q=2^{-H_q}. } \]
Scheduled Half-Life
For a training count \(\tau\), a scheduled halving rate \(\chi_q(\tau)\) gives
\[ \boxed{ H_q = \int_{\tau_t}^{\tau_t+\Delta\tau}\chi_q(\sigma)\,d\sigma, \qquad q=2^{-H_q}. } \]Constant half-life \(h_q\) is the special case \(\chi_q=1/h_q\), so \(q=2^{-\Delta\tau/h_q}\).
Lion-\(\mathcal{K}\) Transfer Coordinates
For Lion-\(\mathcal{K}\) with corrected cautious decay, use
\[ \boxed{ R_W, \qquad \chi_{\beta_2}(\tau), \qquad \gamma(\tau), \qquad \beta_1, \qquad \text{online energy statistics}. } \]The readout blend \(\beta_1\) is dimensionless. The momentum retention \(\beta_2\) is recomputed from its half-life or halving-rate schedule whenever the count increment changes. The additive scale \(\gamma\) may also be scheduled. The active decay fraction \(d=1-\zeta\) is best solved from measured one-step energy statistics once those statistics have warmed up.
2. Lion-\(\mathcal{K}\) Direction
Direction Map
For each weight block \(W\), let \(G=\nabla_W f(\Theta;\mathcal{B})\) be the minibatch gradient. The Lion-\(\mathcal{K}\) state/readout update is
\[ \boxed{ \begin{aligned} M' &= \beta_2M+(1-\beta_2)G,\\ Z &= \beta_1M'+(1-\beta_1)G,\\ U &= -\nabla\mathcal{K}(Z). \end{aligned} } \]Standard Lion is the sign-map case. Muon, Scion-style LMO directions, and normalized SGD fit the same bounded-direction template through different choices of \(\mathcal{K}\) or the direction map.
Momentum Retention
The current momentum retention is
\[ H_{\beta_2} = \int_{\tau_t}^{\tau_t+\Delta\tau}\chi_{\beta_2}(\sigma)\,d\sigma, \qquad \boxed{\beta_2=2^{-H_{\beta_2}}}. \]With a constant momentum half-life, \(\beta_2=2^{-\Delta\tau/h_{\beta_2}}\).
Effective Readout Coefficient
For the Nesterov readout above,
\[ Z = \beta_1\beta_2M+(1-\beta_1\beta_2)G, \]so the effective coefficient on the stored state is
\[ \boxed{\beta_{\mathrm{eff}}=\beta_1\beta_2.} \]For the non-Nesterov readout \(Z=\beta_1M+(1-\beta_1)G\), use \(\beta_{\mathrm{eff}}=\beta_1\).
3. Weight Retention and Radius
Unmasked Retention Form
Without the cautious mask, the decoupled update can be written
\[ \boxed{ W'=\zeta W+(1-\zeta)\rho U. } \]This is the same as \(W'=(1-\gamma\lambda)W+\gamma U\) with
\[ \boxed{ \gamma=(1-\zeta)\rho, \qquad \lambda=\frac{1}{\rho}. } \]Thus \(\zeta\) is a retention and \(\rho\) is the weight-decay coordinate.
Cautious Retention Form
Cautious weight decay applies the weight-retention action only on coordinates aligned with the update direction (Chen et al., 2026). Define
\[ P_i=\mathbf{1}_{\{\operatorname{sign}(W_i)=\operatorname{sign}(U_i)\}}. \]The masked update is
\[ \boxed{ W'=W-(1-\zeta)(P\odot W)+(1-\zeta)\rho U. } \]Active coordinates have retention \(\zeta\); inactive coordinates have retention \(1\). The additive scale remains \(\gamma=(1-\zeta)\rho\).
Weight Retention
The current active-coordinate weight retention is
\[ H_\zeta = \int_{\tau_t}^{\tau_t+\Delta\tau}\chi_\zeta(\sigma)\,d\sigma, \qquad \boxed{\zeta=2^{-H_\zeta}}. \]With a constant weight-retention half-life, \(\zeta=2^{-\Delta\tau/h_\zeta}\). This scheduled-retention form is useful as a cold-start prior or fallback. In the empirical CCWD rule below, \(\zeta=1-d\) is instead solved from the current block statistics.
4. RMS-Matched Radius
Correction Objective
The radius \(\rho\) is chosen to match a target stationary RMS weight radius:
\[ \boxed{ \mathbb{E}\Vert W\Vert ^2=R_W^2. } \]In the stationary radius view, corrected decay maps \((R_W,\zeta,\text{direction statistics})\) to the radius \(\rho\), rather than treating \(\rho\) or \(\lambda\) as arbitrary raw knobs. In the empirical one-step view, the additive scale \(\gamma\) is scheduled and the active decay fraction \(d=1-\zeta\) is solved from the measured energy balance.
Assumption Boundary
Defazio’s corrected schedule is a normalized-layer steady-state argument: the clean derivation uses \(\langle G_t,W_t\rangle=0\) and treats a learning-rate schedule as a moving steady-state target (Defazio, 2025). Chou’s corrected decoupled decay is broader, but its basic random-walk calculation assumes \(\mathbb{E}\langle W_{t-1},U_t\rangle=0\) at steady state; the output-layer caveat is exactly a case where that cross term is not zero (Chou, 2026). CCWD adds mask-dependent cross terms, so the practical rule should measure those terms online instead of assuming them away.
Empirical One-Step CWD Balance
Write \(d=1-\zeta\) and use the additive update scale \(\gamma\). The one-step cautious update is
\[ W'=W-d(P\odot W)+\gamma U. \]Track block-local EMAs of the actual current statistics
\[ p_2=\frac{\Vert P\odot W\Vert ^2}{\Vert W\Vert ^2}, \qquad h=\frac{\langle W,U\rangle}{\Vert W\Vert \Vert U\Vert }, \qquad k=\frac{\langle P\odot W,U\rangle}{\Vert W\Vert \Vert U\Vert }. \]Let \(\alpha=\gamma\Vert U\Vert /R_W\). Near the target radius, \(\Vert W\Vert \approx R_W\), the equation \(\Vert W'\Vert ^2=R_W^2\) becomes
\[ \boxed{ p_2d^2-(2p_2+2\alpha k)d+(\alpha^2+2\alpha h)=0. } \]Use the smaller valid root in \(\lbrack 0,1\rbrack\), then set \(\zeta=1-d\) and \(\rho=\gamma/d\) when \(d>0\). If the target radius differs noticeably from the current norm, let \(r=\Vert W\Vert /R_W\) and solve the exact one-step target equation
\[ p_2r^2d^2-(2p_2r^2+2\alpha kr)d+(r^2-1+\alpha^2+2\alpha hr)=0. \]This is the preferred production rule because it adapts to real gradient persistence, sign-map persistence, mask structure, layer type, and training phase.
Unmasked Stationary Prior
Let
\[ c_k = \frac{\mathbb{E}\langle U_t,U_{t-k}\rangle} {\mathbb{E}\Vert U_t\Vert ^2}, \qquad c_0=1, \]and define
\[ c_u^2=\mathbb{E}\Vert U_t\Vert ^2, \qquad A_\zeta=1+2\sum_{k\ge1}\zeta^kc_k. \]For the unmasked update \(W'=\zeta W+(1-\zeta)\rho U\),
\[ \mathbb{E}\Vert W_t\Vert ^2 = \rho^2 \frac{1-\zeta}{1+\zeta} c_u^2A_\zeta. \]Therefore
\[ \boxed{ \rho = R_W \sqrt{ \frac{1+\zeta}{(1-\zeta)c_u^2A_\zeta} }. } \]
Closed-Form Cautious Prior
If online energy statistics are unavailable, use a stationary closed-form prior. Let \(p_2\) be the masked squared-norm fraction,
\[ p_2 \approx \mathbb{E}\frac{\Vert P_t\odot W_t\Vert ^2}{\Vert W_t\Vert ^2}. \]Treat the CWD mask as a random diagonal retention. Define
\[ \mathcal{R}_t = I-(1-\zeta)\operatorname{Diag}(P_t), \]and approximate its first two coordinate moments by
\[ a_1 \approx 1-p_2(1-\zeta), \qquad a_2 \approx 1-p_2(1-\zeta^2). \]The first moment \(a_1\) controls cross-time direction correlations; the second moment \(a_2\) controls RMS energy retention. Define
\[ A_{a_1}=1+2\sum_{k\ge1}a_1^kc_k. \]The masked RMS balance is
\[ R_W^2 \approx \rho^2(1-\zeta)^2c_u^2 \frac{A_{a_1}}{1-a_2}. \]Hence
\[ \boxed{ \rho \approx R_W \sqrt{ \frac{p_2(1+\zeta)} {(1-\zeta)c_u^2A_{a_1}} }. } \]The corresponding additive scale is
\[ \boxed{ \gamma=(1-\zeta)\rho \approx R_W \sqrt{ \frac{p_2(1-\zeta)(1+\zeta)} {c_u^2A_{a_1}} }. } \]This costs only one scalar substitution, \(A_\zeta\leadsto A_{a_1}\), because \(p_2\) is already tracked for CCWD. It is still a prior: it closes the mask and direction process by low-order moments and a direction-correlation model. When \(p_2=1\), \(a_1=\zeta\) and this recovers the unmasked formula.
Masked RMS Balance
Write \(d=1-\zeta\). The cautious update is
Under the scalar moment closure,
The stationary linearized expansion has coefficients built from products of the random retentions. For lags \(i,j\), the overlap contributes \(a_2^{\min(i,j)}\) and the non-overlap contributes \(a_1^{\vert i-j\vert }\). Therefore
Solving for \(\rho\) gives the displayed cautious-radius formula.
Recovering the Old CCWD Formula
If one insists on raw additive scale \(\gamma\) and retention complement \(d=1-\zeta\), then \(\gamma=d\rho\). In the small-step regime \(\zeta\approx1\) and \(A_{a_1}\approx S\), the cautious-radius balance gives
This is the old CCWD multiplier formula, now interpreted as the small-step form of the retention/radius parametrization under the closed-form prior.
5. Direction Correlation Factor
Correlation Priors
For any scalar retention \(a\), define
\[ A_a = 1+2\sum_{k\ge1}a^k \frac{\mathbb{E}\langle U_t,U_{t-k}\rangle} {\mathbb{E}\Vert U_t\Vert ^2}. \]Use \(a=\zeta\) for the unmasked update and \(a=a_1=1-p_2(1-\zeta)\) for the closed-form CCWD prior. These correlation factors are useful for cold start, ablation, or when one wants a stationary model. The empirical one-step balance above is the practical default because it measures the cross terms \(h\) and \(k\) directly.
Cold-Start Linear-Filter Approximation for Lion-\(\mathcal{K}\)
For a simple independent-gradient cold-start approximation, set \(b=\beta_{\mathrm{eff}}\) and define
\[ \nu_0 = (1-b)^2+\frac{b^2(1-\beta_2)}{1+\beta_2}. \]Then the retention-weighted correlation factor is
\[ \boxed{ A_a \approx 1+ \frac{ 2ab(1-\beta_2)(1+\beta_2-b) } { (1+\beta_2)(1-a\beta_2)\nu_0 }. } \]This is a null model, not a claim about real minibatch gradients. Real batches can share a persistent task and architecture component even at initialization; Lion’s EMA/readout and the sign map can preserve that persistence as stable signs. If \(\beta_2\) or the retention proxy \(a\) is scheduled over the lag window, replace powers such as \(\beta_2^k\) and \(a^k\) by the corresponding accumulated products. The displayed closed form is the constant-retention special case.
In the small-step limit \(a\to1\), this reduces to the usual unweighted factor
\[ \boxed{ S(b,\beta_2) = \frac{1}{\nu_0} = \frac{1+\beta_2} {(1-b)^2(1+\beta_2)+b^2(1-\beta_2)}. } \]
Linear-Filter Calculation
Under independent gradients, the linear readout has filter weights
| Lag \(\ell\) | Weight \(w_\ell\) |
|---|---|
| \(0\) | \(1-b\) |
| \(\ell\ge1\) | \(b(1-\beta_2)\beta_2^{\ell-1}\) |
The lag-zero unnormalized autocorrelation is
For \(k\ge1\),
Since \(c_k=\nu_k/\nu_0\),
6. Algorithm
Block-Local Quantities
The weight block \(W\), momentum state \(M\), target RMS radius \(R_W\), halving exponent \(H_{\beta_2}\), additive scale \(\gamma\), direction map \(\nabla\mathcal{K}\), and energy statistics are block-local. A scheduled \(H_\zeta\) and the closed-form correlation prior can be used until the EMAs for \(p_2,h,k\) have warmed up.
Caveats for Output Layers
The steady-state independence assumption frequently breaks down for the cross-entropy output layer. Use the empirical cross terms for that layer, or exclude the output unembedding layer from corrected decay and manage it separately (Chou, 2026).
Appendix: Notation
Symbols
Symbol Meaning \(W\) Weight block \(G\) Block gradient \(M\) Momentum state \(Z\) Direction-map input \(U\) Lion-\(\mathcal{K}\) direction \(P\) CWD mask \(\beta_2\) Momentum state retention \(\beta_1\) Readout blend \(\beta_{\mathrm{eff}}\) Effective stored-state readout coefficient \(\zeta\) Active-coordinate weight retention \(d\) Active decay fraction, \(d=1-\zeta\) \(\gamma\) Additive update scale, equal to \((1-\zeta)\rho\) in radius form \(\lambda\) Equivalent decoupled weight-decay coefficient, \(\lambda=1/\rho\) \(\rho\) Radius / inverse weight-decay coordinate \(R_W\) Target stationary RMS weight radius \(p_2\) Masked squared-norm fraction \(h\) Weight-direction cosine, \(\langle W,U\rangle/(\Vert W\Vert \Vert U\Vert )\) \(k\) Masked weight-direction cosine, \(\langle P\odot W,U\rangle/(\Vert W\Vert \Vert U\Vert )\) \(\alpha\) Normalized additive step, \(\alpha=\gamma\Vert U\Vert /R_W\) \(a_1\) First moment of masked diagonal retention, \(a_1\approx1-p_2(1-\zeta)\) \(a_2\) Second moment of masked diagonal retention, \(a_2\approx1-p_2(1-\zeta^2)\) \(c_u^2\) Direction squared-norm scale, \(\mathbb{E}\Vert U_t\Vert ^2\) \(c_k\) Normalized direction autocorrelation at lag \(k\) \(A_a\) Retention-weighted direction-correlation factor for scalar retention \(a\)
References
- Asymptotic Estimation of AdamW’s Weight RMS (Part 1) - Science Space\textbarScientific Spaces. Retrieved March 22, 2026, from https://kexue.fm/archives/11307
- Bernstein, J. (2025). Deriving Muon. https://jeremybernste.in/writing/deriving-muon
- Chen, L., Li, J., Liang, K., Su, B., Xie, C., Pierse, N. W., Liang, C., Lao, N., & Liu, Q. (2026). Cautious Weight Decay (Issue arXiv:2510.12402). arXiv. https://doi.org/10.48550/arXiv.2510.1240212
- Chen, L., Liu, B., Liang, K., & Liu, Q. (2025). Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts (Issue arXiv:2310.05898). arXiv. https://doi.org/10.48550/arXiv.2310.05898↩︎
- Chou, J. C.-C. (2026). Correction of Decoupled Weight Decay (Issue arXiv:2512.08217). arXiv. https://doi.org/10.48550/arXiv.2512.08217123
- Defazio, A. (2025). Why Gradients Rapidly Increase Near the End of Training (Issue arXiv:2506.02285). arXiv. https://doi.org/10.48550/arXiv.2506.02285↩︎
- Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An Optimizer for Hidden Layers in Neural Networks. https://kellerjordan.github.io/posts/muon/
- Marek, M., Lotfi, S., Somasundaram, A., Wilson, A. G., & Goldblum, M. (2025). Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful (Issue arXiv:2507.07101). arXiv. https://doi.org/10.48550/arXiv.2507.07101
- Mlodozeniec, B., Ablin, P., Béthune, L., Busbridge, D., Klein, M., Ramapuram, J., & Cuturi, M. (2025). Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration (Issue arXiv:2512.22382). arXiv. https://doi.org/10.48550/arXiv.2512.22382
- Muon Is a Nuclear Lion King. Retrieved March 23, 2026, from https://www.cs.utexas.edu/ lqiang/lionk/html/intro.html
- MuP之上:2. 线性层与最速下降 - 科学空间\textbarScientific Spaces. Retrieved March 22, 2026, from https://kexue.fm/archives/11605
- MuP之上:3. 特殊情况特殊处理 - 科学空间\textbarScientific Spaces. Retrieved March 22, 2026, from https://kexue.fm/archives/11647
- Pethick, T., Xie, W., Antonakopoulos, K., Zhu, Z., Silveti-Falls, A., & Cevher, V. (2025). Training Deep Learning Models with Norm-Constrained LMOs (Issue arXiv:2502.07529). arXiv. https://doi.org/10.48550/arXiv.2502.07529
- Yang, G., Simon, J. B., & Bernstein, J. (2024). A Spectral Condition for Feature Learning (Issue arXiv:2310.17813). arXiv. https://doi.org/10.48550/arXiv.2310.17813
- Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., & Gao, J. (2022). Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (Issue arXiv:2203.03466). arXiv. https://doi.org/10.48550/arXiv.2203.03466
- Yang, G., Yu, D., Zhu, C., & Hayou, S. (2023). Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks (Issue arXiv:2310.02244). arXiv. https://doi.org/10.48550/arXiv.2310.02244