Lion-K CCWD: Corrected Cautious Weight Decay

Retention and radius parametrization for Lion-K with Corrected Cautious Weight Decay.

Posted Mar 23, 2026 Updated May 3, 2026

By Ji-Ha Kim

13 min read

PyTorch implementation on GitHub
https://github.com/JiHa-Kim/ScionC

Summary
Lion- $\mathcal{K}$ with corrected cautious weight decay is cleaner when written in retention/radius coordinates. The update is
$M'=\beta_2M+(1-\beta_2)G, \qquad Z=\beta_1M'+(1-\beta_1)G, \qquad U=-\nabla\mathcal{K}(Z),$ $P_i=\mathbf{1}_{\{\operatorname{sign}(W_i)=\operatorname{sign}(U_i)\}}, \qquad W'=W-(1-\zeta)(P\odot W)+(1-\zeta)\rho U.$
Here $\beta_2$ is the momentum state retention, $\beta_1$ is a dimensionless readout blend, $\zeta$ is the active-coordinate weight retention, and $\rho$ is the radius coordinate. The equivalent additive scale is $\gamma=(1-\zeta)\rho$ .

Why This Parametrization
Raw Lion/AdamW-style knobs mix three different roles: memory timescale, additive update scale, and weight-decay strength. Retention/radius coordinates separate them. The retentions $\beta_2$ and $\zeta$ transfer as half-lives in the chosen training count, while $\rho=1/\lambda$ carries the weight-decay coordinate. This matches the constrained-optimization view of Lion- $\mathcal{K}$ (Chen et al., 2025), the cautious mask from CWD (Chen et al., 2026), and corrected decoupled decay (Chou, 2026).

1. Retention Coordinates

EMA Retention
Any update
$\[ Y'=qY+(1-q)Z \]$
has retention $$q$$ . Use the halving exponent
$\boxed{ H_q=-\log_2q, \qquad q=2^{-H_q}. }$

Scheduled Half-Life
For a training count $\tau$ , a scheduled halving rate $\chi_q(\tau)$ gives
$\boxed{ H_q = \int_{\tau_t}^{\tau_t+\Delta\tau}\chi_q(\sigma)\,d\sigma, \qquad q=2^{-H_q}. }$
Constant half-life $$h_q$$ is the special case $\chi_q=1/h_q$ , so $q=2^{-\Delta\tau/h_q}$ .

Lion- $\mathcal{K}$ Transfer Coordinates
For Lion- $\mathcal{K}$ with corrected cautious decay, use
$\boxed{ R_W, \qquad \chi_{\beta_2}(\tau), \qquad \gamma(\tau), \qquad \beta_1, \qquad \text{online energy statistics}. }$
The readout blend $\beta_1$ is dimensionless. The momentum retention $\beta_2$ is recomputed from its half-life or halving-rate schedule whenever the count increment changes. The additive scale $\gamma$ may also be scheduled. The active decay fraction $d=1-\zeta$ is best solved from measured one-step energy statistics once those statistics have warmed up.

2. Lion- $\mathcal{K}$ Direction

Direction Map
For each weight block $$W$$ , let $G=\nabla_W f(\Theta;\mathcal{B})$ be the minibatch gradient. The Lion- $\mathcal{K}$ state/readout update is
$\boxed{ \begin{aligned} M' &= \beta_2M+(1-\beta_2)G,\\ Z &= \beta_1M'+(1-\beta_1)G,\\ U &= -\nabla\mathcal{K}(Z). \end{aligned} }$
Standard Lion is the sign-map case. Muon, Scion-style LMO directions, and normalized SGD fit the same bounded-direction template through different choices of $\mathcal{K}$ or the direction map.

Momentum Retention
The current momentum retention is
$H_{\beta_2} = \int_{\tau_t}^{\tau_t+\Delta\tau}\chi_{\beta_2}(\sigma)\,d\sigma, \qquad \boxed{\beta_2=2^{-H_{\beta_2}}}.$
With a constant momentum half-life, $\beta_2=2^{-\Delta\tau/h_{\beta_2}}$ .

Effective Readout Coefficient
For the Nesterov readout above,
$Z = \beta_1\beta_2M+(1-\beta_1\beta_2)G,$
so the effective coefficient on the stored state is
$\boxed{\beta_{\mathrm{eff}}=\beta_1\beta_2.}$
For the non-Nesterov readout $Z=\beta_1M+(1-\beta_1)G$ , use $\beta_{\mathrm{eff}}=\beta_1$ .

3. Weight Retention and Radius

Unmasked Retention Form
Without the cautious mask, the decoupled update can be written
$\boxed{ W'=\zeta W+(1-\zeta)\rho U. }$
This is the same as $W'=(1-\gamma\lambda)W+\gamma U$ with
$\boxed{ \gamma=(1-\zeta)\rho, \qquad \lambda=\frac{1}{\rho}. }$
Thus $\zeta$ is a retention and $\rho$ is the weight-decay coordinate.

Cautious Retention Form
Cautious weight decay applies the weight-retention action only on coordinates aligned with the update direction (Chen et al., 2026). Define
$P_i=\mathbf{1}_{\{\operatorname{sign}(W_i)=\operatorname{sign}(U_i)\}}.$
The masked update is
$\boxed{ W'=W-(1-\zeta)(P\odot W)+(1-\zeta)\rho U. }$
Active coordinates have retention $\zeta$ ; inactive coordinates have retention $$1$$ . The additive scale remains $\gamma=(1-\zeta)\rho$ .

Weight Retention
The current active-coordinate weight retention is
$H_\zeta = \int_{\tau_t}^{\tau_t+\Delta\tau}\chi_\zeta(\sigma)\,d\sigma, \qquad \boxed{\zeta=2^{-H_\zeta}}.$
With a constant weight-retention half-life, $\zeta=2^{-\Delta\tau/h_\zeta}$ . This scheduled-retention form is useful as a cold-start prior or fallback. In the empirical CCWD rule below, $\zeta=1-d$ is instead solved from the current block statistics.

4. RMS-Matched Radius

Correction Objective
The radius $\rho$ is chosen to match a target stationary RMS weight radius:
$\boxed{ \mathbb{E}\Vert W\Vert ^2=R_W^2. }$
In the stationary radius view, corrected decay maps $(R_W,\zeta,\text{direction statistics})$ to the radius $\rho$ , rather than treating $\rho$ or $\lambda$ as arbitrary raw knobs. In the empirical one-step view, the additive scale $\gamma$ is scheduled and the active decay fraction $d=1-\zeta$ is solved from the measured energy balance.

Assumption Boundary
Defazio’s corrected schedule is a normalized-layer steady-state argument: the clean derivation uses $\langle G_t,W_t\rangle=0$ and treats a learning-rate schedule as a moving steady-state target (Defazio, 2025). Chou’s corrected decoupled decay is broader, but its basic random-walk calculation assumes $\mathbb{E}\langle W_{t-1},U_t\rangle=0$ at steady state; the output-layer caveat is exactly a case where that cross term is not zero (Chou, 2026). CCWD adds mask-dependent cross terms, so the practical rule should measure those terms online instead of assuming them away.

Empirical One-Step CWD Balance
Write $d=1-\zeta$ and use the additive update scale $\gamma$ . The one-step cautious update is
$W'=W-d(P\odot W)+\gamma U.$
Track block-local EMAs of the actual current statistics
$p_2=\frac{\Vert P\odot W\Vert ^2}{\Vert W\Vert ^2}, \qquad h=\frac{\langle W,U\rangle}{\Vert W\Vert \Vert U\Vert }, \qquad k=\frac{\langle P\odot W,U\rangle}{\Vert W\Vert \Vert U\Vert }.$
Let $\alpha=\gamma\Vert U\Vert /R_W$ . Near the target radius, $\Vert W\Vert \approx R_W$ , the equation $\Vert W'\Vert ^2=R_W^2$ becomes
$\boxed{ p_2d^2-(2p_2+2\alpha k)d+(\alpha^2+2\alpha h)=0. }$
Use the smaller valid root in $\lbrack 0,1\rbrack$ , then set $\zeta=1-d$ and $\rho=\gamma/d$ when $$d>0$$ . If the target radius differs noticeably from the current norm, let $r=\Vert W\Vert /R_W$ and solve the exact one-step target equation
$p_2r^2d^2-(2p_2r^2+2\alpha kr)d+(r^2-1+\alpha^2+2\alpha hr)=0.$
This is the preferred production rule because it adapts to real gradient persistence, sign-map persistence, mask structure, layer type, and training phase.

Unmasked Stationary Prior
Let
$c_k = \frac{\mathbb{E}\langle U_t,U_{t-k}\rangle} {\mathbb{E}\Vert U_t\Vert ^2}, \qquad c_0=1,$
and define
$c_u^2=\mathbb{E}\Vert U_t\Vert ^2, \qquad A_\zeta=1+2\sum_{k\ge1}\zeta^kc_k.$
For the unmasked update $W'=\zeta W+(1-\zeta)\rho U$ ,
$\mathbb{E}\Vert W_t\Vert ^2 = \rho^2 \frac{1-\zeta}{1+\zeta} c_u^2A_\zeta.$
Therefore
$\boxed{ \rho = R_W \sqrt{ \frac{1+\zeta}{(1-\zeta)c_u^2A_\zeta} }. }$

Closed-Form Cautious Prior
If online energy statistics are unavailable, use a stationary closed-form prior. Let $$p_2$$ be the masked squared-norm fraction,
$p_2 \approx \mathbb{E}\frac{\Vert P_t\odot W_t\Vert ^2}{\Vert W_t\Vert ^2}.$
Treat the CWD mask as a random diagonal retention. Define
$\mathcal{R}_t = I-(1-\zeta)\operatorname{Diag}(P_t),$
and approximate its first two coordinate moments by
$a_1 \approx 1-p_2(1-\zeta), \qquad a_2 \approx 1-p_2(1-\zeta^2).$
The first moment $$a_1$$ controls cross-time direction correlations; the second moment $$a_2$$ controls RMS energy retention. Define
$A_{a_1}=1+2\sum_{k\ge1}a_1^kc_k.$
The masked RMS balance is
$R_W^2 \approx \rho^2(1-\zeta)^2c_u^2 \frac{A_{a_1}}{1-a_2}.$
Hence
$\boxed{ \rho \approx R_W \sqrt{ \frac{p_2(1+\zeta)} {(1-\zeta)c_u^2A_{a_1}} }. }$
The corresponding additive scale is
$\boxed{ \gamma=(1-\zeta)\rho \approx R_W \sqrt{ \frac{p_2(1-\zeta)(1+\zeta)} {c_u^2A_{a_1}} }. }$
This costs only one scalar substitution, $A_\zeta\leadsto A_{a_1}$ , because $$p_2$$ is already tracked for CCWD. It is still a prior: it closes the mask and direction process by low-order moments and a direction-correlation model. When $$p_2=1$$ , $a_1=\zeta$ and this recovers the unmasked formula.

Masked RMS Balance

Write $d=1-\zeta$ . The cautious update is

W'=\mathcal{R}_tW+d\rho U.

Under the scalar moment closure,

\mathbb{E}\lbrack (\mathcal{R}_t)_{ii}\rbrack \approx a_1, \qquad \mathbb{E}\lbrack (\mathcal{R}_t)_{ii}^2\rbrack \approx a_2.

The stationary linearized expansion has coefficients built from products of the random retentions. For lags $$i,j$$ , the overlap contributes $a_2^{\min(i,j)}$ and the non-overlap contributes $a_1^{\vert i-j\vert }$ . Therefore

\mathbb{E}\Vert W_t\Vert ^2 \approx d^2\rho^2c_u^2 \sum_{i,j\ge0} a_2^{\min(i,j)} a_1^{\vert i-j\vert } c_{\vert i-j\vert } = d^2\rho^2c_u^2 \frac{A_{a_1}}{1-a_2}.

Solving for $\rho$ gives the displayed cautious-radius formula.

Recovering the Old CCWD Formula

If one insists on raw additive scale $\gamma$ and retention complement $d=1-\zeta$ , then $\gamma=d\rho$ . In the small-step regime $\zeta\approx1$ and $A_{a_1}\approx S$ , the cautious-radius balance gives

d \approx \frac{\gamma^2c_u^2S}{2p_2R_W^2}.

This is the old CCWD multiplier formula, now interpreted as the small-step form of the retention/radius parametrization under the closed-form prior.

5. Direction Correlation Factor

Correlation Priors
For any scalar retention $$a$$ , define
$A_a = 1+2\sum_{k\ge1}a^k \frac{\mathbb{E}\langle U_t,U_{t-k}\rangle} {\mathbb{E}\Vert U_t\Vert ^2}.$
Use $a=\zeta$ for the unmasked update and $a=a_1=1-p_2(1-\zeta)$ for the closed-form CCWD prior. These correlation factors are useful for cold start, ablation, or when one wants a stationary model. The empirical one-step balance above is the practical default because it measures the cross terms $$h$$ and $$k$$ directly.

Cold-Start Linear-Filter Approximation for Lion- $\mathcal{K}$
For a simple independent-gradient cold-start approximation, set $b=\beta_{\mathrm{eff}}$ and define
$\nu_0 = (1-b)^2+\frac{b^2(1-\beta_2)}{1+\beta_2}.$
Then the retention-weighted correlation factor is
$\boxed{ A_a \approx 1+ \frac{ 2ab(1-\beta_2)(1+\beta_2-b) } { (1+\beta_2)(1-a\beta_2)\nu_0 }. }$
This is a null model, not a claim about real minibatch gradients. Real batches can share a persistent task and architecture component even at initialization; Lion’s EMA/readout and the sign map can preserve that persistence as stable signs. If $\beta_2$ or the retention proxy $$a$$ is scheduled over the lag window, replace powers such as $\beta_2^k$ and $$a^k$$ by the corresponding accumulated products. The displayed closed form is the constant-retention special case.
In the small-step limit $a\to1$ , this reduces to the usual unweighted factor
$\boxed{ S(b,\beta_2) = \frac{1}{\nu_0} = \frac{1+\beta_2} {(1-b)^2(1+\beta_2)+b^2(1-\beta_2)}. }$

Linear-Filter Calculation

Under independent gradients, the linear readout has filter weights

Lag $\ell$	Weight $w_\ell$
$$0$$	$$1-b$$
$\ell\ge1$	$b(1-\beta_2)\beta_2^{\ell-1}$

The lag-zero unnormalized autocorrelation is

\nu_0 = (1-b)^2+\frac{b^2(1-\beta_2)}{1+\beta_2}.

For $k\ge1$ ,

\nu_k = \frac{b(1-\beta_2)(1+\beta_2-b)}{1+\beta_2}\beta_2^{k-1}.

Since $c_k=\nu_k/\nu_0$ ,

A_a = 1+2\sum_{k\ge1}a^k\frac{\nu_k}{\nu_0} = 1+ \frac{ 2ab(1-\beta_2)(1+\beta_2-b) } { (1+\beta_2)(1-a\beta_2)\nu_0 }.

6. Algorithm

Block-Local Quantities
The weight block $$W$$ , momentum state $$M$$ , target RMS radius $$R_W$$ , halving exponent $H_{\beta_2}$ , additive scale $\gamma$ , direction map $\nabla\mathcal{K}$ , and energy statistics are block-local. A scheduled $H_\zeta$ and the closed-form correlation prior can be used until the EMAs for $$p_2,h,k$$ have warmed up.

Algorithm Lion-

\mathcal{K}

with RMS-Matched CCWD

1:def LionK_CCWD_Step

\left( \Theta,M;\mathcal{B} \right)

2: for each weight block:

(W,M,R_W,H_{\beta_2},\gamma,\beta_1,\nabla\mathcal{K}) \leftarrow

block-local current values

\beta_2 \leftarrow 2^{-H_{\beta_2}}

G \leftarrow \nabla_W f(\Theta;\mathcal{B})

M' \leftarrow \beta_2M+(1-\beta_2)G

Z \leftarrow \beta_1M'+(1-\beta_1)G

U \leftarrow -\nabla\mathcal{K}(Z)

P_i \leftarrow \mathbf{1}_{\{\operatorname{sign}(W_i)=\operatorname{sign}(U_i)\}}

10:

p_2 \leftarrow \operatorname{EMA}(\Vert P\odot W\Vert ^2/\Vert W\Vert ^2)

11:

h \leftarrow \operatorname{EMA}(\langle W,U\rangle/(\Vert W\Vert \Vert U\Vert ))

12:

k \leftarrow \operatorname{EMA}(\langle P\odot W,U\rangle/(\Vert W\Vert \Vert U\Vert ))

13:

\alpha \leftarrow \gamma\Vert U\Vert /R_W

14:

d \leftarrow

smaller valid root of

p_2d^2-(2p_2+2\alpha k)d+(\alpha^2+2\alpha h)=0

15:

W \leftarrow W-d(P\odot W)+\gamma U

16:

M \leftarrow M'

17: write back

\(W,M\)

to the block

Caveats for Output Layers
The steady-state independence assumption frequently breaks down for the cross-entropy output layer. Use the empirical cross terms for that layer, or exclude the output unembedding layer from corrected decay and manage it separately (Chou, 2026).

Appendix: Notation

Symbols
Symbol Meaning
$$W$$ Weight block
$$G$$ Block gradient
$$M$$ Momentum state
$$Z$$ Direction-map input
$$U$$ Lion- $\mathcal{K}$ direction
$$P$$ CWD mask
$\beta_2$ Momentum state retention
$\beta_1$ Readout blend
$\beta_{\mathrm{eff}}$ Effective stored-state readout coefficient
$\zeta$ Active-coordinate weight retention
$$d$$ Active decay fraction, $d=1-\zeta$
$\gamma$ Additive update scale, equal to $(1-\zeta)\rho$ in radius form
$\lambda$ Equivalent decoupled weight-decay coefficient, $\lambda=1/\rho$
$\rho$ Radius / inverse weight-decay coordinate
$$R_W$$ Target stationary RMS weight radius
$$p_2$$ Masked squared-norm fraction
$$h$$ Weight-direction cosine, $\langle W,U\rangle/(\Vert W\Vert \Vert U\Vert )$
$$k$$ Masked weight-direction cosine, $\langle P\odot W,U\rangle/(\Vert W\Vert \Vert U\Vert )$
$\alpha$ Normalized additive step, $\alpha=\gamma\Vert U\Vert /R_W$
$$a_1$$ First moment of masked diagonal retention, $a_1\approx1-p_2(1-\zeta)$
$$a_2$$ Second moment of masked diagonal retention, $a_2\approx1-p_2(1-\zeta^2)$
$$c_u^2$$ Direction squared-norm scale, $\mathbb{E}\Vert U_t\Vert ^2$
$$c_k$$ Normalized direction autocorrelation at lag $$k$$
$$A_a$$ Retention-weighted direction-correlation factor for scalar retention $$a$$

Symbol	Meaning
$\(W\)$	Weight block
$\(G\)$	Block gradient
$\(M\)$	Momentum state
$\(Z\)$	Direction-map input
$\(U\)$	Lion- $\mathcal{K}$ direction
$\(P\)$	CWD mask
$\beta_2$	Momentum state retention
$\beta_1$	Readout blend
$\beta_{\mathrm{eff}}$	Effective stored-state readout coefficient
$\zeta$	Active-coordinate weight retention
$\(d\)$	Active decay fraction, $d=1-\zeta$
$\gamma$	Additive update scale, equal to $(1-\zeta)\rho$ in radius form
$\lambda$	Equivalent decoupled weight-decay coefficient, $\lambda=1/\rho$
$\rho$	Radius / inverse weight-decay coordinate
$\(R_W\)$	Target stationary RMS weight radius
$\(p_2\)$	Masked squared-norm fraction
$\(h\)$	Weight-direction cosine, $\langle W,U\rangle/(\Vert W\Vert \Vert U\Vert )$
$\(k\)$	Masked weight-direction cosine, $\langle P\odot W,U\rangle/(\Vert W\Vert \Vert U\Vert )$
$\alpha$	Normalized additive step, $\alpha=\gamma\Vert U\Vert /R_W$
$\(a_1\)$	First moment of masked diagonal retention, $a_1\approx1-p_2(1-\zeta)$
$\(a_2\)$	Second moment of masked diagonal retention, $a_2\approx1-p_2(1-\zeta^2)$
$\(c_u^2\)$	Direction squared-norm scale, $\mathbb{E}\Vert U_t\Vert ^2$
$\(c_k\)$	Normalized direction autocorrelation at lag $\(k\)$
$\(A_a\)$	Retention-weighted direction-correlation factor for scalar retention $\(a\)$

References

Asymptotic Estimation of AdamW’s Weight RMS (Part 1) - Science Space\textbarScientific Spaces. Retrieved March 22, 2026, from https://kexue.fm/archives/11307
Bernstein, J. (2025). Deriving Muon. https://jeremybernste.in/writing/deriving-muon
Chen, L., Li, J., Liang, K., Su, B., Xie, C., Pierse, N. W., Liang, C., Lao, N., & Liu, Q. (2026). Cautious Weight Decay (Issue arXiv:2510.12402). arXiv. https://doi.org/10.48550/arXiv.2510.124021 2
Chen, L., Liu, B., Liang, K., & Liu, Q. (2025). Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts (Issue arXiv:2310.05898). arXiv. https://doi.org/10.48550/arXiv.2310.05898↩︎
Chou, J. C.-C. (2026). Correction of Decoupled Weight Decay (Issue arXiv:2512.08217). arXiv. https://doi.org/10.48550/arXiv.2512.082171 2 3
Defazio, A. (2025). Why Gradients Rapidly Increase Near the End of Training (Issue arXiv:2506.02285). arXiv. https://doi.org/10.48550/arXiv.2506.02285↩︎
Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An Optimizer for Hidden Layers in Neural Networks. https://kellerjordan.github.io/posts/muon/
Marek, M., Lotfi, S., Somasundaram, A., Wilson, A. G., & Goldblum, M. (2025). Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful (Issue arXiv:2507.07101). arXiv. https://doi.org/10.48550/arXiv.2507.07101
Mlodozeniec, B., Ablin, P., Béthune, L., Busbridge, D., Klein, M., Ramapuram, J., & Cuturi, M. (2025). Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration (Issue arXiv:2512.22382). arXiv. https://doi.org/10.48550/arXiv.2512.22382
Muon Is a Nuclear Lion King. Retrieved March 23, 2026, from https://www.cs.utexas.edu/ lqiang/lionk/html/intro.html
MuP之上：2. 线性层与最速下降 - 科学空间\textbarScientific Spaces. Retrieved March 22, 2026, from https://kexue.fm/archives/11605
MuP之上：3. 特殊情况特殊处理 - 科学空间\textbarScientific Spaces. Retrieved March 22, 2026, from https://kexue.fm/archives/11647
Pethick, T., Xie, W., Antonakopoulos, K., Zhu, Z., Silveti-Falls, A., & Cevher, V. (2025). Training Deep Learning Models with Norm-Constrained LMOs (Issue arXiv:2502.07529). arXiv. https://doi.org/10.48550/arXiv.2502.07529
Yang, G., Simon, J. B., & Bernstein, J. (2024). A Spectral Condition for Feature Learning (Issue arXiv:2310.17813). arXiv. https://doi.org/10.48550/arXiv.2310.17813
Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., & Gao, J. (2022). Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (Issue arXiv:2203.03466). arXiv. https://doi.org/10.48550/arXiv.2203.03466
Yang, G., Yu, D., Zhu, C., & Hayou, S. (2023). Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks (Issue arXiv:2310.02244). arXiv. https://doi.org/10.48550/arXiv.2310.02244

Machine Learning, Mathematical Optimization

This post is licensed under CC BY 4.0 by the author.