Post

Lion-K CCWD: Corrected Cautious Weight Decay and Hyperparameter Transfer

Derivation of Lion-K with Corrected Cautious Weight Decay (CCWD) and transformation rules for hyperparameter transfer.

Lion-K CCWD: Corrected Cautious Weight Decay and Hyperparameter Transfer

Overview

This post derives Lion-\(\mathcal{K}\) with Corrected Cautious Weight Decay (CCWD) and provides hyperparameter-transfer rules for scaling across width, depth, batch size, and duration.

Key assumptions:

  1. Normalized updates accumulate like a random walk. For bounded optimizer directions (sign, LMO), total parameter displacement after \(T\) steps scales as \(\gamma\sqrt{T}\), requiring \(\gamma \propto \sqrt{B/D}\).

  2. Momentum and decay are parameterized by half-lives in tokens. This yields exact formulas for betas and decay instead of linear approximations (Marek et al., 2025).


1. Lion-\(\mathcal{K}\)

Lion-\(\mathcal{K}\) Update Rule (Chen et al., 2025)

Input: Parameters \(\theta_t\), gradient \(g_t = \nabla f(\theta_t)\), momentum state \(m_t\), direction map \(\nabla \mathcal{K}\).

Step 1 — Momentum update:

\[ m_{t+1} = \beta_2 m_t + (1-\beta_2) g_t \]

Step 2 — Direction input (a common Lion-\(\mathcal{K}\) choice):

\[ z_t = \beta_1 m_{t+1} + (1-\beta_1) g_t \]

Step 3 — Direction map:

\[ u_t = -\nabla \mathcal{K}(z_t) \]

Step 4 — Parameter update (with decoupled decay):

\[ \theta_{t+1} = (1-\eta_t)\theta_t + \gamma_t u_t \]

Special Cases of Lion-\(\mathcal{K}\)

Scion sits naturally inside the Lion-\(\mathcal{K}\) frame, where \(\partial\mathcal{K}\) is a Linear Minimization Oracle (LMO) over a norm ball. This includes Muon, Lion, and Normalized-SGD.


2. Cautious Weight Decay (CWD)

Cautious Weight Decay (CWD) modifies standard decoupled decay to decay only coordinates whose signs align with the optimizer update direction (Chen et al., 2026).

CWD Mask and Update

Let the CWD mask be:

\[ M_t \in \{0,1\}^{\mathrm{shape}(\theta)},\qquad (M_t)_i = \mathbf{1}_{\{\mathrm{sign}(\theta_{t,i}) = \mathrm{sign}(u_{t,i})\}} \]

Apply decay only on masked coordinates:

\[ \theta_{t+1} = \theta_t - \eta_t (M_t \odot \theta_t) + \gamma_t u_t \]

3. Corrected Weight Decay

In decoupled weight decay, the physically meaningful quantity is the per-step multiplicative factor \(\eta\), not the decay coefficient \(\lambda\) (often written \(\eta = \gamma \lambda\)). The steady-state analysis from AdamC (Defazio, 2025) / ScionC (Chou, 2026) shows that stability requires \(\eta \propto \gamma^2\).

Steady-State Assumptions

  1. \(u_t\) has stable RMS size: \(\mathbb{E}\vert u_t\vert ^2 \approx C_u^2\).

  2. Cross-term vanishes in expectation: \(\mathbb{E}\langle \theta_t,u_t\rangle \approx 0\).

  3. \(u_t\) may be correlated across time due to momentum.

Momentum Correlation Factor

For the standard update \(\theta_{t+1} = (1-\eta)\theta_t + \gamma u_t\), define a normalized direction autocorrelation \(\rho_k\) and a correlation-sum factor \(S\):

\[ \rho_k := \frac{\mathbb{E}\langle u_t,u_{t-k}\rangle}{\mathbb{E}\vert u_t\vert ^2},\qquad \rho_0=1,\qquad S := 1 + 2\sum_{k\ge 1}\rho_k \]

Steady-State Parameter Norm

The steady-state parameter norm satisfies (to leading order in small \(\eta\)):

\[ \mathbb{E}\vert \theta\vert ^2 \approx \frac{\gamma^2 C_u^2}{2\eta} S \]

To target a steady-state squared norm \(C_\theta^2\), solve for \(\eta\):

\[ \eta \approx \frac{\gamma^2 C_u^2 S}{2C_\theta^2} \qquad \Longrightarrow \qquad \lambda \approx \frac{\gamma C_u^2 S}{2C_\theta^2} \]

Derivation of the Steady-State Norm

Consider the one-step energy expansion:

\[ \vert \theta_{t+1}\vert ^2 = \vert (1-\eta)\theta_t + \gamma u_t\vert ^2 \]

Expanding and taking expectations under the steady-state assumptions:

\[ \mathbb{E}\vert \theta_{t+1}\vert ^2 = (1-\eta)^2 \mathbb{E}\vert \theta_t\vert ^2 + 2\gamma(1-\eta)\underbrace{\mathbb{E}\langle \theta_t, u_t\rangle}_{\approx 0} + \gamma^2 \mathbb{E}\vert u_t\vert ^2 \]

At steady state \(\mathbb{E}\vert \theta_{t+1}\vert ^2 = \mathbb{E}\vert \theta_t\vert ^2 = C_\theta^2\):

\[ C_\theta^2 = (1-\eta)^2 C_\theta^2 + \gamma^2 C_u^2 S \]
\[ C_\theta^2 [1 - (1-\eta)^2] = \gamma^2 C_u^2 S \]

For small \(\eta\): \(1 - (1-\eta)^2 = 2\eta - \eta^2 \approx 2\eta\), giving the result.

Momentum Correlation Factor for Lion-\(\mathcal{K}\)

In Lion-\(\mathcal{K}\), the direction input \(z_t\) is a convex combination of momentum and the current gradient:

\[ z_t = \beta_{\text{eff}} m_t + (1-\beta_{\text{eff}}) g_t \]

where:

  • Standard Lion (\(z_t = \beta_1 m_t + (1-\beta_1) g_t\)): \(\beta_{\text{eff}} = \beta_1\).

  • Nesterov Lion-\(\mathcal{K}\) (\(z_t = \beta_1 m_{t+1} + (1-\beta_1) g_t\)): substituting \(m_{t+1}\) yields \(\beta_{\text{eff}} = \beta_1 \beta_2\).

Under the assumption of independent gradients (\(\mathbb{E}\langle g_s, g_{s'}\rangle = C'^2\,\delta_{ss'}\)), the correlation-sum factor evaluates to:

\[ S(\beta_{\text{eff}},\beta_2) = \frac{1+\beta_2}{ (1-\beta_{\text{eff}})^2(1+\beta_2) + \beta_{\text{eff}}^2(1-\beta_2) } \]

Note that Adam/Scion (where \(u_t\) directly tracks the momentum EMA) gives \(S \approx \frac{1+\beta_2}{1-\beta_2}\), while Lion’s gradient mixture drastically alters \(S\).

Derivation of \(S(\beta_{\text{eff}},\beta_2)\)

Expressing \(z_t\) as a weighted sum of past gradients, the filter weights are:

Lag \(\ell\)Weight \(w_\ell\)
\(0\)\(1-\beta_{\text{eff}}\)
\(\ell \geq 1\)\(\beta_{\text{eff}}(1-\beta_2)\beta_2^{\ell-1}\)

Lag-0 autocorrelation.\(A_0 = w_0^2 + \sum_{\ell\geq 1}w_\ell^2\):

\[ A_0 = (1-\beta_{\text{eff}})^2 + \frac{\beta_{\text{eff}}^2(1-\beta_2)}{1+\beta_2} = \frac{(1+\beta_2)(1-2\beta_{\text{eff}}) + 2\beta_{\text{eff}}^2}{1+\beta_2} \]

Lag-\(k\) autocorrelation (\(k \geq 1\)):

\[ A_k = \frac{\beta_{\text{eff}}(1-\beta_2)(1+\beta_2-\beta_{\text{eff}})}{1+\beta_2}\cdot\beta_2^{k-1} \]

Summing.\(\sum_{k\geq 1} A_k = \frac{\beta_{\text{eff}}(1+\beta_2-\beta_{\text{eff}})}{1+\beta_2}\). Since \(S = (A_0 + 2\sum_{k\geq 1}A_k)/A_0\), the numerator simplifies to \(1\), giving \(S = 1/A_0\), which yields the stated result.


4. Corrected Cautious Weight Decay (CCWD)

CCWD Multiplier Formula

The correct weight decay multiplier \(\eta\) for a masked decay fraction \(q\) is:

\[ \eta = \frac{\gamma^2 C_u^2 S}{2 q C_\theta^2} \]
VariableMeaning
\(\gamma\)Learning rate
\(C_u^2 \approx \mathbb{E}\vert u_t\vert ^2\)Steady-state update variance
\(S = S(\beta_{\text{eff}},\beta_2)\)Momentum correlation factor (Section 3)
\(C_\theta^2\)Target steady-state parameter norm
\(q \approx p_g\)Masked decay fraction

Avoiding New Hyperparameters

You don’t need to manually guess \(C_\theta^2\) and \(q\). Profile a base run and measure:

  • \(C_{\theta,g}^2\): Expected parameter norm \(\mathbb{E}\vert \theta_g\vert ^2\)

  • \(p_g\): Average mask rate \(\mathbb{E}[\text{mean}(M_{t,g})]\) for group \(g\)

  • \(S_g\): Derived empirically from \(\rho_k\) or via \(S(\beta_{\text{eff}},\beta_2)\)

Derivation of CCWD

CWD operates via masked decay. Because fewer coordinates are shrunk per step, a naive identical \(\eta\) no longer preserves the target steady-state norm.

Step 1: Exact One-Step Energy Change

Let \(d_t := M_t \odot \theta_t\). The update is \(\theta_{t+1} = \theta_t - \eta d_t + \gamma u_t\). Expanding the squared norm:

\[ \vert \theta_{t+1}\vert ^2 = \vert \theta_t\vert ^2 - (2\eta - \eta^2)\vert d_t\vert ^2 + 2\gamma\langle \theta_t - \eta d_t, u_t\rangle + \gamma^2\vert u_t\vert ^2 \]

Step 2: Masking Fraction \(q_t\)

Define the fraction of the squared norm being shrunk:

\[ q_t := \frac{\vert d_t\vert ^2}{\vert \theta_t\vert ^2} = \frac{\vert M_t \odot \theta_t\vert ^2}{\vert \theta_t\vert ^2} \]

Thus, \((2\eta-\eta^2)\vert d_t\vert ^2 = (2\eta-\eta^2) q_t \vert \theta_t\vert ^2 \approx 2\eta q_t \vert \theta_t\vert ^2\).

Step 3: Steady-State Assumption

Assuming independence (\(\mathbb{E}\langle \theta_t, u_t \rangle \approx 0\)) and incorporating the momentum correlation factor \(S\), the expected steady-state norm satisfies:

\[ \mathbb{E}\vert \theta\vert ^2 \approx \frac{\gamma^2 C_u^2}{2\eta q} S \quad \Longrightarrow \quad \eta = \frac{\gamma^2 C_u^2 S}{2 q C_\theta^2} \]

Optional \(\kappa\) Feedback Controller

Because the analytical \(\eta\) formula assumes perfect orthogonality, real-world metrics may drift slightly. A slow feedback controller \(\kappa\) can lock onto the target norm.

Calculate the observed ratio:

\[ R_t := \frac{\vert \theta_t\vert ^2}{C_\theta^2} \]

Update the scale correction \(\kappa\) multiplicatively (\(c\) is a small gain, e.g., \(0.05\)):

\[ \kappa_{t+1} = \kappa_t \cdot R_t^c \]

Scale the analytical formula by \(\kappa_t\):

\[ \eta_t = \kappa_t \cdot \eta^{(\text{formula})} \]

In practice, this is mostly optional as the analytical approximation is usually quite accurate.


5. Hyperparameter Transfer

Scaling Ratios

Define the scaling ratios (base \(\to\) target):

RatioDefinitionMeaning
\(m_N\)\(N'/N\)Width multiplier
\(m_L\)\(L'/L\)Depth multiplier
\(m_B\)\(B'/B\)Batch size multiplier
\(m_D\)\(D'/D\)Data/duration multiplier

Using the Complete(d)P framework (Mlodozeniec et al., 2025), we define scaling rules for Transformer models.

5.1 Initialization and Parameterization

Initialization Scaling Rules

ComponentScaling Rule
Residual branch multiplier\(\text{residual_multiplier}' = \text{residual_multiplier}\cdot m_L^{-\alpha}\), with \(\alpha\in\left[\frac{1}{2},1\right]\)
Init variance: hidden weights\(\mathrm{Var}(W_{\text{hid}})' = \mathrm{Var}(W_{\text{hid}})\cdot m_N^{-1}\)
Init variance: output weights\(\mathrm{Var}(W_{\text{out}})' = \mathrm{Var}(W_{\text{out}})\cdot m_N^{-2}\)

Choosing \(\alpha\): Random Walk vs. Coherent Residuals

  • \(\alpha = \frac{1}{2}\) (random walk): Layer outputs are approximately independent and isotropic. Their sum grows as \(\sqrt{L}\), so each branch scales by \(1/\sqrt{L}\).

  • \(\alpha = 1\) (coherent): Residual branches are aligned, accumulating linearly as \(L\), requiring \(1/L\) scaling.

In practice, \(\alpha = 1\) is conservative; \(\alpha = \frac{1}{2}\) often works better empirically for moderate depth ranges.

5.2 Learning Rates and Training Steps

Since steps scale as \(T \propto D/B\), the target horizon is:

\[ T' = T \cdot \frac{m_D}{m_B} \]

Let the batch/duration scale factor be \(s_{BD} := \sqrt{m_B/m_D}\).

Per-Module Learning Rate Multipliers

ModuleScaling Rule
Input embeddings\(\gamma'_{\rm emb} = \gamma_{\rm emb} \cdot s_{BD}\)
Hidden weights\(\gamma'_{\rm hidW} = \gamma_{\rm hidW} \cdot m_N^{-1} \cdot m_L^{\alpha-1} \cdot s_{BD}\)
Hidden bias/norm\(\gamma'_{\rm hidBN} = \gamma_{\rm hidBN} \cdot m_L^{\alpha-1} \cdot s_{BD}\)
Output weights\(\gamma'_{\rm outW} = \gamma_{\rm outW} \cdot m_N^{-1} \cdot s_{BD}\)

5.3 Momentum Transfer via Token Half-Lives

When the batch size or duration changes, the per-step \(\beta\) must be adjusted to preserve the same forgetting rate in token space.

Beta Transfer Rule

Define the half-life \(H\) of an EMA with coefficient \(\beta\) and batch size \(B\) as the number of tokens after which the weight drops to \(\frac{1}{2}\):

\[ H = -\frac{B}{\log_2 \beta} \]

Holding \(H\) fixed while changing batch size gives:

\[ \beta' = \beta^{m_B / m_D} \qquad \text{equivalently} \qquad \beta' = 2^{-\Delta\tau'/H} \]

where \(\Delta\tau' = B'/T'\) is the token step size of the target run.

Why Not Just Keep \(\beta\) Fixed?

If you double the batch size without adjusting \(\beta\), the EMA forgets twice as fast in token space — the momentum window shrinks by half. For small-batch scaling this is especially destructive (Marek et al., 2025).


6. Complete Algorithm

Lion-\(\mathcal{K}\) with Corrected Cautious Weight Decay

Require: Initial parameters \(\theta_0\), initial momentum \(m_0 = 0\), direction map \(\nabla \mathcal{K}\)

Require: Learning rate \(\gamma\), momentum coefficients \(\beta_1, \beta_2\)

Require: Per-group target norms \(C_{\theta,g}^2\), mask rates \(p_g\), correlation factors \(S_g\)

for \(t = 0, 1, 2, \dots\) do

\(\quad\) \(g_t \leftarrow \nabla f(\theta_t)\)

\(\quad\) // Momentum update

\(\quad\) \(m_{t+1} \leftarrow \beta_2\, m_t + (1-\beta_2)\, g_t\)

\(\quad\) // Direction

\(\quad\) \(z_t \leftarrow \beta_1\, m_{t+1} + (1-\beta_1)\, g_t\)

\(\quad\) \(u_t \leftarrow -\nabla \mathcal{K}(z_t)\)

\(\quad\) // Cautious mask

\(\quad\) \((M_t)_i \leftarrow \mathbf{1}\{\mathrm{sign}(\theta_{t,i}) = \mathrm{sign}(u_{t,i})\}\)

\(\quad\) // Corrected decay (per parameter group \(g\))

\(\quad\) \(\displaystyle\eta_g \leftarrow \frac{\gamma_g^2\, C_{u,g}^2\, S_g}{2\, p_g\, C_{\theta,g}^2}\)

\(\quad\) // Parameter update

\(\quad\) \(\theta_{t+1} \leftarrow \theta_t - \eta_g\,(M_t \odot \theta_t) + \gamma_g\, u_t\)

end for

Caveats for Output Layers

The steady-state independence assumption frequently breaks down for the cross-entropy output layer. You may need to exclude the output unembedding layer from corrected decay or manage it separately (Chou, 2026).

Conclusion

Combining Complete(d)P (Mlodozeniec et al., 2025), corrected weight decay from AdamC/ScionC (Chou, 2026), and bounded direction maps from Lion-\(\mathcal{K}\) with CCWD (Chen et al., 2026) yields a theoretically grounded hyperparameter transfer mechanism for sign/LMO-based optimizers.

References

  1. Asymptotic Estimation of AdamW’s Weight RMS (Part 1) - Science Space\textbarScientific Spaces. Retrieved March 22, 2026, from https://kexue.fm/archives/11307
  2. Bernstein, J. (2025). Deriving Muon. https://jeremybernste.in/writing/deriving-muon
  3. Chen, L., Li, J., Liang, K., Su, B., Xie, C., Pierse, N. W., Liang, C., Lao, N., & Liu, Q. (2026). Cautious Weight Decay (Issue arXiv:2510.12402). arXiv. https://doi.org/10.48550/arXiv.2510.12402
  4. Chen, L., Liu, B., Liang, K., & Liu, Q. (2025). Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts (Issue arXiv:2310.05898). arXiv. https://doi.org/10.48550/arXiv.2310.05898
  5. Chou, J. C.-C. (2026). Correction of Decoupled Weight Decay (Issue arXiv:2512.08217). arXiv. https://doi.org/10.48550/arXiv.2512.08217
  6. Defazio, A. (2025). Why Gradients Rapidly Increase Near the End of Training (Issue arXiv:2506.02285). arXiv. https://doi.org/10.48550/arXiv.2506.02285
  7. Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An Optimizer for Hidden Layers in Neural Networks. https://kellerjordan.github.io/posts/muon/
  8. Marek, M., Lotfi, S., Somasundaram, A., Wilson, A. G., & Goldblum, M. (2025). Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful (Issue arXiv:2507.07101). arXiv. https://doi.org/10.48550/arXiv.2507.07101
  9. Mlodozeniec, B., Ablin, P., Béthune, L., Busbridge, D., Klein, M., Ramapuram, J., & Cuturi, M. (2025). Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration (Issue arXiv:2512.22382). arXiv. https://doi.org/10.48550/arXiv.2512.22382
  10. Muon Is a Nuclear Lion King. Retrieved March 23, 2026, from https://www.cs.utexas.edu/ lqiang/lionk/html/intro.html
  11. Yang, G., Simon, J. B., & Bernstein, J. (2024). A Spectral Condition for Feature Learning (Issue arXiv:2310.17813). arXiv. https://doi.org/10.48550/arXiv.2310.17813
  12. Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., & Gao, J. (2022). Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (Issue arXiv:2203.03466). arXiv. https://doi.org/10.48550/arXiv.2203.03466
  13. Yang, G., Yu, D., Zhu, C., & Hayou, S. (2023). Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks (Issue arXiv:2310.02244). arXiv. https://doi.org/10.48550/arXiv.2310.02244
This post is licensed under CC BY 4.0 by the author.