Lion-K CCWD: Corrected Cautious Weight Decay and Hyperparameter Transfer
Derivation of Lion-K with Corrected Cautious Weight Decay (CCWD) and transformation rules for hyperparameter transfer.
Overview
This post derives Lion-\(\mathcal{K}\) with Corrected Cautious Weight Decay (CCWD) and provides hyperparameter-transfer rules for scaling across width, depth, batch size, and duration.
Key assumptions:
Normalized updates accumulate like a random walk. For bounded optimizer directions (sign, LMO), total parameter displacement after \(T\) steps scales as \(\gamma\sqrt{T}\), requiring \(\gamma \propto \sqrt{B/D}\).
Momentum and decay are parameterized by half-lives in tokens. This yields exact formulas for betas and decay instead of linear approximations (Marek et al., 2025).
1. Lion-\(\mathcal{K}\)
Lion-\(\mathcal{K}\) Update Rule (Chen et al., 2025)
Input: Parameters \(\theta_t\), gradient \(g_t = \nabla f(\theta_t)\), momentum state \(m_t\), direction map \(\nabla \mathcal{K}\).
Step 1 — Momentum update:
\[ m_{t+1} = \beta_2 m_t + (1-\beta_2) g_t \]Step 2 — Direction input (a common Lion-\(\mathcal{K}\) choice):
\[ z_t = \beta_1 m_{t+1} + (1-\beta_1) g_t \]Step 3 — Direction map:
\[ u_t = -\nabla \mathcal{K}(z_t) \]Step 4 — Parameter update (with decoupled decay):
\[ \theta_{t+1} = (1-\eta_t)\theta_t + \gamma_t u_t \]
Special Cases of Lion-\(\mathcal{K}\)
Scion sits naturally inside the Lion-\(\mathcal{K}\) frame, where \(\partial\mathcal{K}\) is a Linear Minimization Oracle (LMO) over a norm ball. This includes Muon, Lion, and Normalized-SGD.
2. Cautious Weight Decay (CWD)
Cautious Weight Decay (CWD) modifies standard decoupled decay to decay only coordinates whose signs align with the optimizer update direction (Chen et al., 2026).
CWD Mask and Update
Let the CWD mask be:
\[ M_t \in \{0,1\}^{\mathrm{shape}(\theta)},\qquad (M_t)_i = \mathbf{1}_{\{\mathrm{sign}(\theta_{t,i}) = \mathrm{sign}(u_{t,i})\}} \]Apply decay only on masked coordinates:
\[ \theta_{t+1} = \theta_t - \eta_t (M_t \odot \theta_t) + \gamma_t u_t \]
3. Corrected Weight Decay
In decoupled weight decay, the physically meaningful quantity is the per-step multiplicative factor \(\eta\), not the decay coefficient \(\lambda\) (often written \(\eta = \gamma \lambda\)). The steady-state analysis from AdamC (Defazio, 2025) / ScionC (Chou, 2026) shows that stability requires \(\eta \propto \gamma^2\).
Steady-State Assumptions
\(u_t\) has stable RMS size: \(\mathbb{E}\vert u_t\vert ^2 \approx C_u^2\).
Cross-term vanishes in expectation: \(\mathbb{E}\langle \theta_t,u_t\rangle \approx 0\).
\(u_t\) may be correlated across time due to momentum.
Momentum Correlation Factor
For the standard update \(\theta_{t+1} = (1-\eta)\theta_t + \gamma u_t\), define a normalized direction autocorrelation \(\rho_k\) and a correlation-sum factor \(S\):
\[ \rho_k := \frac{\mathbb{E}\langle u_t,u_{t-k}\rangle}{\mathbb{E}\vert u_t\vert ^2},\qquad \rho_0=1,\qquad S := 1 + 2\sum_{k\ge 1}\rho_k \]
Steady-State Parameter Norm
The steady-state parameter norm satisfies (to leading order in small \(\eta\)):
\[ \mathbb{E}\vert \theta\vert ^2 \approx \frac{\gamma^2 C_u^2}{2\eta} S \]To target a steady-state squared norm \(C_\theta^2\), solve for \(\eta\):
\[ \eta \approx \frac{\gamma^2 C_u^2 S}{2C_\theta^2} \qquad \Longrightarrow \qquad \lambda \approx \frac{\gamma C_u^2 S}{2C_\theta^2} \]
Derivation of the Steady-State Norm
Consider the one-step energy expansion:
Expanding and taking expectations under the steady-state assumptions:
At steady state \(\mathbb{E}\vert \theta_{t+1}\vert ^2 = \mathbb{E}\vert \theta_t\vert ^2 = C_\theta^2\):
For small \(\eta\): \(1 - (1-\eta)^2 = 2\eta - \eta^2 \approx 2\eta\), giving the result.
Momentum Correlation Factor for Lion-\(\mathcal{K}\)
In Lion-\(\mathcal{K}\), the direction input \(z_t\) is a convex combination of momentum and the current gradient:
\[ z_t = \beta_{\text{eff}} m_t + (1-\beta_{\text{eff}}) g_t \]where:
Standard Lion (\(z_t = \beta_1 m_t + (1-\beta_1) g_t\)): \(\beta_{\text{eff}} = \beta_1\).
Nesterov Lion-\(\mathcal{K}\) (\(z_t = \beta_1 m_{t+1} + (1-\beta_1) g_t\)): substituting \(m_{t+1}\) yields \(\beta_{\text{eff}} = \beta_1 \beta_2\).
Under the assumption of independent gradients (\(\mathbb{E}\langle g_s, g_{s'}\rangle = C'^2\,\delta_{ss'}\)), the correlation-sum factor evaluates to:
\[ S(\beta_{\text{eff}},\beta_2) = \frac{1+\beta_2}{ (1-\beta_{\text{eff}})^2(1+\beta_2) + \beta_{\text{eff}}^2(1-\beta_2) } \]Note that Adam/Scion (where \(u_t\) directly tracks the momentum EMA) gives \(S \approx \frac{1+\beta_2}{1-\beta_2}\), while Lion’s gradient mixture drastically alters \(S\).
Derivation of \(S(\beta_{\text{eff}},\beta_2)\)
Expressing \(z_t\) as a weighted sum of past gradients, the filter weights are:
| Lag \(\ell\) | Weight \(w_\ell\) |
|---|---|
| \(0\) | \(1-\beta_{\text{eff}}\) |
| \(\ell \geq 1\) | \(\beta_{\text{eff}}(1-\beta_2)\beta_2^{\ell-1}\) |
Lag-0 autocorrelation.\(A_0 = w_0^2 + \sum_{\ell\geq 1}w_\ell^2\):
Lag-\(k\) autocorrelation (\(k \geq 1\)):
Summing.\(\sum_{k\geq 1} A_k = \frac{\beta_{\text{eff}}(1+\beta_2-\beta_{\text{eff}})}{1+\beta_2}\). Since \(S = (A_0 + 2\sum_{k\geq 1}A_k)/A_0\), the numerator simplifies to \(1\), giving \(S = 1/A_0\), which yields the stated result.
4. Corrected Cautious Weight Decay (CCWD)
CCWD Multiplier Formula
The correct weight decay multiplier \(\eta\) for a masked decay fraction \(q\) is:
\[ \eta = \frac{\gamma^2 C_u^2 S}{2 q C_\theta^2} \]
Variable Meaning \(\gamma\) Learning rate \(C_u^2 \approx \mathbb{E}\vert u_t\vert ^2\) Steady-state update variance \(S = S(\beta_{\text{eff}},\beta_2)\) Momentum correlation factor (Section 3) \(C_\theta^2\) Target steady-state parameter norm \(q \approx p_g\) Masked decay fraction
Avoiding New Hyperparameters
You don’t need to manually guess \(C_\theta^2\) and \(q\). Profile a base run and measure:
\(C_{\theta,g}^2\): Expected parameter norm \(\mathbb{E}\vert \theta_g\vert ^2\)
\(p_g\): Average mask rate \(\mathbb{E}[\text{mean}(M_{t,g})]\) for group \(g\)
\(S_g\): Derived empirically from \(\rho_k\) or via \(S(\beta_{\text{eff}},\beta_2)\)
Derivation of CCWD
CWD operates via masked decay. Because fewer coordinates are shrunk per step, a naive identical \(\eta\) no longer preserves the target steady-state norm.
Step 1: Exact One-Step Energy Change
Let \(d_t := M_t \odot \theta_t\). The update is \(\theta_{t+1} = \theta_t - \eta d_t + \gamma u_t\). Expanding the squared norm:
Step 2: Masking Fraction \(q_t\)
Define the fraction of the squared norm being shrunk:
Thus, \((2\eta-\eta^2)\vert d_t\vert ^2 = (2\eta-\eta^2) q_t \vert \theta_t\vert ^2 \approx 2\eta q_t \vert \theta_t\vert ^2\).
Step 3: Steady-State Assumption
Assuming independence (\(\mathbb{E}\langle \theta_t, u_t \rangle \approx 0\)) and incorporating the momentum correlation factor \(S\), the expected steady-state norm satisfies:
Optional \(\kappa\) Feedback Controller
Because the analytical \(\eta\) formula assumes perfect orthogonality, real-world metrics may drift slightly. A slow feedback controller \(\kappa\) can lock onto the target norm.
Calculate the observed ratio:
Update the scale correction \(\kappa\) multiplicatively (\(c\) is a small gain, e.g., \(0.05\)):
Scale the analytical formula by \(\kappa_t\):
In practice, this is mostly optional as the analytical approximation is usually quite accurate.
5. Hyperparameter Transfer
Scaling Ratios
Define the scaling ratios (base \(\to\) target):
Ratio Definition Meaning \(m_N\) \(N'/N\) Width multiplier \(m_L\) \(L'/L\) Depth multiplier \(m_B\) \(B'/B\) Batch size multiplier \(m_D\) \(D'/D\) Data/duration multiplier
Using the Complete(d)P framework (Mlodozeniec et al., 2025), we define scaling rules for Transformer models.
5.1 Initialization and Parameterization
Initialization Scaling Rules
Component Scaling Rule Residual branch multiplier \(\text{residual_multiplier}' = \text{residual_multiplier}\cdot m_L^{-\alpha}\), with \(\alpha\in\left[\frac{1}{2},1\right]\) Init variance: hidden weights \(\mathrm{Var}(W_{\text{hid}})' = \mathrm{Var}(W_{\text{hid}})\cdot m_N^{-1}\) Init variance: output weights \(\mathrm{Var}(W_{\text{out}})' = \mathrm{Var}(W_{\text{out}})\cdot m_N^{-2}\)
Choosing \(\alpha\): Random Walk vs. Coherent Residuals
\(\alpha = \frac{1}{2}\) (random walk): Layer outputs are approximately independent and isotropic. Their sum grows as \(\sqrt{L}\), so each branch scales by \(1/\sqrt{L}\).
\(\alpha = 1\) (coherent): Residual branches are aligned, accumulating linearly as \(L\), requiring \(1/L\) scaling.
In practice, \(\alpha = 1\) is conservative; \(\alpha = \frac{1}{2}\) often works better empirically for moderate depth ranges.
5.2 Learning Rates and Training Steps
Since steps scale as \(T \propto D/B\), the target horizon is:
Let the batch/duration scale factor be \(s_{BD} := \sqrt{m_B/m_D}\).
Per-Module Learning Rate Multipliers
Module Scaling Rule Input embeddings \(\gamma'_{\rm emb} = \gamma_{\rm emb} \cdot s_{BD}\) Hidden weights \(\gamma'_{\rm hidW} = \gamma_{\rm hidW} \cdot m_N^{-1} \cdot m_L^{\alpha-1} \cdot s_{BD}\) Hidden bias/norm \(\gamma'_{\rm hidBN} = \gamma_{\rm hidBN} \cdot m_L^{\alpha-1} \cdot s_{BD}\) Output weights \(\gamma'_{\rm outW} = \gamma_{\rm outW} \cdot m_N^{-1} \cdot s_{BD}\)
5.3 Momentum Transfer via Token Half-Lives
When the batch size or duration changes, the per-step \(\beta\) must be adjusted to preserve the same forgetting rate in token space.
Beta Transfer Rule
Define the half-life \(H\) of an EMA with coefficient \(\beta\) and batch size \(B\) as the number of tokens after which the weight drops to \(\frac{1}{2}\):
\[ H = -\frac{B}{\log_2 \beta} \]Holding \(H\) fixed while changing batch size gives:
\[ \beta' = \beta^{m_B / m_D} \qquad \text{equivalently} \qquad \beta' = 2^{-\Delta\tau'/H} \]where \(\Delta\tau' = B'/T'\) is the token step size of the target run.
Why Not Just Keep \(\beta\) Fixed?
If you double the batch size without adjusting \(\beta\), the EMA forgets twice as fast in token space — the momentum window shrinks by half. For small-batch scaling this is especially destructive (Marek et al., 2025).
6. Complete Algorithm
Lion-\(\mathcal{K}\) with Corrected Cautious Weight Decay
Require: Initial parameters \(\theta_0\), initial momentum \(m_0 = 0\), direction map \(\nabla \mathcal{K}\)
Require: Learning rate \(\gamma\), momentum coefficients \(\beta_1, \beta_2\)
Require: Per-group target norms \(C_{\theta,g}^2\), mask rates \(p_g\), correlation factors \(S_g\)
for \(t = 0, 1, 2, \dots\) do
\(\quad\) \(g_t \leftarrow \nabla f(\theta_t)\)
\(\quad\) // Momentum update
\(\quad\) \(m_{t+1} \leftarrow \beta_2\, m_t + (1-\beta_2)\, g_t\)
\(\quad\) // Direction
\(\quad\) \(z_t \leftarrow \beta_1\, m_{t+1} + (1-\beta_1)\, g_t\)
\(\quad\) \(u_t \leftarrow -\nabla \mathcal{K}(z_t)\)
\(\quad\) // Cautious mask
\(\quad\) \((M_t)_i \leftarrow \mathbf{1}\{\mathrm{sign}(\theta_{t,i}) = \mathrm{sign}(u_{t,i})\}\)
\(\quad\) // Corrected decay (per parameter group \(g\))
\(\quad\) \(\displaystyle\eta_g \leftarrow \frac{\gamma_g^2\, C_{u,g}^2\, S_g}{2\, p_g\, C_{\theta,g}^2}\)
\(\quad\) // Parameter update
\(\quad\) \(\theta_{t+1} \leftarrow \theta_t - \eta_g\,(M_t \odot \theta_t) + \gamma_g\, u_t\)
end for
Caveats for Output Layers
The steady-state independence assumption frequently breaks down for the cross-entropy output layer. You may need to exclude the output unembedding layer from corrected decay or manage it separately (Chou, 2026).
Conclusion
Combining Complete(d)P (Mlodozeniec et al., 2025), corrected weight decay from AdamC/ScionC (Chou, 2026), and bounded direction maps from Lion-\(\mathcal{K}\) with CCWD (Chen et al., 2026) yields a theoretically grounded hyperparameter transfer mechanism for sign/LMO-based optimizers.
References
- Asymptotic Estimation of AdamW’s Weight RMS (Part 1) - Science Space\textbarScientific Spaces. Retrieved March 22, 2026, from https://kexue.fm/archives/11307
- Bernstein, J. (2025). Deriving Muon. https://jeremybernste.in/writing/deriving-muon
- Chen, L., Li, J., Liang, K., Su, B., Xie, C., Pierse, N. W., Liang, C., Lao, N., & Liu, Q. (2026). Cautious Weight Decay (Issue arXiv:2510.12402). arXiv. https://doi.org/10.48550/arXiv.2510.12402
- Chen, L., Liu, B., Liang, K., & Liu, Q. (2025). Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts (Issue arXiv:2310.05898). arXiv. https://doi.org/10.48550/arXiv.2310.05898
- Chou, J. C.-C. (2026). Correction of Decoupled Weight Decay (Issue arXiv:2512.08217). arXiv. https://doi.org/10.48550/arXiv.2512.08217
- Defazio, A. (2025). Why Gradients Rapidly Increase Near the End of Training (Issue arXiv:2506.02285). arXiv. https://doi.org/10.48550/arXiv.2506.02285
- Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An Optimizer for Hidden Layers in Neural Networks. https://kellerjordan.github.io/posts/muon/
- Marek, M., Lotfi, S., Somasundaram, A., Wilson, A. G., & Goldblum, M. (2025). Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful (Issue arXiv:2507.07101). arXiv. https://doi.org/10.48550/arXiv.2507.07101
- Mlodozeniec, B., Ablin, P., Béthune, L., Busbridge, D., Klein, M., Ramapuram, J., & Cuturi, M. (2025). Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration (Issue arXiv:2512.22382). arXiv. https://doi.org/10.48550/arXiv.2512.22382
- Muon Is a Nuclear Lion King. Retrieved March 23, 2026, from https://www.cs.utexas.edu/ lqiang/lionk/html/intro.html
- Yang, G., Simon, J. B., & Bernstein, J. (2024). A Spectral Condition for Feature Learning (Issue arXiv:2310.17813). arXiv. https://doi.org/10.48550/arXiv.2310.17813
- Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., & Gao, J. (2022). Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (Issue arXiv:2203.03466). arXiv. https://doi.org/10.48550/arXiv.2203.03466
- Yang, G., Yu, D., Zhu, C., & Hayou, S. (2023). Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks (Issue arXiv:2310.02244). arXiv. https://doi.org/10.48550/arXiv.2310.02244