A Simpler Parametrization for Modern Optimizers
A compact math-first note on replacing raw optimizer knobs with action coordinates: state memory, additive update size, and direct shrinkage.
Summary
2 small changes of variables for a simpler and more robust parametrization of modern optimizers:
Direct Shrinkage: Replace the coupled \((1-\eta\lambda)\) multiplier from decoupled weight decay with a strictly positive shrink factor \(a \in (0,1\rbrack\) that is independent of the learning rate \(\eta\).
Half-Life Coordinates: Parametrize the per-step factors (\(\beta\) and \(a\)) via half-lives \(h\). Defining \(h\) in units of tokens or samples makes the underlying timescales invariant, allowing easier hyperparameter transfer across different batch sizes.
State-Based Optimizer
\[ \boxed{ \begin{aligned} s^+ &= F(s,g),\\ u &= U(s^+,g),\\ \theta^+ &= a\theta+\eta u. \end{aligned} } \]The variables are parameters \(\theta\), optimizer state \(s\), stochastic gradient signal \(g\), update direction \(u\), additive scale \(\eta\), and direct shrink factor \(a\). For normalized optimizers, the direction is measured in a declared layerwise norm, e.g. \(\Vert u\Vert =1\).
Sources
Half-life parametrizes EMA retention as an additive \(\log_2\) coordinate (Marek et al., 2025). Direct shrinkage separates the weight-shrink action from the additive learning-rate scale (Kosson et al., 2026).
1. Multiplicative Coordinates
Natural Bases
Continuous and discrete time have different unit-rate exponentials:
\[ D b^t=(\log b)b^t \quad\Rightarrow\quad b=e, \]\[ \Delta b^n=(b-1)b^n \quad\Rightarrow\quad b=2. \]Continuous rates use base \(e\); discrete half-life coordinates use base \(2\).
Halving Exponent
For \(x\in(0,1\rbrack\),
\[ \boxed{ H_x=-\log_2x, \qquad x=2^{-H_x}. } \]Multiplication becomes addition:
\[ H_{\prod_k x_k} = \sum_k H_{x_k}. \]
Half-Life
Fix a scalar training count \(\tau\): updates, samples, tokens, epochs, or another monotone count. For
\[ c(\Delta\tau)=2^{-q\Delta\tau}, \qquad \lbrack q\rbrack =\lbrack \tau\rbrack ^{-1}, \]the half-life \(h\) is defined by \(c(h)=1/2\):
\[ \boxed{ \lbrack h\rbrack =\lbrack \Delta\tau\rbrack =\lbrack \tau\rbrack , \qquad qh=1, \qquad H_{c(\Delta\tau)} = \frac{\Delta\tau}{h}. } \]
Continuous Analogue
\[ c(t_0,t_1) = \exp\left(-\int_{t_0}^{t_1}r(t)\,dt\right). \]
2. EMA Memory
EMA Retention
\[ m^+ = \beta m+(1-\beta)g, \qquad H_\beta=-\log_2\beta. \]
Memory Half-Life
If one update advances the chosen count by \(\Delta\tau\) and the memory half-life is \(h_\beta\), then
\[ \boxed{ H_\beta=\frac{\Delta\tau}{h_\beta}, \qquad \beta=2^{-\Delta\tau/h_\beta}. } \]
Token Count
For language models, processed tokens give
\[ \tau = \frac{\text{tokens}}{\text{sequence}} \cdot \frac{\text{sequences}}{\text{batch}} \cdot \text{batches}. \]
Count-Preserving Rescaling
For \(r=\Delta\tau^\ast /\Delta\tau\),
\[ \boxed{ H_\beta^\ast =rH_\beta, \qquad \beta^\ast =\beta^r. } \]The count half-life is unchanged. The update-count half-life \(n_\beta=1/H_\beta\) rescales as
\[ n_\beta^\ast =\frac{1}{r}n_\beta. \]
Nesterov Readout
With stored retention \(\beta\) and readout blend \(\mu\),
\[ \tilde m=\beta m+(1-\beta)g, \qquad z=\mu\tilde m+(1-\mu)g = \mu\beta m+(1-\mu\beta)g. \]Only \(\beta\) carries memory across the training count. For \(r=\Delta\tau^\ast /\Delta\tau\),
\[ \boxed{ \beta^\ast =\beta^r, \qquad \mu^\ast =\mu. } \]In \((\beta_1,\beta_2)=(\mu,\beta)\) notation,
\[ \boxed{ \beta_2^\ast =\beta_2^r, \qquad \beta_1^\ast =\beta_1. } \]
Nesterov Transfer
Count-preserving interpolation gives
A target update has
Matching stored state gives \(\beta^\ast =\beta^r\); matching the readout gives \(\mu^\ast =\mu\).
This coordinate change is the fixed token-half-life rule for memory in small-batch language-model training (Marek et al., 2025).
3. Direct Shrinkage
Shrink Action
\[ \theta^+ = a\theta+\eta u, \qquad H_a=-\log_2a, \qquad a=2^{-H_a}. \]Composition is additive:
\[ H_{a_{s:t}} = H_{\prod_{k=s}^{t-1}a_k} = \sum_{k=s}^{t-1}H_{a_k}. \]
Shrink Half-Life
Use the same half-life coordinate as EMA memory:
\[ \boxed{ H_a=\frac{\Delta\tau}{h_a}, \qquad a=2^{-\Delta\tau/h_a}. } \]Here \(h_a\) is measured in the chosen count \(\tau\).
Independent Shrink
Kosson et al. motivate treating weight shrinkage as its own action, independent of the additive learning-rate scale (Kosson et al., 2026). In this parametrization, \(h_a\) controls multiplicative shrinkage and \(\eta\) controls the additive update.
4. Hyperparameter Parametrization
Fix the optimizer family: state update, direction map, and layerwise norm constraints. Expose memory, readout, additive scale, and shrinkage in action coordinates.
Direct Coordinates
\[ \boxed{ \left( h_\beta,\, \mu,\, \eta_t,\, h_a \right) } \]with
\[ \beta_t=2^{-\Delta\tau_t/h_\beta}, \qquad a_t=2^{-\Delta\tau_t/h_a}, \qquad \theta_{t+1}=a_t\theta_t+\eta_tu_t. \]Dimensionless readout blends transfer as \(\mu^\ast =\mu\).
Appendix: Notation
Symbols
Symbol Meaning \(\theta\) Parameters \(s\) Optimizer state \(g\) Stochastic gradient signal \(\tau\) Chosen training count \(\Delta\tau\) Count advanced by one optimizer update \(h\) Half-life in units of \(\tau\) \(H_x\) Halving exponent, \(H_x=-\log_2x\) \(\beta\) EMA retention factor \(\mu\) Nesterov readout blend \(a\) Direct shrink factor \(\eta\) Additive step scale \(u\) Update direction \(\Vert \cdot\Vert\) Layerwise norm; normalized directions satisfy \(\Vert u\Vert =1\) Starred quantities denote transferred values.
Appendix: ScionC Example
Scion supplies unit-norm LMO (\(\operatorname{ulmo}\)) directions (Pethick et al., 2025). The corrected-decay variant motivates treating shrinkage as its own component (Chou, 2026). In half-life coordinates, the invariant hyperparameters are \(h_\beta\) and \(h_a\); the raw factors \(\beta\) and \(a\) are computed from the current count increment \(\Delta\tau\).
Inside the layer loop, \(\theta\), \(m\), \(\eta\), \(h_\beta\), \(h_a\), \(\Vert \cdot\Vert\), and \(\operatorname{ulmo}\) are local to the current layer. The random variable \(\xi\) denotes the sampled minibatch.
References
- Chou, J. C.-C. (2026). Correction of Decoupled Weight Decay (Issue arXiv:2512.08217). arXiv. https://doi.org/10.48550/arXiv.2512.08217↩︎
- Kosson, A., Welborn, J., Liu, Y., Jaggi, M., & Chen, X. (2026). Weight Decay May Matter More than muP for Learning Rate Transfer in Practice (Issue arXiv:2510.19093). arXiv. https://doi.org/10.48550/arXiv.2510.1909312
- Marek, M., Lotfi, S., Somasundaram, A., Wilson, A. G., & Goldblum, M. (2025). Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful (Issue arXiv:2507.07101). arXiv. https://doi.org/10.48550/arXiv.2507.0710112
- Pethick, T., Xie, W., Antonakopoulos, K., Zhu, Z., Silveti-Falls, A., & Cevher, V. (2025). Training Deep Learning Models with Norm-Constrained LMOs (Issue arXiv:2502.07529). arXiv. https://doi.org/10.48550/arXiv.2502.07529↩︎