A Simpler Parametrization for Modern Optimizers

A compact math-first note on replacing raw optimizer knobs with action coordinates: state memory, additive update size, and direct shrinkage.

Posted May 1, 2026 Updated May 2, 2026

By Ji-Ha Kim

5 min read

Summary
2 small changes of variables for a simpler and more robust parametrization of modern optimizers:
Direct Shrinkage: Replace the coupled $(1-\eta\lambda)$ multiplier from decoupled weight decay with a strictly positive shrink factor $a \in (0,1\rbrack$ that is independent of the learning rate $\eta$ .
Half-Life Coordinates: Parametrize the per-step factors ( $\beta$ and $$a$$ ) via half-lives $$h$$ . Defining $$h$$ in units of tokens or samples makes the underlying timescales invariant, allowing easier hyperparameter transfer across different batch sizes.

State-Based Optimizer
$\boxed{ \begin{aligned} s^+ &= F(s,g),\\ u &= U(s^+,g),\\ \theta^+ &= a\theta+\eta u. \end{aligned} }$
The variables are parameters $\theta$ , optimizer state $$s$$ , stochastic gradient signal $$g$$ , update direction $$u$$ , additive scale $\eta$ , and direct shrink factor $$a$$ . For normalized optimizers, the direction is measured in a declared layerwise norm, e.g. $\Vert u\Vert =1$ .

Sources
Half-life parametrizes EMA retention as an additive $\log_2$ coordinate (Marek et al., 2025). Direct shrinkage separates the weight-shrink action from the additive learning-rate scale (Kosson et al., 2026).

1. Multiplicative Coordinates

Natural Bases
Continuous and discrete time have different unit-rate exponentials:
$D b^t=(\log b)b^t \quad\Rightarrow\quad b=e,$ $\Delta b^n=(b-1)b^n \quad\Rightarrow\quad b=2.$
Continuous rates use base $$e$$ ; discrete half-life coordinates use base $$2$$ .

Halving Exponent
For $x\in(0,1\rbrack$ ,
$\boxed{ H_x=-\log_2x, \qquad x=2^{-H_x}. }$
Multiplication becomes addition:
$H_{\prod_k x_k} = \sum_k H_{x_k}.$

Half-Life
Fix a scalar training count $\tau$ : updates, samples, tokens, epochs, or another monotone count. For
$c(\Delta\tau)=2^{-q\Delta\tau}, \qquad \lbrack q\rbrack =\lbrack \tau\rbrack ^{-1},$
the half-life $$h$$ is defined by $$c(h)=1/2$$ :
$\boxed{ \lbrack h\rbrack =\lbrack \Delta\tau\rbrack =\lbrack \tau\rbrack , \qquad qh=1, \qquad H_{c(\Delta\tau)} = \frac{\Delta\tau}{h}. }$

Continuous Analogue
$c(t_0,t_1) = \exp\left(-\int_{t_0}^{t_1}r(t)\,dt\right).$

2. EMA Memory

EMA Retention
$m^+ = \beta m+(1-\beta)g, \qquad H_\beta=-\log_2\beta.$

Memory Half-Life
If one update advances the chosen count by $\Delta\tau$ and the memory half-life is $h_\beta$ , then
$\boxed{ H_\beta=\frac{\Delta\tau}{h_\beta}, \qquad \beta=2^{-\Delta\tau/h_\beta}. }$

Token Count
For language models, processed tokens give
$\tau = \frac{\text{tokens}}{\text{sequence}} \cdot \frac{\text{sequences}}{\text{batch}} \cdot \text{batches}.$

Count-Preserving Rescaling
For $r=\Delta\tau^\ast /\Delta\tau$ ,
$\boxed{ H_\beta^\ast =rH_\beta, \qquad \beta^\ast =\beta^r. }$
The count half-life is unchanged. The update-count half-life $n_\beta=1/H_\beta$ rescales as
$n_\beta^\ast =\frac{1}{r}n_\beta.$

Nesterov Readout
With stored retention $\beta$ and readout blend $\mu$ ,
$\tilde m=\beta m+(1-\beta)g, \qquad z=\mu\tilde m+(1-\mu)g = \mu\beta m+(1-\mu\beta)g.$
Only $\beta$ carries memory across the training count. For $r=\Delta\tau^\ast /\Delta\tau$ ,
$\boxed{ \beta^\ast =\beta^r, \qquad \mu^\ast =\mu. }$
In $(\beta_1,\beta_2)=(\mu,\beta)$ notation,
$\boxed{ \beta_2^\ast =\beta_2^r, \qquad \beta_1^\ast =\beta_1. }$

Nesterov Transfer

Count-preserving interpolation gives

m_r=\beta^r m_0+(1-\beta^r)g, \qquad z_r=\mu\beta^r m_0+(1-\mu\beta^r)g.

A target update has

z^\ast =\mu^\ast \beta^\ast m_0+(1-\mu^\ast \beta^\ast )g.

Matching stored state gives $\beta^\ast =\beta^r$ ; matching the readout gives $\mu^\ast =\mu$ .

This coordinate change is the fixed token-half-life rule for memory in small-batch language-model training (Marek et al., 2025).

3. Direct Shrinkage

Shrink Action
$\theta^+ = a\theta+\eta u, \qquad H_a=-\log_2a, \qquad a=2^{-H_a}.$
Composition is additive:
$H_{a_{s:t}} = H_{\prod_{k=s}^{t-1}a_k} = \sum_{k=s}^{t-1}H_{a_k}.$

Shrink Half-Life
Use the same half-life coordinate as EMA memory:
$\boxed{ H_a=\frac{\Delta\tau}{h_a}, \qquad a=2^{-\Delta\tau/h_a}. }$
Here $$h_a$$ is measured in the chosen count $\tau$ .

Independent Shrink
Kosson et al. motivate treating weight shrinkage as its own action, independent of the additive learning-rate scale (Kosson et al., 2026). In this parametrization, $$h_a$$ controls multiplicative shrinkage and $\eta$ controls the additive update.

4. Hyperparameter Parametrization

Fix the optimizer family: state update, direction map, and layerwise norm constraints. Expose memory, readout, additive scale, and shrinkage in action coordinates.

Direct Coordinates
$\boxed{ \left( h_\beta,\, \mu,\, \eta_t,\, h_a \right) }$
with
$\beta_t=2^{-\Delta\tau_t/h_\beta}, \qquad a_t=2^{-\Delta\tau_t/h_a}, \qquad \theta_{t+1}=a_t\theta_t+\eta_tu_t.$
Dimensionless readout blends transfer as $\mu^\ast =\mu$ .

Appendix: Notation

Symbols
Symbol Meaning
$\theta$ Parameters
$$s$$ Optimizer state
$$g$$ Stochastic gradient signal
$\tau$ Chosen training count
$\Delta\tau$ Count advanced by one optimizer update
$$h$$ Half-life in units of $\tau$
$$H_x$$ Halving exponent, $H_x=-\log_2x$
$\beta$ EMA retention factor
$\mu$ Nesterov readout blend
$$a$$ Direct shrink factor
$\eta$ Additive step scale
$$u$$ Update direction
$\Vert \cdot\Vert$ Layerwise norm; normalized directions satisfy $\Vert u\Vert =1$
Starred quantities denote transferred values.

Symbol	Meaning
$\theta$	Parameters
$\(s\)$	Optimizer state
$\(g\)$	Stochastic gradient signal
$\tau$	Chosen training count
$\Delta\tau$	Count advanced by one optimizer update
$\(h\)$	Half-life in units of $\tau$
$\(H_x\)$	Halving exponent, $H_x=-\log_2x$
$\beta$	EMA retention factor
$\mu$	Nesterov readout blend
$\(a\)$	Direct shrink factor
$\eta$	Additive step scale
$\(u\)$	Update direction
$\Vert \cdot\Vert$	Layerwise norm; normalized directions satisfy $\Vert u\Vert =1$

Appendix: ScionC Example

Scion supplies unit-norm LMO ( $\operatorname{ulmo}$ ) directions (Pethick et al., 2025). The corrected-decay variant motivates treating shrinkage as its own component (Chou, 2026). In half-life coordinates, the invariant hyperparameters are $h_\beta$ and $$h_a$$ ; the raw factors $\beta$ and $$a$$ are computed from the current count increment $\Delta\tau$ .

Inside the layer loop, $\theta$ , $$m$$ , $\eta$ , $h_\beta$ , $$h_a$$ , $\Vert \cdot\Vert$ , and $\operatorname{ulmo}$ are local to the current layer. The random variable $\xi$ denotes the sampled minibatch.

Example ScionC in Half-Life Coordinates

1:def ScionCStep

\left( \Theta,M;\xi,\Delta\tau \right)

2: for each layer

\ell=0,\ldots,L

(\theta,m,\eta,h_\beta,h_a,\Vert \cdot\Vert ,\operatorname{ulmo}) \leftarrow

values for layer

\ell

\beta \leftarrow 2^{-\Delta\tau/h_\beta}

a \leftarrow 2^{-\Delta\tau/h_a}

g \leftarrow \nabla_\theta f(\Theta;\xi)

m \leftarrow \beta m+(1-\beta)g

u \leftarrow \operatorname{ulmo}(m)

\Vert u\Vert =1

10:

\theta \leftarrow a\theta+\eta u

11: write back

\theta,m

to layer

\ell

References

Chou, J. C.-C. (2026). Correction of Decoupled Weight Decay (Issue arXiv:2512.08217). arXiv. https://doi.org/10.48550/arXiv.2512.08217↩︎
Kosson, A., Welborn, J., Liu, Y., Jaggi, M., & Chen, X. (2026). Weight Decay May Matter More than muP for Learning Rate Transfer in Practice (Issue arXiv:2510.19093). arXiv. https://doi.org/10.48550/arXiv.2510.190931 2
Marek, M., Lotfi, S., Somasundaram, A., Wilson, A. G., & Goldblum, M. (2025). Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful (Issue arXiv:2507.07101). arXiv. https://doi.org/10.48550/arXiv.2507.071011 2
Pethick, T., Xie, W., Antonakopoulos, K., Zhu, Z., Silveti-Falls, A., & Cevher, V. (2025). Training Deep Learning Models with Norm-Constrained LMOs (Issue arXiv:2502.07529). arXiv. https://doi.org/10.48550/arXiv.2502.07529↩︎

Machine Learning, Mathematical Optimization

This post is licensed under CC BY 4.0 by the author.