RMS Norm
Characterizing properties of the Root-Mean-Square Norm for vectors
Root-Mean-Square (RMS) Norm for Vectors
While various norms are ubiquitous in mathematics and engineering, deep learning practice often benefits from a dimension-invariant scale for vectors. The RMS norm provides this by normalizing the Euclidean norm by the square root of the dimension.
Definition. RMS Norm (for Vectors)
For
\[ x \in \mathbb{R}^n \](
\[ n \ge 1 \]), the root-mean-square norm is
\[ \Vert x \Vert_{\mathrm{RMS}} \;=\; \frac{\Vert x \Vert_2}{\sqrt{n}} \;=\; \sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2} \]where
\[ \Vert x \Vert_2 = \sqrt{\sum_{i=1}^n x_i^2} \]is the standard Euclidean norm.
Properties and Rationale for Vector RMS Norm
- Dimension neutrality. If the coordinates
of
are i.i.d. random variables with zero mean and unit variance (e.g.,
), then
, and
for large
. This property makes the notion of “unit-size” more consistent for vectors of varying dimensions, such as network activations from layers of different widths.
- Rotational and Orthogonal Invariance. The RMS norm is a positive scalar multiple of the
-norm. The
-norm is invariant under rotations (
) and, more generally, under all orthogonal transformations (
), which include reflections. The RMS norm inherits these crucial geometric symmetries.
Theorem 1. Rotationally Invariant Functions
A function
\[ f: \mathbb{R}^n \to \mathbb{R} \](
\[ n \ge 2 \]) is rotationally invariant (i.e.,
\[ f(Rx) = f(x) \]for all rotation matrices
\[ R \in SO(n) \]and all
\[ x \in \mathbb{R}^n \]) if and only if there exists a function
\[ g: \mathbb{R}_{\ge 0} \to \mathbb{R} \]such that:
\[ f(x) = g(\Vert x \Vert_2) \quad \forall x \in \mathbb{R}^n \]
Note on the case
\[ n=1 \]For
\[ n=1 \], the space is
\[ \mathbb{R} \]. The special orthogonal group
\[ SO(1) \]contains only the identity matrix
\[ [1] \]. Thus, any function
\[ f: \mathbb{R} \to \mathbb{R} \]is trivially rotationally invariant (i.e.,
\[ f(1 \cdot x_1) = f(x_1) \]). Such a function can be written as
\[ f(x_1) = g(\Vert x_1 \Vert_2) = g(\vert x_1 \vert) \]if and only if
\[ f \]is an even function (i.e.,
\[ f(x_1)=f(-x_1) \]).
Proof of Theorem 1.
(
) **Sufficiency (for all
):** Assume
for some function
. For any rotation matrix
, rotations preserve the Euclidean norm:
. Then,
. Thus,
is rotationally invariant.
(
) Necessity:
- **Case
:** Assume
is rotationally invariant. * If
, define
. Then
. * If
, let
. For any
with
, there exists
such that
(since
acts transitively on spheres for
). By rotational invariance,
. Thus,
depends only on
. Define
for any fixed
with
(e.g.,
). Then
. (Using
to avoid confusion with function
).
- **Case
:** (Covered in the note within the theorem statement.) For completeness, if
, then
, so
must be an even function. Conversely, if
is an even function, define
for
. Then for any
,
. Since
is even,
if
and
if
. So
.
Corollary 1.1. Rotationally Invariant Norms
If a function
\[ \Vert \cdot \Vert : \mathbb{R}^n \to \mathbb{R} \]is a norm and is rotationally invariant, then it must be a positive scalar multiple of the Euclidean norm:
\[ \Vert x \Vert = c \Vert x \Vert_2 \quad \forall x \in \mathbb{R}^n, \text{ for some constant } c > 0 \]
Proof of Corollary 1.1.
Let
be a rotationally invariant norm.
- By Theorem 1:
- For
, since
is rotationally invariant,
for some
. * For
, a norm
is an even function (
). By the
case of Theorem 1,
. Thus, for any
,
.
- By absolute homogeneity of norms,
. So,
. Since
, we have
. Let
and
. Then
for all
. This is Cauchy’s functional equation for homogeneous functions on
.
- If
, set
to get
. Let
. Then
for
. Since
, we have
. The relation
also gives
, so it holds for all
. Thus,
.
- Since
is a norm, for
,
. Thus
, which implies
. (For instance,
as
).
Theorem 2. Orthogonal Invariance of Euclidean-Derived Norms
The Euclidean norm (
\[ \ell_2 \]-norm) is orthogonally invariant: for any orthogonal matrix
\[ Q \in O(n) \](satisfying
\[ Q^\top Q = I \]) and any
\[ x \in \mathbb{R}^n \],
\[ \Vert Qx \Vert_2 = \Vert x \Vert_2 \]Consequently, any norm of the form
\[ \Vert x \Vert = c \Vert x \Vert_2 \]with
\[ c > 0 \]is also orthogonally invariant. This implies that any rotationally invariant norm is also orthogonally invariant.
Proof of Theorem 2.
The Euclidean norm squared is
. For
,
. Since norms are non-negative, taking the square root gives
.
If a norm is of the form
for some
, then
. So, such norms are orthogonally invariant. By Corollary 1.1, any rotationally invariant norm (
-invariant norm) must be of the form
for some
. Therefore, any rotationally invariant norm is also orthogonally invariant (
-invariant).
Corollary 2.1. Uniqueness of RMS Norm Family
The RMS norm,
\[ \Vert x \Vert_{\mathrm{RMS}} = \frac{1}{\sqrt{n}}\Vert x \Vert_2 \], is a positive scalar multiple of the
\[ \ell_2 \]-norm. Therefore, it is rotationally and orthogonally invariant.
Furthermore, consider a family of norms
\[ \{\mathcal{N}_n(\cdot)\}_{n \ge 1} \], where each
\[ \mathcal{N}_n: \mathbb{R}^n \to \mathbb{R} \]is a norm on
\[ \mathbb{R}^n \]. If this family satisfies:
- Rotational Invariance: Each
\[ \mathcal{N}_n(\cdot) \]is rotationally invariant.
- Dimensional Normalization: For a class of random vectors
\[ X^{(n)} \in \mathbb{R}^n \](whose components
\[ X_i \]are i.i.d. with zero mean and unit variance, ensuring
\[ \mathbb{E}[\Vert X^{(n)} \Vert_{\mathrm{RMS}}] \approx 1 \]), the expected value
\[ \mathbb{E}[\mathcal{N}_n(X^{(n)})] \]is a constant
\[ K > 0 \]independent of
\[ n \]. (More precisely, assume
\[ \mathbb{E}[\Vert X^{(n)} \Vert_2] = \sqrt{n} \]for this class of vectors.)
Then, each norm
\[ \mathcal{N}_n(x) \]must be of the form
\[ K \cdot \Vert x \Vert_{\mathrm{RMS}} \]. If
\[ K=1 \], the RMS norm family is the unique family of norms satisfying these conditions.
Proof of Corollary 2.1.
The RMS norm is
. Since
, it’s a rotationally invariant norm by Corollary 1.1, and thus orthogonally invariant by Theorem 2.
For the second part:
- Rotational Invariance: By Corollary 1.1, each
for some constant
.
- Dimensional Normalization: We are given
for all
. Substituting the form from (1):
. The condition on the random vectors implies
. (This is derived from the motivating property
, which means
.) Plugging this into the equation for
:
. So,
.
- Form of the Norm: Therefore,
. If
, then
.
Tip. When to use the vector RMS norm
Employ the vector
\[ \Vert \cdot \Vert_{\mathrm{RMS}} \]when a scale for vectors is needed that is simultaneously rotationally symmetric (thus orthogonally symmetric) and normalized for vector dimension. This is useful, for example, when comparing activations from neural network layers of different widths or designing width-robust regularizers.