Matrix Norms: Foundations for Metrized Deep Learning
An introduction to matrix norms, their duals, and computational aspects essential for understanding advanced optimization in machine learning.
Welcome to this installment of our “Crash Course in Mathematical Foundations” series! As we gear up to explore the fascinating world of metrized deep learning, a solid understanding of matrix norms is indispensable. Matrix norms are fundamental tools in linear algebra, numerical analysis, and optimization. They allow us to measure the “size” or “magnitude” of matrices, analyze the behavior of linear transformations (like layers in a neural network), and define geometric structures on spaces of parameters.
In this post, we’ll review vector norms, introduce matrix norms, discuss common families like induced (operator) norms and Schatten norms, and delve into the crucial concept of norm duality. These concepts will pave the way for understanding how different choices of metrics can profoundly impact deep learning optimization and generalization.
1. A Quick Refresher: Vector Norms
Before we jump into matrices, let’s briefly recall vector norms. A vector norm quantifies the length or magnitude of a vector.
1.1 Abstract Vector Norms
Definition. Norm
Let
\[ V \]be a vector space over a field
\[ \mathbb{F} \]with absolute value
\[ \vert \cdot \vert \]. A function
\[ \Vert \cdot \Vert : V \to \mathbb{R} \]is a norm if for all
\[ x, y \in V \]and
\[ \alpha \in \mathbb{F} \]:
- Non-negativity:
\[ \Vert x \Vert \ge 0 \]
- Positive definiteness:
\[ \Vert x \Vert = 0 \implies x = \mathbf{0} \]
- Absolute homogeneity:
\[ \Vert \alpha x \Vert = \vert\alpha\vert \Vert x \Vert \]
- Triangle inequality:
\[ \Vert x + y \Vert \le \Vert x \Vert + \Vert y \Vert \]
Example.
\[
\ell_p
\]
norms on
\[
\mathbb{R}^n
\]
For a vector
:
-norm (Manhattan norm):**
1
2
<div class="math-block" markdown="0"> \[ \Vert x \Vert_1 = \sum_{i=1}^n \vert x_i \vert \]
</div>
-norm (Euclidean norm):**
1
2
<div class="math-block" markdown="0"> \[ \Vert x \Vert_2 = \sqrt{\sum_{i=1}^n x_i^2} \]
</div>
-norm (Maximum norm):**
1
2
<div class="math-block" markdown="0"> \[ \Vert x \Vert_\infty = \max_{1 \le i \le n} \vert x_i \vert \]
</div>
More generally, for
, the **
-norm** is:
2. Stepping Up: Matrix Norms
Similar to vector norms on
, matrix norms measure the “size” of matrices.
Definition. Matrix Norm
A function
\[ \Vert \cdot \Vert : \mathbb{R}^{m \times n} \to \mathbb{R} \]is a matrix norm if for all matrices
\[ A, B \in \mathbb{R}^{m \times n} \]and any scalar
\[ \alpha \in \mathbb{R} \], it satisfies:
- Non-negativity:
\[ \Vert A \Vert \ge 0 \]
- Positive definiteness:
\[ \Vert A \Vert = 0 \]if and only if
\[ A = \mathbf{0} \](the zero matrix)
- Absolute homogeneity:
\[ \Vert \alpha A \Vert = \vert\alpha\vert \Vert A \Vert \]
- Triangle inequality (Subadditivity):
\[ \Vert A + B \Vert \le \Vert A \Vert + \Vert B \Vert \]Additionally, many (but not all) matrix norms satisfy sub-multiplicativity. If
\[ A \in \mathbb{R}^{m \times k} \]and
\[ B \in \mathbb{R}^{k \times n} \], then:
- Sub-multiplicativity:
\[ \Vert AB \Vert \le \Vert A \Vert \Vert B \Vert \]This property is particularly important when analyzing compositions of linear transformations, such as sequential layers in a neural network.
3. Induced (Operator) Norms
Induced norms, also known as operator norms, are defined in terms of how a matrix transforms vectors. Given vector norms
on
(the domain) and
on
(the codomain), the induced matrix norm
for a matrix
is defined as the maximum “stretching factor” A applies to any non-zero vector:
All induced norms are sub-multiplicative.
Examples. Induced Matrix Norms
- **Maximum Column Sum Norm (
):** Induced by the vector
-norm in both domain and codomain. For
:
1
2
3
4
5
<div class="math-block" markdown="0"> \[ \Vert A \Vert_{\ell_1 \to \ell_1} = \max_{1 \le j \le n} \sum_{i=1}^m \vert a_{ij} \vert \]
</div>
This measures the maximum possible output
-norm for an input vector with
-norm 1. Often denoted simply as
.
- **Spectral Norm (
):** Induced by the vector
-norm in both domain and codomain. For
:
1
2
3
4
5
<div class="math-block" markdown="0"> \[ \Vert A \Vert_{\ell_2 \to \ell_2} = \sigma_{\max}(A) \]
</div>
where
is the largest singular value of
. This norm measures the maximum stretching in terms of Euclidean length. Often denoted simply as
.
- **Maximum Row Sum Norm (
):** Induced by the vector
-norm in both domain and codomain. For
:
1
2
3
4
5
<div class="math-block" markdown="0"> \[ \Vert A \Vert_{\ell_\infty \to \ell_\infty} = \max_{1 \le i \le m} \sum_{j=1}^n \vert a_{ij} \vert \]
</div>
This measures the maximum possible output
-norm for an input vector with
-norm 1. Often denoted simply as
.
- **RMS-Induced Operator Norm (
):** This norm is induced when both the domain and codomain vector spaces are equipped with the vector RMS norm. For a matrix
(mapping from
to
), the RMS-induced operator norm is:
1
2
3
4
5
<div class="math-block" markdown="0"> \[ \Vert A \Vert_{\mathrm{RMS}\to\mathrm{RMS}} = \sup_{\Vert x \Vert_{\mathrm{RMS},n_{in}} = 1} \Vert Ax \Vert_{\mathrm{RMS},n_{out}} \]
</div>
where
and
. This simplifies to:
1
2
3
4
5
<div class="math-block" markdown="0"> \[ \Vert A \Vert_{\mathrm{RMS}\to\mathrm{RMS}} = \sqrt{\frac{n_{in}}{n_{out}}}\,\sigma_{\max}(A) \]
</div>
where
is the largest singular value of
. This norm has several advantages in deep learning contexts: * Layer-wise stability: The identity matrix (or any orthogonal matrix, assuming
) has an
norm of
, irrespective of the layer width. Coupled with initialization schemes like Xavier/He (where, for instance,
), newly initialized linear layers tend to have
. This helps in preventing exploding or vanishing activations during the initial phases of training. * Optimizer friendliness: Optimization algorithms designed for metrized deep learning, such as Muon, can leverage this norm to control changes in layer weights (e.g.,
L_{p,q}
\ell_p
L_{p,q}
A \in \mathbb{R}^{m \times n}
L_{p,q}
\ell_p
\ell_q
a_{\cdot, j}
j
A
L_{p,q}
\Vert A \Vert_{p,q} = \left( \sum_{j=1}^n \Vert a_{\cdot, j} \Vert_p^q \right)^{1/q} = \left( \sum_{j=1}^n \left( \sum_{i=1}^m \vert a_{ij} \vert^p \right)^{q/p} \right)^{1/q}
p, q \ge 1
\ast \ast Note on Sub-multiplicativity\ast \ast
A key feature of entrywise norms is that, in general, they are \ast \ast not\ast \ast sub-multiplicative. A notable exception is the Frobenius norm ( ] </div> p=q=2
\[ ), which does satisfy \]\Vert AB \Vert_F \le \Vert A \Vert_F \Vert B \Vert_F
\[ . The lack of this property makes them less suitable for analyzing compositions of linear maps compared to operator norms. </blockquote>
\ast \ast Examples.\ast \ast ] </div> L_{p,q}
\[ norms </summary> \ast \ast \ast Frobenius Norm ( \]p=q=2
\[ ):\ast \ast This is the most famous entrywise norm, which is also a Schatten norm ( \]\Vert A \Vert_{S_2}
\[ ). It is equivalent to the vector \]\ell_2
\[ -norm applied to the matrix's entries reshaped into a single vector. \]
1 2 3 4\Vert A \Vert_{2,2} = \left( \sum_{j=1}^n \sum_{i=1}^m \vert a_{ij} \vert^2 \right)^{1/2} = \Vert A \Vert_F <div class="math-block" markdown="0"> \[ \ast \ast \ast Maximum Absolute Value Norm ( \] </div> p=q=\infty\[ ):\ast \ast This norm finds the largest absolute value among all matrix entries, often denoted \]\Vert A \Vert_{\max}
\[ . \]
1 2 3 4\Vert A \Vert_{\infty, \infty} = \max_{j} \left( \max_{i} \vert a_{ij} \vert \right) = \max_{i,j} \vert a_{ij} \vert <div class="math-block" markdown="0"> \[ \ast \ast \ast \] </div> L_{2,1}\[ -Norm ( \]p=2, q=1
\[ ):\ast \ast This norm is the sum of the Euclidean norms of the columns. \]
1 2 3 4\Vert A \Vert_{2,1} = \sum_{j=1}^n \Vert a_{\cdot,j} \Vert_2 = \sum_{j=1}^n \sqrt{\sum_{i=1}^m \vert a_{ij} \vert^2} <div class="math-block" markdown="0"> \[ The \] </div> L_{2,1}\[ norm is particularly useful in machine learning for inducing \ast \ast group sparsity\ast \ast by encouraging entire columns of a weight matrix to become zero, which is useful for feature selection in multi-task or multi-class learning settings. \ast \ast \ast \]L_{1}
\[ -Norm ( \]p=q=1
\[ ):\ast \ast This is the sum of the absolute values of all entries. \]
1 2 3\Vert A \Vert_{1,1} = \sum_{j=1}^n \sum_{i=1}^m \vert a_{ij} \vert <div class="math-block" markdown="0"> \[</details>
\ast \ast Duality:\ast \ast With respect to the Frobenius inner product, the dual of the ] </div> L_{p,q}
\[ norm is the \]L_{p^\ast, q^\ast}
\[ norm, where \]1/p + 1/p^\ast = 1
\[ and \]1/q + 1/q^\ast = 1
\[ . For instance, the Frobenius norm ( \]\Vert \cdot \Vert_{2,2}
\[ ) is self-dual, while the dual of the max absolute value norm ( \]\Vert \cdot \Vert_{\infty,\infty}
\[ ) is the sum of absolute values norm ( \]\Vert \cdot \Vert_{1,1}
\[ ). ### 4.2. Schatten \]p
\[ -Norms Schatten norms are a family of norms defined using the singular values of a matrix \]A \in \mathbb{R}^{m \times n}
\[ . The singular values, denoted \]\sigma_k(A)
\[ , are typically obtained via Singular Value Decomposition (SVD). For \]p \ge 1
\[ , the Schatten \]p
\[ -norm is: \]\Vert A \Vert_{S_p} = \left( \sum_{k=1}^{\min(m,n)} \sigma_k(A)^p \right)^{1/p}
\[
\ast \ast Alternative Formulation.\ast \ast Via Trace
The singular values ] </div> \sigma_k(A)
\[ are the non-negative square roots of the eigenvalues of \]A^\top A
\[ (or \]AA^\top
\[ ). If \]\lambda_k(A^\top A)
\[ are the (non-negative) eigenvalues of the positive semi-definite matrix \]A^\top A
\[ , then \]\sigma_k(A) = \sqrt{\lambda_k(A^\top A)}
\[ . The Schatten \]p
\[ -norm can then be written in terms of these eigenvalues: \]\Vert A \Vert_{S_p} = \left( \sum_{k=1}^{\min(m,n)} (\lambda_k(A^\top A))^{p/2} \right)^{1/p}
\[ This sum corresponds to the trace of the matrix \](A^\top A)^{p/2}
\[ . The matrix power \](A^\top A)^{p/2}
\[ is defined via functional calculus using the eigendecomposition of \]A^\top A
\[ . If \]A^\top A = V \Lambda V^\top
\[ where \]\Lambda
\[ is the diagonal matrix of eigenvalues \]\lambda_k(A^\top A)
\[ , then \](A^\top A)^{p/2} = V \Lambda^{p/2} V^\top
\[ . The trace is then \]\mathrm{Tr}((A^\top A)^{p/2}) = \sum_{k=1}^n (\lambda_k(A^\top A))^{p/2}
\[ , where the sum runs over all \]n
\[ eigenvalues (if \]A \in \mathbb{R}^{m \times n}
\[ , \]A^\top A
\[ is \]n \times n
\[ ). Thus, an alternative expression for the Schatten \]p
\[ -norm is: \]\Vert A \Vert_{S_p} = \left( \mathrm{Tr}\left( (A^\top A)^{p/2} \right) \right)^{1/p}
\[ While this trace formulation is mathematically sound, computing \](A^\top A)^{p/2}
\[ generally involves an eigendecomposition of \]A^\top A
\[ , which is computationally similar to performing an SVD on \]A
\[ to get the singular values directly. The practical computation, especially for general \]p
\[ , often relies on the singular values. For specific cases like \]p=2
\[ (Frobenius norm), more direct methods are used as highlighted below. </details>
\ast \ast Examples.\ast \ast Schatten ] </div> p
\[ -norms </summary> \ast \ast \ast Nuclear Norm ( \]p=1
\[ ):\ast \ast Also denoted \]\Vert A \Vert_\ast
\[ or \]\Vert A \Vert_{S_1}
\[ . \ast \ast \ast Definition (Primary for computation):\ast \ast \]
1 2 3 4\Vert A \Vert_{S_1} = \sum_{k=1}^{\min(m,n)} \sigma_k(A) <div class="math-block" markdown="0"> \[ This is typically computed by first finding all singular values of \] </div> A\[ (e.g., via SVD) and summing them. \ast \ast \ast Use:\ast \ast Often used as a convex surrogate for matrix rank. Computationally intensive due to SVD. \ast \ast \ast Frobenius Norm ( \]p=2
\[ ):\ast \ast Also denoted \]\Vert A \Vert_F
\[ or \]\Vert A \Vert_{S_2}
\[ . \ast \ast \ast Definition (Primary for computation):\ast \ast \]
1 2 3 4\Vert A \Vert_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n \vert a_{ij} \vert^2} <div class="math-block" markdown="0"> \[ This is the most direct and computationally efficient way: square all elements, sum them, and take the square root. It does \ast \ast not\ast \ast require forming \] </div> A^\top A\[ or computing singular values/eigenvalues explicitly. \ast \ast \ast Singular Value Form:\ast \ast \]\Vert A \Vert_{S_2} = \left( \sum_{k=1}^{\min(m,n)} \sigma_k(A)^2 \right)^{1/2}
\[ \ast \ast \ast Use:\ast \ast A common, computationally friendly matrix norm. \ast \ast \ast Spectral Norm ( \]p=\infty
\[ ):\ast \ast Also denoted \]\Vert A \Vert_{\ell_2 \to \ell_2}
\[ or \]\Vert A \Vert_{S_\infty}
\[ . \ast \ast \ast Definition (Primary for computation):\ast \ast \]
1 2 3 4\Vert A \Vert_{S_\infty} = \max_{k} \sigma_k(A) = \sigma_{\max}(A) <div class="math-block" markdown="0"> \[ This requires finding the largest singular value of \] </div> A\[ . \ast \ast \ast Computation:\ast \ast Typically computed via SVD (if all singular values are needed anyway) or iterative methods like the power iteration to find the largest eigenvalue of \]A^\top A
\[ (since \]\sigma_{\max}(A) = \sqrt{\lambda_{\max}(A^\top A)}
\[ ). More expensive than Frobenius but often cheaper than full SVD if only \]\sigma_{\max}
\[ is needed. \ast \ast \ast Use:\ast \ast Measures maximum stretching, crucial for Lipschitz constants, stability analysis. </details> Schatten norms are unitarily invariant, meaning \]\Vert UAV \Vert_{S_p} = \Vert A \Vert_{S_p}
\[ for any orthogonal/unitary matrices \]U
\[ and \]V
\[ . ## 5. The Concept of Duality in Norms Duality is a powerful concept in optimization and functional analysis. Every norm has an associated \ast \ast dual norm\ast \ast .\ast \ast Definition.\ast \ast Dual Vector Norm
Let ] </div> V
\[ be a nonzero inner product space over a field \]\mathbb{F}
\[ with absolute value \]\vert \cdot \vert
\[ with a norm \]\Vert \cdot \Vert
\[ that is not necessarily induced by its inner product \]\langle \cdot \vert \cdot \rangle
\[ . The corresponding dual norm \]\Vert \cdot \Vert_\ast
\[ is defined on the dual space as: \]\Vert y \Vert_\ast = \sup_{x \in V, x \ne 0} \frac{\vert \langle y \vert x \rangle \vert}{\Vert x \Vert} = \sup_{x \in V, \Vert x \Vert = 1} \vert \langle y \vert x \rangle \vert
\[ In other words, it returns the maximum measurement of a covector \]y \in V^\ast = \mathcal{L}(V;\mathbb{F})
\[ on the unit ball defined by \]\Vert \cdot \Vert
\[ in \]V
\[ . </blockquote> This relationship is captured by \ast \ast Hölder's Inequality\ast \ast :\ast \ast Theorem.\ast \ast Generalized Cauchy-Schwarz/Hölder’s Inequality
Let ] </div> V
\[ be a nonzero inner product space over a field \]\mathbb{F}
\[ with absolute value \]\vert \cdot \vert
\[ with a norm \]\Vert \cdot \Vert
\[ that is not necessarily induced by its inner product \]\langle \cdot \vert \cdot \rangle
\[ . Then the following holds: \]\vert \langle y \vert x \rangle \vert \leq \Vert y \Vert_\ast \Vert x \Vert
\[ </blockquote> The proof of the theorem follows immediately from the definition of the dual norm. Note that theorem itself doesn't give so much information on how to actually compute the dual norm nor how to achieve the equality. In spite of its simplicity in derivation, the investigation of special cases of this theorem will be extremely useful in the context of optimization. For instance, this can be applied to vector norms with the standard Euclidean inner product, or to matrix norms with the Frobenius inner product (which is the Euclidean inner product of the vectorized matrices). As we will see, Von Neumann's trace inequality is a specific instance of this theorem with the Frobenius inner product, the Schatten-infinity norm (operator norm) and the Schatten-1 norm (nuclear/trace norm). ### Vector Norm Duality\ast \ast Definition.\ast \ast Dual Vector Norm
For a vector norm ] </div> \Vert \cdot \Vert
\[ on \]\mathbb{R}^n
\[ , its dual norm \]\Vert \cdot \Vert_\ast
\[ is defined on the dual space (which is also \]\mathbb{R}^n
\[ via the standard dot product) as: \]\Vert y \Vert_\ast = \sup_{x \ne \mathbf{0}} \frac{\vert y^\top x \vert}{\Vert x \Vert} = \sup_{\Vert x \Vert=1} \vert y^\top x \vert
\[ This relationship is captured by \ast \ast Hölder's Inequality\ast \ast : \]\vert y^\top x \vert \le \Vert y \Vert_\ast \Vert x \Vert
\[ </blockquote>
Example: Dual of ] </div> \ell_p
\[ norms </summary> Important dual pairs for \]\ell_p
\[ -norms: \](\Vert \cdot \Vert_{\ell_p})^\ast = \Vert \cdot \Vert_{\ell_q}
\[ where \]1/p + 1/q = 1
\[ . </details> ### Matrix Norm Duality For matrix norms, duality is typically defined with respect to the \ast \ast Frobenius inner product\ast \ast : \]\langle A, B \rangle_F = \mathrm{tr}(A^\top B) = \sum_{i,j} a_{ij} b_{ij}
\[\ast \ast Definition.\ast \ast Dual Matrix Norm
Given a matrix norm ] </div> \Vert \cdot \Vert
\[ on \]\mathbb{R}^{m \times n}
\[ , its dual norm \]\Vert \cdot \Vert_\ast
\[ is defined as: \]\Vert B \Vert_\ast = \sup_{A \ne \mathbf{0}} \frac{\vert \langle B, A \rangle_F \vert}{\Vert A \Vert} = \sup_{\Vert A \Vert=1} \vert \langle B, A \rangle_F \vert
\[ And we have a generalized Hölder's inequality for matrices: \]\vert \langle B, A \rangle_F \vert \le \Vert B \Vert_\ast \Vert A \Vert
\[ </blockquote> The element \]A
\[ that achieves the supremum (or one such element if not unique) is called a \ast \ast dualizing element\ast \ast or \ast \ast duality mapping\ast \ast . Computing this dualizer can be a significant computational step in some optimization algorithms.
\ast \ast Example.\ast \ast Dual Norms of General Induced Norms
1. The General Dual Norm for Matrices
First, let’s define the space and the dual norm concept precisely.
We consider the vector space of real ] </div> m \times n
\[ matrices, \]M_{m,n}(\mathbb{R})
\[ . This space is equipped with an inner product, the \ast \ast Frobenius inner product\ast \ast (or trace inner product): \]\langle A, B \rangle = \mathrm{tr}(B^T A) = \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ij}
\[ This inner product allows us to identify the dual space of \]M_{m,n}(\mathbb{R})
\[ with itself. Given any norm \]\Vert \cdot \Vert
\[ on \]M_{m,n}(\mathbb{R})
\[ , its \ast \ast dual norm\ast \ast , denoted \]\Vert \cdot \Vert_\ast
\[ , is defined as: \]\Vert B \Vert_\ast = \sup_{\Vert A \Vert \le 1} \langle A, B \rangle = \sup_{\Vert A \Vert \le 1} \mathrm{tr}(B^T A)
\[ for any matrix \]B \in M_{m,n}(\mathbb{R})
\[ . #### 2. Dual of a General Induced Norm \]\Vert \cdot \Vert_X \to \Vert \cdot \Vert_Y
\[ Now, let's consider a specific type of norm: the induced norm (or operator norm). Let \]\Vert \cdot \Vert_X
\[ be a norm on \]\mathbb{R}^n
\[ and \]\Vert \cdot \Vert_Y
\[ be a norm on \]\mathbb{R}^m
\[ . The induced norm on a matrix \]A \in M_{m,n}(\mathbb{R})
\[ is: \]\Vert A \Vert_{X,Y} = \sup_{\Vert x \Vert_X = 1} \Vert Ax \Vert_Y
\[ We want to compute the dual of this norm, which we'll denote by \]\Vert B \Vert_{X,Y}^\ast
\[ . Using the definition above: \]\Vert B \Vert_{X,Y}^\ast = \sup_{\Vert A \Vert_{X,Y} \le 1} \mathrm{tr}(B^T A)
\[ Computing this supremum directly is difficult. However, there is a powerful representation theorem for this dual norm. It states that the dual norm is the infimum over all possible decompositions of the matrix \]B
\[ into a sum of rank-one matrices. \ast \ast Theorem:\ast \ast The dual of the induced norm \]\Vert \cdot \Vert_{X,Y}
\[ is given by: \]\Vert B \Vert_{X,Y}^\ast = \inf \left{ \sum_{i=1}^k \Vert u_i \Vert_X \Vert v_i \Vert_{Y^\ast} : B = \sum_{i=1}^k v_i u_i^T, u_i \in \mathbb{R}^n, v_i \in \mathbb{R}^m \right}
\[ where \]\Vert \cdot \Vert_{Y^\ast}
\[ is the dual norm of \]\Vert \cdot \Vert_Y
\[ on \]\mathbb{R}^m
\[ . The infimum is taken over all possible finite sums. This type of norm is a generalization of the nuclear norm (or trace norm). </details>
\ast \ast Example.\ast \ast Dual Norms of ] </div> \ell_p \to \ell_q
\[ -norms </summary> A natural and important question arises: is there a general formula for the dual of an induced norm, specifically the \]\ell_p \to \ell_q
\[ norm? The answer is nuanced and connects concepts from linear algebra, functional analysis, and convex optimization: while there is a general formula for the dual of any induced norm, it doesn't always simplify to another "nice" induced norm. #### 3. The Dual of the Induced Matrix Norm \]\ell^p \to \ell^q
\[ Now we can apply this general result a special case. The induced matrix norm from \]\ell^p
\[ to \]\ell^q
\[ is: \]\Vert A \Vert_{p,q} = \sup_{\Vert x \Vert_p=1} \Vert Ax \Vert_q
\[ Here, the norm on the domain space is \]\Vert \cdot \Vert_X = \Vert \cdot \Vert_p
\[ , and the norm on the codomain space is \]\Vert \cdot \Vert_Y = \Vert \cdot \Vert_q
\[ . To use the theorem, we need the dual of the codomain norm, \]\Vert \cdot \Vert_{Y^\ast} = \Vert \cdot \Vert_{q^\ast}
\[ . The dual norm of the vector \]\ell^q
\[ norm is the \]\ell^{q’}
\[ norm, where \]1/q + 1/q’ = 1
\[ . Plugging this into the general formula, we get the dual of the \]\ell^p \to \ell^q
\[ induced norm: \]\Vert B \Vert_{p,q}^\ast = \inf \left{ \sum_{i=1}^k \Vert u_i \Vert_p \Vert v_i \Vert_{q’} : B = \sum_{i=1}^k v_i u_i^T \right}
\[ where \]u_i \in \mathbb{R}^n
\[ , \]v_i \in \mathbb{R}^m
\[ , and \]1/q + 1/q’ = 1
\[ . This variational formula is the general answer. Except for a few special cases, this expression does not simplify to another induced norm \]\Vert B \Vert_{r,s}
\[ . #### 4. Important Special Cases Let's see how this general formula works for well-known special cases. --- \ast \ast Case 1: The Spectral Norm ( \]p=2, q=2
\[ )\ast \ast \ast \ast \ast Primary Norm:\ast \ast \]\Vert A \Vert_{2,2} = \sigma_{\max}(A)
\[ , the largest singular value of \]A
\[ . This is the spectral norm. \ast \ast \ast Dual Norm Calculation:\ast \ast Here \]p=2
\[ and \]q=2
\[ , so \]q’=2
\[ . The formula becomes: \]
1 2 3 4\Vert B \Vert_{2,2}^\ast = \inf \left\{ \sum_{i=1}^k \Vert u_i \Vert_2 \Vert v_i \Vert_2 : B = \sum_{i=1}^k v_i u_i^T \right\} <div class="math-block" markdown="0"> \[ This is precisely the definition of the \ast \ast trace norm\ast \ast (or nuclear norm), which is the sum of the singular values of \] </div> B\[ . \]
1 2 3 4\Vert B \Vert_{2,2}^\ast = \sum_{i=1}^{\min(m,n)} \sigma_i(B) = \Vert B \Vert_\ast <div class="math-block" markdown="0"> \[ The infimum is achieved by the Singular Value Decomposition (SVD) of \] </div> B\[ . If \]B = \sum \sigma_i v_i u_i^T
\[ , this is a valid decomposition with cost \]\sum \sigma_i \Vert u_i \Vert_2 \Vert v_i \Vert_2 = \sum \sigma_i
\[ . \ast \ast Conclusion:\ast \ast The dual of the spectral norm ( \]\ell^2 \to \ell^2
\[ ) is the trace norm. --- \ast \ast Case 2: The Max-Entry Norm ( \]p=1, q=\infty
\[ )\ast \ast \ast \ast \ast Primary Norm:\ast \ast \]\Vert A \Vert_{1,\infty} = \sup_{\Vert x \Vert_1=1} \Vert Ax \Vert_\infty = \max_{i,j} \vert A_{ij} \vert
\[ . \ast \ast \ast Dual Norm Calculation:\ast \ast Here \]p=1
\[ and \]q=\infty
\[ , so \]q’=1
\[ . The formula becomes: \]
1 2 3 4\Vert B \Vert_{1,\infty}^\ast = \inf \left\{ \sum_{i=1}^k \Vert u_i \Vert_1 \Vert v_i \Vert_1 : B = \sum_{i=1}^k v_i u_i^T \right\} <div class="math-block" markdown="0"> \[ It can be shown that this infimum is equal to the entrywise \] </div> \ell_1\[ -norm of the matrix \]B
\[ . \]
1 2 3 4\Vert B \Vert_{1,\infty}^\ast = \sum_{i=1}^m \sum_{j=1}^n \vert B_{ij} \vert <div class="math-block" markdown="0"> \[ To see this, one can choose the decomposition \] </div> B = \sum_{j=1}^n B_j e_j^T\[ , where \]B_j
\[ is the \]j
\[ -th column of \]B
\[ and \]e_j
\[ is the \]j
\[ -th standard basis vector. The cost is \]\sum_j \Vert B_j \Vert_1 \Vert e_j \Vert_1 = \sum_j \sum_i \vert B_{ij} \vert = \sum_{i,j} \vert B_{ij} \vert
\[ . A more detailed proof shows this is indeed the minimum. \ast \ast Conclusion:\ast \ast The dual of the \]\ell^1 \to \ell^\infty
\[ norm (max-entry norm) is the entrywise \]\ell_1
\[ -norm (which is also the induced \]\ell^\infty \to \ell^1
\[ norm). --- #### Summary \vert Primary Norm ( \]\Vert A \Vert
\[ ) \vert Formula for \]\Vert A \Vert
\[ \vert Dual Norm ( \]\Vert B \Vert_\ast
\[ ) \vert Formula for \]\Vert B \Vert_\ast
\[ \vert \vert ------------------------------------------------------ \vert --------------------------------------------- \vert --------------------------------------------------------------- \vert ------------------------------------------------------------------------------ \vert \vert \ast \ast General Induced Norm\ast \ast \]\ell^p \to \ell^q
\[ \vert \]\sup_{\Vert x \Vert_p=1} \Vert Ax \Vert_q
\[ \vert \ast \ast Variational/Tensor Norm\ast \ast \vert \]\inf{\sum_i \Vert u_i \Vert_p \Vert v_i \Vert_{q’} : B=\sum_i v_i u_i^T}
\[ \vert \vert \ast \ast Spectral Norm\ast \ast \]\ell^2 \to \ell^2
\[ \vert \]\sigma_{\max}(A)
\[ \vert \ast \ast Trace/Nuclear Norm\ast \ast \vert \]\sum_i \sigma_i(B)
\[ \vert \vert \ast \ast Max Absolute Entry Norm\ast \ast \]\ell^1 \to \ell^\infty
\[ \vert \]\max_{i,j} \vert A_{ij} \vert
\[ \vert \ast \ast Entrywise \]\ell_1
\[ -norm\ast \ast ( \]\ell^\infty \to \ell^1
\[ norm) \vert \]\sum_{i,j} \vert B_{ij} \vert
\[ \vert \vert \ast \ast Max Absolute Column Sum\ast \ast \]\ell^1 \to \ell^1
\[ \vert \]\max_j \sum_i \vert A_{ij} \vert
\[ \vert \ast \ast Max Absolute Row Sum\ast \ast ( \]\ell^\infty \to \ell^\infty
\[ norm) \vert \]\max_i \sum_j \vert B_{ij} \vert
\[ \vert In summary, for a general induced norm \]\Vert \cdot \Vert_{p,q}
\[ , its dual is not another induced norm but rather a norm defined via a variational problem related to rank-one decompositions. This variational form only simplifies to a more common, non-induced norm in special cases. </details> ### Duality Mappings: Explicit Formulas (Jeremy Bernstein's Definition) A \ast \ast duality mapping\ast \ast (as defined by Jeremy Bernstein) \]J
\[ for a norm \]\Vert \cdot \Vert
\[ (on a space of matrices \]\mathbb{R}^{m \times n}
\[ ) maps a matrix \]A
\[ to a matrix \]J(A)
\[ that represents the direction of "steepest ascent" for \]A
\[ as measured by the dual norm. It is the element on the primal unit sphere that maximizes the inner product with \]A
\[ . Formally, if \]A \ne \mathbf{0}
\[ , \]J(A)
\[ is a matrix that satisfies two conditions: 1. \]\Vert J(A) \Vert = 1
\[ 2. \]\langle A, J(A) \rangle_F = \Vert A \Vert_\ast
\[ where \]\Vert \cdot \Vert_\ast
\[ is the dual norm of \]\Vert \cdot \Vert
\[ . This \]J(A)
\[ is also known as a \ast \ast dualizing element\ast \ast . It is an element of the subgradient of the dual norm \]\Vert \cdot \Vert_\ast
\[ evaluated at \]A
\[ , i.e., \]J(A) \in \partial \Vert A \Vert_\ast
\[ . The mapping may be set-valued if the primal norm's unit ball is not strictly convex. For \]A = \mathbf{0}
\[ , we define \]J(\mathbf{0}) = \mathbf{0}
\[ .
Duality Mappings for Common Matrix Norms
All formulas below are derived by finding ] </div> J(A)
\[ that achieves the supremum in the definition of the dual norm: \]\Vert A \Vert_\ast = \sup_{\Vert X \Vert=1} \langle A, X \rangle_F
\[ . #### 1. Vector-induced \]\ell_p \to \ell_q
\[ Operator Norms The duality mapping \]J_{\ell_p \to \ell_q}(A)
\[ for the induced norm \]\Vert \cdot \Vert_{\ell_p \to \ell_q}
\[ is an element of the subgradient of its dual norm, \]\Vert \cdot \Vert_{\ell_p \to \ell_q}^\ast
\[ . As we saw, this dual norm has a complex variational form, making a general closed-form expression for the duality mapping intractable. However, we can characterize it for specific cases. \ast \ast Rank-One Case:\ast \ast \ast \ast \ast Norm:\ast \ast For a rank-one matrix \]A = vu^\top
\[ , where \]u \in \mathbb{R}^n, v \in \mathbb{R}^m
\[ . \ast \ast \ast Dual Norm:\ast \ast \]\Vert A \Vert_{\ell_p \to \ell_q}^\ast = \Vert u \Vert_p \Vert v \Vert_{q’}
\[ , where \]1/q + 1/q’ = 1
\[ . \ast \ast \ast Duality Mapping:\ast \ast \]J_{\ell_p \to \ell_q}(A) = s_v s_u^\top
\[ , where \]s_u
\[ is a dual vector to \]u
\[ (w.r.t. the \]\ell_p
\[ norm) and \]s_v
\[ is a dual vector to \]v
\[ (w.r.t. the \]\ell_{q’}
\[ norm). Specifically: \ast \]s_u \in \mathbb{R}^n
\[ satisfies \]\Vert s_u \Vert_{p’} = 1
\[ and \]s_u^\top u = \Vert u \Vert_p
\[ . \ast \]s_v \in \mathbb{R}^m
\[ satisfies \]\Vert s_v _q = 1
\[ and \]s_v^\top v = \Vert v \Vert_{q’}
\[ . \ast For \]1 < p, q’ < \infty
\[ , these dual vectors are unique and given by: \]s_u = \frac{\mathrm{sign}(u) \odot \vert u\vert ^{p-1}}{\Vert u \Vert_p^{p-1}} \quad \text{and} \quad s_v = \frac{\mathrm{sign}(v) \odot \vert v\vert ^{q’-1}}{\Vert v \Vert_{q’}^{q’-1}}
\[ where the operations are element-wise. \ast \ast Special Operator Norms:\ast \ast \ast \ast \ast Max Column Sum ( \]\ell_1 \to \ell_1
\[ ):\ast \ast \ast Let \]i_0
\[ be the index of a row of \]A
\[ with the maximum absolute row sum (this corresponds to the dual norm, \]\Vert A \Vert_{\infty \to \infty}
\[ ). \ast A duality mapping for \]\Vert \cdot \Vert_{\ell_1 \to \ell_1}
\[ at \]A
\[ is the matrix \]J(A)
\[ which is zero everywhere except for row \]i_0
\[ , which is set to \]\mathrm{sign}(A_{i_0, :})
\[ . \ast \]J_{\ell_1 \to \ell_1}(A) = e_{i_0} (\mathrm{sign}(A_{i_0, :}))^\top
\[ \ast \ast \ast Max Row Sum ( \]\ell_\infty \to \ell_\infty
\[ ):\ast \ast \ast Let \]j_0
\[ be the index of a column of \]A
\[ with the maximum absolute column sum (this corresponds to the dual norm, \]\Vert A \Vert_{\ell_1 \to \ell_1}
\[ ). \ast A duality mapping for \]\Vert \cdot \Vert_{\ell_\infty \to \ell_\infty}
\[ at \]A
\[ is the matrix \]J(A)
\[ which is zero everywhere except for column \]j_0
\[ , which is set to \]\mathrm{sign}(A_{:, j_0})
\[ . \ast \]J_{\ell_\infty \to \ell_\infty}(A) = (\mathrm{sign}(A_{:, j_0})) e_{j_0}^\top
\[ \ast \ast \ast Spectral Norm ( \]\ell_2 \to \ell_2
\[ ):\ast \ast \ast This is a special case of the Schatten \]S_\infty
\[ norm. As shown below, the duality mapping is \]J_{S_\infty}(A) = U_r V_r^\top
\[ , where \]A = U_r \Sigma_r V_r^\top
\[ is the compact SVD of \]A
\[ . \ast \ast \ast Max Entry Norm ( \]\ell_1 \to \ell_\infty
\[ ):\ast \ast \ast The dual norm is the entrywise \]\ell_1
\[ norm, \]\sum_{i,j} \vert A_{ij}\vert
\[ \ast The duality mapping for \]\Vert\cdot\Vert_{\ell_1\to\ell_\infty}
\[ at \]A
\[ is a matrix whose primal norm is 1 and is a subgradient of the dual norm at \]A
\[ . A subgradient of \]\sum_{i,j}\vert A_{ij}\vert
\[ is \]\mathrm{sign}(A)
\[ . \ast The \]\ell_1 \to \ell_\infty
\[ norm of \]\mathrm{sign}(A)
\[ is \]\max_{i,j}\vert \mathrm{sign}(A_{ij})\vert =1
\[ , so it is already normalized. \ast \]J_{\ell_1 \to \ell_\infty}(A) = \mathrm{sign}(A)
\[ #### 2. Schatten \]S_p
\[ Norms \ast \ast Norm:\ast \ast For \]A\in\mathbb R^{m\times n}
\[ with SVD \]A=U\Sigma V^{\top}
\[ . \]\Vert A \Vert_{S_p}:=\left(\textstyle\sum_{i}\sigma_i(A)^{\,p}\right)^{1/p}
\[ . \ast \ast Dual Norm:\ast \ast \]\Vert \cdot \Vert_{S_q}
\[ , where \]1/p+1/q=1
\[ . \ast \ast Duality Mapping (for \]1 < p < \infty
\[ ):\ast \ast The duality mapping for \]\Vert \cdot \Vert_{S_p}
\[ is derived from the subgradient of its dual norm \]\Vert \cdot \Vert_{S_q}
\[ . \]\boxed{\,J_{S_p}(A)=\frac{U\,\operatorname{diag}(\sigma_i(A)^{\,q-1})\,V^{\top}} {\left(\sum_j \sigma_j(A)^q\right)^{(p-1)/p}}\,}
\[ (If \]A=\mathbf{0}
\[ , \]J_{S_p}(A)=\mathbf{0}
\[ . Using \]q/p=p-1
\[ , the denominator is \]\Vert A \Vert_{S_q}^{p-1}
\[ ).
\ast \ast Derivation of ] </div> J_{S_p}(A)
\[ \ast \ast </summary> We need \]J(A)
\[ such that \]\Vert J(A) \Vert_{S_p}=1
\[ and \]\langle A, J(A) \rangle_F = \Vert A \Vert_{S_q}
\[ . This \]J(A)
\[ is the normalized subgradient of the \]S_q
\[ norm evaluated at \]A
\[ . Let \]A=U\Sigma V^\top
\[ . We propose \]J(A) = U \Sigma’ V^\top
\[ . 1. \ast \ast Inner Product Condition:\ast \ast \]\langle A, J(A) \rangle_F = \mathrm{tr}(\Sigma^\top \Sigma’) = \sum_i \sigma_i \sigma’_i
\[ . We need this to equal \]\Vert A \Vert_{S_q} = (\sum_i \sigma_i^q)^{1/q}
\[ . By Hölder's inequality for vectors, \]\sum \sigma_i \sigma’i \le (\sum \sigma_i^q)^{1/q} (\sum (\sigma’_i)^p)^{1/p} = \Vert A \Vert{S_q} \Vert J(A) \Vert_{S_p}
\[ . 2. \ast \ast Achieving Equality:\ast \ast Equality is achieved if the vector of \]\sigma’_i
\[ is proportional to the Hölder-dual vector of \]\sigma_i
\[ . Specifically, \]\sigma’_i
\[ must be proportional to \]\sigma_i^{q-1}
\[ . Let \]\sigma’_i = c \cdot \sigma_i^{q-1}
\[ . 3. \ast \ast Norm Condition:\ast \ast We need \]\Vert J(A) \Vert_{S_p} = 1
\[ . \]1 = \Vert J(A) \Vert_{S_p} = (\sum_i (\sigma’_i)^p)^{1/p} = (\sum_i (c \cdot \sigma_i^{q-1})^p)^{1/p} = c (\sum_i \sigma_i^{(q-1)p})^{1/p}
\[ . Since \](q-1)p = q
\[ , this is \]c (\sum_i \sigma_i^q)^{1/p} = c \Vert A \Vert_{S_q}^{q/p}
\[ . So, \]c = 1 / \Vert A \Vert_{S_q}^{q/p} = 1 / \Vert A \Vert_{S_q}^{p-1}
\[ . 4. \ast \ast Final Formula:\ast \ast \]\sigma’i = \sigma_i^{q-1} / \Vert A \Vert{S_q}^{p-1}
\[ . Assembling this into a matrix gives the formula above. </details> \ast \ast Special Cases for Schatten Norms:\ast \ast \ast \ast \ast \]p=2
\[ (Frobenius Norm \]\Vert \cdot \Vert_F = \Vert \cdot \Vert_{S_2}
\[ ):\ast \ast Self-dual ( \]q=2, p=2
\[ ). Then \]q-1=1
\[ and \]p-1=1
\[ . The formula becomes: \]
1 2 3 4\boxed{\,J_F(A) = J_{S_2}(A) = \frac{U \Sigma V^\top}{\Vert A \Vert_{S_2}} = \frac{A}{\Vert A \Vert_F}\,} <div class="math-block" markdown="0"> \[ \ast \ast \ast \] </div> p=1\[ (Nuclear Norm \]\Vert \cdot \Vert_{S_1}
\[ ):\ast \ast Dual is Spectral Norm ( \]\Vert \cdot \Vert_{S_\infty}
\[ , \]q=\infty
\[ ). The duality mapping for \]S_1
\[ is the subgradient of the \]S_\infty
\[ norm. If \]\sigma_1 > \sigma_2
\[ (largest singular value is simple), let \]u_1, v_1
\[ be the top singular vectors. \]
1 2 3 4\boxed{\,J_{S_1}(A) = u_1 v_1^\top\,} <div class="math-block" markdown="0"> \[ If \] </div> \sigma_1\[ is not simple, \]J(A)
\[ can be any convex combination of \]u_i v_i^\top
\[ for \]i
\[ where \]\sigma_i = \sigma_1
\[ . \ast \ast \ast \]p=\infty
\[ (Spectral Norm \]\Vert \cdot \Vert_{S_\infty}
\[ ):\ast \ast Dual is Nuclear Norm ( \]\Vert \cdot \Vert_{S_1}
\[ , \]q=1
\[ ). The duality mapping for \]S_\infty
\[ is the subgradient of the \]S_1
\[ norm. If \]A=U_r \Sigma_r V_r^\top
\[ is the compact SVD, a canonical choice is: \]
1 2 3 4\boxed{\,J_{S_\infty}(A) = U_r V_r^\top\,} <div class="math-block" markdown="0"> \[ This is the unique minimum Frobenius norm subgradient of \] </div> \Vert A \Vert_{S_1}\[ . #### 3. Mahalanobis-Induced Operator Norm Let \]M \succ 0
\[ be an \]n \times n
\[ SPD matrix. \ast \ast Norm:\ast \ast For \]A \in \mathbb{R}^{n \times n}
\[ : \]\Vert A \Vert_M := \max_{x^\top M x = 1} \sqrt{(Ax)^\top M (Ax)} = \Vert M^{1/2} A M^{-1/2} \Vert_{S_\infty}
\[ \ast \ast Dual Norm:\ast \ast \]\Vert B \Vert_{M, \ast} = \Vert M^{-1/2} B M^{1/2} \Vert_{S_1}
\[ . \ast \ast Duality Mapping:\ast \ast This is the subgradient of the dual norm \]\Vert \cdot \Vert_{M, \ast}
\[ at \]A
\[ . Let \]C := M^{-1/2} A M^{1/2}
\[ have compact SVD \]C = U_r \Sigma_r V_r^\top
\[ . \]\boxed{\, J_M(A) = M^{1/2} (U_r V_r^\top) M^{-1/2} \,}
\[ #### 4. RMS-Induced Operator Norm \ast \ast Norm:\ast \ast For \]A \in \mathbb{R}^{n_{out} \times n_{in}}
\[ : \]\Vert A \Vert_{\mathrm{RMS}\to\mathrm{RMS}} = \sqrt{\frac{n_{in}}{n_{out}}}\,\sigma_{\max}(A)
\[ \ast \ast Dual Norm:\ast \ast \]\Vert B \Vert_R^\ast = \sqrt{\frac{n_{out}}{n_{in}}}\,\Vert B \Vert_{S_1}
\[ . \ast \ast Duality Mapping:\ast \ast This is the normalized subgradient of the dual norm \]\Vert \cdot \Vert_R^\ast
\[ at \]A
\[ . Let \]A=U_r \Sigma_r V_r^\top
\[ be the compact SVD of \]A
\[ . \]\boxed{\, J_R(A) = \sqrt{\frac{n_{out}}{n_{in}}} \, U_r V_r^\top \,}
\[
\ast \ast Derivation of ] </div> J_R(A)
\[ \ast \ast </summary> Let \]k = \sqrt{n_{in}/n_{out}}
\[ . The norm is \]\Vert A \Vert_R = k \Vert A \Vert_{S_\infty}
\[ and its dual is \]\Vert A \Vert_R^\ast = (1/k) \Vert A \Vert_{S_1}
\[ . We need \]J(A)
\[ such that \]\Vert J(A) \Vert_R = 1
\[ and \]\langle A, J(A) \rangle_F = \Vert A \Vert_R^\ast
\[ . The duality mapping \]J(A)
\[ must be an element of \]\partial \Vert A \Vert_R^\ast
\[ , normalized to have a primal norm of 1. First: \]\partial \Vert A \Vert_R^\ast = \partial \left(\frac{1}{k} \Vert A \Vert_{S_1}\right) = \frac{1}{k} \partial \Vert A \Vert_{S_1}
\[ A canonical choice for a subgradient \]S \in \partial \Vert A \Vert_{S_1}
\[ is \]S = U_r V_r^\top
\[ . So, a candidate subgradient of the dual norm is \]J_A = (1/k) U_r V_r^\top
\[ . We need to normalize this candidate so that its primal ( \]R
\[ ) norm is 1. \]\Vert J_A \Vert_R = \Vert (1/k) U_r V_r^\top \Vert_R = k \Vert (1/k) U_r V_r^\top \Vert_{S_\infty} = k \cdot (1/k) \Vert U_r V_r^\top \Vert_{S_\infty} = 1
\[ The candidate \]J_A
\[ already has a norm of 1, so it is the duality mapping. \]J_R(A) = \frac{1}{k} U_r V_r^\top = \sqrt{\frac{n_{out}}{n_{in}}} U_r V_r^\top
\[ \ast Check inner product:\ast \]\langle A, J_R(A) \rangle = \langle A, (1/k) U_r V_r^\top \rangle = (1/k) \langle A, U_r V_r^\top \rangle = (1/k) \Vert A \Vert_{S_1} = \Vert A \Vert_R^\ast
\[ . Both conditions hold. </details> </details> ## 6. Why Matrix Norms Matter for Metrized Deep Learning Understanding matrix norms and their duals is more than just a mathematical exercise. These concepts are foundational for "metrized deep learning" for several reasons: 1. \ast \ast Defining Geometry:\ast \ast Norms induce metrics ( \]d(W_1, W_2) = \Vert W_1 - W_2 \Vert
\[ ). The choice of norm for the weights and activations of a neural network defines the geometry of the parameter space and representation spaces. 2. \ast \ast Informing Optimizer Design:\ast \ast Many advanced optimization algorithms, like mirror descent or adaptive methods (e.g., Adam, Shampoo, \ast \ast Muon\ast \ast ), implicitly or explicitly leverage geometric information. Dual norms and duality mappings are key to understanding and deriving these methods, especially for gradient transformation. 3. \ast \ast Regularization:\ast \ast Norms are extensively used in regularization techniques (e.g., spectral/nuclear norm regularization for matrices) to encourage desirable properties like low rank or sparsity. 4. \ast \ast Analyzing Network Properties:\ast \ast Matrix norms help analyze stability, expressivity, and robustness. For instance, the spectral norm of weight matrices controls the Lipschitz constant of network layers. 5. \ast \ast Computational Costs in Optimization:\ast \ast The choice of norm is not "free." \ast \ast \ast Norm Computation:\ast \ast Calculating some norms (e.g., Frobenius) is cheap, while others (e.g., spectral, nuclear, RMS-induced) require SVDs or iterative methods, adding computational overhead per optimization step if used directly for regularization or monitoring. \ast \ast \ast Dualizer Computation:\ast \ast Optimizers like \ast \ast Muon\ast \ast rely on "gradient dualization," which involves finding the argument \]B
\[ that saturates Hölder's inequality: \]\langle G, B \rangle = \Vert G \Vert \Vert B \Vert_\ast
\[ . More practically, they often need to compute the duality mapping \]J(G)
\[ of the gradient \]G
\[ with respect to a chosen norm \]\Vert \cdot \Vert
\[ . The update rule might then involve \]J(G)
\[ or a related preconditioning matrix. The explicit formulas for \]J(G)
\[ provided in the previous section are crucial for implementing such optimizers. \ast For common layers like Linear or Conv2D, computing these duality mappings can be expensive. For instance, if the norm involves SVD (like for Spectral, Nuclear, RMS-induced norms) or matrix square roots/inverses (Mahalanobis), this is costly. The \ast \ast Muon\ast \ast optimizer and related methods often employ approximations, like Newton-Schulz iterations for matrix inverses or low-rank approximations for SVD, to make these computations feasible for large deep learning models. 6. \ast \ast Modular Duality:\ast \ast As seen in recent research, concepts of duality can be applied modularly to layers (Linear, Conv2D, Embedding) and composed throughout a network. This allows for a "dual" perspective on the entire weight space, enabling optimizers that adapt to the specific geometry of each layer. Efficient GPU kernels for these layer-wise dualizations are an active area of development. ## 7. Summary of Common Matrix Norms Here's a quick cheat sheet of common matrix norms and their duals (with respect to the Frobenius inner product). For a matrix \]A \in \mathbb{R}^{n_{out} \times n_{in}}
\[ : ### Matrix Norm Families: Duals and Duality Mappings \vert Norm Family \vert Norm Expression (for matrix \]A
\[ ) \vert Dual Norm Expression (for matrix \]B
\[ ) \vert Duality Mapping \]J(A)
\[ (for \]A \neq 0
\[ ) \vert \vert ------------------------- \vert ------------------------------------------------------------------------ \vert -------------------------------------------------------------------------------------------------------------- \vert -------------------------------------------------------------------------- \vert \vert \ast \ast Induced Operator\ast \ast \vert \]\sup_{\Vert x\Vert _X=1} \Vert Ax\Vert _Y
\[ \vert \]\inf \left\{ \sum_i \Vert u_i\Vert _X \Vert v_i\Vert _{Y^\ast } : B = \sum_i v_i u_i^T \right\}
\[ \vert Maximizer of \]\sup_{\Vert C\Vert _{X \to Y}=1} \langle A, C \rangle_F
\[ \vert \vert \ast \ast Entrywise \]L_{p,q}
\[ \ast \ast \vert \]\left( \sum_j \left( \sum_i \Vert a_{ij}\Vert ^p \right)^{q/p} \right)^{1/q}
\[ \vert \]\Vert B\Vert _{p^\ast , q^\ast }
\[
( \]\frac{1}{p} !+! \frac{1}{p^\ast }!=!1
\[ , \]\frac{1}{q} !+! \frac{1}{q^\ast }!=!1
\[ ) \vert Constructed from dual vectors of columns and global scaling \vert \vert \ast \ast Schatten \]p
\[ -norm\ast \ast \vert \]\left( \sum_i \sigma_i(A)^p \right)^{1/p}
\[ \vert \]\Vert B\Vert _{S_q}
\[
( \]\frac{1}{p} + \frac{1}{q} = 1
\[ ) \vert \]\frac{U \operatorname{diag}(\sigma_i^{q-1}) V^\top}{\Vert A\Vert _{S_q}^{p-1}}
\[ \vert #### Key Properties 1. \ast \ast Duality Condition\ast \ast : For all cases, \]\langle A, J(A) \rangle_F = \Vert A\Vert _\ast
\[ and \]\Vert J(A)\Vert = 1
\[ 2. \ast \ast Induced Norm Duality\ast \ast : - \]X, Y
\[ denote vector norms in domain/codomain - \]Y^\ast
\[ is dual norm of \]Y
\[ - No closed-form duality mapping (requires optimization) 3. \ast \ast Entrywise Construction\ast \ast : - Per-column dual vectors with global \]L_{q^\ast }
\[ scaling 4. \ast \ast Schatten Norm\ast \ast : - Requires SVD: \]A = U \operatorname{diag}(\sigma) V^\top
\[ - \]q = p/(p-1)
\[ (Hölder conjugate) - Degenerate singular values → non-unique \]J(A)
\[ ### Notable Special Cases of Matrix Norms: Duals and Duality Mappings \vert Norm Type \vert Norm Expression (for matrix \]A
\[ ) \vert Dual Norm Expression (for matrix \]B
\[ ) \vert Duality Mapping \]J(A)
\[ (for \]A \neq 0
\[ ) \vert \vert ----------------------------------------------------------------- \vert -------------------------------------------------- \vert --------------------------------------------------------------------------------- \vert ------------------------------------------------------------------------------------------------- \vert \vert \ast \ast Max Column Sum\ast \ast
(Induced \]\ell_1 \to \ell_1
\[ ) \vert \]\max_j \sum_i \vert a_{ij} \vert
\[ \vert \]\max_i \sum_j \vert b_{ij} \vert
\[
(Induced \]\ell_\infty \to \ell_\infty
\[ ) \vert \]e_{i_0} \cdot \text{sign}(A_{i_0,:})^\top
\[
( \]i_0 = \arg\max_i \sum_j \vert a_{ij} \vert
\[ ) \vert \vert \ast \ast Spectral Norm\ast \ast
(Induced \]\ell_2 \to \ell_2
\[ ) \vert \]\sigma_{\max}(A)
\[ \vert \]\sum_i \sigma_i(B)
\[
(Nuclear norm) \vert \]U_r V_r^\top
\[
(compact SVD: \]A = U_r \Sigma_r V_r^\top
\[ ) \vert \vert \ast \ast Max Row Sum\ast \ast
(Induced \]\ell_\infty \to \ell_\infty
\[ ) \vert \]\max_i \sum_j \vert a_{ij} \vert
\[ \vert \]\max_j \sum_i \vert b_{ij} \vert
\[
(Induced \]\ell_1 \to \ell_1
\[ ) \vert \]\text{sign}(A_{:,j_0}) \cdot e_{j_0}^\top
\[
( \]j_0 = \arg\max_j \sum_i \vert a_{ij} \vert
\[ ) \vert \vert \ast \ast Max Entry\ast \ast
(Induced \]\ell_1 \to \ell_\infty
\[ ) \vert \]\max_{i,j} \vert a_{ij} \vert
\[ \vert \]\sum_{i,j} \vert b_{ij} \vert
\[
(Entrywise \]\ell_1
\[ ) \vert \]\text{sign}(A)
\[ \vert \vert \ast \ast Frobenius\ast \ast
(Schatten \]p=2
\[ ) \vert \]\sqrt{\sum_{i,j} a_{ij}^2}
\[ \vert Frobenius norm
(Self-dual) \vert \]\frac{A}{\Vert A\Vert _F}
\[ \vert \vert \ast \ast Nuclear Norm\ast \ast
(Schatten \]p=1
\[ ) \vert \]\sum_i \sigma_i(A)
\[ \vert Spectral norm
( \]\sigma_{\max}(B)
\[ ) \vert \]u_1 v_1^\top
\[
(top singular vectors) \vert \vert \ast \ast RMS-Induced\ast \ast
( \]A \in \mathbb{R}^{n_{out} \times n_{in}}
\[ ) \vert \]\sqrt{\frac{n_{in}}{n_{out}}} \sigma_{\max}(A)
\[ \vert \]\sqrt{\frac{n_{out}}{n_{in}}} \Vert B\Vert _{S_1}
\[
(Scaled nuclear norm) \vert \]\sqrt{\frac{n_{out}}{n_{in}}} U_r V_r^\top
\[
(compact SVD) \vert \vert \ast \ast Mahalanobis-Induced\ast \ast
( \]M \succ 0
\[ ) \vert \]\max_{x^\top M x=1} \Vert Ax\Vert _M
\[ \vert \]\Vert M^{-1/2} B M^{1/2}\Vert _{S_1}
\[
(Transformed nuclear norm) \vert \]M^{1/2} (U_r V_r^\top) M^{-1/2}
\[
( \]C = M^{-1/2} A M^{1/2}
\[ ) \vert #### Key Notes: 1. \ast \ast Compact SVD\ast \ast : \]U_r, V_r
\[ contain singular vectors for non-zero singular values 2. \ast \ast Degenerate Cases\ast \ast : When maximum singular values are non-unique, \]J(A)
\[ becomes a convex combination of corresponding singular vectors 3. \ast \ast Practical Computation\ast \ast : - Spectral/Nuclear/RMS/Mahalanobis require SVD (expensive for large matrices) - Max Row/Column Sum and Max Entry have efficient \]O(mn)
\[ implementations - Frobenius norm is computationally cheapest (simple element-wise operations) ## 8. Summary of Matrix Norm Inequalities This table summarizes key inequalities relating common matrix norms for a matrix \]A \in \mathbb{R}^{m \times n}
\[ . Here, \]\Vert A \Vert_1
\[ is the max column sum (operator norm \]\ell_1 \to \ell_1
\[ ), \]\Vert A \Vert_2
\[ is the spectral norm (operator norm \]\ell_2 \to \ell_2
\[ ), \]\Vert A \Vert_\infty
\[ is the max row sum (operator norm \]\ell_\infty \to \ell_\infty
\[ ), \]\Vert A \Vert_F
\[ is the Frobenius norm, and \]\Vert A \Vert_{\max} = \max_{i,j} \vert a_{ij} \vert
\[ . \vert Inequality \vert Notes / Context \vert \vert :------------------------------------------------------------------ \vert :-------------------------------------------------------------- \vert \vert \]\Vert A \Vert_2 \le \Vert A \Vert_F
\[ \vert Spectral norm is less than or equal to Frobenius norm. \vert \vert \]\Vert A \Vert_F \le \sqrt{\mathrm{rank}(A)} \Vert A \Vert_2
\[ \vert Often simplified using \]\mathrm{rank}(A) \le \min(m,n)
\[ . \vert \vert \]\frac{1}{\sqrt{m}} \Vert A \Vert_\infty \le \Vert A \Vert_2
\[ \vert Lower bound for spectral norm by max row sum. \vert \vert \]\Vert A \Vert_2 \le \sqrt{n} \Vert A \Vert_\infty
\[ \vert Upper bound for spectral norm by max row sum. \vert \vert \]\frac{1}{\sqrt{n}} \Vert A \Vert_1 \le \Vert A \Vert_2
\[ \vert Lower bound for spectral norm by max column sum. \vert \vert \]\Vert A \Vert_2 \le \sqrt{m} \Vert A \Vert_1
\[ \vert Upper bound for spectral norm by max column sum. \vert \vert \]\Vert A \Vert_2 \le \sqrt{\Vert A \Vert_1 \Vert A \Vert_\infty}
\[ \vert Interpolates between operator 1-norm and \]\infty
\[ -norm. \vert \vert \]\Vert A \Vert_{\max} \le \Vert A \Vert_2
\[ \vert Max absolute entry is less than or equal to spectral norm. \vert \vert \]\Vert A \Vert_2 \le \sqrt{mn} \Vert A \Vert_{\max}
\[ \vert Upper bound for spectral norm by max absolute entry. \vert \vert \]\Vert A \Vert_F \le \sqrt{mn} \Vert A \Vert_{\max}
\[ \vert Upper bound for Frobenius norm by max absolute entry. \vert \vert \]\Vert A \Vert_1 \le m \Vert A \Vert_{\max}
\[ \vert Upper bound for operator 1-norm by max absolute entry. \vert \vert \]\Vert A \Vert_\infty \le n \Vert A \Vert_{\max}
\[ \vert Upper bound for operator \]\infty$$-norm by max absolute entry. |
In our upcoming posts on metrized deep learning, we will see how these norms and their associated geometries are not just theoretical curiosities but practical tools for building more efficient and effective deep learning models. Stay tuned!
References
- Horn, R. A., & Johnson, C. R. (2017). Matrix Analysis (Second edition, corrected reprint). Cambridge University Press.
- Vector and Matrix Norms. Retrieved June 6, 2025, from https://wiki.ruda.city/Vector-and-Matrix-Norms
- Aziznejad, S., & Unser, M. (2020). Duality Mapping for Schatten Matrix Norms (Issue arXiv:2009.07768). arXiv. https://doi.org/10.48550/arXiv.2009.07768
- Guth, L., Maldague, D., & Urschel, J. ESTIMATING THE MATRIX p → q NORM.
- Tyrtyshnikov, E. E. (1997). A Brief Introduction to Numerical Analysis. Birkhäuser Boston. https://doi.org/10.1007/978-0-8176-8136-4
- Matrix Norm. (2025). Wikipedia. https://en.wikipedia.org/wiki/Matrix_norm