Post

Elementary Functional Analysis: Why Types Matter in Optimization

Understanding the fundamental distinction between vectors and dual vectors—and why it's crucial for gradient-based optimization.

Elementary Functional Analysis: Why Types Matter in Optimization

1. Introduction: The Overlooked Distinction That Matters

In machine learning and optimization, we constantly work with two types of mathematical objects:

  1. Parameter vectors (weights, biases - typically column vectors)
  2. Gradient vectors (derivatives of loss functions - typically row vectors)

In

\[ \mathbb{R}^n \]

with standard basis, we often casually convert between them using transposes. But this obscures a fundamental distinction that becomes critical when:

  • Working in non-standard coordinate systems
  • Using adaptive optimization algorithms
  • Moving beyond Euclidean spaces (e.g., Riemannian manifolds)

The Core Problem

Consider a loss function

\[ J(w) \]

where

\[ w \in \mathbb{R}^n \]

. The gradient

\[ \nabla J(w) \]

is:

  • Geometrically: A row vector (covector)
  • Algebraically: Belongs to a different space than
\[ w \]

Treating them as interchangeable leads to subtle errors in transformation rules under reparameterization.

1.1 Physical Analogy: Pencils vs. Rulers

To build intuition, consider two physical objects:

  • **Kets as “Pencils” (
\[ \vert v \rangle \]

):** Represent tangible quantities like displacements or velocities.
Example: A displacement vector

\[ \vec{d} = 3\hat{x} + 4\hat{y} \]

in 2D space.
Property: Its description changes inversely to measurement units (contravariant).

  • **Bras as “Rulers” (
\[ \langle f \vert \]

):** Represent measurement devices or gradients.
Example: A topographic map’s contour lines measuring elevation change.
Property: Its description changes with measurement units (covariant).

The invariant pairing: When you move a pencil through contour lines (a ruler), the elevation change

\[ \langle f \vert v \rangle \]

is physical reality that must be basis-independent.

2. Mathematical Foundation: Vector Spaces and Duality

2.1 Vector Spaces and Bases

Let

\[ V \]

be an

\[ n \]

-dimensional vector space (e.g., parameter space).

  • Basis: Choose linearly independent vectors
\[ \{\vert e_1 \rangle, \dots, \vert e_n \rangle\} \]
1
(Think: coordinate axes) *   **Vector components:** Any 
\[ \vert v \rangle \in V \]

expands as:

1
2
3
4
5
<div class="math-block" markdown="0"> \[ \vert v \rangle = \sum_{i=1}^n v^i \vert e_i \rangle \quad \text{(upper index)} \]
</div>


(Note: 
\[ v^i \]

are scalars,

\[ \vert e_i \rangle \]

are basis vectors)

2.2 The Dual Space: Home for “Rulers”

The dual space

\[ V^\ast \]

contains all linear functionals (bras)

\[ \langle f \vert : V \to \mathbb{R} \]

.

Why dual space matters

In optimization:

\[ V \]

contains parameter vectors (weights)

\[ V^\ast \]

contains gradient vectors (derivatives)

They are fundamentally different mathematical objects.

2.3 Dual Basis: The Coordinate System for Rulers

For each basis

\[ \{\vert e_i \rangle\} \]

in

\[ V \]

, there’s a unique dual basis

\[ \{\langle \epsilon^j \vert\} \]

in

\[ V^\ast \]

satisfying:

\[ \langle \epsilon^j \vert e_i \rangle = \delta^j_i = \begin{cases} 1 & \text{if } i=j \\ 0 & \text{otherwise} \end{cases} \]

Any bra expands as:

\[ \langle f \vert = \sum_{j=1}^n f_j \langle \epsilon^j \vert \quad \text{(lower index)} \]

2.4 The Fundamental Pairing

The action of bra on ket gives a basis-independent scalar:

\[ \langle f \vert v \rangle = \left( \sum_j f_j \langle \epsilon^j \vert \right) \left( \sum_i v^i \vert e_i \rangle \right) = \sum_{i,j} f_j v^i \underbrace{\langle \epsilon^j \vert e_i \rangle}_{\delta^j_i} = \sum_k f_k v^k \]

Key Insight

The invariant sum

\[ \sum_k f_k v^k \]

requires that:

  • When basis vectors change,
\[ v^k \]

and

\[ f_k \]

must transform reciprocally

  • This is the origin of contravariance vs. covariance

3. Transformation Laws: Why Components Change Differently

3.1 Change of Basis: Scaling Example

Consider scaling basis vectors:

\[ \vert e'_i \rangle = \alpha_i \vert e_i \rangle \quad (\text{no sum}) \]

Question: How do components of a fixed vector

\[ \vert v \rangle \]

change?

Derivation:
Original:

\[ \vert v \rangle = v^i \vert e_i \rangle \]

New basis:

\[ \vert v \rangle = (v')^i \vert e'_i \rangle = (v')^i \alpha_i \vert e_i \rangle \]

Compare coefficients:

\[ v^i = (v')^i \alpha_i \]

Thus:

\[ \boxed{(v')^i = \frac{v^i}{\alpha_i}} \quad \text{(contravariant)} \]

Physical interpretation:
If you double the length of basis vectors (

\[ \alpha_i=2 \]

), component values halve to represent the same displacement.

3.2 How Dual Vectors Transform

Requirement: Dual basis must still satisfy

\[ \langle (\epsilon')^j \vert e'_i \rangle = \delta^j_i \]

Substitute basis change:

\[ \langle (\epsilon')^j \vert (\alpha_i \vert e_i \rangle) = \alpha_i \langle (\epsilon')^j \vert e_i \rangle = \delta^j_i \]

Assume

\[ \langle (\epsilon')^j \vert = \beta_j \langle \epsilon^j \vert \]

, then:

\[ \alpha_i \beta_j \langle \epsilon^j \vert e_i \rangle = \alpha_i \beta_j \delta^j_i = \delta^j_i \]

For i=j:

\[ \alpha_j \beta_j = 1 \Rightarrow \beta_j = 1/\alpha_j \]

Thus:

\[ \boxed{\langle (\epsilon')^j \vert = \frac{1}{\alpha_j} \langle \epsilon^j \vert} \quad \text{(contravariant)} \]

3.3 Transformation of Bra Components

For a fixed functional

\[ \langle f \vert \]

:
Original:

\[ \langle f \vert = f_j \langle \epsilon^j \vert \]

New basis:

\[ \langle f \vert = (f')_j \langle (\epsilon')^j \vert = (f')_j \frac{1}{\alpha_j} \langle \epsilon^j \vert \]

Compare coefficients:

\[ f_j = (f')_j / \alpha_j \]

Thus:

\[ \boxed{(f')_j = f_j \alpha_j} \quad \text{(covariant)} \]

Transformation Summary Table

| Object | Transformation Rule | Type | Optimization Analogy | | ——————— | —————————————————————————– | ————- | ———————— | | Basis vectors |

\[ \vert e'_i \rangle = \alpha_i \vert e_i \rangle \]
1
                       | -             | Coordinate system change | | Vector components     | 
\[ (v')^i = v^i / \alpha_i \]
1
                                               | Contravariant | Parameter transformation | | Dual basis            | 
\[ \langle (\epsilon')^j \vert = \frac{1}{\alpha_j} \langle \epsilon^j \vert \]

| Contravariant | - | | Functional components |

\[ (f')_j = f_j \alpha_j \]
1
                                                 | Covariant     | Gradient transformation  |

3.4 The Critical Invariant

Verify scalar invariance:

\[ \langle f' \vert v' \rangle = (f')_j (v')^j = (f_j \alpha_j) \left( \frac{v^j}{\alpha_j} \right) = f_j v^j = \langle f \vert v \rangle \]

Why This Matters in ML

When reparameterizing a model (e.g.,

\[ w \to \tilde{w} = Aw \]

for invertible

\[ A \]

):

  • Parameters transform contravariantly:
\[ \tilde{w} = A^{-1} w \]
  • Gradients transform covariantly:
\[ \nabla_{\tilde{w}} J = A^\top \nabla_w J \]

Mixing these transformations breaks optimization algorithms.

4. Normed Spaces and Hilbert Spaces

4.1 Measuring Size: Norms

A norm

\[ \Vert \cdot \Vert_V \]

satisfies:

\[ \Vert \vert x \rangle \Vert_V \ge 0 \]
\[ \Vert \vert x \rangle \Vert_V = 0 \iff \vert x \rangle = 0 \]
\[ \Vert \lambda \vert x \rangle \Vert_V = \vert \lambda \vert \Vert \vert x \rangle \Vert_V \]
\[ \Vert \vert x \rangle + \vert y \rangle \Vert_V \le \Vert \vert x \rangle \Vert_V + \Vert \vert y \rangle \Vert_V \]

Banach space: Complete normed space (all Cauchy sequences converge). Essential for:

  • Guaranteeing convergence of iterative optimization methods
  • Well-defined limits in infinite dimensions

4.2 Dual Norm: Measuring Functional Strength

For

\[ \langle f \vert \in V^\ast \]

:

\[ \Vert \langle f \vert \Vert_{V^\ast} = \sup_{\Vert \vert x \rangle \Vert_V \le 1} \vert \langle f \vert x \rangle \vert \]

Interpretation

The dual norm measures the maximum “amplification” a functional can apply. In optimization:

\[ \Vert \langle \nabla J \vert \Vert_{V^\ast} \]

quantifies sensitivity to perturbations

\[ V^\ast \]

is always Banach under this norm

4.3 Adding Geometry: Inner Products

An inner product

\[ \langle \cdot \vert \cdot \rangle : V \times V \to \mathbb{R} \]

adds:

  • Angles:
\[ \cos \theta = \frac{\langle x \vert y \rangle}{\Vert x \Vert \Vert y \Vert} \]
  • Orthogonality:
\[ \langle x \vert y \rangle = 0 \]
  • Induced norm:
\[ \Vert \vert x \rangle \Vert = \sqrt{\langle x \vert x \rangle} \]

Hilbert space: Complete inner product space (e.g.,

\[ \mathbb{R}^n \]

with dot product,

\[ L^2 \]

function spaces).

5. The Riesz Bridge: Connecting Kets and Bras

5.1 The Fundamental Theorem

Riesz Representation Theorem

In a Hilbert space

\[ H \]

, for every continuous linear functional

\[ \langle \phi \vert \in H^\ast \]

, there exists a unique

\[ \vert y_\phi \rangle \in H \]

such that:

\[ \langle \phi \vert x \rangle = \langle y_\phi \vert x \rangle \quad \forall \vert x \rangle \in H \]

Implications for optimization:

  1. Provides formal justification for representing gradients as vectors
  2. Shows this representation depends on the inner product
  3. Explains why we “see” gradients as vectors in
\[ \mathbb{R}^n \]

Critical Distinction

  • The Fréchet derivative
\[ \langle DJ \vert \]

is intrinsically a bra (element of

\[ V^\ast \]

)

  • The gradient
\[ \vert \nabla J \rangle \]

is its Riesz representation in

\[ H \]
  • They are different mathematical objects with different transformation properties

5.2 Why This Matters Practically

Consider reparameterizing a model from

\[ w \]

to

\[ \tilde{w} = Aw \]

:

| Object | Transformation Rule | Type | | —————- | ————————————————————————- | ————- | | Parameters (ket) |

\[ \vert \tilde{w} \rangle = A^{-1} \vert w \rangle \]
1
                  | Contravariant | | Gradient (bra)   | 
\[ \langle \widetilde{\nabla J} \vert = \langle \nabla J \vert A \]
1
     | Covariant     | | Gradient (ket)   | 
\[ \vert \widetilde{\nabla J} \rangle = A^{-\top} \vert \nabla J \rangle \]

| Contravariant |

Common Mistake

Using

\[ \vert \widetilde{\nabla J} \rangle = A \vert \nabla J \rangle \]

would:

  1. Mix transformation types
  2. Break gradient descent convergence
  3. Violate invariant pairing
\[ \langle \widetilde{\nabla J} \vert \tilde{w} \rangle \neq \langle \nabla J \vert w \rangle \]

6. Transforming Objects: Linear Operators and Their Dual Nature

6.1 Linear Operators: Mapping Between Spaces

A linear operator

\[ T: V \to W \]

transforms kets while preserving linear structure:

\[ T(\alpha \vert x \rangle + \beta \vert y \rangle) = \alpha T\vert x \rangle + \beta T\vert y \rangle \]

Why this matters in ML

  • Weight matrices in neural networks
  • Feature maps in kernel methods
  • Projection operators in dimensionality reduction

6.2 The Adjoint Operator: Dualizing Transformations

When we transform kets with

\[ T \]

, how do measurements (bras) transform? The adjoint operator

\[ T^\dagger: W^\ast \to V^\ast \]

provides the dual transformation:

Definition 6.1: Adjoint Operator (Coordinate-Free)

For Hilbert spaces

\[ H_1, H_2 \]

and bounded operator

\[ T: H_1 \to H_2 \]

, the adjoint

\[ T^\dagger: H_2 \to H_1 \]

satisfies:

\[ \langle y \vert T x \rangle_{H_2} = \langle T^\dagger y \vert x \rangle_{H_1} \quad \forall \vert x \rangle \in H_1, \vert y \rangle \in H_2 \]

The Fundamental Duality Diagram

\[ \begin{array}{ccc} V & \xrightarrow{T} & W \\ \downarrow & & \uparrow \\ V^\ast & \xleftarrow{T^\dagger} & W^\ast \end{array} \]

The adjoint completes the “circuit” of transformations, preserving the scalar product

\[ \langle f \vert v \rangle \]

.

6.3 Basis Dependence: When Transposes Fail

Critical Warning

The familiar matrix transpose

\[ A^T \]

only represents the adjoint in orthonormal bases. In general bases:

\[ [T^\dagger] = G_1^{-1} [T]^H G_2 \]

where:

\[ G_1, G_2 \]

are Gram matrices of inner products

\[ [T]^H \]

is conjugate transpose of

\[ T \]

’s matrix

Example: Why Basis Matters

Consider

\[ \mathbb{R}^2 \]

with:

  • Basis:
\[ \vert e_1 \rangle = \begin{pmatrix}1\\0\end{pmatrix}, \vert e_2 \rangle = \begin{pmatrix}1\\1\end{pmatrix} \]
  • Operator:
\[ T = \begin{pmatrix} 2 & 0 \\ 0 & 1 \end{pmatrix} \]

Gram matrix:

\[ G = \begin{pmatrix}1 & 1\\1 & 2\end{pmatrix} \]

True adjoint matrix:

\[ [T^\dagger] = G^{-1} T^T G = \begin{pmatrix}1 & -1\\0 & 1\end{pmatrix} \begin{pmatrix}2 & 0\\0 & 1\end{pmatrix} \begin{pmatrix}1 & 1\\1 & 2\end{pmatrix} = \begin{pmatrix}1 & 0\\1 & 1\end{pmatrix} \]

Not equal to

\[ T^T = \begin{pmatrix}2 & 0\\0 & 1\end{pmatrix} \]

! Using transpose directly would break invariance.

6.4 Special Operator Classes

| Operator Type | Definition | Key Properties | ML Applications | | —————- | —————————– | —————————————– | ——————————————- | | Self-Adjoint |

\[ T = T^\dagger \]
1
         | Real eigenvalues, orthogonal eigenvectors | Covariance matrices, Hamiltonian in QML     | | **Unitary**      | 
\[ T^\dagger T = I \]
1
       | Preserves inner products                  | Quantum circuits, orthogonal weight updates | | **Normal**       | 
\[ T T^\dagger = T^\dagger T \]

| Diagonalizable | Stable recurrent architectures |

6.5 Spectral Decomposition: The Power of Duality

Spectral Theorem (Compact Self-Adjoint)

For self-adjoint

\[ T \]

on Hilbert space

\[ H \]

:

\[ T = \sum_k \lambda_k \vert \phi_k \rangle \langle \phi_k \vert \]
\[ \lambda_k \in \mathbb{R} \]

(eigenvalues)

\[ \langle \phi_i \vert \phi_j \rangle = \delta_{ij} \]

(orthonormal eigenvectors)

Why the Bra-Ket Form Matters

The projector

\[ \vert \phi_k \rangle \langle \phi_k \vert \]

:

  • Combines ket (state) and bra (measurement)
  • Represents a rank-1 operation
  • Shows why bras/kets can’t be arbitrarily interchanged

Optimization Connection: PCA/SVD are spectral decompositions:

  • Data covariance:
\[ C = \frac{1}{n} \sum_i \vert x_i \rangle \langle x_i \vert \]
  • Principal components: Eigenvectors of
\[ C \]

6.6 Singular Value Decomposition: General Case

For arbitrary

\[ T: H_1 \to H_2 \]

:

\[ T = \sum_k \sigma_k \vert u_k \rangle \langle v_k \vert \]
\[ \sigma_k \geq 0 \]

(singular values)

\[ \langle u_i \vert u_j \rangle = \delta_{ij} \]

,

\[ \langle v_i \vert v_j \rangle = \delta_{ij} \]

Duality in Action

The SVD simultaneously diagonalizes:

\[ T^\dagger T = \sum \sigma_k^2 \vert v_k \rangle \langle v_k \vert \]
\[ T T^\dagger = \sum \sigma_k^2 \vert u_k \rangle \langle u_k \vert \]

Showing how adjoints reveal hidden structure.

7. Optimization in Abstract Spaces

7.1 Fréchet Derivative: The True Derivative

For

\[ J: V \to \mathbb{R} \]

, the derivative at

\[ \vert x \rangle \]

is defined as the unique bra

\[ \langle DJ(\vert x \rangle) \vert \in V^\ast \]

satisfying:

\[ J(\vert x + h \rangle) = J(\vert x \rangle) + \langle DJ(\vert x \rangle) \vert h \rangle + o(\Vert \vert h \rangle \Vert) \]

Why this matters

In non-Euclidean spaces (e.g., Riemannian manifolds):

  • The Fréchet derivative is always well-defined
  • The gradient requires additional structure (metric tensor)
  • Optimization algorithms use
\[ \langle DJ \vert \]

directly in momentum terms

7.2 Gradient: The Practical Representation

In Hilbert spaces, via Riesz:

\[ \vert \nabla J(\vert x \rangle) \rangle \in H \quad \text{s.t.} \quad \langle DJ(\vert x \rangle) \vert h \rangle = \langle \nabla J(\vert x \rangle) \vert h \rangle \]

This enables gradient descent:

\[ \vert x_{k+1} \rangle = \vert x_k \rangle - \eta \vert \nabla J(\vert x_k \rangle) \rangle \]

Implementation Insight

When coding optimizers:

  • Store parameters as contravariant tensors (kets)
  • Store gradients as covariant tensors (bras)
  • Convert to gradient kets only for update steps

8. Conclusion: Why Types Prevent Errors

The ket/bra distinction resolves fundamental issues in optimization:

  1. Reparameterization invariance: Proper transformations preserve algorithm convergence
  2. Geometric consistency: Correct handling of non-Euclidean parameter spaces
  3. Algorithmic clarity: Momentum terms require covariant/contravariant consistency

Practical Cheat Sheet

| Scenario | Correct Approach | | ———————– | ————————————————————————- | | Changing coordinates | Transform parameters contravariantly, gradients covariantly | | Implementing optimizer | Store parameters as vectors, gradients as dual vectors | | Custom gradient descent |

\[ w \leftarrow w - \eta \, \text{Riesz}(\nabla J) \]

(explicit conversion) | | Riemannian optimization | Use

\[ \langle \nabla J \vert \]

directly with metric-dependent transports |

The “pencils” (parameters) and “rulers” (gradients) metaphor provides enduring intuition:
Physical measurements remain invariant only when transformation rules respect mathematical types.

This post is licensed under CC BY 4.0 by the author.