Post

Cheat Sheet: Linear Algebra

A quick reference guide compiling essential theorems, identities, facts, and formulas from linear algebra, designed for the crash course on mathematical foundations for machine learning and optimization.

Cheat Sheet: Linear Algebra

Linear Algebra: Key Concepts and Formulas

This quick reference guide covers essential theorems, identities, and facts from linear algebra, particularly relevant for machine learning and optimization.

1. Vectors

Definitions & Operations

  • Vector: An element of a vector space, often represented as an array of numbers (components). Geometrically, a quantity with magnitude and direction.

    \[\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \in \mathbb{R}^n\]
  • Vector Addition:

    \[\mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1+v_1 \\ \vdots \\ u_n+v_n \end{bmatrix}\]
    • Properties: Commutative (\(\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}\)), Associative (\((\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})\))
  • Scalar Multiplication:

    \[c\mathbf{v} = \begin{bmatrix} cv_1 \\ \vdots \\ cv_n \end{bmatrix}\]
    • Properties: Distributive over vector/scalar addition, Associative
  • Dot Product (Standard Inner Product for \(\mathbb{R}^n\)):

    \[\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T \mathbf{v} = \sum_{i=1}^n u_i v_i\]
    • Properties:
      1. Commutative: \(\mathbf{u} \cdot \mathbf{v} = \mathbf{v} \cdot \mathbf{u}\)
      2. Distributive: \(\mathbf{u} \cdot (\mathbf{v} + \mathbf{w}) = \mathbf{u} \cdot \mathbf{v} + \mathbf{u} \cdot \mathbf{w}\)
      3. Bilinear: \((c\mathbf{u}) \cdot \mathbf{v} = c(\mathbf{u} \cdot \mathbf{v}) = \mathbf{u} \cdot (c\mathbf{v})\)
      4. Positive-definiteness: \(\mathbf{v} \cdot \mathbf{v} = \Vert \mathbf{v} \Vert_2^2 \ge 0\), and \(\mathbf{v} \cdot \mathbf{v} = 0 \iff \mathbf{v} = \mathbf{0}\)
  • Angle between two non-zero vectors:

    \[\cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\Vert \mathbf{u} \Vert_2 \Vert \mathbf{v} \Vert_2}\]
  • Orthogonal Vectors: \(\mathbf{u} \cdot \mathbf{v} = 0\) (implies \(\theta = \pi/2\) or \(90^\circ\) if vectors are non-zero).
  • Vector Projection: The projection of vector \(\mathbf{a}\) onto non-zero vector \(\mathbf{b}\) is:

    \[\text{proj}_{\mathbf{b}} \mathbf{a} = \frac{\mathbf{a} \cdot \mathbf{b}}{\Vert \mathbf{b} \Vert_2^2} \mathbf{b}\]

    The scalar component of \(\mathbf{a}\) along \(\mathbf{b}\) is \(\frac{\mathbf{a} \cdot \mathbf{b}}{\Vert \mathbf{b} \Vert_2}\).

Vector Norms

  • Definition: A function \(\Vert \cdot \Vert : V \to \mathbb{R}\) such that for all \(c \in \mathbb{R}\), \(\mathbf{u}, \mathbf{v} \in V\):
    1. \(\Vert \mathbf{v} \Vert \ge 0\) (Non-negativity)
    2. \(\Vert \mathbf{v} \Vert = 0 \iff \mathbf{v} = \mathbf{0}\) (Definiteness)
    3. \(\Vert c\mathbf{v} \Vert = \vert c \vert \Vert \mathbf{v} \Vert\) (Absolute homogeneity)
    4. \(\Vert \mathbf{u} + \mathbf{v} \Vert \le \Vert \mathbf{u} \Vert + \Vert \mathbf{v} \Vert\) (Triangle Inequality)
  • Common \(L_p\) Norms: For \(\mathbf{v} \in \mathbb{R}^n\)
    • \(L_1\) norm (Manhattan norm): \(\Vert \mathbf{v} \Vert_1 = \sum_{i=1}^n \vert v_i \vert\)
    • \(L_2\) norm (Euclidean norm): \(\Vert \mathbf{v} \Vert_2 = \sqrt{\sum_{i=1}^n v_i^2} = \sqrt{\mathbf{v}^T \mathbf{v}}\)
    • \(L_p\) norm (\(p \ge 1\)): \(\Vert \mathbf{v} \Vert_p = \left( \sum_{i=1}^n \vert v_i \vert^p \right)^{1/p}\)
    • \(L_\infty\) norm (Max norm): \(\Vert \mathbf{v} \Vert_\infty = \max_{1 \le i \le n} \vert v_i \vert\)
  • Cauchy-Schwarz Inequality:

    \[\vert \mathbf{u} \cdot \mathbf{v} \vert \le \Vert \mathbf{u} \Vert_2 \Vert \mathbf{v} \Vert_2\]

    More generally for inner products: \(\vert \langle \mathbf{u}, \mathbf{v} \rangle \vert \le \Vert \mathbf{u} \Vert \Vert \mathbf{v} \Vert\). Equality holds if and only if \(\mathbf{u}\) and \(\mathbf{v}\) are linearly dependent.

2. Matrices

Definitions & Special Matrices

  • Matrix: A rectangular array of numbers, \(A \in \mathbb{R}^{m \times n}\).
TypeDefinition / ConditionKey Notes
Identity (\(I_n\) or \(I\))Square matrix; \(A_{ii}=1\), \(A_{ij}=0\) for \(i \ne j\)\(AI=IA=A\)
Zero (\(0\))All entries \(A_{ij}=0\) 
Diagonal\(A_{ij}=0\) for \(i \ne j\) 
Symmetric\(A = A^T\) (square matrix)Real eigenvalues, orthogonally diagonalizable
Skew-Symmetric\(A = -A^T\) (square matrix)Diagonal elements \(A_{ii}=0\). Eigenvalues are 0 or purely imaginary.
Upper Triangular\(A_{ij}=0\) for \(i > j\) (square matrix)Eigenvalues are the diagonal entries
Lower Triangular\(A_{ij}=0\) for \(i < j\) (square matrix)Eigenvalues are the diagonal entries
Orthogonal (\(Q\))Square matrix; \(Q^T Q = Q Q^T = I\)\(Q^{-1}=Q^T\), columns/rows form an orthonormal basis, \(\det(Q)=\pm 1\), preserves dot products & \(L_2\) norms
NormalSquare matrix; \(AA^T = A^T A\) (real) or \(AA^H = A^H A\) (complex)Unitarily diagonalizable. Includes symmetric, skew-symmetric, orthogonal matrices.

Matrix Operations

  • Matrix Addition/Subtraction: Element-wise. Matrices must have the same dimensions.
  • Scalar Multiplication: Multiply each element by the scalar.
  • Matrix Multiplication: If \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{n \times p}\), then \(C = AB \in \mathbb{R}^{m \times p}\), where

    \[C_{ij} = \sum_{k=1}^n A_{ik} B_{kj}\]
    • Properties:
      1. Associative: \((AB)C = A(BC)\)
      2. Distributive: \(A(B+C) = AB + AC\) and \((B+C)D = BD + CD\)
      3. Not Commutative in general: \(AB \ne BA\)
  • Transpose (\(A^T\)): Rows become columns (and vice versa). \((A^T)_{ij} = A_{ji}\).
    • Properties:
      1. \[(A^T)^T = A\]
      2. \[(A+B)^T = A^T + B^T\]
      3. \[(cA)^T = cA^T\]
      4. \((AB)^T = B^T A^T\) (Reverse order)
  • Matrix Inverse (\(A^{-1}\)): For a square matrix \(A\), if there exists \(A^{-1}\) such that \(AA^{-1} = A^{-1}A = I\).
    • \(A\) is invertible (or non-singular) if \(A^{-1}\) exists. This is true if and only if \(det(A) \ne 0\).
    • Properties:
      1. \[(A^{-1})^{-1} = A\]
      2. \((AB)^{-1} = B^{-1}A^{-1}\) (if A, B are invertible, reverse order)
      3. \[(A^T)^{-1} = (A^{-1})^T\]
  • Trace (\(\text{tr}(A)\)): Sum of diagonal elements of a square matrix \(A\).

    \[\text{tr}(A) = \sum_{i=1}^n A_{ii}\]
    • Properties:
      1. \[\text{tr}(A+B) = \text{tr}(A) + \text{tr}(B)\]
      2. \[\text{tr}(cA) = c \cdot \text{tr}(A)\]
      3. \(\text{tr}(AB) = \text{tr}(BA)\) (Cyclic property. Valid if \(AB\) and \(BA\) are square, e.g., \(A \in \mathbb{R}^{m \times n}, B \in \mathbb{R}^{n \times m}\)).
      4. \[\text{tr}(A) = \text{tr}(A^T)\]
      5. \[\text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB)\]
  • Hadamard Product (Element-wise product, \(A \odot B\)): \((A \odot B)_{ij} = A_{ij} B_{ij}\). Matrices must have the same dimensions.
  • Moore-Penrose Pseudoinverse (\(A^+\)): For any matrix \(A \in \mathbb{R}^{m \times n}\), the pseudoinverse \(A^+ \in \mathbb{R}^{n \times m}\) is the unique matrix satisfying:
    1. \[AA^+A = A\]
    2. \[A^+AA^+ = A^+\]
    3. \((AA^+)^T = AA^+\) (so \(AA^+\) is symmetric)
    4. \((A^+A)^T = A^+A\) (so \(A^+A\) is symmetric)
      • If \(A\) has full column rank (\(n \le m\), \(\text{rank}(A)=n\)), then \(A^+ = (A^T A)^{-1}A^T\). In this case, \(A^+A = I_n\).
      • If \(A\) has full row rank (\(m \le n\), \(\text{rank}(A)=m\)), then \(A^+ = A^T(AA^T)^{-1}\). In this case, \(AA^+ = I_m\).
      • Computation via SVD: If \(A = U\Sigma V^T\) is the SVD of \(A\), then \(A^+ = V\Sigma^+U^T\), where \(\Sigma^+\) is formed by taking the reciprocal of the non-zero singular values in \(\Sigma\) and transposing the resulting matrix.
      • Use: The vector \(\mathbf{x}^+ = A^+\mathbf{b}\) is the minimum norm (\(\Vert \mathbf{x} \Vert_2\)) least-squares solution to \(A\mathbf{x} = \mathbf{b}\).

Matrix Norms

  • Operator Norm (Induced Norm):

    \[\Vert A \Vert_p = \sup_{\mathbf{x} \ne \mathbf{0}} \frac{\Vert A\mathbf{x} \Vert_p}{\Vert \mathbf{x} \Vert_p} = \sup_{\Vert \mathbf{x} \Vert_p = 1} \Vert A\mathbf{x} \Vert_p\]
    • Spectral Norm (\(p=2\)): \(\Vert A \Vert_2 = \sigma_{\max}(A)\) (largest singular value of \(A\))

      \[\Vert A \Vert_2 = \sqrt{\lambda_{\max}(A^T A)}\]

      where \(\lambda_{\max}(M)\) is the largest eigenvalue of \(M\).

    • General property: \(\Vert AB \Vert_p \le \Vert A \Vert_p \Vert B \Vert_p\) (submultiplicativity for induced norms).

  • Frobenius Norm:

    \[\Vert A \Vert_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n \vert A_{ij} \vert^2} = \sqrt{\text{tr}(A^T A)}\]
    • Properties:
      1. \(\Vert A \Vert_F^2 = \sum_{i=1}^{\min(m,n)} \sigma_i^2(A)\) (sum of squares of singular values of \(A\))
      2. Submultiplicative: \(\Vert AB \Vert_F \le \Vert A \Vert_F \Vert B \Vert_F\)
      3. \[\Vert A \Vert_2 \le \Vert A \Vert_F \le \sqrt{\text{rank}(A)} \Vert A \Vert_2\]

3. Linear Systems, Vector Spaces & Subspaces

Linear Transformations

Definition. Linear Transformation

A transformation (or mapping) \(T: V \to W\) from a vector space \(V\) to a vector space \(W\) is linear if for all vectors \(\mathbf{u}, \mathbf{v} \in V\) and all scalars \(c\):

  1. \(T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})\) (Additivity)
  2. \(T(c\mathbf{u}) = cT(\mathbf{u})\) (Homogeneity)

Any linear transformation \(T: \mathbb{R}^n \to \mathbb{R}^m\) can be represented by matrix multiplication \(T(\mathbf{x}) = A\mathbf{x}\) for a unique matrix \(A \in \mathbb{R}^{m \times n}\). The columns of \(A\) are the images of the standard basis vectors of \(\mathbb{R}^n\) under \(T\).

Systems of Linear Equations

  • A system of linear equations can be written as \(A\mathbf{x} = \mathbf{b}\), where \(A \in \mathbb{R}^{m \times n}\) is the coefficient matrix, \(\mathbf{x} \in \mathbb{R}^n\) is the vector of unknowns, and \(\mathbf{b} \in \mathbb{R}^m\) is the constant vector.
  • A solution exists if and only if \(\mathbf{b} \in \text{Col}(A)\) (the column space of \(A\)). Equivalently, \(\text{rank}(A) = \text{rank}([A \mid \mathbf{b}])\).
  • If a solution exists:
    • It is unique if and only if \(\text{Null}(A) = \{\mathbf{0}\}\) (i.e., columns of \(A\) are linearly independent). This implies \(n \le m\) and \(\text{rank}(A) = n\).
    • If \(A\) is square (\(m=n\)) and invertible (\(\det(A) \ne 0\)), there is a unique solution \(\mathbf{x} = A^{-1}\mathbf{b}\).
    • If there are free variables (i.e., \(\text{nullity}(A) > 0\)), there are infinitely many solutions. The general solution is \(\mathbf{x}_p + \mathbf{x}_h\), where \(\mathbf{x}_p\) is a particular solution and \(\mathbf{x}_h\) is any solution to the homogeneous equation \(A\mathbf{x}=\mathbf{0}\).

Vector Space Axioms

Definition. Vector Space

A set \(V\) equipped with two operations, vector addition (\(+\)) and scalar multiplication (\(\cdot\)), is a vector space over a field \(\mathbb{F}\) (typically \(\mathbb{R}\) or \(\mathbb{C}\)) if it satisfies the following axioms for all vectors \(\mathbf{u}, \mathbf{v}, \mathbf{w} \in V\) and scalars \(c, d \in \mathbb{F}\):

  1. \(\mathbf{u} + \mathbf{v} \in V\) (Closure under addition)
  2. \(\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}\) (Commutativity of addition)
  3. \((\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})\) (Associativity of addition)
  4. There exists a zero vector \(\mathbf{0} \in V\) such that \(\mathbf{v} + \mathbf{0} = \mathbf{v}\) for all \(\mathbf{v} \in V\) (Additive identity)
  5. For every \(\mathbf{v} \in V\), there exists an additive inverse \(-\mathbf{v} \in V\) such that \(\mathbf{v} + (-\mathbf{v}) = \mathbf{0}\)
  6. \(c\mathbf{v} \in V\) (Closure under scalar multiplication)
  7. \(c(\mathbf{u} + \mathbf{v}) = c\mathbf{u} + c\mathbf{v}\) (Distributivity of scalar multiplication over vector addition)
  8. \((c+d)\mathbf{v} = c\mathbf{v} + d\mathbf{v}\) (Distributivity of scalar multiplication over field addition)
  9. \(c(d\mathbf{v}) = (cd)\mathbf{v}\) (Associativity of scalar multiplication)
  10. \(1\mathbf{v} = \mathbf{v}\) (Scalar multiplicative identity, where \(1\) is the multiplicative identity in \(\mathbb{F}\))

Key Concepts for Vector Spaces

  • Linear Combination: A vector \(\mathbf{w} = c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \dots + c_k\mathbf{v}_k\) where \(c_i\) are scalars.
  • Span: The set of all possible linear combinations of a set of vectors \(\{\mathbf{v}_1, \dots, \mathbf{v}_k\}\). Denoted \(\text{span}\{\mathbf{v}_1, \dots, \mathbf{v}_k\}\).
  • Linear Independence: A set of vectors \(\{\mathbf{v}_1, \dots, \mathbf{v}_k\}\) is linearly independent if the only solution to \(c_1\mathbf{v}_1 + \dots + c_k\mathbf{v}_k = \mathbf{0}\) is \(c_1 = c_2 = \dots = c_k = 0\).
  • Basis: A linearly independent set of vectors that spans the entire vector space. All bases for a given vector space have the same number of vectors.
  • Dimension: The number of vectors in any basis for a vector space. Denoted \(\dim(V)\).
  • Subspace: A subset of a vector space that is itself a vector space under the same operations (i.e., it is closed under vector addition and scalar multiplication, and contains the zero vector).

Fundamental Subspaces of a Matrix \(A \in \mathbb{R}^{m \times n}\)

Subspace NameDefinitionSpaceDimensionOrthogonal Complement
Column Space (Image, Range)\(\text{Col}(A) = \{ A\mathbf{x} \mid \mathbf{x} \in \mathbb{R}^n \}\)\(\mathbb{R}^m\)\(\text{rank}(A)\)Left Null Space (\(\text{Null}(A^T)\))
Null Space (Kernel)\(\text{Null}(A) = \{ \mathbf{x} \in \mathbb{R}^n \mid A\mathbf{x} = \mathbf{0} \}\)\(\mathbb{R}^n\)\(\text{nullity}(A) = n - \text{rank}(A)\)Row Space (\(\text{Row}(A)\))
Row Space (\(\text{Col}(A^T)\))\(\text{Row}(A) = \{ A^T\mathbf{y} \mid \mathbf{y} \in \mathbb{R}^m \}\)\(\mathbb{R}^n\)\(\text{rank}(A)\)Null Space (\(\text{Null}(A)\))
Left Null Space (\(\text{Null}(A^T)\))\(\text{Null}(A^T) = \{ \mathbf{y} \in \mathbb{R}^m \mid A^T\mathbf{y} = \mathbf{0} \}\)\(\mathbb{R}^m\)\(m - \text{rank}(A)\)Column Space (\(\text{Col}(A)\))
  • Properties of Rank: For matrices \(A\) and \(B\) (dimensions allowing products/sums):
    1. \[0 \le \text{rank}(A) \le \min(m, n)\]
    2. \[\text{rank}(AB) \le \min(\text{rank}(A), \text{rank}(B))\]
    3. \[\text{rank}(A+B) \le \text{rank}(A) + \text{rank}(B)\]
    4. \[\text{rank}(A) = \text{rank}(A^T) = \text{rank}(A^T A) = \text{rank}(AA^T)\]

Theorem. Rank-Nullity Theorem

For any matrix \(A \in \mathbb{R}^{m \times n}\):

\[\text{rank}(A) + \text{nullity}(A) = n \quad (\text{number of columns of } A)\]

Orthogonal Complements

  • Two subspaces \(V\) and \(W\) of \(\mathbb{R}^k\) are orthogonal complements if every vector in \(V\) is orthogonal to every vector in \(W\), and \(V \oplus W = \mathbb{R}^k\) (their direct sum spans \(\mathbb{R}^k\) and their intersection is \(\{\mathbf{0}\}\)). Denoted \(W = V^\perp\).
  • The table above shows the orthogonal complement relationships for the fundamental subspaces. For example, \(\text{Col}(A)^\perp = \text{Null}(A^T)\) in \(\mathbb{R}^m\), and \(\text{Row}(A)^\perp = \text{Null}(A)\) in \(\mathbb{R}^n\). This implies \(\mathbb{R}^m = \text{Col}(A) \oplus \text{Null}(A^T)\) and \(\mathbb{R}^n = \text{Row}(A) \oplus \text{Null}(A)\).

4. Determinants

  • Definition: A scalar value associated with a square matrix \(A \in \mathbb{R}^{n \times n}\), denoted \(\det(A)\) or \(\vert A \vert\).
    • For \(n=1\), \(A = [a_{11}]\), \(\det(A) = a_{11}\).
    • For \(n=2\), \(A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\), \(\det(A) = ad - bc\).
    • For \(n > 2\), often defined recursively using cofactor expansion along any row \(i\) (or column \(j\)):

      \[\det(A) = \sum_{j=1}^n (-1)^{i+j} A_{ij} M_{ij} \quad (\text{expansion along row } i)\]

      where \(M_{ij}\) is the determinant of the submatrix obtained by deleting row \(i\) and column \(j\) (the \((i,j)\)-minor).

  • Properties:
    1. \(\det(I) = 1\).
    2. If \(A\) has a row or column of zeros, \(\det(A) = 0\).
    3. If two rows or two columns of \(A\) are identical, \(\det(A) = 0\).
    4. If \(B\) is obtained from \(A\) by swapping two rows (or two columns), then \(\det(B) = -\det(A)\).
    5. If \(B\) is obtained from \(A\) by multiplying a single row (or column) by a scalar \(c\), then \(\det(B) = c \cdot \det(A)\).
    6. Consequently, for \(A \in \mathbb{R}^{n \times n}\), \(\det(cA) = c^n \det(A)\).
    7. If \(B\) is obtained from \(A\) by adding a multiple of one row to another row (or column to column), then \(\det(B) = \det(A)\).
    8. \(\det(AB) = \det(A)\det(B)\) for square matrices \(A, B\).
    9. \(\det(A^T) = \det(A)\).
    10. \(A\) is invertible (non-singular) if and only if \(\det(A) \ne 0\).
    11. If \(A\) is invertible, \(\det(A^{-1}) = 1/\det(A) = (\det(A))^{-1}\).
    12. For a triangular matrix (upper or lower), \(\det(A)\) is the product of its diagonal entries.
  • Geometric Interpretation: For a matrix \(A \in \mathbb{R}^{n \times n}\), \(\vert \det(A) \vert\) is the factor by which the linear transformation represented by \(A\) scales \(n\)-dimensional volume. The sign of \(\det(A)\) indicates whether the transformation preserves or reverses orientation.

Adjoint Matrix and Cramer’s Rule

  • Adjoint (or Adjugate) Matrix: The adjoint of \(A\), denoted \(\text{adj}(A)\) or \(\text{Adj}(A)\), is the transpose of the cofactor matrix of \(A\). That is, \((\text{adj}(A))_{ij} = C_{ji} = (-1)^{j+i} M_{ji}\).
    • Property: \(A \cdot \text{adj}(A) = \text{adj}(A) \cdot A = \det(A)I\).
    • If \(\det(A) \ne 0\), then \(A^{-1} = \frac{1}{\det(A)} \text{adj}(A)\).
  • Cramer’s Rule: For an invertible matrix \(A\), the unique solution to \(A\mathbf{x} = \mathbf{b}\) is given by:

    \[x_i = \frac{\det(A_i)}{\det(A)}\]

    where \(A_i\) is the matrix formed by replacing the \(i\)-th column of \(A\) with the vector \(\mathbf{b}\). Cramer’s rule is computationally inefficient for large systems but is theoretically important.

5. Eigenvalues and Eigenvectors

  • Definition: For a square matrix \(A \in \mathbb{R}^{n \times n}\), a non-zero vector \(\mathbf{v} \in \mathbb{C}^n\) is an eigenvector of \(A\) if \(A\mathbf{v} = \lambda\mathbf{v}\) for some scalar \(\lambda \in \mathbb{C}\). The scalar \(\lambda\) is the corresponding eigenvalue.
  • Characteristic Equation: Eigenvalues are the roots of the characteristic polynomial \(p(\lambda) = \det(A - \lambda I) = 0\). This is a polynomial in \(\lambda\) of degree \(n\).
  • Properties:
    1. An \(n \times n\) matrix \(A\) has \(n\) eigenvalues in \(\mathbb{C}\), counting multiplicities.
    2. Sum of eigenvalues: \(\sum_{i=1}^n \lambda_i = \text{tr}(A)\)
    3. Product of eigenvalues: \(\prod_{i=1}^n \lambda_i = \det(A)\)
    4. Eigenvectors corresponding to distinct eigenvalues are linearly independent.
    5. If \(A\) is a triangular matrix, its eigenvalues are its diagonal entries.
    6. \(A\) and \(A^T\) have the same eigenvalues (but generally different eigenvectors).
    7. If \(\lambda\) is an eigenvalue of an invertible matrix \(A\), then \(1/\lambda\) is an eigenvalue of \(A^{-1}\). The corresponding eigenvector is the same.
    8. If \(\lambda\) is an eigenvalue of \(A\), then \(\lambda^k\) is an eigenvalue of \(A^k\) for any integer \(k \ge 0\). The corresponding eigenvector is the same.
    9. The set of all eigenvectors corresponding to an eigenvalue \(\lambda\), along with the zero vector, forms a subspace called the eigenspace \(E_\lambda = \text{Null}(A - \lambda I)\).
    10. Cayley-Hamilton Theorem: Every square matrix satisfies its own characteristic equation. If \(p(\lambda) = c_n \lambda^n + \dots + c_1 \lambda + c_0\) is the characteristic polynomial of \(A\), then \(p(A) = c_n A^n + \dots + c_1 A + c_0 I = 0\).

For Symmetric Matrices (\(A = A^T\), \(A \in \mathbb{R}^{n \times n}\))

  1. All eigenvalues are real numbers.
  2. Eigenvectors corresponding to distinct eigenvalues are orthogonal.
  3. A real symmetric matrix is always orthogonally diagonalizable: there exists an orthogonal matrix \(Q\) whose columns are orthonormal eigenvectors of \(A\), and a diagonal matrix \(\Lambda\) whose diagonal entries are the corresponding eigenvalues, such that \(A = Q\Lambda Q^T\). This is known as the Spectral Theorem.

6. Matrix Decompositions

These are ways to factorize a matrix into a product of other matrices with useful properties.

Theorem. Spectral Theorem for Real Symmetric Matrices

If \(A \in \mathbb{R}^{n \times n}\) is a symmetric matrix (i.e., \(A = A^T\)), then there exists an orthogonal matrix \(Q \in \mathbb{R}^{n \times n}\) (whose columns are orthonormal eigenvectors of \(A\)) and a real diagonal matrix \(\Lambda \in \mathbb{R}^{n \times n}\) (whose diagonal entries are the eigenvalues of \(A\) corresponding to the columns of \(Q\)) such that:

\[A = Q \Lambda Q^T\]

This decomposition allows \(A\) to be expressed as a sum of rank-1 outer products:

\[A = \sum_{i=1}^n \lambda_i \mathbf{q}_i \mathbf{q}_i^T\]

where \(\mathbf{q}_i\) are the orthonormal eigenvectors (columns of \(Q\)) and \(\lambda_i\) are the corresponding eigenvalues (diagonal entries of \(\Lambda\)).

Theorem. Singular Value Decomposition (SVD)

For any matrix \(A \in \mathbb{R}^{m \times n}\) (not necessarily square or symmetric), there exist orthogonal matrices \(U \in \mathbb{R}^{m \times m}\) and \(V \in \mathbb{R}^{n \times n}\), and a real diagonal matrix \(\Sigma \in \mathbb{R}^{m \times n}\) (meaning only \((\Sigma)_{ii}\) can be non-zero) with non-negative real numbers \(\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_r > 0\) on its “diagonal” (where \(r = \text{rank}(A)\), and \(\sigma_i = 0\) for \(i > r\)), such that:

\[A = U \Sigma V^T\]
  • The columns of \(U\) are left singular vectors (orthonormal eigenvectors of \(AA^T\)).
  • The columns of \(V\) are right singular vectors (orthonormal eigenvectors of \(A^T A\)).
  • The diagonal entries of \(\Sigma\) (\(\sigma_i\)) are singular values of \(A\). They are the square roots of the non-zero eigenvalues of both \(A^T A\) and \(AA^T\): \(\sigma_i(A) = \sqrt{\lambda_i(A^T A)} = \sqrt{\lambda_i(AA^T)}\). SVD provides a way to write \(A\) as a sum of rank-1 matrices (Full SVD sum goes up to \(\min(m,n)\), but terms with \(\sigma_i=0\) vanish):
\[A = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T\]

where \(\mathbf{u}_i\) is the \(i\)-th column of \(U\) and \(\mathbf{v}_i\) is the \(i\)-th column of \(V\).

Other Useful Decompositions

  • LU Decomposition: For many square matrices \(A\), one can find a factorization \(A = LU\) (or \(PA=LU\) including permutations \(P\) for stability/existence), where \(L\) is a lower triangular matrix (often with 1s on its diagonal, making it unit lower triangular) and \(U\) is an upper triangular matrix. This is commonly used to solve systems \(A\mathbf{x}=\mathbf{b}\) efficiently via forward and backward substitution.
  • QR Decomposition: For any \(A \in \mathbb{R}^{m \times n}\), it can be factored as \(A = QR\), where \(Q \in \mathbb{R}^{m \times m}\) is an orthogonal matrix and \(R \in \mathbb{R}^{m \times n}\) is an upper triangular matrix.
    • If \(m \ge n\) (tall or square matrix), a “thin” or “reduced” QR decomposition is often used: \(A = Q_1 R_1\), where \(Q_1 \in \mathbb{R}^{m \times n}\) has orthonormal columns and \(R_1 \in \mathbb{R}^{n \times n}\) is upper triangular. The columns of \(Q_1\) form an orthonormal basis for \(\text{Col}(A)\) if \(A\) has full column rank. This decomposition can be found using Gram-Schmidt process on columns of \(A\).
  • Cholesky Decomposition: For a real, symmetric, positive definite matrix \(A\), there exists a unique lower triangular matrix \(L\) with strictly positive diagonal entries such that \(A = LL^T\). (Alternatively, \(A=R^T R\) where \(R\) is upper triangular, \(R=L^T\)). This is often used for solving linear systems with symmetric positive definite coefficient matrices and in statistics (e.g., sampling from multivariate normal distributions).
  • Diagonalization (\(A = PDP^{-1}\)): An \(n \times n\) matrix \(A\) is diagonalizable if and only if it has \(n\) linearly independent eigenvectors. In this case, \(A = PDP^{-1}\), where \(P\) is an invertible matrix whose columns are the eigenvectors of \(A\), and \(D\) is a diagonal matrix whose diagonal entries are the corresponding eigenvalues.
    • A matrix is diagonalizable if and only if for every eigenvalue \(\lambda\), its geometric multiplicity (dimension of the eigenspace \(E_\lambda\)) equals its algebraic multiplicity (multiplicity as a root of the characteristic polynomial).
    • If an \(n \times n\) matrix has \(n\) distinct eigenvalues, it is diagonalizable.
    • Symmetric matrices are a special case where \(P\) can be chosen to be orthogonal (\(Q\)), so \(A=Q\Lambda Q^T\).

7. Positive Definite and Semi-definite Matrices

These definitions apply to symmetric matrices \(A \in \mathbb{R}^{n \times n}\). The quadratic form associated with \(A\) is \(\mathbf{x}^T A \mathbf{x}\).

  • Positive Definite (\(A \succ 0\)):

    \[\mathbf{x}^T A \mathbf{x} > 0 \quad \text{for all non-zero vectors } \mathbf{x} \in \mathbb{R}^n\]
    • Equivalent conditions:
      1. All eigenvalues of \(A\) are strictly positive (\(\lambda_i > 0\) for all \(i\)).
      2. All leading principal minors of \(A\) are strictly positive. (A leading principal minor of order \(k\) is the determinant of the top-left \(k \times k\) submatrix).
      3. Cholesky decomposition \(A=LL^T\) exists where \(L\) is a lower triangular matrix with strictly positive diagonal entries.
  • Positive Semi-definite (\(A \succeq 0\)):

    \[\mathbf{x}^T A \mathbf{x} \ge 0 \quad \text{for all vectors } \mathbf{x} \in \mathbb{R}^n\]
    • Equivalent conditions:
      1. All eigenvalues of \(A\) are non-negative (\(\lambda_i \ge 0\) for all \(i\)).
      2. All principal minors of \(A\) are non-negative. (A principal minor is the determinant of a submatrix obtained by deleting the same set of rows and columns).
  • Negative Definite (\(A \prec 0\)): \(\mathbf{x}^T A \mathbf{x} < 0\) for all non-zero \(\mathbf{x}\). (Equivalent to \(-A \succ 0\), all eigenvalues \(<0\)).
  • Negative Semi-definite (\(A \preceq 0\)): \(\mathbf{x}^T A \mathbf{x} \le 0\) for all \(\mathbf{x}\). (Equivalent to \(-A \succeq 0\), all eigenvalues \(\le 0\)).
  • Indefinite: The matrix \(A\) is neither positive semi-definite nor negative semi-definite. This means the quadratic form \(\mathbf{x}^T A \mathbf{x}\) can take both positive and negative values. (Equivalent to \(A\) having at least one positive eigenvalue and at least one negative eigenvalue).
  • Connection to Convexity: For a twice-differentiable function \(f: \mathbb{R}^n \to \mathbb{R}\), its Hessian matrix \(\nabla^2 f(\mathbf{x})\) being positive semi-definite (positive definite) over a convex domain implies \(f\) is convex (strictly convex) over that domain. This is fundamental in optimization.

8. Inner Product Spaces

Definition. Inner Product

An inner product on a real vector space \(V\) is a function \(\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}\) that for all vectors \(\mathbf{u}, \mathbf{v}, \mathbf{w} \in V\) and any scalar \(c \in \mathbb{R}\) satisfies the following properties:

  1. Symmetry: \(\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle\)
  2. Linearity in the first argument:

    \[\langle c\mathbf{u} + \mathbf{v}, \mathbf{w} \rangle = c\langle \mathbf{u}, \mathbf{w} \rangle + \langle \mathbf{v}, \mathbf{w} \rangle\]

    (Linearity in the second argument follows from symmetry: \(\langle \mathbf{u}, c\mathbf{v} + \mathbf{w} \rangle = c\langle \mathbf{u}, \mathbf{v} \rangle + \langle \mathbf{u}, \mathbf{w} \rangle\)).

  3. Positive-definiteness: \(\langle \mathbf{v}, \mathbf{v} \rangle \ge 0\), and \(\langle \mathbf{v}, \mathbf{v} \rangle = 0 \iff \mathbf{v} = \mathbf{0}\) The standard dot product in \(\mathbb{R}^n\) (\(\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^T \mathbf{v}\)) is an example of an inner product. Another example is a weighted inner product \(\langle \mathbf{u}, \mathbf{v} \rangle_W = \mathbf{u}^T W \mathbf{v}\) where \(W\) is a symmetric positive definite matrix. An inner product induces a norm defined by \(\Vert \mathbf{v} \Vert = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}\).
  • Cauchy-Schwarz Inequality (general form):

    \[\vert \langle \mathbf{u}, \mathbf{v} \rangle \vert \le \Vert \mathbf{u} \Vert \Vert \mathbf{v} \Vert\]

    Equality holds if and only if \(\mathbf{u}\) and \(\mathbf{v}\) are linearly dependent.

  • Orthogonality: Two vectors \(\mathbf{u}\) and \(\mathbf{v}\) are orthogonal with respect to the inner product if \(\langle \mathbf{u}, \mathbf{v} \rangle = 0\).
  • Orthonormal Basis: A basis \(\{\mathbf{q}_1, \dots, \mathbf{q}_n\}\) for an \(n\)-dimensional inner product space is orthonormal if \(\langle \mathbf{q}_i, \mathbf{q}_j \rangle = \delta_{ij}\) (Kronecker delta: 1 if \(i=j\), 0 if \(i \ne j\)).
  • Gram-Schmidt Process: An algorithm that takes any basis for an inner product space and constructs an orthonormal basis for that space.
  • Parseval’s Identity: If \(\{\mathbf{q}_1, \dots, \mathbf{q}_n\}\) is an orthonormal basis for an inner product space \(V\), then for any vector \(\mathbf{v} \in V\):

    \[\mathbf{v} = \sum_{i=1}^n \langle \mathbf{v}, \mathbf{q}_i \rangle \mathbf{q}_i \quad \text{and} \quad \Vert \mathbf{v} \Vert^2 = \sum_{i=1}^n \vert \langle \mathbf{v}, \mathbf{q}_i \rangle \vert^2\]

9. Matrix Calculus

Common derivatives using numerator layout convention. Result dimensions match the variable being differentiated with respect to (e.g., if differentiating w.r.t a column vector \(\mathbf{x}\), result is a column vector).

Function \(f\)DerivativeNotes
\(\mathbf{a}^T \mathbf{x}\) (or \(\mathbf{x}^T \mathbf{a}\))\(\frac{\partial f}{\partial \mathbf{x}} = \mathbf{a}\)\(\mathbf{x}, \mathbf{a} \in \mathbb{R}^n\). Result is \(n \times 1\).
\(\mathbf{x}^T A \mathbf{x}\)\(\frac{\partial f}{\partial \mathbf{x}} = (A + A^T)\mathbf{x}\)\(A \in \mathbb{R}^{n \times n}\). If \(A\) symmetric, \(2A\mathbf{x}\). Result is \(n \times 1\).
\(\Vert A\mathbf{x} - \mathbf{b} \Vert_2^2\)\(\frac{\partial f}{\partial \mathbf{x}} = 2A^T(A\mathbf{x} - \mathbf{b})\)\(A \in \mathbb{R}^{m \times n}, \mathbf{x} \in \mathbb{R}^n, \mathbf{b} \in \mathbb{R}^m\). Result is \(n \times 1\).
\(\text{tr}(AX)\)\(\frac{\partial f}{\partial X} = A^T\)\(X \in \mathbb{R}^{k \times l}, A \in \mathbb{R}^{p \times k}\). Result is \(k \times l\) (shape of \(X\)).
\(\text{tr}(XA)\)\(\frac{\partial f}{\partial X} = A^T\)\(X \in \mathbb{R}^{k \times l}, A \in \mathbb{R}^{l \times p}\). Result is \(k \times l\) (shape of \(X\)).
\(\text{tr}(AXB)\)\(\frac{\partial f}{\partial X} = A^T B^T\)\(X \in \mathbb{R}^{k \times l}, A \in \mathbb{R}^{p \times k}, B \in \mathbb{R}^{l \times q}\). Result is \(k \times l\) (shape of \(X\)).
\(\text{tr}(X^T A X)\)\(\frac{\partial f}{\partial X} = (A+A^T)X\)\(X \in \mathbb{R}^{k \times l}, A \in \mathbb{R}^{k \times k}\). If \(A\) symmetric, \(2AX\). Result is \(k \times l\) (shape of \(X\)).
\(\log \det(X)\)\(\frac{\partial f}{\partial X} = (X^{-1})^T = X^{-T}\)\(X \in \mathbb{R}^{n \times n}\) positive definite. If \(X\) symmetric, \(X^{-1}\). Result is \(n \times n\).
\(\det(X)\) (Jacobi’s Formula)\(\frac{\partial f}{\partial X} = \det(X) (X^{-1})^T = \text{adj}(X)^T\)\(X \in \mathbb{R}^{n \times n}\) invertible. If \(X\) symmetric, \(\det(X)X^{-1}\). Result is \(n \times n\).

10. Miscellaneous Identities and Facts

  • Woodbury Matrix Identity (Matrix Inversion Lemma): Allows efficient computation of the inverse of a rank-\(k\) corrected matrix:

    \[(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}\]

    where \(A \in \mathbb{R}^{n \times n}\), \(U \in \mathbb{R}^{n \times k}\), \(C \in \mathbb{R}^{k \times k}\), \(V \in \mathbb{R}^{k \times n}\). Assumes \(A, C\) and \((C^{-1} + VA^{-1}U)\) are invertible.

  • Sherman-Morrison Formula (rank-1 update): A special case of the Woodbury identity where \(k=1\) (i.e., \(U=\mathbf{u}\), \(V=\mathbf{v}^T\), \(C=1\)):

    \[(A + \mathbf{u}\mathbf{v}^T)^{-1} = A^{-1} - \frac{A^{-1}\mathbf{u}\mathbf{v}^T A^{-1}}{1 + \mathbf{v}^T A^{-1}\mathbf{u}}\]

    Assumes \(A\) is invertible and \(1 + \mathbf{v}^T A^{-1}\mathbf{u} \ne 0\).

  • Projection Matrix:
    • The matrix that orthogonally projects vectors onto the column space \(\text{Col}(A)\) of a matrix \(A \in \mathbb{R}^{m \times n}\) with linearly independent columns (i.e., \(\text{rank}(A)=n\)) is:

      \[P_A = A(A^T A)^{-1} A^T\]
    • For any \(\mathbf{b} \in \mathbb{R}^m\), \(P_A\mathbf{b}\) is the vector in \(\text{Col}(A)\) closest to \(\mathbf{b}\).
    • Properties of orthogonal projection matrices: \(P_A^2 = P_A\) (idempotent) and \(P_A^T = P_A\) (symmetric).
  • Relationship between \(A^TA\) and \(AA^T\) for \(A \in \mathbb{R}^{m \times n}\):
    • Both \(A^TA \in \mathbb{R}^{n \times n}\) and \(AA^T \in \mathbb{R}^{m \times m}\) are symmetric and positive semi-definite.
    • They have the same non-zero eigenvalues. Their non-zero eigenvalues are the squares of the non-zero singular values of \(A\).
    • \(\text{rank}(A) = \text{rank}(A^TA) = \text{rank}(AA^T)\).
    • If \(A\) has full column rank (\(\text{rank}(A)=n \le m\)), then \(A^TA\) is positive definite (and thus invertible).
    • If \(A\) has full row rank (\(\text{rank}(A)=m \le n\)), then \(AA^T\) is positive definite (and thus invertible).

This list is intended as a compact reference. For deeper understanding, proofs, and further topics, consult standard linear algebra textbooks or Wikipedia.

This post is licensed under CC BY 4.0 by the author.