Cheat Sheet: Linear Algebra
A quick reference guide compiling essential theorems, identities, facts, and formulas from linear algebra, designed for the crash course on mathematical foundations for machine learning and optimization.
Linear Algebra: Key Concepts and Formulas
This quick reference guide covers essential theorems, identities, and facts from linear algebra, particularly relevant for machine learning and optimization.
1. Vectors
Definitions & Operations
Vector: An element of a vector space, often represented as an array of numbers (components). Geometrically, a quantity with magnitude and direction.
\[\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \in \mathbb{R}^n\]Vector Addition:
\[\mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1+v_1 \\ \vdots \\ u_n+v_n \end{bmatrix}\]- Properties: Commutative (\(\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}\)), Associative (\((\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})\))
Scalar Multiplication:
\[c\mathbf{v} = \begin{bmatrix} cv_1 \\ \vdots \\ cv_n \end{bmatrix}\]- Properties: Distributive over vector/scalar addition, Associative
Dot Product (Standard Inner Product for \(\mathbb{R}^n\)):
\[\mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T \mathbf{v} = \sum_{i=1}^n u_i v_i\]- Properties:
- Commutative: \(\mathbf{u} \cdot \mathbf{v} = \mathbf{v} \cdot \mathbf{u}\)
- Distributive: \(\mathbf{u} \cdot (\mathbf{v} + \mathbf{w}) = \mathbf{u} \cdot \mathbf{v} + \mathbf{u} \cdot \mathbf{w}\)
- Bilinear: \((c\mathbf{u}) \cdot \mathbf{v} = c(\mathbf{u} \cdot \mathbf{v}) = \mathbf{u} \cdot (c\mathbf{v})\)
- Positive-definiteness: \(\mathbf{v} \cdot \mathbf{v} = \Vert \mathbf{v} \Vert_2^2 \ge 0\), and \(\mathbf{v} \cdot \mathbf{v} = 0 \iff \mathbf{v} = \mathbf{0}\)
- Properties:
Angle between two non-zero vectors:
\[\cos \theta = \frac{\mathbf{u} \cdot \mathbf{v}}{\Vert \mathbf{u} \Vert_2 \Vert \mathbf{v} \Vert_2}\]- Orthogonal Vectors: \(\mathbf{u} \cdot \mathbf{v} = 0\) (implies \(\theta = \pi/2\) or \(90^\circ\) if vectors are non-zero).
Vector Projection: The projection of vector \(\mathbf{a}\) onto non-zero vector \(\mathbf{b}\) is:
\[\text{proj}_{\mathbf{b}} \mathbf{a} = \frac{\mathbf{a} \cdot \mathbf{b}}{\Vert \mathbf{b} \Vert_2^2} \mathbf{b}\]The scalar component of \(\mathbf{a}\) along \(\mathbf{b}\) is \(\frac{\mathbf{a} \cdot \mathbf{b}}{\Vert \mathbf{b} \Vert_2}\).
Vector Norms
- Definition: A function \(\Vert \cdot \Vert : V \to \mathbb{R}\) such that for all \(c \in \mathbb{R}\), \(\mathbf{u}, \mathbf{v} \in V\):
- \(\Vert \mathbf{v} \Vert \ge 0\) (Non-negativity)
- \(\Vert \mathbf{v} \Vert = 0 \iff \mathbf{v} = \mathbf{0}\) (Definiteness)
- \(\Vert c\mathbf{v} \Vert = \vert c \vert \Vert \mathbf{v} \Vert\) (Absolute homogeneity)
- \(\Vert \mathbf{u} + \mathbf{v} \Vert \le \Vert \mathbf{u} \Vert + \Vert \mathbf{v} \Vert\) (Triangle Inequality)
- Common \(L_p\) Norms: For \(\mathbf{v} \in \mathbb{R}^n\)
- \(L_1\) norm (Manhattan norm): \(\Vert \mathbf{v} \Vert_1 = \sum_{i=1}^n \vert v_i \vert\)
- \(L_2\) norm (Euclidean norm): \(\Vert \mathbf{v} \Vert_2 = \sqrt{\sum_{i=1}^n v_i^2} = \sqrt{\mathbf{v}^T \mathbf{v}}\)
- \(L_p\) norm (\(p \ge 1\)): \(\Vert \mathbf{v} \Vert_p = \left( \sum_{i=1}^n \vert v_i \vert^p \right)^{1/p}\)
- \(L_\infty\) norm (Max norm): \(\Vert \mathbf{v} \Vert_\infty = \max_{1 \le i \le n} \vert v_i \vert\)
Cauchy-Schwarz Inequality:
\[\vert \mathbf{u} \cdot \mathbf{v} \vert \le \Vert \mathbf{u} \Vert_2 \Vert \mathbf{v} \Vert_2\]More generally for inner products: \(\vert \langle \mathbf{u}, \mathbf{v} \rangle \vert \le \Vert \mathbf{u} \Vert \Vert \mathbf{v} \Vert\). Equality holds if and only if \(\mathbf{u}\) and \(\mathbf{v}\) are linearly dependent.
2. Matrices
Definitions & Special Matrices
- Matrix: A rectangular array of numbers, \(A \in \mathbb{R}^{m \times n}\).
Type | Definition / Condition | Key Notes |
---|---|---|
Identity (\(I_n\) or \(I\)) | Square matrix; \(A_{ii}=1\), \(A_{ij}=0\) for \(i \ne j\) | \(AI=IA=A\) |
Zero (\(0\)) | All entries \(A_{ij}=0\) | |
Diagonal | \(A_{ij}=0\) for \(i \ne j\) | |
Symmetric | \(A = A^T\) (square matrix) | Real eigenvalues, orthogonally diagonalizable |
Skew-Symmetric | \(A = -A^T\) (square matrix) | Diagonal elements \(A_{ii}=0\). Eigenvalues are 0 or purely imaginary. |
Upper Triangular | \(A_{ij}=0\) for \(i > j\) (square matrix) | Eigenvalues are the diagonal entries |
Lower Triangular | \(A_{ij}=0\) for \(i < j\) (square matrix) | Eigenvalues are the diagonal entries |
Orthogonal (\(Q\)) | Square matrix; \(Q^T Q = Q Q^T = I\) | \(Q^{-1}=Q^T\), columns/rows form an orthonormal basis, \(\det(Q)=\pm 1\), preserves dot products & \(L_2\) norms |
Normal | Square matrix; \(AA^T = A^T A\) (real) or \(AA^H = A^H A\) (complex) | Unitarily diagonalizable. Includes symmetric, skew-symmetric, orthogonal matrices. |
Matrix Operations
- Matrix Addition/Subtraction: Element-wise. Matrices must have the same dimensions.
- Scalar Multiplication: Multiply each element by the scalar.
Matrix Multiplication: If \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{n \times p}\), then \(C = AB \in \mathbb{R}^{m \times p}\), where
\[C_{ij} = \sum_{k=1}^n A_{ik} B_{kj}\]- Properties:
- Associative: \((AB)C = A(BC)\)
- Distributive: \(A(B+C) = AB + AC\) and \((B+C)D = BD + CD\)
- Not Commutative in general: \(AB \ne BA\)
- Properties:
- Transpose (\(A^T\)): Rows become columns (and vice versa). \((A^T)_{ij} = A_{ji}\).
- Properties:
- \[(A^T)^T = A\]
- \[(A+B)^T = A^T + B^T\]
- \[(cA)^T = cA^T\]
- \((AB)^T = B^T A^T\) (Reverse order)
- Properties:
- Matrix Inverse (\(A^{-1}\)): For a square matrix \(A\), if there exists \(A^{-1}\) such that \(AA^{-1} = A^{-1}A = I\).
- \(A\) is invertible (or non-singular) if \(A^{-1}\) exists. This is true if and only if \(det(A) \ne 0\).
- Properties:
- \[(A^{-1})^{-1} = A\]
- \((AB)^{-1} = B^{-1}A^{-1}\) (if A, B are invertible, reverse order)
- \[(A^T)^{-1} = (A^{-1})^T\]
Trace (\(\text{tr}(A)\)): Sum of diagonal elements of a square matrix \(A\).
\[\text{tr}(A) = \sum_{i=1}^n A_{ii}\]- Properties:
- \[\text{tr}(A+B) = \text{tr}(A) + \text{tr}(B)\]
- \[\text{tr}(cA) = c \cdot \text{tr}(A)\]
- \(\text{tr}(AB) = \text{tr}(BA)\) (Cyclic property. Valid if \(AB\) and \(BA\) are square, e.g., \(A \in \mathbb{R}^{m \times n}, B \in \mathbb{R}^{n \times m}\)).
- \[\text{tr}(A) = \text{tr}(A^T)\]
- \[\text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB)\]
- Properties:
- Hadamard Product (Element-wise product, \(A \odot B\)): \((A \odot B)_{ij} = A_{ij} B_{ij}\). Matrices must have the same dimensions.
- Moore-Penrose Pseudoinverse (\(A^+\)): For any matrix \(A \in \mathbb{R}^{m \times n}\), the pseudoinverse \(A^+ \in \mathbb{R}^{n \times m}\) is the unique matrix satisfying:
- \[AA^+A = A\]
- \[A^+AA^+ = A^+\]
- \((AA^+)^T = AA^+\) (so \(AA^+\) is symmetric)
- \((A^+A)^T = A^+A\) (so \(A^+A\) is symmetric)
- If \(A\) has full column rank (\(n \le m\), \(\text{rank}(A)=n\)), then \(A^+ = (A^T A)^{-1}A^T\). In this case, \(A^+A = I_n\).
- If \(A\) has full row rank (\(m \le n\), \(\text{rank}(A)=m\)), then \(A^+ = A^T(AA^T)^{-1}\). In this case, \(AA^+ = I_m\).
- Computation via SVD: If \(A = U\Sigma V^T\) is the SVD of \(A\), then \(A^+ = V\Sigma^+U^T\), where \(\Sigma^+\) is formed by taking the reciprocal of the non-zero singular values in \(\Sigma\) and transposing the resulting matrix.
- Use: The vector \(\mathbf{x}^+ = A^+\mathbf{b}\) is the minimum norm (\(\Vert \mathbf{x} \Vert_2\)) least-squares solution to \(A\mathbf{x} = \mathbf{b}\).
Matrix Norms
Operator Norm (Induced Norm):
\[\Vert A \Vert_p = \sup_{\mathbf{x} \ne \mathbf{0}} \frac{\Vert A\mathbf{x} \Vert_p}{\Vert \mathbf{x} \Vert_p} = \sup_{\Vert \mathbf{x} \Vert_p = 1} \Vert A\mathbf{x} \Vert_p\]Spectral Norm (\(p=2\)): \(\Vert A \Vert_2 = \sigma_{\max}(A)\) (largest singular value of \(A\))
\[\Vert A \Vert_2 = \sqrt{\lambda_{\max}(A^T A)}\]where \(\lambda_{\max}(M)\) is the largest eigenvalue of \(M\).
General property: \(\Vert AB \Vert_p \le \Vert A \Vert_p \Vert B \Vert_p\) (submultiplicativity for induced norms).
Frobenius Norm:
\[\Vert A \Vert_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n \vert A_{ij} \vert^2} = \sqrt{\text{tr}(A^T A)}\]- Properties:
- \(\Vert A \Vert_F^2 = \sum_{i=1}^{\min(m,n)} \sigma_i^2(A)\) (sum of squares of singular values of \(A\))
- Submultiplicative: \(\Vert AB \Vert_F \le \Vert A \Vert_F \Vert B \Vert_F\)
- \[\Vert A \Vert_2 \le \Vert A \Vert_F \le \sqrt{\text{rank}(A)} \Vert A \Vert_2\]
- Properties:
3. Linear Systems, Vector Spaces & Subspaces
Linear Transformations
Definition. Linear Transformation
A transformation (or mapping) \(T: V \to W\) from a vector space \(V\) to a vector space \(W\) is linear if for all vectors \(\mathbf{u}, \mathbf{v} \in V\) and all scalars \(c\):
- \(T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})\) (Additivity)
- \(T(c\mathbf{u}) = cT(\mathbf{u})\) (Homogeneity)
Any linear transformation \(T: \mathbb{R}^n \to \mathbb{R}^m\) can be represented by matrix multiplication \(T(\mathbf{x}) = A\mathbf{x}\) for a unique matrix \(A \in \mathbb{R}^{m \times n}\). The columns of \(A\) are the images of the standard basis vectors of \(\mathbb{R}^n\) under \(T\).
Systems of Linear Equations
- A system of linear equations can be written as \(A\mathbf{x} = \mathbf{b}\), where \(A \in \mathbb{R}^{m \times n}\) is the coefficient matrix, \(\mathbf{x} \in \mathbb{R}^n\) is the vector of unknowns, and \(\mathbf{b} \in \mathbb{R}^m\) is the constant vector.
- A solution exists if and only if \(\mathbf{b} \in \text{Col}(A)\) (the column space of \(A\)). Equivalently, \(\text{rank}(A) = \text{rank}([A \mid \mathbf{b}])\).
- If a solution exists:
- It is unique if and only if \(\text{Null}(A) = \{\mathbf{0}\}\) (i.e., columns of \(A\) are linearly independent). This implies \(n \le m\) and \(\text{rank}(A) = n\).
- If \(A\) is square (\(m=n\)) and invertible (\(\det(A) \ne 0\)), there is a unique solution \(\mathbf{x} = A^{-1}\mathbf{b}\).
- If there are free variables (i.e., \(\text{nullity}(A) > 0\)), there are infinitely many solutions. The general solution is \(\mathbf{x}_p + \mathbf{x}_h\), where \(\mathbf{x}_p\) is a particular solution and \(\mathbf{x}_h\) is any solution to the homogeneous equation \(A\mathbf{x}=\mathbf{0}\).
Vector Space Axioms
Definition. Vector Space
A set \(V\) equipped with two operations, vector addition (\(+\)) and scalar multiplication (\(\cdot\)), is a vector space over a field \(\mathbb{F}\) (typically \(\mathbb{R}\) or \(\mathbb{C}\)) if it satisfies the following axioms for all vectors \(\mathbf{u}, \mathbf{v}, \mathbf{w} \in V\) and scalars \(c, d \in \mathbb{F}\):
- \(\mathbf{u} + \mathbf{v} \in V\) (Closure under addition)
- \(\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}\) (Commutativity of addition)
- \((\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})\) (Associativity of addition)
- There exists a zero vector \(\mathbf{0} \in V\) such that \(\mathbf{v} + \mathbf{0} = \mathbf{v}\) for all \(\mathbf{v} \in V\) (Additive identity)
- For every \(\mathbf{v} \in V\), there exists an additive inverse \(-\mathbf{v} \in V\) such that \(\mathbf{v} + (-\mathbf{v}) = \mathbf{0}\)
- \(c\mathbf{v} \in V\) (Closure under scalar multiplication)
- \(c(\mathbf{u} + \mathbf{v}) = c\mathbf{u} + c\mathbf{v}\) (Distributivity of scalar multiplication over vector addition)
- \((c+d)\mathbf{v} = c\mathbf{v} + d\mathbf{v}\) (Distributivity of scalar multiplication over field addition)
- \(c(d\mathbf{v}) = (cd)\mathbf{v}\) (Associativity of scalar multiplication)
- \(1\mathbf{v} = \mathbf{v}\) (Scalar multiplicative identity, where \(1\) is the multiplicative identity in \(\mathbb{F}\))
Key Concepts for Vector Spaces
- Linear Combination: A vector \(\mathbf{w} = c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \dots + c_k\mathbf{v}_k\) where \(c_i\) are scalars.
- Span: The set of all possible linear combinations of a set of vectors \(\{\mathbf{v}_1, \dots, \mathbf{v}_k\}\). Denoted \(\text{span}\{\mathbf{v}_1, \dots, \mathbf{v}_k\}\).
- Linear Independence: A set of vectors \(\{\mathbf{v}_1, \dots, \mathbf{v}_k\}\) is linearly independent if the only solution to \(c_1\mathbf{v}_1 + \dots + c_k\mathbf{v}_k = \mathbf{0}\) is \(c_1 = c_2 = \dots = c_k = 0\).
- Basis: A linearly independent set of vectors that spans the entire vector space. All bases for a given vector space have the same number of vectors.
- Dimension: The number of vectors in any basis for a vector space. Denoted \(\dim(V)\).
- Subspace: A subset of a vector space that is itself a vector space under the same operations (i.e., it is closed under vector addition and scalar multiplication, and contains the zero vector).
Fundamental Subspaces of a Matrix \(A \in \mathbb{R}^{m \times n}\)
Subspace Name | Definition | Space | Dimension | Orthogonal Complement |
---|---|---|---|---|
Column Space (Image, Range) | \(\text{Col}(A) = \{ A\mathbf{x} \mid \mathbf{x} \in \mathbb{R}^n \}\) | \(\mathbb{R}^m\) | \(\text{rank}(A)\) | Left Null Space (\(\text{Null}(A^T)\)) |
Null Space (Kernel) | \(\text{Null}(A) = \{ \mathbf{x} \in \mathbb{R}^n \mid A\mathbf{x} = \mathbf{0} \}\) | \(\mathbb{R}^n\) | \(\text{nullity}(A) = n - \text{rank}(A)\) | Row Space (\(\text{Row}(A)\)) |
Row Space (\(\text{Col}(A^T)\)) | \(\text{Row}(A) = \{ A^T\mathbf{y} \mid \mathbf{y} \in \mathbb{R}^m \}\) | \(\mathbb{R}^n\) | \(\text{rank}(A)\) | Null Space (\(\text{Null}(A)\)) |
Left Null Space (\(\text{Null}(A^T)\)) | \(\text{Null}(A^T) = \{ \mathbf{y} \in \mathbb{R}^m \mid A^T\mathbf{y} = \mathbf{0} \}\) | \(\mathbb{R}^m\) | \(m - \text{rank}(A)\) | Column Space (\(\text{Col}(A)\)) |
- Properties of Rank: For matrices \(A\) and \(B\) (dimensions allowing products/sums):
- \[0 \le \text{rank}(A) \le \min(m, n)\]
- \[\text{rank}(AB) \le \min(\text{rank}(A), \text{rank}(B))\]
- \[\text{rank}(A+B) \le \text{rank}(A) + \text{rank}(B)\]
- \[\text{rank}(A) = \text{rank}(A^T) = \text{rank}(A^T A) = \text{rank}(AA^T)\]
Theorem. Rank-Nullity Theorem
For any matrix \(A \in \mathbb{R}^{m \times n}\):
\[\text{rank}(A) + \text{nullity}(A) = n \quad (\text{number of columns of } A)\]
Orthogonal Complements
- Two subspaces \(V\) and \(W\) of \(\mathbb{R}^k\) are orthogonal complements if every vector in \(V\) is orthogonal to every vector in \(W\), and \(V \oplus W = \mathbb{R}^k\) (their direct sum spans \(\mathbb{R}^k\) and their intersection is \(\{\mathbf{0}\}\)). Denoted \(W = V^\perp\).
- The table above shows the orthogonal complement relationships for the fundamental subspaces. For example, \(\text{Col}(A)^\perp = \text{Null}(A^T)\) in \(\mathbb{R}^m\), and \(\text{Row}(A)^\perp = \text{Null}(A)\) in \(\mathbb{R}^n\). This implies \(\mathbb{R}^m = \text{Col}(A) \oplus \text{Null}(A^T)\) and \(\mathbb{R}^n = \text{Row}(A) \oplus \text{Null}(A)\).
4. Determinants
- Definition: A scalar value associated with a square matrix \(A \in \mathbb{R}^{n \times n}\), denoted \(\det(A)\) or \(\vert A \vert\).
- For \(n=1\), \(A = [a_{11}]\), \(\det(A) = a_{11}\).
- For \(n=2\), \(A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\), \(\det(A) = ad - bc\).
For \(n > 2\), often defined recursively using cofactor expansion along any row \(i\) (or column \(j\)):
\[\det(A) = \sum_{j=1}^n (-1)^{i+j} A_{ij} M_{ij} \quad (\text{expansion along row } i)\]where \(M_{ij}\) is the determinant of the submatrix obtained by deleting row \(i\) and column \(j\) (the \((i,j)\)-minor).
- Properties:
- \(\det(I) = 1\).
- If \(A\) has a row or column of zeros, \(\det(A) = 0\).
- If two rows or two columns of \(A\) are identical, \(\det(A) = 0\).
- If \(B\) is obtained from \(A\) by swapping two rows (or two columns), then \(\det(B) = -\det(A)\).
- If \(B\) is obtained from \(A\) by multiplying a single row (or column) by a scalar \(c\), then \(\det(B) = c \cdot \det(A)\).
- Consequently, for \(A \in \mathbb{R}^{n \times n}\), \(\det(cA) = c^n \det(A)\).
- If \(B\) is obtained from \(A\) by adding a multiple of one row to another row (or column to column), then \(\det(B) = \det(A)\).
- \(\det(AB) = \det(A)\det(B)\) for square matrices \(A, B\).
- \(\det(A^T) = \det(A)\).
- \(A\) is invertible (non-singular) if and only if \(\det(A) \ne 0\).
- If \(A\) is invertible, \(\det(A^{-1}) = 1/\det(A) = (\det(A))^{-1}\).
- For a triangular matrix (upper or lower), \(\det(A)\) is the product of its diagonal entries.
- Geometric Interpretation: For a matrix \(A \in \mathbb{R}^{n \times n}\), \(\vert \det(A) \vert\) is the factor by which the linear transformation represented by \(A\) scales \(n\)-dimensional volume. The sign of \(\det(A)\) indicates whether the transformation preserves or reverses orientation.
Adjoint Matrix and Cramer’s Rule
- Adjoint (or Adjugate) Matrix: The adjoint of \(A\), denoted \(\text{adj}(A)\) or \(\text{Adj}(A)\), is the transpose of the cofactor matrix of \(A\). That is, \((\text{adj}(A))_{ij} = C_{ji} = (-1)^{j+i} M_{ji}\).
- Property: \(A \cdot \text{adj}(A) = \text{adj}(A) \cdot A = \det(A)I\).
- If \(\det(A) \ne 0\), then \(A^{-1} = \frac{1}{\det(A)} \text{adj}(A)\).
Cramer’s Rule: For an invertible matrix \(A\), the unique solution to \(A\mathbf{x} = \mathbf{b}\) is given by:
\[x_i = \frac{\det(A_i)}{\det(A)}\]where \(A_i\) is the matrix formed by replacing the \(i\)-th column of \(A\) with the vector \(\mathbf{b}\). Cramer’s rule is computationally inefficient for large systems but is theoretically important.
5. Eigenvalues and Eigenvectors
- Definition: For a square matrix \(A \in \mathbb{R}^{n \times n}\), a non-zero vector \(\mathbf{v} \in \mathbb{C}^n\) is an eigenvector of \(A\) if \(A\mathbf{v} = \lambda\mathbf{v}\) for some scalar \(\lambda \in \mathbb{C}\). The scalar \(\lambda\) is the corresponding eigenvalue.
- Characteristic Equation: Eigenvalues are the roots of the characteristic polynomial \(p(\lambda) = \det(A - \lambda I) = 0\). This is a polynomial in \(\lambda\) of degree \(n\).
- Properties:
- An \(n \times n\) matrix \(A\) has \(n\) eigenvalues in \(\mathbb{C}\), counting multiplicities.
- Sum of eigenvalues: \(\sum_{i=1}^n \lambda_i = \text{tr}(A)\)
- Product of eigenvalues: \(\prod_{i=1}^n \lambda_i = \det(A)\)
- Eigenvectors corresponding to distinct eigenvalues are linearly independent.
- If \(A\) is a triangular matrix, its eigenvalues are its diagonal entries.
- \(A\) and \(A^T\) have the same eigenvalues (but generally different eigenvectors).
- If \(\lambda\) is an eigenvalue of an invertible matrix \(A\), then \(1/\lambda\) is an eigenvalue of \(A^{-1}\). The corresponding eigenvector is the same.
- If \(\lambda\) is an eigenvalue of \(A\), then \(\lambda^k\) is an eigenvalue of \(A^k\) for any integer \(k \ge 0\). The corresponding eigenvector is the same.
- The set of all eigenvectors corresponding to an eigenvalue \(\lambda\), along with the zero vector, forms a subspace called the eigenspace \(E_\lambda = \text{Null}(A - \lambda I)\).
- Cayley-Hamilton Theorem: Every square matrix satisfies its own characteristic equation. If \(p(\lambda) = c_n \lambda^n + \dots + c_1 \lambda + c_0\) is the characteristic polynomial of \(A\), then \(p(A) = c_n A^n + \dots + c_1 A + c_0 I = 0\).
For Symmetric Matrices (\(A = A^T\), \(A \in \mathbb{R}^{n \times n}\))
- All eigenvalues are real numbers.
- Eigenvectors corresponding to distinct eigenvalues are orthogonal.
- A real symmetric matrix is always orthogonally diagonalizable: there exists an orthogonal matrix \(Q\) whose columns are orthonormal eigenvectors of \(A\), and a diagonal matrix \(\Lambda\) whose diagonal entries are the corresponding eigenvalues, such that \(A = Q\Lambda Q^T\). This is known as the Spectral Theorem.
6. Matrix Decompositions
These are ways to factorize a matrix into a product of other matrices with useful properties.
Theorem. Spectral Theorem for Real Symmetric Matrices
If \(A \in \mathbb{R}^{n \times n}\) is a symmetric matrix (i.e., \(A = A^T\)), then there exists an orthogonal matrix \(Q \in \mathbb{R}^{n \times n}\) (whose columns are orthonormal eigenvectors of \(A\)) and a real diagonal matrix \(\Lambda \in \mathbb{R}^{n \times n}\) (whose diagonal entries are the eigenvalues of \(A\) corresponding to the columns of \(Q\)) such that:
\[A = Q \Lambda Q^T\]This decomposition allows \(A\) to be expressed as a sum of rank-1 outer products:
\[A = \sum_{i=1}^n \lambda_i \mathbf{q}_i \mathbf{q}_i^T\]where \(\mathbf{q}_i\) are the orthonormal eigenvectors (columns of \(Q\)) and \(\lambda_i\) are the corresponding eigenvalues (diagonal entries of \(\Lambda\)).
Theorem. Singular Value Decomposition (SVD)
For any matrix \(A \in \mathbb{R}^{m \times n}\) (not necessarily square or symmetric), there exist orthogonal matrices \(U \in \mathbb{R}^{m \times m}\) and \(V \in \mathbb{R}^{n \times n}\), and a real diagonal matrix \(\Sigma \in \mathbb{R}^{m \times n}\) (meaning only \((\Sigma)_{ii}\) can be non-zero) with non-negative real numbers \(\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_r > 0\) on its “diagonal” (where \(r = \text{rank}(A)\), and \(\sigma_i = 0\) for \(i > r\)), such that:
\[A = U \Sigma V^T\]\[A = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T\]
- The columns of \(U\) are left singular vectors (orthonormal eigenvectors of \(AA^T\)).
- The columns of \(V\) are right singular vectors (orthonormal eigenvectors of \(A^T A\)).
- The diagonal entries of \(\Sigma\) (\(\sigma_i\)) are singular values of \(A\). They are the square roots of the non-zero eigenvalues of both \(A^T A\) and \(AA^T\): \(\sigma_i(A) = \sqrt{\lambda_i(A^T A)} = \sqrt{\lambda_i(AA^T)}\). SVD provides a way to write \(A\) as a sum of rank-1 matrices (Full SVD sum goes up to \(\min(m,n)\), but terms with \(\sigma_i=0\) vanish):
where \(\mathbf{u}_i\) is the \(i\)-th column of \(U\) and \(\mathbf{v}_i\) is the \(i\)-th column of \(V\).
Other Useful Decompositions
- LU Decomposition: For many square matrices \(A\), one can find a factorization \(A = LU\) (or \(PA=LU\) including permutations \(P\) for stability/existence), where \(L\) is a lower triangular matrix (often with 1s on its diagonal, making it unit lower triangular) and \(U\) is an upper triangular matrix. This is commonly used to solve systems \(A\mathbf{x}=\mathbf{b}\) efficiently via forward and backward substitution.
- QR Decomposition: For any \(A \in \mathbb{R}^{m \times n}\), it can be factored as \(A = QR\), where \(Q \in \mathbb{R}^{m \times m}\) is an orthogonal matrix and \(R \in \mathbb{R}^{m \times n}\) is an upper triangular matrix.
- If \(m \ge n\) (tall or square matrix), a “thin” or “reduced” QR decomposition is often used: \(A = Q_1 R_1\), where \(Q_1 \in \mathbb{R}^{m \times n}\) has orthonormal columns and \(R_1 \in \mathbb{R}^{n \times n}\) is upper triangular. The columns of \(Q_1\) form an orthonormal basis for \(\text{Col}(A)\) if \(A\) has full column rank. This decomposition can be found using Gram-Schmidt process on columns of \(A\).
- Cholesky Decomposition: For a real, symmetric, positive definite matrix \(A\), there exists a unique lower triangular matrix \(L\) with strictly positive diagonal entries such that \(A = LL^T\). (Alternatively, \(A=R^T R\) where \(R\) is upper triangular, \(R=L^T\)). This is often used for solving linear systems with symmetric positive definite coefficient matrices and in statistics (e.g., sampling from multivariate normal distributions).
- Diagonalization (\(A = PDP^{-1}\)): An \(n \times n\) matrix \(A\) is diagonalizable if and only if it has \(n\) linearly independent eigenvectors. In this case, \(A = PDP^{-1}\), where \(P\) is an invertible matrix whose columns are the eigenvectors of \(A\), and \(D\) is a diagonal matrix whose diagonal entries are the corresponding eigenvalues.
- A matrix is diagonalizable if and only if for every eigenvalue \(\lambda\), its geometric multiplicity (dimension of the eigenspace \(E_\lambda\)) equals its algebraic multiplicity (multiplicity as a root of the characteristic polynomial).
- If an \(n \times n\) matrix has \(n\) distinct eigenvalues, it is diagonalizable.
- Symmetric matrices are a special case where \(P\) can be chosen to be orthogonal (\(Q\)), so \(A=Q\Lambda Q^T\).
7. Positive Definite and Semi-definite Matrices
These definitions apply to symmetric matrices \(A \in \mathbb{R}^{n \times n}\). The quadratic form associated with \(A\) is \(\mathbf{x}^T A \mathbf{x}\).
Positive Definite (\(A \succ 0\)):
\[\mathbf{x}^T A \mathbf{x} > 0 \quad \text{for all non-zero vectors } \mathbf{x} \in \mathbb{R}^n\]- Equivalent conditions:
- All eigenvalues of \(A\) are strictly positive (\(\lambda_i > 0\) for all \(i\)).
- All leading principal minors of \(A\) are strictly positive. (A leading principal minor of order \(k\) is the determinant of the top-left \(k \times k\) submatrix).
- Cholesky decomposition \(A=LL^T\) exists where \(L\) is a lower triangular matrix with strictly positive diagonal entries.
- Equivalent conditions:
Positive Semi-definite (\(A \succeq 0\)):
\[\mathbf{x}^T A \mathbf{x} \ge 0 \quad \text{for all vectors } \mathbf{x} \in \mathbb{R}^n\]- Equivalent conditions:
- All eigenvalues of \(A\) are non-negative (\(\lambda_i \ge 0\) for all \(i\)).
- All principal minors of \(A\) are non-negative. (A principal minor is the determinant of a submatrix obtained by deleting the same set of rows and columns).
- Equivalent conditions:
- Negative Definite (\(A \prec 0\)): \(\mathbf{x}^T A \mathbf{x} < 0\) for all non-zero \(\mathbf{x}\). (Equivalent to \(-A \succ 0\), all eigenvalues \(<0\)).
- Negative Semi-definite (\(A \preceq 0\)): \(\mathbf{x}^T A \mathbf{x} \le 0\) for all \(\mathbf{x}\). (Equivalent to \(-A \succeq 0\), all eigenvalues \(\le 0\)).
- Indefinite: The matrix \(A\) is neither positive semi-definite nor negative semi-definite. This means the quadratic form \(\mathbf{x}^T A \mathbf{x}\) can take both positive and negative values. (Equivalent to \(A\) having at least one positive eigenvalue and at least one negative eigenvalue).
- Connection to Convexity: For a twice-differentiable function \(f: \mathbb{R}^n \to \mathbb{R}\), its Hessian matrix \(\nabla^2 f(\mathbf{x})\) being positive semi-definite (positive definite) over a convex domain implies \(f\) is convex (strictly convex) over that domain. This is fundamental in optimization.
8. Inner Product Spaces
Definition. Inner Product
An inner product on a real vector space \(V\) is a function \(\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}\) that for all vectors \(\mathbf{u}, \mathbf{v}, \mathbf{w} \in V\) and any scalar \(c \in \mathbb{R}\) satisfies the following properties:
- Symmetry: \(\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle\)
Linearity in the first argument:
\[\langle c\mathbf{u} + \mathbf{v}, \mathbf{w} \rangle = c\langle \mathbf{u}, \mathbf{w} \rangle + \langle \mathbf{v}, \mathbf{w} \rangle\](Linearity in the second argument follows from symmetry: \(\langle \mathbf{u}, c\mathbf{v} + \mathbf{w} \rangle = c\langle \mathbf{u}, \mathbf{v} \rangle + \langle \mathbf{u}, \mathbf{w} \rangle\)).
- Positive-definiteness: \(\langle \mathbf{v}, \mathbf{v} \rangle \ge 0\), and \(\langle \mathbf{v}, \mathbf{v} \rangle = 0 \iff \mathbf{v} = \mathbf{0}\) The standard dot product in \(\mathbb{R}^n\) (\(\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^T \mathbf{v}\)) is an example of an inner product. Another example is a weighted inner product \(\langle \mathbf{u}, \mathbf{v} \rangle_W = \mathbf{u}^T W \mathbf{v}\) where \(W\) is a symmetric positive definite matrix. An inner product induces a norm defined by \(\Vert \mathbf{v} \Vert = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}\).
Cauchy-Schwarz Inequality (general form):
\[\vert \langle \mathbf{u}, \mathbf{v} \rangle \vert \le \Vert \mathbf{u} \Vert \Vert \mathbf{v} \Vert\]Equality holds if and only if \(\mathbf{u}\) and \(\mathbf{v}\) are linearly dependent.
- Orthogonality: Two vectors \(\mathbf{u}\) and \(\mathbf{v}\) are orthogonal with respect to the inner product if \(\langle \mathbf{u}, \mathbf{v} \rangle = 0\).
- Orthonormal Basis: A basis \(\{\mathbf{q}_1, \dots, \mathbf{q}_n\}\) for an \(n\)-dimensional inner product space is orthonormal if \(\langle \mathbf{q}_i, \mathbf{q}_j \rangle = \delta_{ij}\) (Kronecker delta: 1 if \(i=j\), 0 if \(i \ne j\)).
- Gram-Schmidt Process: An algorithm that takes any basis for an inner product space and constructs an orthonormal basis for that space.
Parseval’s Identity: If \(\{\mathbf{q}_1, \dots, \mathbf{q}_n\}\) is an orthonormal basis for an inner product space \(V\), then for any vector \(\mathbf{v} \in V\):
\[\mathbf{v} = \sum_{i=1}^n \langle \mathbf{v}, \mathbf{q}_i \rangle \mathbf{q}_i \quad \text{and} \quad \Vert \mathbf{v} \Vert^2 = \sum_{i=1}^n \vert \langle \mathbf{v}, \mathbf{q}_i \rangle \vert^2\]
9. Matrix Calculus
Common derivatives using numerator layout convention. Result dimensions match the variable being differentiated with respect to (e.g., if differentiating w.r.t a column vector \(\mathbf{x}\), result is a column vector).
Function \(f\) | Derivative | Notes |
---|---|---|
\(\mathbf{a}^T \mathbf{x}\) (or \(\mathbf{x}^T \mathbf{a}\)) | \(\frac{\partial f}{\partial \mathbf{x}} = \mathbf{a}\) | \(\mathbf{x}, \mathbf{a} \in \mathbb{R}^n\). Result is \(n \times 1\). |
\(\mathbf{x}^T A \mathbf{x}\) | \(\frac{\partial f}{\partial \mathbf{x}} = (A + A^T)\mathbf{x}\) | \(A \in \mathbb{R}^{n \times n}\). If \(A\) symmetric, \(2A\mathbf{x}\). Result is \(n \times 1\). |
\(\Vert A\mathbf{x} - \mathbf{b} \Vert_2^2\) | \(\frac{\partial f}{\partial \mathbf{x}} = 2A^T(A\mathbf{x} - \mathbf{b})\) | \(A \in \mathbb{R}^{m \times n}, \mathbf{x} \in \mathbb{R}^n, \mathbf{b} \in \mathbb{R}^m\). Result is \(n \times 1\). |
\(\text{tr}(AX)\) | \(\frac{\partial f}{\partial X} = A^T\) | \(X \in \mathbb{R}^{k \times l}, A \in \mathbb{R}^{p \times k}\). Result is \(k \times l\) (shape of \(X\)). |
\(\text{tr}(XA)\) | \(\frac{\partial f}{\partial X} = A^T\) | \(X \in \mathbb{R}^{k \times l}, A \in \mathbb{R}^{l \times p}\). Result is \(k \times l\) (shape of \(X\)). |
\(\text{tr}(AXB)\) | \(\frac{\partial f}{\partial X} = A^T B^T\) | \(X \in \mathbb{R}^{k \times l}, A \in \mathbb{R}^{p \times k}, B \in \mathbb{R}^{l \times q}\). Result is \(k \times l\) (shape of \(X\)). |
\(\text{tr}(X^T A X)\) | \(\frac{\partial f}{\partial X} = (A+A^T)X\) | \(X \in \mathbb{R}^{k \times l}, A \in \mathbb{R}^{k \times k}\). If \(A\) symmetric, \(2AX\). Result is \(k \times l\) (shape of \(X\)). |
\(\log \det(X)\) | \(\frac{\partial f}{\partial X} = (X^{-1})^T = X^{-T}\) | \(X \in \mathbb{R}^{n \times n}\) positive definite. If \(X\) symmetric, \(X^{-1}\). Result is \(n \times n\). |
\(\det(X)\) (Jacobi’s Formula) | \(\frac{\partial f}{\partial X} = \det(X) (X^{-1})^T = \text{adj}(X)^T\) | \(X \in \mathbb{R}^{n \times n}\) invertible. If \(X\) symmetric, \(\det(X)X^{-1}\). Result is \(n \times n\). |
10. Miscellaneous Identities and Facts
Woodbury Matrix Identity (Matrix Inversion Lemma): Allows efficient computation of the inverse of a rank-\(k\) corrected matrix:
\[(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}\]where \(A \in \mathbb{R}^{n \times n}\), \(U \in \mathbb{R}^{n \times k}\), \(C \in \mathbb{R}^{k \times k}\), \(V \in \mathbb{R}^{k \times n}\). Assumes \(A, C\) and \((C^{-1} + VA^{-1}U)\) are invertible.
Sherman-Morrison Formula (rank-1 update): A special case of the Woodbury identity where \(k=1\) (i.e., \(U=\mathbf{u}\), \(V=\mathbf{v}^T\), \(C=1\)):
\[(A + \mathbf{u}\mathbf{v}^T)^{-1} = A^{-1} - \frac{A^{-1}\mathbf{u}\mathbf{v}^T A^{-1}}{1 + \mathbf{v}^T A^{-1}\mathbf{u}}\]Assumes \(A\) is invertible and \(1 + \mathbf{v}^T A^{-1}\mathbf{u} \ne 0\).
- Projection Matrix:
The matrix that orthogonally projects vectors onto the column space \(\text{Col}(A)\) of a matrix \(A \in \mathbb{R}^{m \times n}\) with linearly independent columns (i.e., \(\text{rank}(A)=n\)) is:
\[P_A = A(A^T A)^{-1} A^T\]- For any \(\mathbf{b} \in \mathbb{R}^m\), \(P_A\mathbf{b}\) is the vector in \(\text{Col}(A)\) closest to \(\mathbf{b}\).
- Properties of orthogonal projection matrices: \(P_A^2 = P_A\) (idempotent) and \(P_A^T = P_A\) (symmetric).
- Relationship between \(A^TA\) and \(AA^T\) for \(A \in \mathbb{R}^{m \times n}\):
- Both \(A^TA \in \mathbb{R}^{n \times n}\) and \(AA^T \in \mathbb{R}^{m \times m}\) are symmetric and positive semi-definite.
- They have the same non-zero eigenvalues. Their non-zero eigenvalues are the squares of the non-zero singular values of \(A\).
- \(\text{rank}(A) = \text{rank}(A^TA) = \text{rank}(AA^T)\).
- If \(A\) has full column rank (\(\text{rank}(A)=n \le m\)), then \(A^TA\) is positive definite (and thus invertible).
- If \(A\) has full row rank (\(\text{rank}(A)=m \le n\)), then \(AA^T\) is positive definite (and thus invertible).
This list is intended as a compact reference. For deeper understanding, proofs, and further topics, consult standard linear algebra textbooks or Wikipedia.