Cheat Sheet: Multivariable Calculus for Optimization
A concise review of essential multivariable calculus concepts vital for understanding mathematical optimization, including partial derivatives, gradients, Hessians, and Taylor series.
Multivariable Calculus: Key Concepts for Optimization
This guide provides a concise overview of essential multivariable calculus concepts crucial for understanding optimization algorithms in machine learning. We primarily consider functions \(f: \mathbb{R}^n \to \mathbb{R}\) (scalar-valued) and \(\mathbf{F}: \mathbb{R}^n \to \mathbb{R}^m\) (vector-valued). Let \(\mathbf{x} = (x_1, x_2, \dots, x_n)^T \in \mathbb{R}^n\).
1. Partial Derivatives
Definition. Partial Derivative
For a function \(f(x_1, \dots, x_n)\), the partial derivative of \(f\) with respect to \(x_i\) at a point \(\mathbf{a} = (a_1, \dots, a_n)\) is the derivative of the single-variable function \(g(x_i) = f(a_1, \dots, a_{i-1}, x_i, a_{i+1}, \dots, a_n)\) at \(x_i = a_i\). It is denoted as:
\[\frac{\partial f}{\partial x_i}(\mathbf{a}) \quad \text{or} \quad f_{x_i}(\mathbf{a}) \quad \text{or} \quad \partial_i f(\mathbf{a})\]This is calculated by treating all variables other than \(x_i\) as constants and differentiating with respect to \(x_i\).
- Higher-Order Partial Derivatives: Can be taken by repeatedly applying partial differentiation. For example, \(\frac{\partial^2 f}{\partial x_j \partial x_i}\) means first differentiating with respect to \(x_i\), then with respect to \(x_j\).
Theorem. Clairaut’s Theorem (Symmetry of Mixed Partial Derivatives)
If the second-order partial derivatives \(\frac{\partial^2 f}{\partial x_j \partial x_i}\) and \(\frac{\partial^2 f}{\partial x_i \partial x_j}\) are continuous in a neighborhood of a point \(\mathbf{a}\), then they are equal at \(\mathbf{a}\):
\[\frac{\partial^2 f}{\partial x_j \partial x_i}(\mathbf{a}) = \frac{\partial^2 f}{\partial x_i \partial x_j}(\mathbf{a})\]
This theorem is fundamental for the symmetry of the Hessian matrix.
2. Gradient
Definition. Gradient
For a scalar-valued function \(f: \mathbb{R}^n \to \mathbb{R}\) that is differentiable at \(\mathbf{x}\), its gradient is the vector of its partial derivatives:
\[\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(\mathbf{x}) \\ \frac{\partial f}{\partial x_2}(\mathbf{x}) \\ \vdots \\ \frac{\partial f}{\partial x_n}(\mathbf{x}) \end{bmatrix} \in \mathbb{R}^n\]It is sometimes denoted as \(\text{grad } f\).
- Geometric Interpretation:
- The gradient \(\nabla f(\mathbf{x})\) points in the direction of the steepest ascent of \(f\) at \(\mathbf{x}\).
- The magnitude \(\Vert \nabla f(\mathbf{x}) \Vert_2\) is the rate of increase in that direction.
- The gradient \(\nabla f(\mathbf{x})\) is orthogonal to the level set (or level surface) of \(f\) passing through \(\mathbf{x}\). A level set is \(\{\mathbf{y} \in \mathbb{R}^n \mid f(\mathbf{y}) = c\}\) for some constant \(c = f(\mathbf{x})\).
Tip. Gradient and Optimization
In optimization, we often seek to minimize a function \(f\). The negative gradient, \(-\nabla f(\mathbf{x})\), points in the direction of steepest descent, which is the basis for gradient descent algorithms.
3. Directional Derivatives
Definition. Directional Derivative
For a scalar-valued function \(f: \mathbb{R}^n \to \mathbb{R}\), the directional derivative of \(f\) at \(\mathbf{x}\) in the direction of a unit vector \(\mathbf{u} \in \mathbb{R}^n\) (\(\Vert \mathbf{u} \Vert_2 = 1\)) is:
\[D_{\mathbf{u}} f(\mathbf{x}) = \lim_{h \to 0} \frac{f(\mathbf{x} + h\mathbf{u}) - f(\mathbf{x})}{h}\]provided the limit exists. This measures the rate of change of \(f\) at \(\mathbf{x}\) along the direction \(\mathbf{u}\).
Theorem. Directional Derivative using Gradient
If \(f\) is differentiable at \(\mathbf{x}\), then the directional derivative in the direction of a unit vector \(\mathbf{u}\) is given by the dot product of the gradient and \(\mathbf{u}\):
\[D_{\mathbf{u}} f(\mathbf{x}) = \nabla f(\mathbf{x}) \cdot \mathbf{u} = \nabla f(\mathbf{x})^T \mathbf{u}\]
Using the Cauchy-Schwarz inequality, \(D_{\mathbf{u}} f(\mathbf{x}) = \Vert \nabla f(\mathbf{x}) \Vert_2 \Vert \mathbf{u} \Vert_2 \cos \theta = \Vert \nabla f(\mathbf{x}) \Vert_2 \cos \theta\), where \(\theta\) is the angle between \(\nabla f(\mathbf{x})\) and \(\mathbf{u}\). This is maximized when \(\mathbf{u}\) is in the same direction as \(\nabla f(\mathbf{x})\) (\(\theta=0\)).
4. Jacobian Matrix
Definition. Jacobian Matrix
For a vector-valued function \(\mathbf{F}: \mathbb{R}^n \to \mathbb{R}^m\), where \(\mathbf{F}(\mathbf{x}) = (f_1(\mathbf{x}), f_2(\mathbf{x}), \dots, f_m(\mathbf{x}))^T\) and each \(f_i: \mathbb{R}^n \to \mathbb{R}\) is differentiable, the Jacobian matrix \(J_{\mathbf{F}}(\mathbf{x})\) (or \(\frac{\partial \mathbf{F}}{\partial \mathbf{x}}\) or \(D\mathbf{F}(\mathbf{x})\)) is an \(m \times n\) matrix defined as:
\[J_{\mathbf{F}}(\mathbf{x}) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1}(\mathbf{x}) & \frac{\partial f_1}{\partial x_2}(\mathbf{x}) & \dots & \frac{\partial f_1}{\partial x_n}(\mathbf{x}) \\ \frac{\partial f_2}{\partial x_1}(\mathbf{x}) & \frac{\partial f_2}{\partial x_2}(\mathbf{x}) & \dots & \frac{\partial f_2}{\partial x_n}(\mathbf{x}) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1}(\mathbf{x}) & \frac{\partial f_m}{\partial x_2}(\mathbf{x}) & \dots & \frac{\partial f_m}{\partial x_n}(\mathbf{x}) \end{bmatrix}\]The \(i\)-th row of \(J_{\mathbf{F}}(\mathbf{x})\) is the transpose of the gradient of the \(i\)-th component function: \((\nabla f_i(\mathbf{x}))^T\).
If \(f: \mathbb{R}^n \to \mathbb{R}\) (i.e., \(m=1\)), then its Jacobian is a \(1 \times n\) row vector:
\[J_f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(\mathbf{x}) & \frac{\partial f}{\partial x_2}(\mathbf{x}) & \dots & \frac{\partial f}{\partial x_n}(\mathbf{x}) \end{bmatrix} = (\nabla f(\mathbf{x}))^T\]
The Jacobian matrix represents the best linear approximation of \(\mathbf{F}\) near \(\mathbf{x}\).
5. Hessian Matrix
Definition. Hessian Matrix
For a scalar-valued function \(f: \mathbb{R}^n \to \mathbb{R}\) whose second-order partial derivatives exist, the Hessian matrix \(H_f(\mathbf{x})\) (or \(\nabla^2 f(\mathbf{x})\)) is an \(n \times n\) matrix of these second-order partial derivatives:
\[(H_f(\mathbf{x}))_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(\mathbf{x})\]Explicitly:
\[H_f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(\mathbf{x}) & \dots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(\mathbf{x}) \\ \frac{\partial^2 f}{\partial x_2 \partial x_1}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_2^2}(\mathbf{x}) & \dots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(\mathbf{x}) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_n \partial x_2}(\mathbf{x}) & \dots & \frac{\partial^2 f}{\partial x_n^2}(\mathbf{x}) \end{bmatrix}\]If the second-order partial derivatives are continuous, then by Clairaut’s Theorem, the Hessian matrix is symmetric: \(H_f(\mathbf{x}) = H_f(\mathbf{x})^T\).
The Hessian describes the local curvature of \(f\).
6. Multivariable Chain Rule
The chain rule allows differentiation of composite functions.
Case 1: If \(z = f(x_1, \dots, x_n)\) and each \(x_i = g_i(t)\) is a differentiable function of a single variable \(t\), then \(z\) is a differentiable function of \(t\), and:
\[\frac{dz}{dt} = \sum_{i=1}^n \frac{\partial f}{\partial x_i} \frac{dx_i}{dt} = \nabla f(\mathbf{x}(t))^T \frac{d\mathbf{x}}{dt}(t)\]where \(\mathbf{x}(t) = (g_1(t), \dots, g_n(t))^T\).
Case 2 (General Form using Jacobians): If \(\mathbf{g}: \mathbb{R}^k \to \mathbb{R}^n\) and \(\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m\) are differentiable functions, and \(\mathbf{h}(\mathbf{s}) = \mathbf{f}(\mathbf{g}(\mathbf{s}))\) where \(\mathbf{s} \in \mathbb{R}^k\), then \(\mathbf{h}: \mathbb{R}^k \to \mathbb{R}^m\) is differentiable, and its Jacobian matrix is:
\[J_{\mathbf{h}}(\mathbf{s}) = J_{\mathbf{f}}(\mathbf{g}(\mathbf{s})) J_{\mathbf{g}}(\mathbf{s})\]This is a product of an \(m \times n\) matrix and an \(n \times k\) matrix, resulting in an \(m \times k\) matrix, as expected.
7. Taylor Series (Multivariable)
Taylor series provide polynomial approximations of a function near a point. For \(f: \mathbb{R}^n \to \mathbb{R}\) sufficiently differentiable at \(\mathbf{a} \in \mathbb{R}^n\), its Taylor expansion around \(\mathbf{a}\) for a point \(\mathbf{x}\) near \(\mathbf{a}\) (let \(\mathbf{x} - \mathbf{a} = \Delta \mathbf{x}\)) is:
First-Order Taylor Approximation (Linear Approximation):
\[f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T (\mathbf{x} - \mathbf{a})\]This defines the tangent hyperplane to \(f\) at \(\mathbf{a}\).
Second-Order Taylor Approximation (Quadratic Approximation):
\[f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^T H_f(\mathbf{a}) (\mathbf{x} - \mathbf{a})\]This is crucial for Newton’s method in optimization and for analyzing the nature of critical points. The terms are: scalar + inner product (scalar) + quadratic form (scalar).
Notation. Higher-Order Terms
The full Taylor series can be written as:
\[f(\mathbf{x}) = \sum_{k=0}^{\infty} \frac{1}{k!} \left( ((\mathbf{x}-\mathbf{a}) \cdot \nabla)^k f \right) (\mathbf{a})\]where \(((\mathbf{x}-\mathbf{a}) \cdot \nabla)^k\) is interpreted as applying the operator \((\sum_{i=1}^n (x_i-a_i) \frac{\partial}{\partial x_i})\) k times. For \(k=0\), it’s \(f(\mathbf{a})\). For \(k=1\), it’s \(\nabla f(\mathbf{a})^T (\mathbf{x}-\mathbf{a})\). For \(k=2\), it’s \(\frac{1}{2}(\mathbf{x}-\mathbf{a})^T H_f(\mathbf{a}) (\mathbf{x}-\mathbf{a})\). The remainder term \(R_k(\mathbf{x})\) can be used to make the approximation an equality.
8. Implicit Function Theorem
Theorem. Implicit Function Theorem
Consider a system of \(m\) equations in \(n+m\) variables:
\[\mathbf{F}(\mathbf{x}, \mathbf{y}) = \mathbf{0}\]where \(\mathbf{F}: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m\) (so \(\mathbf{x} \in \mathbb{R}^n, \mathbf{y} \in \mathbb{R}^m\)), and \(\mathbf{0}\) is the zero vector in \(\mathbb{R}^m\). Suppose \((\mathbf{x}_0, \mathbf{y}_0)\) is a point such that \(\mathbf{F}(\mathbf{x}_0, \mathbf{y}_0) = \mathbf{0}\). If \(\mathbf{F}\) is continuously differentiable in a neighborhood of \((\mathbf{x}_0, \mathbf{y}_0)\), and the Jacobian matrix of \(\mathbf{F}\) with respect to \(\mathbf{y}\), denoted \(J_{\mathbf{F},\mathbf{y}}\), is invertible at \((\mathbf{x}_0, \mathbf{y}_0)\):
\[\det \left( \frac{\partial \mathbf{F}}{\partial \mathbf{y}}(\mathbf{x}_0, \mathbf{y}_0) \right) \ne 0\](where \(\frac{\partial \mathbf{F}}{\partial \mathbf{y}}\) is the \(m \times m\) matrix of partial derivatives of components of \(\mathbf{F}\) with respect to components of \(\mathbf{y}\)), then there exists a neighborhood \(U\) of \(\mathbf{x}_0\) in \(\mathbb{R}^n\) and a unique continuously differentiable function \(\mathbf{g}: U \to \mathbb{R}^m\) such that \(\mathbf{y}_0 = \mathbf{g}(\mathbf{x}_0)\) and
\[\mathbf{F}(\mathbf{x}, \mathbf{g}(\mathbf{x})) = \mathbf{0} \quad \text{for all } \mathbf{x} \in U\]In other words, the system implicitly defines \(\mathbf{y}\) as a function of \(\mathbf{x}\) near \((\mathbf{x}_0, \mathbf{y}_0)\). Furthermore, the Jacobian of \(\mathbf{g}\) at \(\mathbf{x}_0\) is given by:
\[J_{\mathbf{g}}(\mathbf{x}_0) = - \left[ J_{\mathbf{F},\mathbf{y}}(\mathbf{x}_0, \mathbf{y}_0) \right]^{-1} J_{\mathbf{F},\mathbf{x}}(\mathbf{x}_0, \mathbf{y}_0)\]where \(J_{\mathbf{F},\mathbf{x}}\) is the Jacobian of \(\mathbf{F}\) with respect to \(\mathbf{x}\).
This theorem is fundamental in analyzing sensitivities, constrained optimization (e.g., deriving properties of Lagrange multipliers), and when variables are implicitly defined.
9. Unconstrained Optimization: Critical Points and Second Derivative Test
For a differentiable function \(f: \mathbb{R}^n \to \mathbb{R}\).
Definition. Critical Point
A point \(\mathbf{x}^\ast \in \mathbb{R}^n\) is a critical point (or stationary point) of \(f\) if its gradient is zero:
\[\nabla f(\mathbf{x}^\ast) = \mathbf{0}\]Local extrema (minima or maxima) can only occur at critical points (if \(f\) is differentiable everywhere) or at boundary points of the domain.
Theorem. Second Derivative Test for Local Extrema
Let \(f: \mathbb{R}^n \to \mathbb{R}\) be twice continuously differentiable in a neighborhood of a critical point \(\mathbf{x}^\ast\) (i.e., \(\nabla f(\mathbf{x}^\ast) = \mathbf{0}\)). Let \(H_f(\mathbf{x}^\ast)\) be the Hessian matrix of \(f\) evaluated at \(\mathbf{x}^\ast\).
- If \(H_f(\mathbf{x}^\ast)\) is positive definite (all eigenvalues \(> 0\)), then \(f\) has a local minimum at \(\mathbf{x}^\ast\).
- If \(H_f(\mathbf{x}^\ast)\) is negative definite (all eigenvalues \(< 0\)), then \(f\) has a local maximum at \(\mathbf{x}^\ast\).
- If \(H_f(\mathbf{x}^\ast)\) has both positive and negative eigenvalues (i.e., it is indefinite), then \(f\) has a saddle point at \(\mathbf{x}^\ast\).
- If \(H_f(\mathbf{x}^\ast)\) is semi-definite (e.g., positive semi-definite but not positive definite, meaning at least one eigenvalue is 0 and others are non-negative) and not indefinite, the test is inconclusive. Higher-order tests or other methods are needed.
10. Convexity and the Hessian
Convexity is a crucial property in optimization, often guaranteeing that local minima are global minima.
Proposition. Hessian and Convexity
Let \(f: \mathbb{R}^n \to \mathbb{R}\) be twice continuously differentiable on a convex set \(S \subseteq \mathbb{R}^n\).
- \(f\) is convex on \(S\) if and only if its Hessian matrix \(H_f(\mathbf{x})\) is positive semi-definite for all \(\mathbf{x} \in S\).
- If \(H_f(\mathbf{x})\) is positive definite for all \(\mathbf{x} \in S\), then \(f\) is strictly convex on \(S\). (The converse is not necessarily true; a strictly convex function can have a Hessian that is only positive semi-definite, e.g., \(f(x)=x^4\) at \(x=0\)).
- Analogously, \(f\) is concave (strictly concave) on \(S\) if and only if \(H_f(\mathbf{x})\) is negative semi-definite (negative definite) for all \(\mathbf{x} \in S\).
Definition. Convex Function
A function \(f: S \to \mathbb{R}\) defined on a convex set \(S \subseteq \mathbb{R}^n\) is convex if for all \(\mathbf{x}, \mathbf{y} \in S\) and for all \(\theta \in [0, 1]\):
\[f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \le \theta f(\mathbf{x}) + (1-\theta)f(\mathbf{y})\]Geometrically, the line segment connecting any two points on the graph of \(f\) lies on or above the graph. It is strictly convex if the inequality is strict for \(\mathbf{x} \ne \mathbf{y}\) and \(\theta \in (0,1)\).
This guide covers foundational multivariable calculus concepts essential for optimization. Further topics like Lagrange multipliers (for constrained optimization) and vector calculus (divergence, curl, line/surface integrals) build upon these ideas but are beyond this immediate scope.