Cheat Sheet: Multivariable Calculus for Optimization
A concise review of essential multivariable calculus concepts vital for understanding mathematical optimization, including partial derivatives, gradients, Hessians, and Taylor series.
Multivariable Calculus: Key Concepts for Optimization
This guide provides a concise overview of essential multivariable calculus concepts crucial for understanding optimization algorithms in machine learning. We primarily consider functions
(scalar-valued) and
(vector-valued). Let
.
1. Partial Derivatives
Definition. Partial Derivative
For a function
\[ f(x_1, \dots, x_n) \], the partial derivative of
\[ f \]with respect to
\[ x_i \]at a point
\[ \mathbf{a} = (a_1, \dots, a_n) \]is the derivative of the single-variable function
\[ g(x_i) = f(a_1, \dots, a_{i-1}, x_i, a_{i+1}, \dots, a_n) \]at
\[ x_i = a_i \]. It is denoted as:
\[ \frac{\partial f}{\partial x_i}(\mathbf{a}) \quad \text{or} \quad f_{x_i}(\mathbf{a}) \quad \text{or} \quad \partial_i f(\mathbf{a}) \]This is calculated by treating all variables other than
\[ x_i \]as constants and differentiating with respect to
\[ x_i \].
- Higher-Order Partial Derivatives: Can be taken by repeatedly applying partial differentiation. For example,
means first differentiating with respect to
, then with respect to
.
Theorem. Clairaut’s Theorem (Symmetry of Mixed Partial Derivatives)
If the second-order partial derivatives
\[ \frac{\partial^2 f}{\partial x_j \partial x_i} \]and
\[ \frac{\partial^2 f}{\partial x_i \partial x_j} \]are continuous in a neighborhood of a point
\[ \mathbf{a} \], then they are equal at
\[ \mathbf{a} \]:
\[ \frac{\partial^2 f}{\partial x_j \partial x_i}(\mathbf{a}) = \frac{\partial^2 f}{\partial x_i \partial x_j}(\mathbf{a}) \]
This theorem is fundamental for the symmetry of the Hessian matrix.
2. Gradient
Definition. Gradient
For a scalar-valued function
\[ f: \mathbb{R}^n \to \mathbb{R} \]that is differentiable at
\[ \mathbf{x} \], its gradient is the vector of its partial derivatives:
\[ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(\mathbf{x}) \\ \frac{\partial f}{\partial x_2}(\mathbf{x}) \\ \vdots \\ \frac{\partial f}{\partial x_n}(\mathbf{x}) \end{bmatrix} \in \mathbb{R}^n \]It is sometimes denoted as
\[ \text{grad } f \].
- Geometric Interpretation:
- The gradient
points in the direction of the steepest ascent of
at
.
- The magnitude
is the rate of increase in that direction.
- The gradient
is orthogonal to the level set (or level surface) of
passing through
. A level set is
for some constant
.
Tip. Gradient and Optimization
In optimization, we often seek to minimize a function
. The negative gradient,
, points in the direction of steepest descent, which is the basis for gradient descent algorithms.
3. Directional Derivatives
Definition. Directional Derivative
For a scalar-valued function
\[ f: \mathbb{R}^n \to \mathbb{R} \], the directional derivative of
\[ f \]at
\[ \mathbf{x} \]in the direction of a unit vector
\[ \mathbf{u} \in \mathbb{R}^n \](
\[ \Vert \mathbf{u} \Vert_2 = 1 \]) is:
\[ D_{\mathbf{u}} f(\mathbf{x}) = \lim_{h \to 0} \frac{f(\mathbf{x} + h\mathbf{u}) - f(\mathbf{x})}{h} \]provided the limit exists. This measures the rate of change of
\[ f \]at
\[ \mathbf{x} \]along the direction
\[ \mathbf{u} \].
Theorem. Directional Derivative using Gradient
If
\[ f \]is differentiable at
\[ \mathbf{x} \], then the directional derivative in the direction of a unit vector
\[ \mathbf{u} \]is given by the dot product of the gradient and
\[ \mathbf{u} \]:
\[ D_{\mathbf{u}} f(\mathbf{x}) = \nabla f(\mathbf{x}) \cdot \mathbf{u} = \nabla f(\mathbf{x})^T \mathbf{u} \]
Using the Cauchy-Schwarz inequality,
, where
is the angle between
and
. This is maximized when
is in the same direction as
(
).
4. Jacobian Matrix
Definition. Jacobian Matrix
For a vector-valued function
\[ \mathbf{F}: \mathbb{R}^n \to \mathbb{R}^m \], where
\[ \mathbf{F}(\mathbf{x}) = (f_1(\mathbf{x}), f_2(\mathbf{x}), \dots, f_m(\mathbf{x}))^T \]and each
\[ f_i: \mathbb{R}^n \to \mathbb{R} \]is differentiable, the Jacobian matrix
\[ J_{\mathbf{F}}(\mathbf{x}) \](or
\[ \frac{\partial \mathbf{F}}{\partial \mathbf{x}} \]or
\[ D\mathbf{F}(\mathbf{x}) \]) is an
\[ m \times n \]matrix defined as:
\[ J_{\mathbf{F}}(\mathbf{x}) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1}(\mathbf{x}) & \frac{\partial f_1}{\partial x_2}(\mathbf{x}) & \dots & \frac{\partial f_1}{\partial x_n}(\mathbf{x}) \\ \frac{\partial f_2}{\partial x_1}(\mathbf{x}) & \frac{\partial f_2}{\partial x_2}(\mathbf{x}) & \dots & \frac{\partial f_2}{\partial x_n}(\mathbf{x}) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1}(\mathbf{x}) & \frac{\partial f_m}{\partial x_2}(\mathbf{x}) & \dots & \frac{\partial f_m}{\partial x_n}(\mathbf{x}) \end{bmatrix} \]The
\[ i \]-th row of
\[ J_{\mathbf{F}}(\mathbf{x}) \]is the transpose of the gradient of the
\[ i \]-th component function:
\[ (\nabla f_i(\mathbf{x}))^T \].
- If
(i.e.,
), then its Jacobian is a
row vector:
The Jacobian matrix represents the best linear approximation of
near
.
5. Hessian Matrix
Definition. Hessian Matrix
For a scalar-valued function
\[ f: \mathbb{R}^n \to \mathbb{R} \]whose second-order partial derivatives exist, the Hessian matrix
\[ H_f(\mathbf{x}) \](or
\[ \nabla^2 f(\mathbf{x}) \]) is an
\[ n \times n \]matrix of these second-order partial derivatives:
\[ (H_f(\mathbf{x}))_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(\mathbf{x}) \]Explicitly:
\[ H_f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(\mathbf{x}) & \dots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(\mathbf{x}) \\ \frac{\partial^2 f}{\partial x_2 \partial x_1}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_2^2}(\mathbf{x}) & \dots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(\mathbf{x}) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1}(\mathbf{x}) & \frac{\partial^2 f}{\partial x_n \partial x_2}(\mathbf{x}) & \dots & \frac{\partial^2 f}{\partial x_n^2}(\mathbf{x}) \end{bmatrix} \]If the second-order partial derivatives are continuous, then by Clairaut’s Theorem, the Hessian matrix is symmetric:
\[ H_f(\mathbf{x}) = H_f(\mathbf{x})^T \].
The Hessian describes the local curvature of
.
6. Multivariable Chain Rule
The chain rule allows differentiation of composite functions.
- Case 1: If
and each
is a differentiable function of a single variable
, then
is a differentiable function of
, and:
1
2
3
4
5
<div class="math-block" markdown="0"> \[ \frac{dz}{dt} = \sum_{i=1}^n \frac{\partial f}{\partial x_i} \frac{dx_i}{dt} = \nabla f(\mathbf{x}(t))^T \frac{d\mathbf{x}}{dt}(t) \]
</div>
where
.
- Case 2 (General Form using Jacobians): If
and
are differentiable functions, and
where
, then
is differentiable, and its Jacobian matrix is:
1
2
3
4
5
<div class="math-block" markdown="0"> \[ J_{\mathbf{h}}(\mathbf{s}) = J_{\mathbf{f}}(\mathbf{g}(\mathbf{s})) J_{\mathbf{g}}(\mathbf{s}) \]
</div>
This is a product of an
matrix and an
matrix, resulting in an
matrix, as expected.
7. Taylor Series (Multivariable)
Taylor series provide polynomial approximations of a function near a point. For
sufficiently differentiable at
, its Taylor expansion around
for a point
near
(let
) is:
-
First-Order Taylor Approximation (Linear Approximation):
\[ f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T (\mathbf{x} - \mathbf{a}) \]This defines the tangent hyperplane to
at
.
-
Second-Order Taylor Approximation (Quadratic Approximation):
\[ f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^T (\mathbf{x} - \mathbf{a}) + \frac{1}{2} (\mathbf{x} - \mathbf{a})^T H_f(\mathbf{a}) (\mathbf{x} - \mathbf{a}) \]This is crucial for Newton’s method in optimization and for analyzing the nature of critical points. The terms are: scalar + inner product (scalar) + quadratic form (scalar).
Notation. Higher-Order Terms
The full Taylor series can be written as:
where
is interpreted as applying the operator
k times. For
, it’s
. For
, it’s
. For
, it’s
. The remainder term
can be used to make the approximation an equality.
8. Implicit Function Theorem
Theorem. Implicit Function Theorem
Consider a system of
\[ m \]equations in
\[ n+m \]variables:
\[ \mathbf{F}(\mathbf{x}, \mathbf{y}) = \mathbf{0} \]where
\[ \mathbf{F}: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}^m \](so
\[ \mathbf{x} \in \mathbb{R}^n, \mathbf{y} \in \mathbb{R}^m \]), and
\[ \mathbf{0} \]is the zero vector in
\[ \mathbb{R}^m \]. Suppose
\[ (\mathbf{x}_0, \mathbf{y}_0) \]is a point such that
\[ \mathbf{F}(\mathbf{x}_0, \mathbf{y}_0) = \mathbf{0} \]. If
\[ \mathbf{F} \]is continuously differentiable in a neighborhood of
\[ (\mathbf{x}_0, \mathbf{y}_0) \], and the Jacobian matrix of
\[ \mathbf{F} \]with respect to
\[ \mathbf{y} \], denoted
\[ J_{\mathbf{F},\mathbf{y}} \], is invertible at
\[ (\mathbf{x}_0, \mathbf{y}_0) \]:
\[ \det \left( \frac{\partial \mathbf{F}}{\partial \mathbf{y}}(\mathbf{x}_0, \mathbf{y}_0) \right) \ne 0 \](where
\[ \frac{\partial \mathbf{F}}{\partial \mathbf{y}} \]is the
\[ m \times m \]matrix of partial derivatives of components of
\[ \mathbf{F} \]with respect to components of
\[ \mathbf{y} \]), then there exists a neighborhood
\[ U \]of
\[ \mathbf{x}_0 \]in
\[ \mathbb{R}^n \]and a unique continuously differentiable function
\[ \mathbf{g}: U \to \mathbb{R}^m \]such that
\[ \mathbf{y}_0 = \mathbf{g}(\mathbf{x}_0) \]and
\[ \mathbf{F}(\mathbf{x}, \mathbf{g}(\mathbf{x})) = \mathbf{0} \quad \text{for all } \mathbf{x} \in U \]In other words, the system implicitly defines
\[ \mathbf{y} \]as a function of
\[ \mathbf{x} \]near
\[ (\mathbf{x}_0, \mathbf{y}_0) \]. Furthermore, the Jacobian of
\[ \mathbf{g} \]at
\[ \mathbf{x}_0 \]is given by:
\[ J_{\mathbf{g}}(\mathbf{x}_0) = - \left[ J_{\mathbf{F},\mathbf{y}}(\mathbf{x}_0, \mathbf{y}_0) \right]^{-1} J_{\mathbf{F},\mathbf{x}}(\mathbf{x}_0, \mathbf{y}_0) \]where
\[ J_{\mathbf{F},\mathbf{x}} \]is the Jacobian of
\[ \mathbf{F} \]with respect to
\[ \mathbf{x} \].
This theorem is fundamental in analyzing sensitivities, constrained optimization (e.g., deriving properties of Lagrange multipliers), and when variables are implicitly defined.
9. Unconstrained Optimization: Critical Points and Second Derivative Test
For a differentiable function
.
Definition. Critical Point
A point
\[ \mathbf{x}^\ast \in \mathbb{R}^n \]is a critical point (or stationary point) of
\[ f \]if its gradient is zero:
\[ \nabla f(\mathbf{x}^\ast) = \mathbf{0} \]Local extrema (minima or maxima) can only occur at critical points (if
\[ f \]is differentiable everywhere) or at boundary points of the domain.
Theorem. Second Derivative Test for Local Extrema
Let
\[ f: \mathbb{R}^n \to \mathbb{R} \]be twice continuously differentiable in a neighborhood of a critical point
\[ \mathbf{x}^\ast \](i.e.,
\[ \nabla f(\mathbf{x}^\ast) = \mathbf{0} \]). Let
\[ H_f(\mathbf{x}^\ast) \]be the Hessian matrix of
\[ f \]evaluated at
\[ \mathbf{x}^\ast \].
- If
\[ H_f(\mathbf{x}^\ast) \]is positive definite (all eigenvalues
\[ > 0 \]), then
\[ f \]has a local minimum at
\[ \mathbf{x}^\ast \].
- If
\[ H_f(\mathbf{x}^\ast) \]is negative definite (all eigenvalues
\[ < 0 \]), then
\[ f \]has a local maximum at
\[ \mathbf{x}^\ast \].
- If
\[ H_f(\mathbf{x}^\ast) \]has both positive and negative eigenvalues (i.e., it is indefinite), then
\[ f \]has a saddle point at
\[ \mathbf{x}^\ast \].
- If
\[ H_f(\mathbf{x}^\ast) \]is semi-definite (e.g., positive semi-definite but not positive definite, meaning at least one eigenvalue is 0 and others are non-negative) and not indefinite, the test is inconclusive. Higher-order tests or other methods are needed.
10. Convexity and the Hessian
Convexity is a crucial property in optimization, often guaranteeing that local minima are global minima.
Proposition. Hessian and Convexity
Let
\[ f: \mathbb{R}^n \to \mathbb{R} \]be twice continuously differentiable on a convex set
\[ S \subseteq \mathbb{R}^n \].
\[ f \]is convex on
\[ S \]if and only if its Hessian matrix
\[ H_f(\mathbf{x}) \]is positive semi-definite for all
\[ \mathbf{x} \in S \].
- If
\[ H_f(\mathbf{x}) \]is positive definite for all
\[ \mathbf{x} \in S \], then
\[ f \]is strictly convex on
\[ S \]. (The converse is not necessarily true; a strictly convex function can have a Hessian that is only positive semi-definite, e.g.,
\[ f(x)=x^4 \]at
\[ x=0 \]).
- Analogously,
\[ f \]is concave (strictly concave) on
\[ S \]if and only if
\[ H_f(\mathbf{x}) \]is negative semi-definite (negative definite) for all
\[ \mathbf{x} \in S \].
Definition. Convex Function
A function
defined on a convex set
is convex if for all
and for all
:
Geometrically, the line segment connecting any two points on the graph of
lies on or above the graph. It is strictly convex if the inequality is strict for
and
.
This guide covers foundational multivariable calculus concepts essential for optimization. Further topics like Lagrange multipliers (for constrained optimization) and vector calculus (divergence, curl, line/surface integrals) build upon these ideas but are beyond this immediate scope.