Multivariable Calculus for Optimization
A concise review of essential multivariable calculus concepts vital for understanding mathematical optimization, including partial derivatives, gradients, Hessians, and Taylor series.
1. Introduction
Welcome to this crash course on Multivariable Calculus! Many powerful optimization algorithms, especially those prevalent in machine learning, rely on concepts from calculus involving functions of multiple variables. This document aims to provide a concise review of these essential ideas, serving as a prerequisite or a quick refresher for our main series on Mathematical Optimization in Machine Learning.
We will cover:
- Functions of multiple variables.
- Partial derivatives: how a function changes along coordinate axes.
- The gradient: the direction of steepest ascent and its computation.
- The Hessian matrix: capturing second-order (curvature) information.
- Taylor series expansions: approximating functions locally.
This crash course assumes a basic understanding of single-variable calculus and an introductory familiarity with vectors and matrices from linear algebra. Our goal is not to be an exhaustive textbook but to equip you with the key calculus tools needed to understand the mechanics of optimization algorithms.
2. Functions of Multiple Variables
In optimization, we often deal with functions that depend on more than one input variable.
- A scalar-valued function of multiple variables maps a vector input from an
-dimensional space (
) to a single real number (
). We write this as
.
- The input is a vector
.
- The output is a scalar
.
- In machine learning,
could represent the parameters of a model (weights and biases), and
could be the loss function we want to minimize.
Example: Consider the function
. Here,
, and
. This function describes a paraboloid.
Level Sets: A useful concept for visualizing these functions is that of level sets.
- A level set of a function
is the set of all points
in the domain for which
is equal to some constant
. That is,
.
- For
(functions of two variables), level sets are typically level curves. For
, the level curves
(for
) are circles centered at the origin.
- For
, level sets are level surfaces.
Level sets give us a way to “slice” the graph of the function and understand its topography.
3. Partial Derivatives
When a function depends on multiple variables, we can ask how it changes if we vary only one variable while keeping the others fixed. This leads to the concept of partial derivatives.
Definition. Partial Derivative
The partial derivative of a function
\[ f(x_1, x_2, \dots, x_n) \]with respect to the variable
\[ x_i \]at a point
\[ x = (a_1, \dots, a_n) \]is the derivative of the function
\[ g(x_i) = f(a_1, \dots, a_{i-1}, x_i, a_{i+1}, \dots, a_n) \]with respect to
\[ x_i \]at
\[ x_i = a_i \]. It is denoted by:
\[ \frac{\partial f}{\partial x_i}(x) \quad \text{or} \quad f_{x_i}(x) \quad \text{or} \quad \partial_i f(x) \]To compute
\[ \frac{\partial f}{\partial x_i} \], we treat all variables other than
\[ x_i \]as constants and differentiate
\[ f \]with respect to
\[ x_i \]using the rules of single-variable calculus.
Example: Let
.
- To find
, treat
as a constant:
1
2
<div class="math-block" markdown="0"> \[ \frac{\partial f}{\partial x_1} = \frac{\partial}{\partial x_1}(x_1^2) + \frac{\partial}{\partial x_1}(3x_1 x_2^2) + \frac{\partial}{\partial x_1}(5x_2^3) = 2x_1 + 3x_2^2 + 0 = 2x_1 + 3x_2^2 \]
</div>
- To find
, treat
as a constant:
1
2
<div class="math-block" markdown="0"> \[ \frac{\partial f}{\partial x_2} = \frac{\partial}{\partial x_2}(x_1^2) + \frac{\partial}{\partial x_2}(3x_1 x_2^2) + \frac{\partial}{\partial x_2}(5x_2^3) = 0 + 3x_1 (2x_2) + 15x_2^2 = 6x_1 x_2 + 15x_2^2 \]
</div>
Geometrically,
represents the slope of the tangent line to the curve formed by intersecting the surface
with the plane
for all
, at the point
.
4. The Gradient
Partial derivatives tell us how a function changes along coordinate axes. But what if we want to know the rate of change in an arbitrary direction? This leads us to the directional derivative and, ultimately, the gradient.
4.1. Directional Derivatives
Definition. Directional Derivative
Let
\[ f: \mathbb{R}^n \to \mathbb{R} \]be a function,
\[ x \]a point in its domain, and
\[ u \]a unit vector in
\[ \mathbb{R}^n \](
\[ \Vert u \Vert = 1 \]) representing a direction. The directional derivative of
\[ f \]at
\[ x \]in the direction
\[ u \], denoted
\[ D_u f(x) \], is defined as:
\[ D_u f(x) = \lim_{h \to 0^+} \frac{f(x+hu) - f(x)}{h} \]provided this limit exists. It measures the instantaneous rate of change of
\[ f \]at
\[ x \]as we move from
\[ x \]in the direction
\[ u \].
If
is differentiable at
(a concept we won’t define formally here, but it essentially means
can be well-approximated by a linear function near
), there’s a more convenient way to compute the directional derivative.
4.2. Definition and Computation of the Gradient
The gradient is a vector that packages all the first-order partial derivative information about
at a point
.
Definition. The Gradient
\[ \nabla f(x) \]For a scalar-valued function
\[ f: \mathbb{R}^n \to \mathbb{R} \]that is differentiable at a point
\[ x \in \mathbb{R}^n \], the gradient of
\[ f \]at
\[ x \], denoted
\[ \nabla f(x) \](read “del f” or “nabla f”), is the unique vector in
\[ \mathbb{R}^n \]such that for any unit vector
\[ u \in \mathbb{R}^n \], the directional derivative
\[ D_u f(x) \]is given by the dot product (or inner product) of
\[ \nabla f(x) \]with
\[ u \]:
\[ D_u f(x) = \nabla f(x)^T u \](Note: We use
\[ \nabla f(x)^T u \]assuming column vectors for
\[ \nabla f(x) \]and
\[ u \]. If they are row vectors, it would be
\[ \nabla f(x) \cdot u \].)
**How do we find this unique vector
?** It turns out that in standard Cartesian coordinates (
), the gradient is simply the vector of its partial derivatives.
Theorem. Gradient as a Vector of Partial Derivatives
If
\[ f: \mathbb{R}^n \to \mathbb{R} \]is differentiable at
\[ x \], then its gradient
\[ \nabla f(x) \]is given by:
\[ \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(x) \\ \frac{\partial f}{\partial x_2}(x) \\ \vdots \\ \frac{\partial f}{\partial x_n}(x) \end{bmatrix} \]
Proof.
Let
be the standard orthonormal basis vectors for
, where
is a vector with a
in the
-th position and
s elsewhere.
- The directional derivative
in the direction of the basis vector
is:
1
2
3
4
5
<div class="math-block" markdown="0"> \[ D_{e_i} f(x) = \lim_{h \to 0^+} \frac{f(x + h e_i) - f(x)}{h} \]
</div>
Since
, this limit is precisely the definition of the partial derivative of
with respect to
:
1
2
<div class="math-block" markdown="0"> \[ D_{e_i} f(x) = \frac{\partial f}{\partial x_i}(x) \]
</div>
- From the definition of the gradient, we also have
. If we write the gradient vector as
, then:
1
2
3
4
5
<div class="math-block" markdown="0"> \[ \nabla f(x)^T e_i = [g_1, g_2, \dots, g_n] \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix} = g_i \quad (\text{where the 1 in } e_i \text{ is at the } i\text{-th position}) \]
</div>
So,
is the
-th component of the gradient vector
.
- Equating the two expressions for
from step 1 and step 2, we get:
1
2
3
4
5
<div class="math-block" markdown="0"> \[ g_i = \frac{\partial f}{\partial x_i}(x) \]
</div>
Since this holds for all components
, the gradient vector is:
1
2
<div class="math-block" markdown="0"> \[ \nabla f(x) = \left[ \frac{\partial f}{\partial x_1}(x), \dots, \frac{\partial f}{\partial x_n}(x) \right]^T \]
</div>
This confirms that the abstractly defined gradient vector is indeed computed as the vector of partial derivatives in Cartesian coordinates.
Example: For
, we found:
So, the gradient is:
At the point
,
.
4.3. Geometric Interpretation of the Gradient
The gradient has profound geometric significance:
- Direction of Steepest Ascent: The directional derivative is
, where
is the angle between
and
. *
is maximized when
(i.e.,
). This occurs when
points in the **same direction as
**. * Thus,
points in the direction in which
increases most rapidly at
.
- Magnitude of Steepest Ascent: The maximum rate of increase (the value of
when
is aligned with
) is
.
- Direction of Steepest Descent:
is minimized (most negative) when
(i.e.,
). This occurs when
points in the **opposite direction to
**, i.e.,
(if
). * Thus,
points in the direction in which
decreases most rapidly at
. This is fundamental for minimization algorithms like gradient descent.
- Orthogonality to Level Sets: The gradient
at a point
is orthogonal (perpendicular) to the level set of
that passes through
. * Intuition: If you move along a level set, the function value does not change, so the directional derivative in that direction is zero. If
is tangent to the level set, then
, implying
is orthogonal to
.
- Zero Gradient: If
(the zero vector), then
for all directions
. This means
is locally “flat” at
. Such points are called critical points or stationary points and are candidates for local minima, local maxima, or saddle points.
5. The Hessian Matrix
While the gradient (first-order derivatives) tells us about the slope and direction of steepest change, second-order derivatives tell us about the curvature of the function. These are collected in the Hessian matrix.
Definition. The Hessian Matrix
\[ \nabla^2 f(x) \]For a function
\[ f: \mathbb{R}^n \to \mathbb{R} \]whose second-order partial derivatives exist, the Hessian matrix of
\[ f \]at a point
\[ x \in \mathbb{R}^n \], denoted
\[ \nabla^2 f(x) \]or
\[ H_f(x) \], is the
\[ n \times n \]matrix of these second partial derivatives:
\[ \nabla^2 f(x) = H_f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(x) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}(x) \\ \frac{\partial^2 f}{\partial x_2 \partial x_1}(x) & \frac{\partial^2 f}{\partial x_2^2}(x) & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n}(x) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1}(x) & \frac{\partial^2 f}{\partial x_n \partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(x) \end{bmatrix} \]The entry in the
\[ i \]-th row and
\[ j \]-th column is
\[ (\nabla^2 f(x))_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(x) \].
Symmetry of the Hessian: If the second partial derivatives of
are continuous in a region, then the order of differentiation does not matter (Clairaut’s Theorem or Schwarz’s Theorem on equality of mixed partials):
This implies that if these conditions hold, the Hessian matrix
is a symmetric matrix (
). This is usually the case for functions encountered in optimization.
Example: For
, we had:
Now the second partial derivatives:
(Note that
, as expected.)
So, the Hessian matrix is:
At the point
,
.
The Hessian matrix is crucial for:
- Second-order optimization methods (like Newton’s method).
- Characterizing critical points: The definiteness of the Hessian (positive definite, negative definite, indefinite) at a critical point helps determine if it’s a local minimum, local maximum, or saddle point.
- Understanding the local convexity of a function. (More on this in a
Convex Analysiscrash course).
6. Taylor Series for Multivariable Functions
Just as in single-variable calculus, we can use Taylor series to approximate a multivariable function
around a point
using its derivatives at
. This is extremely useful in optimization for building local models of the objective function.
Let
, where
is a small displacement vector.
First-Order Taylor Expansion (Linear Approximation): If
is differentiable at
, then for
small:
This approximates
near
with a linear function (a hyperplane). The term
is the first-order change.
Second-Order Taylor Expansion (Quadratic Approximation): If
is twice differentiable at
, then for
small:
This approximates
near
with a quadratic function. The term
is the second-order (quadratic) change involving the Hessian. This quadratic approximation is the basis for Newton’s method in optimization.
Example: For
around
.
.
, so
.
, which is constant. So
.
Let
. The second-order Taylor expansion around
is:
If we let
and
, then
and
. Substituting these back:
In this case, because
is already a quadratic function, its second-order Taylor expansion is exact (the remainder term is zero). For more complex functions, it’s an approximation.
7. A Note on Vector and Matrix Operations
Throughout this crash course and in optimization theory, certain vector and matrix operations from linear algebra are fundamental. We assume basic familiarity, but here’s a quick reminder of notation commonly used:
- Vectors (
) are typically column vectors in
.
- Transpose:
denotes the transpose of
(a row vector if
is a column vector).
- Dot Product (Inner Product): For vectors
, their dot product is
.
- Vector Norm (Euclidean Norm):
.
- Matrix-Vector Product: If
is an
matrix and
is an
vector,
is an
vector.
- Quadratic Form: For an
matrix
and a vector
, the scalar
is a quadratic form. This appears in the second-order Taylor expansion with
.
These operations are the building blocks for expressing and manipulating the calculus concepts we’ve discussed.
8. Conclusion
This crash course has touched upon the key elements of multivariable calculus that are indispensable for understanding gradient-based optimization algorithms: partial derivatives, the gradient vector and its geometric significance (steepest ascent/descent), the Hessian matrix capturing curvature, and Taylor series for local function approximation.
These concepts form the language used to describe how functions behave locally and how we can iteratively seek their optima. As you proceed through the main optimization series, you’ll see these tools applied repeatedly. Don’t hesitate to revisit this crash course if you need a refresher on any of these foundational ideas!
Further Reading (Standard calculus textbooks like Stewart’s “Calculus” or Apostol’s “Calculus, Vol. 2” cover these topics in great depth. Grant “3Blue1Brown” Sanderson also has excellent animated explanations on YouTube and a course on Khan Academy.)