Post

Gradient Descent and Gradient Flow

Exploring the mathematical foundations of gradient descent, its continuous analogue gradient flow, and their connections.

Gradient Descent and Gradient Flow

1. Introduction: The Landscape of Optimization

In machine learning, training models often involves minimizing a loss function \(f(\theta)\) with respect to its parameters \(\theta\). Gradient-based methods are the workhorses for this task. This post delves into two fundamental concepts: Gradient Descent (GD), a discrete iterative algorithm, and Gradient Flow (GF), its continuous-time analogue.

Why study both?

  • Gradient Descent is what we implement. Understanding its mechanics, convergence, and sensitivity to hyperparameters like the learning rate is crucial for practitioners.
  • Gradient Flow provides a powerful abstraction. It often simplifies analysis, revealing underlying stability properties and connections to physical systems (like an object rolling downhill on an energy landscape \(f(\theta)\)). By studying the continuous dynamics, we can gain deeper insights into the behavior of its discrete counterpart.

Roadmap:

  1. We’ll start with the discrete Gradient Descent algorithm: its formulation, conditions for convergence, and behavior under smoothness and strong convexity assumptions.
  2. Then, we’ll explore the Gradient Flow ODE, its properties like energy dissipation (Lyapunov stability), and convergence characteristics.
  3. We’ll explicitly bridge these two worlds, showing how GD can be seen as a numerical discretization (Forward Euler) of GF.
  4. A comparison of convergence rates and qualitative behaviors will highlight the similarities and differences.
  5. Illustrative examples, particularly the quadratic case, will make these concepts concrete.
  6. Finally, we’ll briefly touch upon extensions, paving the way for future discussions on more advanced optimizers.

Our focus will be on the mathematical underpinnings, providing definitions, key results, and derivations to build a solid understanding.

2. Gradient Descent: Iterating Towards a Minimum

Gradient Descent is an iterative algorithm that takes steps in the direction opposite to the gradient of the function at the current point.

Definition. Gradient Descent Algorithm

Given a differentiable function \(f: \mathbb{R}^d \to \mathbb{R}\), an initial parameter vector \(\theta_0 \in \mathbb{R}^d\), and a learning rate (step size) \(\eta > 0\), the Gradient Descent update rule is:

\[\theta_{k+1} = \theta_k - \eta \nabla f(\theta_k)\]

for \(k = 0, 1, 2, \dots\).

The choice of \(\eta\) is critical and influences both the speed and stability of convergence.

2.1. Ensuring Progress: The Role of Smoothness

To guarantee that gradient descent makes progress (i.e., decreases the function value), we often assume the function \(f\) has a Lipschitz continuous gradient, also known as \(L\)-smoothness.

Definition. \(L\)-Smoothness (Lipschitz Gradient)

A function \(f\) is \(L\)-smooth if its gradient \(\nabla f\) is Lipschitz continuous with constant \(L \ge 0\):

\[\Vert \nabla f(x) - \nabla f(y) \Vert \le L \Vert x - y \Vert, \quad \forall x, y \in \mathbb{R}^d\]

An equivalent characterization (for convex functions, or generally if twice differentiable and \(\nabla^2 f(x) \preceq L I\)) is the descent lemma:

\[f(y) \le f(x) + \nabla f(x)^\top (y-x) + \frac{L}{2}\Vert y-x\Vert^2, \quad \forall x, y \in \mathbb{R}^d\]

Lemma. Sufficient Decrease for Gradient Descent

If \(f\) is \(L\)-smooth, then for a gradient descent step \(\theta_{k+1} = \theta_k - \eta \nabla f(\theta_k)\), we have:

\[f(\theta_{k+1}) \le f(\theta_k) - \eta \left(1 - \frac{L\eta}{2}\right) \Vert \nabla f(\theta_k) \Vert^2\]

Thus, if \(0 < \eta < \frac{2}{L}\), gradient descent ensures \(f(\theta_{k+1}) < f(\theta_k)\) whenever \(\nabla f(\theta_k) \neq 0\). A common choice, \(\eta = 1/L\), yields:

\[f(\theta_{k+1}) \le f(\theta_k) - \frac{1}{2L} \Vert \nabla f(\theta_k) \Vert^2\]

</div>

Proof Sketch. Sufficient Decrease

Using the descent lemma property of \(L\)-smoothness with \(y = \theta_{k+1}\) and \(x = \theta_k\):

\[f(\theta_{k+1}) \le f(\theta_k) + \nabla f(\theta_k)^\top (\theta_{k+1} - \theta_k) + \frac{L}{2}\Vert \theta_{k+1} - \theta_k \Vert^2\]

Substitute \(\theta_{k+1} - \theta_k = -\eta \nabla f(\theta_k)\):

\[f(\theta_{k+1}) \le f(\theta_k) - \eta \nabla f(\theta_k)^\top \nabla f(\theta_k) + \frac{L}{2}\Vert -\eta \nabla f(\theta_k) \Vert^2\] \[f(\theta_{k+1}) \le f(\theta_k) - \eta \Vert \nabla f(\theta_k) \Vert^2 + \frac{L\eta^2}{2} \Vert \nabla f(\theta_k) \Vert^2\] \[f(\theta_{k+1}) \le f(\theta_k) - \eta \left(1 - \frac{L\eta}{2}\right) \Vert \nabla f(\theta_k) \Vert^2\]

For general convex \(L\)-smooth functions, with \(\eta = 1/L\), gradient descent achieves a convergence rate for the function value:

\[f(\theta_k) - f(\theta^\ast) \le \frac{L \Vert \theta_0 - \theta^\ast \Vert^2}{2k}\]

where \(\theta^\ast\) is a minimizer. This is a sublinear rate of \(O(1/k)\).

2.2. Faster Convergence with Strong Convexity

If the function \(f\) is also strongly convex, convergence can be much faster.

Definition. \(\mu\)-Strong Convexity

A function \(f\) is \(\mu\)-strongly convex (for \(\mu > 0\)) if for all \(x, y \in \mathbb{R}^d\):

\[f(y) \ge f(x) + \nabla f(x)^\top (y-x) + \frac{\mu}{2}\Vert y-x\Vert^2\]

If \(f\) is twice differentiable, this is equivalent to \(\nabla^2 f(x) \succeq \mu I\).

Theorem. Linear Convergence for Strongly Convex Functions

If \(f\) is \(L\)-smooth and \(\mu\)-strongly convex, and the learning rate \(\eta\) is chosen appropriately (e.g., \(\eta = 1/L\) or \(\eta = 2/(L+\mu)\)), Gradient Descent converges linearly: There exists a constant \(c < 1\) (e.g., \(c = 1 - \mu/L\) for \(\eta=1/L\)) such that:

\[\Vert \theta_k - \theta^\ast \Vert^2 \le c^k \Vert \theta_0 - \theta^\ast \Vert^2\]

and

\[f(\theta_k) - f(\theta^\ast) \le c^k (f(\theta_0) - f(\theta^\ast))\]

This implies that the error decreases exponentially fast. The number of iterations to reach \(\epsilon\)-accuracy is \(O(\kappa \log(1/\epsilon))\), where \(\kappa = L/\mu\) is the condition number.

2.3. Example: Quadratic Objective

Consider the quadratic function \(f(\theta) = \frac{1}{2} \theta^\top A \theta - b^\top \theta + c\), where \(A\) is a symmetric positive definite (SPD) matrix. The gradient is \(\nabla f(\theta) = A\theta - b\). The unique minimizer \(\theta^\ast\) satisfies \(A\theta^\ast = b\). The GD update is:

\[\theta_{k+1} = \theta_k - \eta (A\theta_k - b)\]

Subtracting \(\theta^\ast\) from both sides:

\[\theta_{k+1} - \theta^\ast = (\theta_k - \theta^\ast) - \eta A(\theta_k - \theta^\ast) = (I - \eta A)(\theta_k - \theta^\ast)\]

Let \(e_k = \theta_k - \theta^\ast\). Then \(e_{k+1} = (I - \eta A)e_k\), so \(e_k = (I - \eta A)^k e_0\). Convergence requires the spectral radius of \((I - \eta A)\) to be less than 1. If \(\lambda_1 \le \dots \le \lambda_d\) are the eigenvalues of \(A\) (all positive since \(A\) is SPD), then the eigenvalues of \((I - \eta A)\) are \(1 - \eta \lambda_i\). For convergence, we need \(\vert 1 - \eta \lambda_i \vert < 1\) for all \(i\). This implies \(-1 < 1 - \eta \lambda_i < 1\), which simplifies to \(0 < \eta \lambda_i < 2\). Thus, we need \(0 < \eta < \frac{2}{\lambda_{\max}(A)}\). Here, \(L = \lambda_{\max}(A)\) and \(\mu = \lambda_{\min}(A)\). The condition number is \(\kappa = \frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}\).

3. Gradient Flow: The Continuous Path of Steepest Descent

Gradient Flow describes a continuous trajectory \(\theta(t)\) that moves in the direction of the negative gradient.

Definition. Gradient Flow

The gradient flow of a differentiable function \(f: \mathbb{R}^d \to \mathbb{R}\) is the solution \(\theta(t)\) to the ordinary differential equation (ODE):

\[\frac{d\theta(t)}{dt} = -\nabla f(\theta(t))\]

with an initial condition \(\theta(0) = \theta_0\).

If \(\nabla f\) is Lipschitz continuous, existence and uniqueness of the solution \(\theta(t)\) are guaranteed.

3.1. Energy Dissipation and Lyapunov Stability

A key property of gradient flow is that the function value \(f(\theta(t))\) continuously decreases along the trajectory, unless a critical point (\(\nabla f(\theta(t))=0\)) is reached. This can be elegantly shown using \(f\) itself as a Lyapunov function.

Consider \(V(\theta(t)) = f(\theta(t))\). Its time derivative is:

\[\frac{d}{dt} f(\theta(t)) = \nabla f(\theta(t))^\top \frac{d\theta(t)}{dt}\]

Substituting the gradient flow equation \(\frac{d\theta(t)}{dt} = -\nabla f(\theta(t))\):

\[\frac{d}{dt} f(\theta(t)) = \nabla f(\theta(t))^\top (-\nabla f(\theta(t))) = - \Vert \nabla f(\theta(t)) \Vert^2\]

Since \(\Vert \nabla f(\theta(t)) \Vert^2 \ge 0\), we have \(\frac{d}{dt} f(\theta(t)) \le 0\). The function value strictly decreases as long as \(\nabla f(\theta(t)) \neq 0\). Trajectories of the gradient flow will converge to the set of critical points of \(f\). This demonstrates that \(f\) acts as a Lyapunov function for the system, ensuring stability around minima.

3.2. Convergence Rates for Gradient Flow

Similar to gradient descent, convergence rates depend on the properties of \(f\).

  1. Convex Case: If \(f\) is convex and \(L\)-smooth, one can show a rate of:

    \[f(\theta(t)) - f(\theta^\ast) \le \frac{\Vert \theta_0 - \theta^\ast \Vert^2}{2t}\]

    This is an \(O(1/t)\) sublinear rate.

  2. Strongly Convex Case: If \(f\) is \(\mu\)-strongly convex, the convergence is exponential:

    \[f(\theta(t)) - f(\theta^\ast) \le (f(\theta_0) - f(\theta^\ast)) e^{-2\mu t}\]

    And for the parameters:

    \[\Vert \theta(t) - \theta^\ast \Vert^2 \le \Vert \theta_0 - \theta^\ast \Vert^2 e^{-2\mu t}\]

    Proof Sketch. Exponential Decay for Strongly Convex Flow

    For a \(\mu\)-strongly convex function, one property is \(\Vert \nabla f(x) \Vert^2 \ge 2\mu (f(x) - f(x^\ast))\). Then,

    \[\frac{d}{dt} (f(\theta(t)) - f(\theta^\ast)) = - \Vert \nabla f(\theta(t)) \Vert^2 \le -2\mu (f(\theta(t)) - f(\theta^\ast))\]

    Let \(g(t) = f(\theta(t)) - f(\theta^\ast)\). We have \(g'(t) \le -2\mu g(t)\). By Grönwall’s inequality, \(g(t) \le g(0)e^{-2\mu t}\).

3.3. Example: Quadratic Objective

For \(f(\theta) = \frac{1}{2} \theta^\top A \theta\) (assuming \(b=0, c=0\) for simplicity, so \(\theta^\ast=0\)), where \(A\) is SPD. The gradient flow ODE is:

\[\frac{d\theta(t)}{dt} = -A\theta(t)\]

The solution is \(\theta(t) = e^{-At} \theta_0\). If \(A = V \Lambda V^\top\) is the eigendecomposition of \(A\) (with \(V\) orthogonal and \(\Lambda = \text{diag}(\lambda_1, \dots, \lambda_d)\)), then \(e^{-At} = V e^{-\Lambda t} V^\top\). So, \(\theta(t) = V e^{-\Lambda t} V^\top \theta_0\). In the eigenbasis \(z(t) = V^\top \theta(t)\), the ODEs decouple:

\[\frac{dz_i(t)}{dt} = -\lambda_i z_i(t) \quad \Rightarrow \quad z_i(t) = z_i(0) e^{-\lambda_i t}\]

The decay rate of each component is governed by its corresponding eigenvalue \(\lambda_i\). The slowest decaying component corresponds to \(\lambda_{\min}(A)\).

4. Bridging Discrete Steps and Continuous Paths

The connection between Gradient Descent and Gradient Flow is established by viewing GD as a numerical discretization of the GF ODE.

4.1. Forward Euler Discretization

The simplest method to discretize an ODE \(\frac{dy}{dt} = G(y)\) is the Forward Euler method:

\[\frac{y(t+\eta) - y(t)}{\eta} \approx G(y(t)) \quad \Rightarrow \quad y(t+\eta) \approx y(t) + \eta G(y(t))\]

Applying this to the gradient flow ODE \(\frac{d\theta(t)}{dt} = -\nabla f(\theta(t))\): Let \(t = k\eta\) and \(\theta_k \approx \theta(k\eta)\).

\[\frac{\theta((k+1)\eta) - \theta(k\eta)}{\eta} \approx -\nabla f(\theta(k\eta))\] \[\theta_{k+1} \approx \theta_k - \eta \nabla f(\theta_k)\]

This is precisely the Gradient Descent update rule.

4.2. Approximation Error

The local truncation error of the Forward Euler method is \(O(\eta^2)\) per step. Over a fixed time interval \(T = K\eta\), the global error \(\Vert \theta_K - \theta(T) \Vert\) is typically \(O(\eta)\). This means that for small step sizes \(\eta\), the trajectory of Gradient Descent closely follows the path of Gradient Flow.

Tip. Taylor Expansion for Local Truncation Error

Consider the Taylor expansion of \(\theta(t+\eta)\) around \(t\):

\[\theta(t+\eta) = \theta(t) + \eta \frac{d\theta(t)}{dt} + \frac{\eta^2}{2} \frac{d^2\theta(t)}{dt^2} + O(\eta^3)\]

Substitute \(\frac{d\theta(t)}{dt} = -\nabla f(\theta(t))\):

\[\theta(t+\eta) = \theta(t) - \eta \nabla f(\theta(t)) + \frac{\eta^2}{2} \frac{d^2\theta(t)}{dt^2} + O(\eta^3)\]

The Gradient Descent step is \(\theta_{k+1} = \theta_k - \eta \nabla f(\theta_k)\). The difference, representing the local error, is primarily due to the \(\frac{\eta^2}{2} \frac{d^2\theta(t)}{dt^2}\) term, where \(\frac{d^2\theta(t)}{dt^2} = -\nabla^2 f(\theta(t)) \frac{d\theta(t)}{dt} = \nabla^2 f(\theta(t)) \nabla f(\theta(t))\).

4.3. Stability Regions and Learning Rate

The Forward Euler method is not unconditionally stable. For the scalar test problem \(f(\theta) = \frac{1}{2}\lambda \theta^2\) (\(\lambda > 0\)):

  • Gradient Flow: \(\frac{d\theta}{dt} = -\lambda \theta \Rightarrow \theta(t) = \theta_0 e^{-\lambda t}\). This is always stable and decays to 0.
  • Gradient Descent: \(\theta_{k+1} = \theta_k - \eta \lambda \theta_k = (1 - \eta \lambda) \theta_k\). For stability, we need \(\vert 1 - \eta \lambda \vert < 1\), which implies \(-1 < 1 - \eta \lambda < 1\). This gives \(0 < \eta \lambda < 2\), or \(0 < \eta < \frac{2}{\lambda}\). This matches our earlier condition \(\eta < 2/L\) where \(L=\lambda\) for this 1D quadratic. If \(\eta\) is too large, GD can oscillate and diverge, while GF remains smooth.

Brief Mention. Backward Euler (Implicit Gradient Descent)

Another discretization is Backward Euler:

\[\frac{\theta_{k+1} - \theta_k}{\eta} = -\nabla f(\theta_{k+1}) \quad \Rightarrow \quad \theta_{k+1} = \theta_k - \eta \nabla f(\theta_{k+1})\]

This defines \(\theta_{k+1}\) implicitly and requires solving an equation at each step. This is related to the proximal point algorithm:

\[\theta_{k+1} = \text{prox}_{\eta f}(\theta_k) = \arg\min_u \left( f(u) + \frac{1}{2\eta}\Vert u - \theta_k \Vert^2 \right)\]

For differentiable \(f\), the first-order optimality condition for the prox step is \(\nabla f(\theta_{k+1}) + \frac{1}{\eta}(\theta_{k+1} - \theta_k) = 0\), which is the Backward Euler update. Implicit methods are often unconditionally stable (no upper bound on \(\eta\) for stability) but computationally more expensive per iteration.

5. Comparing Discrete and Continuous Dynamics

While related, GD and GF exhibit some different characteristics:

FeatureGradient Descent (Discrete)Gradient Flow (Continuous)
PathSequence of discrete points; can “zigzag” or overshoot, especially with large \(\eta\) or ill-conditioning.Smooth, continuous path; always moves in steepest descent direction.
Step Size / TimeControlled by learning rate \(\eta\).Evolves naturally in continuous time.
StabilityConditional on \(\eta\) (\(\eta < 2/L\) for \(L\)-smooth).Inherently stable for suitable \(f\).
Convergence (SC)Linear: error \(\sim (1-\eta\mu)^k\). For small \(\eta\), \((1-\eta\mu)^{t/\eta} \approx e^{-\mu t}\).Exponential: error \(\sim e^{-2\mu t}\).
ImplementationDirectly implementable.Conceptual; ODE solvers needed for simulation.

Note the factor of 2 difference in the exponent for strongly convex rates (\(e^{-\mu t}\) vs \(e^{-2\mu t}\)). This is a common point of comparison: for optimal discrete step sizes, the continuous version often appears “faster” in its idealized rate. However, the discrete rate is what’s achievable in practice.

6. Visualizations (Conceptual)

(This section would typically include plots. For now, we describe them.)

  1. Contour Plot with Vector Field: Imagine a 2D function \(f(x,y)\) (e.g., an elliptical bowl \(f(x,y) = ax^2 + by^2\)).
    • Contours show lines of constant \(f\).
    • The vector field \(-\nabla f(x,y)\) would be plotted with arrows at various points, indicating the direction of steepest descent. These arrows would be perpendicular to the contours.
  2. Trajectories:
    • Gradient Flow Paths: Smooth curves (streamlines) that follow the vector field arrows, starting from various initial points and converging to the minimum.
    • Gradient Descent Steps: A sequence of connected line segments starting from the same initial points.
      • For small \(\eta\), the GD path would closely follow the GF path.
      • For larger (but stable) \(\eta\), the GD path might “zigzag” more, especially if the contours are highly elliptical (ill-conditioned problem).
      • For unstable \(\eta\) (\(>2/L\)), the GD path would diverge.

These visualizations powerfully illustrate how GD attempts to follow the underlying continuous flow and how the learning rate affects this approximation.

7. Outlook: Beyond Vanilla Methods (A Brief Teaser)

Gradient Descent and Gradient Flow are foundational, but many advanced optimization techniques can be understood as modifications or extensions:

  • Accelerated Methods: Algorithms like Polyak’s Heavy Ball or Nesterov’s Accelerated Gradient can be related to discretizations of second-order ODEs, such as a damped oscillator equation:

    \[\ddot{\theta}(t) + \gamma \dot{\theta}(t) + \nabla f(\theta(t)) = 0\]

    These often achieve faster convergence rates.

  • Natural Gradient Flow: Instead of the standard Euclidean gradient, one can use a Riemannian gradient defined by a metric (e.g., Fisher Information Matrix \(G(\theta)\)$):

    \[\frac{d\theta(t)}{dt} = -G(\theta(t))^{-1} \nabla f(\theta(t))\]

    This adapts the descent direction to the geometry of the parameter space.

  • Stochastic Gradient Flow: When gradients are noisy (as in Stochastic Gradient Descent), the dynamics can be modeled by Stochastic Differential Equations (SDEs), like Langevin dynamics:

    \[d\theta(t) = -\nabla f(\theta(t)) dt + \sqrt{2\beta^{-1}} dW(t)\]

    where \(dW(t)\) is a Wiener process (Brownian motion).

These connections will be explored in future posts.

8. Summary and Key Takeaways

  • Gradient Descent (GD) is a practical, iterative algorithm for minimizing functions by taking steps proportional to the negative gradient. Its convergence relies on properties like \(L\)-smoothness and strong convexity, and careful choice of the learning rate \(\eta\).
  • Gradient Flow (GF) is the continuous analogue of GD, described by an ODE. It provides a conceptual framework where the function value acts as a Lyapunov function, ensuring energy dissipation and convergence to critical points.
  • GD can be viewed as a Forward Euler discretization of GF. The learning rate \(\eta\) in GD corresponds to the time step in the discretization.
  • Stability is inherent in GF (for well-behaved functions) but conditional on \(\eta\) for GD (\(\eta < 2/L\)).
  • Convergence rates for GF often appear “cleaner” (e.g., \(e^{-2\mu t}\) for SC functions), while GD rates depend on \(\eta\) and can be \(e^{-\mu t/\kappa}\) or similar, highlighting the impact of discretization and condition number \(\kappa\).
  • Understanding GF can provide insights into the behavior of GD and inspire the design of more sophisticated optimization algorithms.

9. Cheat Sheet: Gradient Descent vs. Gradient Flow

AspectGradient Descent (Discrete, \(\theta_k\))Gradient Flow (Continuous, \(\theta(t)\))
Update Rule / ODE\(\theta_{k+1} = \theta_k - \eta \nabla f(\theta_k)\)\(\frac{d\theta(t)}{dt} = -\nabla f(\theta(t))\)
Function Decrease\(f(\theta_{k+1}) \le f(\theta_k) - \eta(1-\frac{L\eta}{2})\Vert\nabla f_k\Vert^2\) (for \(0 < \eta < 2/L\))\(\frac{d}{dt}f(\theta(t)) = -\Vert \nabla f(\theta(t)) \Vert^2\)
Stability Condition\(0 < \eta < 2/L\) (for \(L\)-smooth)Generally stable if \(\nabla f\) is Lipschitz
Rate (L-smooth, convex)\(f(\theta_k)-f^\ast = O(1/k)\) (e.g., \(\le \frac{L\Vert\theta_0-\theta^\ast\Vert^2}{2k}\) with \(\eta=1/L\))\(f(\theta(t))-f^\ast = O(1/t)\) (e.g., \(\le \frac{\Vert\theta_0-\theta^\ast\Vert^2}{2t}\))
Rate (\(\mu\)-SC, L-smooth)Linear: e.g., \(f(\theta_k)-f^\ast \le (1-\mu/L)^k (f_0-f^\ast)\) with \(\eta=1/L\)Exponential: \(f(\theta(t))-f^\ast \le (f_0-f^\ast)e^{-2\mu t}\)
Quadratic \(f(\theta)=\frac{1}{2}\theta^\top A \theta\)\(\theta_{k+1} = (I-\eta A)\theta_k\)\(\theta(t) = e^{-At}\theta_0\)

(Note: Specific constants in rates can vary slightly based on exact assumptions and choice of \(\eta\)).

10. Further Reading (Optional Placeholder)

  • Numerical Optimization by Nocedal & Wright.
  • Convex Optimization by Boyd & Vandenberghe.
  • Su, W., Boyd, S., & Candès, E. (2016). A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153), 1-43. (For connections to accelerated methods).

Appendix

Fun fact: the Fokker-Planck equation, which describes the evolution of the probability density of a system, can be derived from gradient flow in the 2-Wasserstein space. This is a fascinating area of research in the context of optimal transport, i.e. finding the best way to transport goods from one place and arrangement to another, also called the “earth-mover’s problem”. Mathematically, we find the minimum cost of transporting one probability distribution to another. See e.g. these lecture notes for a short statement, or Jordan et al. (1996) for a more in-depth treatment.

This post is licensed under CC BY 4.0 by the author.