Information Geometry Part 1: Statistical Manifolds and the Fisher Metric
An introduction to Information Geometry, exploring how statistical models form manifolds and how the Fisher Information Metric provides a natural way to measure distances and curvature.
Welcome to this crash course on Information Geometry (IG)! This field beautifully merges concepts from differential geometry with statistics and information theory, providing a powerful geometric framework for understanding statistical models, inference, and even machine learning optimization algorithms. This first part lays the groundwork by introducing statistical manifolds and the cornerstone of IG: the Fisher Information Metric.
This course assumes familiarity with concepts from our Differential Geometry and Statistics and Information Theory courses.
Introduction: Why Geometry for Statistics?
When we work with families of probability distributions, like all possible Gaussian distributions, we are working with a space of distributions. Each point in this space is a specific distribution, identified by its parameters (e.g., mean \(\mu\) and variance \(\sigma^2\) for a Gaussian).
A natural question arises: what is the “shape” of this space?
- Is it flat like Euclidean space, where the shortest path between two points is a straight line and notions of distance are straightforward?
- Or does it have a more complex, curved structure?
Standard Euclidean geometry often falls short when applied to the parameter spaces of statistical models. For instance, a step of size \(\epsilon\) in one parameter might change the distribution much more significantly than a step of the same size \(\epsilon\) in another parameter, or even in the same parameter but at a different location in the parameter space. This suggests that the naive Euclidean distance between parameter vectors is not a “natural” or “intrinsic” measure of how different the corresponding distributions are.
Information Geometry addresses this by endowing the space of probability distributions with a Riemannian manifold structure. This means we can:
- Define a “local distance” between infinitesimally close distributions that reflects their statistical distinguishability.
- Talk about lengths of paths, angles between directions of change, curvature, and geodesics (the “straightest possible” paths) on this manifold.
This geometric perspective allows us to:
- Gain deeper insights into the behavior of statistical estimators and hypothesis tests.
- Understand the properties of information measures like KL divergence.
- Develop more efficient optimization algorithms (like the natural gradient) that respect the intrinsic geometry of the problem space.
In this part, we will formalize the idea of a “statistical manifold” and introduce the Fisher Information Metric, which provides the machinery for all these geometric notions.
Statistical Manifolds
The core idea is to treat a family of probability distributions as a differentiable manifold, where the parameters of the distributions serve as local coordinates.
Definition. Statistical Model and Statistical Manifold
A statistical model \(S\) is a set of probability distributions parameterized by a vector \(\theta = (\theta^1, \dots, \theta^d) \in \Theta\), where \(\Theta\) is an open subset of \(\mathbb{R}^d\). We denote a distribution in the model as \(p(x; \theta)\), where \(x\) represents the data.
\[S = \{ p(x; \theta) \mid \theta \in \Theta \subseteq \mathbb{R}^d \}\]If the mapping \(\theta \mapsto p(x; \theta)\) satisfies certain regularity conditions (e.g., smoothness, identifiability), the statistical model \(S\) can be regarded as a \(d\)-dimensional differentiable manifold. This is called a statistical manifold. Each point on this manifold corresponds to a unique probability distribution \(p(x; \theta)\), and \(\theta\) are its coordinates.
Let’s consider some common examples:
Examples of Statistical Manifolds
Bernoulli Distributions: The family of Bernoulli distributions is given by \(p(x; \pi) = \pi^x (1-\pi)^{1-x}\) for \(x \in \{0, 1\}\) and parameter \(\pi \in (0, 1)\). Here, \(\theta = (\pi)\), and \(\Theta = (0,1)\). This forms a 1-dimensional statistical manifold.
Gaussian Distributions (Univariate): The family of univariate Gaussian distributions is \(p(x; \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\), with parameters \(\theta = (\mu, \sigma)\) where \(\mu \in \mathbb{R}\) and \(\sigma > 0\). This forms a 2-dimensional statistical manifold, often called the \((\mu, \sigma)\) plane or upper-half plane.
Exponential Family: Many common distributions (Gaussian, Bernoulli, Poisson, Exponential, etc.) belong to the exponential family:
\[p(x; \theta) = h(x) \exp(\theta \cdot T(x) - A(\theta))\]where \(\theta\) are the natural parameters. Under regularity conditions, the space of natural parameters \(\theta\) forms a statistical manifold.
Tangent Space of a Statistical Manifold
As discussed in the Differential Geometry crash course, at each point \(p_\theta\) on a manifold, there’s a tangent space \(T_{p_\theta}S\). This vector space consists of all possible “velocities” or directions of infinitesimal movement away from \(p_\theta\) on the manifold.
In a statistical manifold, a tangent vector represents an infinitesimal change in the probability distribution. The parameters \(\theta = (\theta^1, \dots, \theta^d)\) act as local coordinates. The partial derivatives with respect to these coordinates, \(\partial_i = \frac{\partial}{\partial \theta^i}\), form a basis for the tangent space at each point.
An important quantity related to tangent vectors is the score function:
\[\partial_i \log p(x; \theta) = \frac{\partial \log p(x; \theta)}{\partial \theta^i}\]The score vector \((\partial_1 \log p, \dots, \partial_d \log p)\) can be seen as representing the “sensitivity” of the log-likelihood to changes in the parameters. Its components are crucial for defining the Fisher Information Metric. The expectation of the score function is zero:
\[\mathbb{E}_{p(x;\theta)} \left[ \frac{\partial \log p(x; \theta)}{\partial \theta^i} \right] = \int \frac{\partial p(x; \theta)}{\partial \theta^i} dx = \frac{\partial}{\partial \theta^i} \int p(x; \theta) dx = \frac{\partial}{\partial \theta^i} (1) = 0\](assuming we can swap integration and differentiation).
The Fisher Information Metric
To do geometry on our statistical manifold, we need a way to measure distances, angles, and curvature. This is provided by a Riemannian metric. For statistical manifolds, the Fisher Information Metric (FIM) emerges as the most “natural” choice, deeply connected to statistical distinguishability.
Definition. Fisher Information Metric (FIM)
The Fisher Information Metric \(g(\theta)\) is a Riemannian metric whose components \(g_{ij}(\theta)\) (forming the Fisher Information Matrix \(I(\theta)\)) are defined as the expected outer product of the score functions:
\[g_{ij}(\theta) = I_{ij}(\theta) = \mathbb{E}_{p(x;\theta)} \left[ \left( \frac{\partial \log p(x; \theta)}{\partial \theta^i} \right) \left( \frac{\partial \log p(x; \theta)}{\partial \theta^j} \right) \right]\]Under certain regularity conditions (allowing exchange of differentiation and integration), an alternative but equivalent definition is:
\[g_{ij}(\theta) = - \mathbb{E}_{p(x;\theta)} \left[ \frac{\partial^2 \log p(x; \theta)}{\partial \theta^i \partial \theta^j} \right]\]This matrix \(I(\theta)\) is symmetric and positive semi-definite. If it is strictly positive definite, it defines a Riemannian metric on the statistical manifold \(S\).
Aside. Equivalence of the two FIM definitions
To show the equivalence, we start with \(\mathbb{E}[\partial_i \log p] = 0\). Differentiating with respect to \(\theta^j\):
\[\frac{\partial}{\partial \theta^j} \mathbb{E}[\partial_i \log p] = \frac{\partial}{\partial \theta^j} \int (\partial_i \log p(x;\theta)) p(x;\theta) dx = 0\]Assuming we can swap differentiation and integration:
\[\int \left( \frac{\partial (\partial_i \log p)}{\partial \theta^j} p + (\partial_i \log p) \frac{\partial p}{\partial \theta^j} \right) dx = 0\] \[\int \left( (\partial_j \partial_i \log p) p + (\partial_i \log p) (\partial_j \log p) p \right) dx = 0\](since \(\frac{\partial p}{\partial \theta^j} = (\partial_j \log p) p\)). This gives:
\[\mathbb{E}[\partial_j \partial_i \log p] + \mathbb{E}[(\partial_i \log p)(\partial_j \log p)] = 0\]So,
\[\mathbb{E}[(\partial_i \log p)(\partial_j \log p)] = - \mathbb{E}[\partial_j \partial_i \log p]\]which demonstrates the equivalence.
Properties of the Fisher Information Metric
The FIM is not just an arbitrary choice; it possesses several desirable properties:
- Symmetry and Positive Semi-Definiteness: By its definition as an expectation of an outer product (or Hessian of a convex function - the negative log-likelihood averaged), \(I(\theta)\) is symmetric (\(g_{ij} = g_{ji}\)) and positive semi-definite. It is positive definite if the parameters are identifiable.
- Cramér-Rao Bound: The FIM appears in the Cramér-Rao lower bound, which states that the variance of any unbiased estimator \(\hat{\theta}\) for a parameter \(\theta\) is bounded below by the inverse of the Fisher information: \(\text{Var}(\hat{\theta}_i) \ge (I(\theta)^{-1})_{ii}\). This links the metric to the fundamental limits of statistical estimation.
- Invariance Properties:
- Invariance under change of data variables: If we transform the data \(x \to y(x)\) (a sufficient statistic), the FIM remains unchanged.
- Covariant Transformation under Reparameterization: If we change coordinates from \(\theta\) to a new set of parameters \(\phi(\theta)\), the FIM transforms like a Riemannian metric tensor. This is a crucial property for it to be an intrinsic geometric quantity. (This is covered in the DG course).
- Chentsov’s Theorem (Informal): This profound theorem states that for discrete sample spaces, the Fisher Information Metric is (up to a constant factor) the unique Riemannian metric on the space of probability distributions that is invariant under congruent embeddings by statistically relevant mappings (Markov morphisms, coarse-graining). This uniqueness underscores its fundamental nature in statistics.
Examples of Fisher Information Matrices
Let’s compute the FIM for some of our example statistical manifolds.
Calculating FIM for Common Distributions
Bernoulli Distribution: \(p(x; \pi) = \pi^x (1-\pi)^{1-x}\). \(\log p(x; \pi) = x \log \pi + (1-x) \log(1-\pi)\). \(\frac{\partial \log p}{\partial \pi} = \frac{x}{\pi} - \frac{1-x}{1-\pi} = \frac{x - \pi}{\pi(1-\pi)}\). \(I(\pi) = \mathbb{E}\left[ \left(\frac{x - \pi}{\pi(1-\pi)}\right)^2 \right] = \frac{1}{(\pi(1-\pi))^2} \mathbb{E}[(x-\pi)^2]\). Since \(\mathbb{E}[x] = \pi\), \(\mathbb{E}[(x-\pi)^2] = \text{Var}(x) = \pi(1-\pi)\). So,
\[I(\pi) = \frac{\pi(1-\pi)}{(\pi(1-\pi))^2} = \frac{1}{\pi(1-\pi)}\]Univariate Gaussian (parameter \(\mu\), known \(\sigma^2\)): \(p(x; \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\). \(\log p(x; \mu) = C - \frac{(x-\mu)^2}{2\sigma^2}\). \(\frac{\partial \log p}{\partial \mu} = \frac{x-\mu}{\sigma^2}\). \(I(\mu) = \mathbb{E}\left[ \left(\frac{x-\mu}{\sigma^2}\right)^2 \right] = \frac{1}{\sigma^4} \mathbb{E}[(x-\mu)^2]\). Since \(\mathbb{E}[(x-\mu)^2] = \sigma^2\),
\[I(\mu) = \frac{\sigma^2}{\sigma^4} = \frac{1}{\sigma^2}\]Univariate Gaussian (parameter \(\sigma\), known \(\mu\)): Let’s use \(\tau = \sigma^2\) as parameter for simplicity. \(\log p(x; \tau) = C - \frac{1}{2}\log \tau - \frac{(x-\mu)^2}{2\tau}\). \(\frac{\partial \log p}{\partial \tau} = -\frac{1}{2\tau} + \frac{(x-\mu)^2}{2\tau^2} = \frac{(x-\mu)^2 - \tau}{2\tau^2}\). Using \(I(\tau) = - \mathbb{E}\left[ \frac{\partial^2 \log p}{\partial \tau^2} \right]\): \(\frac{\partial^2 \log p}{\partial \tau^2} = \frac{1}{2\tau^2} - \frac{(x-\mu)^2}{\tau^3} - \frac{1}{2\tau^2} = \frac{1}{2\tau^2} - \frac{(x-\mu)^2}{\tau^3}\) (Mistake in calculation above, redoing \(\partial^2/\partial \tau^2\)) \(\frac{\partial^2 \log p}{\partial \tau^2} = \frac{1}{2\tau^2} - \frac{2((x-\mu)^2 - \tau)(2\tau)}{(2\tau^2)^2} - \frac{1}{2\tau^2} = \frac{1}{2\tau^2} - \frac{(x-\mu)^2}{\tau^3}\) Wait, let’s re-evaluate the first derivative expression: \(\frac{\partial \log p}{\partial \tau} = -\frac{1}{2\tau} + \frac{(x-\mu)^2}{2\tau^2}\). \(\frac{\partial^2 \log p}{\partial \tau^2} = \frac{1}{2\tau^2} - \frac{2(x-\mu)^2}{2\tau^3} = \frac{1}{2\tau^2} - \frac{(x-\mu)^2}{\tau^3}\). \(I(\tau) = - \mathbb{E}\left[ \frac{1}{2\tau^2} - \frac{(x-\mu)^2}{\tau^3} \right] = - \left( \frac{1}{2\tau^2} - \frac{\mathbb{E}[(x-\mu)^2]}{\tau^3} \right)\). Since \(\mathbb{E}[(x-\mu)^2] = \tau\) (as \(\tau = \sigma^2\)),
\[I(\tau) = - \left( \frac{1}{2\tau^2} - \frac{\tau}{\tau^3} \right) = - \left( \frac{1}{2\tau^2} - \frac{1}{\tau^2} \right) = - \left( -\frac{1}{2\tau^2} \right) = \frac{1}{2\tau^2}\]So for parameter \(\sigma^2\), the Fisher information is \(I(\sigma^2) = \frac{1}{2(\sigma^2)^2} = \frac{1}{2\sigma^4}\).
Multivariate Gaussian (parameter vector \(\mu\), known covariance \(\Sigma\)): \(p(x; \mu) = \frac{1}{(2\pi)^{k/2} \vert \Sigma \vert^{1/2}} \exp\left(-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu)\right)\). \(\nabla_\mu \log p(x; \mu) = \Sigma^{-1}(x-\mu)\). The Fisher Information Matrix \(I(\mu)\) has components \(I_{ab}(\mu) = \mathbb{E}[ (\nabla_\mu \log p)_a (\nabla_\mu \log p)_b ]\).
\[I(\mu) = \mathbb{E}[ (\Sigma^{-1}(x-\mu)) (\Sigma^{-1}(x-\mu))^T ] = \Sigma^{-1} \mathbb{E}[(x-\mu)(x-\mu)^T] \Sigma^{-1}\]Since \(\mathbb{E}[(x-\mu)(x-\mu)^T] = \Sigma\),
\[I(\mu) = \Sigma^{-1} \Sigma \Sigma^{-1} = \Sigma^{-1}\]
Geometric Interpretation of the Fisher Metric
The Fisher Information Metric \(g_{ij}(\theta)\) provides the tools to measure geometric quantities on the statistical manifold:
Infinitesimal Squared Distance: The “length” \(ds\) of an infinitesimal displacement \(d\theta = (d\theta^1, \dots, d\theta^d)\) in parameter space is given by:
\[ds^2 = \sum_{i,j=1}^d g_{ij}(\theta) d\theta^i d\theta^j = d\theta^T I(\theta) d\theta\]This \(ds^2\) represents the squared “statistical distance” between the distribution \(p(x;\theta)\) and \(p(x; \theta+d\theta)\).
Kullback-Leibler (KL) Divergence and the FIM: Recall the KL divergence from the Statistics & Information Theory course, which measures the “difference” between two distributions \(P\) and \(Q\):
\[D_{KL}(P \Vert Q) = \int p(x) \log \frac{p(x)}{q(x)} dx\]Consider two infinitesimally close distributions \(p_\theta \equiv p(x;\theta)\) and \(p_{\theta+d\theta} \equiv p(x;\theta+d\theta)\) on our statistical manifold. The KL divergence between them can be approximated by a quadratic form involving the FIM:
\[D_{KL}(p_\theta \Vert p_{\theta+d\theta}) \approx \frac{1}{2} \sum_{i,j} g_{ij}(\theta) d\theta^i d\theta^j = \frac{1}{2} ds^2\]And similarly, \(D_{KL}(p_{\theta+d\theta} \Vert p_\theta) \approx \frac{1}{2} ds^2\).
Derivation Sketch. KL Divergence and FIM
Let \(f(\theta') = D_{KL}(p_\theta \Vert p_{\theta'})\). We know \(f(\theta) = 0\). The first derivatives \(\frac{\partial f}{\partial \theta'^k}\Big\vert _{\theta'=\theta} = 0\). The Hessian matrix elements are:
\[\frac{\partial^2 f(\theta')}{\partial \theta'^k \partial \theta'^l}\Big\vert _{\theta'=\theta} = \mathbb{E}_{p_\theta} \left[ \left( \frac{\partial \log p(x; \theta)}{\partial \theta^k} \right) \left( \frac{\partial \log p(x; \theta)}{\partial \theta^l} \right) \right] = g_{kl}(\theta)\]A Taylor expansion of \(f(\theta+d\theta)\) around \(\theta\) gives:
\[D_{KL}(p_\theta \Vert p_{\theta+d\theta}) = f(\theta+d\theta) \approx f(\theta) + \sum_k \frac{\partial f}{\partial \theta^k}\Big\vert _{\theta} d\theta^k + \frac{1}{2} \sum_{k,l} \frac{\partial^2 f}{\partial \theta^k \partial \theta^l}\Big\vert _{\theta} d\theta^k d\theta^l\] \[D_{KL}(p_\theta \Vert p_{\theta+d\theta}) \approx 0 + 0 + \frac{1}{2} \sum_{k,l} g_{kl}(\theta) d\theta^k d\theta^l\]This profound connection shows that the Fisher Information Metric naturally arises from the local behavior of KL divergence, which is a fundamental measure of information difference. The FIM essentially defines a local quadratic approximation to (twice) the KL divergence. This gives a strong statistical motivation for using the FIM as the metric tensor.
Volumes and Angles: With the FIM, we can define volume elements on the manifold (\(\sqrt{\det(I(\theta))} d\theta^1 \dots d\theta^d\)) and angles between tangent vectors (directions of change in distributions).
Summary of Part 1 & What’s Next
In this first part of our Information Geometry crash course, we’ve established that:
- Families of probability distributions (statistical models) can be viewed as statistical manifolds, where parameters act as coordinates.
- The Fisher Information Metric (FIM) provides a natural Riemannian metric on these manifolds.
- The FIM is intrinsically linked to the statistical distinguishability of nearby distributions, as captured by its relationship to the KL divergence and the Cramér-Rao bound.
This geometric framework allows us to analyze statistical problems using the tools of differential geometry.
In Part 2, we will explore:
- Affine connections on statistical manifolds, particularly Amari’s \(\alpha\)-connections.
- The concept of duality in Information Geometry.
- Dually flat spaces and their connection to Bregman divergences (like KL divergence).
- And most importantly for our optimization series, the Natural Gradient, an optimization algorithm that leverages the Fisher metric to achieve faster and more stable convergence.
Stay tuned as we delve deeper into the fascinating geometric structures underlying statistics and machine learning!