Post

Rethinking Probability - Mass, Averages, and Granularity

Developing an intuition for probability using analogies from physics (mass distributions, centers of mass, resolution/granularity), exploring both the standard measure-theoretic and the expectation-first foundations.

Rethinking Probability - Mass, Averages, and Granularity

Introduction

Probability is a cornerstone of mathematics, statistics, and countless scientific disciplines. It provides a principled way to reason about and quantify uncertainty. In a world brimming with randomness and incomplete information, probability empowers us to make informed predictions, manage risk, and make better decisions. Its applications are ubiquitous:

  1. Medicine: Assessing drug efficacy beyond chance, quantifying side-effect risks.
  2. Insurance: Calculating premiums based on the likelihood of events.
  3. Engineering: Evaluating structural reliability under stress.
  4. Weather Forecasting: Providing probabilistic forecasts.
  5. Artificial Intelligence: Enabling systems to handle uncertainty in predictions and classifications.

Despite its importance, the formal machinery of probability theory, often rooted in measure theory, can feel abstract and unintuitive. Why do we need such formalism? While simple intuitive notions of chance work well for finite scenarios like coin flips or dice, they quickly run into trouble when dealing with continuous possibilities (like measuring a height) or infinite sequences of events (like flipping a coin forever). Naive approaches can lead to inconsistencies or paradoxes. A rigorous framework, primarily developed by Andrey Kolmogorov in the 1930s, became necessary to ensure consistency and handle these complexities robustly.

As always, I find it helpful to ground abstract concepts in more tangible analogies. This post aims to build an intuition for probability by exploring its foundations through the lens of physical mass distributions. We’ll see how core concepts map onto physical ideas, clarifying their meaning and relationships, starting with familiar physical mass before moving to the normalized concept of probability. We will also explore an alternative perspective, rooted in work by P.J. Daniell and championed by Peter Whittle, which starts with the concept of averaging or expectation as fundamental. Understanding both, through the unifying lens of our physical analogy, can significantly deepen our grasp of this vital mathematical language.

The Core Analogy: Objects and Regions

Let’s establish our core analogy. Imagine the set of all possible outcomes of a random phenomenon as a physical object or system. This is the sample space (\(\Omega\)).

  • Formal Definition (Sample Space): The sample space, \(\Omega\), is the set of all possible elementary outcomes \(\omega\) of the random phenomenon.
  • Analogy: Think of \(\Omega\) as a physical entity – a metal bar, a container of particles, the surface of a table.

(Running Example: Fair Die Roll) To make things concrete, we’ll use a simple running example: rolling a single, fair six-sided die. The sample space, the set of all elementary outcomes, is our “object”:

\[\Omega = \{1, 2, 3, 4, 5, 6\}\]

This is like having six distinct locations or points within our system.

Often, we’re interested in whether the outcome falls into a certain category or satisfies a condition. These categories correspond to regions within our object.

  • Formal Definition (Event): An event is simply a subset \(E\) of the sample space (\(E \subseteq \Omega\)). It represents a collection of possible outcomes.
  • Analogy: An event \(E\) corresponds to selecting a specific region or part within our object \(\Omega\) that we want to examine.
    • Die Roll Example: The event “rolling an even number” corresponds to the region \(E = \{2, 4, 6\}\) within \(\Omega\). The event “rolling a 1 or 2” is the region \(A = \{1, 2\}\).

Measuring Regions: Physical Mass (Unnormalized Measure)

How can we quantify these regions? Let’s start with a very familiar physical concept: mass. Imagine our object \(\Omega\) has mass distributed throughout it. We can define a function, let’s call it \(\mu\) (mu for mass), that tells us the physical mass (e.g., in kilograms or pounds) contained within any given region \(E\) of our object.

  • Concrete Example 1: Metal Bar: Consider a 1-meter metal bar (\(\Omega = [0, 1]\)). It has a certain mass density \(\rho(x)\) (in kg/m) at each point \(x\). The total mass is \(M = \int_0^1 \rho(x) dx\). We can measure the mass of any segment \([a, b]\) by calculating \(\mu([a, b]) = \int_a^b \rho(x) dx\). If the density \(\rho\) is constant, e.g. the object is made out of the same uniform material, then this reduces to \(\mu([a, b])=\rho (b-a)\). If the density \(\rho(x)\) is not constant, different segments of the same length can have different masses.
  • Concrete Example 2: Discrete Objects: Imagine \(\Omega\) is a collection of six distinct small objects (like our die outcomes). Each object \(\{i\}\) has a specific mass, say \(m_i\). The mass of a region (a sub-collection of objects) \(E\) is simply the sum of the masses of the objects within it: \(\mu(E) = \sum_{i \in E} m_i\).

This physical mass measure \(\mu\) has some intuitive properties:

  1. Non-negativity: The mass of any region cannot be negative: \(\mu(E) \ge 0\).
  2. Additivity: If we take two disjoint regions \(E_1\) and \(E_2\) (they don’t overlap), the mass of the combined region \(E_1 \cup E_2\) is just the sum of their individual masses: \(\mu(E_1 \cup E_2) = \mu(E_1) + \mu(E_2)\). This extends to any finite number of disjoint regions.

This concept of assigning a non-negative, additive quantity (like mass or volume) to regions of a space is the heart of measure theory.

The Need for Precision: Measurable Regions (\(\mathcal{F}\))

Now, a subtle but crucial point arises: can we consistently assign a mass (or length, area, volume) to absolutely any subset we can mathematically define, especially in complex, infinite spaces like the real numbers? It turns out the answer is no. Attempting to do so leads to mathematical contradictions and paradoxes (like Vitali sets or the Banach-Tarski paradox, which involves non-measurable sets arising from the axiom of choice). It’s like trying to define the volume of an infinitely intricate fractal dust cloud – our standard geometric tools break down.

  • Motivation: We need a way to specify which regions are “well-behaved” enough that we can consistently assign them a measure (mass). We need to restrict our attention to a collection of subsets for which our measurement rules work without contradiction.
  • Motivation for Countable Operations: Furthermore, many concepts in probability and analysis involve limits or infinite sequences (e.g., flipping a coin infinitely many times). Our collection of measurable regions needs to be closed under countable unions and intersections to handle these limiting processes rigorously.

  • Analogy & Interpretation: We need to define the resolution or granularity of our measurement system. This is done by specifying a collection \(\mathcal{F}\) of subsets of \(\Omega\), called a sigma-algebra. Only the regions \(E\) that belong to \(\mathcal{F}\) are considered measurable – meaning, only these are the regions our function \(\mu\) can reliably assign a mass to. If a subset isn’t in \(\mathcal{F}\), it’s below the resolution of our system; we cannot meaningfully ask for its mass within this framework. It’s like defining the pixels on a screen; we can measure regions composed of whole pixels, but not sub-pixel areas. A sigma-algebra ensures our collection of measurable regions is mathematically consistent (closed under necessary operations like complements and countable unions).

  • Formal Definition:

Definition - Sigma-Algebra (\(\mathcal{F}\))

A collection \(\mathcal{F}\) of subsets of \(\Omega\) is a sigma-algebra if:

  1. \(\Omega \in \mathcal{F}\) (The entire object is measurable).
  2. If \(E \in \mathcal{F}\), then \(E^c = \Omega \setminus E \in \mathcal{F}\) (Closure under complement: if you can measure a region, you can measure what’s outside it).
  3. If \(E_1, E_2, \dots \in \mathcal{F}\) (a countable sequence), then \(\bigcup_{i=1}^\infty E_i \in \mathcal{F}\) (Closure under countable unions: ensures consistency with limits).

(Implies closure under countable intersections via De Morgan’s laws).

  • Running Example (Die Roll): For our finite die roll space \(\Omega=\{1..6\}\), there are no paradoxes. We can measure any subset. The standard choice for \(\mathcal{F}\) is the power set \(\mathcal{P}(\Omega)\), containing all \(2^6 = 64\) subsets since there are two independent choices for each element: is it in the set? Yes/no (for this reason, the power set is also denoted \(2^\Omega\)). Our granularity is maximal; every possible region is measurable.
  • The pair \((\Omega, \mathcal{F})\) is called a measurable space: it’s our object (\(\Omega\)) equipped with a defined set of measurable regions (\(\mathcal{F}\)), ready for a measure to be applied.

Formalizing Mass: The Measure \(\mu\)

Now we can formally define our mass-measuring function, incorporating the requirement that it only applies to the measurable regions in \(\mathcal{F}\).

Definition - Measure (\(\mu\))

Given a measurable space \((\Omega, \mathcal{F})\), a measure \(\mu\) is a function \(\mu: \mathcal{F} \to [0, \infty]\) (assigning a non-negative value, possibly infinity, to each measurable set) such that:

  1. Non-negativity: \(\mu(E) \ge 0\) for all \(E \in \mathcal{F}\). (Matches physical mass).
  2. Null Empty Set: \(\mu(\emptyset) = 0\). (The region with nothing in it has zero mass).
  3. Countable Additivity: For any sequence of pairwise disjoint sets \(E_1, E_2, \dots\) in \(\mathcal{F}\),

    \[\mu\left(\bigcup_{i=1}^{\infty} E_i\right) = \sum_{i=1}^{\infty} \mu(E_i)\]

    (This formalizes the additivity of mass, extended to countably many pieces to handle limits).

This general definition of a measure \(\mu\) directly mirrors the properties we expect from physical mass (or length, area, volume). It gives us a rigorous way to quantify the “amount of stuff” in different measurable parts of our space \(\Omega\).

From Physical Mass to Probability: Normalization and the Need for Axioms

So far, our measure \(\mu\) gives us an absolute amount, like kilograms or meters. This is often called an unnormalized measure. While perfectly valid for describing a specific physical system, it has limitations when we want to talk about likelihood or chance in a general, comparable way.

  • Motivation for Normalization and Standardization:
    • Relative Comparison: Suppose we have two different metal bars, Bar A (total mass 2 kg) and Bar B (total mass 5 kg). If we pick a random point on each (weighted by mass), how do we compare the chance of landing in the first half? Comparing the absolute mass \(\mu_A([0, 0.5])\) vs \(\mu_B([0, 0.5])\) isn’t a fair comparison of likelihood because the totals differ. We need a relative scale.
    • Avoiding Arbitrary Units: Furthermore, the absolute mass depends on our choice of units (kg, lbs, etc.). Is there a way to quantify likelihood that is canonical and unitless, avoiding arbitrary conventions like the difference between SI and Imperial units? We seek a universal standard for expressing chance.
    • The Solution: Probability. Probability provides this standard by focusing on proportions or fractions of the total mass, resulting in a dimensionless quantity between 0 and 1.
  • The Simple Case: Normalization via Finite Mass: If the total mass (or measure) of our object \(\mu(\Omega)\) is finite and non-zero (e.g., \(0 < \mu(\Omega) < \infty\)), the most straightforward way to achieve this standardized, relative scale is normalization:

    \[P(E) = \frac{\mu(E)}{\mu(\Omega)} \quad \text{for all } E \in \mathcal{F}\]

    This \(P(E)\) is the unitless fraction of the total mass in region \(E\). It satisfies \(P(E) \ge 0\) and \(P(\Omega) = 1\), and inherits additivity from \(\mu\). This works perfectly for many situations.

  • The Problem: Infinite Measure: But what if the natural underlying measure \(\mu\) is infinite?
    • Example 1: Length on the Real Line. Consider \(\Omega = \mathbb{R}\). The natural measure is length (Lebesgue measure), \(\mu\). But the total length \(\mu(\mathbb{R}) = \infty\). The normalization formula \(P(E) = \mu(E)/\mu(\Omega)\) breaks down. Yet, we define probability distributions (like Gaussian) on \(\mathbb{R}\).
    • Example 2: Counting Measure on Integers. Let \(\Omega = \mathbb{N} = \{1, 2, 3, ...\}\). The counting measure gives \(\mu(\mathbb{N}) = \infty\). Simple normalization fails. How do we define distributions over infinitely many integers?
  • The Problem: Infinite Sequences and Limits: Probability often deals with infinite sequences of events (e.g., infinite coin flips) requiring robust handling of limits, which simple finite additivity doesn’t guarantee.

  • Solution: The Axiomatic Definition of Probability (\(P\)): This is where the axiomatic definition of a probability measure \(P\), pioneered by Kolmogorov, becomes essential. Instead of deriving \(P\) from some \(\mu\), we define \(P\) directly by the properties it must have to consistently represent standardized likelihood:

Definition - Probability Measure (\(P\)) (Kolmogorov Axioms)

A probability measure \(P\) on a measurable space \((\Omega, \mathcal{F})\) is a function \(P: \mathcal{F} \to [0, 1]\) that satisfies:

  1. Non-negativity: \(P(E) \ge 0\) for all \(E \in \mathcal{F}\).
  2. Normalization: \(P(\Omega) = 1\).
  3. Countable Additivity: For any sequence of pairwise disjoint sets \(E_1, E_2, \dots\) in \(\mathcal{F}\),

    \[P\left(\bigcup_{i=1}^{\infty} E_i\right) = \sum_{i=1}^{\infty} P(E_i)\]
  • Density vs. Point Mass (PDF/PMF): The way probability \(P\) is assigned relates directly to the type of space:
    • Continuous Case: For spaces like the real line, probability is typically described by a Probability Density Function (PDF), \(p(x)\).
      • Analogy: The PDF \(p(x)\) is analogous to the mass density (\(\rho(x)\), e.g., kg/m) at a point \(x\) on our object (like the metal bar). Density itself is not mass/probability; it indicates the concentration of mass around a point. To find the actual probability (normalized mass) in an interval \([a, b]\), you must integrate the density: \(P(a \le X \le b) = \int_a^b p(x) \, dx\). This reinforces why single points have zero probability: integrating density over a zero-width interval yields zero mass.
    • Discrete Case: For finite or countable spaces like our die roll, probability is assigned by a Probability Mass Function (PMF), \(P(X=x_i)\).
      • Analogy: The PMF \(P(X=x_i)\) is analogous to having discrete point masses \(m_i\) located at specific positions \(x_i\). The PMF value is the actual probability (normalized mass) concentrated exactly at that single point \(x_i\). The total probability is found by summing these point masses: \(P(X \in E) = \sum_{x_i \in E} P(X=x_i)\).
  • Analogy: The probability measure \(P\) represents the normalized mass distribution on our object. It tells us the fraction of the total mass located within any measurable region \(E\). The normalization axiom \(P(\Omega)=1\) simply states that the total normalized mass of the entire object is 1 (or 100%).

  • Why These Axioms Solve the Problems:
    • Normalization (\(P(\Omega)=1\)): This axiom imposes the finite, unitless total measure of 1 directly onto the probability space, creating the canonical scale and bypassing issues with infinite underlying measures or arbitrary physical units.
    • Countable Additivity: This ensures mathematical consistency when dealing with infinite limits and sequences.
  • Benefits of the Axiomatic Approach:
    • Provides a universal, self-contained, canonical foundation for probability that works consistently across diverse scenarios (finite/infinite spaces, limits) and is independent of arbitrary physical units.
    • Gives a clear, unambiguous set of rules for manipulating probabilities.
  • Running Example (Die Roll Revisited): Our definition \(P(\{i\}) = 1/6\) satisfies these axioms directly. It’s the normalized version of a counting measure but stands alone as a valid, standardized probability measure. This is a PMF where \(P(X=i) = 1/6\).

  • The triple \((\Omega, \mathcal{F}, P)\) is a probability space: our object, with its defined resolution, and a specific function \(P\) satisfying the Kolmogorov axioms, representing the standardized, normalized mass distribution (likelihood).

Distribution Matters: Why “Mass” > “Size”

It’s crucial to understand that even though probability \(P\) is normalized to 1, the underlying concept of distribution (how the original mass \(\mu\) was spread out) remains vital. Simply knowing the geometric “size” (length, area) of a region isn’t enough; we need to know how much mass (or probability) is concentrated there.

  • Revisiting the Rod: Consider our 1-meter rod (\(\Omega=[0,1]\)).
    • If it has uniform mass density \(\rho(x)=M\) (constant), then \(\mu([a,b]) = M(b-a)\). The total mass is \(\mu(\Omega)=M\). Normalizing gives \(P([a,b]) = \frac{M(b-a)}{M} = b-a\). In this case, probability is proportional to length (“size”), described by a uniform PDF \(p(x)=1\) for \(x \in [0,1]\).
    • If it has non-uniform density, say \(\rho(x) = 2Mx\) (total mass \(\int_0^1 2Mx dx = M\)), then \(\mu([0, 0.5]) = \int_0^{0.5} 2Mx dx = M[x^2]_0^{0.5} = 0.25M\), and \(\mu([0.5, 1]) = M - 0.25M = 0.75M\). Normalizing gives \(P([0, 0.5]) = 0.25M / M = 0.25\) and \(P([0.5, 1]) = 0.75M / M = 0.75\). This corresponds to a PDF \(p(x)=2x\) for \(x \in [0,1]\). Even though the intervals have the same length, their probabilities (relative masses) are vastly different due to the non-uniform underlying mass distribution.

The “mass” analogy inherently handles this non-uniform weighting. Likelihood isn’t always spread evenly. Failing to specify the distribution (via PDF or PMF) leads to ambiguity.

  • Comparison: Buffon’s Needle Problem: (\(P(\text{cross}) = 2L/\pi D\)). This works using geometric ratios because it assumes a specific uniform probability distribution (uniform mass density) over the space of possible needle positions and orientations. Here, “size” (area in configuration space) aligns with probability due to the uniformity assumption.

  • Contrast: Bertrand’s Paradox: This paradox highlights the necessity of specifying the distribution. What is the probability that a randomly chosen chord of a circle is longer than the side of the inscribed equilateral triangle? Different interpretations of “randomly chosen” imply different underlying (unnormalized) mass distributions on the space of chords, leading to different normalized probabilities:

    1. Random Endpoints Method: (Uniform mass distribution on pairs of circumference points). Leads to \(P = \mathbf{1/3}\). Calculation: Fix one point; the other must land in the opposite 1/3 of the circumference. Probability = (Favorable arc length) / (Total arc length) = (1/3) / 1.
    2. Random Radius Method: (Uniform mass distribution on distance \(d \in [0, R]\) of midpoint from center). Leads to \(P = \mathbf{1/2}\). Calculation: Chord is longer if midpoint distance \(d < R/2\). Probability = (Favorable interval length) / (Total interval length) = (R/2) / R.
    3. Random Midpoint Method: (Uniform mass distribution over the area of the circle for the midpoint). Leads to \(P = \mathbf{1/4}\). Calculation: Chord is longer if midpoint is in inner circle radius \(R/2\). Probability = (Favorable area) / (Total area) = \((\pi(R/2)^2) / (\pi R^2)\).
    • The Lesson: “Randomly” is ambiguous. Each method specifies a different way mass is distributed over the possible chords before normalization. The choice of the underlying measure \(\mu\) (how mass is spread) dictates the final probabilities \(P\). Probability requires specifying the distribution, not just the space.

In summary, thinking in terms of mass distribution (\(\mu\), then normalized to \(P\) via PMF/PDF) is more fundamental and general than thinking about geometric “size”. It correctly emphasizes that the way likelihood is spread out across the possibilities is the defining characteristic of a probability space.

A Counter-Intuitive Consequence: Points Have Zero Probability in Continuous Spaces

The rigorous definition of measure and probability, especially for continuous spaces like the real line (\(\mathbb{R}\)), yields a result that often feels quite paradoxical at first: the probability of any single, specific outcome is zero.

  • The “Crazy” Result: Consider a random variable \(X\) representing a truly continuous measurement, like the exact height of a person or the exact landing position \(x\) of a point chosen randomly on a 1-meter line segment \([0, 1]\). If the distribution is continuous (described by a PDF \(p(x)\)), what is the probability that \(X\) takes exactly one specific value, say \(X = 175.12345...\) cm or \(x = 0.5\)? The perhaps startling answer from measure theory is:

    \[P(X = x_0) = 0 \quad \text{for any single point } x_0\]
  • Why Does This Seem Wrong? Our intuition often screams: “But \(x_0\) is a possible outcome! How can its probability be zero? If all individual points have zero probability, how can intervals have positive probability? Shouldn’t the probabilities add up?”

  • The Summation Problem & Resolution: This intuition clashes with the mathematics of continuous, infinite sets. There are uncountably infinitely many points in any interval on the real line. If each individual point had some tiny positive probability \(\epsilon > 0\), then the total probability in any interval, no matter how small, would be infinite (\(\infty \times \epsilon = \infty\)), blowing past the total probability limit of \(P(\Omega)=1\). The mathematical framework of measure theory resolves this by fundamentally shifting focus: for continuous distributions, probability (mass) is not assigned to individual points, but rather to intervals or other measurable sets that have a non-zero “extent” (like length).

  • Mass Analogy Revisited: Think of the metal bar again, with a smooth mass density \(\rho(x)\). The density \(\rho(x_0)\) at a specific point \(x_0\) can be positive, indicating mass concentration around that point. However, mass itself is obtained by integrating density over a region. The mass in an interval \([a, b]\) is \(\int_a^b \rho(x) dx\). An infinitely thin slice at exactly \(x_0\) corresponds to an interval \([x_0, x_0]\). The integral over this zero-width interval is always zero: \(\int_{x_0}^{x_0} \rho(x) dx = 0\). So, while density can be positive, the mass contained at a single point is zero. Probability, as normalized mass represented by a PDF, works the same way.

  • The Shift in Focus: Intervals, not Points: Therefore, in continuous probability, we don’t typically ask “What is \(P(X = x_0)\)?”. Instead, we ask about the probability of \(X\) falling within a range: “What is \(P(a \le X \le b)\)?”. This probability is calculated by integrating the probability density function (PDF) \(p(x)\) over that interval:

    \[P(a \le X \le b) = \int_a^b p(x) \, dx\]

    This integral can be positive if the interval has non-zero length (\(b > a\)) and the density is positive within it. The probabilities of these intervals consistently add up (via the integral’s properties) to give \(P(\Omega) = \int_{-\infty}^{\infty} p(x) dx = 1\).

  • Contrast with Discrete Case: This is crucially different from discrete probability (like the die roll). There, the sample space \(\Omega\) consists of a finite or countable number of distinct points. We can and do assign a non-zero probability mass \(P(X=\omega_i)\) directly to each outcome \(\omega_i\) using a Probability Mass Function (PMF). These individual point probabilities sum to 1: \(\sum_i P(X=\omega_i) = 1\).

  • Conclusion: So, while initially seeming paradoxical, the fact that individual points have zero probability in continuous distributions is a necessary consequence of how measure and integration work over infinite sets. It forces us to correctly focus on the probability of events occurring within intervals or sets, which aligns perfectly with how we handle continuously distributed physical quantities like mass. (If we specifically need to model probability concentrated at a point within a generally continuous setting, we use tools like Dirac delta functions, creating a mixed distribution.)

Adding Properties: Random Variables (\(X\))

Now that we have our object (\(\Omega\)), its measurable structure (\(\mathcal{F}\)), and its normalized mass distribution (\(P\)), we often want to measure properties associated with the outcomes.

  • Motivation: We roll the die (\(\omega\) occurs) – what number \(X(\omega)\) appears? We measure a person’s height (\(\omega\) occurs) – what is their height \(X(\omega)\) in cm?
  • Analogy: A random variable \(X\) corresponds to assigning a measurable physical property (like temperature \(T(\omega)\), position \(x(\omega)\), or the numerical value itself) to each point \(\omega\) in our object \(\Omega\).
  • Formal Definition: A random variable is a function \(X: \Omega \to \mathbb{R}\).
  • Running Example (Die Roll): Let \(X\) be the face value shown: \(X(\omega) = \omega\) for \(\omega \in \{1, ..., 6\}\).

  • The Crucial Measurability Requirement: Can any function \(X: \Omega \to \mathbb{R}\) be a random variable? No. It must be compatible with our defined resolution \(\mathcal{F}\). We need to be able to determine the probability (mass) of events defined by \(X\), like “\(X \le x\)”.
  • Motivation: To calculate \(P(X \le x)\), which is \(P(\{\omega \in \Omega \mid X(\omega) \le x\})\), the set \(\{\omega \in \Omega \mid X(\omega) \le x\}\) must be one of the regions we know how to measure – it must be in \(\mathcal{F}\). This must hold for all possible thresholds \(x\).
  • Analogy & Interpretation: The property \(X\) must respect the granularity (\(\mathcal{F}\)) of our object. It cannot require distinctions finer than \(\mathcal{F}\) allows. If it did, questions about its probability distribution would be ill-defined.

A function \(X: \Omega \to \mathbb{R}\) is measurable w.r.t. \(\mathcal{F}\) (and thus a valid random variable for the space \((\Omega, \mathcal{F}, P)\)) if for every real number \(x\), the set \(\{\omega \in \Omega \mid X(\omega) \le x\} \in \mathcal{F}\)

  • Running Example (Die Roll): Our \(X(\omega) = \omega\) is measurable w.r.t. the power set \(\mathcal{F}\). For any \(x\), the set \(\{\omega \mid \omega \le x\}\) is a subset of \(\{1, ..., 6\}\), and all subsets are in \(\mathcal{F}\).

  • Cumulative Distribution Function (CDF): For any random variable \(X\), we can define its Cumulative Distribution Function (CDF) as \(F_X(x) = P(X \le x)\). This function gives the probability that the random variable takes on a value less than or equal to \(x\).

    • Analogy: The CDF \(F_X(x)\) corresponds to the total accumulated normalized mass associated with the property \(X\) from the minimum possible value up to the value \(x\). For the 1D rod analogy where \(X\) is position, \(F_X(x)\) is the fraction of the total mass contained in the segment \((-\infty, x]\) of the rod. It’s like sweeping from the left end and measuring the fraction of mass encountered up to point \(x\). In the discrete case (like the die roll), the CDF \(F_X(x)\) is a step function, where the value jumps up by the probability mass \(P(X=x_i)\) at each possible outcome \(x_i\). It still represents the total accumulated probability up to value \(x\). The CDF always increases (or stays level) from 0 to 1 as \(x\) goes from \(-\infty\) to \(+\infty\).
    • The CDF provides a unified way to describe distributions. In the continuous case, the PDF is the derivative of the CDF (\(p(x) = F_X'(x)\) where the derivative exists), representing the rate of accumulation of probability mass. In the discrete case, the PMF gives the magnitude of the jumps in the step-function CDF at each point mass (\(P(X=x_i) = F_X(x_i) - \lim_{y \to x_i^-} F_X(y)\)).

Averaging Properties: Expectation (\(E[X]\)) (via Measure)

Given our object with its mass distribution \(P\) and a measurable property \(X\), a fundamental operation is calculating the average value of \(X\) over the object, weighted by the mass.

  • Motivation: What is the average value we expect when rolling the die? If \(X\) is temperature and \(P\) represents the normalized mass distribution on an object, what’s the average temperature? We can’t just average the temperature values; points with more mass must contribute more to the average.
  • Analogy: The expectation \(E[X]\) is the weighted average of the property \(X\) across the object, weighted by the normalized mass distribution \(P\). If \(X\) represents the position coordinate, \(E[X]\) is precisely the object’s center of mass (calculated using the normalized mass distribution).
  • Formal Definition: The expected value (or expectation) of \(X\) is formally defined as the Lebesgue integral of \(X\) with respect to the probability measure \(P\):

    \[E[X] = \int_{\Omega} X(\omega) \, dP(\omega)\]

    This abstract definition elegantly captures the weighted average. If \(P\) came from normalizing an unnormalized mass \(\mu\) (where \(dP = d\mu / \mu(\Omega)\)), then \(E[X] = \frac{1}{\mu(\Omega)} \int_{\Omega} X(\omega) \, d\mu(\omega)\), which is exactly the formula for the center of mass (average property value weighted by original mass, divided by total mass). It simplifies to familiar forms used in practice:

    • Discrete Case (using PMF \(P(X=x_i)\)): \(E[X] = \sum_{x_i} x_i P(X=x_i)\) (Sum of value times probability/mass fraction).
    • Continuous Case (using PDF \(p(x)\)): \(E[X] = \int_{-\infty}^{\infty} x p(x) \, dx\) (Integral of value times probability density/mass density).
  • Note on Calculation (LOTUS): While expectation is formally defined via an integral over the sample space \(\Omega\), in practice we often compute it directly using the distribution of the random variable \(X\) itself (its PDF \(p(x)\) or PMF \(P(X=x_i)\)) via the formulas above. This shortcut is justified by a result sometimes called the Law of the Unconscious Statistician (LOTUS), which states that these two methods of calculation yield the same result. More generally, for a function \(g(X)\), \(E[g(X)] = \int_\Omega g(X(\omega)) dP(\omega) = \int_{-\infty}^\infty g(x) p(x) dx\) (continuous) or \(\sum_i g(x_i) P(X=x_i)\) (discrete).

  • Running Example (Die Roll): \(X(\omega)=\omega\), \(P(X=i)=1/6\).

    \[E[X] = \sum_{i=1}^{6} i \cdot P(X=i) = \sum_{i=1}^{6} i \cdot \frac{1}{6} = \frac{1+2+3+4+5+6}{6} = \frac{21}{6} = 3.5\]

    Analogy: This is the center of mass if we place equal 1/6 unit masses at positions 1, 2, …, 6.

  • Other Interpretations: Long-run average (Law of Large Numbers), fair price.

  • Variance: Measuring Spread Around the Average While expectation \(E[X]\) gives us the average value or center of mass of the distribution of \(X\), it doesn’t tell us how spread out the values are around this average. Two distributions can have the same mean but vastly different shapes. The variance quantifies this spread.

    • Definition: The variance of a random variable \(X\) is defined as the expected value of the squared deviation from the mean:

      \[Var(X) = E\left[ (X - E[X])^2 \right]\]

      The square ensures that deviations in both directions (above and below the mean) contribute positively to the measure of spread, and it heavily penalizes values far from the mean. The standard deviation, \(\sigma_X = \sqrt{Var(X)}\), brings this measure back to the original units of \(X\).

    • Analogy: Moment of Inertia: This definition has a striking parallel in physics with the moment of inertia (\(I\)).

      1. Recall Center of Mass: We established that \(E[X]\) is analogous to the center of mass (\(x_c\)) of our object, considering the distribution of the property \(X\) weighted by the normalized mass \(P\).
      2. Moment of Inertia Definition: In physics, the moment of inertia measures an object’s resistance to rotational acceleration about a given axis. For a collection of point masses \(m_i\) at positions \(x_i\), the moment of inertia about an axis at \(x_c\) is given by:

        \[I = \sum_i m_i (x_i - x_c)^2\]

        For a continuous object with mass density \(\rho(x)\), it’s:

        \[I = \int (x - x_c)^2 \rho(x) \, dx\]

        Crucially, this is the resistance to rotation about the point \(x_c\).

      3. The Connection: Now, compare the variance formula to the moment of inertia formula, specifically when calculated about the center of mass (\(x_c = E[X]\)):
        • Discrete Case: \(Var(X) = \sum_{x_i} (x_i - E[X])^2 P(X=x_i)\). This is identical in form to \(I = \sum_i (x_i - x_c)^2 m_i\), if we identify:
          • The value \(x_i\) with the position \(x_i\).
          • The probability \(P(X=x_i)\) with the (normalized) mass \(m_i\).
          • The expected value \(E[X]\) with the center of mass \(x_c\).
        • Continuous Case: \(Var(X) = \int (x - E[X])^2 p(x) \, dx\). This is identical in form to \(I = \int (x - x_c)^2 \rho(x) \, dx\), if we identify:
          • The value \(x\) with the position \(x\).
          • The probability density \(p(x)\) with the (normalized) mass density \(\rho(x)\).
          • The expected value \(E[X]\) with the center of mass \(x_c\).
      4. Intuitive Meaning: The moment of inertia is larger when more mass is distributed farther away from the axis of rotation (\(x_c\)). Similarly, the variance is larger when more probability mass is assigned to values of \(X\) farther away from the mean (\(E[X]\)). Both quantities use the squared distance term (\((X - E[X])^2\) or \((x_i - x_c)^2\)) to heavily weight these distant contributions.

Updating Beliefs: Conditional Probability (\(P(A \mid B)\))

Often, we receive partial information about the outcome of a random phenomenon, and we need to update our probabilities accordingly. For instance, if we rolled the die, and someone tells us the result was an even number, how does this change the probability that the result was a 2? This leads to the concept of conditional probability.

  • Motivation: We want to formalize how knowledge of one event (\(B\) occurring) influences the likelihood of another event (\(A\)).
  • Analogy: Zooming In on Mass: Receiving information that event \(B\) occurred is like zooming in on our object (\(\Omega\)) and focusing only on the region \(B\). We discard everything outside \(B\) and examine the mass distribution within this new, smaller world. Conditional probability \(P(A \mid B)\) represents the fraction of the remaining mass (within \(B\)) that corresponds to event \(A\). It’s the re-normalized mass distribution when our view is restricted to region \(B\).
  • Formal Definition: The conditional probability of event \(A\) occurring given that event \(B\) has occurred (where \(P(B) > 0\)) is defined as:

Definition - Conditional Probability

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}\]
  • Interpretation:
    • \(P(A \cap B)\) is the probability (original normalized mass) contained in the region where both A and B occur (the overlap).
    • \(P(B)\) is the total probability (original normalized mass) contained in the region \(B\).
    • The ratio \(P(A \mid B)\) represents the fraction of the mass within region B that also belongs to region A. It’s essentially re-normalizing the probability measure to the subspace defined by event \(B\). The universe shrinks to \(B\), and \(P(A \mid B)\) is the measure of \(A\) in this new universe.
  • Running Example (Die Roll): Let \(A = \{2\}\) (rolling a 2) and \(B = \{2, 4, 6\}\) (rolling an even number). We know \(P(\{i\}) = 1/6\) for all \(i\).
    • \(P(B) = P(\{2\}) + P(\{4\}) + P(\{6\}) = 1/6 + 1/6 + 1/6 = 3/6 = 1/2\).
    • \(A \cap B = \{2\}\). So, \(P(A \cap B) = P(\{2\}) = 1/6\).
    • Therefore, the conditional probability is:

      \[P(A \mid B) = P(\text{Roll = 2} \mid \text{Roll is Even}) = \frac{P(A \cap B)}{P(B)} = \frac{1/6}{3/6} = \frac{1}{3}\]
    • Analogy: We focus only on the region \(B = \{2, 4, 6\}\). The total mass in this region is \(3/6\). Within this region, the mass corresponding to event \(A = \{2\}\) is \(1/6\). The fraction of the mass in \(B\) that is also in \(A\) is \((1/6) / (3/6) = 1/3\). Knowing the roll is even triples the probability that it’s a 2 (from 1/6 to 1/3).
  • Connection to Independence: Two events \(A\) and \(B\) are independent if knowing \(B\) occurred doesn’t change the probability of \(A\), i.e., \(P(A \mid B) = P(A)\). Plugging this into the definition gives \(P(A) = P(A \cap B) / P(B)\), which rearranges to the standard definition of independence: \(P(A \cap B) = P(A) P(B)\).
    • Analogy: Independence means the relative concentration of mass for region \(A\) is the same within region \(B\) as it is within the whole object \(\Omega\). Knowing we are in \(B\) provides no information about whether we are also in \(A\).
  • (Foundation for Bayes’ Theorem): The definition of conditional probability \(P(A \mid B) = P(A \cap B) / P(B)\) is also the starting point for Bayes’ Theorem, a fundamental rule for inverting conditional probabilities (relating \(P(A \mid B)\) to \(P(B \mid A)\)) which allows updating beliefs in light of new evidence. We will explore this theorem in a separate discussion.

Updating Averages: Conditional Expectation (\(E[X \mid \dots]\))

Just as probabilities can be updated with new information, expected values can also be updated. If we know event \(B\) occurred, what is the new expected value of our random variable \(X\)?

1. Conditional Expectation Given an Event (\(E[X \mid B]\))

This is the simpler case, directly paralleling conditional probability.

  • Motivation: Given that the die roll was even (event \(B\)), what is the average value we now expect?
  • Analogy: Center of Mass within a Sub-Region: This is like calculating the center of mass of the property \(X\), but considering only the mass distribution within the sub-region B. We ignore all mass outside \(B\) and find the average value of \(X\) weighted by the re-normalized mass inside \(B\).
  • Formal Definition: The conditional expectation of \(X\) given event \(B\) (where \(P(B) > 0\)) is the expected value of \(X\) calculated using the conditional probability measure \(P(\cdot \mid B)\). It can be calculated as:

Definition - Conditional Expectation Given an Event

\[E[X \mid B] = \int_{\Omega} X(\omega) \, dP(\omega \mid B) = \frac{1}{P(B)} \int_{B} X(\omega) \, dP(\omega) = \frac{E[X \cdot I_B]}{P(B)}\]

where \(I_B\) is the indicator function of \(B\).

  • Interpretation: We integrate (or sum, in the discrete case) the value \(X(\omega)\) against the conditional probability distribution. The formula \(E[X \cdot I_B] / P(B)\) makes the analogy clear:
    • \(X \cdot I_B\) is a new random variable that equals \(X\) inside region \(B\) and 0 outside.
    • \(E[X \cdot I_B]\) is the average value of this new variable over the whole space \(\Omega\) (equivalent to \(\int_B X dP\)). It’s like the “total moment” (value * mass) contributed by region \(B\).
    • Dividing by \(P(B)\) (the total mass in region \(B\)) gives the average value per unit of mass within region \(B\), which is the center of mass of \(X\) within that region.
  • Running Example (Die Roll): Let \(X(\omega) = \omega\) be the face value. Let \(B = \{2, 4, 6\}\) (even roll). We want \(E[X \mid B]\). We know \(P(B) = 1/2\).
    • Using the summation form for the discrete case:

      \[E[X \mid B] = \sum_{\omega \in \Omega} X(\omega) P(\{\omega\} \mid B) = \sum_{i \in B} X(i) \frac{P(\{i\})}{P(B)}\] \[E[X \mid B] = 2 \cdot \frac{1/6}{1/2} + 4 \cdot \frac{1/6}{1/2} + 6 \cdot \frac{1/6}{1/2}\] \[E[X \mid B] = 2 \cdot \frac{1}{3} + 4 \cdot \frac{1}{3} + 6 \cdot \frac{1}{3} = \frac{2 + 4 + 6}{3} = \frac{12}{3} = 4\]
    • Analogy: We focus only on the masses at positions 2, 4, 6, each originally 1/6. Re-normalizing within this region, each now has a relative mass of \((1/6)/(1/2) = 1/3\). The center of mass for equal (1/3 unit) masses at 2, 4, 6 is \((2+4+6)/3 = 4\). The original center of mass was 3.5, but knowing the roll is even shifts the expected average value up to 4.

2. Conditional Expectation Given Partial Information (\(E[X \mid \mathcal{G}]\))

This is a more general and powerful concept, crucial in areas like stochastic processes and statistical modeling. Instead of conditioning on a single, specific event happening (like “the roll was even”), we condition on the information available to us, which might be less precise than knowing the exact outcome. This available information is represented mathematically by a sub-sigma-algebra \(\mathcal{G}\) of \(\mathcal{F}\).

  • Motivation: Imperfect Measurement: Imagine our random experiment involves selecting a person (\(\omega\)) from a population (\(\Omega\)). We are interested in their exact height (\(X(\omega)\)). This is our random variable \(X\). Our full probability space \((\Omega, \mathcal{F}, P)\) allows us, in principle, to consider events related to exact heights. However, suppose our measuring tool isn’t perfectly precise. Instead of the exact height, we only get to know the person’s shoe size category (e.g., Small, Medium, Large). Let \(Y(\omega)\) be this measurement. This measurement gives us some information about their height (people with large shoe sizes tend to be taller), but it’s not the complete picture.

  • Defining the Information Sigma-Algebra (\(\mathcal{G}\)): The information we get from the shoe size measurement corresponds to a coarser partition of the sample space.
    • Let \(B_S = \{\omega \in \Omega \mid Y(\omega) = \text{Small}\}\).
    • Let \(B_M = \{\omega \in \Omega \mid Y(\omega) = \text{Medium}\}\).
    • Let \(B_L = \{\omega \in \Omega \mid Y(\omega) = \text{Large}\}\). These three sets form a partition of \(\Omega\) (assuming everyone falls into one category). The sigma-algebra \(\mathcal{G}\) generated by this partition represents the information we have:
    \[\mathcal{G} = \{\emptyset, B_S, B_M, B_L, B_S \cup B_M, B_S \cup B_L, B_M \cup B_L, \Omega\}\]

    Knowing the information in \(\mathcal{G}\) means, for any person \(\omega\), we only know whether they belong to \(B_S\), \(B_M\), or \(B_L\). We cannot distinguish between two people who are both in, say, \(B_M\). This \(\mathcal{G}\) is a sub-sigma-algebra of the full \(\mathcal{F}\) (which could potentially distinguish individuals or exact heights).

  • The Goal: Best Estimate of Height Given Shoe Size Category: Now, we ask: Given only the shoe size category (i.e., given the information in \(\mathcal{G}\)), what is the best estimate or prediction for the person’s actual height \(X\)? This “best estimate” is the conditional expectation \(E[X \mid \mathcal{G}]\).

  • Analogy: Averaging Height within Coarse Pixels: Think of the full space \(\Omega\) with its fine details accessible via \(\mathcal{F}\). The shoe size measurement imposes a coarser view, like looking at the population through glasses that only resolve into three blurry “pixels”: Small, Medium, Large (these are the sets \(B_S, B_M, B_L\) in \(\mathcal{G}\)). The conditional expectation \(E[X \mid \mathcal{G}]\) is a new random variable, let’s call it \(Z\).
    • Because its value must depend only on the information in \(\mathcal{G}\), \(Z\) must be constant within each coarse pixel. If a person \(\omega\) is in the “Medium” category (\(\omega \in B_M\)), \(Z(\omega)\) must have the same value as for any other person in \(B_M\).
    • What should that constant value be? It should be the average height of all people within that specific shoe size category.
      • If \(\omega \in B_S\), then \(Z(\omega) = \text{Average height of people with Small shoe size} = E[X \mid B_S]\).
      • If \(\omega \in B_M\), then \(Z(\omega) = \text{Average height of people with Medium shoe size} = E[X \mid B_M]\).
      • If \(\omega \in B_L\), then \(Z(\omega) = \text{Average height of people with Large shoe size} = E[X \mid B_L]\).
  • Formal Definition (via Characterizing Properties): Abstractly, \(Z = E[X \mid \mathcal{G}]\) is defined as the unique (up to sets of measure zero) random variable \(Z\) that satisfies:
    1. \(\mathcal{G}\)-Measurability: \(Z\) is measurable with respect to \(\mathcal{G}\).
      • Interpretation (Height Example): The value \(Z(\omega)\) only depends on which set (\(B_S, B_M\), or \(B_L\)) the person \(\omega\) belongs to. It’s constant across each shoe size category.
    2. Partial Averaging: For every set \(A \in \mathcal{G}\),

      \[\int_A Z \, dP = \int_A X \, dP\]
      • Interpretation (Height Example): If we take any region \(A\) definable by shoe size categories (e.g., \(A = B_S\) or \(A = B_M \cup B_L\)), the average value of our estimate \(Z\) over that region must equal the true average height \(X\) over that same region. For example, \(\int_{B_M} Z \, dP\) (which is just \(E[X \mid B_M] \cdot P(B_M)\)) must equal \(\int_{B_M} X \, dP\) (the sum of heights of all people in \(B_M\), weighted by probability). This ensures \(Z\) correctly reflects the average of \(X\) at the resolution level of \(\mathcal{G}\).
  • Summary of the Example: \(E[X \mid \mathcal{G}]\) is a random variable representing the best prediction of a person’s height (\(X\)) if you only know their shoe size category (the information in \(\mathcal{G}\)). Its value for any person is the average height of all people sharing the same shoe size category. It effectively smooths out the original \(X\) by averaging it over the regions defined by the available information \(\mathcal{G}\).

  • Interpretation: Best Estimate / Projection: \(Z = E[X \mid \mathcal{G}]\) is mathematically the orthogonal projection of \(X\) onto the space of \(\mathcal{G}\)-measurable functions. It’s the “closest” \(\mathcal{G}\)-measurable function to \(X\) in the least-squares sense, making it the optimal prediction based on the limited information.

  • Law of Total Expectation (Tower Property): A fundamental property linking conditional and unconditional expectation is:

    \[E[E[X \mid \mathcal{G}]] = E[X]\]
    • Analogy (Height Example): If you take the average height within each shoe size category (\(E[X \mid \mathcal{G}]\)), and then compute the overall average of these category averages (weighting each category average by the proportion of people \(P\) in that category), you recover the original overall average height \(E[X]\) across the entire population. It’s like finding the center of mass of the whole population by averaging the centers of mass of the Small, Medium, and Large groups, weighted by the size of each group.

This concrete example hopefully illustrates how conditioning on a sigma-algebra corresponds to finding the average value of a quantity (\(X\)) given only partial information, represented by the coarser “pixels” or categories defined by \(\mathcal{G}\).

Extending the Analogy: Multiple Properties and Dimensions

Our mass analogy extends naturally when considering multiple random variables (properties) simultaneously.

  • Joint Distributions: Mass on Higher-Dimensional Objects:
    • Motivation: Suppose we measure two properties of our outcome, say height (\(X\)) and weight (\(Y\)) of a person (\(\omega\)). We’re interested in the probability of combinations of events, like \(P(X \le x, Y \le y)\). This is described by the joint distribution.
    • Analogy: The joint distribution of two variables (\(X, Y\)) corresponds to a normalized mass distribution on a 2D object (like a metal plate in the xy-plane, where the axes represent height and weight). The total mass of the plate is 1. The joint probability \(P(X \in A, Y \in B)\) is the amount of normalized mass contained within the 2D region defined by \(A \times B\) on the plate. For continuous variables with a joint PDF \(p(x, y)\), this is analogous to a surface mass density (\(\mathrm{kg/m^2}\)), and \(P(X \in A, Y \in B) = \iint_{A \times B} p(x, y) \, dx \, dy\).
  • Marginal Distributions: Compressing the Mass:
    • Motivation: Often, from a joint distribution of \(X\) and \(Y\), we want to recover the distribution of just \(X\) alone, irrespective of \(Y\). This is the marginal distribution of \(X\).
    • Analogy: Obtaining the marginal distribution of \(X\) from the joint distribution is like taking our 2D metal plate and compressing all its mass onto the x-axis. The resulting 1D mass distribution along the x-axis is the marginal distribution of \(X\). Mathematically, this corresponds to integrating (or summing) out the other variable: \(p_X(x) = \int_{-\infty}^{\infty} p(x, y) \, dy\) (integrating the 2D density over all possible y-values for a fixed x). The total mass remains 1.
  • Conditional Distributions: Slicing the Mass:
    • Motivation: We want to know the distribution of one variable (\(Y\)) given a specific value of the other (\(X=x\)). This is the conditional distribution \(P(Y \in B \mid X=x)\) (or its density \(p(y \mid x)\)).
    • Analogy: Finding the conditional distribution \(p(y \mid x)\) corresponds to taking an infinitesimally thin vertical slice through our 2D mass distribution at the specific location \(X=x\). We look at the 1D mass distribution profile along this slice (in the y-direction). Since the total mass along this thin slice might be very small (or zero if conditioning on a single point in a continuous space requires careful limits), we re-normalize the mass distribution along the slice so that its total mass becomes 1. This re-normalized 1D distribution along the slice represents \(p(y \mid x)\). Mathematically, \(p(y \mid x) = p(x, y) / p_X(x)\), which is analogous to \(P(A \mid B) = P(A \cap B) / P(B)\) – dividing the joint density (intersection mass) by the marginal density (mass in the conditioning “region”).

This extension shows how the physical intuition of mass distribution, projection, and slicing can clarify the relationships between joint, marginal, and conditional probabilities in higher dimensions.

Perspective 2: Expectation First (Averaging is Fundamental)

Instead of starting with regions and their masses, let’s explore an alternative viewpoint, championed by figures like P.J. Daniell and Peter Whittle. What if the most fundamental concept isn’t the distribution of mass itself, but rather the process of averaging properties across that distribution?

  • Motivation: The Primacy of Averages?
    • Operational Intuition: In many physical or statistical scenarios, we might directly observe or estimate average quantities (average temperature, average profit, expected outcome of a bet) even before we have a complete picture of the underlying probability distribution. The average feels like a very tangible, operational concept.
    • Physical Analogy - Center of Mass: Think about finding the center of mass of an object. You can often determine this balance point experimentally or conceptually without needing to know the precise mass density \(\rho(\omega)\) at every single point \(\omega\). The center of mass itself feels like a primary characteristic. Could we define our system starting from this averaging principle?
    • Conceptual Shift: Instead of defining mass \(P(A)\) for regions \(A\) and then deriving the average \(E[X]\) via integration, could we define the “averaging operator” \(E\) first, based on its essential properties, and then recover the concept of probability \(P(A)\) from it?
  • The Expectation Operator (\(E\)) as the Fundamental Object:
    • Analogy: Imagine we possess a “black box” or a “machine” \(E\). This machine takes any valid physical property \(X\) defined on our object \(\Omega\) (like temperature, position, etc.) and outputs a single number: the average value of that property, correctly weighted by the object’s (perhaps unknown) normalized mass distribution. Our goal is to define the essential characteristics this “averaging machine” must have.
    • Formalizing the Averaging Process: We postulate the existence of an expectation operator \(E\). This operator takes a function (random variable) \(X\) from a suitable class \(\mathcal{H}\) of “measurable” or “expectable” properties and maps it to its average real value \(E[X]\). We define \(E\) not by how it calculates the average (e.g., via a specific integral), but by the fundamental algebraic and analytic properties that any reasonable averaging process should satisfy:

Axioms of Expectation (\(E\))

Let \(\mathcal{H}\) be a suitable class of “expectable” functions \(X: \Omega \to \mathbb{R}\). The expectation operator \(E: \mathcal{H} \to \mathbb{R}\) satisfies:

  1. Linearity: \(E[aX + bY] = aE[X] + bE[Y]\) for constants \(a, b\) and \(X, Y \in \mathcal{H}\). (Averages combine linearly).
  2. Positivity (Monotonicity): If \(X(\omega) \ge 0\) for all \(\omega\) and \(X \in \mathcal{H}\), then \(E[X] \ge 0\). (If a property is always non-negative, its average must be non-negative. Essential for probability/mass).
  3. Normalization: Let \(1\) be the function \(1(\omega) = 1\) for all \(\omega\). If \(1 \in \mathcal{H}\), then \(E[1] = 1\). (The average value of a property that is always ‘1’ must be 1. This connects to the total normalized mass being 1).
  4. Monotone Convergence: If \(0 \le X_n(\omega) \uparrow X(\omega)\) for \(X_n, X \in \mathcal{H}\), then \(E[X_n] \uparrow E[X]\). (A technical condition ensuring consistency with limits, crucial for connecting to integration theory).
  • Interpretation: These axioms define what it means to be a consistent averaging operator over some underlying (normalized) distribution. Linearity and positivity are core algebraic properties of averaging. Normalization sets the scale (equivalent to \(P(\Omega)=1\)). Monotone convergence ensures good behavior with limits.

Recovering Probability (\(P\) from \(E\))

Now, the crucial step: if our fundamental object is the averaging operator \(E\), how do we get back the concept of the probability (normalized mass) of a specific region (event) \(A\)?

  • Motivation: We want to find the total normalized mass concentrated within region \(A\). How can we use our “averaging machine” \(E\) to measure this?
  • The Key Insight: The Indicator Function: We need to define a specific “property” \(X\) such that its average value \(E[X]\) is precisely the probability \(P(A)\). Consider the indicator function \(I_A\) for the event \(A\):

    \[I_A(\omega) = \begin{cases} 1 & \text{if } \omega \in A \\ 0 & \text{if } \omega \notin A \end{cases}\]

    This function \(I_A\) represents the property of “being inside region A”. Its value is 1 exactly where we want to measure the mass, and 0 elsewhere.

  • Analogy: We feed this specific “in-A-ness” property \(I_A\) into our averaging machine \(E\). The machine calculates the average value of this property across the entire object, weighted by the underlying normalized mass. Since the property is 1 only within A (where the mass we care about is) and 0 outside, the weighted average must yield exactly the proportion of the total mass that resides within A.

  • Formal Definition:

Definition - Probability via Expectation

For an event \(A\) (such that its indicator function \(I_A\) is in the class \(\mathcal{H}\) of expectable functions), its probability is defined as the expected value of its indicator function:

\[P(A) \equiv E[I_A]\]
  • Intuition/Analogy Recap: The probability of a region is simply the average value of the property “being in that region”. The \(E\) operator, embodying the averaging process (like finding a center of mass), directly gives us this value when applied to the indicator function.

  • Consistency: Remarkably, if the operator \(E\) satisfies the axioms listed above, the function \(P(A)\) defined via \(P(A) = E[I_A]\) can be shown to satisfy the Kolmogorov axioms for a probability measure (non-negativity, normalization, countable additivity). The starting assumptions about the averaging process automatically imply the standard rules of probability measures. This establishes the equivalence of the two approaches.

  • Running Example (Die Roll): Let’s assume we have an expectation operator \(E\) that reflects the fair die’s equal weighting (i.e., it computes averages assuming \(P(\{i\})=1/6\)). We want to find \(P(A)\) where \(A = \{1, 2\}\).
    1. Define the indicator: \(I_A(\omega) = 1\) if \(\omega \in \{1, 2\}\), and 0 otherwise.
    2. Apply the definition: \(P(A) = E[I_A]\).
    3. Calculate the average using the (assumed) underlying fair distribution:

      \[E[I_A] = \sum_{i=1}^{6} I_A(i) P(\{i\}) = I_A(1)\frac{1}{6} + I_A(2)\frac{1}{6} + I_A(3)\frac{1}{6} + \dots + I_A(6)\frac{1}{6}\] \[E[I_A] = 1 \cdot \frac{1}{6} + 1 \cdot \frac{1}{6} + 0 \cdot \frac{1}{6} + 0 \cdot \frac{1}{6} + 0 \cdot \frac{1}{6} + 0 \cdot \frac{1}{6} = \frac{1}{6} + \frac{1}{6} = \frac{2}{6} = \frac{1}{3}\]

    So, \(P(\{1, 2\}) = 1/3\). This matches the result obtained from the measure-first approach, demonstrating consistency. The expectation-first definition successfully recovers the probability value.

  • For more examples on applications of this representation, e.g. deriving the Principle of Inclusion-Exclusion very simply through algebra and more, see my other blog post.

Conditioning in the Expectation-First View

One area where the expectation-first perspective can offer a particularly appealing intuition is in conditioning. Instead of defining conditional probability via ratios of measures and then deriving conditional expectation, we can try to define conditional expectation directly based on the idea of updating our average given new information.

1. Conditioning on an Event (\(E[X \mid B]\))

Suppose we have our averaging operator \(E\) representing the overall average across \(\Omega\). Now, we learn that a specific event \(B\) (where \(E[I_B] = P(B) > 0\)) has occurred. How should we define the new average value of \(X\), denoted \(E[X \mid B]\), given this information?

  • Intuition: We are now interested in the average value of \(X\) only considering the part of the world where \(B\) is true. We want to restrict our averaging process to the “sub-universe” defined by \(B\) and re-normalize.
  • Using the Operator \(E\): How can we achieve this using the original operator \(E\) which averages over the whole space?
    • Consider the random variable \(X \cdot I_B\). This variable is equal to \(X\) inside \(B\) and zero outside. Its expectation, \(E[X \cdot I_B]\), represents the total contribution to the average of \(X\) that comes from within region \(B\).
    • The “size” or total probability weight of region \(B\) is \(P(B) = E[I_B]\).
    • To find the average value of \(X\) per unit of probability mass within B, we should take the total contribution from \(B\) and divide by the total mass within \(B\).
  • Formal Definition: This leads directly to defining the conditional expectation given an event \(B\) as:

Definition - Conditional Expectation Given an Event (via E)

For an event \(B\) with \(E[I_B] > 0\), the conditional expectation of \(X\) given \(B\) is:

\[E[X \mid B] = \frac{E[X I_B]}{E[I_B]}\]
  • Connection to Conditional Probability: What about conditional probability \(P(A \mid B)\)? Since probability is the expectation of the indicator function, we can define it consistently:

    \[P(A \mid B) \equiv E[I_A \mid B] = \frac{E[I_A I_B]}{E[I_B]}\]

    Since \(I_A I_B = I_{A \cap B}\), this gives:

    \[P(A \mid B) = \frac{E[I_{A \cap B}]}{E[I_B]} = \frac{P(A \cap B)}{P(B)}\]

    This perfectly recovers the standard definition of conditional probability, showing the consistency of defining conditional expectation first via the averaging principle.

2. Conditioning on Partial Information (\(E[X \mid \mathcal{G}]\))

The more general and powerful concept involves conditioning not on a single event occurring, but on the available information, represented by a sub-sigma-algebra \(\mathcal{G}\). As discussed previously, \(\mathcal{G}\) represents a coarser view of the outcome space (like knowing only the shoe size category, not the exact person).

  • Intuition: We want to find the “best estimate” or “updated average” of \(X\) given only the information permitted by \(\mathcal{G}\). This estimate cannot depend on details finer than \(\mathcal{G}\) allows; it must be a \(\mathcal{G}\)-measurable random variable. Let’s call this estimate \(Z\). What property should uniquely define \(Z\)?
  • The Averaging Principle: The core idea is that the estimate \(Z\) should behave like \(X\) on average, specifically when averaged over any region definable by the available information \(\mathcal{G}\). More generally, if we take any property \(W\) that depends only on the coarse information (i.e., \(W\) is \(\mathcal{G}\)-measurable), then calculating the average of \(W \cdot X\) across the whole space should give the same result as calculating the average of \(W \cdot Z\). The estimate \(Z\) must preserve the average value of \(X\) when viewed through the lens of \(\mathcal{G}\).

  • Formal Definition (via Characterizing Property): The conditional expectation \(Z = E[X \mid \mathcal{G}]\) is defined as the unique (up to P-null sets) random variable \(Z\) such that:
    1. \(Z\) is \(\mathcal{G}\)-measurable (its value only depends on the information in \(\mathcal{G}\)).
    2. For any bounded \(\mathcal{G}\)-measurable random variable \(W\),

      \[E[W X] = E[W Z]\]
  • Interpretation: This defining property \(E[W X] = E[W E[X \mid \mathcal{G}]]\) captures the essence of \(E[X \mid \mathcal{G}]\) being the best approximation of \(X\) based on \(\mathcal{G}\). It states that for any calculation involving averaging against a function \(W\) that respects the information constraint \(\mathcal{G}\), replacing \(X\) with its conditional expectation \(E[X \mid \mathcal{G}]\) yields the same average.
  • Operational Meaning: As seen in the height/shoe size example, if \(\mathcal{G}\) is generated by a partition (\(B_1, B_2, \dots\)), then for any outcome \(\omega\) falling into partition element \(B_i\), the value of \(E[X \mid \mathcal{G}](\omega)\) is simply \(E[X \mid B_i]\) (calculated using the formula from the previous section). It’s the average of \(X\) within the specific information category \(\omega\) belongs to.
  • Conditional Probability: Conditional probability given \(\mathcal{G}\) is then naturally defined as \(P(A \mid \mathcal{G}) = E[I_A \mid \mathcal{G}]\). This gives the probability of event \(A\) occurring, based on the partial information available in \(\mathcal{G}\).
  • Tower Property: The fundamental property \(E[E[X \mid \mathcal{G}]] = E[X]\) also emerges naturally. It states that averaging the conditional averages (weighted appropriately by the probability of each information state) recovers the original overall average.

In this expectation-centric view, conditioning is fundamentally about refining the averaging process based on available information. The definitions for \(E[X \mid B]\) and \(E[X \mid \mathcal{G}]\) flow directly from asking “What should the average be now?” and ensuring consistency with the overall averaging operator \(E\). This can feel more direct and operationally motivated than the measure-theoretic approach of defining conditional probability first.

Synthesis and Conclusion

We’ve explored two foundational paths to modern probability theory:

  1. Measure First (Kolmogorov): Object (\(\Omega\)) \(\to\) Regions (\(E\)) \(\to\) Physical Mass (\(\mu\)) \(\to\) Measurable Resolution (\(\mathcal{F}\)) \(\to\) Normalized Mass (\(P = \mu/\mu(\Omega)\), described by PMF/PDF) \(\to\) Properties (\(X\)) \(\to\) Average/Center of Mass (\(E[X]\)) \(\to\) Conditioning (Zooming/Slicing/Re-normalizing \(P\) and \(E\)). Emphasizes measuring regions and the crucial role of distribution.
  2. Expectation First (Whittle/Daniell): Averaging Operator (\(E\) defined by axioms) \(\to\) Probability (\(P(A) = E[I_A]\)) \(\to\) Conditioning (via axioms on conditional operator \(E_{\mathcal{G}}\) or derived properties) \(\to\) (Implies consistent \(\Omega, \mathcal{F}, P, X\) structure). Emphasizes the operational meaning of averaging.

Both lead to the same rich framework. The physical analogy of mass distributions provides a unifying intuition:

Analogy Summary

  • Sample Space (\(\Omega\)): The physical object/system (e.g., a population, a metal plate).
  • Event (\(E \in \mathcal{F}\)): A measurable region within the object.
  • Unnormalized Measure (\(\mu\)): The physical mass (or count) within a region.
  • Sigma-Algebra (\(\mathcal{F}\)): Defines the object’s finest resolution/granularity.
  • Probability Measure (\(P\)): The normalized mass distribution (total mass = 1).
    • PDF (\(p(x)\)) (Continuous): Analogous to mass density (\(\rho(x)\)). Requires integration.
    • PMF (\(P(x_i)\)) (Discrete): Analogous to point masses (\(m_i\)). Value is probability/mass.
  • Random Variable (\(X\)): A measurable physical property (e.g., position, temperature, height).
  • CDF (\(F_X(x)\)) : Accumulated normalized mass for property \(X\) up to value \(x\).
  • Expectation (\(E[X]\)): The center of mass or overall weighted average of property \(X\).
  • Variance (\(Var(X)\)): The moment of inertia measuring the spread of \(X\)’s mass around \(E[X]\).
  • Joint Distribution (\(P(X,Y)\)) : Mass distribution on a higher-dimensional object (e.g., 2D plate).
  • Marginal Distribution (\(P(X)\)) : Mass distribution obtained by compressing/projecting the joint mass onto one axis/subspace.
  • Conditional Probability (\(P(A \mid B)\)): Re-normalized mass distribution focused within region \(B\).
  • Conditional Distribution (\(p(y \mid x)\)) : Re-normalized mass distribution along a slice taken through the joint distribution at \(X=x\).
  • Conditional Expectation (\(E[X \mid B]\)): Center of mass calculated within region \(B\).
  • Sub-Sigma-Algebra (\(\mathcal{G}\)): A coarser resolution based on partial information (e.g., knowing only shoe size category).
  • Conditional Expectation (\(E[X \mid \mathcal{G}]\)): Average value (center of mass) calculated within each coarser “pixel” defined by \(\mathcal{G}\).

Thinking in terms of objects, their measurable structure, how physical mass is distributed, how this leads to normalized probability, the properties defined on them, how to average those properties, and how to update these quantities when focusing on sub-regions provides a tangible path to probability’s core concepts. Visualizing these concepts, for instance by drawing the sample space as a region, mass distributions as shading or point heights, and conditional probabilities as zoomed-in views, can further enhance intuition. Remember the analogy’s limits – probability is ultimately about information and uncertainty – but the shared mathematical structure of distribution and averaging makes the physical intuition powerful.

This robust framework serves as the common language for different interpretations (Frequentism, Bayesianism). The mathematics, illuminated by physical analogy starting with concrete mass, provides the consistent foundation.

Further Reading

Visual examples

Books

Variance and moment of inertia

Conditional Probability/Expectation

Miscellaneous

This post is licensed under CC BY 4.0 by the author.