Post

Statistics & Info Theory Cheat Sheet: Key Formulas & Definitions

A quick reference guide with key formulas and definitions from the Statistics and Information Theory crash course for machine learning.

Statistics & Info Theory Cheat Sheet: Key Formulas & Definitions

Introduction

This cheat sheet provides a quick reference to the key definitions, formulas, and concepts covered in the Statistics & Info Theory Part 1: Statistical Foundations for ML and Statistics & Info Theory Part 2: Information Theory Essentials for ML crash course posts. Use it as a quick reminder or for reviewing core ideas.

Part 1: Statistical Foundations Recap

Basic Probability

ConceptFormula / Definition
Conditional Prob.\(P(A \vert B) = \frac{P(A \cap B)}{P(B)}\), for \(P(B) > 0\)
Independence\(A, B\) indep. if \(P(A \cap B) = P(A)P(B)\), or \(P(A \vert B) = P(A)\)
Bayes’ Theorem\(P(A \vert B) = \frac{P(B \vert A)P(A)}{P(B)}\)
Law of Total Prob.\(P(B) = \sum_i P(B \vert A_i)P(A_i)\) for partition \(\{A_i\}\)

Random Variables (RVs)

ConceptDiscrete RVContinuous RV
DescriptionProbability Mass Function (PMF): \(p_X(x)\)Probability Density Function (PDF): \(f_X(x)\)
PMF/PDF Properties\(\sum_x p_X(x) = 1\), \(p_X(x) \ge 0\)\(\int_{-\infty}^{\infty} f_X(x) dx = 1\), \(f_X(x) \ge 0\)
CDF \(F_X(x)=P(X \le x)\)\(F_X(x) = \sum_{k \le x} p_X(k)\)\(F_X(x) = \int_{-\infty}^x f_X(t) dt\)

Expectation, Variance, Covariance

ConceptFormula / Definition
Expected Value \(E[X]\)Discrete: \(\sum_x x p_X(x)\). Continuous: \(\int x f_X(x) dx\)
Variance \(Var(X)\)\(E[(X - E[X])^2] = E[X^2] - (E[X])^2\)
Standard Deviation \(\sigma_X\)\(\sqrt{Var(X)}\)
Covariance \(Cov(X,Y)\)\(E[(X-E[X])(Y-E[Y])] = E[XY] - E[X]E[Y]\)
Correlation \(\rho_{X,Y}\)\(\frac{Cov(X,Y)}{\sigma_X \sigma_Y}\), where \(-1 \le \rho_{X,Y} \le 1\)
Covariance Matrix \(\Sigma\)For RV vector \(\mathbf{X}\), \(\Sigma = E[(\mathbf{X} - E[\mathbf{X}])(\mathbf{X} - E[\mathbf{X}])^T]\). \(\Sigma_{ij} = Cov(X_i, X_j)\)

Common Probability Distributions

DistributionTypeParametersPMF / PDF \(p(x;\cdot)\) or \(f(x;\cdot)\)Mean \(E[X]\)Variance \(Var(X)\)
BernoulliDiscrete\(p\)\(p^x (1-p)^{1-x}\) for \(x \in \{0,1\}\)\(p\)\(p(1-p)\)
BinomialDiscrete\(n, p\)\(\binom{n}{x} p^x (1-p)^{n-x}\)\(np\)\(np(1-p)\)
CategoricalDiscrete\(\mathbf{p}=(p_1,\dots,p_K)\)\(P(X=k) = p_k\)(Vector)(Cov Matrix)
PoissonDiscrete\(\lambda\)\(\frac{\lambda^x e^{-\lambda}}{x!}\)\(\lambda\)\(\lambda\)
UniformContinuous\(a, b\)\(\frac{1}{b-a}\) for \(x \in [a,b]\)\(\frac{a+b}{2}\)\(\frac{(b-a)^2}{12}\)
Normal (Gaussian)Continuous\(\mu, \sigma^2\)\(\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\)\(\mu\)\(\sigma^2\)
Multivariate NormalContinuous\(\boldsymbol{\mu}, \Sigma\)\(\frac{1}{\sqrt{(2\pi)^d \det(\Sigma)}} e^{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})}\)\(\boldsymbol{\mu}\)\(\Sigma\)
ExponentialContinuous\(\lambda\)\(\lambda e^{-\lambda x}\) for \(x \ge 0\)\(1/\lambda\)\(1/\lambda^2\)

Important Theorems

TheoremSummary
Law of Large Numbers (LLN)Sample mean \(\bar{X}_n\) converges to true mean \(E[X]\) as \(n \to \infty\).
Central Limit Theorem (CLT)Sum/average of many i.i.d. RVs (with finite mean/variance) tends towards a normal distribution, regardless of original distribution. \(\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)\)

Statistical Estimation

ConceptDefinition / Formula
Estimator Bias\(Bias(\hat{\theta}) = E[\hat{\theta}] - \theta\)
Estimator MSE\(MSE(\hat{\theta}) = Var(\hat{\theta}) + (Bias(\hat{\theta}))^2\)
Likelihood \(L(\theta; D)\)\(L(\theta; D) = \prod_{i=1}^n p(x_i; \theta)\) (for i.i.d. data)
Log-Likelihood \(\ell(\theta; D)\)\(\ell(\theta; D) = \sum_{i=1}^n \log p(x_i; \theta)\)
MLE \(\hat{\theta}_{MLE}\)\(\hat{\theta}_{MLE} = \arg\max_{\theta} \ell(\theta; D)\)

Part 2: Information Theory Essentials Recap

Entropy Measures

ConceptFormula / DefinitionNotes
Shannon Entropy \(H(X)\) (Discrete)\(-\sum_{x \in \mathcal{X}} p(x) \log_b p(x)\)Units: bits (b=2), nats (b=e). \(H(X) \ge 0\).
Differential Entropy \(h(X)\) (Continuous)\(-\int_{\mathcal{X}} f(x) \log f(x) dx\)Can be negative. Units usually nats.
Joint Entropy \(H(X,Y)\)\(-\sum_{x,y} p(x,y) \log p(x,y)\) (discrete)Total uncertainty of pair \((X,Y)\).
Conditional Entropy \(H(Y \vert X)\)\(-\sum_{x,y} p(x,y) \log p(y \vert x) = H(X,Y) - H(X)\)Remaining uncertainty in \(Y\) given \(X\). \(H(Y \vert X) \le H(Y)\).
Chain Rule for Entropy\(H(X,Y) = H(X) + H(Y \vert X)\)Generalizes to multiple variables.

Information Measures and Divergences

ConceptFormula / DefinitionKey Properties
Mutual Information \(I(X;Y)\)\(H(X) - H(X \vert Y)\) \(= H(X) + H(Y) - H(X,Y)\) \(= \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}\)\(I(X;Y) \ge 0\). \(I(X;Y)=0 \iff X,Y\) independent. Symmetric.
KL Divergence \(D_{KL}(P \Vert Q)\)Discrete: \(\sum_x p(x) \log \frac{p(x)}{q(x)}\)
Continuous: \(\int p(x) \log \frac{p(x)}{q(x)} dx\)
\(D_{KL}(P \Vert Q) \ge 0\). \(D_{KL}(P \Vert Q)=0 \iff P=Q\). Asymmetric.
Cross-Entropy \(H(P,Q)\)Discrete: \(-\sum_x p(x) \log q(x)\)
Continuous: \(-\int p(x) \log q(x) dx\)
\(H(P,Q) = H(P) + D_{KL}(P \Vert Q)\). Common loss function.

Fisher Information

ConceptDefinition / Formula
Score Function \(\mathbf{s}(\boldsymbol{\theta}; x)\)\(\nabla_{\boldsymbol{\theta}} \log p(x; \boldsymbol{\theta})\)
Fisher Information Matrix \(I(\boldsymbol{\theta})\)\(E_{X \sim p(x;\boldsymbol{\theta})}\left[ (\nabla_{\boldsymbol{\theta}} \log p(x;\boldsymbol{\theta})) (\nabla_{\boldsymbol{\theta}} \log p(x;\boldsymbol{\theta}))^T \right]\)
OR
\(-E_{X \sim p(x;\boldsymbol{\theta})}\left[ \nabla^2_{\boldsymbol{\theta}} \log p(x; \boldsymbol{\theta}) \right]\) (under regularity conditions)
Cramér-Rao Lower Bound (CRLB)For unbiased estimator \(\hat{\boldsymbol{\theta}}\), \(Cov(\hat{\boldsymbol{\theta}}) \succeq I(\boldsymbol{\theta})^{-1}\)
Local KL-FIM Relation\(D_{KL}(p(\cdot;\boldsymbol{\theta}) \Vert p(\cdot;\boldsymbol{\theta} + d\boldsymbol{\theta})) \approx \frac{1}{2} d\boldsymbol{\theta}^T I(\boldsymbol{\theta}) d\boldsymbol{\theta}\)

This cheat sheet is intended as a condensed summary. For detailed explanations, derivations, and examples, please refer to the full posts in the crash course.

This post is licensed under CC BY 4.0 by the author.