Continuous Probability

Leaving the Finite World

Discrete Probability Spaces Beyond the Finite Case

Countable and uncountable sets

A set \(\Omega\) is countable if it is finite, or if there exists a sequence \((a_n)_{n \ge 0}\) such that

\[ \Omega = \{a_0, a_1, a_2, a_3, \dots\}. \]

Equally, \(\Omega\) is countable if and only if there exists a injection from \(\Omega\) to \(\mathbb{N}\).

A set that is not countable is uncountable.

Properties about countable and uncountable sets

(Tuples are countable). For every fixed integer \(k \ge 1\), the set of all \(k\)-tuples of natural numbers,

\[ \mathbb{N}^k := \{(x_1, x_2, \dots, x_k) : x_1, \dots, x_k \in \mathbb{N}\} \]

is countable.

Rational numbers are countable.
Real numbers are uncountable.

Discrete probability space

A discrete probability space is a pair \(\mathcal{P} = (\Omega, p)\) consisting of:

A non-empty countable set \(\Omega\), called the sample space, whose elements represent all possible outcomes of an experiment.
A function \(p : \Omega \to [0, 1]\), called the probability mass function, which assigns probabilities to outcomes and satisfies \(\sum_{\omega \in \Omega} p(\omega) = 1\).

Event

An event \(A\) is a subset of \(\Omega\) (\(A \subseteq \Omega\)), and its probability is defined as

\[ \Pr[A] := \sum_{\omega \in A} p(\omega). \]

Expectation

Let \(X\) be a random variable on a discrete probability space \((\Omega, p)\). The expectation of \(X\) is defined as

\[ \mathbb{E}[X] := \sum_{\omega \in \Omega} [X(\omega)]_+ p(\omega) - \sum_{\omega \in \Omega} [X(\omega)]_- p(\omega), \]

provided the two sums are not both \(+\infty\). If both sums are finite, then \(\mathbb{E}[X]\) is finite. If exactly one of them is \(+\infty\), then \(\mathbb{E}[X]\) is \(+\infty\) or \(-\infty\). If both are \(+\infty\), then the expectation is left undefined.

This definition guarantees that limits exist and are consistent regardless of the order of summation. This need some knowledge about infinite series, which is not covered in this note.

Continuous Probability Space

Probability density function

A function \(f : \mathbb{R}^d \to [0, \infty)\) is called a probability density function (PDF) on \(\mathbb{R}^d\) if

\[ \int_{\mathbb{R}^d} f(\boldsymbol{x}) \, \mathrm{d}\boldsymbol{x} = 1. \]

Probability Distributions

Discrete random variables

A random variable \(X\) is discrete if it takes countably many values \(\{x_1, x_2, \dots\}\). We define the probability mass function (PMF) of \(X\) as

\[ p_X(x) := \Pr[X = x]. \]

Continuous random variables

A random variable \(X\) is continuous if there exists a function \(f_X : \mathbb{R} \to [0, +\infty)\) such that

\[ \int_{\mathbb{R}} f_X(x) \,\mathrm{d}x = 1, \]

and for every \(a < b\),

\[ \Pr[a < X < b] = \int_{a}^{b} f_X(x) \,\mathrm{d}x. \]

The function \(f_X\) is called the probability density function (PDF) of \(X\).

Probability distribution

The probability distribution, or law, of \(X\) is defined as a function \(P_X\) that maps sets to probabilities:

\[ P_X(A) := \Pr[X \in A], \]

In general, a probability distribution refers to any function \(\mathcal{D}\) that maps sets to probabilities, and we write \(X \sim \mathcal{D}\) if \(P_X = \mathcal{D}\). A probability distribution \(\mathcal{D}\) is discrete if \(X \sim \mathcal{D}\) for some discrete random variable \(X\), and continuous if \(X \sim \mathcal{D}\) for some continuous random variable \(X\).

Theorem 4.1 Let \(P_X\) and \(P_Y\) be the probability distributions of two random variables \(X\) and \(Y\). Then the following conditions are equivalent to \(P_X = P_Y\):

For every \(a < b\), \(P_X((a, b)) = P_Y((a, b))\).
For every \(a < b\), \(P_X([a, b]) = P_Y([a, b])\).
For every \(a \in \mathbb{R}\), \(P_X((-\infty, a)) = P_Y((-\infty, a))\).
For every \(a \in \mathbb{R}\), \(P_X((-\infty, a]) = P_Y((-\infty, a])\).

Theorem 4.2 Let \(P_X\) and \(P_Y\) be the probability distributions of two random variables \(X\) and \(Y\). Suppose that the MGFs of \(X\) and \(Y\), denoted as \(M_X\) and \(M_Y\), are defined on some open interval \((-\epsilon, \epsilon)\). Then \(P_X = P_Y\) if and only if \(M_X(t) = M_Y(t)\) for all \(t \in (-\epsilon, \epsilon)\).

Gaussian Distribution

The Standard Gaussian Distributions

Standard Gaussian distribution

The standard Gaussian distribution is the continuous probability distribution with PDF

\[ \varphi(x) := \frac{1}{\sqrt{2\pi}} e^{-x^2/2}. \]

We denote the standard Gaussian distribution by \(\mathcal{N}(0, 1)\).

General Gaussian Distributions

General Gaussian distribution

Let \(\mu \in \mathbb{R}\) and \(\sigma \ge 0\). The Gaussian distribution with mean \(\mu\) and variance \(\sigma^2\) is defined as the distribution of random variable

\[ X = \mu + \sigma Z, \]

where \(Z \sim \mathcal{N}(0, 1)\). We denote this distribution by \(\mathcal{N}(\mu, \sigma^2)\).

Theorem 4.3 Let \(X\) be a continuous random variable with PDF \(f_X(x)\). Let \(\alpha \in \mathbb{R} \setminus \{0\}\) and \(\beta \in \mathbb{R}\) be constants. Then:

Scaling: \(\alpha X\) has PDF \(\frac{1}{|\alpha|} f_X\left(\frac{x}{\alpha}\right)\).
Shift: \(X + \beta\) has PDF \(f_X(x - \beta)\).
Scaling and Shift: \(\alpha X + \beta\) has PDF \(\frac{1}{|\alpha|} f_X\left(\frac{x - \beta}{\alpha}\right)\).

If \(X \sim \mathcal{N}(\mu, \sigma^2)\) for some \(\mu \in \mathbb{R}\) and \(\sigma > 0\), then

\[ f_X(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

is a PDF of \(X\).

Moment Generating Functions

Theorem 4.4 If \(X \sim \mathcal{N}(\mu, \sigma^2)\) is a Gaussian random variable with mean \(\mu\) and variance \(\sigma^2\), then

\[ M_X(t) = e^{\mu t + \sigma^2 t^2 / 2}, \quad K_X(t) = \mu t + \frac{1}{2}\sigma^2 t^2. \]

Sum of Independent Gaussians

Theorem 4.5 Let \(X_1\) and \(X_2\) be independent Gaussian random variables. Suppose

\[ X_1 \sim \mathcal{N}(\mu_1, \sigma_1^2), \quad X_2 \sim \mathcal{N}(\mu_2, \sigma_2^2). \]

Then

\[ X_1 + X_2 \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2). \]

Theorem 4.6 Let \(X_1, \dots, X_n\) be independent Gaussian random variables with

\[ X_i \sim \mathcal{N}(\mu_i, \sigma_i^2), \quad \text{for all } i \in [n]. \]

For any constants \(c_1, \dots, c_n \in \mathbb{R}\),

\[ \sum_{i=1}^n c_i X_i \sim \mathcal{N}\left(\sum_{i=1}^n c_i \mu_i, \sum_{i=1}^n c_i^2 \sigma_i^2\right). \]

Gaussian Tail Bounds

Theorem 4.7 Let \(X := \frac{1}{n} \sum_{i=1}^n X_i\) be the sample average of \(n\) i.i.d. Gaussian random variables \(X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)\). Then for every \(\epsilon \ge 0\), as \(n \to \infty\),

\[ \begin{aligned} \Pr[X \ge \mu + \epsilon] = \Pr[X \le \mu - \epsilon] &= \bar{\Phi}\left(\frac{\epsilon}{\sigma}\sqrt{n}\right) \\ &\sim \frac{\sigma}{\epsilon\sqrt{2\pi n}}e^{-n\epsilon^2/(2\sigma^2)}. \end{aligned} \]

where \(\bar{\Phi}\) is conventionally defined as the tail probability of the standard Gaussian distribution: \[ \bar{\Phi}(x) := \int_x^{+\infty} \frac{1}{\sqrt{2\pi}} e^{-u^2/2} \,\mathrm{d}u. \]

However, based on some tricks with integration by parts, one can show that \[ \bar{\Phi}(x) \sim \frac{1}{\sqrt{2\pi}} \frac{e^{-x^2/2}}{x} \quad \text{as } x \to +\infty. \]

Theorem 4.8 Let \(X := \frac{1}{n} \sum_{i=1}^n X_i\) be the sample average of \(n\) i.i.d. Gaussian random variables \(X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)\). Then, for every \(\epsilon \ge 0\),

\[ \Pr[X \ge \mu + \epsilon] \le e^{-n\epsilon^2/(2\sigma^2)}, \quad \Pr[X \le \mu - \epsilon] \le e^{-n\epsilon^2/(2\sigma^2)}. \]

Gaussian Vectors

Random vector

A random vector in \(\mathbb{R}^d\) is a vector

\[ \boldsymbol{x} = (X_1, \dots, X_d) \]

whose coordinates \(X_1, \dots, X_d\) are random variables. The probability distribution of a random vector \(\boldsymbol{x} \in \mathbb{R}^d\), or equivalently the joint distribution of \(X_1, \dots, X_d\), is a function that assigns probabilities to regions in \(\mathbb{R}^d\):

\[ P_{\boldsymbol{x}}(A) := \Pr[\boldsymbol{x} \in A]. \]

When referring to the distribution of a single coordinate \(X_i\), we use the term marginal distribution of \(X_i\) to emphasize that we are considering the probability distribution of this coordinate alone.

Mean vector

The mean of a random vector \(\boldsymbol{x} = (X_1, \dots, X_d)\) is the vector

\[ \mathbb{E}[\boldsymbol{x}] := \begin{bmatrix} \mathbb{E}[X_1] \\ \mathbb{E}[X_2] \\ \vdots \\ \mathbb{E}[X_d] \end{bmatrix}. \]

Covariance matrix

The covariance matrix of a random vector \(\boldsymbol{x} = (X_1, \dots, X_d)\) is the matrix

\[ \text{Cov}(\boldsymbol{x}) := \mathbb{E}[(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])(\boldsymbol{x} - \mathbb{E}[\boldsymbol{x}])^\top], \]

where the expectation is taken entrywise. More explicitly,

\[ \text{Cov}(\boldsymbol{x}) := \begin{bmatrix} \text{Cov}(X_1, X_1) & \text{Cov}(X_1, X_2) & \cdots & \text{Cov}(X_1, X_d) \\ \text{Cov}(X_2, X_1) & \text{Cov}(X_2, X_2) & \cdots & \text{Cov}(X_2, X_d) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(X_d, X_1) & \text{Cov}(X_d, X_2) & \cdots & \text{Cov}(X_d, X_d) \end{bmatrix}, \]

where \(\text{Cov}(X, Y) := \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]\) is defined as the covariance between \(X\) and \(Y\).

Standard Gaussian vector

A random vector \(\boldsymbol{x} = (X_1, \dots, X_d) \in \mathbb{R}^d\) is called a standard Gaussian vector if \(X_1, \dots, X_d\) are independent and

\[ X_i \sim \mathcal{N}(0, 1), \quad \text{for all } i \in [d]. \]

The distribution of this random vector is denoted by \(\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\).

Gaussian vector

A random vector \(\boldsymbol{x} \in \mathbb{R}^d\) is called a Gaussian vector if there exists a vector \(\boldsymbol{\mu} \in \mathbb{R}^d\), a matrix \(\boldsymbol{A} \in \mathbb{R}^{d \times r}\), and a \(r\)-dimensional standard Gaussian vector \(\boldsymbol{z} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\) such that

\[ \boldsymbol{x} = \boldsymbol{\mu} + \boldsymbol{A}\boldsymbol{z}. \]

The distribution of this random vector is denoted by \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), where \(\boldsymbol{\Sigma} = \boldsymbol{A}\boldsymbol{A}^\top\).

Theorem 4.9 Let \(P_{\boldsymbol{x}}\) and \(P_{\boldsymbol{y}}\) be the probability distributions of two random vectors \(\boldsymbol{x}\) and \(\boldsymbol{y}\). Suppose that the MGFs of \(\boldsymbol{x}\) and \(\boldsymbol{y}\), denoted as \(M_{\boldsymbol{x}}\) and \(M_{\boldsymbol{y}}\), are defined on some open neighborhood of the origin. Then \(P_{\boldsymbol{x}} = P_{\boldsymbol{y}}\) if and only if \(M_{\boldsymbol{x}}(\boldsymbol{t}) = M_{\boldsymbol{y}}(\boldsymbol{t})\) for all \(\boldsymbol{t}\) in this neighborhood.

Theorem 4.10 Let \(\boldsymbol{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\). Then the MGF and CGF of \(\boldsymbol{x}\) are given by

\[ M_{\boldsymbol{x}}(\boldsymbol{t}) = e^{\boldsymbol{t}^\top \boldsymbol{\mu} + \boldsymbol{t}^\top \boldsymbol{\Sigma} \boldsymbol{t} / 2}, \quad K_{\boldsymbol{x}}(\boldsymbol{t}) = \boldsymbol{t}^\top \boldsymbol{\mu} + \frac{1}{2}\boldsymbol{t}^\top \boldsymbol{\Sigma} \boldsymbol{t}. \]

Theorem 4.11 Let \(\boldsymbol{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) with \(\det(\boldsymbol{\Sigma}) > 0\). Then

\[ f_{\boldsymbol{x}}(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^d \det(\boldsymbol{\Sigma})}} e^{-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})}. \]

is a PDF of \(\boldsymbol{x}\).

Theorem 4.12 Let \(\boldsymbol{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\). Then for any fixed matrix \(\boldsymbol{S} \in \mathbb{R}^{r \times d}\),

\[ \boldsymbol{S}\boldsymbol{x} \sim \mathcal{N}(\boldsymbol{S}\boldsymbol{\mu}, \boldsymbol{S}\boldsymbol{\Sigma}\boldsymbol{S}^\top). \]

Theorem 4.13 Let \(\boldsymbol{x}_1\) and \(\boldsymbol{x}_2\) be independent Gaussian vectors in \(\mathbb{R}^d\). Suppose

\[ \boldsymbol{x}_1 \sim \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1), \quad \boldsymbol{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2). \]

Then

\[ \boldsymbol{x}_1 + \boldsymbol{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_1 + \boldsymbol{\mu}_2, \boldsymbol{\Sigma}_1 + \boldsymbol{\Sigma}_2). \]

The Strange Geometry of High Dimensions

High-dimensional Gaussians, Spheres, and Ball

Three fundamental distributions

Gaussian distribution \(\mathcal{G} = \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}/d)\):

The Gaussian distribution with mean \(\boldsymbol{0} \in \mathbb{R}^d\) and covariance matrix \(\boldsymbol{I}/d \in \mathbb{R}^{d \times d}\). The scaling factor \(1/d\) here is intentional. For \(\boldsymbol{x} = (X_1, \dots, X_d) \sim \mathcal{G}\), we have \(\mathbb{E}[\|\boldsymbol{x}\|_2^2] = \sum_{i=1}^d \mathbb{E}[X_i^2] = d \cdot (1/d) = 1\). Therefore, this scaling keeps the expected squared norm equal to \(1\), independent of \(d\).
Uniform sphere distribution \(\mathcal{U}_{\mathbb{S}}\):

The uniform distribution on the unit sphere \(\mathbb{S}^{d-1} = \{\boldsymbol{x} \in \mathbb{R}^d : \|\boldsymbol{x}\|_2 = 1\}\). The norm of a point on the unit sphere is always \(1\).
Uniform ball distribution \(\mathcal{U}_B\):

The uniform distribution on the unit ball \(B^d = \{\boldsymbol{x} \in \mathbb{R}^d : \|\boldsymbol{x}\|_2 \le 1\}\). The norm of a point in the unit ball is always upper bounded by \(1\).

Theorem 4.14 Let \(\boldsymbol{x} \sim \mathcal{U}_B\). Then, for every \(\epsilon \in (0, 1)\), \[ \Pr[1 - \epsilon < \|\boldsymbol{x}\|_2 \le 1] \ge 1 - e^{-\epsilon d}. \]

Theorem 4.15 Let \(\boldsymbol{x} \sim \mathcal{G}\). Then, for every \(\epsilon \in (0, 1)\), \[ \Pr[|\|\boldsymbol{x}\|_2^2 - 1| \ge \epsilon] \le 2e^{-d\epsilon^2/8}. \]

In particular, \[ \Pr[1 - \epsilon < \|\boldsymbol{x}\|_2 < 1 + \epsilon] \ge 1 - 2e^{-d\epsilon^2/8}. \]

and from Theorem 4.8 we can also derive that for every \(\epsilon \in (0, 1)\), \[ \operatorname{Pr}\left[ |\langle \boldsymbol{u}, \boldsymbol{x} \rangle| < \epsilon \right] \geq 1 - 2e^{-d\epsilon^2/2}. \]

Theorem 4.16 Let \(d \ge 3\). Fix a unit vector \(\boldsymbol{u} \in \mathbb{R}^d\), and let \(\boldsymbol{x} \sim \mathcal{U}_B\). Then, for every \(\epsilon \in (0, 1)\),

\[ \Pr[|\langle \boldsymbol{u}, \boldsymbol{x} \rangle| < \epsilon] \ge 1 - 6e^{-(d-1)\epsilon^2/2}. \]

Corollary

Let \(d \ge 3\). Fix a unit vector \(\boldsymbol{u} \in \mathbb{R}^d\), and let \(\boldsymbol{x} \sim \mathcal{U}_B\). Then, for every \(\epsilon \in (0, 1)\), \[ \Pr\left[1 - \frac{1}{2}\epsilon^2 < \|\boldsymbol{x}\|_2 \le 1 \text{ and } |\langle \boldsymbol{u}, \boldsymbol{x} \rangle| < \epsilon\right] \ge 1 - 7e^{-(d-1)\epsilon^2/2}. \]
Fix a unit vector \(\boldsymbol{u} \in \mathbb{R}^d\), and let \(\boldsymbol{x} \sim \mathcal{G}\). Then, for every \(\epsilon \in (0, 1)\), \[ \Pr[1 - 2\epsilon < \|\boldsymbol{x}\|_2 < 1 + 2\epsilon \text{ and } |\langle \boldsymbol{u}, \boldsymbol{x} \rangle| < \epsilon] \ge 1 - 4e^{-d\epsilon^2/2}. \]
Let \(\mathbb{S}^{d-1} = \{\boldsymbol{x} \in \mathbb{R}^d : \|\boldsymbol{x}\|_2 = 1\}\) denote the unit sphere in \(\mathbb{R}^d\). Let \(\boldsymbol{x} \sim \mathcal{U}_S\) be a random vector uniformly distributed on \(\mathbb{S}^{d-1}\). Fix any deterministic unit vector \(\boldsymbol{u} \in \mathbb{R}^d\) (\(\|\boldsymbol{u}\|_2 = 1\)). Then, for every \(\epsilon \in (0, 1)\), the random inner product satisfies the following concentration inequality:

\[ \operatorname{Pr}\left[ |\langle\boldsymbol{u}, \boldsymbol{x}\rangle| < \epsilon \right] \geq 1 - 7e^{-(d-1)\epsilon^2 / 8}. \]

Random Projection

Theorem 4.17 Let \(\boldsymbol{A} \in \mathbb{R}^{k \times d}\) be a Gaussian random matrix, where every entry of \(\boldsymbol{A}\) is sampled independently from \(\mathcal{N}(0,1/k)\). For every fixed vector \(\boldsymbol{x} \in \mathbb{R}^d\),

\[ \boldsymbol{A}\boldsymbol{x} \sim \mathcal{N}\left(\mathbf{0}, \frac{\|\boldsymbol{x}\|_2^2}{k} \boldsymbol{I}\right). \]

Moreover, if \(\boldsymbol{x} \neq \mathbf{0}\), then for every \(\epsilon \in (0,1)\),

\[ \operatorname{Pr}\left[ \left| \frac{\|\boldsymbol{A}\boldsymbol{x}\|_2^2}{\|\boldsymbol{x}\|_2^2} - 1 \right| \geq \epsilon \right] \leq 2e^{-k\epsilon^2/8}. \]

Theorem 4.18 Let \(\boldsymbol{x}_1, \dots, \boldsymbol{x}_n \in \mathbb{R}^d\) be \(n\) points in \(\mathbb{R}^d\), where \(n \ge 2\). For every \(\epsilon \in (0, 1)\) and any integer \(k\) with \[ k \ge \frac{16 \log n}{\epsilon^2}, \]

there exists a linear map \(F : \mathbb{R}^d \to \mathbb{R}^k\) such that for all \(1 \le i, j \le n\) \[ (1 - \epsilon)\|\boldsymbol{x}_i - \boldsymbol{x}_j\|_2^2 \le \|F(\boldsymbol{x}_i) - F(\boldsymbol{x}_j)\|_2^2 \le (1 + \epsilon)\|\boldsymbol{x}_i - \boldsymbol{x}_j\|_2^2. \]

Probability Distances

Notation for probability and expectation under specific distributions

For a probability event \(A\), we may write \(P(A)\) as

\[ \operatorname{Pr}_{\boldsymbol{x} \sim P}[A] \]

to explicitly indicate that \(\boldsymbol{x}\) is the random variable sampled from the distribution \(P\).
For an expectation of a function \(\phi(\boldsymbol{x})\) where \(\boldsymbol{x} \sim P\), we may write \(P(\phi)\) or

\[ \mathbb{E}_{\boldsymbol{x} \sim P}[\phi(\boldsymbol{x})] \]

Total Variation Distance

Total variation distance

The total variation distance between \(P\) and \(Q\) is

\[ d_{\mathrm{TV}}(P, Q) := \sup_{A \subseteq \mathbb{R}^d} |P(A) - Q(A)|. \]

Theorem 4.19 If \(P\) and \(Q\) are discrete probability distributions with PMFs \(p : \mathcal{X} \to [0, 1]\) and \(q : \mathcal{X} \to [0, 1]\), where \(\mathcal{X}\) is a finite set of values, then

\[ d_{\mathrm{TV}}(P, Q) = \frac{1}{2} \sum_{x \in \mathcal{X}} |p(x) - q(x)|. \]

Theorem 4.20 If \(P\) and \(Q\) are continuous probability distributions with PDFs \(f\) and \(g\) on \(\mathbb{R}^d\), then

\[ d_{\mathrm{TV}}(P, Q) = \frac{1}{2} \int_{\mathbb{R}^d} |f(\boldsymbol{x}) - g(\boldsymbol{x})| \, \mathrm{d}\boldsymbol{x}. \]

Theorem 4.21 Let \(\ell < u\). For any two probability distributions \(P\) and \(Q\) on \(\mathbb{R}^d\),

\[ d_{\mathrm{TV}}(P,Q) = \frac{1}{u - \ell} \sup_{\phi \in \mathcal{F}_{\ell,u}} |P(\phi) - Q(\phi)|, \]

where \(\mathcal{F}_{\ell,u}\) denotes the family of all functions \(\phi : \mathbb{R}^d \to [\ell, u]\).

Wasserstein Distance

Lipschitz function

For \(L \geq 0\), a function \(\phi : \mathbb{R}^d \to \mathbb{R}\) is \(L\)-Lipschitz if for all \(\boldsymbol{x}, \boldsymbol{y} \in \mathbb{R}^d\),

\[ |\phi(\boldsymbol{x}) - \phi(\boldsymbol{y})| \leq L\|\boldsymbol{x} - \boldsymbol{y}\|_2. \]

In particular, \(\phi\) is 1-Lipschitz if it is \(L\)-Lipschitz with \(L = 1\).

Wasserstein-1 distance

The Wasserstein-1 distance between \(P\) and \(Q\) is

\[ W_1(P,Q) := \sup_{\phi \in \operatorname{Lip}_1} |P(\phi) - Q(\phi)|, \]

where \(\operatorname{Lip}_1\) denotes the family of all 1-Lipschitz functions \(\phi : \mathbb{R}^d \to \mathbb{R}\).

The 1-Wasserstein distance can be written \[ W_1(P, Q) := \inf_{\gamma \in \Gamma(P, Q)} \mathbb{E}_{(\boldsymbol{x}, \boldsymbol{y}) \sim \gamma}\left[ \|\boldsymbol{x} - \boldsymbol{y}\|_2 \right]. \]

Appendix

Theorem 4.22 (Asymptotic expansion of the Gaussian tail probability) Let \(\bar{\Phi}(x) := \int_{x}^{+\infty} \frac{1}{\sqrt{2\pi}} e^{-u^2/2} \, \mathrm{d}u\) denote the tail probability (or complementary cumulative distribution function) of the standard Gaussian distribution \(\mathcal{N}(0, 1)\). Then for any \(x > 0\), \(\bar{\Phi}(x)\) admits the following exact algebraic representation obtained via successive integration by parts:

\[ \bar{\Phi}(x) = \frac{1}{\sqrt{2\pi}} \left( \frac{1}{x} e^{-x^2/2} - \frac{1}{x^3} e^{-x^2/2} + \int_{x}^{+\infty} \frac{3}{u^4} e^{-u^2/2} \, \mathrm{d}u \right). \]

Theorem 4.23 (Moments of a zero-mean Gaussian distribution) Let \(X \sim \mathcal{N}(0, \sigma^2)\) be a centered Gaussian random variable with mean \(0\) and variance \(\sigma^2\). Let \(M_X(t) := \mathbb{E}[e^{tX}] = e^{\sigma^2 t^2 / 2}\) denote its moment-generating function (MGF). Then for any positive integer \(k \in \mathbb{Z}^+\), the \(k\)-th raw moment of \(X\) is given by the \(k\)-th derivative of \(M_X(t)\) evaluated at \(t = 0\):

\[ \mathbb{E}[X^k] = M_X^{(k)}(0) = \begin{cases} 0, & \text{if } k \text{ is odd} \\ \frac{k!}{(k/2)!} \left( \frac{\sigma^2}{2} \right)^{k/2}, & \text{if } k \text{ is even} \end{cases}. \]

Theorem 4.24 (Fourier analysis of the Gaussian distribution and inversion) Let \(f : \mathbb{R} \to \mathbb{C}\) be a function in \(L^1(\mathbb{R})\), meaning \(\int_{-\infty}^{+\infty} |f(x)| \, \mathrm{d}x < +\infty\). The Fourier transform \(\hat{f} = \mathcal{F}[f]\) and the inverse Fourier transform \(\check{g} = \mathcal{F}^{-1}[g]\) are defined respectively as:

\[ \mathcal{F}[f](\omega) := \int_{-\infty}^{+\infty} e^{i\omega x} f(x) \, \mathrm{d}x, \quad \mathcal{F}^{-1}[g](x) := \frac{1}{2\pi} \int_{-\infty}^{+\infty} e^{-i\omega x} g(\omega) \, \mathrm{d}\omega. \]

Then the following structural properties hold across probability and signal spaces:

Eigenfunction of the Fourier Transform: Let \(\varphi_\sigma(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-x^2/(2\sigma^2)}\) be the PDF of the Gaussian distribution \(\mathcal{N}(0, \sigma^2)\). Its Fourier transform is given by another Gaussian kernel:

\[ \mathcal{F}[\varphi_\sigma](\omega) = e^{-\sigma^2 \omega^2 / 2}. \]

Furthermore, the inversion identity holds perfectly everywhere: \(\mathcal{F}^{-1}\left[\mathcal{F}[\varphi_\sigma]\right] = \varphi_\sigma\).
Gaussian Smoothing and Integrability: For any \(f \in L^1(\mathbb{R})\), the Gaussian smoothing (or convolution kernel) of \(f\) with bandwidth \(\sigma > 0\) is defined as \[f_\sigma(x) := \mathbb{E}[f(x + Z)] = \int_{-\infty}^{+\infty} f(x+z)\varphi_\sigma(z)\,\mathrm{d}z, \] where \(Z \sim \mathcal{N}(0, \sigma^2)\). This smoothed function preserves integrability, i.e., \(f_\sigma \in L^1(\mathbb{R})\).
Spectral Representation of Smoothing: The Fourier transform of the Gaussian smoothing is the pointwise product of individual transforms:

\[ \mathcal{F}[f_\sigma](\omega) = \mathcal{F}[f](\omega) \cdot e^{-\sigma^2 \omega^2 / 2}. \]

This regularized spectrum satisfies the recovery condition: \(\mathcal{F}^{-1}\left[\mathcal{F}[f_\sigma]\right] = f_\sigma\).
The Fourier Inversion Theorem: If \(f \in L^1(\mathbb{R})\) and both the left and right limits (\(f_-(x)\) and \(f_+(x)\)) exist at every point in \(\mathbb{R}\), then taking the limit of bandwidth \(\sigma \to 0^+\) yields the pointwise reconstruction:

\[ \mathcal{F}^{-1}\left[\mathcal{F}[f]\right](x) = \frac{f_-(x) + f_+(x)}{2}. \]
Uniqueness of Characteristic Functions: Let \(X\) and \(Y\) be two continuous random variables with PDFs \(f_X\) and \(f_Y\). If their characteristic functions (or MGFs on the imaginary axis) coincide everywhere, meaning \(M_X(it) = M_Y(it)\) for all \(t \in \mathbb{R}\), then their probability distributions are identical:

\[ f_X(x) = f_Y(x) \quad \text{for all } x \in \mathbb{R}. \]