跳到主要内容

1 Discrete Probability

1.1 Probability Spaces and Event

Discrete Probability Space

A discrete probability space is a pair P=(Ω,p)\mathcal{P} = (\Omega, p) consisting of:

  • A non-empty finite set Ω\Omega, called the sample space, whose elements represent all possible outcomes of an experiment.
  • A function p:Ω[0,1]p : \Omega \to [0, 1], called the probability mass function, which assigns probabilities to outcomes and satisfies ωΩp(ω)=1\sum_{\omega \in \Omega} p(\omega) = 1. That is, the total probability of all outcomes is 1.

Event

An event AA is a subset of Ω\Omega (AΩA \subseteq \Omega), and its probability is defined as

Pr[A]:=ωAp(ω).\begin{equation*} Pr[A] := \sum_{\omega \in A} p(\omega). \end{equation*}

1.2 Asymptotic Notations

Big-O\mathcal{O} notation

Let f,g:NRf,g : \mathbb{N} \to \mathbb{R} be functions defined on N\mathbb{N}. We write f(n)=O(g(n))f(n) = \mathcal{O}(g(n)) if there exist constants C>0C > 0 and NNN \in \mathbb{N} such that

f(n)Cg(n),nN.\begin{equation*} |f(n)| \le Cg(n), \quad \forall n \ge N. \end{equation*} O(g):={f:C>0,NN s.t. f(n)Cg(n), for all nN}.\begin{equation*} \mathcal{O}(g) := \{f : \exists C > 0, \exists N \in \mathbb{N} \text{ s.t. } |f(n)| \le Cg(n), \text{ for all } n \ge N\}. \end{equation*}

Theorem 1.1. Let f,g:NRf, g : \mathbb{N} \to \mathbb{R} and cRc \in \mathbb{R}. If f(n),g(n)0f(n), g(n) \ge 0 for all nn, then

f(n)=O(f(n))cf(n)=O(f(n))O(O(f(n)))=O(f(n))O(f(n))+O(g(n))=O(f(n)+g(n))O(f(n))+O(g(n))=O(max{f(n),g(n)})O(f(n))O(g(n))=O(f(n)g(n))O(f(n)g(n))=f(n)O(g(n)).\begin{align*} f(n) &= \mathcal{O}(f(n)) \\ c \cdot f(n) &= \mathcal{O}(f(n)) \\ \mathcal{O}(\mathcal{O}(f(n))) &= \mathcal{O}(f(n)) \\ \mathcal{O}(f(n)) + \mathcal{O}(g(n)) &= \mathcal{O}(f(n) + g(n)) \\ \mathcal{O}(f(n)) + \mathcal{O}(g(n)) &= \mathcal{O}(\max\{f(n), g(n)\}) \\ \mathcal{O}(f(n)) \cdot \mathcal{O}(g(n)) &= \mathcal{O}(f(n) \cdot g(n)) \\ \mathcal{O}(f(n)g(n)) &= f(n) \cdot \mathcal{O}(g(n)). \\ \end{align*}

Big-O\mathcal{O} notation over a domain

Let f,g:DRf,g : D \to \mathbb{R} be functions defined on a domain DD. We say that f(n)=O(g(n))f(n) = \mathcal{O}(g(n)) for all nDn \in D if there exists a constant C>0C > 0 such that

f(n)Cg(n),nD.\begin{equation*} |f(n)| \le Cg(n), \quad \forall n \in D. \end{equation*}

Big-O\mathcal{O} notation for multivariate functions

Let f,g:NdRf,g : \mathbb{N}^d \to \mathbb{R} be functions defined on Nd\mathbb{N}^d. We write

f(n1,,nd)=O(g(n1,,nd))\begin{equation*} f(n_1,\dots,n_d) = \mathcal{O}(g(n_1,\dots,n_d)) \end{equation*}

if there exist constants C>0C > 0 and NNN \in \mathbb{N} such that

f(n1,,nd)Cg(n1,,nd),n1,,ndN.\begin{equation*} |f(n_1,\dots,n_d)| \le Cg(n_1,\dots,n_d), \quad \forall n_1,\dots,n_d \ge N. \end{equation*}

Big-O\mathcal{O} notation for x0x \to 0

Let f,g:RRf,g : \mathbb{R} \to \mathbb{R} be functions defined on R\mathbb{R}. We say that f(x)=O(g(x))f(x) = \mathcal{O}(g(x)) as x0x \to 0 if there exist constants C>0C > 0 and δ>0\delta > 0 such that

f(x)Cg(x),xR with 0<x<δ.\begin{equation*} |f(x)| \le Cg(x), \quad \forall x \in \mathbb{R} \text{ with } 0 < |x| < \delta. \end{equation*}

Big-O\mathcal{O} notation defined using lim sup\limsup

Let f,g:NRf, g : \mathbb{N} \to \mathbb{R}, where g(n)>0g(n) > 0 for all sufficiently large nn. We write f(n)=O(g(n))f(n) = \mathcal{O}(g(n)) if

lim supnf(n)g(n)<+.\begin{equation*} \limsup_{n \to \infty} \frac{|f(n)|}{g(n)} < +\infty. \end{equation*}

Ω,Θ,\Omega, \Theta, \sim Notations

Let f,g:NRf,g : \mathbb{N} \to \mathbb{R}, where g(n)>0g(n) > 0 for all sufficiently large nn.

  • We write f(n)=Ω(g(n))f(n) = \Omega(g(n)) if lim infnf(n)g(n)>0\liminf\limits_{n \to \infty} \frac{|f(n)|}{g(n)} > 0.
  • We write f(n)=Θ(g(n))f(n) = \Theta(g(n)) if f(n)=O(g(n))f(n) = \mathcal{O}(g(n)) and f(n)=Ω(g(n))f(n) = \Omega(g(n)).
  • We write f(n)g(n)f(n) \sim g(n) if limnf(n)g(n)=1\lim\limits_{n \to \infty} \frac{|f(n)|}{g(n)} = 1.

Little-oo and Little-ω\omega Notations

Let f,g:NRf,g : \mathbb{N} \to \mathbb{R}, where g(n)>0g(n) > 0 for all sufficiently large nn.

  • We write f(n)=o(g(n))f(n) = o(g(n)) if limnf(n)g(n)=0\lim\limits_{n \to \infty} \frac{|f(n)|}{g(n)} = 0.
  • We write f(n)=ω(g(n))f(n) = \omega(g(n)) if limnf(n)g(n)=+\lim\limits_{n \to \infty} \frac{|f(n)|}{g(n)} = +\infty.

Theorem 1.2. Let f,g:NRf,g : \mathbb{N} \to \mathbb{R} with g(n)>0g(n) > 0 for all n1n \ge 1. If

f(k)=O(g(k))as k,\begin{equation*} f(k) = \mathcal{O}(g(k)) \quad \text{as } k \to \infty, \end{equation*}

then

k=1nf(k)=O(k=1ng(k))as n.\begin{equation*} \sum_{k=1}^n f(k) = \mathcal{O}\left(\sum_{k=1}^n g(k)\right) \quad \text{as } n \to \infty. \end{equation*}

Theorem 1.3. Let f,g:N2Rf,g : \mathbb{N}^2 \to \mathbb{R} with g(m,n)>0g(m,n) > 0 for all m,n1m,n \ge 1. If for some function U(m)1U(m) \ge 1, we have

f(m,k)=O(g(m,k)),for 1kU(m), as m,\begin{equation*} f(m,k) = \mathcal{O}(g(m,k)), \quad \text{for } 1 \le k \le U(m), \text{ as } m \to \infty, \end{equation*}

Then,

k=1nf(m,k)=O(k=1ng(m,k)),for 1nU(m), as m.\begin{equation*} \sum_{k=1}^n f(m,k) = \mathcal{O}\left(\sum_{k=1}^n g(m,k)\right), \quad \text{for } 1 \le n \le U(m), \text{ as } m \to \infty. \end{equation*}

1.3 Uniform Probability Means Counting

Theorem 1.4 (Addition Principle). If a set AA is the union of mm disjoint sets S1,,SmS_1, \dots, S_m, i.e., A=S1SmA = S_1 \sqcup \dots \sqcup S_m, then

A=S1Sm=i=1mSi,\begin{align*} |A| = |S_1 \sqcup \dots \sqcup S_m| = \sum_{i=1}^m |S_i|, \end{align*}

where \sqcup denotes the union of two disjoint sets.

Theorem 1.5 (Multiplication Principle). If a set AA is the Cartesian product of mm sets S1,,SmS_1, \dots, S_m, i.e., A=S1×S2××SmA = S_1 \times S_2 \times \dots \times S_m, then

A=S1×S2××Sm=i=1mSi,\begin{align*} |A| = |S_1 \times S_2 \times \dots \times S_m| = \prod_{i=1}^m |S_i|, \end{align*}

where ×\times denotes the Cartesian product of two sets, i.e., A×B={(a,b):aA,bB}A \times B = \{(a, b) : a \in A, b \in B\}.

The Secretary Problem

  • There are exactly nn options, each associated with a distinct value viRv_i \in \mathbb{R}. The options are presented in a random order σ\sigma, where σ\sigma is a permutation of [n][n] sampled uniformly at random.

  • At step tt, you observe the value of the current option vσtv_{\sigma_t} and only know the values of the first tt observed options vσ1,,vσtv_{\sigma_1}, \dots, v_{\sigma_t}. You must make an immediate decision: accept or reject the current option.

  • If an option is rejected, it cannot be recalled. If an option is accepted, the process terminates immediately and no subsequent options can be inspected or accepted.

  • The final goal is to maximize the probability of accepting the global best option, namely the option with the highest value
vmax:=maxi[n]{vi}\begin{align*} v_{\max} := \max_{i \in [n]} \{v_i\} \end{align*}

Definition (Threshold Strategy). For a fixed integer kk satisfying 1kn11 \le k \le n - 1, the kk-strategy is an online decision strategy defined as follows:

  • Reject the first kk options and record the maximum value observed;
  • Starting from step k+1k + 1, accept the first option whose value exceeds the recorded maximum.

Theorem 1.6. For n2n \ge 2, the following two statements hold:

  1. the optimal success probability satisfies
pn=1/e+O(1n);\begin{align*} p_n = 1/e + \mathcal{O}\left(\frac{1}{n}\right); \end{align*}
  1. the optimal threshold satisfies
k=n/e+O(1).\begin{align*} k^* = n/e + \mathcal{O}(1). \end{align*}

The Locker Puzzle

  • Suppose nn students attend U.S. visa interviews together, and all students have distinct names. Each student has one bag, and the student's own name tag is attached to that bag. They are not allowed to bring personal belongings into the embassy, so everyone stores one bag in one of nn lockers. After they come out, everyone forgets which locker contains their own bag. So we have a random one-to-one assignment between name tags and lockers.

  • Students search one by one (not simultaneously): when one student finishes, all opened lockers are closed again and the locker state is restored before the next student starts. Each student may open at most n/2n/2 lockers (assume nn is an even number), and no communication is allowed during the search. The goal is to design a strategy that maximizes the probability that all students succeed in finding their own bags.

Definition (Pointer Chasing Strategy). Before opening any locker, students discuss and agree on a bijection between student names and [n][n],

ϕ:{student names}[n],\begin{align*} \phi : \{\text{student names}\} \to [n], \end{align*}

which assigns each student a unique ID in [n][n].

For each student ss, let xϕ(s)x \leftarrow \phi(s). The search rule is:

  • open locker xx;
  • if the revealed name tag is ss, stop (success);
  • otherwise, let xϕ(t)x \leftarrow \phi(t), where tt is the revealed student name, and continue.

If no success occurs, stop after n/2n/2 opens (failure).

Theorem 1.7. For all even n2n \ge 2, the success probability of the pointer chasing strategy is

Pr[success]=1j=n/2+1n1j=1ln2+O(1n)0.307.\begin{align*} \Pr[\text{success}] = 1 - \sum_{j=n/2+1}^n \frac{1}{j} = 1 - \ln 2 + \mathcal{O}\left(\frac{1}{n}\right) \approx 0.307. \end{align*}

1.4 Intersection and Union of Events

Intersection and Union of Event

For events A,BΩA, B \subseteq \Omega:

  • The intersection of two events AA and BB, denoted by ABA \cap B, is the event that both AA and BB are happen.

  • The union of two events AA and BB, denoted by ABA \cup B, is the event that at least one of A,BA, B happens.

As events are subsets of the sample sapce Ω\Omega, the operations \cap and \cup are the same as set operations.

Lemma. For any events A,BΩA, B \subseteq \Omega,

Pr[AB]=Pr[A]+Pr[B]Pr[AB].\begin{align*} \Pr[A \cup B] = \Pr[A] + \Pr[B] - \Pr[A \cap B]. \end{align*}

Theorem 1.8. For ans two events A,BΩA, B \subseteq \Omega,

Pr[AB]Pr[A]+Pr[B].\begin{align*} \Pr [A \cup B] \leq \Pr[A] + \Pr[B]. \end{align*}

Theorem 1.9. For any sequence of events A1,,AmΩA_1, \dots, A_m \subseteq \Omega with m2m \geq 2,

Pr[i=1mAi]i=1mPr[Ai].\begin{align*} \Pr\left[\bigcup_{i=1}^m A_i\right] \le \sum_{i=1}^m \Pr[A_i]. \end{align*}

Hashing Collision

Given a set S={(ki,vi)}i=1nS = \{(k_i, v_i)\}_{i=1}^n of nn key-value pairs, a dictionary is a data structure that stores these pairs and supports queries of the form: given a key kk, return its corresponding value vv (if it exists). We assume that all the given keys are distinct and are taken from a key space KK. Each key is represented as an integer in the range {0,1,,K1}\{0, 1, \dots, |K| - 1\}.

Definition (Hash Table). Choose a hash function h:K{0,1,,m1}h : K \to \{0, 1, \dots, m - 1\} and allocate an array of mm buckets to store the key-value pairs, where the ii-th key-value pair is stored at the h(ki)h(k_i)-th bucket. To answer a query for a key kk, compute h(k)h(k) and look up the value at the h(k)h(k)-th bucket. This yields O(1)\mathcal{O}(1) query time while using only Θ(m)\Theta(m) space for storing the dictionary.

Definition (Linear Hash Function). Here

ha,b,p,m:KZm,x((ax+b)modp)modm,\begin{align*} h_{a,b,p,m} : K \to \mathbb{Z}_m, \quad x \mapsto ((ax + b) \bmod p) \bmod m, \end{align*}

where a,ba, b are integers, pp is a large prime number such that KZpK \subseteq \mathbb{Z}_p, and mm is a positive integer with mpm \le p. Here Zm\mathbb{Z}_m is the integers modulo mm, i.e., Zm={0,1,,m1}\mathbb{Z}_m = \{0, 1, \dots, m - 1\}. The operation zmodpz \bmod p denotes the remainder of zz modulo pp, which takes the value in Zp\mathbb{Z}_p. Since the final result is taken modulo mm, the hash function always produces a value in Zm\mathbb{Z}_m.

The corresponding linear hash family is defined as follows:

Hp,m:={ha,b,p,m:aZp{0},bZp}.\begin{align*} \mathcal{H}_{p,m} := \{h_{a,b,p,m} : a \in \mathbb{Z}_p \setminus \{0\}, b \in \mathbb{Z}_p\}. \end{align*}

That is, Hp,m\mathcal{H}_{p,m} is a family of linear hash functions with fixed prime pp and bucket number mm.

Theorem 1.10. For any two distinct integers k,kKk, k' \in K, if a hash function hh is sampled uniformly at random from Hp,m\mathcal{H}_{p,m}, then

Pr[h(k)=h(k)]1m.\begin{align*} \Pr[h(k) = h(k')] \le \frac{1}{m}. \end{align*}

Definition. A hash family H\mathcal{H} is a set of hash functions h:K{0,1,,m1}h : K \to \{0, 1, \dots, m - 1\}. We say that H\mathcal{H} is universal if for any two distinct keys k,kKk, k' \in K, when hh is sampled uniformly at random from H\mathcal{H},

Pr[h(k)=h(k)]1m.\begin{align*} \Pr[h(k) = h(k')] \le \frac{1}{m}. \end{align*}

Theorem 1.11. For any universal hash family H\mathcal{H} and any nn distinct keys k1,,knKk_1, \dots, k_n \in K, the collision probability admits the upper bound

Pr[C]1m(n2).\begin{align*} \Pr[C] \le \frac{1}{m} \binom{n}{2}. \end{align*}

Ramsey Numbers via the Probabilistic Method

Theorem 1.12. Among any 6 people, there always exist either 3 mutual friends or 3 mutual strangers.

Definition. KnK_n donotes the complete graph on nn vertices, and we say that a complete graph KnK_n is monochromatic if all its edges have the same color.

Theorem 1.13 (Ramsey's theorem). For every integer k2k \ge 2, there exists an integer N>0N > 0 such that every red/blue coloring of the edges of KNK_N contains a monochromatic KkK_k.

Definition. The Ramsey number R(k)R(k) is the smallest integer NN such that every red/blue coloring of the edges of KNK_N contains a monochromatic KkK_k.

Theorem 1.14 (Ramsey's theorem for hypergraphs). For any integers r1r \ge 1 and km1k \ge m \ge 1, there exists an integer NN such that for every coloring of the mm-element subsets of an NN-element set with rr colors, there exists a kk-element subset whose mm-element subsets are all assigned the same color.

Theorem 1.15. For all k3k \ge 3,

2k/2<R(k)<(2k2k1)<4k.\begin{align*} 2 ^ {k / 2} < R(k) < \binom{2k - 2}{k - 1} < 4 ^ k. \end{align*}

Thus R(k)=2Θ(k)R(k) = 2^{\Theta(k)}.

1.5 Conditional Probability and Independence

Conditional Probability

Let A,BA, B be events with Pr[B]>0\Pr[B] > 0. The conditional probability of AA given BB is defined as

Pr[AB]=Pr[AB]Pr[B].\begin{align*} \Pr[A \mid B] = \frac{\Pr[A \cap B]}{\Pr[B]}. \end{align*}

We leave Pr[AB]\Pr[A \mid B] undefined if Pr[B]=0\Pr[B] = 0.

Independence of Event

Two events A,BA, B are said to be independent if

Pr[AB]=Pr[A]Pr[B].\begin{align*} \Pr[A \cap B] &= \Pr[A] \Pr[B]. \end{align*}

If Pr[B]>0\Pr[B] > 0, this is equivalent to

Pr[AB]=Pr[A].\begin{align*} \Pr[A \mid B] &= \Pr[A]. \end{align*}

Mutual Independence

Events A1,,AmA_1, \ldots, A_m are mutually independent if for every non-empty subset I[m]I \subseteq [m],

Pr[iIAi]=iIPr[Ai].\begin{align*} \Pr\left[\bigcap_{i \in I} A_i\right] &= \prod_{i \in I} \Pr[A_i]. \end{align*}

Theorem 1.16. For m2m \geq 2, let A1,,AmA_1, \ldots, A_m be events with all relevant conditional probabilities well-defined. Then

Pr[i=1mAi]=Pr[A1]i=2mPr[Aij=1i1Aj].\begin{align*} \Pr\left[\bigcap_{i=1}^m A_i\right] &= \Pr[A_1] \prod_{i=2}^m \Pr\left[A_i \mid \bigcap_{j=1}^{i-1} A_j\right]. \end{align*}

Theorem 1.17. Let B1,,BmB_1, \dots, B_m be mutually exclusive events with i=1mBi=Ω\bigcup_{i=1}^m B_i = \Omega and Pr[Bi]>0\Pr[B_i] > 0 for all i[m]i \in [m]. Then for any event AA,

Pr[A]=i=1mPr[ABi]Pr[Bi].\begin{align*} \Pr[A] &= \sum_{i=1}^m \Pr[A \mid B_i]\Pr[B_i]. \end{align*}

1.6 Random Variables

Random Variable

Let P=(Ω,p)\mathcal{P} = (\Omega, p) be a discrete probability spac

e. A random variable is a function

X:ΩR.\begin{align*} X : \Omega \to \mathbb{R}. \end{align*}

Probability Distribution

A random variable defined on a probability space induces a probability distribution on the real numbers. Here, the probability distribution of XX is defined as a function that assigns a probability to each subset of possible values of XX. That is, for any subset ARA \subseteq \mathbb{R},

Pr[XA]=ωΩ1[X(ω)A]p(ω),\begin{align*} \Pr[X \in A] &= \sum_{\omega \in \Omega} \mathbb{1}_{[X(\omega) \in A]} p(\omega), \end{align*}

Indicator Function

The indicator function of a condition CC is:

1[C]={1if the condition C holds,0otherwise.\begin{align*} \mathbb{1}_{[C]} &= \begin{cases} 1 & \text{if the condition } C \text{ holds}, \\ 0 & \text{otherwise}. \end{cases} \end{align*}

For a set AUA \subseteq U, where UU is some universe, the indicator function of AA is

1A:U{0,1},1A(x)={1if xA,0otherwise.\begin{align*} \mathbb{1}_A : U \to \{0, 1\}, \quad \mathbb{1}_A(x) &= \begin{cases} 1 & \text{if } x \in A, \\ 0 & \text{otherwise}. \end{cases} \end{align*}

Probability Mass Function

For a random variable XX, its probability mass function pX:R[0,1]p_X : \mathbb{R} \to [0, 1] is defined as

pX(x):=Pr[X=x],xR.\begin{align*} p_X(x) &:= \Pr[X = x], \quad \forall x \in \mathbb{R}. \end{align*}

Expectation

Let XX be a random variable on a discrete probability space (Ω,p)(\Omega, p). Its expectation is

E[X]=ωΩX(ω)p(ω).\begin{align*} \mathbb{E}[X] &= \sum_{\omega \in \Omega} X(\omega) p(\omega). \end{align*}

Theorem 1.18. For any random variables X1,,XnX_1, \dots, X_n and constants c1,,cnRc_1, \dots, c_n \in \mathbb{R},

E[i=1nciXi]=i=1nciE[Xi].\begin{align*} \mathbb{E}\left[\sum_{i=1}^n c_i X_i\right] &= \sum_{i=1}^n c_i \mathbb{E}[X_i]. \end{align*}

Theorem 1.19. If XX is a random variable that takes values in non-negative integers, then

E[X]=t0Pr[X>t].\begin{align*} \mathbb{E}[X] &= \sum_{t \ge 0} \Pr[X > t]. \end{align*}

Conditional Expectation

Let XX be a random variable and BB be an event with Pr[B]>0\Pr[B] > 0. The conditional expectation of XX given BB is defined as

E[XB]=E[1BX]Pr[B].\begin{align*} \mathbb{E}[X \mid B] &= \frac{\mathbb{E}[\mathbb{1}_B \cdot X]}{\Pr[B]}. \end{align*}

Conditional Expectation Given a Random Variable

Let XX and YY be random variables. The conditional expectation of XX given YY is defined as the following random variable:

E[XY]:ΩR,ωE[XY=Y(ω)].\begin{align*} \mathbb{E}[X \mid Y] : \Omega \to \mathbb{R}, \quad \omega \mapsto \mathbb{E}[X \mid Y = Y(\omega)]. \end{align*}

Theorem 1.20. Let XX be a random variable and B1,,BmB_1, \dots, B_m be mutually exclusive events with i=1mBi=Ω\bigcup_{i=1}^m B_i = \Omega and Pr[Bi]>0\Pr[B_i] > 0 for all i[m]i \in [m]. Then

E[X]=i=1mE[XBi]Pr[Bi].\begin{align*} \mathbb{E}[X] &= \sum_{i=1}^m \mathbb{E}[X \mid B_i] \Pr[B_i]. \end{align*}

Theorem 1.21. Let XX and YY be random variables. Then

E[X]=E[E[XY]].\begin{align*} \mathbb{E}[X] &= \mathbb{E}[\mathbb{E}[X \mid Y]]. \end{align*}

Independence of Random Variables

We say that random variables XX and YY are independent if for all sets A,BRA, B \subseteq \mathbb{R},

Pr[XA,YB]=Pr[XA]Pr[YB].\begin{align*} \Pr[X \in A, Y \in B] &= \Pr[X \in A]\Pr[Y \in B]. \end{align*}

Equivalently, XX and YY are independent if for all a,ba, b,

Pr[X=a,Y=b]=Pr[X=a]Pr[Y=b].\begin{align*} \Pr[X = a, Y = b] &= \Pr[X = a]\Pr[Y = b]. \end{align*}

Theorem 1.22. If X,YX, Y are independent random variables, then

E[XY]=E[X]E[Y].\begin{align*} \mathbb{E}[XY] &= \mathbb{E}[X]\mathbb{E}[Y]. \end{align*}

Mutual Independence of Random Variables

We say that random variables X1,,XnX_1, \dots, X_n are mutually independent if for all sets A1,,AnRA_1, \dots, A_n \subseteq \mathbb{R},

Pr[X1A1,,XnAn]=i=1nPr[XiAi].\begin{align*} \Pr[X_1 \in A_1, \dots, X_n \in A_n] &= \prod_{i=1}^n \Pr[X_i \in A_i]. \end{align*}

Equivalently, X1,,XnX_1, \dots, X_n are mutually independent if for all a1,,ana_1, \dots, a_n,

Pr[X1=a1,,Xn=an]=i=1nPr[Xi=ai].\begin{align*} \Pr[X_1 = a_1, \dots, X_n = a_n] &= \prod_{i=1}^n \Pr[X_i = a_i]. \end{align*}

Data Deduplication via Hashing

Definition. Let S(D)S(D) be the set of 5-grams of document DD. Define Jaccard similarity between two documents D1D_1 and D2D_2 as

J(D1,D2)=S(D1)S(D2)S(D1)S(D2).\begin{align*} J(D_1, D_2) &= \frac{|S(D_1) \cap S(D_2)|}{|S(D_1) \cup S(D_2)|}. \end{align*}

Fix a threshold τ(0,1)\tau \in (0, 1) and a tolerance ϵ(0,1)\epsilon \in (0, 1). Given a corpus of nn documents D1,,DnD_1, \dots, D_n, we build a graph where each document is a node. For any two documents D1D_1 and D2D_2, we compute their Jaccard similarity J(D1,D2)J(D_1, D_2) and discuss the following three cases:

  • If J(D1,D2)τ+ϵJ(D_1, D_2) \ge \tau + \epsilon, we add an edge between D1D_1 and D2D_2.
  • If J(D1,D2)τϵJ(D_1, D_2) \le \tau - \epsilon, we do nothing.
  • If τϵ<J(D1,D2)<τ+ϵ\tau - \epsilon < J(D_1, D_2) < \tau + \epsilon, it is both ok to add an edge between D1D_1 and D2D_2 and not to add an edge.

Definition (MinHash). First, we collect all 5-grams appearing in the nn documents in the corpus, and denote this set by G\mathcal{G}. Next, we sample a random permutation on G\mathcal{G}, and construct a hash function hh by mapping each 5-gram to its rank in the permutation. Finally, for each document DD, the MinHash value of DD is defined as

MinHash(D;h)=min{h(x):xS(D)}, \begin{align*} \text{MinHash}(D; h) &= \min\{h(x) : x \in S(D)\}, \ \end{align*}

Theorem 1.23. For any two documents D,DD, D' in the corpus,

Pr[MinHash(D;h)=MinHash(D;h)]=J(D,D).\begin{align*} \Pr[\text{MinHash}(D; h) = \text{MinHash}(D'; h)] &= J(D, D'). \end{align*}

Definition (MinHash Signature).

Hj(i)(D)=MinHash(D;hj(i))Zfor i[b] and j[r].\begin{align*} H_j^{(i)}(D) &= \text{MinHash}(D; h_j^{(i)}) \in \mathbb{Z} \quad \text{for } i \in [b] \text{ and } j \in [r]. \end{align*}

We call this vector H(D)H(D) the MinHash signature of DD.

For two documents DD and DD', we say that their MinHash signatures collide if there exists an index i[b]i \in [b] such that

H(i)(D)=H(i)(D).\begin{align*} H^{(i)}(D) &= H^{(i)}(D'). \end{align*}

We write H(D)H(D)H(D) \sim H(D') when the two signatures collide.

Theorem 1.24. For any two documents D,DD, D' in the corpus,

Pr[H(D)H(D)]=1(1J(D,D)r)b.\begin{align*} \Pr[H(D) \sim H(D')] &= 1 - (1 - J(D, D')^r)^b. \end{align*}

Definition (The Banding Method).

The following algorithm clusters documents with colliding MinHash signatures into the same group.

  • Compute the MinHash signature H(D)H(D) for each document DD.
  • Build an empty graph GG with nn vertices.
  • For each band i[b]i \in [b]:
    • Build an empty hash table Ti\mathcal{T}_i with m=Θ(n)m = \Theta(n) buckets.
    • For each document DD:
      • If the key H(i)(D)H^{(i)}(D) is already present in Ti\mathcal{T}_i, select one stored document DD' and add an edge between DD and DD' in GG.
      • Otherwise, insert DD into Ti\mathcal{T}_i with key H(i)(D)H^{(i)}(D).
  • Identify all connected components of GG using breadth-first search (BFS).
  • Return the documents grouped by these connected components.