1.1 Probability Spaces and Event
Discrete Probability Space
A discrete probability space is a pair P=(Ω,p) consisting of:
- A non-empty finite set Ω, called the sample space, whose elements represent all possible outcomes of an experiment.
- A function p:Ω→[0,1], called the probability mass function, which assigns probabilities to outcomes and satisfies ∑ω∈Ωp(ω)=1. That is, the total probability of all outcomes is 1.
Event
An event A is a subset of Ω (A⊆Ω), and its probability is defined as
Pr[A]:=ω∈A∑p(ω).
1.2 Asymptotic Notations
Big-O notation
Let f,g:N→R be functions defined on N. We write f(n)=O(g(n)) if there exist constants C>0 and N∈N such that
∣f(n)∣≤Cg(n),∀n≥N.
O(g):={f:∃C>0,∃N∈N s.t. ∣f(n)∣≤Cg(n), for all n≥N}.
Theorem 1.1. Let f,g:N→R and c∈R. If f(n),g(n)≥0 for all n, then
f(n)c⋅f(n)O(O(f(n)))O(f(n))+O(g(n))O(f(n))+O(g(n))O(f(n))⋅O(g(n))O(f(n)g(n))=O(f(n))=O(f(n))=O(f(n))=O(f(n)+g(n))=O(max{f(n),g(n)})=O(f(n)⋅g(n))=f(n)⋅O(g(n)).
Big-O notation over a domain
Let f,g:D→R be functions defined on a domain D. We say that f(n)=O(g(n)) for all n∈D if there exists a constant C>0 such that
∣f(n)∣≤Cg(n),∀n∈D.
Big-O notation for multivariate functions
Let f,g:Nd→R be functions defined on Nd. We write
f(n1,…,nd)=O(g(n1,…,nd))
if there exist constants C>0 and N∈N such that
∣f(n1,…,nd)∣≤Cg(n1,…,nd),∀n1,…,nd≥N.
Big-O notation for x→0
Let f,g:R→R be functions defined on R. We say that f(x)=O(g(x)) as x→0 if there exist constants C>0 and δ>0 such that
∣f(x)∣≤Cg(x),∀x∈R with 0<∣x∣<δ.
Big-O notation defined using limsup
Let f,g:N→R, where g(n)>0 for all sufficiently large n. We write f(n)=O(g(n)) if
n→∞limsupg(n)∣f(n)∣<+∞.
Ω,Θ,∼ Notations
Let f,g:N→R, where g(n)>0 for all sufficiently large n.
- We write f(n)=Ω(g(n)) if n→∞liminfg(n)∣f(n)∣>0.
- We write f(n)=Θ(g(n)) if f(n)=O(g(n)) and f(n)=Ω(g(n)).
- We write f(n)∼g(n) if n→∞limg(n)∣f(n)∣=1.
Little-o and Little-ω Notations
Let f,g:N→R, where g(n)>0 for all sufficiently large n.
- We write f(n)=o(g(n)) if n→∞limg(n)∣f(n)∣=0.
- We write f(n)=ω(g(n)) if n→∞limg(n)∣f(n)∣=+∞.
Theorem 1.2. Let f,g:N→R with g(n)>0 for all n≥1. If
f(k)=O(g(k))as k→∞,
then
k=1∑nf(k)=O(k=1∑ng(k))as n→∞.
Theorem 1.3. Let f,g:N2→R with g(m,n)>0 for all m,n≥1. If for some function U(m)≥1, we have
f(m,k)=O(g(m,k)),for 1≤k≤U(m), as m→∞,
Then,
k=1∑nf(m,k)=O(k=1∑ng(m,k)),for 1≤n≤U(m), as m→∞.
Theorem 1.4 (Addition Principle). If a set A is the union of m disjoint sets S1,…,Sm, i.e., A=S1⊔⋯⊔Sm, then
∣A∣=∣S1⊔⋯⊔Sm∣=i=1∑m∣Si∣,
where ⊔ denotes the union of two disjoint sets.
Theorem 1.5 (Multiplication Principle). If a set A is the Cartesian product of m sets S1,…,Sm, i.e., A=S1×S2×⋯×Sm, then
∣A∣=∣S1×S2×⋯×Sm∣=i=1∏m∣Si∣,
where × denotes the Cartesian product of two sets, i.e., A×B={(a,b):a∈A,b∈B}.
The Secretary Problem
-
There are exactly n options, each associated with a distinct value vi∈R. The options are presented in a random order σ, where σ is a permutation of [n] sampled uniformly at random.
-
At step t, you observe the value of the current option vσt and only know the values of the first t observed options vσ1,…,vσt. You must make an immediate decision: accept or reject the current option.
-
If an option is rejected, it cannot be recalled. If an option is accepted, the process terminates immediately and no subsequent options can be inspected or accepted.
- The final goal is to maximize the probability of accepting the global best option, namely the option with the highest value
vmax:=i∈[n]max{vi}
Definition (Threshold Strategy). For a fixed integer k satisfying 1≤k≤n−1, the k-strategy is an online decision strategy defined as follows:
- Reject the first k options and record the maximum value observed;
- Starting from step k+1, accept the first option whose value exceeds the recorded maximum.
Theorem 1.6. For n≥2, the following two statements hold:
- the optimal success probability satisfies
pn=1/e+O(n1);
- the optimal threshold satisfies
k∗=n/e+O(1).
The Locker Puzzle
-
Suppose n students attend U.S. visa interviews together, and all students have distinct names. Each student has one bag, and the student's own name tag is attached to that bag. They are not allowed to bring personal belongings into the embassy, so everyone stores one bag in one of n lockers. After they come out, everyone forgets which locker contains their own bag. So we have a random one-to-one assignment between name tags and lockers.
-
Students search one by one (not simultaneously): when one student finishes, all opened lockers are closed again and the locker state is restored before the next student starts. Each student may open at most n/2 lockers (assume n is an even number), and no communication is allowed during the search. The goal is to design a strategy that maximizes the probability that all students succeed in finding their own bags.
Definition (Pointer Chasing Strategy). Before opening any locker, students discuss and agree on a bijection between student names and [n],
ϕ:{student names}→[n],
which assigns each student a unique ID in [n].
For each student s, let x←ϕ(s). The search rule is:
- open locker x;
- if the revealed name tag is s, stop (success);
- otherwise, let x←ϕ(t), where t is the revealed student name, and continue.
If no success occurs, stop after n/2 opens (failure).
Theorem 1.7. For all even n≥2, the success probability of the pointer chasing strategy is
Pr[success]=1−j=n/2+1∑nj1=1−ln2+O(n1)≈0.307.
1.4 Intersection and Union of Events
Intersection and Union of Event
For events A,B⊆Ω:
-
The intersection of two events A and B, denoted by A∩B, is the event that both A and B are happen.
-
The union of two events A and B, denoted by A∪B, is the event that at least one of A,B happens.
As events are subsets of the sample sapce Ω, the operations ∩ and ∪ are the same as set operations.
Lemma. For any events A,B⊆Ω,
Pr[A∪B]=Pr[A]+Pr[B]−Pr[A∩B].
Theorem 1.8. For ans two events A,B⊆Ω,
Pr[A∪B]≤Pr[A]+Pr[B].
Theorem 1.9. For any sequence of events A1,…,Am⊆Ω with m≥2,
Pr[i=1⋃mAi]≤i=1∑mPr[Ai].
Hashing Collision
Given a set S={(ki,vi)}i=1n of n key-value pairs, a dictionary is a data structure that stores these pairs and supports queries of the form: given a key k, return its corresponding value v (if it exists). We assume that all the given keys are distinct and are taken from a key space K. Each key is represented as an integer in the range {0,1,…,∣K∣−1}.
Definition (Hash Table). Choose a hash function h:K→{0,1,…,m−1} and allocate an array of m buckets to store the key-value pairs, where the i-th key-value pair is stored at the h(ki)-th bucket. To answer a query for a key k, compute h(k) and look up the value at the h(k)-th bucket. This yields O(1) query time while using only Θ(m) space for storing the dictionary.
Definition (Linear Hash Function). Here
ha,b,p,m:K→Zm,x↦((ax+b)modp)modm,
where a,b are integers, p is a large prime number such that K⊆Zp, and m is a positive integer with m≤p. Here Zm is the integers modulo m, i.e., Zm={0,1,…,m−1}. The operation zmodp denotes the remainder of z modulo p, which takes the value in Zp. Since the final result is taken modulo m, the hash function always produces a value in Zm.
The corresponding linear hash family is defined as follows:
Hp,m:={ha,b,p,m:a∈Zp∖{0},b∈Zp}.
That is, Hp,m is a family of linear hash functions with fixed prime p and bucket number m.
Theorem 1.10. For any two distinct integers k,k′∈K, if a hash function h is sampled uniformly at random from Hp,m, then
Pr[h(k)=h(k′)]≤m1.
Definition. A hash family H is a set of hash functions h:K→{0,1,…,m−1}. We say that H is universal if for any two distinct keys k,k′∈K, when h is sampled uniformly at random from H,
Pr[h(k)=h(k′)]≤m1.
Theorem 1.11. For any universal hash family H and any n distinct keys k1,…,kn∈K, the collision probability admits the upper bound
Pr[C]≤m1(2n).
Ramsey Numbers via the Probabilistic Method
Theorem 1.12. Among any 6 people, there always exist either 3 mutual friends or 3 mutual strangers.
Definition. Kn donotes the complete graph on n vertices, and we say that a complete graph Kn is monochromatic if all its edges have the same color.
Theorem 1.13 (Ramsey's theorem). For every integer k≥2, there exists an integer N>0 such that every red/blue coloring of the edges of KN contains a monochromatic Kk.
Definition. The Ramsey number R(k) is the smallest integer N such that every red/blue coloring of the edges of KN contains a monochromatic Kk.
Theorem 1.14 (Ramsey's theorem for hypergraphs). For any integers r≥1 and k≥m≥1, there exists an integer N such that for every coloring of the m-element subsets of an N-element set with r colors, there exists a k-element subset whose m-element subsets are all assigned the same color.
Theorem 1.15. For all k≥3,
2k/2<R(k)<(k−12k−2)<4k.
Thus R(k)=2Θ(k).
1.5 Conditional Probability and Independence
Conditional Probability
Let A,B be events with Pr[B]>0. The conditional probability of A given B is defined as
Pr[A∣B]=Pr[B]Pr[A∩B].
We leave Pr[A∣B] undefined if Pr[B]=0.
Independence of Event
Two events A,B are said to be independent if
Pr[A∩B]=Pr[A]Pr[B].
If Pr[B]>0, this is equivalent to
Pr[A∣B]=Pr[A].
Mutual Independence
Events A1,…,Am are mutually independent if for every non-empty subset I⊆[m],
Pr[i∈I⋂Ai]=i∈I∏Pr[Ai].
Theorem 1.16. For m≥2, let A1,…,Am be events with all relevant conditional probabilities well-defined. Then
Pr[i=1⋂mAi]=Pr[A1]i=2∏mPr[Ai∣j=1⋂i−1Aj].
Theorem 1.17. Let B1,…,Bm be mutually exclusive events with ⋃i=1mBi=Ω and Pr[Bi]>0 for all i∈[m]. Then for any event A,
Pr[A]=i=1∑mPr[A∣Bi]Pr[Bi].
1.6 Random Variables
Random Variable
Let P=(Ω,p) be a discrete probability spac
e. A random variable is a function
X:Ω→R.
Probability Distribution
A random variable defined on a probability space induces a probability distribution on the real numbers. Here, the probability distribution of X is defined as a function that assigns a probability to each subset of possible values of X. That is, for any subset A⊆R,
Pr[X∈A]=ω∈Ω∑1[X(ω)∈A]p(ω),
Indicator Function
The indicator function of a condition C is:
1[C]={10if the condition C holds,otherwise.
For a set A⊆U, where U is some universe, the indicator function of A is
1A:U→{0,1},1A(x)={10if x∈A,otherwise.
Probability Mass Function
For a random variable X, its probability mass function pX:R→[0,1] is defined as
pX(x):=Pr[X=x],∀x∈R.
Expectation
Let X be a random variable on a discrete probability space (Ω,p). Its expectation is
E[X]=ω∈Ω∑X(ω)p(ω).
Theorem 1.18. For any random variables X1,…,Xn and constants c1,…,cn∈R,
E[i=1∑nciXi]=i=1∑nciE[Xi].
Theorem 1.19. If X is a random variable that takes values in non-negative integers, then
E[X]=t≥0∑Pr[X>t].
Conditional Expectation
Let X be a random variable and B be an event with Pr[B]>0. The conditional expectation of X given B is defined as
E[X∣B]=Pr[B]E[1B⋅X].
Conditional Expectation Given a Random Variable
Let X and Y be random variables. The conditional expectation of X given Y is defined as the following random variable:
E[X∣Y]:Ω→R,ω↦E[X∣Y=Y(ω)].
Theorem 1.20. Let X be a random variable and B1,…,Bm be mutually exclusive events with ⋃i=1mBi=Ω and Pr[Bi]>0 for all i∈[m]. Then
E[X]=i=1∑mE[X∣Bi]Pr[Bi].
Theorem 1.21. Let X and Y be random variables. Then
E[X]=E[E[X∣Y]].
Independence of Random Variables
We say that random variables X and Y are independent if for all sets A,B⊆R,
Pr[X∈A,Y∈B]=Pr[X∈A]Pr[Y∈B].
Equivalently, X and Y are independent if for all a,b,
Pr[X=a,Y=b]=Pr[X=a]Pr[Y=b].
Theorem 1.22. If X,Y are independent random variables, then
E[XY]=E[X]E[Y].
Mutual Independence of Random Variables
We say that random variables X1,…,Xn are mutually independent if for all sets A1,…,An⊆R,
Pr[X1∈A1,…,Xn∈An]=i=1∏nPr[Xi∈Ai].
Equivalently, X1,…,Xn are mutually independent if for all a1,…,an,
Pr[X1=a1,…,Xn=an]=i=1∏nPr[Xi=ai].
Data Deduplication via Hashing
Definition. Let S(D) be the set of 5-grams of document D. Define Jaccard similarity between two documents D1 and D2 as
J(D1,D2)=∣S(D1)∪S(D2)∣∣S(D1)∩S(D2)∣.
Fix a threshold τ∈(0,1) and a tolerance ϵ∈(0,1). Given a corpus of n documents D1,…,Dn, we build a graph where each document is a node. For any two documents D1 and D2, we compute their Jaccard similarity J(D1,D2) and discuss the following three cases:
- If J(D1,D2)≥τ+ϵ, we add an edge between D1 and D2.
- If J(D1,D2)≤τ−ϵ, we do nothing.
- If τ−ϵ<J(D1,D2)<τ+ϵ, it is both ok to add an edge between D1 and D2 and not to add an edge.
Definition (MinHash). First, we collect all 5-grams appearing in the n documents in the corpus, and denote this set by G. Next, we sample a random permutation on G, and construct a hash function h by mapping each 5-gram to its rank in the permutation. Finally, for each document D, the MinHash value of D is defined as
MinHash(D;h)=min{h(x):x∈S(D)},
Theorem 1.23. For any two documents D,D′ in the corpus,
Pr[MinHash(D;h)=MinHash(D′;h)]=J(D,D′).
Definition (MinHash Signature).
Hj(i)(D)=MinHash(D;hj(i))∈Zfor i∈[b] and j∈[r].
We call this vector H(D) the MinHash signature of D.
For two documents D and D′, we say that their MinHash signatures collide if there exists an index i∈[b] such that
H(i)(D)=H(i)(D′).
We write H(D)∼H(D′) when the two signatures collide.
Theorem 1.24. For any two documents D,D′ in the corpus,
Pr[H(D)∼H(D′)]=1−(1−J(D,D′)r)b.
Definition (The Banding Method).
The following algorithm clusters documents with colliding MinHash signatures into the same group.
- Compute the MinHash signature H(D) for each document D.
- Build an empty graph G with n vertices.
- For each band i∈[b]:
- Build an empty hash table Ti with m=Θ(n) buckets.
- For each document D:
- If the key H(i)(D) is already present in Ti, select one stored document D′ and add an edge between D and D′ in G.
- Otherwise, insert D into Ti with key H(i)(D).
- Identify all connected components of G using breadth-first search (BFS).
- Return the documents grouped by these connected components.