leveraging big data: lecture 2
DESCRIPTION
http://www.cohenwang.com/edith/bigdataclass2013. Leveraging Big Data: Lecture 2. Edith Cohen Amos Fiat Haim Kaplan Tova Milo. Instructors:. Counting Distinct Elements. 4,. 32,. 6 ,. 12,. 12,. 1 4 ,. 32 ,. 7 ,. 12,. 32,. 7,. - PowerPoint PPT PresentationTRANSCRIPT
Leveraging Big Data: Lecture 2
Instructors:
http://www.cohenwang.com/edith/bigdataclass2013
Edith CohenAmos FiatHaim KaplanTova Milo
Counting Distinct Elements
Elements occur multiple times, we want to count the number of distinct elements.
Number of distinct element is ( =6 in example) Total number of elements is 11 in this example
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Exact counting of distinct element requires a structure of size We are happy with an approximate count that uses a small-size working memory.
Distinct Elements: Approximate Counting 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
We want to be able to compute and maintain a small sketch of the set of distinct items seen so far
Distinct Elements: Approximate Counting Size of sketch Can query to get a good estimate of (small
relative error) For a new element easy to compute from and
For data stream computation If and are (possibly overlapping) sets then we can
compute the union sketch from their sketches: from and For distributed computation
Distinct Elements: Approximate Counting
Size-estimation/Minimum value technique:[Flajolet-Martin 85, C 94]
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
is a random hash function from element IDs to uniform random numbers in
Maintain the Min-Hash value : Initialize Processing an element
Distinct Elements: Approximate Counting
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,𝑥
h (𝑥)
𝑦
𝑛0
1
12 33 4 4 4 45 5 6
0.45 0.21
0.35 0.92
0.14
0.45 0.450.450.74
0.35 0.35
0.35 0.35
0.35
0.210.21
0.21 0.21 0.21 0.140.14
0.14
The minimum hash value is: Unaffected by repeated elements. Is non-increasing with the number of distinct elements .
Distinct Elements: Approximate Counting
How does the minimum hash give information on the number of distinct elements ?
0 1
The expectation of the minimum is
minimum
A single value gives only limited information. To boost information, we maintain values
Why expectation is ?
Take a circle of length 1 Throw a random red point to “mark” the start of
a segment of length (circle points map to ) Throw another point independently at random The circle is cut into segments by these points. The expected length of each segment is Same also for the segment clockwise from the
red point.
Min-Hash Sketches
k-mins sketch: Use “independent” hash functions: Track the respective minimum for each function.
Bottom-k sketch: Use a single hash function: Track the smallest values
k-partition sketch: Use a single hash function: Use the first bits of to map uniformly to one of parts. Call the remaining bits .For : Track the minimum hash value of the elements in part .
These sketches maintain values from the range of the hash function (distribution).
All sketches are the same for
Min-Hash Sketchesk-mins, bottom-k, k-partition
Why study all 3 variants ? Different tradeoffs between update cost, accuracy, usage…
Beyond distinct counting: Min-Hash sketches correspond to sampling schemes of
large data sets Similarity queries between datasets Selectivity/subset queries
These patterns generally apply as methods to gather increased confidence from a random “projection”/sample.
Min-Hash Sketches: Examplesk-mins, k-partition, bottom-k
The min-hash value and sketches only depend on The random hash function/s The set of distinct elements
Not on the order elements appear or their multiplicity
32 12 14 7 6 4𝑁={ }, , , , ,
Min-Hash Sketches: Examplek-mins
32 12 14 7 6 4
0.920.45 0.740.35 0.21 0.14𝑥h1(𝑥 )
0.200.19 0.070.51 0.70 0.55h2(𝑥 )0.180.10 0.930.71 0.50 0.89h3(𝑥)
(𝑦¿¿1 , 𝑦2 , 𝑦3)=(, ,)¿0.14 0.07 0.10
Min-Hash Sketches: k-minsk-mins sketch: Use “independent” hash functions: Track the respective minimum for each function.
Processing a new element :For :
Computation: Whether sketch is actually updated or not.
h1 (𝑥 )=0.35h2 (𝑥 )=0.51h3 (𝑥 )=0.71
Min-Hash Sketches: Examplek-partition
32 12 14 7 6 4
32 13 1 2𝑥𝑖(𝑥)
0.200.19 0.070.51 0.70 0.55h (𝑥)
(𝑦¿¿1 , 𝑦2 , 𝑦3)=(, ,)¿0.07 0.19 0.20
part-hash
value-hash
Min-Hash Sketches: k-partition
Processing a new element :
Computation: to test or update
k-partition sketch: Use a single hash function: Use the first bits of to map uniformly to one of parts. Call the remaining bits .For : Track the minimum hash value of the elements in part .
Min-Hash Sketches: ExampleBottom-k
32 12 14 7 6 4𝑥0.200.19 0.070.51 0.70 0.55h (𝑥)
(𝑦¿¿1 , 𝑦2 , 𝑦3)=(, ,)¿0.07 0.19 0.20
Min-Hash Sketches: bottom-k
Processing a new element :If : Computation: The sketch is maintained as a sorted list or as a priority queue. to test if an update is needed to update a sorted list. to update a priority queue.
Bottom-k sketch: Use a single hash function: Track the smallest values
We will see that #changes #distinct elements
Min-Hash Sketches: Number of updatesClaim: The expected number of actual updates (changes) of the min-hash sketch is
Proof: First Consider . Look at distinct elements in the order they first occur. The th distinct element has lower hash value than the current minimum with probability . This is the probability of being first in a random permutation of elements.Total expected number of updates is .
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,UpdateProb.
121 1
314
15
160 0 0 0 0
Min-Hash Sketches: Number of updates
Claim: The expected number of actual updates (changes) of the min-hash sketch is Proof (continued): Recap for (single min-hash value): the th distinct element causes an update with probability expected total is .k-mins: min-hash values (apply times)Bottom-k: We keep the smallest elements, so update probability of the th distinct element is (probability of being in the first in a random permutation)k-partition: min-hash values for distinct values.
Merging Min-Hash Sketches
The union sketch from sketches of two sets ’,: k-mins: take minimum per hash function k-partition: take minimum per part Bottom-k: The smallest in union of data must be in
the smallest of their own set:
!! We apply the same set of hash function to all elements/data sets/streams.
Using Min-Hash Sketches
Recap: We defined Min-Hash Sketches (3 types) Adding elements, merging Min-Hash sketches Some properties of these sketches
Next: We put Min-Hash sketches to work Estimating Distinct Count from a Min-Hash
SketchTools from estimation theory
The Exponential Distribution PDF ; CDF ; Very useful properties:
Memorylessness: Min-to-Sum conversion:
Relation with uniform:
Estimating Distinct Count from a Min-Hash Sketch: k-mins
• Change to exponential distribution • Using Min-to-Sum property, – In fact, we can just work with and use when
estimating.• Number of distinct elements becomes a parameter
estimation problem:
Given independent samples from , estimate
Estimating Distinct Count from a Min-Hash Sketch: k-mins
Each has expectation and variance The average has expectation and variance The
cv is . is a good unbiased estimator for But which is the inverse of what we want. What about estimating ?
Estimating Distinct Count from a Min-Hash Sketch: k-mins
What about estimating ? 1) We can use the biased estimator
To say something useful on the estimate quality: We apply Chebyshev’s inequality to bound the probability that is far from its expectation and thus is far from
2) Maximum Likelihood Estimation (general and powerful technique)
Chebyshev’s InequalityFor any random variable with expectation and standard deviation , for any
For ,
For Using
Using Chebyshev’s Inequality
For =
Maximum Likelihood EstimationSet of independent ; we do not know The MLE is the value that maximizes the likelihood (joint density) function . The maximum over of the probability of observing
Properties: Principled way of deriving estimators Converges in probability to true value (with
enough i.i.d samples)… but generally biased (Asymptotically!) optimal – minimizes MSE (mean
square error) – meets Cramér-Rao lower bound
Estimating Distinct Count from a Min-Hash Sketch: k-mins MLE
Likelihood function for (joint density function): Take a logarithm (does not change the
maximum): Differentiate to find maximum: MLE estimate
Given independent samples from , estimate
We get the same estimator, depends only on the sum!
We can think of several ways to combine and use these samples and decrease the variance:• average (sum)• median• remove outliers and average remaining, …
Given independent samples from , estimate
We want to get the most value (best estimate) from the information we have (the sketch). What combinations should we consider ?
Sufficient StatisticA function is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form
Likelihood function (joint density) for exponential i.i.d random variables from :The sum is a sufficient statistic for
Sufficient StatisticA function is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form
In particular: The MLE depends on only through The maximum with respect to does not depend
on . The maximum of , computed by deriving with
respect to , is a function of T.
Sufficient Statistic is a sufficient statistic for if the likelihood function has the form
Lemma: Conditional distribution of given does not depend on
If we fix , the density function is If we know the density up to fixed factor, it is determined completely by normalizing to 1
Rao-Blackwell TheoremRecap: is a sufficient statistic for Conditional distribution of given does not depend on
Rao-Blackwell Theorem: Given an estimator of that is not a function of the sufficient statistic, we can get an estimator with at most the same MSE that depends only on :
does not depend on (critical) Process is called: Rao-Blackwellization of
Rao-Blackwell Theorem
(1,3)
𝑓 (𝑦1 , 𝑦2;𝜃)
(4,0)
(3,1)
(2,2)
(1,2)
(2,1)
(3,0)
(1,4)
(3,2)
Density function of given parameter
Rao-Blackwell Theorem
(1,3)
𝑓 (𝑦1 , 𝑦2;𝜃)T
(4,0)
(3,1)
(2,2)
(1,2)
(2,1)
(3,0)
(1,4)
(3,2)
Sufficient statistic:
Rao-Blackwell Theorem
(1,3)
𝑓 (𝑦1 , 𝑦2;𝜃)T
(4,0)
(3,1)
(2,2)
(1,2)
(2,1)
(3,0)
(1,4)
(3,2)
Sufficient statistic:
Rao-Blackwell Theorem
(1,3)
𝑓 (𝑦1 , 𝑦2;𝜃)T
(4,0)
(3,1)
(2,2)
(1,2)
(2,1)
(3,0)
(1,4)
(3,2)
Sufficient statistic:𝑓 (𝑦 1 , 𝑦 2;𝜃 )∨ y1+ y2
Rao-Blackwell Theorem
(1,3)
Estimator T
(4,0)
(3,1)
(2,2)
(1,2)
(2,1)
(3,0)
(1,4)
(3,2)
3
0
2
1
0
4
2
1
2
Rao-Blackwell Theorem
(1,3)
�̂�(𝒚𝟏 , 𝒚𝟐) T
Rao-Blackwell:
(4,0)
(3,1)
(2,2)
(1,2)
(2,1)
(3,0)
(1,4)
(3,2)
3
0
2
1
0
4
2
1
21.5 1
3
Rao-Blackwell Theorem�̂�(𝒚𝟏 , 𝒚𝟐) T
Rao-Blackwell:
Law of total expectation:
Expectation (bias) remains the same MSE (Mean Square Error) can only decrease
Why does the MSE decrease?
Suppose we have two points with equal probabilities. We have an estimator of that gives estimates and on these points.
We replace it by an estimator that instead returns the average:
The (scaled) contribution of these two points to the square error changes from
to
Why does the MSE decrease?
Show that
Sufficient Statistic for estimating from k-mins sketches
The sum is a sufficient statistic for estimating any
function of (including , ) Rao-Blackwell We can not gain by using
estimators with a different dependence on (e.g. functions of median or of a smaller sum)
Given independent samples from , estimate
Estimating Distinct Count from a Min-Hash Sketch: k-mins MLE
• , the sum of i.i.d random variables), has PDF
The expectation of the MLE estimate is
MLE estimate
Estimating Distinct Count from a Min-Hash Sketch: k-mins
The variance of the unbiased estimate is
The CV is Is this the best we can do ?
Unbiased Estimator (for )
Cramér-Rao lower bound (CRLB)
Are we using the information in the sketch in the best possible way ?
Cramér-Rao lower bound (CRLB)Information theoretic lower bound on the variance of any unbiased estimator of . Likelihood function: Log likelihood:
Fisher Information
CRLB: Any unbiased estimator has
CRLB for estimating
Likelihood function for Log likelihood Negated second derivative: Fisher information: CRLB :
Our estimator has CV
The Cramér-Rao lower bound on CV is we are using the information in the sketch nearly optimally !
Estimating Distinct Count from a Min-Hash Sketch: k-mins
Unbiased Estimator (for )
Estimating Distinct Count from a Min-Hash Sketch: Bottom-k
Bottom-k sketch Can we specify the distribution? Use Exponential D.
What is the relation with k-mins sketches?
same as k-mins The minimum of the remaining elements is
Since memoryless, . More generally .
Bottom-k versus k-mins sketchesBottom-k sketch: samples from
Bottom-k sketches carry strictly more information than k-mins sketches!
K-mins sketch: samples from
To obtain from (without knowing ) we can take where
We can use k-mins estimators with bottom-k. Can do even better by taking expectation over choices of .
Estimating Distinct Count from a Min-Hash Sketch: Bottom-k
Likelihood function of :
Does not depend on n Depends on nWhat does estimation theory tell us?
Estimating Distinct Count from a Min-Hash Sketch: Bottom-k
Likelihood function
(maximum value in the sketch) is a sufficient statistic for estimating (or any function of ).Captures everything we can glean from the bottom-k sketch on
What does estimation theory tell us?
Bottom-k: MLE for Distinct CountLikelihood function (probability density) is
Find the value of which maximizes :Look only at part that depends on Take the logarithm (same maximum)
ℓ (𝑦 ;𝑛)=∑𝑖=0
𝑘−1
ln (𝑛−𝑖)− (𝑛+1 ) 𝑦𝑘
Bottom-k: MLE for Distinct Count
ℓ (𝑦 ;𝑛)=∑𝑖=0
𝑘−1
ln (𝑛−𝑖)− (𝑛+1 ) 𝑦𝑘
We look for which maximizes
𝜕ℓ ( 𝑦 ;𝑛 )𝜕𝑛 =∑
𝑖=0
𝑘−1 1𝑛− 𝑖− 𝑦𝑘
∑𝑖=0
𝑘−1 1𝑛−𝑖=𝑦𝑘MLE is the solution of:
Need to solve numerically
Summary: k-mins count estimators k-mins sketch with dist: With dist: Sufficient statistic for (any function of) : MLE/Unbiased est for : cv: CRLB: MLE for : Unbiased est for : cv: CRLB:
Summary: bottom-k count estimators
Bottom-k sketch with : With dist: Sufficient statistic for (any function of) : Contains strictly more information than k-
mins When , approximately the same as k-mins MLE for is the solution of:∑
𝑖=0
𝑘−1 1𝑛−𝑖=𝑦𝑘
Bibliography
• See lecture 3• We will continue with Min Hash sketches• Use as random samples• Applications to similarity• Inverse-Probability based distinct count
estimators