ke yi small summaries for big data

72
Small Summaries for Big Data Ke Yi HKUST [email protected]

Upload: jins0618

Post on 15-Apr-2017

56 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Ke yi small summaries for big data

Small Summaries for Big Data

Ke YiHKUST

[email protected]

Page 2: Ke yi small summaries for big data

The case for “Big Data” in one slide

“Big” data arises in many forms:– Medical data: genetic sequences, time series– Activity data: GPS location, social network activity– Business data: customer behavior tracking at fine detail– Physical Measurements: from science (physics, astronomy)

The 3 V’s: – Volume– Velocity– Variety

Small Summaries for Big Data2

focus of this talk

Page 3: Ke yi small summaries for big data

Small Summaries for Big Data3

Computational scalability

The first (prevailing) approach: scale up the computation Many great technical ideas:

– Use many cheap commodity devices– Accept and tolerate failure– Move code to data– MapReduce: BSP for programmers– Break problem into many small pieces– Decide which constraints to drop: noSQL, ACID, CAP

Scaling up comes with its disadvantages:– Expensive (hardware, equipment, energy), still not always fast

This talk is not about this approach!

Page 4: Ke yi small summaries for big data

Small Summaries for Big Data4

Downsizing data

A second approach to computational scalability: scale down the data!– A compact representation of a large data set– Too much redundancy in big data anyway– What we finally want is small: human readable analysis / decisions– Necessarily gives up some accuracy: approximate answers– Often randomized (small constant probability of error)– Examples: samples, sketches, histograms, wavelet transforms

Complementary to the first approach: not a case of either-or Some drawbacks:

– Not a general purpose approach: need to fit the problem– Some computations don’t allow any useful summary

Page 5: Ke yi small summaries for big data

Small Summaries for Big Data5

Outline for the talk

Some most important data summaries– Sketches: Bloom filter, Count-Min, Count-Sketch (AMS)– Counters: Heavy hitters, quantiles– Sampling: simple samples, distinct sampling– Summaries for more complex objects: graphs and matrices

Current trends and future challenges for data summarization Many abbreviations and omissions (histograms, wavelets, ...) A lot of work relevant to compact summaries

– In both the database and the algorithms literature

Page 6: Ke yi small summaries for big data

Small Summaries for Big Data6

Summary Construction There are several different models for summary construction

– Offline computation: e.g. sort data, take percentiles– Streaming: summary merged with one new item each step– Full mergeability: allow arbitrary merges of partial summaries

The most general and widely applicable category Key methods for summaries:

– Create an empty summary– Update with one new tuple: streaming processing

Insert and delete– Merge summaries together: distributed processing (eg MapR)– Query: may tolerate some approximation (parameterized by )

Several important cost metrics (as function of ): – Size of summary, time cost of each operation

Page 7: Ke yi small summaries for big data

Small Summaries for Big Data 7

Sketches

Page 8: Ke yi small summaries for big data

Small Summaries for Big Data8

Bloom Filters [Bloom 1970]

Bloom filters compactly encode set membership – E.g. store a list of many long URLs compactly – random hash functions map items to -bit vector times– Update: Set all entries to to indicate item is present– Query: Is in the set?

Return “yes” if all bits maps to are .

Duplicate insertions do not change Bloom filters Can be merged by OR-ing vectors (of same size)

item

1 1 1

Page 9: Ke yi small summaries for big data

Small Summaries for Big Data9

Bloom Filters Analysis

If item is in the set, then always return “yes”– No false negative

If item is not in the set, may return “yes” with a probability.– Prob. that a bit is 0: – Prob. that all bits are 1:

– Setting minimizes this probability to . For false positive prob , Bloom filter needs bits

Page 10: Ke yi small summaries for big data

Small Summaries for Big Data10

Bloom Filters Applications Bloom Filters widely used in “big data” applications

– Many problems require storing a large set of items Can generalize to allow deletions

– Swap bits for counters: increment on insert, decrement on delete– If no duplicates, small counters suffice: 4 bits per counter

Bloom Filters are still an active research area– Often appear in networking conferences– Also known as “signature files” in DB

item

1 2 22 2

item item

Page 11: Ke yi small summaries for big data

Small Summaries for Big Data11

Invertible Bloom Lookup Table [GM11]

A summary that– Support insertions and deletions– When # item , can retrieve all items– Also known as sparse recovery

Easy case when – Assume all items are integers – Just keep the bitwise-XOR of all items

To insert To delete :

– Assumption: no duplicates; can’t delete a non-existing item A problem: How do we know when # items = 1?

– Just keep another counter

Page 12: Ke yi small summaries for big data

Small Summaries for Big Data12

Invertible Bloom Lookup Table

Structure is the same as a Bloom filter, but each cell– is the XOR or all items mapped here– also keeps a counter of such items

How to recover?– First check if there is any cell has only 1 item– Recover the item if there is one– Delete the item from the all cells, and repeat

item

1 2 22 2

item item

Page 13: Ke yi small summaries for big data

Small Summaries for Big Data13

Invertible Bloom Lookup Table

Can show that– Using cells, can recover items with high probability

2 21 1

item item

Page 14: Ke yi small summaries for big data

Small Summaries for Big Data14

Count-Min Sketch [CM 04]

Count Min sketch encodes item counts– Allows estimation of frequencies (e.g. for selectivity

estimation)– Some similarities in appearance to Bloom filters

Model input data as a vector of dimension – E.g., is the space of all IP addresses, is the frequency of IP

address in the data – Create a small summary as an array of in size– Use hash function to map vector entries to [1..]𝑊

𝑑Array:

Page 15: Ke yi small summaries for big data

Small Summaries for Big Data15

Count-Min Sketch Structure

Update: each entry in vector is mapped to one bucket per row.

Merge two sketches by entry-wise summation Query: estimate by taking min

– Never underestimate– Will bound the error of overestimation

+𝑐

+𝑐+𝑐

+𝑐

h1( 𝑖)

h𝑑(𝑖)𝑖 ,+𝑐

rows

𝑤=2/ 𝜀

Page 16: Ke yi small summaries for big data

Small Summaries for Big Data16

Count-Min Sketch Analysis

Focusing on first row:– is always added to – is added to with prob. – The expected error is – The total expected error is – By Markov inequality, error

By taking the minimum of rows, this prob. is To give an error with prob. , the sketch needs to have size

Page 17: Ke yi small summaries for big data

Small Summaries for Big Data17

Generalization: Sketch Structures

Sketch is a class of summary that is a linear transform of input– Sketch() = for some matrix – Hence, Sketch() = Sketch() + Sketch()– Trivial to update and merge

Often describe in terms of hash functions Analysis relies on properties of the hash functions

– Some assumes truly random hash functions (e.g., Bloom filter) Don’t exist, but ad hoc hash functions work well for many

real-world data– Others need hash functions with limited independence.

Page 18: Ke yi small summaries for big data

Small Summaries for Big Data18

Count Sketch / AMS Sketch [CCF-C02, AMS96]

Second hash function: maps each to or Update: each entry in vector is mapped to one bucket per

row. Merge two sketches by entry-wise summation Query: estimate by taking median

– May either overestimate or underestimate

+𝑐 𝑔1(𝑖)

+𝑐 𝑔3 (𝑖)

+𝑐 𝑔𝑑(𝑖)

h1( 𝑖)

h𝑑(𝑖)𝑖 ,+𝑐

rows

𝑤

Page 19: Ke yi small summaries for big data

Small Summaries for Big Data19

Count Sketch Analysis

Focusing on first row:– is always added to

We return – is added to with prob. – The expected error is – E[total error] = 0, E[|total error|] – By Chebyshev inequality, |total error|

By taking the median of rows, this prob. is To give an error with prob. , the sketch needs to have size

Page 20: Ke yi small summaries for big data

Small Summaries for Big Data20

Count-Min Sketch vs Count Sketch

Count-Min:– Size: – Error:

Count Sketch:– Size: – Error:

Other benefits:– Unbiased estimator– Can also be used to estimate

(join size)

Count Sketch is better when

Page 21: Ke yi small summaries for big data

Small Summaries for Big Data21

Application to Large Scale Machine Learning

In machine learning, often have very large feature space– Many objects, each with huge, sparse feature vectors– Slow and costly to work in the full feature space

“Hash kernels”: work with a sketch of the features– Effective in practice! [Weinberger, Dasgupta, Langford, Smola, Attenberg ‘09]

Similar analysis explains why:– Essentially, not too much noise on the important features

Page 22: Ke yi small summaries for big data

Small Summaries for Big Data 22

Counter Based Summaries

1 2 3 4 5 6 7 8 9

Page 23: Ke yi small summaries for big data

Small Summaries for Big Data23

Heavy hitters [MG82]

Misra-Gries (MG) algorithm finds up to items that occur more than fraction of the time in a stream– Estimate their frequencies with additive error – Equivalently, achieves error with space

Better than Count-Min but doesn’t support deletions Keep different candidates in hand. To add a new item:

– If item is monitored, increase its counter– Else, if items monitored, add new item with count 1– Else, decrease all counts by 1

1 2 3 4 5 6 7 8 9

𝑘=5

Page 24: Ke yi small summaries for big data

Small Summaries for Big Data24

Heavy hitters

Misra-Gries (MG) algorithm finds up to items that occur more than fraction of the time in a stream [MG’82]– Estimate their frequencies with additive error

Keep different candidates in hand. To add a new item:– If item is monitored, increase its counter– Else, if items monitored, add new item with count 1– Else, decrease all counts by 1

1 2 3 4 5 6 7 8 9

𝑘=5

Page 25: Ke yi small summaries for big data

Small Summaries for Big Data25

Heavy hitters

Misra-Gries (MG) algorithm finds up to items that occur more than fraction of the time in a stream [MG’82]– Estimate their frequencies with additive error

Keep different candidates in hand. To add a new item:– If item is monitored, increase its counter– Else, if items monitored, add new item with count 1– Else, decrease all counts by 1

1 2 3 4 5 6 7 8 9

𝑘=5

Page 26: Ke yi small summaries for big data

Small Summaries for Big Data26

MG analysis = total input size = sum of counters in data structure Error in any estimated count at most

– Estimated count a lower bound on true count– Each decrement spread over items: 1 new one and in MG– Equivalent to deleting distinct items from stream– At most decrement operations– Hence, can have “deleted” copies of any item– So estimated counts have at most this much error

Page 27: Ke yi small summaries for big data

Small Summaries for Big Data27

Merging two MG summaries [ACHPZY12]

Merging algorithm:– Merge two sets of counters in the obvious way– Take the th largest counter , and subtract from all– Delete non-positive counters

1 2 3 4 5 6 7 8 9

𝑘=5

Page 28: Ke yi small summaries for big data

(prior error) (from merge)

Small Summaries for Big Data28

Merging two MG summaries This algorithm gives mergeability:

– Merge subtracts at least from counter sums– So

Sum of remaining (at most ) counters is – By induction, error is

1 2 3 4 5 6 7 8 9

𝑘=5

𝐶𝑘+1

Page 29: Ke yi small summaries for big data

Summarizing Disitributed Data29

Quantiles (order statistics)

Exact quantiles: for , CDF Approximate version: tolerate answer between

Dual problem - rank estimation: Given , estimate – Can use binary search to find quantiles

Page 30: Ke yi small summaries for big data

Quantiles gives equi-height histogram

Automatically adapts to skew data distributions Equi-width histograms (fixed binning) are is to construct but

does not adapt to data distribution

Summarizing Disitributed Data30

Page 31: Ke yi small summaries for big data

The GK Quantile Summary [GK01]

3

Incoming: 3

0

node: value & rank

N=1

Page 32: Ke yi small summaries for big data

GK

3

Incoming: 14

0

141

N=2

Page 33: Ke yi small summaries for big data

GK

3

Incoming: 50

0

141

502

We know the exact rank of any elementSpace:

N=3

Page 34: Ke yi small summaries for big data

GK

3

Incoming: 9

0

141

91

Insertion in the middle: affects other nodes’ ranks

502

N=4

Page 35: Ke yi small summaries for big data

GK

30

142

91

Deletion: introducing errors

503

N=4

Page 36: Ke yi small summaries for big data

GK

30

91

Deletion: introducing errors

rank(13) = 2 or 3 ?Incoming: 13

503

N=5

Page 37: Ke yi small summaries for big data

GK

30

132-3

91

Incoming: 14

node: value & rank bounds

503

N=5

Page 38: Ke yi small summaries for big data

GK

30

132-3

91

Incoming: 14

Insertion: does not introduce errorsError retains in the interval

504

143-4

N=6

Page 39: Ke yi small summaries for big data

GK

30

132-3

91

One comes, one goes

New elements are inserted immediatelyTry to delete one after each insertion

505

143-4

N=6

Page 40: Ke yi small summaries for big data

GK

30

132-3

91

Removing a node causes errors

Delete the node with smallest error, if it is smaller than

505

143-4

1 2 2 2 2

N=6

Page 41: Ke yi small summaries for big data

GK

30

132-3

91

Query: rank(20)

505

143-4

Return rank(20)=4The error is at most 1

Page 42: Ke yi small summaries for big data

The GK Quantile Summary

The practical version– Space: Open problem!– Small in practice

A complicated version achieves space Doesn’t support deletions Doesn’t (don’t know how to) support merging There is a randomized mergeable quantile summary

Page 43: Ke yi small summaries for big data

Small Summaries for Big Data43

A Quantile Summary Supporting Deletions [WLYC13]

Assumption: all elements are integers from [0..U] Use a dyadic decomposition

0 U-1U/2

# of elementsbetween U/2 and U-1

Page 44: Ke yi small summaries for big data

A Quantile Summary Supporting Deletions

0 U-1x

rank(x)

Page 45: Ke yi small summaries for big data

A Quantile Summary Supporting Deletions

Consider each layer as a frequency vector, and build a sketch– Count Sketch is better than Count-Min due to unbiasedness

Space:

CS

CS

CS

CS

Page 46: Ke yi small summaries for big data

Small Summaries for Big Data 46

Random Sampling

Page 47: Ke yi small summaries for big data

Small Summaries for Big Data47

Random Sample as a Summary

Cost– A random sample is cheaper to build if you have random access

to data Error-size tradeoff

– Specialized summaries often have size – A random sample needs size

Generality– Specialized summaries can only answer one type of queries– Random sample can be used to answer different queries

In particular, ad hoc queries that are unknown beforehand

Page 48: Ke yi small summaries for big data

Small Summaries for Big Data48

Sampling Definitions

Sampling without replacement– Randomly draw an element– Don’t put it back– Repeat s times

Sampling with replacement– Randomly draw an element– Put it back– Repeat s times

Coin-flip sampling– Sample each element with probability – Sample size changes for dynamic data

The statistical difference is very small, for

Page 49: Ke yi small summaries for big data

Small Summaries for Big Data49

Reservoir Sampling

Given a sample of size drawn from a data set of size (without replacement)

A new element in added, how to update the sample? Algorithm

– With probability , use it to replace an item in the current sample chosen uniformly at random

– With probability , throw it away

Page 50: Ke yi small summaries for big data

Small Summaries for Big Data50

Reservoir Sampling Correctness Proof

Many “proofs” found online are actually wrong– They only show that each item is sampled with probability – Need to show that every subset of size has the same probability

to be the sample Correct proof relates with the Fisher-Yates shuffle

a

b

c

d

b

a

c

d

a

b

c

d

b

c

a

d

b

d

a

c

s = 2

Page 51: Ke yi small summaries for big data

Merging random samples

Summarizing Disitributed Data51

+

𝑁 1=10 𝑁 2=15With prob. , take a sample from

𝑆1𝑆2

Page 52: Ke yi small summaries for big data

Merging random samples

Summarizing Disitributed Data52

+

𝑁 1=9 𝑁 2=15With prob. , take a sample from

𝑆1𝑆2

Page 53: Ke yi small summaries for big data

Merging random samples

Summarizing Disitributed Data53

+

𝑁 1=9 𝑁 2=15With prob. , take a sample from

𝑆1𝑆2

Page 54: Ke yi small summaries for big data

Merging random samples

Summarizing Disitributed Data54

+

𝑁 1=9 𝑁 2=14With prob. , take a sample from

𝑆1𝑆2

Page 55: Ke yi small summaries for big data

Merging random samples

Summarizing Disitributed Data55

+

𝑁 1=8 𝑁 2=12With prob. , take a sample from

𝑆1𝑆2

𝑁 3=25

Page 56: Ke yi small summaries for big data

Small Summaries for Big Data56

Random Sampling with Deletions

Naïve method:– Maintain a sample of size – Handle insertions as in reservoir sampling– If the deleted item is not in the sample, ignore it– If the deleted item is in the sample, delete it from the sample– Problem: the sample size may drop to below

Some other algorithms in the DB literature but sample size still drops when there are many deletions

Page 57: Ke yi small summaries for big data

Small Summaries for Big Data57

Min-wise Hashing / Sampling

For each item , compute Return the items with the smallest values of as the

sample– Assumption: is a truly random hash function– A lot of research on how to remove this assumption

Page 58: Ke yi small summaries for big data

Small Summaries for Big Data58

-Sampling [FIS05, CMR05]

Suppose and always has bits (pad zeroes when necessary) Map all items to level 0 Map items that have one leading zero in to level 1 Map items that have two leading zeroes in to level 2 … Use a -sparse recovery summary for each level

p=1

p=1/U

-sparse recovery

Page 59: Ke yi small summaries for big data

Small Summaries for Big Data59

Estimating Set Similarity with MinHash

Assumption: The sample from and the sample from are obtained by the same

(A item randomly sampled from ) To estimate :

– Merge Sample() and Sample() to get Sample() of size – Count how many items in Sample() are in – Divide by

𝐽 (𝐴 ,𝐵 )=|𝐴∩𝐵||𝐴∪ 𝐵|

Page 60: Ke yi small summaries for big data

Small Summaries for Big Data 60

Advanced Summaries

Page 61: Ke yi small summaries for big data

Geometric data: -approximations A sample that preserves “density” of point sets

– For any range (e.g., a circle),|fraction of sample point fraction of all points|

– Can simply draw a random sample of size – More careful construction yields size

Summarizing Disitributed Data61

Page 62: Ke yi small summaries for big data

Summarizing Disitributed Data62

Geometric data: -kernels

-kernels approximately preserve the convex hull of points– -kernel has size O(1/(d-1)/2)– Streaming -kernel has size O(1/(d-1)/2 log(1/))

Page 63: Ke yi small summaries for big data

Small Summaries for Big Data63

Graph Sketching [AGM12]

Goal: Build a sketch for each vertex and use it to answer queries Connectivity: want to test if there is a path between any two nodes

– Trivial if edges can only be inserted; want to support deletions Basic idea: repeatedly contract edges between components

– Use L0 sampling to get edges from vector of adjacencies– The L0 sampling sketch supports deletions and merges

Problem: as components grow, sampling edges from components most likely to produce internal links

Page 64: Ke yi small summaries for big data

Small Summaries for Big Data64

Graph Sketching

Idea: use clever encoding of edges Encode edge (i,j) as ((i,j),+1) for node i<j, as ((i,j),-1) for node j>i When node i and node j get merged, sum their L0 sketches

– Contribution of edge (i,j) exactly cancels out

– Only non-internal edges remain in the L0 sketches Use independent sketches for each iteration of the algorithm

– Only need O(log n) rounds with high probability Result: O(poly-log n) space per node for connectivity

i j+

=

Page 65: Ke yi small summaries for big data

Small Summaries for Big Data65

Other Graph Results via sketching

Recent flurry of activity in summaries for graph problems– K-connectivity via connectivity– Bipartiteness via connectivity: – (Weight of the) Minimum spanning tree: – Sparsification: find G’ with few edges so that cut(G,C) cut(G’,C)– Matching: find a maximal matching (assuming it is small)

Total cost is typical O(|V|), rather than O(|E|)– Semi-streaming / semi-external model

Page 66: Ke yi small summaries for big data

Small Summaries for Big Data66

Matrix Sketching

Given matrices A, B, want to approximate matrix product AB– Measure the normed error of approximation C: ǁAB – Cǁ

Main results for the Frobenius (entrywise) norm ǁǁF

– ǁCǁF = (i,j Ci,j2)½

– Results rely on sketches, so this entrywise norm is most natural

Page 67: Ke yi small summaries for big data

Small Summaries for Big Data67

Direct Application of Sketches

Build AMS sketch of each row of A (Ai), each column of B (Bj) Estimate Ci,j by estimating inner product of Ai with Bj

– Absolute error in estimate is ǁAiǁ2 ǁBjǁ2 (whp)– Sum over all entries in matrix, squared error is ǁAǁFǁBǁF

Outline formalized & improved by Clarkson & Woodruff [09,13]– Improve running time to linear in number of non-zeros in A,B

Page 68: Ke yi small summaries for big data

Small Summaries for Big Data68

Compressed Matrix Multiplication

What if we are just interested in the large entries of AB?– Or, the ability to estimate any entry of (AB)– Arises in recommender systems, other ML applications

If we had a sketch of (AB), could find these approximately Compressed Matrix Multiplication [Pagh 12]:

– Can we compute sketch(AB) from sketch(A) and sketch(B)?– To do this, need to dive into structure of the Count (AMS) sketch

Several insights needed to build the method:– Express matrix product as summation of outer products– Take convolution of sketches to get a sketch of outer product– New hash function enables this to proceed– Use the FFT to speed up from O(w2) to O(w log w)

Page 69: Ke yi small summaries for big data

Small Summaries for Big Data69

More Linear Algebra

Matrix multiplication improvement: use more powerful hash fns– Obtain a single accurate estimate with high probability

Linear regression given matrix A and vector b: find x Rd to (approximately) solve minx ǁAx – bǁ– Approach: solve the minimization in “sketch space”– From a summary of size O(d2/) [independent of rows of A]

Frequent directions: approximate matrix-vector product [Ghashami, Liberty, Phillips, Woodruff 15] – Use the SVD to (incrementally) summarize matrices

The relevant sketches can be built quickly: proportional to the number of nonzeros in the matrices (input sparsity)– Survey: Sketching as a tool for linear algebra [Woodruff 14]

Page 70: Ke yi small summaries for big data

Small Summaries for Big Data70

Current Directions in Data Summarization

Sparse representations of high dimensional objects– Compressed sensing, sparse fast fourier transform

General purpose numerical linear algebra for (large) matrices– k-rank approximation, linear regression, PCA, SVD, eigenvalues

Summaries to verify full calculation: a ‘checksum for computation’ Geometric (big) data: coresets, clustering, machine learning Use of summaries in large-scale, distributed computation

– Build them in MapReduce, Continuous Distributed models Communication-efficient maintenance of summaries

– As the (distributed) input is modified

Page 71: Ke yi small summaries for big data

Small Summaries for Big Data71

Two complementary approaches in response to growing data sizes– Scale the computation up; scale the data down

The theory and practice of data summarization has many guises– Sampling theory (since the start of statistics)– Streaming algorithms in computer science– Compressive sampling, dimensionality reduction… (maths, stats,

CS) Continuing interest in applying and developing new theory

Small Summary of Small Summaries

Page 72: Ke yi small summaries for big data

Small Summaries for Big Data 72

Thank you!