strata nyc 2015: sketching big data with spark: randomized algorithms for large-scale data analytics
TRANSCRIPT
![Page 1: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/1.jpg)
Sketching Big Data with Spark
Reynold Xin @rxin Sep 29, 2015 @ Strata NY
![Page 2: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/2.jpg)
About Databricks
Founded by creators of Spark in 2013
Cloud service for end-to-end data processing • Interactive notebooks, dashboards,
and production jobs
We are hiring!
![Page 3: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/3.jpg)
Spark
![Page 4: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/4.jpg)
Count-min sketch
![Page 5: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/5.jpg)
Approximate frequent items
![Page 6: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/6.jpg)
Taylor Swift
![Page 7: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/7.jpg)
![Page 8: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/8.jpg)
“Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune
![Page 9: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/9.jpg)
Who is this guy?
Co-founder & architect for Spark at Databricks Former PhD student at UC Berkeley AMPLab A “systems” guy, which means I won’t be showing equations and this talk might be the easiest to consume in HDS
![Page 10: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/10.jpg)
This talk
1. Develop intuitions on these sketches so you know when to use it
2. Understand how certain parts in distributed data processing (e.g. Spark) work
![Page 11: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/11.jpg)
![Page 12: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/12.jpg)
Sketch: Reynold’s not-so-scientific definition
1. Use small amount of space to summarize a large dataset. 2. Go over each data point once, a.k.a. “streaming algorithm”, or “online algorithm” 3. Parallelizable, but only small amount of communication
![Page 13: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/13.jpg)
What for?
Exploratory analysis Feature engineering Combine sketch and exact to speed up processing
![Page 14: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/14.jpg)
Sketches in Spark
Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining
Frequent items Stratified Sampling …
![Page 15: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/15.jpg)
This Talk
Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining
Frequent items Stratified Sampling …
![Page 16: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/16.jpg)
Set membership
![Page 17: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/17.jpg)
Set membership
Identify whether an item is in a set e.g. “You have bought this item before”
![Page 18: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/18.jpg)
Exact set membership
Track every member of the set • Space: size of data • One pass: yes • Parallelizable & communication: size of data
![Page 19: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/19.jpg)
Approximate set membership
Take 1. Use a 32-bit integer hash map to track • ~4 bytes per record • Max 4 billion items
Take 2. Hash items to 256 buckets
• Memory usage only 256 bits • Good if num records is small • Bad if num records is large (256+ items, collision rate 100%!)
![Page 20: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/20.jpg)
Bloom filter
Bloom filter algorithm • k hash functions • hash item into k separate positions • if any of the k positions is not set, then item is not in set
Properties • ~500MB needed to have 10% error rate on 1 billion items • See http://hur.st/bloomfilter?n=1000000000&p=0.1 • False positives possible
![Page 21: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/21.jpg)
Use case beyond exploration
SELECT * FROM A join B on A.key = B.key 1. Assume A and B are both large, i.e. “shuffle join” 2. Some rows in A might not have matched rows in B 3. Wouldn’t it be nice if we only need to shuffle rows that match?
Answer: use a bloom filter to filter the ones that don’t match
![Page 22: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/22.jpg)
Frequent items
![Page 23: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/23.jpg)
Frequent Items
Find items more frequent than 1/k
![Page 24: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/24.jpg)
Source: http://www.macfreek.nl/memory/Letter_Distribution
![Page 25: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/25.jpg)
4,474
3,146
2,352
1,749
1,293 1,248 1,107 1,094 1,065
907 835 793 789 737 598 582 517 482 447 444 420 409 409 405 400 381 378 369 367 366
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000 Tw
itter
follo
wer
s in
thou
sand
s
Twitter Followers of NBA teams (in 1,000s), September 2015
Source: http://www.statista.com/statistics/240386/twitter-followers-of-national-basketball-association-teams/
![Page 26: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/26.jpg)
Frequent Items
Exploration • Identify important members in a network • E.g. “the”, LA Lakers, Taylor Swift
Feature Engineering • Identify outliers • Ignore low frequency items
![Page 27: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/27.jpg)
Frequent Items: Exact Algorithm
SELECT item, count(*) cnt FROM corpus GROUP BY item HAVING cnt > k * cnt
• Space: linear to |item| • One pass: no (two passes) • Parallelizable & communication: linear to |item|
![Page 28: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/28.jpg)
![Page 29: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/29.jpg)
Example 1: Find Items Frequency > ½ (k=2)
![Page 30: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/30.jpg)
draw
Put back if any pair of balls are the same color
![Page 31: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/31.jpg)
![Page 32: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/32.jpg)
draw
Remove if balls are all different color
![Page 33: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/33.jpg)
Example 1: Find Items Frequency > 1/2
Blue ball left (frequent item)
![Page 34: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/34.jpg)
Example 2: Find Items Frequency > ½ (k=2)
![Page 35: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/35.jpg)
draw
![Page 36: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/36.jpg)
![Page 37: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/37.jpg)
draw
![Page 38: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/38.jpg)
draw
![Page 39: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/39.jpg)
1 ball left (frequent item)
![Page 40: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/40.jpg)
How do we implement this?
Maintain a hash table of counts
![Page 41: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/41.jpg)
Increment for every ball we see
0 => 1
![Page 42: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/42.jpg)
Increment for every ball we see
1 => 2
![Page 43: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/43.jpg)
Increment for every ball we see
0 => 4
![Page 44: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/44.jpg)
Increment for every ball we see
0 => 4
![Page 45: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/45.jpg)
Increment for every ball we see
4
0 => 1
![Page 46: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/46.jpg)
When the hash table has k items, remove 1 from each item and remove the item if count = 0
4 => 3
1 => 0
![Page 47: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/47.jpg)
3
![Page 48: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/48.jpg)
3
0 => 1
![Page 49: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/49.jpg)
2
![Page 50: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/50.jpg)
2
0 => 1
![Page 51: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/51.jpg)
1
![Page 52: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/52.jpg)
Implementation
Maintains a hash table of counts • For each item, increment its count • If hash table size == k:
– decrement 1 from each item; and – remove items whose count == 0
Parallelization: merge hash tables of max size k
![Page 53: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/53.jpg)
Comparing Exact vs Approximate
Naïve Exact Sketch
# Passes 2 1
Memory |item| k
Communication |item| k
![Page 54: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/54.jpg)
Comparing Exact vs Approximate
Naïve Exact Sketch Smart Exact
# Passes 2 1 2 (1st pass using sketch)
Memory |item| k k
Communication |item| k k
![Page 55: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/55.jpg)
Quiz: an example with false positive?
K = 3
![Page 56: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/56.jpg)
How to use it in Spark?
Frequent items for multiple columns independently • df.stat.freqItems([“columnA”, “columnB”, …])
Frequent items for composite keys
• df.stat.freqItems(struct(“columnA”, “columnB”))
![Page 57: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/57.jpg)
Stratified sampling
![Page 58: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/58.jpg)
Bernoulli sampling & Variance
Sample US population (300m) using rate 0.000002 (~600) • Wyoming (0.5m) should have 1 • Bernoulli sampling likely leads to Wyoming having 0
Intuition: uniform sampling leads to ~ 600 samples.
• i.e. it might be 600, or 601, or 599, or … • Impact on WY when going from 600 to 601 is much larger than that on CA’s
![Page 59: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/59.jpg)
Stratified sampling
Existing “exact” algorithms • Draw-by-draw • Selection-rejection • Reservoir • Random sort
Either sequential or expensive (full global sort)
![Page 60: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/60.jpg)
Random sort
Example: sampling probability p = 0.1 on 100 items. 1. Generate random keys
• (0.644, t1), (0.378, t2), … (0.500, t99), (0.471, t100)
2. Sort and select the smallest 10 items
• (0.028, t94), (0.029, t44), …, (0.137, t69), …, (0.980, t26), (0.988, t60)
![Page 61: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/61.jpg)
Heuristics
Qualitatively speaking • If u is “much larger” than p, then t is “unlikely” to be selected • If u is “much smaller” than p, then it is “likely” to be selected
Set two thresholds q1 and q2, such that: • If u < q1, accept t directly • If u > q2, reject t directly • Otherwise, put t in a buffer to be sorted
![Page 62: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/62.jpg)
Spark’s stratified sampling algorithm
Combines “exact” and “sketch” to achieve parallelization & low memory overhead df.stat.sampleByKeyExact(col, fractions, seed)
Xiangrui Meng. Scalable Simple Random Sampling and Stratified Sampling. ICML 2013
![Page 63: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/63.jpg)
This Talk
Set membership (Bloom filter) Cardinality (HyperLogLog) Histogram (count-min sketch) Frequent pattern mining
Frequent items Stratified Sampling …
![Page 64: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/64.jpg)
Conclusion
Sketches can be useful in exploration, feature engineering, as well as building faster exact algorithms. We are building a lot of these into Spark so you don’t need to reinvent the wheel!
![Page 65: Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics](https://reader031.vdocument.in/reader031/viewer/2022022203/586fdc0c1a28ab18428b6363/html5/thumbnails/65.jpg)
Thank you. Meetup tonight @ Civic Hall, 6:30pm 156 5th Avenue, 2nd floor, New York, NY