large-scale real-time analytics for everyone

83
Large-scale real-time analytics for everyone: fast, cheap and 98% correct

Upload: pavel-kalaidin

Post on 16-Jul-2015

325 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Large-scale real-time analytics for everyone

Large-scalereal-time

analytics for everyone:fast, cheap and 98% correct

Page 3: Large-scale real-time analytics for everyone

we have a lot of datamemory is limited

one pass would be greatconstant update time

Page 4: Large-scale real-time analytics for everyone

max, min, mean is trivial

Page 5: Large-scale real-time analytics for everyone

median, anyone?

Page 6: Large-scale real-time analytics for everyone

Sampling?

Page 7: Large-scale real-time analytics for everyone

Probabilistic algorithms

Page 8: Large-scale real-time analytics for everyone

Estimate is OKbut nice to know how error is

distributed

Page 9: Large-scale real-time analytics for everyone

def frugal(stream):

m = 0

for val in stream:

if val > m:

m += 1

elif val < m:

m -= 1

return m

Page 10: Large-scale real-time analytics for everyone

Memory used - 1 int!

def frugal(stream):

m = 0

for val in stream:

if val > m:

m += 1

elif val < m:

m -= 1

return m

It really works

Page 11: Large-scale real-time analytics for everyone
Page 12: Large-scale real-time analytics for everyone

Percentiles?

Page 13: Large-scale real-time analytics for everyone

Demo: bit.ly/frugalsketch

def frugal_1u(stream, m=0, q=0.5):

for val in stream:

r = np.random.random()

if val > m and r > 1 - q:

m += 1

elif val < m and r > q:

m -= 1

return m

Page 14: Large-scale real-time analytics for everyone

Streaming + probabilistic = sketch

Page 15: Large-scale real-time analytics for everyone

What do we want?Get the number of unique users

aka cardinality number

Page 16: Large-scale real-time analytics for everyone

What do we want?Get the number of unique usersgrouped by host, date, segment

Page 17: Large-scale real-time analytics for everyone

When do we want?Well, right now

Page 18: Large-scale real-time analytics for everyone

Data:1010 elements,

109 unique int3240Gb

Page 19: Large-scale real-time analytics for everyone

Straight-forward approach: hash-table

Page 20: Large-scale real-time analytics for everyone

Hash-table:4Gb

Page 21: Large-scale real-time analytics for everyone

HyperLogLog:1.5Kb, 2% error

Page 22: Large-scale real-time analytics for everyone

It all starts with an algorithm called LogLog

Page 23: Large-scale real-time analytics for everyone

Imagine I tell you I spent this morning flipping a coin

Page 24: Large-scale real-time analytics for everyone

and now tell you what was the longest non-interrupting

run of heads

Page 25: Large-scale real-time analytics for everyone

2 timesor

100 times

Page 26: Large-scale real-time analytics for everyone

When I flipped a coin for longer time?

Page 27: Large-scale real-time analytics for everyone

We are interested in patterns in hashes

(namely the longest runs of leading zeros = heads)

Page 28: Large-scale real-time analytics for everyone

Hash, don’t sample!*

* need a good hash function

Page 29: Large-scale real-time analytics for everyone

Expecting:0xxxxxx hashes - ~50%1xxxxxx hashes - ~50%00xxxxx hashes - ~25%

Page 30: Large-scale real-time analytics for everyone

estimate - 2R, where R - is a longest run of

leading zeros in hashes

Page 31: Large-scale real-time analytics for everyone

I can perform several flipping experiments

Page 32: Large-scale real-time analytics for everyone

and average the number of zeros

Page 33: Large-scale real-time analytics for everyone

This is called stochastic averaging

Page 34: Large-scale real-time analytics for everyone

So far the estimate is 2R, where R is a is a longest run of leading zeros in hashes

Page 35: Large-scale real-time analytics for everyone

We will be using M buckets

Page 36: Large-scale real-time analytics for everyone

where ɑ is a normalization constant

Page 37: Large-scale real-time analytics for everyone

LogLog

SuperLogLog

Page 38: Large-scale real-time analytics for everyone

LogLogSuperLogLog

HyperLogLogarithmetic mean -> harmonic mean

plus a couple of tweaks

Page 39: Large-scale real-time analytics for everyone

Standard error is 1.04/sqrt(M),

where M is the number of buckets

Page 40: Large-scale real-time analytics for everyone

LogLogSuperLogLogHyperLogLog

HyperLogLog++Google, 2013

32 bit -> 64 bit + fixes for low cardinality

bit.ly/HLLGoogle

Page 41: Large-scale real-time analytics for everyone

LogLogSuperLogLogHyperLogLog

HyperLogLog++

Discrete Max-CountFacebook, 2014

bit.ly/DiscreteMaxCount

Page 42: Large-scale real-time analytics for everyone

Large scale?

Page 43: Large-scale real-time analytics for everyone

Suppose we have two HLL-sketches, let’s take a maximum value from

corresponding buckets

Page 44: Large-scale real-time analytics for everyone

Resulting sketch has no loss in accuracy!

Page 45: Large-scale real-time analytics for everyone

What do we want?how many unique users belong to two

segments?

Page 46: Large-scale real-time analytics for everyone

HLL intersection

Page 47: Large-scale real-time analytics for everyone

Inclusion-exclusion principle

Page 49: Large-scale real-time analytics for everyone
Page 50: Large-scale real-time analytics for everyone

Python code: bit.ly/hloglog

Page 51: Large-scale real-time analytics for everyone

What do we want?Get the churn rate

Page 52: Large-scale real-time analytics for everyone

Straight forward: feed new data to a new sketch

Page 53: Large-scale real-time analytics for everyone

Sliding-window HyperLogLog

Page 54: Large-scale real-time analytics for everyone

We maintain a list of tuples (timestamp, R), where R is a

possible maximum over future time

Page 55: Large-scale real-time analytics for everyone

Values that are no longer make sense are

automatically discarded from the list

Page 56: Large-scale real-time analytics for everyone
Page 57: Large-scale real-time analytics for everyone

One list per bucket

Page 58: Large-scale real-time analytics for everyone

Take a maximum R over the given timeframe from the

past, then estimate as we do in a regular HLL

Page 59: Large-scale real-time analytics for everyone

Extra memory is required

Page 60: Large-scale real-time analytics for everyone

All the details:bit.ly/SlidingHLL

Page 61: Large-scale real-time analytics for everyone

hash, don’t sampleestimate, not precise

save memorystreamingthis slide is the sketch of the talk

Page 62: Large-scale real-time analytics for everyone
Page 63: Large-scale real-time analytics for everyone

Lots of sketches for various purposes:

percentiles,heavy hitters,

similarity,other stream statistics

Page 64: Large-scale real-time analytics for everyone

Have we seen this user before?

Page 65: Large-scale real-time analytics for everyone

Bloom filter

Page 66: Large-scale real-time analytics for everyone

ih1

h2

hk

1 1 10 0 0 0 0 0 0 0 0 0 0 0 0

Page 67: Large-scale real-time analytics for everyone

How many time did we see a user?

Page 68: Large-scale real-time analytics for everyone

Count-Min sketch is the answer:

bit.ly/CountMinSketch

Page 69: Large-scale real-time analytics for everyone

w

i

+1

+1

+1

h1

h4

hd

d

Estimate - take minimum from d values

Page 70: Large-scale real-time analytics for everyone

Percentiles

Page 71: Large-scale real-time analytics for everyone

Frugal sketching is not that precise enough

Page 72: Large-scale real-time analytics for everyone

Sorting is pain

Page 73: Large-scale real-time analytics for everyone

Distribute incoming values to buckets?

Page 74: Large-scale real-time analytics for everyone

Some sort of clustering, maybe

Page 75: Large-scale real-time analytics for everyone

T-Digest

Page 76: Large-scale real-time analytics for everyone
Page 77: Large-scale real-time analytics for everyone

Size is log(n),error is relative to q(1-q)

Page 79: Large-scale real-time analytics for everyone

This is a growing field of computer science:

stay tuned!

Page 80: Large-scale real-time analytics for everyone
Page 81: Large-scale real-time analytics for everyone

Thanksand

happy sketching!

Page 82: Large-scale real-time analytics for everyone

Reading list:

Neustar Research blog:bit.ly/NRsketchesSketches overview:bit.ly/SketchesOverviewLecture notes on streaming algorithms:bit.ly/streaming-lectures

Page 83: Large-scale real-time analytics for everyone

Bonus:

HyperLogLog in SQL:bit.ly/HLLinSQL