large-scale real-time analytics for everyone

Post on 16-Jul-2015

326 Views

Category:

Data & Analytics

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Large-scalereal-time

analytics for everyone:fast, cheap and 98% correct

we have a lot of datamemory is limited

one pass would be greatconstant update time

max, min, mean is trivial

median, anyone?

Sampling?

Probabilistic algorithms

Estimate is OKbut nice to know how error is

distributed

def frugal(stream):

m = 0

for val in stream:

if val > m:

m += 1

elif val < m:

m -= 1

return m

Memory used - 1 int!

def frugal(stream):

m = 0

for val in stream:

if val > m:

m += 1

elif val < m:

m -= 1

return m

It really works

Percentiles?

Demo: bit.ly/frugalsketch

def frugal_1u(stream, m=0, q=0.5):

for val in stream:

r = np.random.random()

if val > m and r > 1 - q:

m += 1

elif val < m and r > q:

m -= 1

return m

Streaming + probabilistic = sketch

What do we want?Get the number of unique users

aka cardinality number

What do we want?Get the number of unique usersgrouped by host, date, segment

When do we want?Well, right now

Data:1010 elements,

109 unique int3240Gb

Straight-forward approach: hash-table

Hash-table:4Gb

HyperLogLog:1.5Kb, 2% error

It all starts with an algorithm called LogLog

Imagine I tell you I spent this morning flipping a coin

and now tell you what was the longest non-interrupting

run of heads

2 timesor

100 times

When I flipped a coin for longer time?

We are interested in patterns in hashes

(namely the longest runs of leading zeros = heads)

Hash, don’t sample!*

* need a good hash function

Expecting:0xxxxxx hashes - ~50%1xxxxxx hashes - ~50%00xxxxx hashes - ~25%

estimate - 2R, where R - is a longest run of

leading zeros in hashes

I can perform several flipping experiments

and average the number of zeros

This is called stochastic averaging

So far the estimate is 2R, where R is a is a longest run of leading zeros in hashes

We will be using M buckets

where ɑ is a normalization constant

LogLog

SuperLogLog

LogLogSuperLogLog

HyperLogLogarithmetic mean -> harmonic mean

plus a couple of tweaks

Standard error is 1.04/sqrt(M),

where M is the number of buckets

LogLogSuperLogLogHyperLogLog

HyperLogLog++Google, 2013

32 bit -> 64 bit + fixes for low cardinality

bit.ly/HLLGoogle

LogLogSuperLogLogHyperLogLog

HyperLogLog++

Discrete Max-CountFacebook, 2014

bit.ly/DiscreteMaxCount

Large scale?

Suppose we have two HLL-sketches, let’s take a maximum value from

corresponding buckets

Resulting sketch has no loss in accuracy!

What do we want?how many unique users belong to two

segments?

HLL intersection

Inclusion-exclusion principle

Python code: bit.ly/hloglog

What do we want?Get the churn rate

Straight forward: feed new data to a new sketch

Sliding-window HyperLogLog

We maintain a list of tuples (timestamp, R), where R is a

possible maximum over future time

Values that are no longer make sense are

automatically discarded from the list

One list per bucket

Take a maximum R over the given timeframe from the

past, then estimate as we do in a regular HLL

Extra memory is required

All the details:bit.ly/SlidingHLL

hash, don’t sampleestimate, not precise

save memorystreamingthis slide is the sketch of the talk

Lots of sketches for various purposes:

percentiles,heavy hitters,

similarity,other stream statistics

Have we seen this user before?

Bloom filter

ih1

h2

hk

1 1 10 0 0 0 0 0 0 0 0 0 0 0 0

How many time did we see a user?

Count-Min sketch is the answer:

bit.ly/CountMinSketch

w

i

+1

+1

+1

h1

h4

hd

d

Estimate - take minimum from d values

Percentiles

Frugal sketching is not that precise enough

Sorting is pain

Distribute incoming values to buckets?

Some sort of clustering, maybe

T-Digest

Size is log(n),error is relative to q(1-q)

This is a growing field of computer science:

stay tuned!

Thanksand

happy sketching!

Reading list:

Neustar Research blog:bit.ly/NRsketchesSketches overview:bit.ly/SketchesOverviewLecture notes on streaming algorithms:bit.ly/streaming-lectures

Bonus:

HyperLogLog in SQL:bit.ly/HLLinSQL

top related