large-scale real-time analytics for everyone

Large-scalereal-time

analytics for everyone:fast, cheap and 98% correct

Pavel Kalaidin@facultyofwonder

https://twitter.com/facultyofwonder

https://twitter.com/facultyofwonder

we have a lot of datamemory is limited

one pass would be greatconstant update time

max, min, mean is trivial

median, anyone?

Sampling?

Probabilistic algorithms

Estimate is OKbut nice to know how error is

distributed

def frugal(stream):

m = 0

for val in stream:

if val > m:

m += 1

elif val < m:

m -= 1

return m

Memory used - 1 int!

def frugal(stream):

m = 0

for val in stream:

if val > m:

m += 1

elif val < m:

m -= 1

return m

It really works

Percentiles?

Demo: bit.ly/frugalsketch

def frugal_1u(stream, m=0, q=0.5):

for val in stream:

r = np.random.random()

if val > m and r > 1 - q:

m += 1

elif val < m and r > q:

m -= 1

return m

http://bit.ly/frugalsketch

Streaming + probabilistic = sketch

What do we want?Get the number of unique users

aka cardinality number

What do we want?Get the number of unique usersgrouped by host, date, segment

When do we want?Well, right now

Data:1010 elements,

109 unique int3240Gb

Straight-forward approach: hash-table

Hash-table:4Gb

HyperLogLog:1.5Kb, 2% error

It all starts with an algorithm called LogLog

Imagine I tell you I spent this morning flipping a coin

and now tell you what was the longest non-interrupting

run of heads

2 timesor

100 times

When I flipped a coin for longer time?

We are interested in patterns in hashes

(namely the longest runs of leading zeros = heads)

Hash, don’t sample!*

* need a good hash function

Expecting:0xxxxxx hashes - ~50%1xxxxxx hashes - ~50%00xxxxx hashes - ~25%

estimate - 2R, where R - is a longest run of

leading zeros in hashes

I can perform several flipping experiments

and average the number of zeros

This is called stochastic averaging

So far the estimate is 2R, where R is a is a longest run of leading zeros in hashes

We will be using M buckets

where ɑ is a normalization constant

LogLog

SuperLogLog

LogLogSuperLogLog

HyperLogLogarithmetic mean -> harmonic mean

plus a couple of tweaks

Standard error is 1.04/sqrt(M),

where M is the number of buckets

LogLogSuperLogLogHyperLogLog

HyperLogLog++Google, 2013

32 bit -> 64 bit + fixes for low cardinality

bit.ly/HLLGoogle

http://bit.ly/HLLGoogle

http://bit.ly/HLLGoogle

LogLogSuperLogLogHyperLogLog

HyperLogLog++

Discrete Max-CountFacebook, 2014

bit.ly/DiscreteMaxCount

http://bit.ly/DiscreteMaxCount

http://bit.ly/DiscreteMaxCount

Large scale?

Suppose we have two HLL-sketches, let’s take a maximum value from

corresponding buckets

Resulting sketch has no loss in accuracy!

What do we want?how many unique users belong to two

segments?

HLL intersection

Inclusion-exclusion principle

credits: http://research.neustar.biz/2012/12/17/hll-intersections-2/

http://research.neustar.biz/2012/12/17/hll-intersections-2/



Python code: bit.ly/hloglog

http://bit.ly/hloglog

http://bit.ly/hloglog

What do we want?Get the churn rate

Straight forward: feed new data to a new sketch

Sliding-window HyperLogLog

We maintain a list of tuples (timestamp, R), where R is a

possible maximum over future time

Values that are no longer make sense are

automatically discarded from the list

One list per bucket

Take a maximum R over the given timeframe from the

past, then estimate as we do in a regular HLL

Extra memory is required

All the details:bit.ly/SlidingHLL

http://bit.ly/SlidingHLL

http://bit.ly/SlidingHLL

hash, don’t sampleestimate, not precise

save memorystreamingthis slide is the sketch of the talk

Lots of sketches for various purposes:

percentiles,heavy hitters,

similarity,other stream statistics

Have we seen this user before?

Bloom filter

ih1

h2

hk

1 1 10 0 0 0 0 0 0 0 0 0 0 0 0

How many time did we see a user?

Count-Min sketch is the answer:

bit.ly/CountMinSketch

http://bit.ly/CountMinSketch

http://bit.ly/CountMinSketch

w

i

+1

+1

+1

h1

h4

hd

d

Estimate - take minimum from d values

Percentiles

Frugal sketching is not that precise enough

Sorting is pain

Distribute incoming values to buckets?

Some sort of clustering, maybe

T-Digest

Size is log(n),error is relative to q(1-q)

Code:bit.ly/T-Digest-Java

bit.ly/T-Digest-Python

http://bit.ly/T-Digest-Java

http://bit.ly/T-Digest-Java

http://bit.ly/T-Digest-Python

http://bit.ly/T-Digest-Python

This is a growing field of computer science:

stay tuned!

Thanksand

happy sketching!

Reading list:

Neustar Research blog:bit.ly/NRsketchesSketches overview:bit.ly/SketchesOverviewLecture notes on streaming algorithms:bit.ly/streaming-lectures

http://bit.ly/nrsketches

http://bit.ly/nrsketches

http://bit.ly/SketchesOverview

http://bit.ly/SketchesOverview

http://bit.ly/streaming-lectures

http://bit.ly/streaming-lectures

Bonus:

HyperLogLog in SQL:bit.ly/HLLinSQL

http://bit.ly/hllinsql

http://bit.ly/hllinsql

large-scale real-time analytics for everyone

Data & Analytics

maximum r

elif val

longest run of leading

number of unique usersgrouped

xxxxxx hashes

longest runs of leading

number of zerosthis

stream statisticshave