large-scale real-time analytics for everyone
TRANSCRIPT
Large-scalereal-time
analytics for everyone:fast, cheap and 98% correct
Pavel Kalaidin@facultyofwonder
we have a lot of datamemory is limited
one pass would be greatconstant update time
max, min, mean is trivial
median, anyone?
Sampling?
Probabilistic algorithms
Estimate is OKbut nice to know how error is
distributed
def frugal(stream):
m = 0
for val in stream:
if val > m:
m += 1
elif val < m:
m -= 1
return m
Memory used - 1 int!
def frugal(stream):
m = 0
for val in stream:
if val > m:
m += 1
elif val < m:
m -= 1
return m
It really works
Percentiles?
Demo: bit.ly/frugalsketch
def frugal_1u(stream, m=0, q=0.5):
for val in stream:
r = np.random.random()
if val > m and r > 1 - q:
m += 1
elif val < m and r > q:
m -= 1
return m
Streaming + probabilistic = sketch
What do we want?Get the number of unique users
aka cardinality number
What do we want?Get the number of unique usersgrouped by host, date, segment
When do we want?Well, right now
Data:1010 elements,
109 unique int3240Gb
Straight-forward approach: hash-table
Hash-table:4Gb
HyperLogLog:1.5Kb, 2% error
It all starts with an algorithm called LogLog
Imagine I tell you I spent this morning flipping a coin
and now tell you what was the longest non-interrupting
run of heads
2 timesor
100 times
When I flipped a coin for longer time?
We are interested in patterns in hashes
(namely the longest runs of leading zeros = heads)
Hash, don’t sample!*
* need a good hash function
Expecting:0xxxxxx hashes - ~50%1xxxxxx hashes - ~50%00xxxxx hashes - ~25%
estimate - 2R, where R - is a longest run of
leading zeros in hashes
I can perform several flipping experiments
and average the number of zeros
This is called stochastic averaging
So far the estimate is 2R, where R is a is a longest run of leading zeros in hashes
We will be using M buckets
where ɑ is a normalization constant
LogLog
SuperLogLog
LogLogSuperLogLog
HyperLogLogarithmetic mean -> harmonic mean
plus a couple of tweaks
Standard error is 1.04/sqrt(M),
where M is the number of buckets
LogLogSuperLogLogHyperLogLog
HyperLogLog++Google, 2013
32 bit -> 64 bit + fixes for low cardinality
bit.ly/HLLGoogle
LogLogSuperLogLogHyperLogLog
HyperLogLog++
Discrete Max-CountFacebook, 2014
bit.ly/DiscreteMaxCount
Large scale?
Suppose we have two HLL-sketches, let’s take a maximum value from
corresponding buckets
Resulting sketch has no loss in accuracy!
What do we want?how many unique users belong to two
segments?
HLL intersection
Inclusion-exclusion principle
credits: http://research.neustar.biz/2012/12/17/hll-intersections-2/
What do we want?Get the churn rate
Straight forward: feed new data to a new sketch
Sliding-window HyperLogLog
We maintain a list of tuples (timestamp, R), where R is a
possible maximum over future time
Values that are no longer make sense are
automatically discarded from the list
One list per bucket
Take a maximum R over the given timeframe from the
past, then estimate as we do in a regular HLL
Extra memory is required
hash, don’t sampleestimate, not precise
save memorystreamingthis slide is the sketch of the talk
Lots of sketches for various purposes:
percentiles,heavy hitters,
similarity,other stream statistics
Have we seen this user before?
Bloom filter
ih1
h2
hk
1 1 10 0 0 0 0 0 0 0 0 0 0 0 0
How many time did we see a user?
Count-Min sketch is the answer:
bit.ly/CountMinSketch
w
i
+1
+1
+1
h1
h4
hd
d
Estimate - take minimum from d values
Percentiles
Frugal sketching is not that precise enough
Sorting is pain
Distribute incoming values to buckets?
Some sort of clustering, maybe
T-Digest
Size is log(n),error is relative to q(1-q)
Code:bit.ly/T-Digest-Java
bit.ly/T-Digest-Python
This is a growing field of computer science:
stay tuned!
Thanksand
happy sketching!
Reading list:
Neustar Research blog:bit.ly/NRsketchesSketches overview:bit.ly/SketchesOverviewLecture notes on streaming algorithms:bit.ly/streaming-lectures