grokking techtalk #11 - an introduction to probabilistic data-structures (and algorithms)

An introduction to An introduction to probabilities data-probabilities data-

structures structures (and algorithms)(and algorithms)

== Grokking Engineering, April 2016 ==

Võ Việt Hùngvvhung@gmail.com

Who am I?Who am I?

● A technical guy, has been working in IT for 15+ years

✔ In many roles: developer, sys-admin, dba, big data analyst

✔ Large systems: billions of requests per month

● Current: one of the biggest adnetworks in Vietnam

● Past

✔ VNG: Zing Ads, Zing Me, ...

✔ Vietnamworks/Navigos Group...

Agenda (1)Agenda (1)

● A real-world problem

● Probabilities data-structures (PDS), what?

● PDS, why?

● Some characteristics of PDS

● Some common PDS

● Membership Query – BloomFilter

● Cardinality Estimation – HyperLogLog

● Frequency Estimation – Count-Min Sketch

● Percentile and Quantile Estimation – t-digest

Agenda (2)Agenda (2)

● Some case studies

● Whats else in the jungle?

● References

● Q&A

A real-world problem (1)A real-world problem (1)

When processing data sets, we often want to do some simple checks (queries) like:

● Does the data set contain a particular element (membership query)?

● How many distinct elements are in the data set (i.e. what is the cardinality of the data set)?

● What are the most frequent elements (i.e. top-k elements)?

● What are the frequencies of the most frequent elements?

● What are the mean/median value of some quantity of the data set?

A real-world problem (2)A real-world problem (2)

The common approach is to use some kind of deterministic data structure like HashSet or Hashtable for such purposes.

Another approach is using database, then performs SQL queries.

But along with data grows, with demand for fast response, come to problems with memory, CPU limitation, slow queries.

Probabilities data-structures (PDS)Probabilities data-structures (PDS)

● PDS are a group of data structures that are extremely useful for big data and streaming/realtime applications.

● These data structures use hash functions to randomize and compactly represent a set of items.

● Collisions are ignored but errors can be well-controlled under certain threshold.

Why?Why?

to deal with

● fast response

● (very) large data and could not fit in memory

● data (could be) processed in one pass

● incremental updates (results)

● no need of 100% correct, just approximation but controllable

PDS characteristicsPDS characteristics

(as comparing with error-free approaches)

● trade space and performance for accuracy

● use less memory

● have constant (and short) query time

● (usually) support union and intersection operations

● can be merged => map-reduced friendly

● Parallelized and distributed

Some common PDSSome common PDS

● Membership Query

✔ Bloom Filter (BF)

✔ Bloom Filter extensions: counting-BF, scalable-BF, stable-BF, layered-BF, inverse-BF

✔ Cuckoo hashing● Cardinality Estimation – HyperLogLog (HLL), KMV, LC

● Frequency Estimation – Count-Min Sketch (CMS)

● Percentile and Quantile Estimation – t-digest

● Skip-list

● ….

Membership Query – Bloom FilterMembership Query – Bloom Filter

● conceived by Burton Howard Bloom in 1970

● is used to test whether an element is a member of a set

● False-positive matches are possible, but false-negatives are not. In other words, a query returns either "possibly in set" or "definitely not in set"

● Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter).

● The more elements that are added to the set, the larger the probability of false positives.

BloomFilter – algorithm behindBloomFilter – algorithm behind

● effectively a hash table where collisions are ignored and each element added to the table is hashed by some number k hash functions.

● There is one major difference: a bloom filter does NOT store the hashed keys.

● Instead, it has a bit array as its underlying data structure; each key is remembered by flipping on all of the bits the k hash functions map it to.

BloomFilter – Simple implementationBloomFilter – Simple implementation

BloomFilter – PropertiesBloomFilter – Properties

● Unlike a standard hash table, a BF of a fixed size can represent a set with an arbitrarily large number of elements

● adding an element never fails due to the data structure "filling up"

● Union and intersection of BFs with the same size and set of hash functions can be implemented with bitwise OR (union) and AND (intersection)

● The union operation on BFs is lossless in the sense that the resulting BF is the same as the BF created from scratch using the union of the two sets.

● The intersect operation satisfies a weaker property: the false-positive probability in the resulting BF is at most the false-positive probability in one of the constituent BFs, but may be larger than the false-positive probability in the BF created from scratch using the intersection of the two sets.

BloomFilter – simple usageBloomFilter – simple usage

BloomFilter – Math behindBloomFilter – Math behind

BloomFilter – rules-of-thumbBloomFilter – rules-of-thumb

Fomulas, rule-of-thumbs

(http://corte.si/posts/code/bloom-filter-rules-of-thumb/)

● fp rate bits

50% 1.44

10% 4.79

2% 8.14

1% 9.58

0.1% 14.38

0.01% 19.17

BloomFilter – size over probabilityBloomFilter – size over probability

BloomFilter extension – CountingBloomFilter extension – Counting

● Counting BFs provide a way to implement a delete operation on a BF without recreating the filter afresh.

● In a counting filter the array positions (buckets) are extended from being a single bit to being an n-bit counter.

● When an item is added, the corresponding counters are incremented, and when it’s removed, the counters are decremented.

● Counting BF takes n-times more space than a regular BF, but it also has a scalability limit. Because the counting BF table cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must be known in advance. Once the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted.

BloomFilter extension – ScalableBloomFilter extension – Scalable

● Standard BFs require knowing the size of the data set ahead of time in order to keep probability controlable

● Scalable BFs are useful for cases where the size of the data set isn’t known a priori and memory constraints aren’t of particular concern.

● Scalable BF is essentially an array of BFs. New elements are added to the last filter. When this filter becomes “full” – when it reaches a target fill ratio – a new filter is added with a tightened error probability.

BloomFilter extension – ScalableBloomFilter extension – Scalable

BloomFilter extension – StableBloomFilter extension – Stable

● Stable BF is a variant of BFs for detecting duplicates in unbounded data streams with limited space (memory).In particular, if the stream is not uniformly distributed, meaning duplicates are likely to be grouped closer together, the rate of false positives becomes immaterial.

● Since there is no way to store the entire history of a stream (which can be infinite), Stable BFs continuously evict stale information to make room for more recent elements.

● Since stale information is evicted, the Stable BF introduces false negatives, which do not appear in traditional Bloom filters. But a tight upper bound of false positive rates is guaranteed.

BloomFilter extension – LayeredBloomFilter extension – Layered

● A layered BF consists of multiple BF layers.

● Layered BFs allow keeping track of how many times an item was added to the BF by checking how many layers contain the item.

● With a layered BF a check operation will normally return the deepest layer number the item was found in.

BloomFilter extension – InverseBloomFilter extension – Inverse

● Inverse BF is an “opposite” of BF. It may report a false negative but can never report a false positive. That is, it may indicate that an item has not been seen when it actually has, but it will never report an item as seen which it hasn’t come across.

● Inverse BF behaves in a similar manner to a fixed-size hash map of m buckets which doesn’t handle conflicts, but it provides lock-free concurrency using an underlying CAS.

● Inverse BF is a nice option for dealing with unbounded streams or large data sets due to its limited memory usage. If duplicates are close together, the rate of false negatives becomes vanishingly small with an adequately sized filter.

BloomFilter – ApplicationsBloomFilter – Applications

● Akamai's web servers use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches

● Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns

● Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed

● The Squid Web Proxy Cache uses Bloom filters for cache digests

● Bitcoin uses Bloom filters to speed up wallet synchronization

● The Exim mail transfer agent (MTA) uses Bloom filters in its rate-limit feature

BloomFilter – AlternativesBloomFilter – Alternatives

● Cuckoo hashinghttps://en.wikipedia.org/wiki/Cuckoo_hashing

● Roaringbitmapshttp://roaringbitmap.org/

Cardinality Estimation – HyperLogLogCardinality Estimation – HyperLogLog

● a streaming algorithm used for estimating the number of distinct elements (cardinality) of very large data sets.

● HyperLogLog counter can count one billion distinct items with an accuracy of 2% using only 1.5 KB of memory.

● It is based on the bit pattern observation that for a stream of randomly distributed numbers, if there is a number x with the maximum of leading 0 bits k, the cardinality of the stream is very likely equal to 2^k.

HyperLogLog – simple explanationHyperLogLog – simple explanation

● For example, given four bits there exist only 16 possible representations. If in our stream the highest number of consecutive zeroes were three (000), the probability of seeing that pattern is 2 in 16 (or 1 in 8) to conclude that the cardinality of our streaming set is 8.

HyperLogLog – more detailsHyperLogLog – more details

● In the HLL algorithm, a hash function is applied to each element in the original multiset (a set which allows multiple occurrences of its elements), to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly distributed set can then be estimated using the algorithm above.

● The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large variance. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set.

HyperLogLog – an implementationHyperLogLog – an implementation

Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch

● Count-Min Sketches is a family of memory efficient data structures that allow one to estimate frequency-related properties of the data set, e.g. estimate frequencies of particular elements, find top-K frequent elements, perform range queries (where the goal is to find the sum of frequencies of elements within a range), estimate percentiles

● It is somewhat similar to bloom filter. The main difference is that bloom filter represents a set as a bitmap, while Count-Min sketch represents a multi-set which keeps a frequency distribution summary.

● Count-Min sketch is a two-dimensional array (dxw) of integer counters. When a value arrives, it is mapped to one position at each of d rows using d different and preferably independent hash functions. Counters on each position are incremented.

● The estimate of the counts for an item is the minimum value of the counts at the array positions determined by the d hash functions.

● The space used by Count-Min sketch is the array of w*d counters. By choosing appropriate values for d and w, very small error and high probability can be achieved.

Count-Min Sketch – implementationCount-Min Sketch – implementation

Count-Min Sketch – PropertiesCount-Min Sketch – Properties

● Union can be performed by cell-wise ADD operation

● O(k) query time

● Better accuracy for higher frequency items (heavy-hitters)

● Can only cause over-counting but not under-counting

Count-Min Sketch – NotesCount-Min Sketch – Notes

● Accuracy of the Count-Min sketch depends on the ratio between the sketch size and the total number of registered events. This means that Count-Min technique provides significant memory gains only for skewed data, i.e. data where items have very different probabilities.

● Applicability of Count-Min sketches is not a straightforward question and the best thing that can be recommended is experimental evaluation of each particular case.

● Count-Min sketch performs well on highly skewed data, but on low or moderately skewed data it is not so efficient because of poor protection from the high number of hash collisions – Count-Min sketch simply selects minimal (less distorted) estimator => Count-Mean-Min sketch

Count-Mean-Min Sketch – implementationCount-Mean-Min Sketch – implementation

● CMM estimates noise for each hash function as the average value of all counters in the row that correspond to this function (except counter that corresponds to the query itself), deduces it from the estimation for this hash function, and, finally, computes the median of the estimations for all hash functions.

Count-Min Sketch – Top-k problemCount-Min Sketch – Top-k problem

Find all elements in the data set with the frequencies greater than k percent of the total number of elements in the data set.

● Maintain a standard Count-Min sketch during the scan of the data set and put all elements into it.

● Maintain a heap of top elements, initially empty, and a counter N of the total number of already process elements.

● For each element in the data set:

✔ Put the element to the sketch

✔ Estimate the frequency of the element using the sketch. If frequency is greater than a threshold (k*N), then put the element to the heap. Heap should be periodically or continuously cleaned up to remove elements that do not meet the threshold anymore.

● In general, the top-k problem makes sense only for skewed data, so usage of Count-Min sketches is reasonable in this context.

Percentile & Quantile Est. – t-digestPercentile & Quantile Est. – t-digest

● The problem of calculating median of a dataset in distributed environment. ('cause the median of median is not equal to the median) => what's needed is an algorithm that can approximate the median, while still being space efficient.

● the t-Digest is a probabilistic data structure for estimating the median (and more generally any percentile) from either distributed data or streaming data.

● Internally, the data structure is a sparse representation of the cumulative distribution function. After ingesting data, the data structure has learned the "interesting" points of the CDF, called centroids.

Percentile & Quantile Est. – t-digestPercentile & Quantile Est. – t-digest

● A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

● The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics.

● The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean.

● The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.

t-digest – characteristicst-digest – characteristics

● has smaller summaries than Q-digest

● works on doubles as well as integers.

● provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles

● is fast

● is very simple

● can be used with map-reduce very easily because digests can be merged

Some remarksSome remarks

● For some structures like HyperLogLog or Bloom filter, there're simple and practical formulas to determine parameters of the structure on the basis of expected data volume and required error probability.

● Other structures like Count-(Mean-)Min Sketch have complex dependency on statistical properties of data and experiments are the only reasonable way to understand their applicability to real use cases.

● Data-structures populated by different data sets can often be combined to process complex queries.

● Some types of queries can be supported by using customized versions of the described data-structures/ algorithms.

Case Study 1Case Study 1

● There is a system that tracks a huge number of web events and each event is marked by a number of tags including a user ID this event corresponds to.It is required to report a number of unique users that meet the specified combination of tags (like users from the city C that visited site A or site B)

Case Study 1: solutionCase Study 1: solution

● Solution 1:

✔ maintain a BF that tracks user IDs for each tag value and a BF that contains user IDs that correspond to the final result.

✔ A user ID from each incoming event is tested against the per-tag filters – does it satisfy the required combination of tags or not.

✔ If the user ID passes this test, it is additionally tested against the additional BF that corresponds to the report itself and, if passed, the final report counter is increased.

● Solution 2: using HLL for each tag value

● There is a system that receives events on user visits from different internet sites.

● This system enables analysis to query a number of unique visitors for the specified date range and site.

● HLL can be used to aggregate information about visitor IDs for each day and site, masks for each day are saved, and a query can be processed using bitwise OR-ing of the daily masks.

● There is a system that tracks traffic by IP address and it is required to detect most traffic-intensive addresses.

● CMS?!!

● the problem is not trivial because we need to track the total traffic for each address, not a frequency of items.

● counters in the CMS implementation can be incremented not by 1, but by absolute amount of traffic for each observation (i.e, size of IP packet if sketch is updated for each packet)

● In this case, sketch will track amounts of traffic for each address and a heap with the most traffic-intensive addresses can be maintained (top-k or heavy-hitter).

● There is a system that monitors traffic and counts unique visitors for different criteria (visited site, geography, etc.).

● It is required to compute 100 most popular sites using a number of unique visitors as a metric of popularity.

● Popularity should be computed every day on the basis of data for last 30-day, i.e. every day one-day partition added, another one is removed from the scope.

● create a fresh set of per-site HLL counters every day and maintain this set during 30 days, i.e. 30 sets of counters are active at any moment of time.

● Number of users doing-action (view, click...) on site objects (banner, button, …) 1-times, 2-times, …., 10+-times

● Report looks like below

Filter: Object=X 1-times: 98765 2-times: 76543 3-times: 54321 … 9-times: 1234 10+-times: 343

● Should we use CMS???

● … and why/why NOT???

● Use scalable layered-BF to track k-times user actions on objects

● Use HLL to count users on each k-times action

What else?What else?

● Libs

✔ Redis: HLL already, BF in next 3.2

✔ https://github.com/twitter/algebird

✔ https://github.com/addthis/stream-lib

✔ https://github.com/tylertreat/BoomFilters

✔ https://github.com/tdunning/t-digest● More

✔ Linear Counting

✔ MinHash

✔ Top-K

ReferencesReferences

● https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

● https://dzone.com/articles/introduction-probabilistic-0

● http://bravenewgeek.com/stream-processing-and-probabilistic-methods/

● https://www.somethingsimilar.com/2012/05/21/the-opposite-of-a-bloom-filter/

● https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest

grokking techtalk #11 - an introduction to probabilistic data-structures (and algorithms)

Engineering

techtalk: 3d printing

kotlin techtalk

techtalk #13 grokking: scaling and supercharging your online...

grokking techtalk #16: react stack at lozi

grokking thegimp

grokking techtalk #19: software development cycle in the...

techtalk on artiﬁcial...

grokking techtalk #18a: vietnamese sentiment analysis in a...

grokking the paradigm creating a component

grokking the org

grokking techtalk #17: introduction to blockchain

techtalkfsbpensionservices.co.uk/.../08/docs-techtalk-2014-06.pdf ·...

grokking the paradigm reconstructing webtop

grokking monads in scala

grokking techtalk #20: postgresql internals 101

techtalk - back.novatechautomation.com

grokking git with shakespeare

grokking grok: monitorama pdx 2015

grokking techtalk #16: html js and three way binding

grokking techtalk #21: deep learning in computer vision