advanced data structures for information search
DESCRIPTION
Bloom FilterTRANSCRIPT
Pre-diploma internshipASYLBEK OTAROV
Advanced Data Structures for information search
Agenda• Introduction
• A Membership Query Problem
• What is Bloom Filter
• Windows Application in C#(Show how bloom filter algorithm does work)
• Conclusion
Introduction• The place of practice: Almaty, AO “KBTU”
• Date of practice: January 13, 2014 – February 7, 2014
• Instructor: Eliusizov Damir
• Supervisor:Eliusizov Damir
Problem Description
Given an element E, query whether it belongs to an big elements set S.
• Fast as soon as possible
• Small as soon as possible
Some Solutions• Hash table• Fast but big data structure
• Bitmap index• Small but data structure smaller than hash table
Tradeoff solutions• To obtain speed and size improvements, allow some
probability of error.
Bloom Filter
Bloom Filter• A Bloom filter is a space-efficient probabilistic data structure,
conceived by Burton Howard Bloom in 1970.
An empty Bloom filter is a bit array of m bits, all set to 0.
0 1 2 3 4 m-1
There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.
0 0 0 0 0 0…………
To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.
HF1(x) = indexHF2(x) = index………HFk(x) = index
To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1, the element is in the set
Bloom Filter has 2 operations Add and Test
Bloom filter is space-efficient probabilistic data structure that is used to test whether an element is a member of a set
Hash Table chance of collision • a collision is a situation that occurs when two distinct pieces of data have
the same hash value, checksum, fingerprint, or cryptographic digest.
x – is element, which added in hash table
y – is element, which added in hash table
F(x) and G(x) – hash functions.
0 0 1 0 1 0…………
F(x) = 2 G(x) = 1 F(y) = 2 G(y) = 1
0 1 2 3 4 M-1
False positives are possible False negatives are not possible
If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive
Bloom filter is space-efficient probabilistic data structure that is used to test whether an element is a member of a set
• Not a Key-Value store
• Array of bits indicating the presence of a key in the filter.
• Removing an element from the filter is not possible
Bloom Filter: Usage Google Big Table and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
The Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is first checked against a local Bloom filter and only upon a hit a full check of the URL is performed.
Windows Application in C#
Add element
Check element
Check element
Hash function in C#• public static UInt32 FNV1(string offset_str)• {
• UInt32 FNV_prime = 16777601;• UInt32 offset_basis = 2166136261;
• int strlen = offset_str.Length;
• for (int i = 0; i < strlen; i++)• {• offset_basis = offset_basis ^ Convert.ToChar(offset_str.Substring(i, 1));• offset_basis = (offset_basis * FNV_prime) % 23;
• }• return offset_basis;• }
ConclusionThis internship gave me a lot of experiences which I can use in future life. As I am a student of department of computer engineering with a major of Information Systems, I can definitely say that this internship helped me to understand the concepts of information system more deeply, because I have participated in project where information systems play a big role in connecting two sides: people and organizations. I learned a lot of tricks related to object oriented programming.
References
1: Network Applications of Bloom Filters: A Survey, Broder and Mitzenmacher. An excellent overview.
2: Wikipedia, which has an excellent and comprehensive page on Bloom filters
3: Less Hashing, Same Performance, Kirsch and Mitzenmacher
4: Scalable Bloom Filters, Almeida et al5: SlideShare http://
www.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees (Lorenzo Alberton)