Download - Retrieval and Perfect Hashing
-
8/12/2019 Retrieval and Perfect Hashing
1/16
0 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Department of Informatics of KIT1 and HANA Core of SAP AG2
Retrieval and Perfect Hashingusing FingerprintingIngo Mller12, Peter Sanders1, Robert Schulze2, and Wei Zhou12
KIT University of the State of Baden-Wuerttemberg and
National Laboratory of the Helmholtz Association www.kit.edu
-
8/12/2019 Retrieval and Perfect Hashing
2/16
The Retrieval Problem
1 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Perfect Hash Function(PHF)Map each keys Sto unique integeri ID
Retrieval data structure
Associate valuev Vto eachs S
Classical implementation: hash table
Store key/ID or key/value pairs
Optimization: do not storeS
Undefined behavior fors
S
Applications
Look-up in dictionaries of in-memory DBMSs (like the SAP HANA database [1])
Many more. . . (see Botelho et al. [2])
S V . . .flag
noun, verb,
flare
n, v,flask n,flash n, v
,. . .
V optimization
-
8/12/2019 Retrieval and Perfect Hashing
3/16
Known Methods
2 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Perfect Hash Functions
Practical implementations exist: BPZ [3], CHD [4], etc.
Store only constant, sometimes optimal number of extra bits
Retrieval: use a PHF to index an array of values
Direct Retrieval Data Structures
CHM [5], etc.
Prior work Our solution
Construction complicated simplerinherently sequential easily parallelizable
slow
fasterDynamic operations no (rebuild) yesCache misses per query
2 1
-
8/12/2019 Retrieval and Perfect Hashing
4/16
Fingerprint Retrieval (FiRe)Overview
3 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
...
Level1
...
Level2
...
LevelL
PHF-based
retrievaldata
structure
LevelL 1
m n
bbuckets
fingerprints v1 v2 va
acells
A bucket
bucket hash1 key 1 , . . . ,m ,fingerprint hash2 key 1 , . . . , k
Recursively overflowto next level on fingerprint collision/full bucket
Fingerprints implemented as bit vector for simplicity and speed
-
8/12/2019 Retrieval and Perfect Hashing
5/16
Fingerprint Retrieval (FiRe)Asymptotics
4 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Llevels
PHF
fallback
m
buckets
fingerprints acells
n elements
m
nb buckets
a cells per bucket
r-bit values
Expectedlinear construction time
LFiRe levels,O n for each level
Even forL : geometric series, as only a constant fraction of theelements overflow
Constant worst-case query time, sinceL is constant
-
8/12/2019 Retrieval and Perfect Hashing
6/16
Fingerprint Retrieval (FiRe)Formulae
5 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Llevels
PHF
fallback
m
buckets
fingerprints acells
n elements
m nb buckets
a cells per bucket
r-bit values
Leta1 be the expected number of elements in a bucket
Space overheadper elements
r
a
a1 size fingerprints
a1bits
Cache missesper queryl
ba1
Calculation ofa1: see our paper
-
8/12/2019 Retrieval and Perfect Hashing
7/16
Fingerprint Retrieval (FiRe)Space/Time Trade-Off
6 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Space overheads and expected number ofcache missesl depend ona: #cells per bucket
b: average #elements per bucket (
nm )
k: #possible fingerprint values
r: size of each value
How to choose parameters?
aand ksuch that a bucket
fits into acache lineDepending on desiredtrade-off
0
5
10
15
20
25
30
35
40
1 1.5 2 2.5 3
s
--spac
e
overhead
[bits]
l -- cache misses
r = 32 bits
opt. encoding a=14, k=149bit vector a=15, k= 32bit vector a=14, k= 64bit vector a=13, k= 96bit vector a=12, k=128
-
8/12/2019 Retrieval and Perfect Hashing
8/16
Fingerprint Retrieval (FiRe)Dynamization
7 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Updatesand deletions(easy)
Update associatedvalue in place
Ignore deletions
static part
S V . . .flag
noun, verb,flare
n, v,flask
n,flash
n, v,. . .
dynamic part
queries updates
Insertions
Needs adynamic partwith keys + some book-keeping information
Answer queries with thestatic part(FiRe)
Idea:Overflow new and old element if fingerprints collideBlock fingerprint for future insertsRebuild when some stability criterion is violated
-
8/12/2019 Retrieval and Perfect Hashing
9/16
Fingerprint Perfect Hashing (FiPHa)
8 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Llevels
PHF
fallback
m
buckets
fingerprints acells
Fingerprint-BasedPerfect Hashing(FiPHa)Special case with large buckets of empty values
Associated ID is calculated asrankbucket v a rankfingerprint v
Veryspace efficient(2.79bits overhead with2.78cache misses)
-
8/12/2019 Retrieval and Perfect Hashing
10/16
Experimental ResultsSettings
9 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Configurationsof a,b,ksuch thatl
1.05cache misses: FiRe5
l 1.25cache misses: FiRe25
l 1.50cache misses: FiRe50
Retrieval data structure with FiPHaas PHF (3.78cache misses)
Base lines
BPZ, CHD-0.5/0.99 from theC Minimal Perfect Hashing Library [6]
CHM-2/3 from our implementation
Datasets
Keys: 100million unique random32-bit integers
Values: integers of sizer 8bits
-
8/12/2019 Retrieval and Perfect Hashing
11/16
Experimental ResultsBuild Times
10 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
0
500
1000
1500
2000
2500
3000
FiRe5FiRe25
FiRe50FiPH
aCHM
-2CHM
-3BPZ CHD
-0.5
CHD-0.99
buildtime[n
s/element]
r 8bits
417 timesfaster sequential constructionfor FiRe
FiPHa slower, but faster than competitors
-
8/12/2019 Retrieval and Perfect Hashing
12/16
Experimental ResultsSpace Overhead
11 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
0
5
10
15
20
25
30
35
FiRe5
FiRe25
FiRe50
FiPHa
CHM-2
CHM-3
BPZCH
D-0.5
CHD-0.99
spaceoverhead
[bits/element]
r 8bits
FiRe50 hascomparable overhead to most competitors
FiPHa almost on par with best competitor (CHD-0.99)
-
8/12/2019 Retrieval and Perfect Hashing
13/16
Experimental ResultsQuery Times
12 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
0
50
100
150200
250
300
350
400
FiRe5
FiRe25
FiRe50
FiPHa
CHM-2
CHM-3
BPZ CHD-0.5
CHD-0.99
querytime[ns/element]
r 8bits
FiRe has thebest query timesdue to low number of cache misses
FiPHa comparable to CHD-0.99, but has much faster construction
-
8/12/2019 Retrieval and Perfect Hashing
14/16
Summary and Future Work
13 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Fingerprint Retrieval(FiRe) andPerfect Hashing(FiPHa)
Simple concept, easy implementation
Fast evaluation due tolow number of cache-misses
Extremelyfast construction, even with sequential implementation
Small space overhead
Highlyconfigurable trade-offSupport forupdates,insertions, anddeletions
Future Work
Find more compact, yet practicalrepresentation of fingerprints
Adapt idea ofcuckoo-hashingto fingerprinting
Improve trade-off withdifferent settingsfor each level
Adapt fingerprinting idea toother data structures
-
8/12/2019 Retrieval and Perfect Hashing
15/16
14 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
Thank you
-
8/12/2019 Retrieval and Perfect Hashing
16/16
References I
15 Mller, Sanders, Schulze, Zhou:Retrieval and Perfect Hashing using Fingerprinting
Institute for Theoretical InformaticsAlgorithms II
[1] F. Frberet al., SAP HANA Database: Data management for modern business applications, SIGMOD
Rec., vol. 40, no. 4, pp. 4551, 2012. [Online]. Available: http://doi.acm.org/10.1145/2094114.2094126
[2] F. C. Botelho and N. Ziviani, External perfect hashing for very large key sets, in Proceedings of the
sixteenth ACM conference on Conference on information and knowledge management. ACM, 2007, pp.
653662.
[3] F. C. Botelho, R. Pagh, and N. Ziviani, Simple and space-efficient minimal perfect hash functions, in
Algorithms and Data Structures. Springer, 2007, pp. 139150.
[4] D. Belazzougui, F. C. Botelho, and M. Dietzfelbinger, Hash, displace, and compress, in Algorithms-ESA
2009. Springer, 2009, pp. 682693.
[5] Z. J. Czech, G. Havas, and B. S. Majewski, An optimal algorithm for generating minimal perfect hash
functions,Information Processing Letters, vol. 43, no. 5, pp. 257264, 1992.
[6] D. de Castro Reis, D. Belazzougui, F. C. Botelho, and N. Ziviani, CMPH C Minimal Perfect Hashing
Library. [Online]. Available:http://cmph.sourceforge.net
http://doi.acm.org/10.1145/2094114.2094126http://cmph.sourceforge.net/http://cmph.sourceforge.net/http://doi.acm.org/10.1145/2094114.2094126