© neeraj suri eu-nsf ict march 2006 dependable embedded systems & sw group introduction to...
TRANSCRIPT
© Neeraj SuriEU-NSF ICT March 2006
Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de
Introduction to Computer Science 2
Hash Tables (2)
Prof. Neeraj SuriDan Dobre
ICS-II - 2008 2Hash Tables (2)
Overview
So far: Direct hashing Hash functions (folding, modulo etc.) Collision resolution (linear & quadratic probing)
What’s next? Collision resolution continued Cost analysis of hashing Hashing on external memory Extendible (dynamic) hashing Excursus: (pseudo-)random numbers and their application
ICS-II - 2008 3Hash Tables (2)
Double/repeated Hashing
If a collision occurs the key is hashed a second time using another Hash function.
Can be generalized: if a collision occur, the key is hashed again using the next Hash function.
If the collision after using k Hash functions persists, another technique has to be applied.
Avoids collision accumulation, delete remains complex, accessibility of the entire memory space is problematic
ICS-II - 2008 4Hash Tables (2)
Chaining of synonyms in the same HT
Members of a collision class are chained. Each memory slot in HT must have an additional pointer. Because there is no separate overflow area, collisions
continue to occur due to foreign occupation. Chaining doesn’t prevent the collisions, however it
facilitates the search. Delete becomes considerably easier, because only one
pointer have to be reset. Insert requires to follow the pointer list, until a free place
is found. If the home address is occupied by another key (which
does not belong there), move it.
ICS-II - 2008 5Hash Tables (2)
Chaining: Example
h0 (K) = K mod 7; hi (K) = (h0 (K) + i) mod m
Insert: 11, 32, 8, 25
0 1 2 3 4 5 6
11 325
8 256
ICS-II - 2008 6Hash Tables (2)
Chaining: Example
h0 (K) = K mod 7; hi (K) = (h0 (K) + i) mod m
Now insert 12
Move 32: search left for pointer, then move further to position 0.
0 1 2 3 4 5 6
8 11 32 255 6
ICS-II - 2008 7Hash Tables (2)
Chaining: Example
h0 (K) = K mod 7; hi (K) = (h0 (K) + i) mod m
Now insert 12 in its home address
0 1 2 3 4 5 6
32 8 11 256 0
ICS-II - 2008 8Hash Tables (2)
Chaining: Example
h0 (K) = K mod 7; hi (K) = (h0 (K) + i) mod m Delete 11:
Follow chain until 25 is reached (4-0-6) Move 25 to its home address 4 Delete pointer “6” in address 0
0 1 2 3 4 5 6
32 8 11 12 256 0
ICS-II - 2008 9Hash Tables (2)
Chaining: Example
h0 (K) = K mod 7; hi (K) = (h0 (K) + i) mod m
Collision chain until 32 is now broken (empty address 6) But this is not a problem since pointers are used for
chaining
0 1 2 3 4 5 6
32 8 25 120
ICS-II - 2008 10Hash Tables (2)
Chaining with separate overflow
All records, which can not be stored in the own home address, are transferred to an overflow area.
Overflow area can be: A single overflow for all synonyms with only one entry point
• simple, avoid having pointers in the Hash table • possibly long synonym chains, therefore only suitable with small
collision frequency A single overflow with more than one entry point
• efficient, since only members of a collision class are browsed• requires pointer for each entry in Hash table• reference to synonym chain can be implemented using double
Hashing in the case of collisions synonyms (mostly few) of 2 collision classes are affected
ICS-II - 2008 11Hash Tables (2)
Chaining with separate overflow
Separate overflow area can be assigned dynamically HT can be restricted to the keys in the home address, all
data can be stored in the dynamic overflow area. Since pointers can refer to any address, this corresponds
to a partition of the overflow Chaining of synonyms is a preferred method
Position Key Pointer
0 HAYDN HAENDEL VIVALDI 1 BEETHOVEN BACH BRAHMS 2 CORELLI 3 4 SCHUBERT LISZT 5 MOZART 6
ICS-II - 2008 12Hash Tables (2)
Hashing: analysis of the costs
Cost measure: Number of steps (addressing attempts)
Assumption: The same time effort for all h(Kp) and search steps The Hash table is allocated with n keys
Search costs Sn = delete costs without rearrangement
Insert costs = unsuccessful search Un
Delete costs = Sn + rearrangement Rn
Costs can be expressed as function of the allocation factor = n/m
ICS-II - 2008 13Hash Tables (2)
Hashing: analysis of the costs – extreme cases
Worst case: Sn = n
Un = n + 1 One collision class, access as in linear list
Best case: Sn = 1
Un = 1 No collisions
ICS-II - 2008 14Hash Tables (2)
Hashing: analysis of the costs – average cases
Average case depends on overflow handling
Assumption: h(Kp) distributes keys uniformly
-> Probability, that a key a Hash value 0 i m-1 has, is 1/m
ICS-II - 2008 15Hash Tables (2)
Costs using linear probing
Example hi(k) = (h0(k)+i) mod m In the case of small allocation of HT, no problem In the case of higher allocation, drastic degradation
Probability p, that 7 will be allocated is 1/m because 6 is free Probability that 14 will be allocated is 5/m (the p for 14 as home
address plus the sum of the p for 10,11,12,13, which can produce an overflow on 14)
Long chains will be longer and chains can grow together (insert in 3 or 14)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
ICS-II - 2008 16Hash Tables (2)
Costs using linear probing
According to KnuthSn = 0.5(1 + 1/(1- ))
with 0 = n/m < 1
Un = 0.5(1 + 1/(1- )2)
0.1 0.3 0.5 0.7 0.9
8
7
6
5
4
3
2
1
Sn
Un
Number of search steps increases drastically with higher allocationfactor
Steps
ICS-II - 2008 17Hash Tables (2)
Costs using optimal collision resolution
With optimal methods for collision resolution a uniform distribution can be approximately assumed despite collision E.g. : rehashing, pseudo-random numbers etc.
Probability that a place is occupied/free depends on the number of the already allocated places (n) and on the ones, that are still available (m-n) E.g. : Pfree = (m-n)/m
See script for details of the derivative
ICS-II - 2008 18Hash Tables (2)
Costs using optimal collision resolution (2)
ApproximatelySn ~ |(1/ ) ln(1- )|
with 0 = n/m < 1Un ~ 1/(1- )
0.1 0.3 0.5 0.7 0.9
8
7
6
5
4
3
2
1
Sn
UnNumber of search steps can improvedrastically with independentallocation after collision resolution
Steps
ICS-II - 2008 19Hash Tables (2)
Costs using separate overflow
Assumption: Uniform distribution of the keys over all chains n/m = Keys per chain, furthermore linear chaining (Q: how big is Sn?)
If key i is inserted in HT, then i-1 keys are in the table and in each chain (i-1)/m keys
Costs to find a free place are 1 step for home address plus (i-1)/m steps to reach end of the chain (must first see, if the key already exists in table or not)
Averaged over all n keys
Sn = 1/n i=1...n(1 + (i-1)/m) = 1+(n-1)/2m ~ 1+ /2
ICS-II - 2008 20Hash Tables (2)
Costs using separate overflow
For successful search half of the chain will be traversed in average
For unsuccessful search the entire chain has to be traversed
Chaining is superior to other methods, even with high overflow ( >1) good efficiency
0.5 0.75 1 1.5 2 3 4 5
Sn 1.25 1.37 1.5 1.75 2 2.5 3 3.5
Un 1.11 1.22 1.37 1.72 2.14 3.05 4.02 5.01
ICS-II - 2008 21Hash Tables (2)
Hashing on external memory (b>1)
With bucket factor > 1, b records can be stored in one address
For both main and external memory suitable, particularly attractive with external memory
During collision the new record will simply be stored in the same bucket
First within b+1 entries bucket overflows
Having overflow the known methods for collision resolution can be applied Overflow in primary area Separate overflow area
ICS-II - 2008 22Hash Tables (2)
Hashing on external memory
Overflow bucket can be assigned dynamically and interlinked with overflow address
An overflow bucket can serve for several home addresses as overflow area
Recommended: one chain per collision class
With b>1 is =n/bm
Sequence for storing records in bucket: According to the insert sequence (sequential) According to the sorting sequence (linked list)
ICS-II - 2008 23Hash Tables (2)
Hashing on external memory
Typical bucket size: Sector Track Page
Generally: Transfer unit (1 I/O per bucket)
Like B-Trees: I/O dominates (approx. 6-10 ms) more complex Hash function justified Relative search costs inside one bucket are low
Insert always at first free space in chain
While deletion, no need to bridge gaps (or only inside a page)
Empty overflow buckets are removed from chain
ICS-II - 2008 24Hash Tables (2)
Example: b=2
b=2; h(k) = k mod 7
Insert: 11, 32, 8, 25, 21, 15, 2, 18, 13, 20, 4, 27
0 1 2 3 4 5 6
ICS-II - 2008 25Hash Tables (2)
Example: b=2 (2)
Now: delete 25
0 1 2 3 4 5 6
21 8 2 11 13
15 32 20
25 27
18
4
ICS-II - 2008 26Hash Tables (2)
Example: b=2 (3)
Chains will not be closed!Inside of a page will berearranged if needed.
0 1 2 3 4 5 6
21 8 2 11 13
15 32 20
18 27
4
ICS-II - 2008 27Hash Tables (2)
Summary: Hashing on external memory
Primary buckets remain always assigned because of relative addressing
Overflow buckets will be assigned dynamically (append), delete empty buckets
With strong negative growth, buckets possibly understaffed (reorganization of the file, e.g. using rehashing of all entries stored in the hash table)
ICS-II - 2008 28Hash Tables (2)
Approximate values for Hashing
Selected values for Sn(b) and Un(b) as function of b and β
Rule of thumb: b is typically determined by data transfer unit, select β in such a way, that
S ~ 1.05 to 1.08 holds
ICS-II - 2008 29Hash Tables (2)
Hashing vs. B+-Tree
Access costs with a good designed Hash method better than B+-Tree (1.05 vs. path length)
Disadvantages: no sorting of all keys (sequential output needs an obviously
higher cost) Hashing is static
• Not extendable, long chains lead to degenerations• Consumes already with a small number of keys the complete
designated memory space• (can also be an advantage: the required memory space is defined to
a large extent from the beginning)
ICS-II - 2008 30Hash Tables (2)
Extendible Hashing
Disadvantages of static Hash methods with strongly growing volume of data Primary area must be largely dimensioned from the beginning
( bad initial allocation) If the capacity of the primary area is exceeded, the overflow
chains grow fast Run time behavior degrades Reorganization requires to unload the entire volume of data and
to load it again interruption of the operation (often not possible, e.g., with 24x7 operation)
ICS-II - 2008 31Hash Tables (2)
Extendible Hashing
Therefore we need a Hash method that Permits dynamic growing and shrinking of the Hash area Guarantees constant run time behavior independently of the size
of data Requires not more than 2 page accesses for finding a record Avoids overflow mechanisms and total reorganization Guarantees a high allocation of the memory independently of the
growth of the key set
ICS-II - 2008 32Hash Tables (2)
Extendible Hashing
Must avoid overflow buckets
Would like stability are ready to pay for it, i.e., constantly 2 accesses
Available (known to us) techniques Balancing the B-Trees (constant path length) Addressing techniques via coding of the key from digital trees
Extendible Hashing uses these techniques in order to guarantee a stable access with exactly 2 I/O operations.
ICS-II - 2008 33Hash Tables (2)
Extendible Hashing
Hash function transforms keys into binary strings (coding)
Only the first n bits are used if necessary (addressing like in the digital tree)
Additional indirection over container board Having few keys, few bits are sufficient With many keys additional bits are used
Containers are if necessary added or removed (balancing)
Container board is “doubled” if necessary memory space costs, but not high intensive computations
ICS-II - 2008 34Hash Tables (2)
Example: Extendible Hashing
Insertion sequence: 11, 32, 8, 25, 21, 15, 2, 18, 13, 20, 4, 27
11 001011 2 00001032 100000 18 0100108 001000 13 00110125 011001 20 01010021 010101 4 00010015 001111 27 011011
ICS-II - 2008 35Hash Tables (2)
Extendible Hashing, b=2
Initial situation Container board contains only a
reference To an empty container
Insert11 00101132 100000works without problems
ICS-II - 2008 36Hash Tables (2)
Extendible Hashing, b=2
Next key8 001000
Doesn’t fit anymore
Thus, doubling of the capacity through duplication of the container board (still no extra containers!)
ICS-II - 2008 37Hash Tables (2)
Extendible Hashing, b=2
Blue numbers: implicit through addresses of the container board
Now: next key 8 001000
Fits through partition of the boards
ICS-II - 2008 38Hash Tables (2)
Extendible Hashing, b=2
Next key25 011001
Doesn’t fit in the first board, no other address available (for partition of the container) container board has to be doubled
ICS-II - 2008 39Hash Tables (2)
Extendible Hashing, b=2
Again: through doubling of the container board, no extra container is generated
Next key (still)25 011001
ICS-II - 2008 40Hash Tables (2)
Extendible Hashing, b=2
Additional container
ICS-II - 2008 41Hash Tables (2)
Extendible Hashing, b=2
Next key21 010101
No problems
ICS-II - 2008 42Hash Tables (2)
Extendible Hashing, b=2
Next key15 001111
Easy doubling of the container board
ICS-II - 2008 43Hash Tables (2)
Extendible Hashing, b=2
Next key15 001111
Still not possible Doubling again
ICS-II - 2008 44Hash Tables (2)
Extendible Hashing, b=2
Next key15 001111
Now selectivity is sufficient big container doubling
ICS-II - 2008 45Hash Tables (2)
Extendible Hashing, b=2
Next key (straight-forward)2 00001018 01001013 00110120 0101004 00010027 011011
ICS-II - 2008 46Hash Tables (2)
Extendible Hashing, b=2
Finish
ICS-II - 2008 47Hash Tables (2)
Extendible Hashing
Within the key the prefix doesn’t need to be used always, one can also use the postfix
Within keys which are not uniformly distributed, an internal hash function can be used to produce the bit string to utilize in extendible hashing
ICS-II - 2008 48Hash Tables (2)
Summary, extendible Hashing
Key fragment with n bits direct hashing (container board)
Container having a bucket factor b>1 (typically b>20)
Search Look up the container address in the container board Search in the container (e.g., binary search)
ICS-II - 2008 49Hash Tables (2)
Summary, extendible Hashing
Insert Look up the container address in the container board Search in the container If found good, no further actions If not found
• If there is a free slot in the container insert• If no free slot is there
- Double the container board until the key fragment is selective enough to establish more containers (note: sometimes the container board doesn’t need to be doubled)
- Add new containers and if needed, redistribute keys from the old container among the new containers
ICS-II - 2008 50Hash Tables (2)
Summary, extendible Hashing
Delete Look up the container address in the container board Search in the container If found delete If container is empty delete the container, set pointer in the
container board to the neighbor container
ICS-II - 2008 51Hash Tables (2)
Extendible Hashing
In principle very similar to direct hashing using the first bits of the key (h(k) = k / 2x)
BUT: Within direct hashing the doubling of the table if an overflow occurs is much more expensive. For extendible hashing, each pointer should only be set to two successive addresses, for direct hashing each address should be split.
ICS-II - 2008 52Hash Tables (2)
Example
Extendible hashing Direct Hashing
(There is no container board in direct hashing, but we added it here for the sake of understanding)
ICS-II - 2008 53Hash Tables (2)
Analysis, extendible Hashing
Search has a constant cost, two I/O operations
Delete is combined if needed with the deletion of the container, but still constant cost
For insert “usually” max. 5 operations (search, write to the container, if needed write to other containers, write to the container board)
BUT IN ADDITION: If needed reorganization of the container board (duplicate all pointers)
ICS-II - 2008 54Hash Tables (2)
Analysis, extendible Hashing
Doubling of the container board occurs mainly in the main memory low cost in comparison to I/O operations
A very successful and widely used method
ICS-II - 2008 55Hash Tables (2)
Excursus: Pseudo-random numbers
A topic which is well related to hashing
Why “pseudo”-random numbers Computer is a “good” computational menial Algorithms are always executed reliably in a similar way Consequence: generating random numbers is not a strength of
computers!
Applications Games Simulation Generating keys for cryptography
But specially also numerical solutions of problems
ICS-II - 2008 56Hash Tables (2)
Example of an application
Computation of Pi
Surface of the unit circle (Pi)
Compute the surface offourth of the circle (Pi/4)numerically and thenmultiply by 4 Pi
1
1
1
ICS-II - 2008 57Hash Tables (2)
Compute Pi
Counting:36 x 36 =1296 smallboxes
Or roll the dice!
11
65
64
63
62
61
56
55
54
53
52
51
46
45
44
43
42
41
36
35
34
33
32
31
26
25
24
23
22
21
16
15
14
13
12
66
1
66
66655555
544444433
33332222
2211111
6
1
543
216543216
54321654
321654321
65432
6
ICS-II - 2008 58Hash Tables (2)
Compute Pi
Particularly for computations of four-dimensional cases (e.g., physic systems with many degrees of freedom, computation of physic simulations, crash tests, …) it isn’t possible to go through all possible parameters systematically
The utilization of (good) multi-dimensional random numbers can lead to better results while using less values
ICS-II - 2008 59Hash Tables (2)
Pseudo-random numbers
For this type of applications, pseudo-random numbers are even better than “real” random numbers
How works a normal pseudo-random generator? Needs an initialization z0
A random function computes starting from the last random number the next one:zn = Z(zn-1)
Requirements are also like those of hash-/collision resolution functions: Uniform distribution of the random numbers All random numbers (from a specific interval) should eventually
appear once in the sequence
ICS-II - 2008 60Hash Tables (2)
Example: Mid-square-generator
Was implemented e.g., in Apple II
zn = middle_digits(zn-12)
Example: z0 = 42
42 x 42 = 1764; 76 x 76 = 5776 etc.
Sequence: 42 – 76 – 77 – 92 – 46 – 11 – 12 – 14 – 19 – 36 – 29 – 84 – 5 – 2 – 0 – 0 – 0 - …
Many sequences either ends with “0” or are repeated continuously (24 – 57 – 24 – 57 - …)
Very bad generator
ICS-II - 2008 61Hash Tables (2)
Linear congruence-generator
Better: linear congruence- generator
Appears to be familiar to us
zn = (zn-1 * a + b) mod m
Example:zn = (zn-1 * 21 + 17) mod 40
… generates an optimal sequence …1 - 38 - 15 - 12 - 29 - 26 - 3 - 0 - 17 - 14 - 31 - 28 - 5 - 2 - 19 - 16 - 33 - 30 - 7 - 4 - 21 - 18 - 35 - 32 - 9 - 6 - 23 - 20 - 37 - 34 - 11 - 8 - 25 - 22 - 39 - 36 - 13 - 10 - 27 - 24 - 1
ICS-II - 2008 62Hash Tables (2)
Linear congruence-generator
zn = (zn-1 * a + b) mod m
Parameter a, b, m determine the quality
Like in Hashing: it is reasonably easy to define the minimal requirements for a good quality e.g., a, m coprime
But: uniform distribution for multi-dimensions is hard
Example: 2, 7, 4, 9, 6, 1, 8, 3, 0, 5, …
One-dimension: uniformly distributed
Two-dimensions: (2, 7) (4, 9) (6, 1) (8, 3), (0, 5) located in two “lines” – not uniformly distributed
ICS-II - 2008 63Hash Tables (2)
Linear congruence-generator
Separate research area in computer science and mathematics which is focused on finding good pseudo-random generators
For numerical applications pseudo-random numbers are often better than real random numbers
For cryptography this doesn’t apply anymore – there are plug-in cards which generate real random numbers because of quantum physics …
ICS-II - 2008 64Hash Tables (2)
Thoughts: Hash / Random
Often, the computer produces apparently chaos
The computer can not do this really: if you look deeply it is always another way of ordering
“Chaotic” arrangement of data in hash tables and pseudo-random generators are good examples for this