ece750-txb lecture 8: treaps, tries, and hash tablesece750-ads/notes/lecture08.pdf · i a treap is...
TRANSCRIPT
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
ECE750-TXB Lecture 8: Treaps, Tries, andHash Tables
Todd L. [email protected]
Electrical & Computer EngineeringUniversity of Waterloo
Canada
February 1, 2007
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Review: Treaps
I Recall that a binary search tree has keys drawn from atotally ordered structure 〈K ,≤〉
I An inorder traversal of the tree recovers the keys inascending order.
d
b h
a c f i
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Review: Treaps
I Recall that a heap has priorities drawn from a totallyordered structure 〈P,≤〉
I The priority of a parent is ≥ that of its children (for amax heap.)
I The largest priority is at the root.
23
11 14
7 1 6 13
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Review: Treaps
I In a treap, nodes contain a pair (k, p) where k ∈ K is akey, and p ∈ P is a priority.
I A Treap is a mixture of a binary search tree and a heap:
I A binary search tree with respect to keys;I A heap with respect to priorities.
(d,23)
(b,11) (h,14)
(a,7) (c,1) (f,6) (i,13)
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Review: Unique Representation
I If the keys and priorities are unique, then treaps havethe unique representation property: given a set of (k, p)pairs, there is only one way to build the tree.
I For the heap property to be satisfied, there is only one(k, p) pair that can be the root: the one with thehighest priority.
I The left subtree of the root will contain all keys < k,and the right subtree of the root will contain all keys> k.
I Of the keys < k, the one with the highest priority mustoccupy the left child of the root. This then splitsconstructing the left subtree into two subproblems.
I etc.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Review: Unique Representation
I Example: to build a treap from{(i , 13), (c , 1), (d , 23), (b, 11), (h, 14), (a, 7), (f , 6)},unique choice of root: (d , 23)
(d , 23)jjjjj TTTTT
{(c, 1), (b, 11), (a, 7)} {(i , 13), (h, 14), (f , 6)}
I To build the left subtree, pick out the highest priorityelement: (b, 11). And so forth.
(d , 23)
tttt TTTTT
(b, 11)
uuuu KKK
K{(i , 13), (h, 14), (f , 6)}
(a, 7) (c, 1)
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Review: Unique Representation
I Data structures with the unique representation can bechecked for equality in O(1) time by using caching (alsoknown as memoization):
I Implement the data structure in a purely functional style(a node’s fields are never altered after construction.Any changes require creating a new node.)
I Maintain a map from (key,priority, lchild, rchild)tuples to already constructed nodes.
I Before constructing a node, check the cache to see if italready exists; if so, return the pointer to that node.Otherwise, construct the node and add it to the cache.
I If two treaps contain the same keys, their root pointerswill be equal: can be checked in O(1) time.
I Checking and maintaining the cache requires additionaltime overhead.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Review: Balance of treaps
I Treaps are balanced if the priorities are chosenrandomly.
I Recall that building a binary search tree with a randominsertion order results in a tree of expected heightc log n, with c ≈ 4.311.
I A treap with random priorities assigned to keys hasexactly the same structure as a binary search treecreated by inserting keys in descending order of priority
I Descending order of priority is a random order;I Therefore treaps have expected height c log n with
c ≈ 4.311.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Insertion into treaps
I Insertion for treaps is much simpler than that forred-black trees.
1. Insert the (k, p) pair as for a binary search tree, by keyalone: the new node will be placed somewhere at thebottom of the tree.
2. Perform rotations along the path from the new leaf tothe root to restore invariants:
I If there is a node x whose right subchild has a higherpriority, rotate left at x .
I If there is a node x whose left subchild has a higherpriority, rotate right at x .
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Insertion into treaps
I Example: the treap below has just had (e, 19) insertedas a new leaf. Rotations have not yet been performed.
(d,23)
(b,11) (h,14)
(a,7) (c,1) (f,6) (i,13)
(e,19)
I f has a left subchild with greater priority: rotate rightat f .
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Insertion into treaps
I After rotating right at f :
(d,23)
(b,11) (h,14)
(a,7) (c,1) (e,19) (i,13)
(f,6)
I h has a left subchild with greater priority: rotate rightat h.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Insertion into treaps
I After rotating right at h:
(d,23)
(b,11) (e,19)
(a,7) (c,1) (h,14)
(f,6) (i,13)
I Heap invariant is satisfied: all done.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
I Treaps are easily made persistent (retain previousversions) by implementing them in a purely functionalstyle. Insertion requires duplicating at most a sequenceof nodes from the root to a leaf: an O(log n) spaceoverhead. The remaining parts of the tree are shared.
I E.g. the previous insert done in a purely functional style:
Version 2
(d,23)
(b,11)
(e,19)
(a,7) (c,1)
(h,14)
(f,6) (i,13)
(d,23)
Version 1
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Strings
I A string is a sequence of characters drawn from somealphabet Σ. We will often use Σ = {0, 1}: binarystrings.
I We write Σ∗ to mean all finite strings1 composed ofcharacters from Σ. (∗ is the Kleene closure.)
I Σ∗ contains the empty string ε.I If w , v ∈ Σ∗ are strings, we write w · v or just wv to
mean the concatenation of w and v .I Example: given w = 010 and v = 11, w · v = 01011.
〈Σ∗, ·, ε〉 is an example of a monoid: a set (Σ∗) together with anassociative binary operator (·) and an identity element (ε). Forany strings u, v , w ∈ Σ∗,
u · (v · w) = (u · v) · wvε = εv = v
1Infinite strings are very useful also: if we write a real numberx ∈ [0, 1] as a binary number e.g. 0.101100101000 · · · , this is arepresentation of x by an infinite string from Σω.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries
I Recall that we may label the left and right links of abinary tree with 0 (for left) and 1 (for right):
��������0yyy
y 1@@
@@
x ��������0���� 1::
::
y z
I To describe a path in the tree, one can list the sequenceof left/right branches to take from the root. E.g., 10gives y , 11 gives z .
I The set of all paths from the root to leaves isP◦ = {0, 10, 11}
I The set of all paths from the root to leaves or internalnodes is: P• = {ε, 0, 1, 10, 11}, where ε is the emptystring indicating the path starting and ending at theroot.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries
I The set P◦ is prefix-free: no string is an initial segmentof any other string. Otherwise, there would be a pathto a leaf passing through another leaf!
I The set P• is prefix-closed: if wv ∈ P•, then w ∈ P•
also. i.e., P• contains all prefixes of all strings in P•.2
2We can define • as an operator by A• ≡ {w : wv ∈ A}. • is aclosure operator. A useful fact: every closure operator has as its range acomplete lattice, where meet and join are given by (X uY )• = X • ∩Y •
and (X tY )• = (X • ∪Y •)•. Applying this fact to the representation ofbinary trees by strings, • induces a lattice of binary trees.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries
I Given a binary tree, we can produce a set of strings P•
or P◦ that describe all paths (resp. all paths to leaves).
I The converse is also true: given a set P• or P◦, we canreproduce the tree.3
I Example: the set {100, 11, 001, 01} is prefix free, andthe corresponding tree can be built by simply addingthe paths one-by-one to an initially empty tree:
��������0
ooooooooooooo1
OOOOOOOOOOOOO
��������0
����
���� 1
????
????
��������0
����
���� 1
????
????
��������1
????
????
�������� ��������0
����
����
���������������� ��������
3Formally we can say there is a bijection (a 1-1 correspondence)between binary trees and prefix-closed (resp. prefix-free) sets.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
TriesI A tree constructed in this way — by interpreting a set
of strings as paths of the tree — is called a trie. (Theterm comes from reTRIEval; pronounced either “tree”or “try” depending on taste. Tries were invented by dela Briandais, and independently by Fredkin [5].)
I The most common use of a trie is to implement aDictionary〈K ,V 〉, i.e., maintaining a mapf : K ⇀ V by associating each k ∈ K with a paththrough the trie to a node where f (k) is stored.4
I Tries find applications in bioinformatics, coding andcompression, sorting, SAT solving, routing, naturallanguage processing, very large databases (VLDBs),data mining, etc.
I Binary Decision Diagrams (BDDs) are essentially trieswith caching and sharing of subtrees.
I Recent survey by Flajolet [4].4The notation K ⇀ V indicates a partial function from K to V : a
function that might not be defined for some keys.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Trie example: word list
I Example: build a trie to store english words: trie, larch,saxophone, tried, saxifrage, squeak, try, squeak,squeaky, squeakily, squeakier.
I Common implementation variants of a trie:I associate internal nodes with entries also, if one occurs
there. (Can use 1 bit on internal nodes to indicatewhether a key terminates there.)
I when a node has only one descendent, end the triethere, rather than including a possibly long chain ofnodes with single children.
I Use the trie to store keys only; implicitly the values weare storing are V = {0, 1}. The function the trierepresents is a map χ : K → {0, 1} where χ is thecharacteristic function of the set: χ(k) = 1 if and onlyif k is in the set.
I Use the alphabet {a, b, · · · , z}.I Instead of having a 26-way branch in each node, put a
little BST at each node with up to 26 elements in it (a“ternary search trie” [1])
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Trie example: wordlist
larchl
st
a
q
r
x
u e a squeakki
squeakyy
squeakiere
squeakilyl
saxifragei
saxophoneo
i
try
y triee triedd
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Trie example: coding
I Suppose we want to transmit (or compress) data.
I At the receiving (or decoding) end, we will have a longstring of bits to decode.
I A simple but effective strategy is to build a codebookthat maps binary codewords to plaintext. The incomingtransmission is then just a sequence of codewords thatwe will replace, one by one, with their correspondingplaintext.5
I A code that can be described by a trie, with outputsonly at the leaves, is an example of a uniquelydecodeable code: there is only one way an encodedmessage can be decoded. Specifically, such codes arecalled prefix codes or instantaneous codes.
5This strategy is asymptotically optimal (achieves a bitrate ≤ H + εfor any ε > 0) for stationary ergodic random processes, with anappropriate choice of codebook.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Trie example: coding
I Example: to encode english, we might assign codewordsto sequences of three letters, giving the most frequentwords shorter codes:
Three-letter combination Codewordthe 000and 001for 010are 011but 100not 1010you 1011all 1100...
...etc 11101101...
...qxw 1111011001101001
I These codewords are chosen to be a prefix-free set.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Trie example: coding
I For decoding messages we build a trie:
0 1
0 1 0 1
the
0
and
1
for
0
are
1
but
01 0 1
not
0
you
1
all
0 1 0 1
0 1
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Trie example: decoding
I Incoming message: 100101001010111100
I To decode: start at root of trie, follow path given bybits. When a leaf is reached, output the word there,and return to the root.
100︸︷︷︸but
1010︸︷︷︸not
010︸︷︷︸for
1011︸︷︷︸you
1100︸︷︷︸all
I This requires substantially fewer bits than transmittingas ASCII text (24 bits per 3-letter sequence).
I A good code assigns short codewords tofrequently-occurring strings; if a string occurs withprobability pi , one wants the codeword to have lengthabout − log2 pi .
I Later in the course we shall see how such codes can beconstructed optimally using a greedy algorithm.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries: Kraft’s inequality
I Kraft’s inequality is a simple constraint on the lengthsof codewords in a prefix code (equivalently, leaf depthsin a binary tree.)
Theorem (Kraft)
Let (d1, d2, . . .) be a sequence of code lengths of a code.There is a prefix code with code lengths d1, d2, . . .(equivalently, a binary tree with leaves at depth d1, d2, . . .) ifand only if
n∑i=1
2−di ≤ 1 (1)
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries: Kraft’s inequality I
I Positive example: the codeword lengths 3, 3, 2, 2 satisfyKraft’s inequality: 1
8 + 18 + 1
4 + 14 = 3
4 . Possible trierealization:
��������0ooooo 1
OOOOO��������0��
� 1??? ��������0
�����������0
��� 1??
? �������� ���������������� ��������
I Negative example: the codeword lengths 3, 3, 3, 2, 2, 2violate Kraft’s inequality: sum is 9
8 .
I Kraft’s inequality becomes an equality for trees in whichevery internal node has two children.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries: Kraft’s inequality
Two ways to prove Kraft’s inequality:I Put each node of a binary tree in correspondence with a
subinterval of [0, 1] on the real line: root is [0, 1], its children get[0, 1
2] and [ 1
2, 1]. Each node at depth d receives an interval of
length 2−d and splits it in half for its children. The union of theintervals at the leaves is ⊆ [0, 1], and the intervals at the leavesare pairwise disjoint, so the sum of their interval lengths is ≤ 1.
I Kraft’s inequality can also be proved with a simple inductionargument. The list of valid codeword length sequences can begenerated from the initial sequence 〈1, 1〉 (codewords {0, 1}) bythe rewrite rules k → k + 1, k + 1 (expand a node into twochildren) and k → k + 1 (expand a node to have a single child).Base case: with 〈1, 1〉 obviously 2−1 + 2−1 = 1. Induction step: ifsum is ≤ 1, consider expanding a single element of the sequence:have either the rewrite k → k + 1, k + 1, and 2k ≥ 2k−1 + 2k−1; orthe rewrite k → k + 1, and 2k ≥ 2k−1. So rewrites never increasethe “weight” of a node.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries: Kraft’s inequality I
It is occasionally useful to have an infinite set of codewordshandy, in case we do not know in advance how manydifferent objects we might need to code.For an infinite set of codewords (or infinite binary tree),Kraft’s inequality implies
dk ≥ c + log+ k + log log+ log∗ k infinitely often (2)
where
log+ x ≡ log x + log log x + log log log x + · · ·
with the sum taken only over the positive terms, and log∗ xis the “iterated logarithm” —
log∗ x =
{0 if x ≤ 1
1 + log∗(log x) otherwise
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries: Kraft’s inequality II
See e.g., [2, 9].Where does this bound come from? Well, a necessary condition for
∞Xk=0
2−dk ≤ 1
to hold is that the seriesP∞
k=0 2−dk converges. For example, ifdk = log k, then 2−dk = 1
k, the Harmonic series. The Harmonic series
diverges, so Kraft’s inequality can’t hold.We can parlay this into an inequality by remembering the “comparisontest” for convergence of series: if ak , bk are two positive series, andak ≤ bk for all k, then
Pak ≤
Pbk . If we stick the Harmonic series in
for ak and 2−dk for bk , we get:
If 1k≤ 2−dk for all k then ∞ ≤
P2−dk .
The premiss of this test must be false ifP
2−dk does not diverge toinfinity. Therefore 2−dk must be < 1
kfor at least some k. If 2−dk < 1
k
for only some finite number of choices of k, the series would stilldiverge. So, a necessary condition for 2−dk to converge is that 2−dk < 1
k
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries: Kraft’s inequality III
for infinitely many terms. Taking logarithms and multiplying through by−1 we get dk > log k for infinitely many i .We can generalize this by saying that if g ∈ ω(1) is any divergingfunction, then dk > − log g ′(k) for infinitely many k. (The Harmonicseries bound follows from choosing g(x) = log x .) Unfortunately thereis no “slowest growing function” g(x) from which we could obtain atightest possible bound.
Eqn. (2) is from [2]; Bentley credits the result to Ronald Graham and
Fan Chung, apparently unpublished.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries: Variations on a theme I
There are many useful variants of tries [4]:
I Multiway branching: instead of choosing Σ = {0, 1},one can choose any finite alphabet, and allow eachnode to have |Σ| children.
I Paged trie: each node is required to have a minimalnumber of leaves descended from it; when thisthreshold is not met, the subtree is converted into acompact form (e.g., an array of keys and values)suitable for secondary storage. This technique can alsobe used to increase performance in main memory [6].
I Patricia tries [7] (“Practical Algorithm To RetrieveInformation Coded in Alphanumeric6”) Introduce skippointers to avoid long sequences of single-branch nodeslike
�������� 0 //�������� 1 //�������� 1 //�������� 0 //��������
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Tries: Variations on a theme II
I LC-Trie: the first few levels of a big trie tend to bealmost a complete binary tree of some depth, which canbe collapsed into an array of pointers to tries [8].
I Ternary Search Tries (TSTs): a blend of a trie and aBST; can require substantially less space than a trie.For a large |Σ|, replace a |Σ|-way branch at eachinternal node with a BST of depth ≤ log |Σ|.
6Almost better than my all-time favourite strained CS acronym,PERIDOT: “Programming by Example for Real-time Interface DesignObviating Typing.” Great project, despite the acronym.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Hash Tables
I Suppose we wanted to represent the following set:
M = {35, 139, 395, 1691, 1760, 1795, 3632, 3789, 4657}
Given some x ∈ N, we want to quickly test whetherx ∈ M.
I Binary search trees: require following a path through atree — perhaps not fast enough for our problem.
I Super fast way: allocate an array of 4657 bytes. Set
A[i ] =
{0 if i 6∈ M
1 if i ∈ M
Then, on a RAM, can test whether x ∈ M with a singlememory access to A[i ] (a constant amount of time).However, space required by this strategy is O(supM).
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Hash Tables
I Obviously the array A would contain mostly emptyspace. Can we somehow “compress” the array but stillsupport fast access?
I Yes: allocate a much smaller table B of length k.Define a function h : [1, 4657] → [1, k] that mapsindices of A to indices of B, can be computed quickly,and ensures that if x , y ∈ M and x 6= y , thenh(x) 6= h(y) i.e., no two elements of M have the sameindex in B.
I Then, x ∈ M if and only if B[h(x)] = x .
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Hash Tables
I For our example, h(x) = x mod 17 does the trick. Hereis the array B:
j B[j ]
0 01 352 03 1394 3955 0
j B[j ]
6 07 08 16919 176010 179511 3632
j B[j ]
12 013 014 015 378916 4657
I e.g.: x = 1691: h(x) = 8, and B[8] = 1691, so x ∈ M.
I e.g.: x = 1692: h(x) = 9, and B[9] = 1760 6= 1692, sox 6∈ M.
I This is a hash table. h(x) = x mod 17 is called a hashfunction.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Hash Functions
I A hash function is a map h : K → H from some(usually large) key space K to some (usually small) setof hash values H. In our example, we were mappingfrom K = [1, 4657] to H = [1, 17].
I If the set M ⊆ K is chosen uniformly at random, keysare uniformly distributed (i.e., each k ∈ K has the sameprobability of appearing in a set to represent). In thiscase the hash function should distribute the keys evenlyamongst elements of H, i.e., we want that|h−1(y)| ≈ |h−1(z)| for y , z ∈ H.7
I For a nonuniform distribution on keys, one just wants to choose h
so that the distribution induced on H is close to uniform.
7Recall that for a function f : R → S , f −1(s) ≡ {r : f (r) = s}.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Hash Functions
I We will describe some hash functions where K = N(keys are nonnegative integers). These are easilyadapted to other kinds of keys (e.g., strings) byinterpreting the binary representation of the key as aninteger.
Some commonly used hash functions are the following:
1. Division: use h(k) = k mod m where m = |H| is usuallychosen to be a prime number far away from any powerof 2. (Note.8)
I For long bit strings, use Horner’s rule for evaluatingpolynomials in Z/mZ (will explain.)
2. Multiplication: use h(k) = bm{kφ}c, where 0 < φ < 1is an irrational number and {x} ≡ x − bxc. A popular
choice of φ is φ =√
5−12 .
8A particularly terrible choice would be m = 256, which would hashobjects based only on their lowest 8 bits. e.g., the hash of a stringwould depend only on its last character.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Multiplication hash functions: ExampleExample of multiplication hash function using φ =
√5−12
, and hashtable with m = 100 slots:
key {kφ} bm{kφ}c1 0.618034 61.2 0.236068 23.3 0.854102 85.4 0.472136 47.5 0.090170 9.6 0.708204 70.7 0.326238 32.8 0.944272 94.9 0.562306 56.
10 0.180340 18.11 0.798374 79.12 0.416408 41.13 0.034442 3.14 0.652476 65.15 0.270510 27.16 0.888544 88.17 0.506578 50.
Idea is that the third column (the hash slots) ‘looks like’ a random
sequence.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Multiplication hash functionsI The reason why h(k) = bm{kφ}c is a reasonable hash
function is interesting.I The short answer is that the sequence {kφ} for
k = 1, 2, 3, . . . ‘kind of behaves like’ a random realdrawn from (0, 1). So, h(k) = bm{kφ}c ‘looks like’ arandomly chosen hash function.A less sketchy explanation:
1. {kφ} is uniformly distributed on (0, 1): asymptotically,the proportion of {kφ} falling in an interval (α, β)where (α, β) ⊆ (0, 1) is (β − α). Just like a uniformdistribution on (0, 1)!
2. {kφ} satisfies an ergodic theorem: if we sample asuitably well-behaved9 function f at points {kφ} andaverage, this converges to the integral:
1
m
m∑k=1
f ({kφ}) →∫ 1
0
f (x)dx
Just like a uniform distribution on (0, 1)!See [3]. Variously called Weyl’s ergodic principle, Weyl’sequidistribution theorem.
However, {kφ} is emphatically not a random sequence.9Continuously differentiable and periodic with period 1
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Hash FunctionsI To evaluate whether a hash function is a good choice
for a set of data S ⊆ K , one can see how the observeddistribution of keys into hash table slots compares to auniform distribution.
I Suppose there are n keys and m hash slots. Computethe observed distribution of the keys:
p̂i =|{k : h(k) = i}|
nI To measure how far from uniform, compute
D(P̂||U) = log2 m +m∑
i=1
p̂i log2 p̂i
Convention: 0 log2 0 = 0.
I This is the Kullback-Leibler divergence of the observeddistribution P̂ from the uniform distribution U. It maybe thought of as the “distance” from P̂ to U.
I The smaller D(P̂||U), the better the hash function.
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Bibliography I
[1] Jon L. Bentley and Robert Sedgewick.Fast algorithms for sorting and searching strings.In SODA ’97: Proceedings of the eighth annualACM-SIAM symposium on Discrete algorithms, pages360–369, Philadelphia, PA, USA, 1997. Society forIndustrial and Applied Mathematics. bib
[2] Jon Louis Bentley and Andrew Chi Chih Yao.An almost optimal algorithm for unbounded searching.Information Processing Lett., 5(3):82–87, 1976. bib pdf
[3] Bernard Chazelle.The Discrepancy Method — Randomness andComplexity.Cambridge University Press, Cambridge, 2000. bib
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Bibliography II
[4] Philippe Flajolet.The ubiquitous digital tree.In Bruno Durand and Wolfgang Thomas, editors,STACS, volume 3884 of Lecture Notes in ComputerScience, pages 1–22. Springer, 2006. bib pdf
[5] Edward Fredkin.Trie memory.Commun. ACM, 3(9):490–499, 1960. bib
[6] Steffen Heinz, Justin Zobel, and Hugh E. Williams.Burst tries: a fast, efficient data structure for string keys.
ACM Trans. Inf. Syst., 20(2):192–223, 2002. bib
ECE750-TXBLecture 8: Treaps,Tries, and Hash
Tables
Todd L.Veldhuizen
Review: Treaps
Tries
Hash Tables
Bibliography
Bibliography III
[7] Donald R. Morrison.PATRICIA—practical algorithm to retrieve informationcoded in alphanumeric.J. ACM, 15(4):514–534, 1968. bib pdf
[8] Stefan Nilsson and Gunnar Karlsson.IP-address lookup using LC-tries.IEEE Journal on Selected Areas in Communications,17:1083–1092, June 1999. bib
[9] Jorma Rissanen.Stochastic Complexity in Statistical Inquiry, volume 15of Series in Computer Science.World Scientific, 1989. bib