copyright © 2002-2010 curt hill hashing key transformation

Copyright © 2002-2010 Curt Hill

HashingKey Transformation


What is a hash?• Hashing is another name for key

transformation• The original key is usually a

character string or other sparse key

• The result is usually a dense integer key

Example• Suppose we have a three digit

integer key– Not every key is used

• Fewer (by definition) than 1000 items

• What would be a good structure for storing and searching this item?

• Clearly an array– Dimension should be 0..999– Mark empty slots in some way


Complication• Now suppose we still have fewer

than a 1000 keys, but key is a name

• Suppose key is 10 character name• Then there are 2610 = 1.4 x 1014

(14 trillion) possibilities• Little bit large for memory• Sparse coverage

– Less than 1000 are used 1:1.4 x 1011


What are the alternatives?• Must we resort to tree or linked list?

• That is a dynamic data structure• Or perhaps an array that is sorted on

key name

• Or may we somehow transform that name into an integer key?

• Key transformation aka hashing aka scatter-storage does just that

• That is into an integer in the range 0..999


Hashing Components• An array (or vector) that holds the

data• A hash function that transforms

the key into an integer in the correct value

• A set of functions that adds, removes, searches the array using the hash function

• A collision strategy


One Example Function• If numeric key such as product

number, use just the bottom 3 digits

• If the bottom portion has a digit that does not span all the possibilities there may be a problem– Such as 0 meaning original, 1 first

replacement – However any three digits may be used


Another Example Function• Multiply the ordinal values of the

key• Divided by 1000 keeping the

remainder• Should work on any character key• See next screen for code


Example• Assume the followingchar key[10];

• Use this code int index = key[0]*key[1]*…*key[9];key = key % 1000;



Another hash function

• Input is a character string• Output is an integer in range 0 - N• Sum the ordinal value of each character

of the string• Divide the result by N and take the

remainder– This guarantees the right range

• Number theory tells us to make N prime


Example values with N = 256

• “Abcdef” returns 53• “Hi there” returns 233• “ABCDEF” returns 149• “FEDCBA” returns 149• “A character string” returns 197 • “A big character string” returns 23• “Zoology” returns 243• See the pattern yet?

Alternatively• Maul the string into integers • Operate on them• Divide and keep remainderunion { char key[12]; int ints[3]; } u;…strcpy(u.key, key);ndx = u.ints[0]* u.int[1] / u.ints[2] % 1000;


Commentary• With strings, beware of things past

the null character• For reasons observable from

number theory the best strategy is to mod by a prime number

• Empty space is usually left in the table to ease the computation of the hash function


Variations• Adding or multiplying the ordinal

value of characters will make transpositions give same result– ABC and BAC will map to same integer– This is known as a collision

• One approach is to multiply the ordinal value by its position:int index = 1;for(int i = 0;i<keylen;i++) index *= key[i]*i;index = index % 1000;


Distributions• The distribution of the keys will

appear to be random compared to original key value

• What we would like is the keys to be uniformly scattered throughout the index space

• Often the hash function is tailored for the data at hand– With much experimentation to get a

good distribution


Illustration


Otter (0)

Aaron(2)

Smith (5)

Butler (6)

Lawson (8)

Character key (hash value)

HashFunction

Aaron

2

What problems exist?• Data skew• Collisions: two keys map into one

array index• A collision strategy is how to

handle this• A hash function that does not map

two keys into one index is called a perfect hash function

• Depending on the collision strategy deletions may be problematic



Data Skew• A good hash function spreads the

keys uniformly among the integer range

• How well does this work when the data is not uniformly distributed?

• For example consider – Names - There are many more Smiths

than Garnjobsts– Numbers – there are many more 101

courses than most other numbers • Can the hash function still give a

good distribution?

Observations• If the hash function is good then

the larger the table (the more sparse the table) the better the likelihood of avoiding a collision

• We can tell the difference between a used and unused slot

• Since a hash table will necessarily have empty space in it is almost always undesirable to store the whole record there


What are we storing?• What we do is make it an array of

pointers, where the pointers point to the actual record - thus each array item wastes only pointer size (4 on Win32) number of bytes and all the items are in heap storage

• The initialization then becomes setting all slots to NULL– An unused slot is NULL– Alternative is a boolean in the table

specifying usedCopyright © 2002-2010 Curt Hill

Collision Strategies• Only limited by your creativity• Here are four that are commonly

used:• Linear probing• Quadratic probing• Rehashing• Overflow areas


Linear probing• Add one (modulo size) to index until

empty slot is found• Tends to cluster (problem with most

strategies)• These become long strings of

adjacent indexes that are filled up• These need to be searched

sequentially until empty cell is found• This sequential search ruins the

inherent quickness of hashing if groups get too long


Clustering• A long series of filled slots• Clustering is often a sign of poor

hash function or table too close to full

• When the overflow of one key overlaps another the problem becomes compounded


Linear Probing


Otter (0)

Aaron(2)

Smith (5)

Butler (6)

Lawson (8)

Character key (hash value)

Matthew(2)

Taylor(3)

Taylor could have gone in slot 3 but Matthew was already there.

Once this exists any key between 2 and 6 will end in slot 7.

Quadratic probing• Instead of adding 1 to the index

add the square– Modulo size of table

• Keeps x and x+1 from mapping to same area


Rehashing• When a collision occurs with first

hash function use a different one – AKA Secondary hashing

• This doubles the difficulty since we now have to come up with two hash functions which do not conflict with each other

• If the first hash function is good, there will be comparatively few collisions


Overflow areas• AKA Chaining• When a collision is detected upon

an add remove both from table• Make slot with a special entry that

redirects to another data structure– List, tree or another array or vector


Chain to Overflow


Otter (0)

OVER

Smith (5)

Butler (6)

Lawson (8)

Overflow area

Taylor(3)

Aaron(2)

Matthew(2)OVER

Jones (5)

Chain to List


Otter (0)

OVER

Smith (5)Butler (6)

Lawson (8)

Overflow area

Taylor(3)

Aaron(2)

Matthew(2)

OVER

Jones (5)

More on chaining• Keeps the collisions from further

degrading the hash table• If the hash table is large then we

get good split so the list is short• If duplicate keys are allowed also

good• There is no systematic order of the

items in any hash table– Thus sorting sometimes is needed

after the factCopyright © 2002-2010 Curt Hill

Deletion• There are challenges to deleting in a

hash table– Making a slot empty may prevent

finding an item that actually exists

• Depends on collision strategy• Linear probing

– Rehash everything from the first empty slot before the deletion to the last empty slot after

• Similar things may be needed with others


Other Considerations• Size of table is fixed• We must have a good prediction of

numbers or waste much space• Otherwise have catastrophic table

overflow• Even if we have a good prediction

it should be larger than required


Analysis• Disregarding collisions hashing is

clearly O(1)• How likely are collisions?• Birthday collisions example• Analysis from Algorithms + Data

Structures = Programs provided statistics follows


The Birthday Paradox• Most people think that you need a

big room with many people before two will have the same birthday

• If 23 people are in the same room there is better than 50% chance that two will have the same birthday

• Why is this number so low?• The probability is the sum of

earlier probabilitiesCopyright © 2002-2010 Curt Hill

Consider• If there is one person the probability

is zero• If one additional then 1 in 365• The third person has 2 chances in

365 plus the chances of the first 2 having the same birthday

• The twenty third person 22/365 in addition to the previous probabilities of two of the twenty two having the same

• This is about .52Copyright © 2002-2010 Curt Hill

What do we learn form this?

• The likelihood of a collision is much greater than we intuitively believe

• We will always have collisions unless we go to the large amount of work of finding the perfect hash

• Lets consider some empiracal data


Using Optimal Rehashing– Load factor = number of keys /

number of slots


Load Factor Probes0.1 1.050.25 1.150.5 1.390.75 1.85.9 2.560.95 3.150.99 4.66

Using Linear Probing


Load Factor Probes0.1 1.060.25 1.170.5 1.500.75 2.50.9 5.500.95 10.50

Types of Hash Techniques• Mapping• Folding• Shifting• Pseudo Random Numbers• Casts


Mapping• Convert items (usually characters

or integers) into values• The character value itself will often

not be acceptable• Preferred method: Create a vector

that is subscripted by the character values– Each character value returns another

value– By modifying these we can change

the hash functionCopyright © 2002-2010 Curt Hill

Folding• Treat pieces of the key as if they

were integers and compute key• Example:

– Suppose an 8 character key that is always there

– An int is 4 characters long– Form a union that has the 8

characters for one value and an array of 2 ints or 4 short ints as the other:


Folding Example• Consider:union { char c[8]; int i[2]; short s[4]; } smash;

• Move the 8 characters into the c part and then remove the 2 ints or 4 short ints

• Do some computation on the ints or short ints


Shifting• Using the shift instructions shift

right the values to make the high order bits more prominent

• For instance 'a'..'z' are from 97 .. 122

• Shifting can bring this range down some


Pseudo Random Numbers• Computer random number

generators not random– Rather they are sequence of numbers

based on algorithm usually based on overflow

• The seed of the sequence determines where it starts

• Use your value as the seed and then call the random number generator

• This mechanism will often generate integers in a particular range as well


Casts• You may use a cast to turn anything

into a character• Often the same as the folding• Beware of short character strings!

– Suppose you have "hi" in char x[8]– Do not use anything past x[2] in casts

folding– The debris past the null character may

change from time to time– Using it makes your hash function

unreliableCopyright © 2002-2010 Curt Hill

Minimal and Perfect Hashes

• A perfect hash function has no collisions– Life is simpler when there is no need

for a collision strategy

• A hash function that computes indexes where the number of keys is the same as the number of entries is called minimal– No space is wasted in the hash table


Perfect Hash Requirements

• Know the keys in advance• Key is of a constant size and

makeup• The table must be of size greater

than or equal number of keys• The smaller of ratio of keys to size,

the harder the function is to find– Although not necessarily to compute


Constructing a Hash• Choose the hash method • Bring down to range• Choose a collision strategy


Choose the hash method• Recall the methods:

– Mapping, folding, shifting, random numbers, casting

– Any combination or something you make up

• Decision is strongly influenced by the form and type of the key


Range• Second step - bring down to the right

range• If the computation was a fold and the

values were originally characters then the resulting range is limited mostly by the variable type

• Could be positive or possibly negative• Usually we do integer division and

keep the remainder– The divisor is best if prime


Range • We often choose the largest prime

number smaller than the table• Shifts

– Table size is a power of two:– res = num >> 20;– Leaves twelve bits which is 0 - 4095

• Bitwise logical operations– And can mask out high order bits– Table size will be a power of two – res = num & 0x000000ff; – Forces range 0 – 255


Collision Strategy• The simplest and worst is linear

probing– Do not use this if the hash table will

be close to full– May be acceptable if table is less than

75% full

• Chaining is the best when the number of entries is least well known– Also can degenerate to O(N) easily


In search of the minimal perfect hash

• Holy Grail for searches– What is better than O(1) search?

• Actually we will be happy if we find any good hash

• This task is never really easy• Always requires programmer

intervention• A perfect or minimal perfect hash

requires some luck to find beside requirements mentioned previously


Searching for Perfect Hash• Generate a hash function that uses

one or more characters from the items to be hashed as well as perhaps the length of the item

• Have it use a lookup table so that we can assign values to them

• Letters not in any of the used characters are usually zeroed

• Using some systematic procedure set the map values


Searching continued• Attempt to hash all the items

– Keep track of all the character maps

• At the first collision– If there is a map that was only used by

one or both of the collidees alter it to prevent the collision and continue from there

– Otherwise alter one of the maps present in the two items and go back to the attempt the hash

• If there are no collisions that are not resolved the hash is successful


Case study• I have such a searcher: perfhash• Finds a Perfect hash• Uses several algorithms• Written in C++


PerfHash algorithms• 0 – Multiply characters and divide by

length• 1 – Add characters times square of

position then multiply length• 2 – Fold using short ints, take product• 3 – Fold using ints, alternatively add

then subtract• 4 – Fold then use as seed for random

number generator


Results


Keys

0 1 2 3 4

C 32 151

293

139 337 227

C++ 35 281

307

389 487 577

Pascal 40 251

251

271 151 277

Modula2 48 301

367

317 397 193

Java 51 577

431

337 331 463


Finally• When hashing works well, it hard

to beat– Stable index– Good distribution from the hash

function

• It is much harder to make dynamic than trees

• Fiddling with the hash function can make it better or worse in unpredicatable ways

copyright © 2002-2010 curt hill hashing key transformation

Documents