copyright © 2002-2010 curt hill hashing key transformation

59
Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Upload: clifton-reed

Post on 19-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Copyright © 2002-2010 Curt Hill

HashingKey Transformation

Page 2: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Copyright © 2002-2010 Curt Hill

What is a hash?• Hashing is another name for key

transformation• The original key is usually a

character string or other sparse key

• The result is usually a dense integer key

Page 3: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Example• Suppose we have a three digit

integer key– Not every key is used

• Fewer (by definition) than 1000 items

• What would be a good structure for storing and searching this item?

• Clearly an array– Dimension should be 0..999– Mark empty slots in some way

Copyright © 2002-2010 Curt Hill

Page 4: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Complication• Now suppose we still have fewer

than a 1000 keys, but key is a name

• Suppose key is 10 character name• Then there are 2610 = 1.4 x 1014

(14 trillion) possibilities• Little bit large for memory• Sparse coverage

– Less than 1000 are used 1:1.4 x 1011

Copyright © 2002-2010 Curt Hill

Page 5: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

What are the alternatives?• Must we resort to tree or linked list?

• That is a dynamic data structure• Or perhaps an array that is sorted on

key name

• Or may we somehow transform that name into an integer key?

• Key transformation aka hashing aka scatter-storage does just that

• That is into an integer in the range 0..999

Copyright © 2002-2010 Curt Hill

Page 6: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Hashing Components• An array (or vector) that holds the

data• A hash function that transforms

the key into an integer in the correct value

• A set of functions that adds, removes, searches the array using the hash function

• A collision strategy

Copyright © 2002-2010 Curt Hill

Page 7: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

One Example Function• If numeric key such as product

number, use just the bottom 3 digits

• If the bottom portion has a digit that does not span all the possibilities there may be a problem– Such as 0 meaning original, 1 first

replacement – However any three digits may be used

Copyright © 2002-2010 Curt Hill

Page 8: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Another Example Function• Multiply the ordinal values of the

key• Divided by 1000 keeping the

remainder• Should work on any character key• See next screen for code

Copyright © 2002-2010 Curt Hill

Page 9: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Example• Assume the followingchar key[10];

• Use this code int index = key[0]*key[1]*…*key[9];key = key % 1000;

Copyright © 2002-2010 Curt Hill

Page 10: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Copyright © 2002-2010 Curt Hill

Another hash function

• Input is a character string• Output is an integer in range 0 - N• Sum the ordinal value of each character

of the string• Divide the result by N and take the

remainder– This guarantees the right range

• Number theory tells us to make N prime

Page 11: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Copyright © 2002-2010 Curt Hill

Example values with N = 256

• “Abcdef” returns 53• “Hi there” returns 233• “ABCDEF” returns 149• “FEDCBA” returns 149• “A character string” returns 197 • “A big character string” returns 23• “Zoology” returns 243• See the pattern yet?

Page 12: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Alternatively• Maul the string into integers • Operate on them• Divide and keep remainderunion { char key[12]; int ints[3]; } u;…strcpy(u.key, key);ndx = u.ints[0]* u.int[1] / u.ints[2] % 1000;

Copyright © 2002-2010 Curt Hill

Page 13: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Commentary• With strings, beware of things past

the null character• For reasons observable from

number theory the best strategy is to mod by a prime number

• Empty space is usually left in the table to ease the computation of the hash function

Copyright © 2002-2010 Curt Hill

Page 14: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Variations• Adding or multiplying the ordinal

value of characters will make transpositions give same result– ABC and BAC will map to same integer– This is known as a collision

• One approach is to multiply the ordinal value by its position:int index = 1;for(int i = 0;i<keylen;i++) index *= key[i]*i;index = index % 1000;

Copyright © 2002-2010 Curt Hill

Page 15: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Distributions• The distribution of the keys will

appear to be random compared to original key value

• What we would like is the keys to be uniformly scattered throughout the index space

• Often the hash function is tailored for the data at hand– With much experimentation to get a

good distribution

Copyright © 2002-2010 Curt Hill

Page 16: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Illustration

Copyright © 2002-2010 Curt Hill

Otter (0)

Aaron(2)

Smith (5)

Butler (6)

Lawson (8)

Character key (hash value)

HashFunction

Aaron

2

Page 17: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

What problems exist?• Data skew• Collisions: two keys map into one

array index• A collision strategy is how to

handle this• A hash function that does not map

two keys into one index is called a perfect hash function

• Depending on the collision strategy deletions may be problematic

Copyright © 2002-2010 Curt Hill

Page 18: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Copyright © 2002-2010 Curt Hill

Data Skew• A good hash function spreads the

keys uniformly among the integer range

• How well does this work when the data is not uniformly distributed?

• For example consider – Names - There are many more Smiths

than Garnjobsts– Numbers – there are many more 101

courses than most other numbers • Can the hash function still give a

good distribution?

Page 19: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Observations• If the hash function is good then

the larger the table (the more sparse the table) the better the likelihood of avoiding a collision

• We can tell the difference between a used and unused slot

• Since a hash table will necessarily have empty space in it is almost always undesirable to store the whole record there

Copyright © 2002-2010 Curt Hill

Page 20: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

What are we storing?• What we do is make it an array of

pointers, where the pointers point to the actual record - thus each array item wastes only pointer size (4 on Win32) number of bytes and all the items are in heap storage

• The initialization then becomes setting all slots to NULL– An unused slot is NULL– Alternative is a boolean in the table

specifying usedCopyright © 2002-2010 Curt Hill

Page 21: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Collision Strategies• Only limited by your creativity• Here are four that are commonly

used:• Linear probing• Quadratic probing• Rehashing• Overflow areas

Copyright © 2002-2010 Curt Hill

Page 22: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Linear probing• Add one (modulo size) to index until

empty slot is found• Tends to cluster (problem with most

strategies)• These become long strings of

adjacent indexes that are filled up• These need to be searched

sequentially until empty cell is found• This sequential search ruins the

inherent quickness of hashing if groups get too long

Copyright © 2002-2010 Curt Hill

Page 23: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Clustering• A long series of filled slots• Clustering is often a sign of poor

hash function or table too close to full

• When the overflow of one key overlaps another the problem becomes compounded

Copyright © 2002-2010 Curt Hill

Page 24: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Linear Probing

Copyright © 2002-2010 Curt Hill

Otter (0)

Aaron(2)

Smith (5)

Butler (6)

Lawson (8)

Character key (hash value)

Matthew(2)

Taylor(3)

Taylor could have gone in slot 3 but Matthew was already there.

Once this exists any key between 2 and 6 will end in slot 7.

Page 25: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Quadratic probing• Instead of adding 1 to the index

add the square– Modulo size of table

• Keeps x and x+1 from mapping to same area

Copyright © 2002-2010 Curt Hill

Page 26: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Rehashing• When a collision occurs with first

hash function use a different one – AKA Secondary hashing

• This doubles the difficulty since we now have to come up with two hash functions which do not conflict with each other

• If the first hash function is good, there will be comparatively few collisions

Copyright © 2002-2010 Curt Hill

Page 27: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Overflow areas• AKA Chaining• When a collision is detected upon

an add remove both from table• Make slot with a special entry that

redirects to another data structure– List, tree or another array or vector

Copyright © 2002-2010 Curt Hill

Page 28: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Chain to Overflow

Copyright © 2002-2010 Curt Hill

Otter (0)

OVER

Smith (5)

Butler (6)

Lawson (8)

Overflow area

Taylor(3)

Aaron(2)

Matthew(2)OVER

Jones (5)

Page 29: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Chain to List

Copyright © 2002-2010 Curt Hill

Otter (0)

OVER

Smith (5)Butler (6)

Lawson (8)

Overflow area

Taylor(3)

Aaron(2)

Matthew(2)

OVER

Jones (5)

Page 30: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

More on chaining• Keeps the collisions from further

degrading the hash table• If the hash table is large then we

get good split so the list is short• If duplicate keys are allowed also

good• There is no systematic order of the

items in any hash table– Thus sorting sometimes is needed

after the factCopyright © 2002-2010 Curt Hill

Page 31: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Deletion• There are challenges to deleting in a

hash table– Making a slot empty may prevent

finding an item that actually exists

• Depends on collision strategy• Linear probing

– Rehash everything from the first empty slot before the deletion to the last empty slot after

• Similar things may be needed with others

Copyright © 2002-2010 Curt Hill

Page 32: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Other Considerations• Size of table is fixed• We must have a good prediction of

numbers or waste much space• Otherwise have catastrophic table

overflow• Even if we have a good prediction

it should be larger than required

Copyright © 2002-2010 Curt Hill

Page 33: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Analysis• Disregarding collisions hashing is

clearly O(1)• How likely are collisions?• Birthday collisions example• Analysis from Algorithms + Data

Structures = Programs provided statistics follows

Copyright © 2002-2010 Curt Hill

Page 34: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

The Birthday Paradox• Most people think that you need a

big room with many people before two will have the same birthday

• If 23 people are in the same room there is better than 50% chance that two will have the same birthday

• Why is this number so low?• The probability is the sum of

earlier probabilitiesCopyright © 2002-2010 Curt Hill

Page 35: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Consider• If there is one person the probability

is zero• If one additional then 1 in 365• The third person has 2 chances in

365 plus the chances of the first 2 having the same birthday

• The twenty third person 22/365 in addition to the previous probabilities of two of the twenty two having the same

• This is about .52Copyright © 2002-2010 Curt Hill

Page 36: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

What do we learn form this?

• The likelihood of a collision is much greater than we intuitively believe

• We will always have collisions unless we go to the large amount of work of finding the perfect hash

• Lets consider some empiracal data

Copyright © 2002-2010 Curt Hill

Page 37: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Using Optimal Rehashing– Load factor = number of keys /

number of slots

Copyright © 2002-2010 Curt Hill

Load Factor Probes0.1 1.050.25 1.150.5 1.390.75 1.85.9 2.560.95 3.150.99 4.66

Page 38: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Using Linear Probing

Copyright © 2002-2010 Curt Hill

Load Factor Probes0.1 1.060.25 1.170.5 1.500.75 2.50.9 5.500.95 10.50

Page 39: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Types of Hash Techniques• Mapping• Folding• Shifting• Pseudo Random Numbers• Casts

Copyright © 2002-2010 Curt Hill

Page 40: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Mapping• Convert items (usually characters

or integers) into values• The character value itself will often

not be acceptable• Preferred method: Create a vector

that is subscripted by the character values– Each character value returns another

value– By modifying these we can change

the hash functionCopyright © 2002-2010 Curt Hill

Page 41: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Folding• Treat pieces of the key as if they

were integers and compute key• Example:

– Suppose an 8 character key that is always there

– An int is 4 characters long– Form a union that has the 8

characters for one value and an array of 2 ints or 4 short ints as the other:

Copyright © 2002-2010 Curt Hill

Page 42: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Folding Example• Consider:union { char c[8]; int i[2]; short s[4]; } smash;

• Move the 8 characters into the c part and then remove the 2 ints or 4 short ints

• Do some computation on the ints or short ints

Copyright © 2002-2010 Curt Hill

Page 43: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Shifting• Using the shift instructions shift

right the values to make the high order bits more prominent

• For instance 'a'..'z' are from 97 .. 122

• Shifting can bring this range down some

Copyright © 2002-2010 Curt Hill

Page 44: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Pseudo Random Numbers• Computer random number

generators not random– Rather they are sequence of numbers

based on algorithm usually based on overflow

• The seed of the sequence determines where it starts

• Use your value as the seed and then call the random number generator

• This mechanism will often generate integers in a particular range as well

Copyright © 2002-2010 Curt Hill

Page 45: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Casts• You may use a cast to turn anything

into a character• Often the same as the folding• Beware of short character strings!

– Suppose you have "hi" in char x[8]– Do not use anything past x[2] in casts

folding– The debris past the null character may

change from time to time– Using it makes your hash function

unreliableCopyright © 2002-2010 Curt Hill

Page 46: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Minimal and Perfect Hashes

• A perfect hash function has no collisions– Life is simpler when there is no need

for a collision strategy

• A hash function that computes indexes where the number of keys is the same as the number of entries is called minimal– No space is wasted in the hash table

Copyright © 2002-2010 Curt Hill

Page 47: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Perfect Hash Requirements

• Know the keys in advance• Key is of a constant size and

makeup• The table must be of size greater

than or equal number of keys• The smaller of ratio of keys to size,

the harder the function is to find– Although not necessarily to compute

Copyright © 2002-2010 Curt Hill

Page 48: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Constructing a Hash• Choose the hash method • Bring down to range• Choose a collision strategy

Copyright © 2002-2010 Curt Hill

Page 49: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Choose the hash method• Recall the methods:

– Mapping, folding, shifting, random numbers, casting

– Any combination or something you make up

• Decision is strongly influenced by the form and type of the key

Copyright © 2002-2010 Curt Hill

Page 50: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Range• Second step - bring down to the right

range• If the computation was a fold and the

values were originally characters then the resulting range is limited mostly by the variable type

• Could be positive or possibly negative• Usually we do integer division and

keep the remainder– The divisor is best if prime

Copyright © 2002-2010 Curt Hill

Page 51: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Range • We often choose the largest prime

number smaller than the table• Shifts

– Table size is a power of two:– res = num >> 20;– Leaves twelve bits which is 0 - 4095

• Bitwise logical operations– And can mask out high order bits– Table size will be a power of two – res = num & 0x000000ff; – Forces range 0 – 255

Copyright © 2002-2010 Curt Hill

Page 52: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Collision Strategy• The simplest and worst is linear

probing– Do not use this if the hash table will

be close to full– May be acceptable if table is less than

75% full

• Chaining is the best when the number of entries is least well known– Also can degenerate to O(N) easily

Copyright © 2002-2010 Curt Hill

Page 53: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

In search of the minimal perfect hash

• Holy Grail for searches– What is better than O(1) search?

• Actually we will be happy if we find any good hash

• This task is never really easy• Always requires programmer

intervention• A perfect or minimal perfect hash

requires some luck to find beside requirements mentioned previously

Copyright © 2002-2010 Curt Hill

Page 54: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Searching for Perfect Hash• Generate a hash function that uses

one or more characters from the items to be hashed as well as perhaps the length of the item

• Have it use a lookup table so that we can assign values to them

• Letters not in any of the used characters are usually zeroed

• Using some systematic procedure set the map values

Copyright © 2002-2010 Curt Hill

Page 55: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Searching continued• Attempt to hash all the items

– Keep track of all the character maps

• At the first collision– If there is a map that was only used by

one or both of the collidees alter it to prevent the collision and continue from there

– Otherwise alter one of the maps present in the two items and go back to the attempt the hash

• If there are no collisions that are not resolved the hash is successful

Copyright © 2002-2010 Curt Hill

Page 56: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Case study• I have such a searcher: perfhash• Finds a Perfect hash• Uses several algorithms• Written in C++

Copyright © 2002-2010 Curt Hill

Page 57: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

PerfHash algorithms• 0 – Multiply characters and divide by

length• 1 – Add characters times square of

position then multiply length• 2 – Fold using short ints, take product• 3 – Fold using ints, alternatively add

then subtract• 4 – Fold then use as seed for random

number generator

Copyright © 2002-2010 Curt Hill

Page 58: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Results

Copyright © 2002-2010 Curt Hill

Keys

0 1 2 3 4

C 32 151

293

139 337 227

C++ 35 281

307

389 487 577

Pascal 40 251

251

271 151 277

Modula2 48 301

367

317 397 193

Java 51 577

431

337 331 463

Page 59: Copyright © 2002-2010 Curt Hill Hashing Key Transformation

Copyright © 2002-2010 Curt Hill

Finally• When hashing works well, it hard

to beat– Stable index– Good distribution from the hash

function

• It is much harder to make dynamic than trees

• Fiddling with the hash function can make it better or worse in unpredicatable ways