can’t provide fast insertion/removal and fast lookup at the same time vectors, linked lists,...

Post on 04-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Can’t provide fast insertion/removal and fast lookup atthe same time

Vectors, Linked Lists, Stack, Queues, Deques

4

Data Structures - CSCI 102

Copyright © William C. Cheng

Data Structure Limitations

Provide consistently fast operations, but must maintainan internal ordering

Binary Search Trees, Heaps

What if we didn’t care about the ordering of the elementsat all?

How can we further improve the performance of lookup,add & removal?

Each value in the table has a unique key

For operations where we only care about fastadd/remove/search, not fast traversal, we create a tablestructure to optimize for fast lookup

5

Data Structures - CSCI 102

Copyright © William C. Cheng

Lookup Tables

The key is used as a short identifier to lookup an entirevalue in the table

Your student ID is used to look up your student record(e.g. name, GPA, etc.)

Example

Search(key)See if a particular value identified by key is in thetable

What kind of operations do we need to perform on a lookuptable?

6

Data Structures - CSCI 102

Copyright © William C. Cheng

Lookup Tables

Insert(key,value)Insert a new value identified by key into the table

Remove(key)Remove the value identified by key from the table

We don’t care as much about traversal (visiting allelements) in this scenario

Let’s assume ID is a unique integer

We want to keep a directory of all the students at USC andbe able to look them up by their student ID

7

Data Structures - CSCI 102

Copyright © William C. Cheng

Sample Object

struct Student {string name;double gpa;int id;

};

Student data[4999];

If we can guarantee that student IDs will always range from0 to N (e.g. 0 to 4999), we could just store them in an array:

8

Data Structures - CSCI 102

Copyright © William C. Cheng

Direct Address Table

int id = 3285;Student s = data[id];

Then when we want to grab a particular student, we knowStudent N is at index N:

Data Structures - CSCI 102

Direct Address Table

StudentObjects

John Doe3.20

Jane Doe2.62

Some Guy

Name

3.7

GPA

4

ID

0

1

2

3

4

5

4999

9

Copyright © William C. Cheng

StudentIDs

Data

0

24

Direct Addressing

10

Data Structures - CSCI 102

Copyright © William C. Cheng

Direct Address Table

Maps keys directly to the indexes in an arrayUnused array indexes need to be marked

O(1) worst case

Generally use NULLOperations are fast

Key RestrictionsDirect Addressing Issues

11

Data Structures - CSCI 102

Copyright © William C. Cheng

Direct Address Table

Array Size

Keys must fall into a nice, uniform rangeKeys must be numeric

If there are N possible keys, then data[] must be ofsize NOur array could get HUGEWhat if we’re only using a small numbers of keys?Tons of space is wasted

How can we get around these limitations?

Hash Functions

12

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Functions

A function that maps key values to array indexesInput records all have a unique keyThe hash function maps key to an array indexRecords are stored at data[hash(key)]Ideally every unique key also has unique hash(key)

Direct Addressing essentially uses a hash function thatdoes nothing

int directAddressHash(int studentId) {return studentId;

}

13

Copyright © William C. Cheng

Data Structures - CSCI 102

Hash Tables

StudentObjects

John Doe

Jane Doe

Some Guy

3.2

2.6

3.7

0

2

4

NameGPAID

hash(4)

hash(0)

hash(2)

Data

StudentIDs

(Keys)

0

24

HashFunction

How can we avoid having to make our array gigantic tohold all possible keys?

Hash Functions

15

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Tables

Simple solution: use modular arithmeticSize of the backing array is no longer dependent onthe number of unique keysint modularHash(int studentId) {

return studentId % ARRAY_SIZE;}

int directAddressHash(int studentId) {return studentId;

}

Recall direct addressing:

FastHashing is supposed to be faster than a binary searchtree. hash(key) needs to be O(1)

What makes a good hash function?

16

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Functions

DeterministicIf we have a key K, then hash(K) must always givethe same result

Uniform distributionThe hash function should uniformly distribute keysacross all of the available indexes in the storage array

Making a good hash function is hard

For strings, use things like ASCII letter codes

Map your data into the set of natural numbersMaking a hash function

N = {0, 1, 2, ...}

17

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Functions

Prime table sizes tend to yield better resultsPrime numbers are your friend

E.g. make sure "get" and "gets" hash differentlyHandle variants of the same pattern

Try to be independent of any patterns that may exist inthe data

You won’t usually have to write your own, but you shouldknow what the default hash function does

Hash Tables do not maintain any ordering of theirinternal elements

Hashing Issues

19

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Tables

Creating a perfect hash function is almost impossible

When two distinct keys generate the same hash valueit’s called a collision

Collisions

hash(K1) == hash(K2)

If we try to insert a new element and there’s a collision,keep probing the hash table until we find a vacant space

Open Addressing

23

Data Structures - CSCI 102

Copyright © William C. Cheng

Collision Handling

If a collision occurs, use a deterministic algorithm tocalculate the next array index to check (based on theinitial hash result)

Probing

All data is stored directly in the hash table. No extra datastructures are needed.

Start with an empty Hash Table

25

Data Structures - CSCI 102

Copyright © William C. Cheng

Open Addressing (Linear Probing)

Data0

1

2

3

4

26

Copyright © William C. Cheng

Student

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "John Doe" with ID = 123

Data0

1

2

3

4

John Doe

2.8

123

Name

GPA

ID

27

Copyright © William C. Cheng

Student

1

2

3

4

John Doe

2.8

123

Name

GPA

ID

hash(123) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "John Doe" with ID = 123

hash(123) = 1

Data0

28

Copyright © William C. Cheng

Student

1

2

3

4

John Doe

2.8

123

Name

GPA

ID

hash(123) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "John Doe" with ID = 123

hash(123) = 1data[1] is empty, no collision

Data0

29

Copyright © William C. Cheng

Student

Data0

1

2

3

4

John Doe2.8123

John Doe

2.8

123

Name

GPA

ID

hash(123) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "John Doe" with ID = 123

hash(123) = 1data[1] is empty, no collision

store it there

Data Structures - CSCI 102

Open Addressing (Linear Probing)Hash Table contains one item

Data0

1

2

3

4

30

Copyright © William C. Cheng

John Doe2.8123

31

Copyright © William C. Cheng

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202

Data0

1

2

3

4

John Doe2.8123

StudentJane Doe

3.4

202

Name

GPA

ID

32

Copyright © William C. Cheng

hash(202) = 3

Data0

1

2

3

4

John Doe2.8123

StudentJane Doe

3.4

202

Name

GPA

ID

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202

hash(202) = 3

33

Copyright © William C. Cheng

hash(202) = 3

Data0

1

2

3

4

John Doe2.8123

StudentJane Doe

3.4

202

Name

GPA

ID

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202

hash(202) = 3data[3] is empty, no collision

34

Copyright © William C. Cheng

hash(202) = 3

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Jane Doe" with ID = 202

hash(202) = 3data[3] is empty, no collision

store it there

Student

Name

Jane Doe

GPA

3.4

ID

202

35

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202

Data Structures - CSCI 102

Open Addressing (Linear Probing)Hash Table contains two items

36

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202Student

Some Guy

3.5

401

Name

GPA

ID

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

37

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202Student

Some Guy

3.5

401

Name

GPA

ID

hash(401) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

hash(401) = 1

38

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202Student

Some Guy

3.5

401

Name

GPA

ID

hash(401) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!

39

Copyright © William C. Cheng

hash(401) = 1

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202Student

Some Guy

3.5

401

Name

GPA

ID

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!

hash(401)+1 = 2

40

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Jane Doe3.4202Student

Some Guy

3.5

401

Name

GPA

ID

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!

hash(401)+1 = 2data[2] is empty, no collision

hash(401) = 1

hash(401)+1 = 2data[2] is empty, no collision

41

Copyright © William C. Cheng

Data0

1

2

3

4

John Doe2.8123

Some Guy3.5401

Jane Doe3.4202

hash(401) = 1

hash()

Data Structures - CSCI 102

Open Addressing (Linear Probing)Insert "Some Guy" with ID = 401

hash(401) = 1

data[1] is non-empty, collision!

store it there

Student

Name

Some Guy

GPA

3.5

ID

401

Data0

1

2

3

4

123

Some Guy3.5401

Jane Doe3.4

202

42

Copyright © William C. Cheng

Data Structures - CSCI 102

Open Addressing (Linear Probing)Hash Table contains three items

John Doe2.8

Search(key)What is the Big O of each of these operations?

48

Data Structures - CSCI 102

Copyright © William C. Cheng

Open Addressing (Linear Probing)

Insert(key,value)

Remove(key)

Average: O(1), Worst Case: O(N)

Average: O(1), Worst Case: O(N)

Average: O(1), Worst Case: O(N)

How big is the table?

load factor = (# of elements) / (size of array)

Operations depend on the table’s load factor

How many slots are taken already?

"Utilization"

Each slot in the Hash Table can now contain a list ofelements instead of a single element

Chaining

50

Data Structures - CSCI 102

Copyright © William C. Cheng

Collision Handling

When multiple items hash to the same slot, they areplaced in the list at that slot

This requires the overhead of an extra list for each slot thatcontains one or more elements

2.8123

Jane Doe3.4202

51

Copyright © William C. Cheng

Data0

1

2

3

4

Data Structures - CSCI 102

ChainingHash Table contains two items

John Doe

StudentSome Guy

3.5

401

Name

GPA

ID

52

Copyright © William C. Cheng

Data0

1

2

3

4

Data Structures - CSCI 102

ChainingInsert "Some Guy" with ID = 401

John Doe

2.8123

Jane Doe3.4202

2.8123

Jane Doe3.4

202

StudentSome Guy

3.5

401

Name

GPA

ID

53

Copyright © William C. Cheng

Data0

1

2

3

4

hash(401) = 1

hash()

Data Structures - CSCI 102

ChainingInsert "Some Guy" with ID = 401

hash(401) = 1

John Doe

StudentSome Guy

3.5

401

Name

GPA

ID

54

Copyright © William C. Cheng

Data0

1

2

3

4

hash(401) = 1

hash()

Data Structures - CSCI 102

ChainingInsert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!

John Doe

2.8123

Jane Doe3.4

202

StudentSome Guy

3.5

401

Name

GPA

ID

55

Copyright © William C. Cheng

Data0

1

2

3

4

hash(401) = 1

hash()

Data Structures - CSCI 102

ChainingInsert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!Chaining says to add the newentry to the list at data[1]

John Doe

2.8123

Jane Doe3.4

202

StudentSome Guy

3.5

401

Name

GPA

ID

56

Copyright © William C. Cheng

Data0

1

2

3

4

hash()

Data Structures - CSCI 102

ChainingInsert "Some Guy" with ID = 401

hash(401) = 1data[1] is non-empty, collision!Chaining says to add the newentry to the list at data[1]

Insert Some Guy in the list at data[1]

hash(401) = 1

John Doe2.8123

Jane Doe3.4

202

57

Copyright © William C. Cheng

Data0

1

2

3

4

2.8123

Jane Doe3.4202

Data Structures - CSCI 102

ChainingHash Table contains three items

Some Guy3.5401

John Doe

63

Data Structures - CSCI 102

Copyright © William C. Cheng

Chaining

Search(key)What is the Big O of each of these operations?

Insert(key,value)

Remove(key)

Average: O(1), Worst Case: O(N)

Average: O(1), Worst Case: O(1)

Average: O(1), Worst Case: O(N)

Operations depend on the average length of a chain (exceptfor insert)

If a malicious user knows what hash function you’reusing, they can intentionally cause your worst-casebehavior

The Problem

66

Data Structures - CSCI 102

Copyright © William C. Cheng

Collision Handling

When the Hash Table is created, randomly choose ahash function independent of the keys that are going tobe stored

No single input gives worst-case behavior(just like randomized Quicksort)

Universal Hashing

Like chaining, but each element in the hash table holdsanother hash table with a different hash function

Multi-Level Hashing

67

Data Structures - CSCI 102

Copyright © William C. Cheng

Collision Handling

If the set of possible keys is static (never changes), wecan develop a perfect multi-level hash to give O(1) worstcase performance

e.g. The reserved keywords in a programminglanguage are a static set of keys

Perfect Hashing

By hashing multiple times, we can greatly decrease theodds of a collision

Hash Tables generally do provide a way for you toretrieve a list of the known keys

Just keep in mind there is no guaranteed ordering ofthe keys

Other Notes

68

Data Structures - CSCI 102

Copyright © William C. Cheng

Hash Tables

C++ currently has no built-in hash tableThere’s a proposal for unordered_map in the STL is onthe tableGoogle Sparse Hash provides C++ hash tablesBoost C++ Libraries provides hash tableshttp://www.boost.org/

top related