09 searching[1]

43
90-723: Data Structures and Algorithms for Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved. 1 Lecture 9: Searching Data Structures and Algorithms for Information Processing Lecture 9: Searching

Upload: kamal-shrish

Post on 15-Apr-2017

94 views

Category:

Education


1 download

TRANSCRIPT

Page 1: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

1Lecture 9: Searching

Data Structures and Algorithms for Information

ProcessingLecture 9: Searching

Page 2: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

2Lecture 9: Searching

Outline

• The simplest method: serial search

• Binary search• Open-address hashing• Chained hashing

Page 3: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

3Lecture 9: Searching

Search Algorithms

Whenever large amounts of data need to be accessed quickly, search algorithms are crucially involved.

Page 4: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

4Lecture 9: Searching

Search Algorithms Lie at the heart of many computer

technologies. To name a few:– Databases– Information retrieval applications– Web infrastructure (file systems,

domain name servers, etc.)– String searching for patterns

Page 5: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

5Lecture 9: Searching

Search Algorithms: Two Broad Categories

• Searching a static database– Accessing indexed Web pages– Finding a file on disk

• Evaluating a dynamically changing set of hypotheses– Computer chess (search for a move)– Speech recognition (search for text given

speech)We’ll be concerned with the first

Page 6: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

6Lecture 9: Searching

The Simplest Search: Serial Lookup

• Items are stored in an array or list.

• To search for an item x:– Start at the beginning of the list– Compare the current item to x– If unequal, proceed to next item

Page 7: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

7Lecture 9: Searching

Pseudocode for Serial Search// Find x in an array a of length

nint i=0;boolean found = false;while ((i < n) && !found) { if (a[i] == x) found = true; else i++;}if (found) ...

Page 8: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

8Lecture 9: Searching

Analysis for Serial Search• Best case: Requires one array

access: Θ(1)• Worst case: Requires n array

accesses: Θ(n) • Average case: To access an item,

assuming position is random (uniform):(1+2+3+...+n)/n = n(n+1)/2n = (n+1)/2 = Θ(n)

Page 9: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

9Lecture 9: Searching

A Useful Combinatorial Identity

1+2+3+…+n = n(n+1)/2Why?

Algebraic Proof in MainVisual Counting

Page 10: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

10Lecture 9: Searching

Visual Counting

n*n

Page 11: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

11Lecture 9: Searching

Visual Counting

n

Page 12: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

12Lecture 9: Searching

Visual Counting

n*n - n

Page 13: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

13Lecture 9: Searching

Visual Counting

(n*n - n)/2 + n = n(n+1)/2

Page 14: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

14Lecture 9: Searching

Binary Search• Can be used whenever the data are totally

ordered -- e.g., the integers. All elements are comparable.

• Requires sorting in advance, and storing in an array

• One of the simplest to implement, often “fast enough”

• Can be tricky to handle “boundary cases”• This a classic divide-and-conquer algorithm.

Page 15: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

15Lecture 9: Searching

Idea of Binary Search• Closely related to the natural

algorithm we use to look up a word in a dictionary – Open to the middle– If target comes before all words on

the page, search in left half of book– Otherwise, search in right half.

Page 16: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

16Lecture 9: Searching

Interface for Binary Searchint search(int [] a, int first, int size, int target)• Parameters:

– int [] a: array to be searched over– Search over a[first,first+1,...,first+size-1]

• Precondition:– array is sorted in increasing order– first >= 0

Page 17: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

17Lecture 9: Searching

int search (int [] a, int start, int size, int target) { if (size <= 0) return -1; else { int middle = start + size/2; if (a[middle] == target) return middle; else if (target < a[middle]) return search(a, start, size/2, target); else return search(a, middle+1, size/2, target); }}

Implementation

Page 18: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

18Lecture 9: Searching

Implementation

Where’s the error??Suppose size is odd. Arenew sizes correct?Suppose size is even. Are new sizes correct?

Page 19: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

19Lecture 9: Searching

int search (int [] a, int first, int size, int target) { if (size <= 0) return -1; else { int middle = first + size/2; if (a[middle] == target) return middle; else if (target < a[middle]) return search(a, first, size/2, target); else return search(a, middle+1, (size-1)/2, target); }}

Implementation

Page 20: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

20Lecture 9: Searching

Boundary Cases• Binary search is sometimes tricky

to get right. • A common source of bugs.• Test cases are not always helpful

for checking correctness of code.• How many test cases would our

first implementation solve?

Page 21: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

21Lecture 9: Searching

Binary Search with Other Data Structures

• Can binary search be implemented using linked lists rather than arrays?

• Are there any other data structures that could be used?

Page 22: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

22Lecture 9: Searching

Analysis of Binary Search• Recursively dividing up array in half represents

data as a full binary tree.• Consider the simplest case -- array of size n =

2k -1, complete binary tree.• Take away one and divide by 2.• New Size = 2k-1 - 1.• We can only do that k times and k = Lg(n+1).• Thus, worst case involves Θ(log n)

operations.

Page 23: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

23Lecture 9: Searching

Average Case• A complete binary tree with k

leaves has k-1 internal nodes.• So, about half of the n data

elements require Θ(log n) operations to find.

• Thus, assuming uniform distribution on target elements, average cost is also Θ(log n).

Page 24: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

24Lecture 9: Searching

Binary Search is Limited When we have a large number

of items that will be accessed in part of the program, where efficiency is crucial, binary search may be too slow.

Page 25: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

Improving Binary SearchTry to guess more precisely where the

key is located in the interval.Generalize middle = first + size/2

(key – a[first])

middle = ------------------------------ * size (a[first+size-1] – a[first])

25Lecture 9: Searching

Page 26: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

Interpolation Search• This modifies method is called

interpolation search.• Uses fewer than log(log(N))

comparisons in the average caes.• But uses Θ(N) in the worst case.• For analysis, see Perl, Ital, Avni

“Interpolation Search – A Log Log N search” CACM 21 (1978) Pages 550 – 553• Is log (log (N)) better that log (N)?

26Lecture 9: Searching

Page 27: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

Comparing Log N to Log(Log N)

• Suppose N = 2^100 Log N = 100 Log (Log N) = Log (100) = 6.65• Suppose N = 2^(2^100) Log N = 2^100 Log (Log N) = Log 2^100 = 100

27Lecture 9: Searching

Page 28: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

Comparing Log(N) to Log(Log N)

• Or, by taking limits…• Lim Log(Log(n)) / Log(n) n->∞

is of the form inf. / inf.• Apply L’Hopital and take derivatives.• Lim 1/(Log N) * 1/n n->∞ -------------------- = 0 1/n

28Lecture 9: Searching

Page 29: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

29Lecture 9: Searching

Hashing• Fortunately, we can often do better • Hashing is a technique that where

the access time can be O(1) rather than O(log n)

Page 30: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

30Lecture 9: Searching

Open Address HashingThe basic technique:• Items are stored in an array of size N• The preferred position in the array is

computed using a hash function of the item’s key

• When adding an item, if the preferred position is occupied, the next open position in the array is used instead.

Page 31: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

31Lecture 9: Searching

Open Address HashingMain ’s presentation for Chapter 11

Page 32: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

32Lecture 9: Searching

A Basic Hash Table• We keep arrays for the keys and

data, and a bit indicating whether a given position has been occupiedprivate class Table { private int numItems; private Object[] keys; private Object[] data; private boolean[] hasBeenUsed; ....}

Page 33: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

33Lecture 9: Searching

The Hash Function• We can use the built in hashCode()

method that Java provides private int hash (Object key) { return Math.abs(key.hashCode()) %

data.length; }

Page 34: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

34Lecture 9: Searching

Calculating the Index// If found return value is index of key private int findIndex(Object key) { int count=0; int i=hash(key); while ((count < data.length) && (hasBeenUsed[i]))

{ if (key.equals(keys[i])) return i; i = nextIndex(i); count++; } return -1;}

Page 35: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

35Lecture 9: Searching

Inserting an Itempublic Object put (Object key, Object element) { int index = findIndex(key); if (index != -1) { Object answer = data[index]; data[index] = element; return answer; } else if (numItems < data.length) { ....

Page 36: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

36Lecture 9: Searching

Inserting an Itempublic Object put (Object key, Object element) { ... else if (numItems < data.length) { index = hash(key); while (keys[index] != null) index = nextIndex(index); keys[index] = key; data[index] = element; hasBeenUsed[index] = true; numItems++; return null; } else throw new IllegalStateException(“Table full”) ....

Page 37: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

37Lecture 9: Searching

Two Hashes are Better than One

• Collisions can result in long stretches of positions with keys not in their “preferred” position

• This is called clustering• To address this problem, when a

collision results we jump a “random” number of positions, using a second hash function

Page 38: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

38Lecture 9: Searching

Double Hashing• Find the first position using hash1(key)• If there’s a collision, step through the

array in steps of size hash2(key):i = (i + hash2(key)) % data.length

• To avoid cycles, hash2(key) and the length of the array must be relatively prime (no common factors)

Page 39: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

39Lecture 9: Searching

Double Hashing• Knuth’s technique to avoid cycles:• Choose the length of the array so that

both data.length and data.length-2 are prime

hash1(key) = Math.abs(key.hashCode()) % lengthhash2(key) = 1 + (Math.abs(key.hashCode()) %

(length-1)

Page 40: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

40Lecture 9: Searching

Issues with O-A Hashing• Each array cell holds only one

element• Collisions and clustering can

degrade performance• Once the array is full, no more

elements can be added, unless we:– create a new array with the right size

and hash functions– re-hash the original elements

Page 41: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

41Lecture 9: Searching

Chained Hashing• Each array cell can hold more than

one element of the hash table• Hash the key of each element to

obtain the array index• When a collision happens, the

element is still placed at the original hash index

• How is this handled?

Page 42: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

42Lecture 9: Searching

Answer• Each array location must be

implemented with a data structure that can hold a group of elements with the same hash index

• Most common approach– each array location stores the head of

a linked list– items in the list all have the same has

index

Page 43: 09 searching[1]

90-723: Data Structuresand Algorithms for

Information Processing Copyright © 1999, Carnegie Mellon. All Rights Reserved.

43Lecture 9: Searching

Chained Hashingtable …

[0] [1] [2] [3]

elementkeylink

elementkeylink

elementkeylink

elementkeylink

Any number of elements can beadded to the table without a need to rehash