advanced algorithms piyush kumar (lecture 12: string matching/searching) welcome to cot5405 source:...

38
Advanced Algorithms Advanced Algorithms Piyush Kumar Piyush Kumar (Lecture 12: String Matching/Searching) (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Upload: brett-hicks

Post on 19-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Advanced AlgorithmsAdvanced AlgorithmsAdvanced AlgorithmsAdvanced Algorithms

Piyush KumarPiyush Kumar(Lecture 12: String Matching/Searching)(Lecture 12: String Matching/Searching)

Welcome to COT5405 Source: S. Šaltenis

Page 2: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Computing Edit Distance

• Two text strings are given: X and Y• We want to quantify how similar they are:

– Comparing DNA sequences in studies of evolution of different species

– Spell checkers

• One of the measures of similarity is the edit distance between X and Y

(small distance <---> high similarity)

Page 3: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Edit Distance: Definition

• We want to convert X into Y by performing one of three operations:

• Delete a letter, insert a letter, or substitute one letter for another.• E.g. X =“ACGGTTA” can be converted

toY=“CGTAT” by deleting the 1st A, 2nd G, and substituting A<-->T in last two positions.

ACGGTTA _CG_ TAT

Page 4: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Edit Distance: Definition

• We want to convert X into Y by performing one of three operations:

• Delete a letter, insert a letter, or substitute

one letter for another.• The minimum number of these

operations that convert X into Y is called the edit distance between X and Y.

Page 5: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Edit Distance: Optimal

Substructure

• Denote by E(i,j) the edit distance between the i-th prefix of X (x1 x2 …xi) and the j-th prefix of Y (y1 y2 …yj)

– If xi=yj, then E(i,j)=E(i-1,j-1)

– If xiyj,

• Either substitute xiyj, (cost is 1+ E(i-1,j-1) )

• or delete xi (cost is 1+ E(i-1,j) )

• or insert yj (cost is 1+ E(i,j-1) )

• Decide which decision to do by comparing the three values, taking the minimum one.

– “Cut-and-paste” argument

Page 6: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Edit Distance: Computing

• Let n be the length of the word X, and let m be the length of Y.

• To compute E[i,j] (the Edit distance of (Xi, Yj) ) we construct a 2-dim array (in Scheme – vector of vectors)

of size (n+1)x(m+1).• We initialize the array at the left most column

and topmost row: E(i,0)=i, E(0,j)=j (the edit distance to an empty word).

Page 7: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

• To fill entry E(i,j), we need the three “former values”

E(i-1,j-1),E(i,j-1),E(i-1,j). Having these, we use the recurrence we saw to fill E(i,j).• Desired value is E(n,m).• Observe: conditions in the problem restrict

sub-problems (What is the total number of sub-problems?)

Page 8: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

0 1 2 3 4 5

0 0 1 2 3 4 5

1 1 1 2 3 3 4

2 2

3 3

4 4

5 5

6 6

7 7 ???

X = ACGGTTA Y = CGTAT

Page 9: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Edit Distance: Example• Lets do a dry run with X

=“ACGGTTA”, Y=“CGTAT”

Page 10: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Short IntroductionShort Introductionto Search Enginesto Search EnginesShort IntroductionShort Introductionto Search Enginesto Search Engines

Page 11: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Applictions• ?

Page 12: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Typical Web Search Engine Architecture

crawl theweb

create an inverted

index

Check for duplicates,store the

documents

Inverted index

Search engine servers

userquery

Show results To user

DocIds

Courtesy R. Ramakrishnan

Page 13: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Goals• Speed• Space Efficiency• Accuracy: “The first item should

be what I want to see?”• Updates: Periodic? Dynamic?

Page 14: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Typical Methods• Full Text scanning (egrep?)• Inverted File Indexing (Most common)

• Signature Files• Vector Space Model

Page 15: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Types of queries• Boolean• Proximity? (Edit Distance?)• In relation to other documents.• FileType + Keywords

Allow for:Prefix matches?Wildcards?Edit distance bounds. (egrep)

Page 16: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Common Tricks• Case Unfolding: Tallahassee = tallahassee.

• Stemming: Compress = compressed = compression

( off-the shelf stemmers available for English)

• Ignore words: a, the, it, be,…• Thesaurus: fast = rapid

(typically use available clustering)

Page 17: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Inverted File Index

• Periodically rebuilt, static otherwise.• Documents are parsed to extract tokens. These

are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 18: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

How Inverted Files are Created

• After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 19: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

How InvertedFiles are Created

• Multiple term entries for a single document are merged.

• Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 20: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

How Inverted Files are Created

• Finally, the file can be split into – A Dictionary or Lexicon file and – A Postings file

Page 21: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

How Inverted Files are Created

Dictionary/Lexicon PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 22: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Why use Inverted Files?• Permits fast search for individual

terms • For Boolean queries.• For statistical ranking algorithms.

Page 23: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Issues with Inverted files?• How to minimize the space taken

by the postings list?• Access to the lexicon?• How to do union and intersection

of postings.

Page 24: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Minimizing Space• Store postings with deltas

– Original posting list: 3,5,20,21,23– Delta Encoding: 3,2,15,1,2

• Use compression on delta encoding– Huffman, Arithmetic

Page 25: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Access to Lexicon?• Static:

– Sorted arrays.– Perfect Hashing

• Dynamic– Tries– B-Trees

Prefix Matching?

Page 26: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

TriesTriesTriesTries

e nimize

nimize ze

zei mi

mize nimize ze

Courtesy Tamassia & Goodrich.

Useful for ReTrievalFirst appearance: 1959Radix Search?

Page 27: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Preprocessing Strings• Preprocessing the pattern speeds up pattern

matching queries– After preprocessing the pattern, KMP’s algorithm

performs pattern matching in time proportional to the text size

• If the text is large, immutable and searched for often (e.g., works by Shakespeare), we may want to preprocess the text instead of the pattern

• A trie is a compact data structure for representing a set of strings, such as all the words in a text– A tries supports pattern matching queries in time

proportional to the pattern size

Page 28: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

d

r

e

d

d

p

k

c

o

t

l

l

e

y

l

l

l

Standard Tries (§ 11.3.1)

• The standard trie for a set of strings S is an ordered tree such that:– Each node but the root is labeled with a character– The children of a node are alphabetically ordered– The paths from the external nodes to the root yield the strings of

S• Example: standard trie for the set of strings

S = { bear, bell, bid, bull, buy, sell, stock, stop }

r

l

s

u

a

e i

b

Page 29: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Analysis of Standard Tries

• A standard trie uses O(n) space and supports searches, insertions and deletions in time O(dm), where:n total size of the strings in Sm size of the string parameter of the operationd size of the alphabet

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Page 30: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Applications of Tries• A standard trie supports the following

operations on a pre-processed text in time O(m), where m is the size of word X:– Word Matching: find the first occurrence of

the word X in the text.

– Prefix Matching: Find the first occurrence of the longest prefix of word X in the text.

Page 31: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Word Matching with a Trie

• We insert the words of the text into a trie

• Each leaf stores the occurrences of the associated word in the text

s e e b e a r ? s e l l s t o c k !

s e e b u l l ? b u y s t o c k !

b i d s t o c k !

a

a

h e t h e b e l l ? s t o p !

b i d s t o c k !

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86

a r87 88

a

e

b

l

s

u

l

e t

e

0, 24

o

c

i

l

r

6

l

78

d

47, 58l

30

y

36l

12k

17, 40,51, 62

p

84

h

e

r

69

a

Page 32: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Compressed Tries• Solves the following problems in

the standard trie.– Creation of extra nodes in the trie (Path Compression)

• Just a different representation of the standard trie.

First appearance: 1968

Page 33: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Compressed Tries • A compressed trie has

internal nodes of degree at least two

• It is obtained from standard trie by compressing chains of “redundant” nodes

e

b

ar ll

s

u

ll y

ell to

ck p

id

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Page 34: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Compact Representation

• Compact representation of a compressed trie for an array of strings:– Stores at the nodes ranges of indices instead of substrings– Uses O(s) space, where s is the number of strings in the array– Serves as an auxiliary index structure

s e e

b e a r

s e l l

s t o c k

b u l l

b u y

b i d

h e

b e l l

s t o p

0 1 2 3 4a rS[0] =

S[1] =

S[2] =

S[3] =

S[4] =

S[5] =

S[6] =

S[7] =

S[8] =

S[9] =

0 1 2 3 0 1 2 3

1, 1, 1

1, 0, 0 0, 0, 0

4, 1, 1

0, 2, 2

3, 1, 2

1, 2, 3 8, 2, 3

6, 1, 2

4, 2, 3 5, 2, 2 2, 2, 3 3, 3, 4 9, 3, 3

7, 0, 3

0, 1, 1

Page 35: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Suffix Trie (§ 11.3.3)• The suffix trie of a string X is the compressed trie of all

the suffixes of X

e nimize

nimize ze

zei mi

mize nimize ze

m i n i z em i0 1 2 3 4 5 6 7

Page 36: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Analysis of Suffix Tries

• Compact representation of the suffix trie for a string X of size n from an alphabet of size d– Uses O(n) space– Supports arbitrary pattern matching queries in X in

O(dm) time, where m is the size of the pattern– Can be constructed in O(n) time

7, 7 2, 7

2, 7 6, 7

6, 7

4, 7 2, 7 6, 7

1, 1 0, 1

m i n i z em i0 1 2 3 4 5 6 7

Page 37: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Tries and Web Search Engines

• The index of a search engine (collection of all searchable words) is stored in a compressed trie.

• Each leaf of the trie is associated with a word and a list of pages (URLs) containing that word (called the occurrence list).

• The trie is kept in internal memory.

• The occurrence lists are kept in external memory and are ranked by relevance.

Page 38: Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Tries and Web Search Engines

• Boolean queries for sets of words (e.g. Java and coffee) correspond to sets of operations (e.g. intersection) on the occurrence lists.

• Additional information retrieval techniques are used, such as:– Stopword Elimination (as done in the standard tries

example).– Stemming (e.g. identify “add” “adding” and “added”

as the same word).– Link Analysis (recognise authoritative pages).