lecture 27. string matching algorithms 1. floyd algorithm help to find the shortest path between...

30
Lecture 27. String Matching Algorithms 1

Upload: stephanie-flynn

Post on 18-Dec-2015

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Lecture 27.

String Matching Algorithms

1

Page 2: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Floyd algorithm help to find the shortest path between every pair of vertices of a graph.

Floyd graph may contain negative edges but no negative cycles

A representation of weight matrix where W(i,j)=0 if i=j. W(i,j)=¥ if there is no edge between i and j. W(i,j)=“weight of edge”

Recap

2

Page 3: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Definitions

• Formal Definition of String Matching ProblemFormal Definition of String Matching Problem

• Assume text is an array T[1..n] of length n Assume text is an array T[1..n] of length n and the pattern is an array P[1..m] of length and the pattern is an array P[1..m] of length m ≤ nm ≤ n

• This basically means that there is a string array T This basically means that there is a string array T which contains a certain number of characters that which contains a certain number of characters that is larger than the number of characters in string is larger than the number of characters in string array P. P is said to be the pattern array because array P. P is said to be the pattern array because it contains a pattern of characters to be searched it contains a pattern of characters to be searched for in the larger array T.for in the larger array T.

Page 4: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Definitions

- Alphabet- Alphabet

It is assumed that the elements in P and T are drawn from a It is assumed that the elements in P and T are drawn from a finite alphabet finite alphabet ΣΣ. . -Example-Example

ΣΣ = {a,b, …z} = {a,b, …z}

ΣΣ = {0,1} = {0,1}

Sigma simply defines what characters are allowed in both the Sigma simply defines what characters are allowed in both the character array to be searched and the character array that contains character array to be searched and the character array that contains the subsequence to be searched for.the subsequence to be searched for.

Page 5: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

- Strings- Strings

- - ΣΣ* denotes the set of all finite length strings formed by using * denotes the set of all finite length strings formed by using characters from the alphabetcharacters from the alphabet

- - The zero length empty string denoted by The zero length empty string denoted by εε and is a member of and is a member of ΣΣ* * - - The length of a string x is denoted by |x|The length of a string x is denoted by |x|

- - The concatenation of two strings x and y, denoted xy, has length The concatenation of two strings x and y, denoted xy, has length |x| + |y| and consists of the characters in x followed by the |x| + |y| and consists of the characters in x followed by the characters in ycharacters in y

Definitions

Page 6: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

- Shift- Shift

- If P occurs with shift s in T, then we call s - If P occurs with shift s in T, then we call s a valid shifta valid shift

--If P does not occurs with shift s in T, we call If P does not occurs with shift s in T, we call s an invalid shifts an invalid shift

Definitions

Page 7: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

- - String Concatenation ExampleString Concatenation Example

ΣΣ = {A,B,C,D,E,H,1,2,6,9} = {A,B,C,D,E,H,1,2,6,9}

String X = A125 , |X| = 4String X = A125 , |X| = 4String Y = HE69D, |Y| = 5String Y = HE69D, |Y| = 5

Definitions

The Concatenator

String Z = A125HE69D, |x| = 9String Z = A125HE69D, |x| = 9

Page 8: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Definitions

- Prefix- Prefix

- - String w is a prefix of a string x if x = wy String w is a prefix of a string x if x = wy for some string y for some string y εε ΣΣ**

- - w[x means that string w is a prefix w[x means that string w is a prefix of string xof string xIf a string w is a prefix of siring x this means If a string w is a prefix of siring x this means that there exists some string y that when that there exists some string y that when added onto the back of string w will make w added onto the back of string w will make w = x= x

Page 9: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Definitions

- Prefix Examples- Prefix Examples

To Prefix Or Not To PrefixTo Prefix Or Not To Prefix

ΣΣ = {A,B} = {A,B}ΣΣ* = {A, B, AB, BA}* = {A, B, AB, BA}Examples: Examples:

String x = AABBAABBABABString x = AABBAABBABAB

String w =AABBAAString w =AABBAA

Is w[x ?Is w[x ? Why?Why?

Page 10: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Definitions

- Suffix- Suffix

- String w is a suffix of a string x if x = yw for - String w is a suffix of a string x if x = yw for some y some y εε ΣΣ**- - w]x means that string w is a w]x means that string w is a suffix of string xsuffix of string xIf a string w is a suffix of string x this means that If a string w is a suffix of string x this means that there exists some string y that when added onto the there exists some string y that when added onto the front of string w will make w = xfront of string w will make w = x

Page 11: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Definitions

- Suffix Examples- Suffix Examples

Et Tu Suffix?Et Tu Suffix?

ΣΣ = {A,B} = {A,B}ΣΣ* = {A, B, AB, BA}* = {A, B, AB, BA}

ExamplesExamples: :

String x = AABBAABBABABString x = AABBAABBABAB

String w = BABBAString w = BABBA

Is w[x ?Is w[x ? Why?Why?

Page 12: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

- Formal Definition of String Matching Problem- Formal Definition of String Matching Problem

- Assume text is an array T[1..n] of length n - Assume text is an array T[1..n] of length n and the pattern is an array P[1..m] of length m and the pattern is an array P[1..m] of length m ≤ n≤ nThis basically means that there is a string array T This basically means that there is a string array T which contains a certain number of characters that which contains a certain number of characters that is larger than the number of characters in string is larger than the number of characters in string array P. P is said to be the pattern array because array P. P is said to be the pattern array because it contains a pattern of characters to be searched it contains a pattern of characters to be searched for in the larger array T.for in the larger array T.

Naïve String Matching Algorithm

Page 13: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

- The Naïve String Matching Algorithm takes the - The Naïve String Matching Algorithm takes the pattern that is being searched for in the “base” string pattern that is being searched for in the “base” string and slides it across the base string looking for a and slides it across the base string looking for a match. It keeps track of how many times the pattern match. It keeps track of how many times the pattern has been shifted in varriable s and when a match is has been shifted in varriable s and when a match is found it prints the statement “Pattern Occurs with found it prints the statement “Pattern Occurs with Shirt s” .Shirt s” .- This algorithm is also sometimes known as - This algorithm is also sometimes known as the Brute Force algorithm.the Brute Force algorithm.

Basic Explanation

Page 14: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

NAÏVE-STRING-MATCHER(T,P)NAÏVE-STRING-MATCHER(T,P)1 N ← length [T]N ← length [T]2 M ← length[P]M ← length[P]3 For s ← 0 to n –mFor s ← 0 to n –m4 do if P[1…m] = T[s+1 .. S+m]do if P[1…m] = T[s+1 .. S+m]5 then print “Pattern Occurs with shift” sthen print “Pattern Occurs with shift” s

- This algorithm is also sometimes known as - This algorithm is also sometimes known as the Brute Force algorithm.the Brute Force algorithm.

Algorithm Pseudo Code

Page 15: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Algorithm Time Analysis

NAÏVE-STRING-MATCHER(T,P)NAÏVE-STRING-MATCHER(T,P)1 N ← length [T]N ← length [T]2 M ← length[P]M ← length[P]3 For s ← 0 to n –mFor s ← 0 to n –m4 do if P[1…m] = T[s+1 .. S+m]do if P[1…m] = T[s+1 .. S+m]5 then print “Pattern Occurs with shift” sthen print “Pattern Occurs with shift” s

- - The worst case is when the algorithm has a The worst case is when the algorithm has a substring to find in the string it is searching that is substring to find in the string it is searching that is repeated throughout the whole string. An example of repeated throughout the whole string. An example of this would be a substring of length am that is being this would be a substring of length am that is being searched for in a substring of length an.searched for in a substring of length an.

Page 16: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

The algorithm is O((n-m)+1)*m) The algorithm is O((n-m)+1)*m)

n = length of string being searchedn = length of string being searched

m = length of substring being comparedm = length of substring being compared

Inclusive subtractionInclusive subtraction

- - The Naïve String Matcher is not an optimal solutionThe Naïve String Matcher is not an optimal solution

- - It is inefficient because information gained about It is inefficient because information gained about the text for one value of s is entirely ignored in the text for one value of s is entirely ignored in considering other values of s.considering other values of s.

CommentsComments::

Algorithm Time Analysis

Page 17: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Boyer Moore Algorithm

A String Matching Algorithm

Preprocess a Pattern P (|P| = n)

For a text T (| T| = m), find all of the occurrences of P in T

Time complexity: O(n + m), but usually sub-linear

Page 18: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Right to Left

Matching the pattern from right to left

For a pattern abc: ↓T: bbacdcbaabcddcdaddaaabcbcbP: abc

Worst case is still O(n m)

Page 19: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

The Bad Character Rule (BCR)

On a mismatch between the pattern and the text, we can shift the pattern by more than one place.

Sublinearity!

ddbbacdcbaabcddcdaddaaabcbcbacabc

Page 20: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

BCR Preprocessing

A table, for each position in the pattern and a character, the size of the shift. O(n |Σ|) space. O(1) access time.

a b a c b: 1 2 3 4 5

A list of positions for each character. O(n + |Σ|) space. O(n) access time, But in total O(m).

1 2 3 4 5

a 1 1 3 3 3

b 2 2 2 5

c 4 4

Page 21: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

BCR - Summary

On a mismatch, shift the pattern to the right until the first occurrence of the mismatched char in P.

Still O(n m) worst case running time:

T: aaaaaaaaaaaaaaaaaaaaaaaaaP: abaaaa

Page 22: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

The Good Suffix Rule (GSR)

We want to use the knowledge of the matched characters in the pattern’s suffix.

If we matched S characters in T, what is (if exists) the smallest shift in P that will align a sub-string of P of the same S characters ?

Page 23: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

GSR (Cont…)

Example 1 – how much to move: ↓T: bbacdcbaabcddcdaddaaabcbcbP: cabbabdbab cabbabdbab

Page 24: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

GSR (Cont…)

Example 2 – what if there is no alignment:

↓T: bbacdcbaabcbbabdbabcaabcbcbP: bcbbabdbabc bcbbabdbabc

Page 25: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

GSR - Detailed

We mark the matched sub-string in T with t and the mismatched char with x

1. In case of a mismatch: shift right until the first occurrence of t in P such that the next char y in P holds y≠x

2. Otherwise, shift right to the largest prefix of P that aligns with a suffix of t.

Page 26: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Boyer Moore Algorithm

Preprocess(P) k := nwhile (k ≤ m) do

– Match P and T from right to left starting at k

– If a mismatch occurs: shift P right (advance k) by max(good suffix rule, bad char rule).

– else, print the occurrence and shift P right (advance k) by the good suffix rule.

Page 27: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Algorithm Correctness

The bad character rule shift never misses a match

The good suffix rule shift never misses a match

Page 28: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

Boyer Moore Worst Case Analysis

Assume P consists of n copies of a single char and T consists of m copies of the same char:

T: aaaaaaaaaaaaaaaaaaaaaaaaaP: aaaaaa

Boyer Moore Algorithm runs in Θ(m n) when finding all the matches

Page 29: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

String is combination of characters ends with a special character known as Null(in computer languages such as C/C++)

A String comes with a prefix and suffex. One character or a string can be match

with given string. Two important algorithm of string are

Navii String matcher and Boyer Moore Algorithm which help to match a pattern of string over given string

Summary

29

Page 30: Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain

In next lecturer we will discuss Amortized analysis of different algorithms

In Next Lecturer

30