string matching using the rabin-karp algorithm katey cruz csc 252: algorithms smith college...

14
String Matching Using String Matching Using the Rabin-Karp the Rabin-Karp Algorithm Algorithm Katey Cruz Katey Cruz CSC 252: Algorithms CSC 252: Algorithms Smith College Smith College 12.12.2000 12.12.2000

Upload: earl-flowers

Post on 16-Jan-2016

241 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

String Matching Using the String Matching Using the Rabin-Karp AlgorithmRabin-Karp Algorithm

Katey CruzKatey Cruz

CSC 252: AlgorithmsCSC 252: Algorithms

Smith CollegeSmith College

12.12.200012.12.2000

Page 2: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

OutlineOutline

String matching problemString matching problem Definition of the Rabin-Karp algorithmDefinition of the Rabin-Karp algorithm How Rabin-Karp worksHow Rabin-Karp works A Rabin-Karp exampleA Rabin-Karp example ComplexityComplexity Real Life applicationsReal Life applications AcknowledgementsAcknowledgements

Page 3: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

String Matching ProblemString Matching Problem

We assume that the text is an array TWe assume that the text is an array T [1..N] of length n and that the pattern is [1..N] of length n and that the pattern is an array Pan array P [1..M] of length m, where m [1..M] of length m, where m << n. We also assume that the elements << n. We also assume that the elements of P and T are characters in the finite of P and T are characters in the finite alphabet alphabet

(e.g., {a,b} We want to find P = ‘aab’ in T = ‘abbaabaaaab’)

Page 4: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

String Matching Problem String Matching Problem (Continued)(Continued)

The idea of the string matching problem is The idea of the string matching problem is that we want to find all occurrences of the that we want to find all occurrences of the pattern P in the given text T.pattern P in the given text T.

We could use the brute force method for We could use the brute force method for string matching, which utilizes iteration string matching, which utilizes iteration over T. At each letter, we compare the over T. At each letter, we compare the sequence against P until all letters match sequence against P until all letters match of until the end of the alphabet is reached. of until the end of the alphabet is reached.

The worst case scenario can reach O(N*M)The worst case scenario can reach O(N*M)

Page 5: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

Definition of Rabin-KarpDefinition of Rabin-Karp

A string search algorithm which A string search algorithm which compares a string's hash values, compares a string's hash values, rather than the strings themselves. rather than the strings themselves. For efficiency, the hash value of For efficiency, the hash value of the next position in the text is the next position in the text is easily computed from the hash easily computed from the hash value of the current position. value of the current position.

Page 6: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

How Rabin-Karp worksHow Rabin-Karp works

Let characters in both arrays T and P be digits in radix- notation. (

Let p be the value of the characters in P Choose a prime number q such that fits

within a computer word to speed computations.

Compute (p mod q)– The value of p mod q is what we will be

using to find all matches of the pattern P in T.

Page 7: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

How Rabin-Karp works How Rabin-Karp works (continued)(continued)

Compute (T[s+1, .., s+m] mod q) for s = 0 .. n-m

Test against P only those sequences in T having the same (mod q) value

(T[s+1, .., s+m] mod q) can be incrementally computed by subtracting the high-order digit, shifting, adding the low-order bit, all in modulo q arithmetic.

Page 8: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

A Rabin-Karp exampleA Rabin-Karp example

Given T = 31415926535 and P = 26Given T = 31415926535 and P = 26 We choose q = 11We choose q = 11 P mod q = 26 mod 11 = 4P mod q = 26 mod 11 = 4

13 14 95 62 35 5

13 14 95 62 35 5

14 mod 11 = 3 not equal to 4

31 mod 11 = 9 not equal to 4

13 14 95 62 35 5

41 mod 11 = 8 not equal to 4

Page 9: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

Rabin-Karp example Rabin-Karp example continuedcontinued

13 14 95 62 35 5

15 mod 11 = 4 equal to 4 -> spurious hit

13 14 95 62 35 5

59 mod 11 = 4 equal to 4 -> spurious hit13 14 95 62 35 5

92 mod 11 = 4 equal to 4 -> spurious hit

13 14 95 62 35 5

26 mod 11 = 4 equal to 4 -> an exact match!!13 14 95 62 35 5

65 mod 11 = 10 not equal to 4

Page 10: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

Rabin-Karp example Rabin-Karp example continuedcontinued

13 14 95 62 35 5

53 mod 11 = 9 not equal to 413 14 95 62 35 5

35 mod 11 = 2 not equal to 4

As we can see, when a match is found, further As we can see, when a match is found, further testing is done to insure that a match has testing is done to insure that a match has indeed been found.indeed been found.

Page 11: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

ComplexityComplexity

The running time of the Rabin-Karp The running time of the Rabin-Karp algorithm in the worst-case scenario is algorithm in the worst-case scenario is O(n-m+1)m but it has a good average-O(n-m+1)m but it has a good average-case running time.case running time.

If the expected number of valid shifts is If the expected number of valid shifts is small O(1) and the prime q is chosen to small O(1) and the prime q is chosen to be quite large, then the Rabin-Karp be quite large, then the Rabin-Karp algorithm can be expected to run in time algorithm can be expected to run in time O(n+m) plus the time to required to O(n+m) plus the time to required to process spurious hits.process spurious hits.

Page 12: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

ApplicationsApplications BioinformaticsBioinformatics

– Used in looking for similarities of two or Used in looking for similarities of two or more proteins; i.e. high sequence similarity more proteins; i.e. high sequence similarity usually implies significant structural or usually implies significant structural or functional similarity. functional similarity.

Example: Example: Hb A_human Hb A_human GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLG+ +VK+HGKKV A++++++AH+ D++ ++ +++LS+LH G+ +VK+HGKKV A++++++AH+ D++ ++ +++LS+LH KLKLHb B_humanHb B_humanGNPKVKAHGKKVLGAFSDGLAH LDNLKGTF ATLSELH CDKLGNPKVKAHGKKVLGAFSDGLAH LDNLKGTF ATLSELH CDKL+ similar amino acids+ similar amino acids

Page 13: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

Applications continuedApplications continued Alpha hemoglobin and beta hemoglobin are Alpha hemoglobin and beta hemoglobin are

subunits that make up a protein called hemoglobin subunits that make up a protein called hemoglobin in red blood cells. Notice the similarities between in red blood cells. Notice the similarities between the two sequences, which probably signify the two sequences, which probably signify functional similarity. functional similarity.

Many distantly related proteins have domains that Many distantly related proteins have domains that are similar to each other, such as the DNA binding are similar to each other, such as the DNA binding domain or cation binding domain. To find regions domain or cation binding domain. To find regions of high similarity within multiple sequences of of high similarity within multiple sequences of proteins, local alignment must be performed. The proteins, local alignment must be performed. The local alignment of sequences may provide local alignment of sequences may provide information of similar functional domains present information of similar functional domains present among distantly related proteins. among distantly related proteins.

Page 14: String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College 12.12.2000

AcknowledgementsAcknowledgements

– Cormen, Thomas H.r, et al.,auths. Introduction Cormen, Thomas H.r, et al.,auths. Introduction to to Algorithms Cambridge: MIT Press, 1997Algorithms Cambridge: MIT Press, 1997

– Go2Net Website for String Matching AlgorithmsGo2Net Website for String Matching Algorithms www.go2net.com/internet/deep/1997/05/14/body.htmlwww.go2net.com/internet/deep/1997/05/14/body.html

– Yummy Yummy Animations Site for an Yummy Yummy Animations Site for an animation of the Rabin-Karp algorithm at workanimation of the Rabin-Karp algorithm at work

www.mills.edu/ACAD_INFO/MCS/CS/S00MCS125/String.www.mills.edu/ACAD_INFO/MCS/CS/S00MCS125/String.Matching.Algorithms/animations.htmlMatching.Algorithms/animations.html

– National Institute of Standards and Technology National Institute of Standards and Technology Dictionary of Algorithms, Data Structures, and Dictionary of Algorithms, Data Structures, and ProblemsProblems

hissa.nist.gov/dads/HTML/rabinKarpAlgo.htmlhissa.nist.gov/dads/HTML/rabinKarpAlgo.html