faster algorithm for string matching with k mismatches

18
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275 Date : Nov. 26, 2004 Created by : Hsing-Yen Ann

Upload: tanek-pena

Post on 02-Jan-2016

18 views

Category:

Documents


1 download

DESCRIPTION

Faster Algorithm for String Matching with k Mismatches. Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp. 257-275 Date : Nov. 26, 2004 Created by : Hsing-Yen Ann. Abstract. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Faster Algorithm for String Matching with  k  Mismatches

Faster Algorithm for String Matching with k Mismatches

Amihood Amir, Moshe Lewenstin, Ely PoratJournal of Algorithms, Vol. 50, 2004, pp. 257-

275

Date : Nov. 26, 2004Created by : Hsing-Yen Ann

Page 2: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Abstract

The string matching with mismatches problem is that of finding the number of mismatches between a pattern P of length m and every length m substring of the text T. Currently, the fastest algorithms for this problem are the following. The Galil–Giancarlo algorithm finds all locations where the pattern has at most k errors (where k is part of the input) in time O(nk).

Page 3: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Abstract (cont’d)

The Abrahamson algorithm finds the number of mismatches at every location in time . We present an algorithm that is faster than both. Our algorithm finds all locations where the pattern has at most k errors in time . We also show an algorithm that solves the above problem in time .

kknO log

mmnO log

)log)/(( 3 kmnknO

Page 4: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Problem Definition String matching with k mismatches:

Input:

Text T = t1t2...tn

Pattern P = p1p2...pm

A natural number k

Output:

All pairs <i, ham(P, T[i,i+m-1])>,

where 1≦i ≦n and ham(P, T[i,i+m-1])≦k

ham(): hamming distance (# of errors)

Page 5: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Two Types of Solving Strategies

1. Finding all hamming distances + linear scan.

Previous:

2. Finding the locations with at most k errors directly.

Previous: O(nk)

Choose strategy 1 when .

Improved to in this paper by using strategy 2.

mmk log

kkO log

mmnO log ??log?? mmnO

Page 6: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Two Types of Solving Strategies (cont’d)

Example:

1 2 3 4 5 6 7 8 9T a b b b c a a b cP a b c

k = 21 2 3 4 5 6 7 8 9

all hamming distances + scan 1 2 1 3 3 2 0 N/A N/A

1 2 3 6 7finding matched locations 1 2 1 2 0

Page 7: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Algorithm for Solving this Problem

Two-stage algorithm Marking stage

Identifying the potential starts of the pattern. Reducing the # to be verified. Focused in this paper.

Verification stage Verifying which of the potential candidates is

indeed a pattern occurrence. Using the Kangaroo method for speed-up.

Page 8: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Kangaroo Method Introduced by Landau and Vishkin. Using Suffix trees + Lowest Common Ancestor. Constant-time “jumps” over equal substrings in the

text and pattern.

O(1) for jumping to next mismatch. O(k) for verifying a candidate location with k

mismatches.

O (1)

P a b b a b c e a b a b cT a b c a b c a a c

O (1) O (1)

Page 9: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Algorithms for Four Different Cases

Large alphabet At least 2k different alphabets in pattern P. O(n)

Small alphabet At most different alphabets in pattern P.

General alphabets - many frequent symbols At least frequent symbols

General alphabets - few frequent symbols Less than frequent symbols

k2

mknO log

mknO log

mknO log

k

k

Page 10: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Large alphabet Example: k=3, |Σ|=6=2k

Time: O(n / k) x O(k) = O(n)

1 2 3 4 5 6 7 8 9P a c b d c a b e f

Alphabet a b c d e fsmallest index in P (si ) 1 3 2 4 8 9

1 2 3 4 5 6 7 8 9 10 11 12 13T b a c b d c b b a d b d f

location to mark (i-si ) -2 1 1 1 1 4 4 5 8 6 8 8 4

marked locations -2 1 4 5 6 8# of marks (≧ 3 ) 1 4 3 1 1 3

1 2 3 4 5 6 7 8 9 10 11 12 13candidates in T b a c b d c b b a d b d f

Page 11: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Small alphabet

Example: k=5 , Σ={a, b} , |Σ|=2

1 2 3 4 5 6 7P a a b a a b a

1 1 0 1 1 0 10 0 1 0 0 1 0

1 0 1 1 0 1 10 1 0 0 1 0 0

1 2 3 4 5 6 7 8 9 10 11 12 13T b b b a a b b b a a b b a

1 1 1 0 0 1 1 1 0 0 1 1 00 0 0 1 1 0 0 0 1 1 0 0 1

Pa Pb

Ra P R

b P

Ta T

b

k

Page 12: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Small alphabet (cont’d)

Use FFT for polynomial multiplication.

Time:

1 1 2 2 2 3 3 4 3 2 3 3 3 3 1 1 2 1 00 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 0 0

1 2 3 4 5 6 7 8 9 10 11 12 13sum 1 2 3 3 4 5 5 6 5 4 5 5 5 5 2 2 3 1 0

candidates N/A N/A N/A N/A N/A N/A 5 6 5 4 5 5 5 N/A N/A N/A N/A N/A N/A

Raa PT

Rbb

PT

mknOmnO loglog

Page 13: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

General alphabet – many frequent symbols

Frequent symbol: appears at least times in P. Many frequent symbols: at least frequent

symbols. T’ and P’: replace all non-frequent symbols in T

and P with “don’t cares” symbols. Mismatch problem with “don’t cares” can be

solved in time . After the last step, at most candidates

left. Time:

k2

kn /2

mknOknmknOkOk

nOmknO loglog

2log

mknO log

k

Page 14: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

General alphabet – few frequent symbols

Few frequent symbols: less then frequent symbols.

T’ and P’: replace all frequent symbols in T and P with “don’t cares” symbols.

Mismatch problem with “don’t cares” can be solved in time .

After the last step, at most candidates left.

Time:

kn /2

mknOknmknOkOk

nOmknO loglog

2log

mknO log

k

Page 15: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

General alphabet (cont’d)

Example:

1 2 3 4 5 6 7 8P a b c b a d b a

frequent symbol a b

P' (many frequent symbol) a b φ b a φ b a

P' (few frequent symbol) φ φ c φ φ d φ φ

Page 16: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Mismatch with Don’t Cares Problem

Example: k=3 , Σ={a, b}∪{φ}

1 2 3 4 5 6 7 8P a b φ b a φ b a

1 0 0 0 1 0 0 10 1 0 1 0 0 1 0

1 0 0 1 0 0 0 10 1 0 0 1 0 1 0

1 2 3 4 5 6 7 8 9 10 11 12T b a φ b a b φ b a a φ a

1 0 0 1 0 1 0 1 0 0 0 00 1 0 0 1 0 0 0 1 1 0 1

Pa Pb

Ra P R

b P

Ta Tb

Page 17: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Mismatch with Don’t Cares Problem (cont’d)

Use FFT for polynomial multiplication

Time:

1 0 0 2 0 1 1 2 1 0 2 0 1 0 1 0 0 0 00 0 1 0 0 2 0 1 1 1 2 0 2 1 1 2 0 1 0

sum 1 0 1 2 0 3 1 3 2 1 4 0 3 1 2 2 0 1 0

added bias 7 6 5 4 3 2 1 0 0 0 0 0 1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9 10 11 12candidate 8 6 6 6 3 5 2 3 2 1 4 0 4 3 5 6 5 7 7

Raa PT

Rbb PT

mknOmnO loglog

Page 18: Faster Algorithm for String Matching with  k  Mismatches

2004/11/22 Hsing-Yen Ann

Conclusion This problem can be solved by above algorithms

in .

When :

When : use another algorithm.

Finally, this problem can be solved in .

3/1mk

3/1mk

mknO log

kOm loglog

kknOmknO loglog

kknO log

)log)/(( 3 kmnknO

1/3 mk

)log()log)/(( 3 knOkmnknO