pattern matching part two: k-mismatches
TRANSCRIPT
Advanced Algorithms – COMS31900
Pattern matching part four
Pattern matching with at most k mismatches
Benjamin Sach
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(4) = 1
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(5) = 4
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(6) = 1
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
this is alignment 7
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
this is alignment 8
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
Last lecture we saw two algorithms for this problem:
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
this is alignment 8
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
Last lecture we saw two algorithms for this problem:
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
this is alignment 8
One algorithm takesO(n|Σ| logm) time (where |Σ| is the alphabet size)
The other algorithm takesO(n√m logm) time (regardless of the alphabet size)
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (lengthm) and a positive integer k
P
ba b c
a b d
a a d ad a
Goal: For all i, output,
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Output the number of mismatches. . . unless it’s more than k
(we interpret the outputX to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) 6 k
X if Ham(i) > k
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (lengthm) and a positive integer k
P
ba b c
a b d
a a d ad a
Goal: For all i, output,
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Output the number of mismatches. . . unless it’s more than k
(we interpret the outputX to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) 6 k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (lengthm) and a positive integer k
P
ba b c
a b d
a a d ad a
Goal: For all i, output,
a a a
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Hamk(4) = 1
Output the number of mismatches. . . unless it’s more than k
(we interpret the outputX to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) 6 k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (lengthm) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(5) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the outputX to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) 6 k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (lengthm) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(6) = 1
Output the number of mismatches. . . unless it’s more than k
(we interpret the outputX to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) 6 k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (lengthm) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(7) = 2
Output the number of mismatches. . . unless it’s more than k
(we interpret the outputX to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) 6 k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (lengthm) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the outputX to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) 6 k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (lengthm) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
• We could use theO(n√m logm) time algorithm for Hamming distance. . .
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the outputX to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) 6 k
X if Ham(i) > k
k = 2
Pattern matching with few mismatches (k-mismatch)
T
Input A text string T (length n), a pattern string P (lengthm) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
• We could use theO(n√m logm) time algorithm for Hamming distance. . .
but when k is small we can do much better
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(8) = X
Output the number of mismatches. . . unless it’s more than k
(we interpret the outputX to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) 6 k
X if Ham(i) > k
k = 2
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
a b c b a bc bP
nn nm
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
a b c b a bc bP
nn nm
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 3
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 4
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
a b c b a bc bP
nn nm
i jLCP(i, j) returns 0
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
i j
a b c b a bc bP
nn nm
LCP(i, j) returns 4
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries inO(n) time andO(n) space
b c b a bc bP
nn nm
LCP(i, j) returns 4
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries inO(n) time andO(n) space
Each query then takesO(1) time
b c b a bc bP
nn nm
LCP(i, j) returns 4
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries inO(n) time andO(n) space
Each query then takesO(1) time
we’ll see how to do this later in this lecture
b c b a bc bP
nn nm
LCP(i, j) returns 4
LCP - the Longest Common Prefix
T ba b c a b aa b cb a b
For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1]
it’s the furthest you can go before hitting a mismatch
i j
a
We can preprocess P and T for LCP queries inO(n) time andO(n) space
Each query then takesO(1) time
we’ll see how to do this later in this lecture
we can use LCP queries to solve the k-mismatch problem. . .
b c b a bc bP
nn nm
First let’s see how
LCP(i, j) returns 4
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
perform LCP(i, j)
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
perform LCP(i, j)
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
perform LCP(i, j)
a a b a which returns 3
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
perform LCP(i, j)
a a b a which returns 3
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
perform LCP(i, j)
a a b a which returns 3
mismatch!
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b
i
j
a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
a a b a
skip over this!
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
i′
j
a a b a
skip over this!
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
i′
j
perform LCP(i′, j)
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
i′
j
perform LCP(i′, j)
a a b a which returns 0
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
i′
j
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
i′
j
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
perform LCP(i′, j)i′
j
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
perform LCP(i′, j)i′
j
a a b a which returns 2
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
i′
j
a a b a
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
perform LCP(i′, j)
a a b a
i′
j
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
perform LCP(i′, j)
a a b a
i′
j
which returns 3
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
(or we reach the end of P )
perform LCP(i′, j)
a a b a
i′
j
which returns 3
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
(or we reach the end of P )
a a b a
i′
j
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
(or we reach the end of P )
We can therefore calculate Hamk(i) for a single alignment i inO(k) time
a a b a
i′
j
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
(or we reach the end of P )
We can therefore calculate Hamk(i) for a single alignment i inO(k) time
Overall this takesO(nk) time (including the ‘preprocessing’ for LCP queries)
a a b a
i′
j
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
(or we reach the end of P )
We can therefore calculate Hamk(i) for a single alignment i inO(k) time
Overall this takesO(nk) time (including the ‘preprocessing’ for LCP queries)
this is pretty good but we can do better
a a b a
i′
j
k-mismatch using LCP queries
a
b
c
aT a b c a b aa b cb b a
b c b a bP
nn
Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P
(we do this for each i seperately)
We can do this using (at most) k + 1 LCP queries
each query takesO(1) time and finds a new mismatch
(or we reach the end of P )
We can therefore calculate Hamk(i) for a single alignment i inO(k) time
Overall this takesO(nk) time (including the ‘preprocessing’ for LCP queries)
this is pretty good but we can do better but first. . . how do we answer those LCP queries?
a a b a
i′
j
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
inO(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
inO(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j3
inO(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
inO(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
inO(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]
inO(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
i j
LCA(i, j)
3
it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]
inO(n) prep. time and space
For any pair of locations i, j in T , LCPT (i, j) is the largest ` such that
T [i . . . i+ `− 1] = T [j . . . j + `− 1]
Single string LCP:
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]
LCA(i, j)
j
1
5
inO(n) prep. time and space
i
For any pair of locations i, j in T , LCPT (i, j) is the largest ` such that
T [i . . . i+ `− 1] = T [j . . . j + `− 1]
Single string LCP:
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]
LCA(i, j)
j
4
i
2
inO(n) prep. time and space
For any pair of locations i, j in T , LCPT (i, j) is the largest ` such that
T [i . . . i+ `− 1] = T [j . . . j + `− 1]
Single string LCP:
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]
LCA(i, j)
j
4
i
2
inO(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]
LCA(i, j)
j
4
i
2
We store the root-to-node length at each internal nodeso we can recover the length, LCPT (i, j) inO(1) time
inO(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]
LCA(i, j)
j
4
i
2
We store the root-to-node length at each internal node
1
3
0
2
so we can recover the length, LCPT (i, j) inO(1) time
inO(n) prep. time and space
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
What is the LCA of the leaves representing suffixes i and j?
Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries
1
5
it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]
LCA(i, j)
j
4
i
2
We store the root-to-node length at each internal node
1
3
0
2
so we can recover the length, LCPT (i, j) inO(1) time
inO(n) prep. time and space
So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string
We can extend this two strings (P and T ) by first concatenating them together. . .(and proceeding as for a single string)
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string
We can extend this two strings (P and T ) by first concatenating them together. . .
So we also haveO(n) space,O(n) prep. time andO(1) query time
(and proceeding as for a single string)
for the LCP problem on two strings
LCPs in Suffix Trees
TT b n aaa sn
n$
a
s$
nas$
nas$
s$
na
s$
bananas$
7$
b n aaa sn
n aaa sn
n aa sn
aa sn
a sn
a s
s
suffixes
$
$
$
$
$
$
$
0
1
2
3
4
5
6
$ 7 3
0
2 4
6nas$
This is the suffix tree
of this text
2
1
5
LCA(i, j)
j
4
i
2
1
3
0
2
So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string
We can extend this two strings (P and T ) by first concatenating them together. . .
So we also haveO(n) space,O(n) prep. time andO(1) query time
(and proceeding as for a single string)
for the LCP problem on two strings
I.e. for any i in T and j in P , LCP(i, j) is the largest ` such that
T [i . . . i+ `− 1] = P [j . . . j + `− 1] (as we originally defined it)
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
and infrequent otherwise
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
and infrequent otherwise
k = 4
(√k = 2)
m = 9
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
and infrequent otherwise
k = 4
(√k = 2)
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent
and infrequent otherwise
k = 4
(√k = 2)
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
and infrequent otherwise
k = 4
(√k = 2)
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequent
and infrequent otherwise
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
How many frequent symbols can there be?
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
How many frequent symbols can there be? Lots!
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
How many frequent symbols can there be? Lots! there could be m√k
frequent symbols
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
How many frequent symbols can there be? Lots! there could be m√k
frequent symbols
Case 1: There are fewer than 2√k frequent symbols in P .
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
Algorithm summary
How many frequent symbols can there be? Lots! there could be m√k
frequent symbols
Case 1: There are fewer than 2√k frequent symbols in P .
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√k
frequent symbols
Case 1: There are fewer than 2√k frequent symbols in P .
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√k
frequent symbols
Case 1: There are fewer than 2√k frequent symbols in P .
-O(m logm) time
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√k
frequent symbols
Case 1: There are fewer than 2√k frequent symbols in P .
-O(m logm) time
-O(n√k logm) time
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√k
frequent symbols
Case 1: There are fewer than 2√k frequent symbols in P .
-O(m logm) time
-O(n√k logm) time
-O(n√k) time
k = 4
(√k = 2)
, d is frequent
k-mismatch using frequent/infrequent symbols
Definition: A symbol is frequent if it occurs at least√k times in P ,
P a d bc ab b da
0 1 2 3 4 5 6 7 8
a is frequent , b is frequentc is infrequent
and infrequent otherwise
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent
Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)
Stage 2: Count all matches involving infrequent symbols (as in last lecture)
How many frequent symbols can there be? Lots! there could be m√k
frequent symbols
Case 1: There are fewer than 2√k frequent symbols in P .
-O(m logm) time
-O(n√k logm) time
-O(n√k) time
-O(n√k logm) total time
k = 4
(√k = 2)
, d is frequent
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a d bc ab b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a d bc ab b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequente and f are infrequent
, d is frequent, c is frequent
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequente and f are infrequent
, d is frequent, c is frequent
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequente and f are infrequent
, d is frequent, c is frequent
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a is frequent , b is frequente and f are infrequent
, d is frequent, c is frequent
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Fact There are at most n/√k values of i with dk(i) > k
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Fact There are at most n/√k values of i with dk(i) > k
this follows from a counting argument
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/√k values of i with dk(i) > k
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/√k values of i with dk(i) > k
For any location i′, T [i′] = P [j] for either 0 or√k distinct j ∈ J
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/√k values of i with dk(i) > k
For any location i′, T [i′] = P [j] for either 0 or√k distinct j ∈ J
This implies that∑
i dk(i) 6∑
i′∑
j∈J Eq(T [i′], P [j]) 6 n√k
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/√k values of i with dk(i) > k
For any location i′, T [i′] = P [j] for either 0 or√k distinct j ∈ J
This implies that∑
i dk(i) 6∑
i′∑
j∈J Eq(T [i′], P [j]) 6 n√k
Eq = 1 if T [i′] = P [j] (and
0 otherwise)
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/√k values of i with dk(i) > k
For any location i′, T [i′] = P [j] for either 0 or√k distinct j ∈ J
This implies that∑
i dk(i) 6∑
i′∑
j∈J Eq(T [i′], P [j]) 6 n√k
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/√k values of i with dk(i) > k
Assume that more than n/√k values of i have dk(i) > k
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/√k values of i with dk(i) > k
Assume that more than n/√k values of i have dk(i) > k
So∑
i dk(i) >(
n√k
+ 1)· k
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/√k values of i with dk(i) > k
Assume that more than n/√k values of i have dk(i) > k
So∑
i dk(i) >(
n√k
+ 1)· k > n
√k
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact There are at most n/√k values of i with dk(i) > k
Assume that more than n/√k values of i have dk(i) > k
So∑
i dk(i) >(
n√k
+ 1)· k > n
√k
Contradiction!
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4 dk(i) = 3
Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)
because there are 2k interesting positions. . . and fewer than k of them match
Fact There are at most n/√k values of i with dk(i) > k
this follows from a counting argument
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/√k alignments to check
every other alignment has more than k mismatches
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/√k alignments to check
every other alignment has more than k mismatches
Check each of the remaining alignments using LCP queries inO(k) timeper alignment
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/√k alignments to check
every other alignment has more than k mismatches
Check each of the remaining alignments using LCP queries inO(k) time
This takes n/√k ·O(k) = O(n
√k) total time.
per alignment
Case 2: There are at least 2√k frequent symbols
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
We can filter the text, using the dk(i) values leaving only n/√k alignments to check
every other alignment has more than k mismatches
Check each of the remaining alignments using LCP queries inO(k) time
How do we compute all the dk(i) values?
This takes n/√k ·O(k) = O(n
√k) total time.
per alignment
Computing all the dk(i) values
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
For each text character T [i′],
For each j ∈ J such that P [j] = T [i′]
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
Let dk(i) = 0 for all i.
increase dk(i′ − j) by one
Computing all the dk(i) values
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
For each text character T [i′],
For each j ∈ J such that P [j] = T [i′]
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
Let dk(i) = 0 for all i.
increase dk(i′ − j) by one
T [i′] = P [j] for either 0 or√k distinct j ∈ J
(store a list of j values for each symbol)
Computing all the dk(i) values
Pick any 2√k frequent symbols and for each symbol pick
√k occurences in P .
This gives us 2k interesting pattern locations, denoted J
k = 4d bc ab b daP a dcb b da c bbfe
J = {0, 2, 3, 4, 5, 7, 9, 10}
b da b d bfT b b c caa f e f cc c aa ab e e
i = 4
For each text character T [i′],
For each j ∈ J such that P [j] = T [i′]
Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]
i.e. the number of (single character) matches involving interesting pattern locations
Let dk(i) = 0 for all i.
increase dk(i′ − j) by one
T [i′] = P [j] for either 0 or√k distinct j ∈ J
(store a list of j values for each symbol)
This takesO(n√k) total time.
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries -O(n) time
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries -O(n) time
Count the number of frequent symbols in P -O(m logm) time
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries -O(n) time
Count the number of frequent symbols in P -O(m logm) time
Case 1: P has at most 2√k frequent symbols
Case 2: P has more than 2√k frequent symbols
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries -O(n) time
Count the number of frequent symbols in P -O(m logm) time
Case 1: P has at most 2√k frequent symbols
Count matches with frequent symbols using cross-correlations -O(n√k logm) time
Case 2: P has more than 2√k frequent symbols
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries -O(n) time
Count the number of frequent symbols in P -O(m logm) time
Case 1: P has at most 2√k frequent symbols
Count matches with frequent symbols using cross-correlations -O(n√k logm) time
Count matches with infrequent symbols directly -O(n√k) time
Case 2: P has more than 2√k frequent symbols
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries -O(n) time
Count the number of frequent symbols in P -O(m logm) time
Case 1: P has at most 2√k frequent symbols
Count matches with frequent symbols using cross-correlations -O(n√k logm) time
Count matches with infrequent symbols directly -O(n√k) time
Case 2: P has more than 2√k frequent symbols
Filter the text, leaving n/√k alignments -O(n
√k) time
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries -O(n) time
Count the number of frequent symbols in P -O(m logm) time
Case 1: P has at most 2√k frequent symbols
Count matches with frequent symbols using cross-correlations -O(n√k logm) time
Count matches with infrequent symbols directly -O(n√k) time
Case 2: P has more than 2√k frequent symbols
Filter the text, leaving n/√k alignments -O(n
√k) time
Count mismatches at these alignments using LCP queries -O(n√k) time
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries -O(n) time
Overall, we obtain a time complexity of O(n√k logm).
Count the number of frequent symbols in P -O(m logm) time
Case 1: P has at most 2√k frequent symbols
Count matches with frequent symbols using cross-correlations -O(n√k logm) time
Count matches with infrequent symbols directly -O(n√k) time
Case 2: P has more than 2√k frequent symbols
Filter the text, leaving n/√k alignments -O(n
√k) time
Count mismatches at these alignments using LCP queries -O(n√k) time
Pattern matching with k-mismatches: putting it all together
Algorithm summary
Preprocess P, T for LCP queries -O(n) time
Overall, we obtain a time complexity of O(n√k logm).
Count the number of frequent symbols in P -O(m logm) time
Case 1: P has at most 2√k frequent symbols
Count matches with frequent symbols using cross-correlations -O(n√k logm) time
Count matches with infrequent symbols directly -O(n√k) time
Case 2: P has more than 2√k frequent symbols
Filter the text, leaving n/√k alignments -O(n
√k) time
Count mismatches at these alignments using LCP queries -O(n√k) time
- this can be improved toO(n√k log k)
Conclusion
T
Input A text string T (length n), a pattern string P (lengthm) and a positive integer k
P
ba b c a a d ad a
Goal: For all i, output,
a a a
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Hamk(6) = 1
Output the number of mismatches. . . unless it’s more than k(we interpret the outputX to mean “too many mismatches”)
Hamk(i) =
Ham(i) if Ham(i) 6 k
X if Ham(i) > k
k = 2
One algorithm takesO(nk) time
The other algorithm takesO(n√k logm) time (improvable toO(n
√k log k) time)
We saw two algorithms for this problem: