pattern matching part three: hamming distance
TRANSCRIPT
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c
a b a
a b a cb a
Goal: Find all the locations where P matches in TP matches at location i iff
a b a
m
for all 0 6 j < m we have that P [j] = T [i+ j]
(our strings are zero-indexed)
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c
a b a
a b a cb a
Goal: Find all the locations where P matches in TP matches at location i iff
a b a
m
4
for all 0 6 j < m we have that P [j] = T [i+ j]
(our strings are zero-indexed)
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in TP matches at location i iff
a b a
a b a
m
6
for all 0 6 j < m we have that P [j] = T [i+ j]
(our strings are zero-indexed)
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in TP matches at location i iff
a b a
a b a
m
10
for all 0 6 j < m we have that P [j] = T [i+ j]
(our strings are zero-indexed)
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in TP matches at location i iff
a b a
a b a
m
6
for all 0 6 j < m we have that P [j] = T [i+ j]
(our strings are zero-indexed)
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in TP matches at location i iff
a b a
a b a
m
6
for all 0 6 j < m we have that P [j] = T [i+ j]
(our strings are zero-indexed)
j-th character of P
(i+ j)-th char. of T
T [2] = c
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in TP matches at location i iff
a b a
a b a
m
6
for all 0 6 j < m we have that P [j] = T [i+ j]
(our strings are zero-indexed)
• A naive algorithm takesO(nm) time
j-th character of P
(i+ j)-th char. of T
T [2] = c
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Exact pattern matching
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a b a cb a
Goal: Find all the locations where P matches in TP matches at location i iff
a b a
a b a
m
6
for all 0 6 j < m we have that P [j] = T [i+ j]
(our strings are zero-indexed)
• A naive algorithm takesO(nm) time
• ManyO(n) time algorithms are known (for example the KMP algorithm)
j-th character of P
(i+ j)-th char. of T
T [2] = c
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c
a b d
a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
m
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a
Ham(4) = 1
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(5) = 4
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(6) = 1
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(7) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
this is alignment 7
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
this is alignment 8
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
A naive algorithm for this problem takesO(nm) time
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
this is alignment 8
Pattern matching with mismatches
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
The Hamming distance is the number of mismatches. . .
c a a
i.e. the number of distinct j such that P [j] 6= T [i+ j]
A naive algorithm for this problem takesO(nm) time. . . but we can do better
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
a
Ham(8) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
this is alignment 8
It’s a small alphabet after all
T
P
d
d
d d dd d d
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
d
Imagine that the alphabet contains only a small number of different symbols,
aa c
a b
bb c
which we will consider individually. . .
It’s a small alphabet after all
T
P
d
d
d d dd d d
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
d
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
aa c
a b
bb c
which we will consider individually. . .
It’s a small alphabet after all
T
P
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
d
d
d d dd d d
d
aa c
a b
bb c
which we will consider individually. . .
It’s a small alphabet after all
T
P
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
aa c
a b
bb c
which we will consider individually. . .
1
1
1 1 11 1 1
1
It’s a small alphabet after all
T
P
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
aa c
a b
bb c
which we will consider individually. . .
1
1
1 1 11 1 1
1
It’s a small alphabet after all
T
P
aa c
a b
bb c
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
which we will consider individually. . .
1
1
1 1 11 1 1
1
It’s a small alphabet after all
T
P
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1∑j=0
Pd[j]× Td[i+ j]
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1∑j=0
Pd[j]× Td[i+ j]
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1∑j=0
Pd[j]× Td[i+ j]
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1(Td ⊗ Pd)[4] =
(1× 1)+ (0× 0)+
(1× 0)+ (1× 1) = 2
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1∑j=0
Pd[j]× Td[i+ j]
1 iffP [j]=T [i+j]=d
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1(Td ⊗ Pd)[4] =
(1× 1)+ (0× 0)+
(1× 0)+ (1× 1) = 2
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1∑j=0
Pd[j]× Td[i+ j]
This is the exactly number of matching ds at the i-th alignment.
1 iffP [j]=T [i+j]=d
which we will consider individually. . .
00 0
0 0
00 01
1
1 1 11 1 1
1(Td ⊗ Pd)[4] =
(1× 1)+ (0× 0)+
(1× 0)+ (1× 1) = 2
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1∑j=0
Pd[j]× Td[i+ j]
This is the exactly number of matching ds at the i-th alignment.
1 iffP [j]=T [i+j]=d
which we will consider individually. . .
How can we work out (Td ⊗ Pd) quickly?
00 0
0 0
00 01
1
1 1 11 1 1
1(Td ⊗ Pd)[4] =
(1× 1)+ (0× 0)+
(1× 0)+ (1× 1) = 2
It’s a small alphabet after all
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Imagine that the alphabet contains only a small number of different symbols,
Replace all d symbols with 1 and everything else with 0
Td
Pd
We denote these new strings Td and Pd and consider. . .
(Td ⊗ Pd)[i] =
m−1∑j=0
Pd[j]× Td[i+ j]
This is the exactly number of matching ds at the i-th alignment.
1 iffP [j]=T [i+j]=d
which we will consider individually. . .
How can we work out (Td ⊗ Pd) quickly?
00 0
0 0
00 01
1
1 1 11 1 1
1
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
A
A[i] = ai
B
B[i] = bi(or be seen as arrays of length n)
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where ci =
i∑j=0
aj×b(i−j)
A
A[i] = ai
B
B[i] = bi(or be seen as arrays of length n)
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where ci =
i∑j=0
aj×b(i−j)
A
A[i] = ai
B
B[i] = bi(or be seen as arrays of length n)
CC[i] = ci
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where ci =
i∑j=0
aj×b(i−j)
By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.
A
A[i] = ai
B
B[i] = bi(or be seen as arrays of length n)
CC[i] = ci
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where ci =
i∑j=0
aj×b(i−j)
By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.
A
A[i] = ai
B
B[i] = bi(or be seen as arrays of length n)
CC[i] = ci
m−1∑j=0
Pd[j]Td[i+ j]
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where ci =
i∑j=0
aj×b(i−j)
By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.
A
A[i] = ai
B
B[i] = bi(or be seen as arrays of length n)
CC[i] = ci
m−1∑j=0
Pd[j]Td[i+ j]
these look similar!
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where ci =
i∑j=0
aj×b(i−j)
By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.
A
A[i] = ai
B
B[i] = bi(or be seen as arrays of length n)
CC[i] = ci
m−1∑j=0
Pd[j]Td[i+ j]
these look similar!
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where ci =
i∑j=0
aj×b(i−j)
By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.
A
A[i] = ai
B
B[i] = bi(or be seen as arrays of length n)
CC[i] = ci
m−1∑j=0
Pd[j]Td[i+ j]
these look similar!
Hint 1 LetA = Pd andB = Td
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.
CC[i] = ci
m−1∑j=0
Pd[j]Td[i+ j]
these look similar!
Hint 1 LetA = Pd andB = Td
A
A[i] = ai = Pd[i]
B
B[i] = bi = Td[i](or be seen as arrays of length n)
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where ci =
i∑j=0
Pd[j]Td[i−j]
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.
CC[i] = ci
m−1∑j=0
Pd[j]Td[i+ j]
these look similar!
Hint 2 LetA = Pd (padded with zeros) andB = Td
A
A[i] = ai = Pd[i]
B
B[i] = bi = Td[i](or be seen as arrays of length n)
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where ci =
i∑j=0
Pd[j]Td[i−j]
m0 0 0 00
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.
CC[i] = ci
m−1∑j=0
Pd[j]Td[i+ j]
these look similar!
Hint 3 LetA = Pd (padded with zeros) andB = Td (reversed). . . now C contains (Td ⊗ Pd)
A
A[i] = ai = Pd[i]
B
B[i] = bi = Td[n− i](or be seen as arrays of length n)
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where cn−i =
n−i∑j=0
Pd[j]Td[i+ j]
m0 0 0 00
Last year on COMS21103. . .
LetA andB be (n− 1) degree polynomials which can be expressed as. . .
A(x) =
n−1∑i=0
aixi and B(x) =
n−1∑i=0
bixi
By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.
CC[i] = ci
m−1∑j=0
Pd[j]Td[i+ j]
these look similar!
Hint 3 LetA = Pd (padded with zeros) andB = Td (reversed). . . now C contains (Td ⊗ Pd)
A
A[i] = ai = Pd[i]
B
B[i] = bi = Td[n− i](or be seen as arrays of length n)
The polynomial C = A×B can be expressed as. . .
C(x) =
2n−1∑i=0
cixi where cn−i =
n−i∑j=0
Pd[j]Td[i+ j]
m0 0 0 00
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1∑j=0
Pσ [j]× Tσ [i+ j]
is exactly number of matching ds at the i-th alignment.
00 0
0 0
00 01
1
1 1 11 1 1
1
(Pσ is defined analogously)alignment 4
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1∑j=0
Pσ [j]× Tσ [i+ j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
(Pσ is defined analogously)alignment 4
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1∑j=0
Pσ [j]× Tσ [i+ j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)alignment 4
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1∑j=0
Pσ [j]× Tσ [i+ j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)alignment 4
(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1∑j=0
Pσ [j]× Tσ [i+ j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)alignment 4
(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
it is also very often (but technically incorrectly) called the convolution
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1∑j=0
Pσ [j]× Tσ [i+ j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)alignment 4
(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
it is also very often (but technically incorrectly) called the convolution
cross-correlations are used a lotin the pattern matching literature
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1∑j=0
Pσ [j]× Tσ [i+ j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)alignment 4
(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
it is also very often (but technically incorrectly) called the convolution
cross-correlations are used a lotin the pattern matching literature
(but they mostly call them convolutions)
Computing cross-correlations via the FFT
m
0 1 2 3 4 5 6 7 8 9 10 11 12
n
Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s
Tσ
Pσ
(Tσ ⊗ Pσ)[i] =
m−1∑j=0
Pσ [j]× Tσ [i+ j]
is exactly number of matching ds at the i-th alignment.
We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT
00 0
0 0
00 01
1
1 1 11 1 1
1
i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i
(Pσ is defined analogously)alignment 4
(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ
it is also very often (but technically incorrectly) called the convolution
cross-correlations are used a lotin the pattern matching literature
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
matches involving σ
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
all matches
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
mismatches = m− matches
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(O(n|Σ| logn) time)
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(O(n|Σ| logn) time)
(O(n|Σ|) time)
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(O(n|Σ| logn) time)
(O(n|Σ|) time)
This takesO(n|Σ| logn) total time (and usesO(n) space)
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(O(n|Σ| logn) time)
(O(n|Σ|) time)
This takesO(n|Σ| logn) total time (and usesO(n) space)
However, |Σ| could be as big asm...
(in the example Σ = {a, b, c, d} so |Σ| = 4)
It’s a small alphabet after all
Let Σ denote the set of alphabet symbols and |Σ| be its size
We have seen how to find all matches with a single symbol inO(n logn) time
Algorithm Summary
Construct Tσ and Pσ for each symbol σ in Σ
Compute (Tσ ⊗ Pσ) for each symbol σ in Σ
For every i, compute,
Ham(i) = m−∑σ∈Σ
(Tσ ⊗ Pσ)[i] .
(O(n|Σ|) time)
(O(n|Σ| logn) time)
(O(n|Σ|) time)
This takesO(n|Σ| logn) total time (and usesO(n) space)
However, |Σ| could be as big asm...
in which case, this is worse than the naive method!
(in the example Σ = {a, b, c, d} so |Σ| = 4)
Coping with a large alphabet
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
(√m = 3
)
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
(√m = 3
)
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequent
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
(√m = 3
)
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
(√m = 3
)
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
Key idea: Our algorithm will have two main stages:
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
(√m = 3
)
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
Key idea: Our algorithm will have two main stages:
Stage 1 will count all the matches involving frequent symbols(at each alignment of P and T )
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
(√m = 3
)
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
Key idea: Our algorithm will have two main stages:
Stage 1 will count all the matches involving frequent symbols
Stage 2 will count all the matches involving infrequent symbols
(at each alignment of P and T )
(at each alignment of P and T )
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
(√m = 3
)
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
Key idea: Our algorithm will have two main stages:
Stage 1 will count all the matches involving frequent symbols
Stage 2 will count all the matches involving infrequent symbols
The total number of matches is the sum of the matches from Stage 1 and Stage 2
(at each alignment of P and T )
(at each alignment of P and T )
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
(√m = 3
)
Coping with a large alphabet
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
Key idea: Our algorithm will have two main stages:
Stage 1 will count all the matches involving frequent symbols
Stage 2 will count all the matches involving infrequent symbols
The total number of matches is the sum of the matches from Stage 1 and Stage 2
(at each alignment of P and T )
(at each alignment of P and T )
We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
inO(n logn) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
inO(n logn) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (√m+ 1) freq. symbols
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
inO(n logn) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (√m+ 1) freq. symbols
each occurs at least√m times. . .
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
inO(n logn) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (√m+ 1) freq. symbols
each occurs at least√m times. . . (
√m+ 1)
√m > m
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
inO(n logn) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (√m+ 1) freq. symbols
each occurs at least√m times. . . (
√m+ 1)
√m > m Contradiction!
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
inO(n logn) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (√m+ 1) freq. symbols
so there are at most√m frequent symbols
each occurs at least√m times. . . (
√m+ 1)
√m > m Contradiction!
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
inO(n logn) time (per symbol σ) using cross-correlations
The frequent/infrequent symbols trick
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
Stage 1: For each alignment i, count the number of matches involving frequent symbols:
Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)
How many frequent symbols can there be?
Assume that there at least (√m+ 1) freq. symbols
So Stage 1 takesO(n√m logn) time.
so there are at most√m frequent symbols
each occurs at least√m times. . . (
√m+ 1)
√m > m Contradiction!
P a d bc ab b da
0 1 2 3 4 5 6 7 8
m = 9
a is frequent , b is frequentc and d are infrequent
inO(n logn) time (per symbol σ) using cross-correlations
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
m = 9
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
m = 9
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
If T [k] is infrequent. . .
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
If T [k] is infrequent. . .
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
If T [k] is infrequent. . .
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
If T [k] is infrequent. . .
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 00
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
(k − j) < 0P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
(k − j) < 0P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
0
P a d bc ab b da
j = 4
k = 4
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 00 0 0 0 0 0
(except when (k − j) < 0)
k − j = 0
0
P a d bc ab b da
j = 4
k = 4
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
k − j = 0
P a d bc ab b da
j = 4
k = 4
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
P a d bc ab b da
k = 5
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
k = 5
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
k = 5
j = 4
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0 0
(except when (k − j) < 0)
k − j = 1
k = 5
j = 4
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
k − j = 1
1
k = 5
j = 4
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
k = 6
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
k = 6
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
k = 6
j = 4
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0 0
(except when (k − j) < 0)
1
k = 6
j = 4k − j = 2
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 6
j = 4k − j = 2
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
j = 8
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
(k − j) < 0
k = 7
j = 8
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
j = 6
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
1 1
k = 7
j = 6k − j = 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
k = 7
j = 6k − j = 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
12
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
13
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
13
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
13
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 00 0 0 0
(except when (k − j) < 0)
13
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 0 0 0 0
(except when (k − j) < 0)
13 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 0 0 0 0
(except when (k − j) < 0)
13 1
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
What isA[i]?
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
What isA[i]?
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
What isA[i]?FactA[i] is the number of matches at
alignment i involving an infrequent symbol
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
How quick is Stage 2?
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
How quick is Stage 2?
O(n) time
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
store a list for each infrequent symbol
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
How quick is Stage 2?
O(n) time
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
store a list for each infrequent symbol
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
How quick is Stage 2?
(each list has length less than√m)
O(n) time
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
store a list for each infrequent symbol
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
How quick is Stage 2?
(each list has length less than√m)
O(n) time
P a d bc ab b da
O(n√m)
time
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
P a d bc ab b da
The infrequent/frequent symbols trick
aaa
Definition: A symbol is infrequent if it occurs fewer than√m times in P .
a is frequent , b is frequentc and d are infrequent
Every symbol is either
frequent or infrequent
T d b c c c d d c d c d ca
Stage 2: Count all matches involving infrequent symbols.
Make a single pass through T . . .
For each character T [k], (where 0 6 k < n)
For all j such that T [k] = P [j],
If T [k] is infrequent. . .
IncreaseA[k − j] by one
Construct an arrayA of length (n−m+ 1) - which initially contains all zeros
A 1 1 2 1 13 2 0
(except when (k − j) < 0)
O(n√m) total time
P a d bc ab b da
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
(by alphabetically sorting the characters from P )
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
(by alphabetically sorting the characters from P )
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
Stage 2: Count all matches involving infrequent symbols. -O(n√m) time
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
Stage 2: Count all matches involving infrequent symbols. -O(n√m) time
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
Stage 2: Count all matches involving infrequent symbols. -O(n√m) time
at any alignment i the number of mismatches is justm minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Overall, we obtain a time complexity of O(n√m logn).
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
Stage 2: Count all matches involving infrequent symbols. -O(n√m) time
at any alignment i the number of mismatches is justm minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Overall, we obtain a time complexity of O(n√m logn).
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
Stage 2: Count all matches involving infrequent symbols. -O(n√m) time
at any alignment i the number of mismatches is justm minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Overall, we obtain a time complexity of O(n√m logn).
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
Stage 2: Count all matches involving infrequent symbols. -O(n√m) time
at any alignment i the number of mismatches is justm minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct countingNotice that Stage 1 takes
longer than Stage 2...
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
How long does each stage take now?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time(this stage is unaffected - the time complexity doesn’t depend on f ))
How long does each stage take now?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
How long does each stage take now?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
Stage 1: Count all matches involving frequent symbols
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
How long does each stage take now?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
Stage 1: Count all matches involving frequent symbols
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
How long does each stage take now?
As each frequent symbol occurs at least f times, there are at most mf frequent symbols
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
Stage 1: Count all matches involving frequent symbols
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
How long does each stage take now?
As each frequent symbol occurs at least f times, there are at most mf frequent symbols
and we do one cross-correlation for each frequent symbol. . .
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
How long does each stage take now?
As each frequent symbol occurs at least f times, there are at most mf frequent symbols
and we do one cross-correlation for each frequent symbol. . .
Stage 1: Count all matches involving frequent symbols -O(mf · n logn) time
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
How long does each stage take now?
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
How long does each stage take now?
Stage 2: Count all matches involving infrequent symbols.
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
How long does each stage take now?
Stage 2: Count all matches involving infrequent symbols.
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
We make a single pass through T . . .
and for each T [i] we update at most (f − 1) locations inA
Improving the Time Complexity 1 - balance the stages
Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .
and infrequent otherwise
What happens if we generalise this definition?
New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
How long does each stage take now?
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Stage 2: Count all matches involving infrequent symbols. -O(nf) time
We make a single pass through T . . .
and for each T [i] we update at most (f − 1) locations inA
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Stage 1: Count all matches involving frequent symbols -O(nmf logn) time
Stage 2: Count all matches involving infrequent symbols. -O(nf) time
at any alignment i the number of mismatches is justm minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
What should we set f to?
Stage 1: Count all matches involving frequent symbols -O(nmf logn) time
Stage 2: Count all matches involving infrequent symbols. -O(nf) time
at any alignment i the number of mismatches is justm minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
What should we set f to?
Stage 1: Count all matches involving frequent symbols -O(nmf logn) time
Stage 2: Count all matches involving infrequent symbols. -O(nf) time
at any alignment i the number of mismatches is justm minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Let f =√m logn. . .
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
What should we set f to?
Stage 1: Count all matches involving frequent symbols -O(nmf logn) time
Stage 2: Count all matches involving infrequent symbols. -O(nf) time
at any alignment i the number of mismatches is justm minus the total number of matches
Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Let f =√m logn. . .
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
What should we set f to?
at any alignment i the number of mismatches is justm minus the total number of matches
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Let f =√m logn. . .
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Stage 1: Count all matches involving frequent symbols -O(n m√m logn
logn) time
Stage 2: Count all matches involving infrequent symbols. -O(n√m logn) time
Definition: An alphabet symbol is frequent if it occurs at least√m logn times in P .
and infrequent otherwise
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
What should we set f to?
at any alignment i the number of mismatches is justm minus the total number of matches
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Let f =√m logn. . .
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Stage 2: Count all matches involving infrequent symbols. -O(n√m logn) time
Definition: An alphabet symbol is frequent if it occurs at least√m logn times in P .
and infrequent otherwise
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
at any alignment i the number of mismatches is justm minus the total number of matches
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Stage 2: Count all matches involving infrequent symbols. -O(n√m logn) time
Definition: An alphabet symbol is frequent if it occurs at least√m logn times in P .
and infrequent otherwise
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
Pattern matching with mismatches: putting it all together
(Generalised) Algorithm summary
at any alignment i the number of mismatches is justm minus the total number of matches
(by alphabetically sorting the characters from P )
Tσ
Pσ
00 0
0 0
00 01
1
1 1 11 1 1
1
Matches with a single symbolcan be found using a cross-correlation
aaaT d b c c c d d c d c d ca
A 1 1 2 1 13 2 0
P a d bc ab b daMatches with an infrequent symbol
can be found by direct counting
Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time
Stage 2: Count all matches involving infrequent symbols. -O(n√m logn) time
Definition: An alphabet symbol is frequent if it occurs at least√m logn times in P .
and infrequent otherwise
Stage 1: Count all matches involving frequent symbols -O(n√m logn) time
This improves the overall time complexity from O(n√m logn) toO(n
√m logn).
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
T
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
P
m
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
P
m
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
P
m
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
2m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
P
m
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
the final substringmight be shorter
2m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
P
m
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
2m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
P
m
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
P
m
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
How long does running the previous algorithm take?
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
How long does running the previous algorithm take?
O(|Tk|√m log |Tk|) time.
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
How long does running the previous algorithm take?
O(|Tk|√m log |Tk|) time.
= O(m√m logm) time.
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m P
m
How long does running the previous algorithm take?
O(|Tk|√m log |Tk|) time.
= O(m√m logm) time.
We run the previous algorithmO( nm
)times so this process takesO(n
√m logm) time in total
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m
How long does running the previous algorithm take?
O(|Tk|√m log |Tk|) time.
= O(m√m logm) time.
We run the previous algorithmO( nm
)times so this process takesO(n
√m logm) time in total
P
m
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Split T intoO( nm
)contiguous 2m length substrings, T1, T2, T3 . . .
Run the previous algorithm once for with P and each Tk
2m
How long does running the previous algorithm take?
O(|Tk|√m log |Tk|) time.
= O(m√m logm) time.
We run the previous algorithmO( nm
)times so this process takesO(n
√m logm) time in total
P
m
what about this alignment?
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Run the previous algorithm once for with P and each Tk
2m
How long does running the previous algorithm take?
O(|Tk|√m log |Tk|) time.
= O(m√m logm) time.
We run the previous algorithmO( nm
)times so this process takesO(n
√m logm) time in total
P
m
what about this alignment?
Split T intoO( nm
)overlapping 2m length substrings, T1, T2, T3 . . .
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Run the previous algorithm once for with P and each Tk
2m
How long does running the previous algorithm take?
O(|Tk|√m log |Tk|) time.
= O(m√m logm) time.
We run the previous algorithmO( nm
)times so this process takesO(n
√m logm) time in total
P
m
what about this alignment?
Split T intoO( nm
)overlapping 2m length substrings, T1, T2, T3 . . .
T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21
m 2mT2 T4 T6 T8 T10 T12 T14 T16 T18 T20
Improving the time complexity 2 - split the text
We have just seen an algorithm which takesO(n√m logn) time.
Imagine that n is a lot bigger thanm. . .
T
Run the previous algorithm once for with P and each Tk
2m
How long does running the previous algorithm take?
O(|Tk|√m log |Tk|) time.
= O(m√m logm) time.
We run the previous algorithmO( nm
)times so this process takesO(n
√m logm) time in total
P
m
Split T intoO( nm
)overlapping 2m length substrings, T1, T2, T3 . . .
T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21
m 2mT2 T4 T6 T8 T10 T12 T14 T16 T18 T20
Conclusion
T
Input: A text string T (length n) and a pattern string P (lengthm)
P
ba b c a a d ad a
Goal: For every alignment i, output
(the Hamming distance is the number of mismatches)
c a a
A naive algorithm for this problem takesO(nm) time
0 1 2 3 4 5 6 7 8 9 10 11 12
n
a b d
m
aHam(8) = 3
Ham(i), the Hamming distance between P and T [i . . . i+m− 1]
We have seen two alternative algorithms:
One algorithm takesO(n|Σ| logn) time (where |Σ| is the alphabet size)
The other algorithm takesO(n√m logn) time (regardless of the alphabet size)
and can be improved toO(n√m logm)
(by changing the freq./infreq. cut off and splitting the text)
and can be improved toO(n|Σ| logm) (by splitting the text)