pattern matching part three: hamming distance

221
Advanced Algorithms – COMS31900 Pattern matching part three Hamming distance Benjamin Sach

Upload: benjamin-sach

Post on 13-Apr-2017

51 views

Category:

Education


0 download

TRANSCRIPT

Advanced Algorithms – COMS31900

Pattern matching part three

Hamming distance

Benjamin Sach

Exact pattern matching

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c

a b a

a b a cb a

Goal: Find all the locations where P matches in TP matches at location i iff

a b a

m

for all 0 6 j < m we have that P [j] = T [i+ j]

(our strings are zero-indexed)

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Exact pattern matching

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c

a b a

a b a cb a

Goal: Find all the locations where P matches in TP matches at location i iff

a b a

m

4

for all 0 6 j < m we have that P [j] = T [i+ j]

(our strings are zero-indexed)

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Exact pattern matching

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a b a cb a

Goal: Find all the locations where P matches in TP matches at location i iff

a b a

a b a

m

6

for all 0 6 j < m we have that P [j] = T [i+ j]

(our strings are zero-indexed)

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Exact pattern matching

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a b a cb a

Goal: Find all the locations where P matches in TP matches at location i iff

a b a

a b a

m

10

for all 0 6 j < m we have that P [j] = T [i+ j]

(our strings are zero-indexed)

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Exact pattern matching

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a b a cb a

Goal: Find all the locations where P matches in TP matches at location i iff

a b a

a b a

m

6

for all 0 6 j < m we have that P [j] = T [i+ j]

(our strings are zero-indexed)

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Exact pattern matching

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a b a cb a

Goal: Find all the locations where P matches in TP matches at location i iff

a b a

a b a

m

6

for all 0 6 j < m we have that P [j] = T [i+ j]

(our strings are zero-indexed)

j-th character of P

(i+ j)-th char. of T

T [2] = c

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Exact pattern matching

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a b a cb a

Goal: Find all the locations where P matches in TP matches at location i iff

a b a

a b a

m

6

for all 0 6 j < m we have that P [j] = T [i+ j]

(our strings are zero-indexed)

• A naive algorithm takesO(nm) time

j-th character of P

(i+ j)-th char. of T

T [2] = c

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Exact pattern matching

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a b a cb a

Goal: Find all the locations where P matches in TP matches at location i iff

a b a

a b a

m

6

for all 0 6 j < m we have that P [j] = T [i+ j]

(our strings are zero-indexed)

• A naive algorithm takesO(nm) time

• ManyO(n) time algorithms are known (for example the KMP algorithm)

j-th character of P

(i+ j)-th char. of T

T [2] = c

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c

a b d

a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

m

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c

a b d

a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

m

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Ham(4) = 1

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(5) = 4

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(6) = 1

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(7) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(7) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

this is alignment 7

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(8) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

this is alignment 8

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

A naive algorithm for this problem takesO(nm) time

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(8) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

this is alignment 8

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

A naive algorithm for this problem takesO(nm) time. . . but we can do better

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(8) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

this is alignment 8

It’s a small alphabet after all

T

P

d

d

d d dd d d

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

d

Imagine that the alphabet contains only a small number of different symbols,

aa c

a b

bb c

which we will consider individually. . .

It’s a small alphabet after all

T

P

d

d

d d dd d d

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

d

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

aa c

a b

bb c

which we will consider individually. . .

It’s a small alphabet after all

T

P

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

d

d

d d dd d d

d

aa c

a b

bb c

which we will consider individually. . .

It’s a small alphabet after all

T

P

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

aa c

a b

bb c

which we will consider individually. . .

1

1

1 1 11 1 1

1

It’s a small alphabet after all

T

P

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

aa c

a b

bb c

which we will consider individually. . .

1

1

1 1 11 1 1

1

It’s a small alphabet after all

T

P

aa c

a b

bb c

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

which we will consider individually. . .

1

1

1 1 11 1 1

1

It’s a small alphabet after all

T

P

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

which we will consider individually. . .

00 0

0 0

00 01

1

1 1 11 1 1

1

It’s a small alphabet after all

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

Td

Pd

We denote these new strings Td and Pd and consider. . .

which we will consider individually. . .

00 0

0 0

00 01

1

1 1 11 1 1

1

It’s a small alphabet after all

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

Td

Pd

We denote these new strings Td and Pd and consider. . .

(Td ⊗ Pd)[i] =

m−1∑j=0

Pd[j]× Td[i+ j]

which we will consider individually. . .

00 0

0 0

00 01

1

1 1 11 1 1

1

It’s a small alphabet after all

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

Td

Pd

We denote these new strings Td and Pd and consider. . .

(Td ⊗ Pd)[i] =

m−1∑j=0

Pd[j]× Td[i+ j]

which we will consider individually. . .

00 0

0 0

00 01

1

1 1 11 1 1

1

It’s a small alphabet after all

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

Td

Pd

We denote these new strings Td and Pd and consider. . .

(Td ⊗ Pd)[i] =

m−1∑j=0

Pd[j]× Td[i+ j]

which we will consider individually. . .

00 0

0 0

00 01

1

1 1 11 1 1

1(Td ⊗ Pd)[4] =

(1× 1)+ (0× 0)+

(1× 0)+ (1× 1) = 2

It’s a small alphabet after all

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

Td

Pd

We denote these new strings Td and Pd and consider. . .

(Td ⊗ Pd)[i] =

m−1∑j=0

Pd[j]× Td[i+ j]

1 iffP [j]=T [i+j]=d

which we will consider individually. . .

00 0

0 0

00 01

1

1 1 11 1 1

1(Td ⊗ Pd)[4] =

(1× 1)+ (0× 0)+

(1× 0)+ (1× 1) = 2

It’s a small alphabet after all

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

Td

Pd

We denote these new strings Td and Pd and consider. . .

(Td ⊗ Pd)[i] =

m−1∑j=0

Pd[j]× Td[i+ j]

This is the exactly number of matching ds at the i-th alignment.

1 iffP [j]=T [i+j]=d

which we will consider individually. . .

00 0

0 0

00 01

1

1 1 11 1 1

1(Td ⊗ Pd)[4] =

(1× 1)+ (0× 0)+

(1× 0)+ (1× 1) = 2

It’s a small alphabet after all

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

Td

Pd

We denote these new strings Td and Pd and consider. . .

(Td ⊗ Pd)[i] =

m−1∑j=0

Pd[j]× Td[i+ j]

This is the exactly number of matching ds at the i-th alignment.

1 iffP [j]=T [i+j]=d

which we will consider individually. . .

How can we work out (Td ⊗ Pd) quickly?

00 0

0 0

00 01

1

1 1 11 1 1

1(Td ⊗ Pd)[4] =

(1× 1)+ (0× 0)+

(1× 0)+ (1× 1) = 2

It’s a small alphabet after all

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Imagine that the alphabet contains only a small number of different symbols,

Replace all d symbols with 1 and everything else with 0

Td

Pd

We denote these new strings Td and Pd and consider. . .

(Td ⊗ Pd)[i] =

m−1∑j=0

Pd[j]× Td[i+ j]

This is the exactly number of matching ds at the i-th alignment.

1 iffP [j]=T [i+j]=d

which we will consider individually. . .

How can we work out (Td ⊗ Pd) quickly?

00 0

0 0

00 01

1

1 1 11 1 1

1

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

A

A[i] = ai

B

B[i] = bi(or be seen as arrays of length n)

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where ci =

i∑j=0

aj×b(i−j)

A

A[i] = ai

B

B[i] = bi(or be seen as arrays of length n)

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where ci =

i∑j=0

aj×b(i−j)

A

A[i] = ai

B

B[i] = bi(or be seen as arrays of length n)

CC[i] = ci

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where ci =

i∑j=0

aj×b(i−j)

By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.

A

A[i] = ai

B

B[i] = bi(or be seen as arrays of length n)

CC[i] = ci

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where ci =

i∑j=0

aj×b(i−j)

By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.

A

A[i] = ai

B

B[i] = bi(or be seen as arrays of length n)

CC[i] = ci

m−1∑j=0

Pd[j]Td[i+ j]

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where ci =

i∑j=0

aj×b(i−j)

By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.

A

A[i] = ai

B

B[i] = bi(or be seen as arrays of length n)

CC[i] = ci

m−1∑j=0

Pd[j]Td[i+ j]

these look similar!

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where ci =

i∑j=0

aj×b(i−j)

By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.

A

A[i] = ai

B

B[i] = bi(or be seen as arrays of length n)

CC[i] = ci

m−1∑j=0

Pd[j]Td[i+ j]

these look similar!

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where ci =

i∑j=0

aj×b(i−j)

By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.

A

A[i] = ai

B

B[i] = bi(or be seen as arrays of length n)

CC[i] = ci

m−1∑j=0

Pd[j]Td[i+ j]

these look similar!

Hint 1 LetA = Pd andB = Td

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.

CC[i] = ci

m−1∑j=0

Pd[j]Td[i+ j]

these look similar!

Hint 1 LetA = Pd andB = Td

A

A[i] = ai = Pd[i]

B

B[i] = bi = Td[i](or be seen as arrays of length n)

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where ci =

i∑j=0

Pd[j]Td[i−j]

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.

CC[i] = ci

m−1∑j=0

Pd[j]Td[i+ j]

these look similar!

Hint 2 LetA = Pd (padded with zeros) andB = Td

A

A[i] = ai = Pd[i]

B

B[i] = bi = Td[i](or be seen as arrays of length n)

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where ci =

i∑j=0

Pd[j]Td[i−j]

m0 0 0 00

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.

CC[i] = ci

m−1∑j=0

Pd[j]Td[i+ j]

these look similar!

Hint 3 LetA = Pd (padded with zeros) andB = Td (reversed). . . now C contains (Td ⊗ Pd)

A

A[i] = ai = Pd[i]

B

B[i] = bi = Td[n− i](or be seen as arrays of length n)

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where cn−i =

n−i∑j=0

Pd[j]Td[i+ j]

m0 0 0 00

Last year on COMS21103. . .

LetA andB be (n− 1) degree polynomials which can be expressed as. . .

A(x) =

n−1∑i=0

aixi and B(x) =

n−1∑i=0

bixi

By the magic of the FFT we can compute C (i.e. every ci) inO(n logn) time.

CC[i] = ci

m−1∑j=0

Pd[j]Td[i+ j]

these look similar!

Hint 3 LetA = Pd (padded with zeros) andB = Td (reversed). . . now C contains (Td ⊗ Pd)

A

A[i] = ai = Pd[i]

B

B[i] = bi = Td[n− i](or be seen as arrays of length n)

The polynomial C = A×B can be expressed as. . .

C(x) =

2n−1∑i=0

cixi where cn−i =

n−i∑j=0

Pd[j]Td[i+ j]

m0 0 0 00

Computing cross-correlations via the FFT

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s

(Tσ ⊗ Pσ)[i] =

m−1∑j=0

Pσ [j]× Tσ [i+ j]

is exactly number of matching ds at the i-th alignment.

00 0

0 0

00 01

1

1 1 11 1 1

1

(Pσ is defined analogously)alignment 4

Computing cross-correlations via the FFT

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s

(Tσ ⊗ Pσ)[i] =

m−1∑j=0

Pσ [j]× Tσ [i+ j]

is exactly number of matching ds at the i-th alignment.

We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT

00 0

0 0

00 01

1

1 1 11 1 1

1

(Pσ is defined analogously)alignment 4

Computing cross-correlations via the FFT

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s

(Tσ ⊗ Pσ)[i] =

m−1∑j=0

Pσ [j]× Tσ [i+ j]

is exactly number of matching ds at the i-th alignment.

We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT

00 0

0 0

00 01

1

1 1 11 1 1

1

i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i

(Pσ is defined analogously)alignment 4

Computing cross-correlations via the FFT

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s

(Tσ ⊗ Pσ)[i] =

m−1∑j=0

Pσ [j]× Tσ [i+ j]

is exactly number of matching ds at the i-th alignment.

We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT

00 0

0 0

00 01

1

1 1 11 1 1

1

i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i

(Pσ is defined analogously)alignment 4

(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ

Computing cross-correlations via the FFT

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s

(Tσ ⊗ Pσ)[i] =

m−1∑j=0

Pσ [j]× Tσ [i+ j]

is exactly number of matching ds at the i-th alignment.

We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT

00 0

0 0

00 01

1

1 1 11 1 1

1

i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i

(Pσ is defined analogously)alignment 4

(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ

it is also very often (but technically incorrectly) called the convolution

Computing cross-correlations via the FFT

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s

(Tσ ⊗ Pσ)[i] =

m−1∑j=0

Pσ [j]× Tσ [i+ j]

is exactly number of matching ds at the i-th alignment.

We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT

00 0

0 0

00 01

1

1 1 11 1 1

1

i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i

(Pσ is defined analogously)alignment 4

(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ

it is also very often (but technically incorrectly) called the convolution

cross-correlations are used a lotin the pattern matching literature

Computing cross-correlations via the FFT

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s

(Tσ ⊗ Pσ)[i] =

m−1∑j=0

Pσ [j]× Tσ [i+ j]

is exactly number of matching ds at the i-th alignment.

We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT

00 0

0 0

00 01

1

1 1 11 1 1

1

i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i

(Pσ is defined analogously)alignment 4

(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ

it is also very often (but technically incorrectly) called the convolution

cross-correlations are used a lotin the pattern matching literature

(but they mostly call them convolutions)

Computing cross-correlations via the FFT

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

Let Tσ be T with all σs replaced with 1s and everything else replaced with a 0s

(Tσ ⊗ Pσ)[i] =

m−1∑j=0

Pσ [j]× Tσ [i+ j]

is exactly number of matching ds at the i-th alignment.

We can compute (Tσ ⊗ Pσ) inO(n logn) time via the FFT

00 0

0 0

00 01

1

1 1 11 1 1

1

i.e afterO(n logn) time we have (Td ⊗ Pd)[i] for every i

(Pσ is defined analogously)alignment 4

(Tσ ⊗ Pσ) is called the cross-correlation of Tσ and Pσ

it is also very often (but technically incorrectly) called the convolution

cross-correlations are used a lotin the pattern matching literature

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

(in the example Σ = {a, b, c, d} so |Σ| = 4)

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

matches involving σ

(in the example Σ = {a, b, c, d} so |Σ| = 4)

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

all matches

(in the example Σ = {a, b, c, d} so |Σ| = 4)

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

mismatches = m− matches

(in the example Σ = {a, b, c, d} so |Σ| = 4)

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

(in the example Σ = {a, b, c, d} so |Σ| = 4)

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

(O(n|Σ|) time)

(in the example Σ = {a, b, c, d} so |Σ| = 4)

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

(O(n|Σ|) time)

(O(n|Σ| logn) time)

(in the example Σ = {a, b, c, d} so |Σ| = 4)

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

(O(n|Σ|) time)

(O(n|Σ| logn) time)

(O(n|Σ|) time)

(in the example Σ = {a, b, c, d} so |Σ| = 4)

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

(O(n|Σ|) time)

(O(n|Σ| logn) time)

(O(n|Σ|) time)

This takesO(n|Σ| logn) total time (and usesO(n) space)

(in the example Σ = {a, b, c, d} so |Σ| = 4)

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

(O(n|Σ|) time)

(O(n|Σ| logn) time)

(O(n|Σ|) time)

This takesO(n|Σ| logn) total time (and usesO(n) space)

However, |Σ| could be as big asm...

(in the example Σ = {a, b, c, d} so |Σ| = 4)

It’s a small alphabet after all

Let Σ denote the set of alphabet symbols and |Σ| be its size

We have seen how to find all matches with a single symbol inO(n logn) time

Algorithm Summary

Construct Tσ and Pσ for each symbol σ in Σ

Compute (Tσ ⊗ Pσ) for each symbol σ in Σ

For every i, compute,

Ham(i) = m−∑σ∈Σ

(Tσ ⊗ Pσ)[i] .

(O(n|Σ|) time)

(O(n|Σ| logn) time)

(O(n|Σ|) time)

This takesO(n|Σ| logn) total time (and usesO(n) space)

However, |Σ| could be as big asm...

in which case, this is worse than the naive method!

(in the example Σ = {a, b, c, d} so |Σ| = 4)

Coping with a large alphabet

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

(√m = 3

)

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

(√m = 3

)

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequent

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

(√m = 3

)

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

(√m = 3

)

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

Key idea: Our algorithm will have two main stages:

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

(√m = 3

)

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

Key idea: Our algorithm will have two main stages:

Stage 1 will count all the matches involving frequent symbols(at each alignment of P and T )

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

(√m = 3

)

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

Key idea: Our algorithm will have two main stages:

Stage 1 will count all the matches involving frequent symbols

Stage 2 will count all the matches involving infrequent symbols

(at each alignment of P and T )

(at each alignment of P and T )

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

(√m = 3

)

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

Key idea: Our algorithm will have two main stages:

Stage 1 will count all the matches involving frequent symbols

Stage 2 will count all the matches involving infrequent symbols

The total number of matches is the sum of the matches from Stage 1 and Stage 2

(at each alignment of P and T )

(at each alignment of P and T )

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

(√m = 3

)

Coping with a large alphabet

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

Key idea: Our algorithm will have two main stages:

Stage 1 will count all the matches involving frequent symbols

Stage 2 will count all the matches involving infrequent symbols

The total number of matches is the sum of the matches from Stage 1 and Stage 2

(at each alignment of P and T )

(at each alignment of P and T )

We will now see an algorithm which runs inO(n√m logn) timeregardless of the alphabet size

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

Stage 1: For each alignment i, count the number of matches involving frequent symbols:

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

Stage 1: For each alignment i, count the number of matches involving frequent symbols:

Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

Stage 1: For each alignment i, count the number of matches involving frequent symbols:

Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

inO(n logn) time (per symbol σ) using cross-correlations

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

Stage 1: For each alignment i, count the number of matches involving frequent symbols:

Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)

How many frequent symbols can there be?

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

inO(n logn) time (per symbol σ) using cross-correlations

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

Stage 1: For each alignment i, count the number of matches involving frequent symbols:

Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)

How many frequent symbols can there be?

Assume that there at least (√m+ 1) freq. symbols

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

inO(n logn) time (per symbol σ) using cross-correlations

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

Stage 1: For each alignment i, count the number of matches involving frequent symbols:

Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)

How many frequent symbols can there be?

Assume that there at least (√m+ 1) freq. symbols

each occurs at least√m times. . .

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

inO(n logn) time (per symbol σ) using cross-correlations

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

Stage 1: For each alignment i, count the number of matches involving frequent symbols:

Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)

How many frequent symbols can there be?

Assume that there at least (√m+ 1) freq. symbols

each occurs at least√m times. . . (

√m+ 1)

√m > m

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

inO(n logn) time (per symbol σ) using cross-correlations

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

Stage 1: For each alignment i, count the number of matches involving frequent symbols:

Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)

How many frequent symbols can there be?

Assume that there at least (√m+ 1) freq. symbols

each occurs at least√m times. . . (

√m+ 1)

√m > m Contradiction!

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

inO(n logn) time (per symbol σ) using cross-correlations

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

Stage 1: For each alignment i, count the number of matches involving frequent symbols:

Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)

How many frequent symbols can there be?

Assume that there at least (√m+ 1) freq. symbols

so there are at most√m frequent symbols

each occurs at least√m times. . . (

√m+ 1)

√m > m Contradiction!

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

inO(n logn) time (per symbol σ) using cross-correlations

The frequent/infrequent symbols trick

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

Stage 1: For each alignment i, count the number of matches involving frequent symbols:

Consider each frequent symbol σ ∈ Σ separately and compute (Tσ ⊗ Pσ)

How many frequent symbols can there be?

Assume that there at least (√m+ 1) freq. symbols

So Stage 1 takesO(n√m logn) time.

so there are at most√m frequent symbols

each occurs at least√m times. . . (

√m+ 1)

√m > m Contradiction!

P a d bc ab b da

0 1 2 3 4 5 6 7 8

m = 9

a is frequent , b is frequentc and d are infrequent

inO(n logn) time (per symbol σ) using cross-correlations

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

m = 9

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

m = 9

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 00

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 00

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 00

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 00

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

If T [k] is infrequent. . .

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 00

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

If T [k] is infrequent. . .

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 00

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

If T [k] is infrequent. . .

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 00

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

If T [k] is infrequent. . .

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 00

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 00

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

0

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

0

(k − j) < 0P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

0

(k − j) < 0P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

0

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

0

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

0

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

0

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

0

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

0

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

0

P a d bc ab b da

j = 4

k = 4

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 00 0 0 0 0 0

(except when (k − j) < 0)

k − j = 0

0

P a d bc ab b da

j = 4

k = 4

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0 0

(except when (k − j) < 0)

k − j = 0

P a d bc ab b da

j = 4

k = 4

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0 0

(except when (k − j) < 0)

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0 0

(except when (k − j) < 0)

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0 0

(except when (k − j) < 0)

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0 0

(except when (k − j) < 0)

P a d bc ab b da

k = 5

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0 0

(except when (k − j) < 0)

k = 5

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0 0

(except when (k − j) < 0)

k = 5

j = 4

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0 0

(except when (k − j) < 0)

k − j = 1

k = 5

j = 4

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0

(except when (k − j) < 0)

k − j = 1

1

k = 5

j = 4

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0

(except when (k − j) < 0)

1

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0

(except when (k − j) < 0)

1

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0

(except when (k − j) < 0)

1

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0

(except when (k − j) < 0)

1

k = 6

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0

(except when (k − j) < 0)

1

k = 6

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0

(except when (k − j) < 0)

1

k = 6

j = 4

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0 0

(except when (k − j) < 0)

1

k = 6

j = 4k − j = 2

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

k = 6

j = 4k − j = 2

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

k = 7

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

k = 7

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

k = 7

j = 8

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

(k − j) < 0

k = 7

j = 8

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

k = 7

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

k = 7

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

k = 7

j = 6

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

1 1

k = 7

j = 6k − j = 1

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

12

k = 7

j = 6k − j = 1

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

12

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

12

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

12

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

12

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

12

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

12

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

13

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

13

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

13

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 00 0 0 0

(except when (k − j) < 0)

13

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 0 0 0 0

(except when (k − j) < 0)

13 1

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 0 0 0 0

(except when (k − j) < 0)

13 1

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

What isA[i]?

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

What isA[i]?

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

What isA[i]?FactA[i] is the number of matches at

alignment i involving an infrequent symbol

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

How quick is Stage 2?

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

How quick is Stage 2?

O(n) time

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

store a list for each infrequent symbol

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

How quick is Stage 2?

O(n) time

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

store a list for each infrequent symbol

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

How quick is Stage 2?

(each list has length less than√m)

O(n) time

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

store a list for each infrequent symbol

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

How quick is Stage 2?

(each list has length less than√m)

O(n) time

P a d bc ab b da

O(n√m)

time

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

P a d bc ab b da

The infrequent/frequent symbols trick

aaa

Definition: A symbol is infrequent if it occurs fewer than√m times in P .

a is frequent , b is frequentc and d are infrequent

Every symbol is either

frequent or infrequent

T d b c c c d d c d c d ca

Stage 2: Count all matches involving infrequent symbols.

Make a single pass through T . . .

For each character T [k], (where 0 6 k < n)

For all j such that T [k] = P [j],

If T [k] is infrequent. . .

IncreaseA[k − j] by one

Construct an arrayA of length (n−m+ 1) - which initially contains all zeros

A 1 1 2 1 13 2 0

(except when (k − j) < 0)

O(n√m) total time

P a d bc ab b da

Pattern matching with mismatches: putting it all together

Algorithm summary

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

(by alphabetically sorting the characters from P )

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

(by alphabetically sorting the characters from P )

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

Stage 2: Count all matches involving infrequent symbols. -O(n√m) time

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

Stage 2: Count all matches involving infrequent symbols. -O(n√m) time

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

Stage 2: Count all matches involving infrequent symbols. -O(n√m) time

at any alignment i the number of mismatches is justm minus the total number of matches

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Overall, we obtain a time complexity of O(n√m logn).

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

Stage 2: Count all matches involving infrequent symbols. -O(n√m) time

at any alignment i the number of mismatches is justm minus the total number of matches

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Overall, we obtain a time complexity of O(n√m logn).

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

Stage 2: Count all matches involving infrequent symbols. -O(n√m) time

at any alignment i the number of mismatches is justm minus the total number of matches

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Pattern matching with mismatches: putting it all together

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Overall, we obtain a time complexity of O(n√m logn).

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

Stage 2: Count all matches involving infrequent symbols. -O(n√m) time

at any alignment i the number of mismatches is justm minus the total number of matches

Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct countingNotice that Stage 1 takes

longer than Stage 2...

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

How long does each stage take now?

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time(this stage is unaffected - the time complexity doesn’t depend on f ))

How long does each stage take now?

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

How long does each stage take now?

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

Stage 1: Count all matches involving frequent symbols

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

How long does each stage take now?

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

Stage 1: Count all matches involving frequent symbols

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

How long does each stage take now?

As each frequent symbol occurs at least f times, there are at most mf frequent symbols

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

Stage 1: Count all matches involving frequent symbols

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

How long does each stage take now?

As each frequent symbol occurs at least f times, there are at most mf frequent symbols

and we do one cross-correlation for each frequent symbol. . .

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

How long does each stage take now?

As each frequent symbol occurs at least f times, there are at most mf frequent symbols

and we do one cross-correlation for each frequent symbol. . .

Stage 1: Count all matches involving frequent symbols -O(mf · n logn) time

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

How long does each stage take now?

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

How long does each stage take now?

Stage 2: Count all matches involving infrequent symbols.

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

How long does each stage take now?

Stage 2: Count all matches involving infrequent symbols.

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

We make a single pass through T . . .

and for each T [i] we update at most (f − 1) locations inA

Improving the Time Complexity 1 - balance the stages

Current Definition: An alphabet symbol is frequent if it occurs at least√m times in P .

and infrequent otherwise

What happens if we generalise this definition?

New Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

How long does each stage take now?

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Stage 2: Count all matches involving infrequent symbols. -O(nf) time

We make a single pass through T . . .

and for each T [i] we update at most (f − 1) locations inA

Pattern matching with mismatches: putting it all together

(Generalised) Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Stage 1: Count all matches involving frequent symbols -O(nmf logn) time

Stage 2: Count all matches involving infrequent symbols. -O(nf) time

at any alignment i the number of mismatches is justm minus the total number of matches

Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Pattern matching with mismatches: putting it all together

(Generalised) Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

What should we set f to?

Stage 1: Count all matches involving frequent symbols -O(nmf logn) time

Stage 2: Count all matches involving infrequent symbols. -O(nf) time

at any alignment i the number of mismatches is justm minus the total number of matches

Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Pattern matching with mismatches: putting it all together

(Generalised) Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

What should we set f to?

Stage 1: Count all matches involving frequent symbols -O(nmf logn) time

Stage 2: Count all matches involving infrequent symbols. -O(nf) time

at any alignment i the number of mismatches is justm minus the total number of matches

Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Let f =√m logn. . .

Pattern matching with mismatches: putting it all together

(Generalised) Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

What should we set f to?

Stage 1: Count all matches involving frequent symbols -O(nmf logn) time

Stage 2: Count all matches involving infrequent symbols. -O(nf) time

at any alignment i the number of mismatches is justm minus the total number of matches

Definition: An alphabet symbol is frequent if it occurs at least f times in P .and infrequent otherwise

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Let f =√m logn. . .

Pattern matching with mismatches: putting it all together

(Generalised) Algorithm summary

What should we set f to?

at any alignment i the number of mismatches is justm minus the total number of matches

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Let f =√m logn. . .

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Stage 1: Count all matches involving frequent symbols -O(n m√m logn

logn) time

Stage 2: Count all matches involving infrequent symbols. -O(n√m logn) time

Definition: An alphabet symbol is frequent if it occurs at least√m logn times in P .

and infrequent otherwise

Pattern matching with mismatches: putting it all together

(Generalised) Algorithm summary

What should we set f to?

at any alignment i the number of mismatches is justm minus the total number of matches

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Let f =√m logn. . .

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Stage 2: Count all matches involving infrequent symbols. -O(n√m logn) time

Definition: An alphabet symbol is frequent if it occurs at least√m logn times in P .

and infrequent otherwise

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

Pattern matching with mismatches: putting it all together

(Generalised) Algorithm summary

at any alignment i the number of mismatches is justm minus the total number of matches

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Stage 2: Count all matches involving infrequent symbols. -O(n√m logn) time

Definition: An alphabet symbol is frequent if it occurs at least√m logn times in P .

and infrequent otherwise

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

Pattern matching with mismatches: putting it all together

(Generalised) Algorithm summary

at any alignment i the number of mismatches is justm minus the total number of matches

(by alphabetically sorting the characters from P )

00 0

0 0

00 01

1

1 1 11 1 1

1

Matches with a single symbolcan be found using a cross-correlation

aaaT d b c c c d d c d c d ca

A 1 1 2 1 13 2 0

P a d bc ab b daMatches with an infrequent symbol

can be found by direct counting

Stage 0: Classify each symbol as frequent or infrequent -O(m logn) time

Stage 2: Count all matches involving infrequent symbols. -O(n√m logn) time

Definition: An alphabet symbol is frequent if it occurs at least√m logn times in P .

and infrequent otherwise

Stage 1: Count all matches involving frequent symbols -O(n√m logn) time

This improves the overall time complexity from O(n√m logn) toO(n

√m logn).

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

T

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

P

m

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

P

m

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

P

m

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

2m

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

P

m

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

the final substringmight be shorter

2m

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

P

m

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

2m

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

P

m

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

P

m

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m P

m

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m P

m

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m P

m

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m P

m

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m P

m

How long does running the previous algorithm take?

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m P

m

How long does running the previous algorithm take?

O(|Tk|√m log |Tk|) time.

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m P

m

How long does running the previous algorithm take?

O(|Tk|√m log |Tk|) time.

= O(m√m logm) time.

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m P

m

How long does running the previous algorithm take?

O(|Tk|√m log |Tk|) time.

= O(m√m logm) time.

We run the previous algorithmO( nm

)times so this process takesO(n

√m logm) time in total

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m

How long does running the previous algorithm take?

O(|Tk|√m log |Tk|) time.

= O(m√m logm) time.

We run the previous algorithmO( nm

)times so this process takesO(n

√m logm) time in total

P

m

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Split T intoO( nm

)contiguous 2m length substrings, T1, T2, T3 . . .

Run the previous algorithm once for with P and each Tk

2m

How long does running the previous algorithm take?

O(|Tk|√m log |Tk|) time.

= O(m√m logm) time.

We run the previous algorithmO( nm

)times so this process takesO(n

√m logm) time in total

P

m

what about this alignment?

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Run the previous algorithm once for with P and each Tk

2m

How long does running the previous algorithm take?

O(|Tk|√m log |Tk|) time.

= O(m√m logm) time.

We run the previous algorithmO( nm

)times so this process takesO(n

√m logm) time in total

P

m

what about this alignment?

Split T intoO( nm

)overlapping 2m length substrings, T1, T2, T3 . . .

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Run the previous algorithm once for with P and each Tk

2m

How long does running the previous algorithm take?

O(|Tk|√m log |Tk|) time.

= O(m√m logm) time.

We run the previous algorithmO( nm

)times so this process takesO(n

√m logm) time in total

P

m

what about this alignment?

Split T intoO( nm

)overlapping 2m length substrings, T1, T2, T3 . . .

T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21

m 2mT2 T4 T6 T8 T10 T12 T14 T16 T18 T20

Improving the time complexity 2 - split the text

We have just seen an algorithm which takesO(n√m logn) time.

Imagine that n is a lot bigger thanm. . .

T

Run the previous algorithm once for with P and each Tk

2m

How long does running the previous algorithm take?

O(|Tk|√m log |Tk|) time.

= O(m√m logm) time.

We run the previous algorithmO( nm

)times so this process takesO(n

√m logm) time in total

P

m

Split T intoO( nm

)overlapping 2m length substrings, T1, T2, T3 . . .

T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21

m 2mT2 T4 T6 T8 T10 T12 T14 T16 T18 T20

Conclusion

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

(the Hamming distance is the number of mismatches)

c a a

A naive algorithm for this problem takesO(nm) time

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

aHam(8) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

We have seen two alternative algorithms:

One algorithm takesO(n|Σ| logn) time (where |Σ| is the alphabet size)

The other algorithm takesO(n√m logn) time (regardless of the alphabet size)

and can be improved toO(n√m logm)

(by changing the freq./infreq. cut off and splitting the text)

and can be improved toO(n|Σ| logm) (by splitting the text)