suffix trees

113
Suffix trees

Upload: savea

Post on 24-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Suffix trees. Trie. A tree representing a set of strings. c. a. { aeef ad bbfe bbfg c }. b. e. b. d. e. f. c. f. e. g. Trie (Cont). Assume no string is a prefix of another. c. Each edge is labeled by a letter, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Suffix trees

Suffix trees

Page 2: Suffix trees

Trie

• A tree representing a set of strings.

a

c

b

c

e

e

f

d b

f

e g

{ aeef ad bbfe bbfg c }

Page 3: Suffix trees

Trie (Cont)

• Assume no string is a prefix of another

a

c

b

c

e

e

f

d b

f

e g

Each edge is labeled by a letter,no two edges outgoing from the same node are labeled the same.

Each string corresponds to a leaf.

Page 4: Suffix trees

Compressed Trie

• Compress unary nodes, label edges by strings

a

c

b

c

e

e

f

d b

f

e g

a

c

bbf

c

eefd

e g

Page 5: Suffix trees

Suffix tree

Given a string s a suffix tree of s is a compressed trie of all suffixes of s

To make these suffixes prefix-free we add a special character, say $, at the end of s

Page 6: Suffix trees

Suffix tree (Example)

Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$

{ $ b$ ab$ bab$ abab$ }

ab

ab

$

ab

$

b

$

$

$

Page 7: Suffix trees

Trivial algorithm to build a Suffix tree

Put the largest suffix in

Put the suffix bab$ in

abab$

abab

$

ab$

b

Page 8: Suffix trees

Put the suffix ab$ in

ab

ab

$

ab$

b

ab

ab

$

ab$

b

$

Page 9: Suffix trees

Put the suffix b$ in

ab

ab

$

ab$

b

$

ab

ab

$

ab$

b

$

$

Page 10: Suffix trees

Put the suffix $ in

ab

ab

$

ab$

b

$

$

ab

ab

$

ab$

b

$

$

$

Page 11: Suffix trees

We will also label each leaf with the starting point of the corres. suffix.

ab

ab

$

ab$

b

$

$

$

12

ab

ab

$

ab

$

b

3

$ 4

$

5

$

Page 12: Suffix trees

Analysis

Takes O(n2) time to build.

We will see how to do it in O(n) time

Page 13: Suffix trees

What can we do with it ?

Exact string matching:Given a Text T, |T| = n, preprocess it

such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T.

We may also want to find all occurrences of P in T

Page 14: Suffix trees

Exact string matchingIn preprocessing we just build a suffix tree in O(n) time

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Given a pattern P = ab we traverse the tree according to the pattern.

Page 15: Suffix trees

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

If we did not get stuck traversing the pattern then the pattern occurs in the text.

Each leaf in the subtree below the node we reach corresponds to an occurrence.

By traversing this subtree we get all k occurrences in O(n+k) time

Page 16: Suffix trees

Generalized suffix tree

Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s S

To associate each suffix with a unique string in S add a different special char to each s

Page 17: Suffix trees

Generalized suffix tree (Example)

Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2

{ $ # b$ b# ab$ ab# bab$ aab# abab$ }

1

2

a

b

ab

$

ab$

b

3

$

4

$

5

$

1

b#

a

2

#

3

#

4

#

Page 18: Suffix trees

So what can we do with it ?

Matching a pattern against a database of strings

Page 19: Suffix trees

Longest common substring (of two strings)Every node with a leaf

descendant from string s1

and a leaf descendant

from string s2

represents a maximal common substring and vice versa.

1

2

a

b

ab

$

ab$

b

3

$

4

$

5

$

1

b#

a

2

#

3

#

4

#

Find such node with largest “string depth”

Page 20: Suffix trees

Lowest common ancetorsA lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Page 21: Suffix trees

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

2

a

b

ab

$

ab$

b

3

$

4

$

5

$

1

b#

a

2

#

3

#

4

#

Page 22: Suffix trees

Finding maximal palindromes

• A palindrome: caabaac, cbaabc• Want to find all maximal palindromes in a

string s

Let s = cbaaba

The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr

Page 23: Suffix trees

Maximal palindromes algorithm

Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc#

For every i find the LCA of suffix i of s and suffix m-i+1 of sr

Page 24: Suffix trees

3

a

a b

a

baaba$

b

3

$

7

$

b

7

#

c

1

6a b

c #

5

2 2

a $

c #

a

5

6

$

4

4

1

c #

a $

$abc #

c #

Let s = cbaaba$ then sr = abaabc#

Page 25: Suffix trees

Analysis

O(n) time to identify all palindromes

Page 26: Suffix trees

Can we construct a suffix tree in linear time ?

Page 27: Suffix trees

Ukkonen’s linear time construction

ACTAATC

A

1

A

Page 28: Suffix trees

ACTAATC

A

1

AC

Page 29: Suffix trees

ACTAATC

AC

1

AC

Page 30: Suffix trees

ACTAATC

AC

1

AC

2

C

Page 31: Suffix trees

ACTAATC

AC

1

ACT

2

C

Page 32: Suffix trees

ACTAATC

ACT

1

ACT

2

C

Page 33: Suffix trees

ACTAATC

ACT

1

ACT

2

CT

Page 34: Suffix trees

ACTAATC

ACT

1

ACT

2

CT

3

T

Page 35: Suffix trees

ACTAATC

ACT

1

ACTA

2

CT

3

T

Page 36: Suffix trees

ACTAATC

ACTA

1

ACTA

2

CT

3

T

Page 37: Suffix trees

ACTAATC

ACTA

1

ACTA

2

CTA

3

T

Page 38: Suffix trees

ACTAATC

ACTA

1

ACTA

2

CTA

3

TA

Page 39: Suffix trees

ACTAATC

ACTA

1

ACTAA

2

CTA

3

TA

Page 40: Suffix trees

ACTAATC

1

ACTAA

2

CTA

3

TAACTAA

Page 41: Suffix trees

ACTAATC

ACTAA

1

ACTAA

2

CTAA

3

TA

Page 42: Suffix trees

ACTAATC

ACTAA

1

ACTAA

2

CTAA

3

TAA

Page 43: Suffix trees

ACTAATC

CTAA

1

ACTAA

2

CTAA

3

TAAA

4

A

Page 44: Suffix trees

Phases & extensions

• Phase i is when we add character i

i

• In phase i we have i extensions of suffixes

Page 45: Suffix trees

ACTAATC

CTAA

1

ACTAAT

2

CTAA

3

TAAA

4

A

Page 46: Suffix trees

ACTAATC

CTAAT

1

ACTAAT

2

CTAA

3

TAAA

4

A

Page 47: Suffix trees

ACTAATC

CTAAT

1

ACTAAT

2

CTAAT

3

TAAA

4

A

Page 48: Suffix trees

ACTAATC

CTAAT

1

ACTAAT

2

CTAAT

3

TAATA

4

A

Page 49: Suffix trees

ACTAATC

CTAAT

1

ACTAAT

2

CTAAT

3

TAATA

4

AT

Page 50: Suffix trees

ACTAATC

CTAAT

1

ACTAAT

2

CTAAT

3

TAATA

4

AT

5

T

Page 51: Suffix trees

Extension rules

• Rule 1: The suffix ends at a leaf, you add a character on the edge entering the leaf

• Rule 2: The suffix ended internally and the extended suffix does not exist, you add a leaf and possibly an internal node

• Rule 3: The suffix exists and the extended suffix exists, you do nothing

Page 52: Suffix trees

ACTAATC

1

ACTAATC

2

CTAAT

3

TAATA

4

AT

5

TCTAAT

Page 53: Suffix trees

ACTAATC

CTAATC

1

ACTAATC

2

CTAAT

3

TAATA

4

AT

5

T

Page 54: Suffix trees

ACTAATC

CTAATC

1

ACTAATC

2

CTAATC

3

TAATA

4

AT

5

T

Page 55: Suffix trees

ACTAATC

CTAATC

1

ACTAATC

2

CTAATC

3

TAATCA

4

AT

5

T

Page 56: Suffix trees

ACTAATC

CTAATC

1

ACTAATC

2

CTAATC

3

TAATCA

4

ATC

5

T

Page 57: Suffix trees

ACTAATC

CTAATC

1

ACTAATC

2

CTAATC

3

TAATCA

4

ATC

5

TC

Page 58: Suffix trees

ACTAATC

CTAATC

1

ACTAATC

2

CTAATC

3

TA

4

ATC

5

TCAATC

6

C

Page 59: Suffix trees

1 2

C

3

TA

4

ATCAC

5

TCAC

AATCAC

6

CAC

ACTAATCAC

TAATCAC

7

AC

CTAATCAC

Skip forward..

Page 60: Suffix trees

1 2

C

3

TA

4

ATCAC

5

TCAC

AATCAC

6

CAC

ACTAATCACT

TAATCAC

7

AC

CTAATCAC

Page 61: Suffix trees

1 2

C

3

TA

4

ATCAC

5

TCAC

AATCAC

6

CAC

ACTAATCACT

TAATCAC

7

AC

CTAATCACT

Page 62: Suffix trees

1 2

C

3

TA

4

ATCAC

5

TCAC

AATCAC

6

CAC

ACTAATCACT

TAATCACT

7

AC

CTAATCACT

Page 63: Suffix trees

1 2

C

3

TA

4

ATCAC

5

TCAC

AATCACT

6

CAC

ACTAATCACT

TAATCACT

7

AC

CTAATCACT

Page 64: Suffix trees

1 2

C

3

TA

4

ATCACT

5

TCAC

AATCACT

6

CAC

ACTAATCACT

TAATCACT

7

AC

CTAATCACT

Page 65: Suffix trees

1 2

C

3

TA

4

ATCACT

5

TCACT

AATCACT

6

CAC

ACTAATCACT

TAATCACT

7

AC

CTAATCACT

Page 66: Suffix trees

1 2

C

3

TA

4

ATCACT

5

TCACT

AATCACT

6

CACT

ACTAATCACT

TAATCACT

7

AC

CTAATCACT

Page 67: Suffix trees

1 2

C

3

TA

4

ATCACT

5

TCACT

AATCACT

6

CACT

ACTAATCACT

TAATCACT

7

ACT

CTAATCACT

Page 68: Suffix trees

1 2

C

3

TA

4

ATCACT

5

TCACT

AATCACT

6

CACT

ACTAATCACTG

TAATCACT

7

ACT

CTAATCACT

Page 69: Suffix trees

1 2

C

3

TA

4

ATCACT

5

TCACT

AATCACT

6

CACT

ACTAATCACTG

TAATCACT

7

ACT

CTAATCACTG

Page 70: Suffix trees

1 2

C

3

TA

4

ATCACT

5

TCACT

AATCACT

6

CACT

ACTAATCACTG

TAATCACTG

7

ACT

CTAATCACTG

Page 71: Suffix trees

1 2

C

3

TA

4

ATCACT

5

TCACT

AATCACTG

6

CACT

ACTAATCACTG

TAATCACTG

7

ACT

CTAATCACTG

Page 72: Suffix trees

1 2

C

3

TA

4

ATCACTG

5

TCACT

AATCACTG

6

CACT

ACTAATCACTG

TAATCACTG

7

ACT

CTAATCACTG

Page 73: Suffix trees

1 2

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACT

ACTAATCACTG

TAATCACTG

7

ACT

CTAATCACTG

Page 74: Suffix trees

1 2

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACTG

ACTAATCACTG

TAATCACTG

7

ACT

CTAATCACTG

Page 75: Suffix trees

1 2

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACTG

ACTAATCACTG

TAATCACTG

7

ACTG

CTAATCACTG

Page 76: Suffix trees

CT

1

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACTG

ACTAATCACTG

7

ACTG

AATCACTG

8

G

2

TAATCACTG

Page 77: Suffix trees

CT

1 2

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACTG

ACTAATCACTG

T

7

ACTG

AATCACTG

8

GAATCACTG

9

G

Page 78: Suffix trees

CT

1 2

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACTG

ACTAATCACTG

T

7

ACTG

AATCACTG

8

GAATCACTG

9

G

10

G

11

G

Page 79: Suffix trees

Observations

At the first extension we must end at a leaf because no longer suffix exists (rule 1)

i

At the second extension we still most likely to end at a leaf.

i

We will not end at a leaf only if the second suffix is a prefix of the first

Page 80: Suffix trees

Say at some extension we do not end at a leaf

i

Is there a way to continue using ith character ?

(Is it a prefix of a suffix where the next character is the ith character ?)

Rule 3 Rule 2

Then this suffix is a prefix of some other suffix (suffixes)We will not end at a leaf in subsequent extensions

Page 81: Suffix trees

Rule 3 Rule 2

If we apply rule 3 then in all subsequent extensions we will apply rule 3

Otherwise we keep applying rule 2 until in some subsequent extensions we will apply rule 3

Rule 3

Page 82: Suffix trees

In terms of the rules that we apply a phase looks like:

1 1 1 1 1 1 1 2 2 2 2 3 3 3 3

We have nothing to do when applying rule 3, so once rule 3 happens we can stop

We don’t really do anything significant when we apply rule 1 (the structure of the tree does not change)

Page 83: Suffix trees

Representation

• We do not really store a substring with each edge, but rather pointers into the starting position and ending position of the substring in the text

• With this representaion we do not really have to do anything when rule 1 applies

Page 84: Suffix trees

How do phases relate to each other

1 1 1 1 1 1 1 2 2 2 2 3 3 3 3i

The next phase we must have:

1 1 1 1 1 1 1 1 1 1 1 2/3

So we start the phase with the extension that was the first where we applied rule 3 in the previous phase

Page 85: Suffix trees

Suffix Links

CT

1 2

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACTG

ACTAATCACTG

T

7

ACTG

AATCACTG

8

GAATCACTG

9

G

10

G

11

G

Page 86: Suffix trees

CT

1 2

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACTG

ACTAATCACTG

T

7

ACTG

AATCACTG

8

GAATCACTG

9

G

10

G

11

G

Page 87: Suffix trees

CT

1 2

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACTG

ACTAATCACTG

T

7

ACTG

AATCACTG

8

GAATCACTG

9

G

10

G

11

G

Page 88: Suffix trees

CT

1 2

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACTG

ACTAATCACTG

T

7

ACTG

AATCACTG

8

GAATCACTG

9

G

10

G

11

G

Page 89: Suffix trees

CT

1 2

C

3

TA

4

ATCACTG

5

TCACTG

AATCACTG

6

CACTG

ACTAATCACTG

T

7

ACTG

AATCACTG

8

GAATCACTG

9

G

10

G

11

G

Page 90: Suffix trees

Suffix Links

• From an internal node that corresponds to the string aβ to the internal node that corresponds to β (if there is such node)

aββ

Page 91: Suffix trees

• Is there such a node ?

v

aβSuppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy

So there was a suffix βx… x..

β

y

Page 92: Suffix trees

• Is there such a node ?

v

aβSuppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy

So there was a suffix βx… x..

β

If there was also a suffix βz…

Then a node corresponding to β is there

y x.. z..

Page 93: Suffix trees

• Is there such a node ?

v

aβSuppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy

So there was a suffix βx… x..

β

If there was also a suffix βz…

Then a node corresponding to β is there

y x.. y

Otherwise it will be created in the next extension when we add βy

Page 94: Suffix trees

Inv: All suffix links are there except (possibly) of the last internal node added

You are at the (internal) node corresponding to the last extension

You start a phase at the last internal node of the first extension in which you applied rule 3 in the previous iteration

i

Remember: we apply rule 2

i

Page 95: Suffix trees

1) Go up one node (if needed) to find a suffix link2) Traverse the suffix link

3) If you went up in step 1 along an edge that was labeled δ then go down consuming a string δ

Page 96: Suffix trees

v

δ

x..

δ

y

β

Create the new internal node if necessary

Page 97: Suffix trees

v

δ

x..

δ

y

β

Create the new internal node if necessary

Page 98: Suffix trees

v

δ

x..

δ

y

β

Create the new internal node if necessary, add the suffix

y

Page 99: Suffix trees

v

δ

x..

δ

y

β

Create the new internal node if necessary, add the suffix and install a suffix link if necessary

y

Page 100: Suffix trees

AnalysisHandling all extensions of rule 1 and all extensions of rule 3 per phase take O(1) time O(n) total

How many times do we carry out rule 2 in all phases ?O(n)

Does each application of rule 2 takes constant time ?No ! (going up and traversing the suffix link takes constant time, but then we go down possibly on many edges..)

Page 101: Suffix trees

v

δ

x..

δ

y

β

y

Page 102: Suffix trees

So why is it a linear time algorithm ?

How much can the depth change when we traverse a suffix link ?

It can decrease by at most 1

Page 103: Suffix trees

Punch lineEach time we go up or traverse a suffix link the depth decreases by at most 1

When starting the depth is 0, final depth is at most n

So during all applications of rule 2 together we cannot go down more than 3n times

THM: The running time of Ukkonen’s algorithm is O(n)

Page 104: Suffix trees

Another application

• Suppose we have a pattern P and a text T and we want to find for each position of T the length of the longest substring of P that matches there.

• How would you do that in O(n) time ?

Page 105: Suffix trees
Page 106: Suffix trees

Drawbacks of suffix trees

• Suffix trees consume a lot of space

• It is O(n) but the constant is quite big

• Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node

Page 107: Suffix trees

Suffix array

• We loose some of the functionality but we save space.

Let s = ababSort the suffixes lexicographically: ab, abab, b, bab

The suffix array gives the indices of the suffixes in sorted order

3 1 4 2

Page 108: Suffix trees

How do we build it ?

• Build a suffix tree• Traverse the tree in DFS,

lexicographically picking edges outgoing from each node and fill the suffix array.

• O(n) time

Page 109: Suffix trees

How do we search for a pattern ?

• If P occurs in T then all its occurrences are consecutive in the suffix array.

• Do a binary search on the suffix array

• Takes O(mlogn) time

Page 110: Suffix trees

ExampleLet S = mississippi

iippiissippiississippimississippipi

8

5

2

1

10

9

7

4

11

6

3

ppisippisisippissippississippi

L

R

Let P = issa

M

Page 111: Suffix trees

How do we accelerate the search ?

L

R

Maintain l = LCP(P,L)Maintain r = LCP(P,R)

M

If l = r then start comparing M to P at l + 1

r

Page 112: Suffix trees

How do we accelerate the search ?

L

R

Suppose we know LCP(L,M)

If LCP(L,M) < l we go left

If LCP(L,M) > l we go right

If LCP(L,M) = l we start comparing at l + 1

M

If l > r then

r

Page 113: Suffix trees

Analysis of the acceleration

If we do more than a single comparison in an iteration then max(l, r ) grows by 1 for each comparison O(logn + m) time