suffix trees

71
Suffix trees

Upload: arlo

Post on 22-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Suffix trees. Trie. A tree representing a set of strings. c. a. { aeef ad bbfe bbfg c }. b. e. b. d. e. f. f. e. g. Trie (Cont). Assume no string is a prefix of another. c. 1) Each edge is labeled by a letter, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Suffix trees

Suffix trees

Page 2: Suffix trees

Trie• A tree representing a set of strings.

ab

c

e

e

f

d b

f

e g

{ aeef ad bbfe bbfg c }

Page 3: Suffix trees

Trie (Cont)• Assume no string is a prefix of

anothera

bc

e

e

f

d b

f

e g

1) Each edge is labeled by a letter,

2) No two edges outgoing from the same node are labeled the same.

3) Each string corresponds to a leaf.

Page 4: Suffix trees

Compressed Trie • Compress unary nodes, label edges by

stringsa

bc

e

e

f

d b

f

e g

a

bbf

c

eefd

e g

Page 5: Suffix trees

Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s

To make these suffixes prefix-free we add a special character, say $, at the end of s

Page 6: Suffix trees

Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab${

$ b$ ab$ bab$ abab$ }

ab

ab

$

ab$

b

$

$

$

Note that a suffix tree has O(n) nodes n = |s| (why?)

Page 7: Suffix trees

Trivial algorithm to build a Suffix tree

Put the largest suffix in

Put the suffix bab$ in

abab$

abab

$

ab$

b

Page 8: Suffix trees

Put the suffix ab$ in

ab

ab

$

ab$

b

ab

ab

$

ab$

b

$

Page 9: Suffix trees

Put the suffix b$ in

ab

ab

$

ab$

b

$

ab

ab

$

ab$

b

$

$

Page 10: Suffix trees

Put the suffix $ in

ab

ab

$

ab$

b

$

$

ab

ab

$

ab$

b

$

$

$

Page 11: Suffix trees

We will also label each leaf with the starting point of the corres. suffix.

ab

ab

$

ab$

b

$

$

$

12

ab

ab

$

ab

$

b

3

$ 4

$

5

$

Page 12: Suffix trees

AnalysisTakes O(n2) time to build.

It is possible to construct it in O(n) time

But, how come? does it take O(n) space ?

Page 13: Suffix trees

$

Linear space ? • Consider the string aaaaaabbbbbb

a

c

bbbbbb$

a b

a

a

a

b

b

b

b

b$

bbbbbb$

bbbbbb$

bbbbbb$

bbbbbb$

$

$

$

$

$

abbbbbb$

Page 14: Suffix trees

To use only O(n) space encode the edge-labels

a

c

bbbbbb$

a b

a

a

a

b

b

b

b

b$

bbbbbb$

bbbbbb$

bbbbbb$

(7,13)

$

$

$

$

$

• Consider the string aaaaaabbbbbb

$

abbbbbb$

Page 15: Suffix trees

To use only O(n) space encode the edge-labels

(1,1)

c

(7,13)

(7,13)

(7,13)

(7,13)

(7,13)

• Consider the string aaaaaabbbbbb

(6,13)

(1,1)

(1,1)

(1,1)

(1,1)

(13,13)

(7,7)

(13,13)

(13,13)

(13,13)

(12,13)

(13,13)

(7,7)

(7,7)

(7,7)

(7,7)

Page 16: Suffix trees

What can we do with it ?Exact string matching:Given a text T, |T| = n, preprocess it

such that when a pattern P, |P|=m, arrives you can quickly decide if it occurs in T.

We may also want to find all occurrences of P in T

Page 17: Suffix trees

Exact string matchingIn preprocessing we just build a suffix tree in O(n) time

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Given a pattern P = ab we traverse the tree according to the pattern.

Page 18: Suffix trees

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence.By traversing this subtree we get all k occurrences in O(m+k) time

Page 19: Suffix trees

Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s S

To associate each suffix with a unique string in S add a different special char to each s

Page 20: Suffix trees

Generalized suffix tree (Example)

Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2

{ $ # b$ b# ab$ ab# bab$ aab# abab$ }

1

2

a

b

ab

$

ab$

b

3

$

4

$

5$

1

b#

a

2

#

3

#4

#

Page 21: Suffix trees

So what can we do with it ? Matching a pattern against a database of strings

Page 22: Suffix trees

Longest common substring (of two strings)Every node with a leaf

descendant from string s1 and a leaf descendant from string s2 represents a common substring.

A maximal common substring corresponds to such node.

1

2

a

b

ab

$

ab$

b

3

$

4

$

5$

1

b#

a

2

#

3

#4

#

Find such node with largest “string depth”

Page 23: Suffix trees

Lowest common ancetorsA lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Page 24: Suffix trees

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

2

a

b

ab

$

ab$

b

3

$

4

$

5$

1

b#

a

2

#

3

#4

#

Page 25: Suffix trees

Lowest common ancestors

Page 26: Suffix trees

Write an Euler tour of the tree3

4

51

672

4 1 7 1 3 5 6 5 33 1231

2

3

46

9

10

8

57

11

12

0

LCA(1,5) = 3

Shallowest node

Page 27: Suffix trees

3

4

51

672

4 1 7 1 3 5 6 5 33 123

1

2

3

46

9

10

8

57

11

12

0

2 1 2 1 0 1 2 1 00 110

minimum

Node id

Level

Page 28: Suffix trees

Range minimum3

4

51

672

4 1 7 1 3 5 6 5 33 123

1

2

3

46

9

10

8

57

11

12

0

2 1 2 1 0 1 2 1 00 110

minimum

Preprocess an array, such that given i,j you can find the minimum in [i,j] fastReduction takes linear time

Page 29: Suffix trees

Trivial algorithms for RMQ• O(n) space, O(n) query time

• O(n2) space, O(1) query time

Page 30: Suffix trees

Less trivial algorithms to RMQ

• Try to use O(nlog(n)) space to do a query in O(1) time..

• Simpler: O(n + sqrt(n)) to do a query in sqrt(n)

Page 31: Suffix trees

Lowest common ancetorsA lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Page 32: Suffix trees

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

2

a

b

ab

$

ab$

b

3

$

4

$

5$

1

b#

a

2

#

3

#4

#

Page 33: Suffix trees

Finding maximal palindromes

• A palindrome: caabaac, cbaabc• Want to find all maximal palindromes

in a string s

Let s = cbaaba

The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position n-i+2 of sr

Page 34: Suffix trees

Maximal palindromes algorithm

Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc#

For every i, find the LCA of suffix i of s and suffix n-i+2 of sr

Page 35: Suffix trees

3

a

a b

a

baaba$

b

3

$

7$

b

7

#c

1

6a b

c #

5

2 2

a $

c #

a

5

6

$

4

4

1

c #

a $

$abc #

c #

Let s = cbaaba$ then sr = abaabc#

Page 36: Suffix trees

AnalysisO(n) time to identify all palindromes

Page 37: Suffix trees

Suffix trees for compression

Page 38: Suffix trees

aabcbbabbsb (2,5) bbabbabcbbabbabbsb…S[1..n] =

Compression aabcbbabbsbabcbbbbabbabcbbabbabbsb…

Page 39: Suffix trees

aabcbbabbsb (2,5) bbabbabcbbabbabbsb…S[1..n] =

Let priori=S[i..i+Li-1] = S[si..si+Li-1] be this prefix

Compression

Let Li be the length of the longest prefix of S[i..n] that is a substring of S[1..i-1]Let si the previous position of this prefix

aabcbbabbsbabcbbbbabbabcbbabbabbsb…

Page 40: Suffix trees

aabcbbabbsb (2,5) bbabbabcbbabbabbsb…

S[1..n]=

Compressionprior12

(s12,L12)

i=12

aabcbbabbsbabcbbbbabbabcbbabbabbsb…

Page 41: Suffix trees

For (i=1 ; i<=n ;;){Compute (si,Li) if Li>1 {output(si,Li); i=i+Li}; else {output(S[i]); i=i+1}}

aabcbb

(2,2)

bs

(6,4)(? si,Li) איך לחשב את

“Ziv-Lempel” compression

aabcbbabbsbabbbbbabbabcbbabbabbsb…

Page 42: Suffix trees

Implementation using suffix tree

Before compression:

• Build a suffix tree T for S.

• For each node v, compute cv : – the smallest leaf index in v’s subtree.– the starting position of the leftmost copy of the

substring that labels the path from the root to v.

• O(n) time.

Page 43: Suffix trees

leaf i

root

S[i...n]vstring-len)v( + cv ≥ i

computing (si,Li):

icv

string-len)v(

Let v be the “first” node such that:

set si ← cv

set Li ← i-cv

Page 44: Suffix trees

v2

v1

Example

S = abababab 1 2 3 4 5 6 7 8

1

1

1

12

2

2

a

a

a

a

a

a

a

b

b

b

bb

b

b

b

$

$

$

$

$$

$ $18 76 54 32

i=1 Li=0 a

i=2 Li=0 b

i=3 Li=2 si=1 )1,2(

i=5 Li=4 si=1 )1,4(

Page 45: Suffix trees

Suffix array• We loose some of the functionality

but we save space.Let s = ababSort the suffixes lexicographically: ab, abab, b, babThe suffix array gives the indices of the suffixes in sorted order

2 0 3 1

Page 46: Suffix trees

How do we build it ?• Build a suffix tree• Traverse the tree in in-order,

lexicographically picking edges outgoing from each node and fill the suffix array.

• O(n) time• Can also build it directly

Page 47: Suffix trees

ExampleLet S = mississippi

iippiissippiississippimississippipi

7

4

1

0

9

8

6

3

10

5

2

ppisippisisippissippississippi

L

R

Let P = issa

M

Page 48: Suffix trees

How do we search for a pattern ?

• If P occurs in T then all its occurrences are consecutive in the suffix array.

• Do a binary search on the suffix array

• Takes O(mlogn) time• Can also do it in O(m+log(n)) with an

additional array

Page 49: Suffix trees

Computing the suffix array• Can do it in linear time without

constructing the suffix tree ?

Page 50: Suffix trees

Divide into triples

$ y a b b a d a b b a d o

abb ada bba do$

Page 51: Suffix trees

Divide into triples

$ y a b b a d a b b a d o

abb ada bba do$

$ y a b b a d a b b a d o

bba dab bad o$$

Page 52: Suffix trees

Radix sort triplets

$ y a b b a d a b b a d o

abb ada bba do$

$ y a b b a d a b b a d o

bba dab bad o$$

abb

ada

bba

do$

bad

dab

o$$

Page 53: Suffix trees

Change the alphabet

$ y a b b a d a b b a d o

abb ada bba do$

$ y a b b a d a b b a d o

bba dab bad o$$

abb

ada

bba

do$

bad

dab

o$$

1

2

3

4

5

6

7

1 2 4 6

4 5 3 7

Page 54: Suffix trees

Sort the suffixes

$ y a b b a d a b b a d o

abb ada bba do$

$ y a b b a d a b b a d o

bba dab bad o$$

1 2 4 6

4 5 3 7

1 2 4 62 4 64 664 5 3 75 3 73 77

Page 55: Suffix trees

Recursively (problem size decreases to 2/3 the original)

abb ada bba do$ bba dab bad o$$

1 2 4 6 4 5 3 7

0 1 2 3 4 5 6 7

0 - 123645371 - 24645372 - 4645373 - 645374 - 45375 - 5376 - 377 - 7

0 - 123645371 – 24645376 – 374 - 45372 - 4645375 – 5373 - 645377 - 7

Page 56: Suffix trees

Go back to original problem abb ada bba do$ bba dab bad o$$

1 2 4 6 4 5 3 70 - 123645371 - 24645372 - 4645373 - 645374 - 45375 - 5376 - 377 - 7

0 - 123645371 – 24645376 – 374 - 45372 - 4645375 – 5373 - 645377 - 7

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 120

0/1 1/4 2/7 3/10 4/2 5/5 6/8 7/11

Page 57: Suffix trees

Go back to original problem abb ada bba do$ bba dab bad o$$

1 2 4 6 4 5 3 70 - 123645371 - 24645372 - 4645373 - 645374 - 45375 - 5376 - 377 - 7

0 - 123645371 – 24645376 – 374 - 45372 - 4645375 – 5373 - 645377 - 7

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 120

0/1 1/4 2/7 3/10 4/2 5/5 6/8 7/11

1 4 2 6 5 3 7 8

Page 58: Suffix trees

Sort the remaining third

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

(b, 2) (a, 5) (a, 7)(y, 1)

(b, 2)(a, 5) (a, 7) (y, 1)

36 9 0

1 2 3 4 5 6 7 8 9 10 11 120

Page 59: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

(b, 2) (a, 5) (a, 7)(y, 1)

(b, 2)(a, 5) (a, 7) (y, 1)

36 9 0

1 2 3 4 5 6 7 8 9 10 11 120

1 4 8 2 7 5 10 11

Page 60: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

36 9 0

1 4 8 2 7 5 10 11

Page 61: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

36 9 0

4 8 2 7 5 10 11

6

Page 62: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

39 0

4 8 2 7 5 10 11

6 4

Page 63: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

39 0

8 2 7 5 10 11

6 4 9

Page 64: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

3 0

8 2 7 5 10 11

6 4 9 3

Page 65: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

8 2 7 5 10 11

6 4 9 3 8

Page 66: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

2 7 5 10 11

6 4 9 3 8 2

Page 67: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

7 5 10 11

6 4 9 3 8 2 7

Page 68: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

5 10 11

6 4 9 3 8 2 7 5

Page 69: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

10 11

6 4 9 3 8 2 7 5

Page 70: Suffix trees

Merge

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10 11 0

Page 71: Suffix trees

summary

$ y a b b a d a b b a d o1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10 11 0

When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes

When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes