suffix trees, suffix arrays and suffix trays richard cole tsvi kopelowitz moshe lewenstein

Post on 17-Dec-2015

242 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Suffix Trees, Suffix Arrays

and Suffix TraysRichard Cole

Tsvi Kopelowitz

Moshe Lewenstein

Indexing problem

Input: Text T=t1,…,tn (preprocess to DS)

Queries: Pattern P=p1,…,pm (use DS)

T=

5 14 30

Suffix Property

P appears at location i of T iff

P is a prefix of the suffix Ti

T=

T14 =

5 14 30

Suffix Tree

A suffix tree for string S is a compressed trie of all suffixes of S.

{ $ b$ ab$ bab$ abab$ }

ab

ab

$

ab

$

b

$

$

$

Example: s=abab$

Suffix Tree

The size of the suffix tree of S is O(|S|).

{ $ b$ ab$ bab$ abab$ }

Example: s=abab$

01

ab

ab

$

ab

$

b

2

$ 3

$

4

$

Suffix Tree

The size of the suffix tree of S is O(|S|).

{ $ b$ ab$ bab$ abab$ } 0

1

[2,3]

2

3

4

Example: s=abab$

[2,4] [4,4]

[4,4]

[4,4]

[1,1]

[2,4]

Indexing and Suffix Trees

Navigate from root. (Use suffix property).

P = ssi

Time: O(|P| + occ)

Indexing and Suffix Trees

Navigate from root. (Use suffix property).

P = ssi

Time: O(|P| log|Σ| + occ)

Suffix Trees

Weiner 1973 (linear time construction!)

McCreight 1975 (space efficient)

Ukkonen 1995 (online)

Farach 1997 (poly range alphabets)

Suffix Array POS

11

8

5

2

1

10

9

7

4

6

3

All suffixesS1 mississippi

S2 ississippi

S3 ssissippi

S4 sissippi

S5 issippi

S6 ssippi

S7 sippi

S8 ippi

S9 ppi

S10 pi

S11 i

Sorted suffixesS11 i

S8 ippi

S5 issippi

S2 ississippi

S1 mississippi

S10 pi

S9 ppi

S7 sippi

S4 sissippi

S6 ssippi

S3 ssissippi

Suffix Array

11 8 5 2 1 10 9 7 4 6 3

m i s s i s s i p p i S =

SA(S) =

P = pi

Suffix Array

11 8 5 2 1 10 9 7 4 6 3

m i s s i s s i p p i S =

SA(S) =

P = pi

Suffix Array

11 8 5 2 1 10 9 7 4 6 3

m i s s i s s i p p i S =

SA(S) =

P = pi

Suffix Array

11 8 5 2 1 10 9 7 4 6 3

m i s s i s s i p p i S =

SA(S) =

P = pi

Suffix Array

11 8 5 2 1 10 9 7 4 6 3

m i s s i s s i p p i S =

SA(S) =

P = pi

Time: O(|P|*log |S|)

Suffix Array

Introduced:

Manber and Myers (1993).

Gonnet, Baeza-Yates, Snider (1992) (PAT arrays).

Manber and Myers (1993):

Time - O(|P| + log |S|)

Suffix Array Construction

Manber and Myers (1993) - O(n log n).

Karkkainen-Sanders (2003) - O(n) (poly range)

2 Other papers as well.

End of Story?

No. Lots of questions.

1.Construction Time of Suffix Trees.

2.Query Time.

3.Compressed Indexing Structures.

4. Indexing with Errors.

5.Real-Time S.T. construction.

Query Time for Large Alphabets

Suffix Trees: O(|P|*log|Σ|) (deterministic)

Suffix Arrays: O(|P| + log |T|)

Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}

Query Time for Large Alphabets

Actually it is easy to answer queries in O(|P|) time.

Create at every node of suffix tree - |∑| length array.

Then navigation at every node is O(1).

However, time and space of suffix tree construction = O(n|∑| )

Query Time for Large Alphabets

Suffix Trees: O(|P|*log|Σ|) (deterministic)

Suffix Arrays: O(|P| + log |S|)

Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}

Suffix Tree – Suffix Array connection

The ordering of the suffixes (leaves) in suffix tree is exactly the suffix array

Suffix Array POS

8

5

2

11

1

9

10

6

3

7

4

12

All suffixesS1 mississippi$

S2 ississippi$

S3 ssissippi$

S4 sissippi$

S5 issippi$

S6 ssippi$

S7 sippi$

S8 ippi$

S9 ppi$

S10 pi$

S11 i$

S12 $

sorted suffixesS8 ippi$

S5 issippi$

S2 ississippi$

S11 i$

S1 mississippi$

S9 ppi$

S10 pi$

S6 ssippi$

S3 ssissippi$

S7 sippi$

S4 sissippi$

S12 $

Example: Mississippi$

8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =

Suffix Tree – Suffix Array connectionWe utilize this connection as follows:

Every node in the suffix tree corresponds to an interval in suffix array.

Example: Mississippi$

8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =

Suffix Tree – Suffix Array connectionMoreover,

Time to search in suffix array on interval I is:

O(|P| + log |I|).

Suffix Tree – Suffix Array connectionDFN: a |Σ|-leaf is a node that

(1) has at least |Σ| leaves in its subtree

(2) all its children do not.

Number of leaves in subtree of |Σ|-leaf is O(|Σ|2).

Why?

At most |Σ| children – each with less than |Σ| leaves in subtree.

Suffix Tree – Suffix Array connection

Number of leaves in subtree of |Σ|-leaf is O(|Σ|2).

Time to search in suffix array for |Σ|-leaf is:

O(|P| + log |Σ|).

Example: Mississippi$

8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =

Suffix Tray

Idea Outline:

Navigate in suffix tree till a |Σ|-leaf is hit and then move to suffix array (time in SA - O(|P| + log |Σ|))

Problem:

Navigation in suffix tree O(|P| log |Σ|) time.

We promised O(|P| + log |Σ|) .

Suffix Tray

Recall idea:

Create at every node of suffix tree - |∑| length array.

Then navigation at every node is O(1).

Too expensive overall: O(n|∑| )

But OK for O(n/|Σ|) nodes.

Suffix TrayIdea: Truncate suffix trees at |Σ|-leaves for Σ-tree

Would be nice: size of Σ-tree = O(n/|Σ|)

However, this is not the case.a

$

$

$

$$a

a

aa

$

< |Σ| leaves

|Σ|-leaf

- the rest

< |Σ| leaves

|Σ|-leaf

- the rest

$a

$

$

$

$

$ab

ab

ab

ab

$

ab

ab

$ab

$

$

ba

S=ababababa$

Suffix Tray

Alternative Idea: Extend def. of Σ-tree by removing all nodes with fewer than |Σ| leaves in its subtree.

Nodes in Σ-tree:

1.Σ-leaf

2.Branching-Σ-node: node with at least 2 children

3.Others – nodes with only one child.

Suffix Tray - Example

$a

$

$

$

$$ab

ab

abab

$

ab

ab

$ab

$

$

ba

< |Σ| leaves

|Σ|-leaf

- others

- branching |Σ|- node

Suffix TrayObservation:

# of Σ-leafs = O(n/|Σ|)

Hence, # of branching-Σ-nodes = O(n/|Σ|)

So, we can save Σ-tables for navigation at each.

Suffix Tray – What is Left?

$a

$

$

$

$$ab

ab

abab

$

ab

ab

$ab

$

$

ba

< |Σ| leaves

|Σ|-leaf

- others

- branching |Σ|- node

Suffix Tray

Nodes in Σ-tree with only one child.

ab b c d

e

8 5 2 11 1 9 10 6 3 7 4 12

Interval less than |Σ|2

Suffix Tray

Size of suffix Tray: O(n)

Navigation: 1.Σ-leaf - jump to suffix array2.Branching-Σ-node: look at Σ-array3.Others – look at one character to Σ-tree child.

Time: O(|P| + log|Σ|)

End of Story?

No. Lots of questions.

1.Construction Time of Suffix Trees.

2.Query Time.

3.Compressed Indexing Structures.

4. Indexing with Errors.

5.Real-Time S.T. construction.

top related