1 suffix arrays: a new method for on-line string searches udi manber gene myers may 1989 presented...

59
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann

Upload: rocco-huskey

Post on 14-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

1

Suffix Arrays:A new method for on-line

string searches

Udi Manber Gene Myers

May 1989

Presented by:Oren Weimann

2

Introduction - Problem definition

“Is W a substring of A?”

|A|=N and |W|=P A = a0a1…aN-1

Ai = suffix beginning at index i = aiai+1…aN-1

A= abccbbadgfbbcahgjf

W= badgfbb

A= abccbbadgfbbcahgjf

3

Introduction – what is a suffix array? Example:

assassin 0 assin 3 in 6 n 7 sassin 2 sin 5 ssassin 1 ssin 4

Pos

Pos[2] = 6 (A6 = in)

0 3 6 7 2 5 1 4

A = assassin0 1 2 3 4 5 6 7

4

Introduction – what is a suffix array?

A lexicographically sorted array- Pos[N], of all

the suffixes of A:

Pos[k] = i Ai is the kth smallest suffix in the set {A0, A1, A2…… AN-1}

5

Introduction – what is a suffix tree? Example:

A trie that contains all suffixes of A:

sa

4

3

ss

ss

a

in0

i

n 6

in

A = assassin0 1 2 3 4 5 6 7

s

ina

ssin

2

in

5

1a s s i n

6

The Article Overview

1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).

2. How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known)

3. An Algorithm for computing the lcp information in O(NlogN).

4. Algorithms for Expected-time improvement.

7

The Search algorithm - Definitions

For any string u, up = u1u2u3…….up (or u if |u| p)

Let “ “ denote a Lexicographical order, We say u v up vp

Note that for any choice of p:

Note that W is a substring of A there is an i such that W

p

]1[]2[]1[]0[ .... Npospppospposppos AAAA

][iposp A

8

The Search algorithm – how does the array help us know if W is a substring of A?

We define a search interval: LW = min {k | W APos[k] or k = N}

RW = max {k | W APos[k] or k = -1}

W matches ai ai+1 ...ai+P-1 i=Pos[k] for some k [LW, RW]

p

p

9

Example:

Pos0 assassin

1 assin 2 in 3 n 4 sassin

5 sin

6 ssassin

7 ssin

W LW RW # s 4 7 4 as 0 1 2

assa 0 0 1 ast 2 1 0

A = assassin0 1 2 3 4 5 6 7

Option 1

Option 2

Option 3

10

Why finding LW, RW == Finding the matches:

If LW > RW => W is not a substring of A.

Else: there are (RW-LW+1) matches - APos[LW],…, APos[RW]

W>APos[k] W<APos[k]LW RW

Pos

11

The Search algorithm –The easy way - O(PlogN)

L M R

abcde... abcdf... abd...Pos

Log(N) iterations, each iteration sets new L,R bonds (initially L=0, R=N-1) according to a comparison of W with APos[M] , where M=(L+R)/2.

In the end LW R

W=“abcx”

12

The Search algorithm using lcp values in O(P+logN) – Definitions:

Speedup using precomputed lcp Values, for now We assume lcp is known.

Each iteration We define: – l = lcp(APos[L], W) – r=lcp(W, APos[R]) – Llcp[M] = lcp(APos[L] APos[M])– Rlcp[M] = lcp(APos[M], APos[R])

13

The Search algorithm using lcp values in O(P+logN) Example: A=“abcx”

l = 3

Llcp[M]=4 Rlcp[M]=2L M R

abcde... abcdf... abd...Pos

r = 2

Note that Llcp[M] is well defined because every midpoint M has one LM and one RM

14

So how do we use l,r,Llcp[M] ?Example: W=abcx

abcde...

abc... abc... abcdf… abd…

l=3 r=2

Case 1: Llcp[M] > l (Llcp[M]=4 and l=3 )W>APos[L]

W>APos[M]

Go rightl is unchanged = 3

L M R

Llcp[M]=4

15

Example: W=abcx (cont.)

Case 2: Llcp[M] < l (Llcp[M]=2 and l=3 )

APos[L] <APos[M]

W<APos[M]

Go left r = Llcp[M] = 2

abcde...

abdf… abd…

r=2l=3

L M R

Llcp[M]=2

16

Example: W=abcx (cont.)

abcde...

abc... abc... abcp… abd…

l=3 r=2

Case 3: Llcp[M] = l (Llcp[M]=3 and l=3 )Compare Wl and APos[M]l

until Wl+j APos[M]l+j

Go right or left according to Wl+j, APos[M]l+j

new l or r = (l+j) Number of comparisons = j+1

L M R

Llcp[M]=3

17

The Search algorithm using lcp values-complexity

In each iteration there are maximum j+1comparisons, when in total

Total comparisons (P + #Iterations) O(P+logN) running time

Requires only 3N-sized arrays

Pjiterations

#

1

18

The Article Overview

1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).

2. How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known)

3. An Algorithm for computing the lcp information in O(NlogN).

4. Algorithms for Expected-time improvement.

19

Construction of suffix array in O(NlogN)

Sorting the suffixes in a unique Radix sort – WeWill have O(logN) stages (numbered

1,2,4,8,16…)

In stage H the suffixes are sorted in bucketscalled H Buckets, according to the first Hcharacters. (next stage is 2H– thus, in stage Hthe suffixes are sorted by )H

20

Construction of suffix array –The general idea

If Ai, Aj H-bucket we Sort them by the

Next H symbols, but:Their next H symbols = first H symbols ofAi+H and Aj+H which are already sorted in phase

H.

abef… abcd… ab… bb... bb… cd… cd… ef…

H=2:Ai Aj Aj+H Ai+H

first bucket fourth bucketthird bucketsecond bucket

21

Construction of suffix array –The general idea (cont.)

Let Ai be in first H-bucket after stage H

Ai starts with smallest H-symbol string

Ai-H should be first in its H-bucket

abef…

abcd…

ab… bb... bb… cdef… cdab…

ef…

Ai Ai-HH=2:

22

Construction of suffix array –The algorithm

Go over the suffix array: For each Ai: Move Ai-H to next available place in

its H-bucket The suffixes are now sorted according to -order Go over the array again, and decide which

suffix opens a new 2H-bucket, use lcs knowledge (described later)

H2

23

Construction of suffix array –The algorithm Example:

A = assassin0 1 2 3 4 5 6 7

assin assassin

in n sin ssin sassin ssassin

H=1A3

A2

Ai sets Ai-1

24

Construction of suffix array –The algorithm Example:

assin assassin

in n sassin ssin sin ssassin

H=1A0

A = assassin0 1 2 3 4 5 6 7

Ai sets Ai-1

25

Construction of suffix array –The algorithm Example:

assin assassin

in n sassin ssin sin ssassin

H=1A6

A = assassin0 1 2 3 4 5 6 7

A5

Ai sets Ai-1

26

Construction of suffix array –The algorithm Example:

assin assassin

in n sassin sin ssin ssassin

H=1A7

A = assassin0 1 2 3 4 5 6 7

A6

Ai sets Ai-1

27

Construction of suffix array –The algorithm Example:

assin assassin

in n sassin sin ssin ssassin

H=1

A2 A1

A = assassin0 1 2 3 4 5 6 7

Ai sets Ai-1

28

Construction of suffix array –The algorithm Example:

assin assassin

in n sassin sin ssassin

ssin

H=1

A4

A = assassin0 1 2 3 4 5 6 7

A5

Ai sets Ai-1

29

Construction of suffix array –The algorithm Example:

assin assassin

in n sassin sin ssassin

ssin

H=1

A = assassin0 1 2 3 4 5 6 7

A1A0

Ai sets Ai-1

30

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=1

A = assassin0 1 2 3 4 5 6 7

A4A3

Ai sets Ai-1

31

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=1

A = assassin0 1 2 3 4 5 6 7

Go over array to get new 2-buckets

lcs(sassin,sin)= 1+ lcs(assin,in)= 1+0=1 so “sin” opens a new 2-bucket

backAi sets Ai-1

32

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=2

A = assassin0 1 2 3 4 5 6 7

A0

Ai sets Ai-2

33

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=2

A = assassin0 1 2 3 4 5 6 7

A3A1

Ai sets Ai-2

34

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=2

A = assassin0 1 2 3 4 5 6 7

A6A4

Ai sets Ai-2

35

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=2

A = assassin0 1 2 3 4 5 6 7

A7 A5

Ai sets Ai-2

36

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=2

A = assassin0 1 2 3 4 5 6 7

A2A0

Ai sets Ai-2

37

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=2

A = assassin0 1 2 3 4 5 6 7

A5A3

Ai sets Ai-2

38

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=2

A = assassin0 1 2 3 4 5 6 7

A1

Ai sets Ai-2

39

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=2

A = assassin0 1 2 3 4 5 6 7

A4A2

Ai sets Ai-2

40

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=2

A = assassin0 1 2 3 4 5 6 7

Go over array to get new 4-buckets

Ai sets Ai-2

41

Construction of suffix array –The algorithm Example:

assassin

assin in n sassin sin ssassin

ssin

H=4

A = assassin0 1 2 3 4 5 6 7

That’s it, we are sorted!

42

Construction of suffix array –Complexity Summary

Sorting by first char – O(N) O(logN) stages of O(N) operations = O(NlogN)

Total - time: O(NlogN) - space: 2 integer arrays of size N

back

43

The Article Overview

1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).

2. How to construct Pos[ ] in O(NlogN) time and O(N) space.

3. An Algorithm for computing the lcp information in O(NlogN).

4. Algorithms for Expected-time improvement.

44

How to find Longest Common Prefixes – the general idea

We don’t care what is the lcp between suffixes in the same H-bucket.

For Ap, Aq in the same H-bucket but different 2H-buckets:– H lcp(Ap, Aq) < 2H– lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H)– lcp(Ap+H, Aq+H) < H that is why Ap+H,Aq+H

Are in different H-buckets, but which ones?

45

How to find Longest Common Prefixes – the general idea

If Ap+H and Aq+H were in adjacent H-buckets then lcp is known. how?

If not, Then: lcp(APos[i], APos[j]) =

{lcp(APos[k],APos[k+1])}]1,[ jik

Min

46

How to find Longest Common Prefixes – the general idea

lcp(Ap+H, Aq+H) = min{1,1,2} = 1

assassin

assin in n sassin sin ssassin

ssin

Aq+hAp+h

1 1 2

Notice that if 2 neighbors are in the same H-bucket we can consider there lcp to be H, since lcp(Ap+H, Aq+H) < H

H=2

47

How to find lcp – algorithm and data structures – Hgt[]

During the construction stage, we build an arrayCalled Hgt[N]: Hgt(i)=lcp(APos[i-1], APos[i]),

initialized so that Hgt[i]=N+1 for every i.

In stage H=1: Hgt(i)=0 for APos[i] that are first in their buckets. In stage 2H: we update every Hgt(i) that APos[i] is the first in a newly created 2H bucket

48

How to find lcp – Hgt[] example:

H=1assin assassin

3 0 6 7 5 4 2 1 in n sin ssin sassin ssassin

0 0 0 9 999

1 1

assin assassin in n sin ssinsassin ssassin3 0 6 7 2 5 4 1

0 0 0 99

H=2

lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1

49

How to find lcp – Hgt[] example (cont.)

23

0 3 6 7 2 5 1 4 assinassassin in n sin ssinsassin ssassin

H=4

0 0 0 1 1

lcp(assassin,assin)=2+lcp(sin, sassin)=2+1=3lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2

50

How to find lcp –data structures

We need a data structure that will containlcp(APos[j], APos[i]) between any i and j

(not just i and i+1 which Hgt[] supplies)

Hgt[] will become the leaves of a binarybalanced tree called the Interval tree.

51

How to find lcp –example of Interval tree

(2,3) (3,4) (4,5) (5,6) (6,7)(1,2)(0,1)

0

9 0 0 0

0 0

9

0

9 9

9

9

1 1

1

1

3 2

52

How to find lcp –Complexity

Each time a leaf opens a new bucket we change Hgt[i] for that leaf.

That change requires O(logN) changes in the interval tree

There are O(N) leaves opening new bucket

In total we get O(NlogN) to get all lcp values

53

The Article Overview

1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).

2. How to construct Pos[ ] in O(NlogN) time and O(N) space.

3. An Algorithm for computing the lcp information in O(NlogN).

4. Algorithms for Expected-time improvement.

54

Time Expected-case Improvement of the construction of pos[]

Assumptions: - All N-symbol strings are equally likely.

– Under this assumption: Expected length of longest repeated substring = O(log| |N)

This immediately implies that construction of pos[] is reduced to O(NLogLogN). why?

Next is a way to reduce it to O(N).

55

Time Expected-case Improvement of the construction of pos[]

Let T = We encode each possible T length string to

an integer with the isomorphism IntT(u)

Map each AP to IntT(AP) [0,| |T-1] :

– IntT(AP) = ap| |T-1 +

Nlog

/)( 1pT AInt

56

Example of the mapping

IntT(AP) = ap| |T-1 +

assassin 0 ssassin 1 sassin 2 assin 3 ssin 4 sin 5 in 6 n 7

/)( 1pT AInt

2*4^0 + 0 2

| |= 4 , a=0, i=1, n=2, s=3

N=8

T= =1

1*4^0 + 0 1

Nlog

3*4^0 + 0 3

3*4^0 + 0 3

0*4^0 + 0 0

3*4^0 + 0 3

3*4^0 + 0 3

0*4^0 + 0 0

57

Time Expected-case Improvement of the construction of pos[]

By the definition of IntT(AP) it takes O(N) to

compute all IntT(AP) values of all suffixes.

So now instead of starting with H=1 we start with H=

But since the longest repeated substring length isO(log| |N) we will have O(1) stages of the radix sort.

Thus, the total time for constructing pos[] = O(N)

Nlog

58

So is a suffix array better then a suffix tree?

Suffix array Suffix tree

Construction time

O(NlogN) - for small | |O(N) – needs additional space

O(N)

Time Complexity

O(P+logN) – good for large alphabets

O(Plog| |)

Space Complexity

requires 2N integers – this is the main advantage.

O(N)

dependent on | | ?

No Yes

59