compressed suffix arrays and suffix trees roberto grossi, jeffery scott vitter

54
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

Upload: beverly-holmes

Post on 17-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

Compressed Suffix Arrays and Suffix Trees

Roberto Grossi, Jeffery Scott Vitter

Page 2: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

2

Outline

Reminders Motivation Compression results

Time & Space bounds Compressed Suffix Tree Compressed Suffix Array

Proof of bounds

Page 3: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

3

Reminder - Symbols

T = t1t2...tn-1 text of length n-1eof symbol # at the nth position

T[i,n] is suffix i of text Ti=1,…,n

Page 4: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

4

Reminder - Symbols

P = p1p2...pm

pattern of length m 0<ε≤1

Page 5: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

5

Reminder - Main Goal

Search string pattern P within text T Support fast queriesText T being fully scanned only once

Page 6: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

6

Reminder – Suffix Trees

Leaf with value i represents suffix [i,n]

Build time O(n)

Search time O(m)

Structure spaceO(n)

Page 7: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

7

Reminder – Suffix Arrays

Lexicographically ordered SA[i] = the starting position in T of the i-th

suffix

Σ={a,b} a<#<b

T = bbba#

a# # ba# bba#

bbba#

1 2 3 4 54 5 3 2 1

Page 8: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

8

Reminder – Suffix Arrays

Build timeO(nlogn)

Search timeO(m+logn)

Structure spaceO(n)

Page 9: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

9

Motivation

So FarGreedy in spaceFast searching

Need for space-efficient text indexing Reduce both space and query time

Page 10: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

10

Compressed Suffix Tree

Build timeO(n)

Search timeO(m/logn+(logn)ε)

Structure space(ε -1+O(1)) n

Page 11: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

11

Compressed Suffix Tree

Build Suffix Array Build Compressed Suffix Tree

Patricia Tries Compress Suffix Array

Page 12: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

12

CSA Basic Operations

Compress(T,SA)Return succinct representation of SARetain TDiscard SA

Lookup(i)Return SA[i]Use compressed SA

Page 13: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

13

CSA Primary measures

CompressPreprocessing compressed SASpace of compressed SA

lookupQuery time

Page 14: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

14

Compressed Suffix Array

Build time O(n)

Structure space ½nloglogn + O(n)

lookup time O(loglogn)

Page 15: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

15

Suffix Arrays Optimization

Main ideaDecomposition schemeRecursive structure of permutations

Page 16: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

16

Decomposition Scheme

K levels, K=0,….,l

SA0 = SA (Original SA) n 0=n

n = |T|assumption - n is a power of 2

n k=n/2k

SAk={1,2,…,nk)

Page 17: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

17

SAk Succinct Representation

4 main steps:1. Produce bit vector Bk

2. Map Bk 0’s to 1’s

3. Compute 1’s for each prefix in Bk

Using function rankk(j)

4. ‘Pack’ SAk

Page 18: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

18

Step #1: Produce bit vector Bk

|Bk| = nk

Bk[i]=1 if SAk[i] is even

Bk[i]=0 if SAk[i] is odd

T = bba#243 1SA0

Bo 110 0

Page 19: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

19

Step #2 : Map Bk 0’s to 1’s

New Fuction Ψk(i), i=1,…,nk

Ψk(i) =j SAk[i] is odd

and SAk[j]= SAk[i]+1

i otherwise (SAk[i] is even) T = bba#243 1

Bo110 0

322 3Ψo

SA0

Page 20: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

20

Step #3 : Compute 1’s for Bk

Recall fuction rankk(j), j=1,…,lk rankk(j) = number of 1’s on first j bits

of Bk

T = bba#243 1

Bo110 0

SA0

210 2ranko

Page 21: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

21

Step #4 : ‘Pack’ SAk

Pack even values of SAk

Divide by 2 New permutation {1,2,..,nk+1}

nk+1=nk/2=n/2k+1

Store new permutation into SAk+1

Remove SAk

|SAk+1| = |SAk|/2

Page 22: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

22

Example: level 0, steps 1-3

Page 23: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

23

Example: level 0, step 4

Page 24: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

24

Lemma : Reconstruct SAk

Results of phase k

Bk, Ψk, rankk,SAk+1

Reconstruct SAk

SAk[i] = 2*SAk+1[rankk(Ψk(i))] + (Bk[i]-1)

i = 1,….nk

Page 25: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

25

Proof, case 1, Bk[i] = 1

SAk[i] = 2*SAk+1[rankk(Ψk(i))] + (Bk[i]-1)

Step #4 : SAk[i]/2 stored in rankk(i)th entry of SAk+1

SAk[i] = 2 * SAk+1[rankk(i)]

Step #2 : Ψk(i) = i

Page 26: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

26

Proof, case 2, Bk[i] = 0

SAk[i] = 2*SAk+1[rankk(Ψk(i))] + (Bk[i]-1)

Ψk(i) = j

Step #2 : SAk[i] = SAk[j]-1

Bk[j] = 1

Apply case 1 on j SAk[j] = 2 * SAk+1[rankk(j)]

Page 27: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

27

Example, case 1, Bk[i] = 1

SA0[2] = ?

B0[2]=1, Ψ0(2)=2, rank0(2) = 1

SA0[2]/2 stored in 1st entry of SA1

SA0[2] = 2 * SA1[1] = 2 * 8 = 16

Page 28: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

28

Example, case 2, Bk[i] = 0

SA0[3] = ?

B0[3]=0, Ψ0(3) = 14, rank0(14) = 6 SA0[14] = 2 * SA1[6] = 2 * 16 = 32

SA0[3] = SA0[14] - 1 = 32 - 1 = 31

Page 29: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

29

Example - Decomposition

Page 30: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

30

Determining l

n 0 = n = 32

n 3 = 4 ~ n/logn can be stored in ≤ n bits

Conclusion l = loglogn

Page 31: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

31

CSA Structure

K levels, k = 0,1,….,l-1 Store Bk, Ψk, rankk

Final Level k = l Store only SAl

Page 32: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

32

CSA Structure & Build

Bk

nk bits per vector

O(nk) build

rankk

O(nk(loglognk)/lognk) bits• As shown before

O(nk) build

Sal

(n/2l)logn bits

Page 33: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

33

CSA Structure space - Ψk

List method 2K lists

possibilities for ‘prefixes’ of suffixes Number of lists increases Lk = concatenation of all 2K lists

|Lk| = nk/2

|Lk| decreases

Page 34: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

34

CSA Structure space - Ψk

For i = 1,…,nk/2

j = ith 1 in Bk

Pattern in 2K(SAk[j]-1),…, 2K*SAk[j]-1

matched to a list

Page 35: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

35

Level 0

a list = {2,14,15,18,23,28,30,31} b list = {7,8,10,13,16,17,21,27}

Page 36: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

36

Levels 1,2

Level 1 aa = {} //empty list ab = {9} ba = {1,6,12,14} bb = {2,4,5}

Level 2 abba = {5,8} baba = {1} aabb = {4}

Page 37: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

37

Reconstruct Ψk

Bk[i] =1

Ψk(i) = i

Bk[i] =0

h = number of 0’s in Bk

Ψk(i) = Lk[h]

Page 38: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

38

example : Reconstruct Ψk

Ψ0(25) = ?

B0[25] =0

h = 25 - 12 = 13

Ψ0(25) = L0[13] = 16

Page 39: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

39

example : Reconstruct Ψk

rank0(16) = 8

SA1[8] = ?

Ψ1[8] = ?

B1[8] =0

h = 8 - 5 = 3

Ψ1(8) = L1[3] = 6

Page 40: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

40

Lemma

S sorted integers w bits per number S < 2w

Store integers S(2+w-logs)+O(s/loglogs)

Retrieve hth integer O(1)

Page 41: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

41

Store Lk

Store integers n(1/2+3/2K+1 )+O(n/2kloglogn)

Retrieve hth integer O(1)

Preprocess time O(n/2k+22k)

Page 42: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

42

CSA Structure - Summary

Bk

nk

rankk

O(nk(loglognk)/lognk)

Sal

(n/2l)logn Ψk

n(½+3/2K+1 )+O(n/2kloglogn)

Page 43: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

43

Summing it up…

nlogn/2l + ½l*n + 5n + O(n/loglogn)

≤½nloglogn+n

½nloglogn + O(n) bits of storage

Page 44: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

44

Preprocess - summary

Bk

O(nk) rankk

O(nk) Ψk

O(n/2k+22k)

Summing up 0,..,l-1 levels Preprocess time O(n)

Page 45: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

45

lookup(i)

lookup(i) refers to SA0[i]

Need to reconstruct SA0[i]

New procedure - rlookup(i,k) Recursive Based on lemma of reconstructing SAk

Page 46: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

46

rlookup(i,k)

rlookup(i,k) If k = l

Return Sal[i]

else

Return 2*rlookup(rankk(Ψk(i)),k+1)+(Bk[i]-1)

Page 47: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

47

Reconstruct SAk

Lemma 2*SAk+1[rankk(Ψk(i))] + (Bk[i]-1)

lookup(i) = rlookup(i,0)

Page 48: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

48

Example - lookup(i)

lookup(5) = rlookup(5,0), l=3

2*rlookup(rank0(Ψ0(5)),1)+(B0[5]-1)

2*rlookup(10,1)+(-1)

Page 49: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

49

Example – cont.

rlookup(10,1) = 2*rlookup(rank1(Ψ1(10)),2)+(B1[10]-1)

2*rlookup(7,2)+(-1)

Page 50: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

50

Example - cont.

rlookup(7,2) = 2*rlookup(rank2(Ψ2(7)),3)+(B2[7]-1)

2*rlookup(2,3)+(-1)

Page 51: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

51

Example - cont.

rlookup(2,3) =

lookup(5) = 2*(2*(2*3+(-1))+(-1))+(-1)= 2*(2*(5)+(-1))+(-1) = 2*(9)+(-1) = 17

Page 52: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

52

lookup(i)

lookup(i) = rlookup(i,0) l+1 levels O(1) per level

O(loglogn) lookup time

Page 53: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

53

Compressed Suffix Array

Build time O(n)

Structure space ½nloglogn + O(n)

lookup time O(loglogn)

Page 54: Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter

54

Compressed Suffix Tree

Build timeO(n)

Search timeO(m/logn+(logn)ε)

Structure space(ε -1+O(1)) n