compressed suffix arrays and suffix trees with applications to text indexing and string matching
TRANSCRIPT
Compressed suffix Compressed suffix arrays and suffix trees arrays and suffix trees with applications to text with applications to text
indexing and string indexing and string matchingmatching
Jeffrey scott vitter Roberto Grosssi
AgendaAgenda A (very) short review on suffix arraysA (very) short review on suffix arrays IntroductionIntroduction Problem DefinitionProblem Definition Information theory reasoningInformation theory reasoning
– Simple solution round 2 Simple solution round 2 Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access
timetime Rank And Select Problem definitionsRank And Select Problem definitions
– Rank DSRank DS Compressed suffix arrays in Compressed suffix arrays in εε-1-1n + O(n) bits and O(logn + O(n) bits and O(logεεn) access timen) access time Select data structure (if time permits) Select data structure (if time permits)
Short review on suffix arraysShort review on suffix arrays
A suffix array is a A suffix array is a sorted array of the sorted array of the suffix of a string S suffix of a string S represented by an represented by an array of pointers to array of pointers to the suffixes of Sthe suffixes of S
For example The For example The string TelAviv and string TelAviv and it’s corresponding it’s corresponding suffix array suffix array
SS00telavivtelaviv
SS11elavivelaviv
SS22lavivlaviv
SS33avivaviv
SS44vivviv
SS55iviv
SS66vv33115522006644
IntroductionIntroduction Succinct data structures branch Succinct data structures branch
Dna genome strings (small alphabet, large strings)Dna genome strings (small alphabet, large strings)
Mainly a Theoretical article Mainly a Theoretical article
Problem DefinitionProblem Definition The Algorithm Is composed of two phasesThe Algorithm Is composed of two phases
– compression compression – lookup lookup
Compress :Compress :– given a suffix array Sa compress it to get it’s succinct given a suffix array Sa compress it to get it’s succinct
representationrepresentation
lookup(i): lookup(i): – Given the compressed representation return SA[i]Given the compressed representation return SA[i]
Some DefinitionsSome Definitions We will deal (at first) with binary alphabetWe will deal (at first) with binary alphabet
– ΣΣ = {a,b} = {a,b} We will add a special end of string symbol #We will add a special end of string symbol #
And will set the relation between the characters to be And will set the relation between the characters to be – a<#<b (*)a<#<b (*)
Basic Ram ModelBasic Ram Model – Log(n) word sizeLog(n) word size– Word lookup and arithmetic in constant timeWord lookup and arithmetic in constant time
Information theory reasoningInformation theory reasoning
aaaaaaaa##
1234512345
aaabaaab##
1235412354
aabaaaba##
1425314253
aabb#aabb#1254312543
abaaabaa##
3415234152
abaaabaa##
1352413524
abababab##
4153241532
abbaabba##
1543215432
baaabaaa##
2345123451
baab#baab#2351423514
baba#baba#4253142531
babb#babb#2514325143
bbaa#bbaa#3452134521
bbab#bbab#3524135241
bbba#bbba#4532145321
bbbb#bbbb#5432154321
Information theory reasoning (2)Information theory reasoning (2)
Suffix array size nlog(n) Suffix array size nlog(n) One to one corresponds between the One to one corresponds between the
suffix array to the stringsuffix array to the string– Construction detailsConstruction details
Number of possible suffix arrays 2Number of possible suffix arrays 2n-1n-1
– Perfect compress n bits (the string itself)Perfect compress n bits (the string itself)– The cost for lookup The cost for lookup ΩΩ(n) see prev lecture(n) see prev lecture
““Simple” solution round 2Simple” solution round 2different approachdifferent approach
Let’s pack together each logn bits to Let’s pack together each logn bits to create a new alphabet.create a new alphabet.
So the text length will be n/logn and So the text length will be n/logn and the pattern length would be m/lognthe pattern length would be m/logn
The suffix array will take o(n) bitsThe suffix array will take o(n) bits Searching becomes hard (alignment) Searching becomes hard (alignment)
– the text is aligned but the pattern isn’t the text is aligned but the pattern isn’t logn caseslogn cases
““Simple” solution round 2Simple” solution round 2 the text isn’t aligned the pattern occurs k bit right to a word the text isn’t aligned the pattern occurs k bit right to a word
boundaryboundary Need to append k bits to the pattern and check itNeed to append k bits to the pattern and check it So we need to check 2^k cases So we need to check 2^k cases K~logn => n different cases to check K~logn => n different cases to check Assuming we know how much to pad!! Assuming we know how much to pad!!
General frameworkGeneral framework
Abstract Data Type Optimization [Jacobson'89]Abstract Data Type Optimization [Jacobson'89]
# distinct Data structures = C(n) => Each data # distinct Data structures = C(n) => Each data structure occupies O(log C(n)) bits.structure occupies O(log C(n)) bits.
Doesn’t guarantee the time complexity on the supported Doesn’t guarantee the time complexity on the supported operationsoperations
Compressed suffix arrays inCompressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access time ½*nloglogn +O(n) bits and O(loglogn) access time
Recursive method in natureRecursive method in nature– Take advantage on the suffixesTake advantage on the suffixes
Let SaLet Sa00 be the uncompressed suffix array be the uncompressed suffix array And NAnd N00 be it’s size (assume power of 2) be it’s size (assume power of 2)
In The k phase of the compression we start with In The k phase of the compression we start with SaSak k with the size with the size and create Saand create Sak+1 k+1 with the size with the size SaSak+1k+1 holds the permutation {1..N holds the permutation {1..Nk+1k+1}}
k
NkN 2
01
0
21 k
NkN
SaSak+1k+1 Construction Construction
Create the BCreate the Bkk bit vector bit vector
BBkk[i] = 1 iff Sa[i] = 1 iff Sakk[i] is even [i] is even create the Rank vectorcreate the Rank vector
RankRankkk(j) counts the number of one bits in the first j (j) counts the number of one bits in the first j bits of Bbits of Bkk
Create the Create the ΨΨk(i) vector k(i) vector – stores the 0 to 1 companion relation)stores the 0 to 1 companion relation)
Store the even values from SaStore the even values from Sakk in Sa in Sak+1k+1
otherwise
1][Sa ][Sa and odd is ][Sa if )( kkk
i
ijijik
An ExampleAn Example
The 32 chars string T The 32 chars string T abbabbabbabbabaaabababbabbbabba#abbabbabbabbabaaabababbabbbabba#
An ExampleAn Example
1122334455667788991010111112121313141415151616
TextTextaabbbbaabbbbaabbbbaabbbbaabbaaaa
SaSa001515161631311313171719192828101077441121212424323214143030
BB0000110000000011110011000011111111
RankRank0000111111111122333344444455667788
ΨΨ0022221414151518182323778828281010303031311313141415151616
ExampleExample… …
16161717181819192020212122222323242425252626272728282929303031313232
aaaabbaabbaabbbbaabbbbbbaabbbbaa##
30301212181827279966332020232329291111262688552222222525
1111110000110011000000111100111100
8899101010101010111111111212121212121212131314141414151516161616
1616171718187788212110102323131316161717272728282121303031312727
How To compute SaHow To compute Sakk from Sa from Sak-1k-1
Lemma 1Lemma 1– Given suffix array SaGiven suffix array Sakk let B let Bkk rank rankkk ΨΨkk and Sa and Sak+1k+1
Be the result of the transformation performed Be the result of the transformation performed by phase k we can construct Saby phase k we can construct Sak k from Sak+1 from Sak+1 by the following formula by the following formula
SaSakk[i] = 2* Sa[i] = 2* Sak+1k+1[rank[rankkk((ΨΨkk(i))]+(B(i))]+(Bkk[i]-1)[i]-1)
– Let’s split for 2 casesLet’s split for 2 cases Bk[i] is even Bk[i] is even Bk[i] is odd Bk[i] is odd
Example continueExample continueSaSa118814145522121216167715156699331010131344111111
BB1111110011111100001100001100110000
RankRank
1111222233445555556666667777888888
ΨΨ111122994455661166991212141412122214144455
SaSa224477116688335522
BB221100001111000011
RankRank
221111112233333344
ΨΨ221155884455114488
SaSa3322334411
CompressCompress– We Keep l = O(loglogn) levelsWe Keep l = O(loglogn) levels
– All Levels but the Sal level are save implicitlyAll Levels but the Sal level are save implicitly
– For each of the level 0..l-1 we save BFor each of the level 0..l-1 we save Bjj,rank,rankjj ΨΨjj
– rankrankjj ΨΨjj are stored implicitly are stored implicitly
– The Size of SaThe Size of Sall is is )(
log
loglog*)
log(log*
log)
2(log*
2log*
loglogloglognO
n
nn
n
n
n
NnnNN
nnll
lookuplookup
just compute recursively Sajust compute recursively Sakk[i] from Sa[i] from Sak+1k+1[i][i]
Recursion depth loglognRecursion depth loglogn
All data structure going to be used have o(1) access timeAll data structure going to be used have o(1) access time
O(loglogn) lookup costO(loglogn) lookup cost
How The Data Is StoredHow The Data Is Stored The Bk bit vector is stored explctiy The Bk bit vector is stored explctiy
– O(Nk) space O(Nk) space – O(1) lookup O(1) lookup – O(Nk) preprocess timeO(Nk) preprocess time
The RankThe RankK K vector is stored implicitly using Jacobson rank data vector is stored implicitly using Jacobson rank data structure structure – O(NO(Nkk(loglogn(loglognkk)/logn)/lognkk) space ) space
– O(1) lookup O(1) lookup – O(Nk) preprocess timeO(Nk) preprocess time
The The ΨΨk k vector is stored implicitly (using rank and select)vector is stored implicitly (using rank and select)
timepreprocess )2( -
bits )loglog2
()2
3
2
1( using-
Time acess O(1) -
1
kk
kk
NO
n
nOn
ΨΨk k vector representationvector representation
it togcorspondin
]N[1,j indices oflist ordered a keep pattern webit 2each for
Tin suffix ])[(2 thepreceded symbol this
1][*2)..1][(*2 positionsin symbols 2 heConsider t
Bin 1ith ofindex thebe jlet 2
ni1each for
kk
k
kk
kk
thjSa
jSajSa
k
kk
k
Let’s Take a lookLet’s Take a look
0list bbbb {}list bbbb 0list abbb {}list abbb
0list bbba {}list bbba 2list abba {5,8}list abba
0list bbab {}list bbab 0list abab {}list abab
0list bbaa {}list bbaa 0list abaa {}list abaa
1list babb {4}list babb 0list aabb {}list aabb
1list baba {1}list baba 0list aaba {}list aaba
0list baab {}list baab 0list aaab {}list aaab
0list baaa {}list baaa 0list aaaa {}list aaaa
2 levlel
3list bb {2,4,5}list bb
4list ba }{1,6,12,14list ba
1list ab {9}list ab
0list aa {}list aa
1 level
8list b 27},16,17,21,{7,8,10,13 :list b
8list a ,31}8,23,28,30{2,14,15,1 :list a
0 level
An ExampleAn Example
1122334455667788991010111112121313141415151616
TextTextaabbbbaabbbbaabbbbaabbbbaabbaaaa
SaSa001515161631311313171719192828101077441121212424323214143030
BB0000110000000011110011000011111111
RankRank0000111111111122333344444455667788
ΨΨ0022221414151518182323778828281010303031311313141415151616
ExampleExample… …
16161717181819192020212122222323242425252626272728282929303031313232
aaaabbaabbaabbbbaabbbbbbaabbbbaa##
30301212181827279966332020232329291111262688552222222525
1111110000110011000000111100111100
8899101010101010111111111212121212121212131314141414151516161616
1616171718187788212110102323131316161717272728282121303031312727
So What can we do with all the list’sSo What can we do with all the list’s
Concatenate them together in a lexicographical order and form Concatenate them together in a lexicographical order and form the Lk listthe Lk list
LL11={9,1,6,12,14,2,4,5} ={9,1,6,12,14,2,4,5}
Let’s see how we can compute Let’s see how we can compute ΨΨk k (i)(i)
– If BIf Bkk[i] is even , it’s simply i[i] is even , it’s simply i
– Otherwise , Otherwise , – because all the prefix patterns saved are in sorted order, because all the prefix patterns saved are in sorted order, – We saved in the Lk list till the point i , entries for all the odd We saved in the Lk list till the point i , entries for all the odd
suffix’s before i , h=i-rank[i]suffix’s before i , h=i-rank[i]– So we can look up the h entry in LkSo we can look up the h entry in Lk
And it will give us the answerAnd it will give us the answer
Simple exampleSimple example
LL22={5,8,2,4}={5,8,2,4}
RankRank22={1,1,1,2,3,3,3,4}={1,1,1,2,3,3,3,4}
BB22={1,0,0,1,1,0,0,1}={1,0,0,1,1,0,0,1} ΨΨ2={1,5,8,4,5,1,4,8}2={1,5,8,4,5,1,4,8} ΨΨ(3) = ?(3) = ? Rank(3) = 1, h= 3-1 , L2[2] = 8 Rank(3) = 1, h= 3-1 , L2[2] = 8 ΨΨ(3) =8 (3) =8
ΨΨk k vector representationvector representation Lemma 2Lemma 2
Given s integers in sorted order ,Given s integers in sorted order ,
each containing w bits ,where s<2each containing w bits ,where s<2w w
we can store them with at most we can store them with at most
s(2+w-floor(logs))+O(s/loglogs) bits s(2+w-floor(logs))+O(s/loglogs) bits
so that retrieving the hth integer takes constant timeso that retrieving the hth integer takes constant time
ΨΨk k vector representationvector representationTake the first z=floor(logs) bits of each int, creating the qTake the first z=floor(logs) bits of each int, creating the q11..q..qs s intint
It’s easy to see that , qIt’s easy to see that , q11<q<qii<q<qi+1i+1<s (we take the msb bits after all)<s (we take the msb bits after all)
The rest w-z bits of each int , will be rThe rest w-z bits of each int , will be r ii
10101010101010101010101010101
1010101010101010101010101
Si
qi
101
ri
ΨΨk k vector representationvector representation
Store rStore ri i in a simple array, (w-z)*s bits in a simple array, (w-z)*s bits
Store qStore q11..q..qs s in a table supporting select and rank in in a table supporting select and rank in constant time.constant time.
The table Q is implemented in the following wayThe table Q is implemented in the following wayInstead of saving the number themselves,Instead of saving the number themselves,
we store qwe store q11,q,q22-q-q11,q,q22-q-q33,… q,… qss-q-qs-1 s-1
in unary representation )0in unary representation )0ii1( 1( And add a select data structure.And add a select data structure.
ΨΨk k vector representationvector representation
In order to get qi we simply do select(i) ,In order to get qi we simply do select(i) ,
and count the number of zeros before the ith 1 and count the number of zeros before the ith 1
Qi = select(i) - rank(select(i))Qi = select(i) - rank(select(i))
ΨΨk k vector representationvector representation
The q table size is The q table size is
the size of the unary string is s+2z <2s + the the size of the unary string is s+2z <2s + the select overhead O(s/loglogs) select overhead O(s/loglogs)
So we can output Si easilySo we can output Si easily
SSii=q=qii*2*2w-zw-z+r+ri i
ΨΨk k vector representationvector representation
Lemma 3Lemma 3
We can store the concatenated list LWe can store the concatenated list Lkk used for used for ΨΨk k in in n*(1/2+3/2n*(1/2+3/2k+1k+1)+O(n/2)+O(n/2kkloglogn), so accessing the hth element will take loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2constant time, with preprocessing time o(n/2k+222kk
))
There are 2There are 222k k lists, number them ,(even the empty ones)lists, number them ,(even the empty ones)
ΨΨk k vector representationvector representation
Lemma 3Lemma 3
We can store the concatenated list LWe can store the concatenated list Lkk used for used for ΨΨk k in in n*(1/2+3/2n*(1/2+3/2k+1k+1)+O(n/2)+O(n/2kkloglogn), so accessing the hth element will take loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2constant time, with preprocessing time o(n/2k+222kk
))
There are 2There are 222k k lists, number them ,(even the empty ones)lists, number them ,(even the empty ones)
Each XEach Xi i integer in the lists, 1<xinteger in the lists, 1<xii<N<Nkk will be transformed into a new will be transformed into a new integer by appending it’s list int representation integer by appending it’s list int representation
X` bit size is , 2X` bit size is , 2KK+logn+lognk k , ,
ΨΨk k vector representationvector representation
Lemma 3Lemma 3
We can store the concatenated list LWe can store the concatenated list Lkk used for used for ΨΨk k in in n*(1/2+3/2n*(1/2+3/2k+1k+1)+O(n/2)+O(n/2kkloglogn), so accessing the hth element will take loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2constant time, with preprocessing time o(n/2k+222kk
))
There are 2There are 222k k lists, number them ,(even the empty ones)lists, number them ,(even the empty ones)
Each XEach Xi i integer in the lists, 1<xinteger in the lists, 1<xii<N<Nkk will be transformed into a new will be transformed into a new integer by appending it’s list int representation integer by appending it’s list int representation
X` bit size is , 2X` bit size is , 2KK+logn+lognk k , ,
After concatenating all the lists ,we have a NAfter concatenating all the lists ,we have a Nkk/2 sorted numbers sized /2 sorted numbers sized 22KK+logn+lognk k bitsbits
Using lemma 2 we get.Using lemma 2 we get. O(1) access timeO(1) access time And a space bound of n(1/2+3/2And a space bound of n(1/2+3/2k+1k+1)+O(n/2)+O(n/2kkloglogn) bitsloglogn) bits
Sum it up (space complexity)Sum it up (space complexity)
O(n) 2 is timepreprocess the
)loglog()loglog
(6loglog2
1)
loglog2
1(
2
3
2
1
2log
2loglog
2
1
2
1
2
nlogn
is space totalthe
timesing)preproces2O(n loglogn)O(n/2)3/2(1/2*n is size
timeingpreprocess )O(n ))/logn(loglognO(n is size Rank
timeingpreprocess )O(n bits )O(n is size B
bits O(n) is size Sa
loglognl be set to is level ofnumber The
1-l
1
2
1
11l
2k
k1kk
kkkkk
kkk
l
k
k
k
l
kkk
k
k
kk
n
nnOn
nOnnn
nO
n
n
On
Rank data structureRank data structure Due to JacobsonDue to Jacobson Given a bit vector length n ,Rank[i] is the number of 1 bits Given a bit vector length n ,Rank[i] is the number of 1 bits
till Itill I Multilevel approach Multilevel approach
We will slice the bit string to log2n chunks.We will slice the bit string to log2n chunks. Between each chunk we will keep rank counterBetween each chunk we will keep rank counter
Each chunk will be divvied into ½ * logn chunks , Each chunk will be divvied into ½ * logn chunks , And a counter will be kept between each sub chunksAnd a counter will be kept between each sub chunks
At The Bottom Level a simple Lookup table will be used.At The Bottom Level a simple Lookup table will be used.
RankRank
3
7
101
Lookup table
14
Log2n chunks
½ logn sub chunks
The output 14+3+1
Rank AnalysisRank Analysis
space total)(
)loglog*log*(loglog*log*2
takes, tableLookup The
)logn
nloglognO( of total,counter loglogn a havingeach subchunks
logn
2n have we
)logn
nO( of total,counter logn a havingeach , chunks
log
n
levelfirst
logn 1/2
2
no
nnnOnn
n
Compressed suffix arrays in Compressed suffix arrays in εε-1-1n + n + O(n) bits and O(logO(n) bits and O(logεεn) access timen) access time
In order to break the space barrier we need to save less In order to break the space barrier we need to save less levels =>longer lookup’slevels =>longer lookup’s
Lets save 3 compressed levels only SaLets save 3 compressed levels only Sa00 Sa Sall Sa Sal` l`
L = ceil(loglogn) , l`=ceil(1/2loglogn)L = ceil(loglogn) , l`=ceil(1/2loglogn)
using A Dictionary data structure , which Can say If an using A Dictionary data structure , which Can say If an element is member of the Dictionary, and support a rank element is member of the Dictionary, and support a rank query, O(1) time for both queriesquery, O(1) time for both queries
The Space complexity of the dictionary isThe Space complexity of the dictionary is
We keep in 2 dictionaries what items we have in the next We keep in 2 dictionaries what items we have in the next level Dlevel D00 and Dl (from Sa and Dl (from Sa00->Sa->Sal`l` Sa Sal`l`->Sa->Sall
bits *log lnn
nO l
l
The The ΨΨ`̀k k functionfunction
We define the We define the ΨΨ`̀k k functionfunction , which maps each 1 to it’s companion 0 , which maps each 1 to it’s companion 0
Let’s define the Let’s define the φφkk function to be function to be
We just need to merge the indexes in LWe just need to merge the indexes in Lkk and L` and L`kk
otherwise
1][Sa ][Sa andeven is ][Sa and ][Sa if )( kkkk
i
ijiNiji k
k
otherwise
1][Sa ][Sa and N ][Sa if )( kkk
i
ijkijik
ExampleExample
6,2,3,4,5},7,12,14,1,13,6*,1,6{10,8,9,11
gives merging The
1list bb {3}list bb
2list ba {7,16}list ba
3list ab {8,11,13}list ab
1list aa {10}list aa
sList'even
3list bb {2,4,5}list bb
4list ba }{1,6,12,14list ba
1list ab {9}list ab
0list aa {}list aa
1 level
slist' Odd
5432161412761613119810k
1615313161110787613113810`
54142121412961654921
0010100100111011
11141310396157161225148
16151413121110987654321
1
1
1
1
B
Sa
The The φφk k function implementationfunction implementation
Lemma 4 :We can store the concatenated list used for Lemma 4 :We can store the concatenated list used for φφk k
– k =0 in n+O(n/loglogn) bitsk =0 in n+O(n/loglogn) bits– K>0 in n*(1+1/2K>0 in n*(1+1/2k-1k-1)+O(n/2kloglogn) , preprocess time of O(n/2)+O(n/2kloglogn) , preprocess time of O(n/2kk
+2+222kk))
– If k>0 simply using lemma 3If k>0 simply using lemma 3– K=0K=0
Encode a,# as 0, and b as 1.Encode a,# as 0, and b as 1. Create a n bit vector , named lCreate a n bit vector , named l L[f] = 0 iff the list for L[f] = 0 iff the list for φφ0 0 is a or # at the f positionis a or # at the f position
We add a select and selectWe add a select and select00 data structure on top of it. O(n/loglogn) data structure on top of it. O(n/loglogn) Also we keep the number of 0 in l as c0, Also we keep the number of 0 in l as c0, Query Query φφkk(j) (j) is done in the following wayis done in the following way if j = C0 , return select0(c0)if j = C0 , return select0(c0) If j<c0 return select0(j)If j<c0 return select0(j) If j>c0 return select(j-c0)If j>c0 return select(j-c0)
The Lookup algorithmThe Lookup algorithm Sa[i] , we start walking the Sa[i] , we start walking the φφk k function i,i`,i``,i```function i,i`,i``,i``` Sa0[i]+1=Sa0[i`]…Sa0[i]+1=Sa0[i`]… Until reaching entry found in the dictionary DUntil reaching entry found in the dictionary D00, ,
– Let s be the walk length Let s be the walk length – And r the entry rank in the dictionary (how many items, already passed And r the entry rank in the dictionary (how many items, already passed
to the next level?)to the next level?) Using r we start walking the next level Using r we start walking the next level
– Let s` be the walk length Let s` be the walk length – And r` the entry rank in the dictionaryAnd r` the entry rank in the dictionary
we return the following resultwe return the following result
The walk length is , max(s,s`)<2The walk length is , max(s,s`)<2l`l`<sqr(logn) <sqr(logn)
So the query time is O(sqr(logn))So the query time is O(sqr(logn))
)2`*2*`][ ` ssrSa lll
The General multilevel BuildThe General multilevel Build
For every 0<For every 0<εε<1 ,<1 , Assume Assume εεl is an integer so 2l is an integer so 2εεll<2log<2logεεnn Create all the levels , 0, Create all the levels , 0, εεl,2l,2εεl ..ll ..l
Number of levels is Number of levels is εε-1-1+1 => lookup of O(log+1 => lookup of O(logεεn)n)
The General multilevel BuildThe General multilevel Build
)loglog
()log
loglog()(D
space esdicitonari the
)loglog
()1()log
()loglog
()1(
)loglog2
1()
2
11()
loglog(
2
log
k k
11
1
1
1
n
nO
n
nnOlnO
n
nOn
n
nO
n
nOn
nOn
n
nOn
nn
l
k
ilik
kl
Select data structureSelect data structure select(i)- returns the i 1 bit in the stringselect(i)- returns the i 1 bit in the string
Same idea as rank , a bit more complicatedSame idea as rank , a bit more complicated
multilevel approachmultilevel approach
At the first level we record the position of every lognloglognAt the first level we record the position of every lognloglognthth bit, bit, – Total space o(N/loglogn)Total space o(N/loglogn)
Between each two bits, we keep the following data, Between each two bits, we keep the following data, If the distance between them r>(lognloglogn)If the distance between them r>(lognloglogn)2 2
– we keep the absolute pos of all the indexes between them we keep the absolute pos of all the indexes between them loglog22nloglogn nloglogn
– Other wise we keep , the relative position of each logrloglognOther wise we keep , the relative position of each logrloglognthth bit bit Total space logr*loglogn <logTotal space logr*loglogn <log22nloglogn = r/loglogn r<N !!!nloglogn = r/loglogn r<N !!!
Then we keep one more level (the same notions) Then we keep one more level (the same notions) – Block size comes to the size of (lgn)Block size comes to the size of (lgn)44
Select data structureSelect data structure After that, we keep a lookup tableAfter that, we keep a lookup table For every logn/d pattern we save (d>=2)For every logn/d pattern we save (d>=2)
– Number of 1 bits, Number of 1 bits, – the location of the ith 1 bit in the patternthe location of the ith 1 bit in the pattern
Same as before the space is O(nSame as before the space is O(n1/d1/dlognloglogn)lognloglogn)
The lookup is then very simple, just walk the levels,The lookup is then very simple, just walk the levels,
Get a block and ask a query about him using the lookup Get a block and ask a query about him using the lookup table.table.
Space complexity , O(n/loglogn)Space complexity , O(n/loglogn)