dynamic rank-select structures with applications to run-length encoded texts sunho lee and kunsoo...
TRANSCRIPT
![Page 1: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/1.jpg)
Dynamic Rank-Select Structures with Applications to Run-Length
Encoded Texts
Sunho Lee and Kunsoo ParkSeoul National Univ.
![Page 2: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/2.jpg)
Contents
Introduction– Rank/select problem– Relations to compressed full-text indices
Dynamic rank-select structure Extensions of the structure
– For a large alphabet text– For a run-length encoded text
![Page 3: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/3.jpg)
Rank-select problem
For a given text T over σ-size alphabet, our structures support:– rankT(c, i): gives the number of character c’s up to
position i in T– selectT(c, k): gives the position of the k-th c
E.g. T=acabbc– rankT(‘a’, 5) = 2
– selectT(‘a’, 2) = 3
![Page 4: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/4.jpg)
Rank-select problem
Our structures support additional update operations– insertT(c, i): inserts character c between T[i] and T
[i+1]– deleteT(i): deletes T[i] from T
E.g. T=acabbc aababc– rankT(‘a’, 5) = 2 rankT(‘a’, 5) = 3– selectT(‘a’, 2) = 3 selectT(‘a’, 2) = 2
![Page 5: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/5.jpg)
Why rank-select problem?
In compressed full-text index– Rank-select structures are built on Burrows-Whee
ler Transform (BWT)– Rank: backward search (Ferragina & Manzini)– Select: Psi-function in CSA (Grossi & Vitter)
Dynamic BWT– Index for a collection of texts (Chan, Hon & Lam)– Add or remove a text from the collection
![Page 6: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/6.jpg)
Example of select on BWT
T=mississippi$i Psi SA Suffix
1 6 12 $
2 1 11 i$
3 8 8 ippi$
4 11 5 issippi$
5 12 2 ississippi$
6 5 1 mississippi$
7 2 10 pi$
8 7 9 ppi$
9 3 7 sippi$
10 4 4 sissippi$
11 9 6 ssippi$
12 10 3 ssissippi$
Psi function– Order of the suffix at next position– E.g.. Psi[4] = 11, the order of ‘ssippi
$’
![Page 7: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/7.jpg)
Example of select on BWT
T=mississippi$i BWT Psi SA Suffix
1 i 6 12 $
2 p 1 11 i$
3 s 8 8 ippi$
4 s 11 5 issippi$
5 m 12 2 ississippi$
6 $ 5 1 mississippi$
7 p 2 10 pi$
8 i 7 9 ppi$
9 s 3 7 sippi$
10 s 4 4 sissippi$
11 i 9 6 ssippi$
12 i 10 3 ssissippi$
Psi function– Order of the suffix at next position– E.g. Psi[4] = 11, the order of ‘ssippi$’
Duality between Psi-function and BWT
(Hon, Sadakane & Sung)– BWT[i] = T[SA[i] – 1]– Psi[i] = selectBWT(C[i], i – F[C[i]])
C[i]: T[SA[i]] F[c]: The number of x < c
![Page 8: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/8.jpg)
Our results
Dynamic rank-select on texts over a small alphabet (σ < log n)
– Improve the binary-alphabet version by Makinen & Navarro– O(log n) time and nlogσ + o(nlogσ) bits
Dynamic rank-select for a large alphabet (σ < n)– Use wavelet trees to extend our small-alphabet structure– O(log n logσ / loglog n) time and nlogσ + o(nlogσ) bits
Application to RLE texts
![Page 9: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/9.jpg)
Static rank-select
![Page 10: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/10.jpg)
Dynamic rank-select
![Page 11: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/11.jpg)
Dynamic rank-select preliminary
We assume RAM model with:– Word size w = θ(log n) bits– +, -, *, / and bitwise operations in O(1) time
We process a word-size text of θ(log n/log ) characters in O(1) time
![Page 12: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/12.jpg)
Dynamic rank-select preliminary
Partition of text– Blocks of sizes from ½ log n words to 2log n words– Bit vector representation, I
Give block number b and offset r for position i Employ binary rank-select by Makinen & Navarro:
O(log n) time & O(n) bits
E.g. – T = babc abab abca b = rankI(‘1’, 10) = 3
– I = 1000 1000 1000 r = 10 - selectI(‘1’, 3) + 1 = 2
![Page 13: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/13.jpg)
Dynamic rank-select preliminary
Over-block/in-block operation– rankT(c, i):
rank-overT(c, b): The number of c’s before the b-th block
rankTb(c, r): The number of c’s up to position r in Tb
– E.g. T = babc abab abca : rankT(‘a’,10) = rank-overT(‘a’, 3)
I = 1000 1000 1000 + rankT3(‘a’, 2)
![Page 14: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/14.jpg)
Dynamic rank-select preliminary
Over-block/in-block operation– selectT(c, k):
select-overT(c,k): The block number containing the k-th c
selectTb(c,k’): The offset of the k’-th c in Tb
– Update operation In-block update: change the text itself Over-block update: change the statistics of the text
![Page 15: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/15.jpg)
Over-block structures
Sorted character-block pair– Character-block pair (T[i], b): T[i] in the b-th block
E.g. T = babc abab abca(b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)
![Page 16: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/16.jpg)
Over-block structures
Sorted character-block pair– Character-block pair (T[i], b): T[i] in the b-th block– Sorted pairs: partially non-decreasing
(Hon, Sadakane & Sung)
E.g. T = babc abab abca(b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)
(a,1)(a,2)(a,2)(a,3)(a,3) (b,1)(b,1)(b,2)(b,2)(b,3) (c,1)(c,3)
![Page 17: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/17.jpg)
Over-block structures
Differential encoding of sorted pairs– A bit vector B of O(n) bits– For each distinct pair:
1: the difference of block number 0: the number of the same pairs
E.g. – T = ... babc abab bbbb abcc …– … (c,5)(c,8)(c,8) … … 11111011100 …
![Page 18: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/18.jpg)
Over-block structures
Differential encoding of sorted pairs– A bit vector B of O(n) bits– For each distinct pair:
1: the difference of block number 0: the number of the same pairs
E.g. – T = babc abab abca
– B = 10100100 10010010 10110‘b’ group
![Page 19: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/19.jpg)
Over-block rank-select
rank-overT(c, b):– Find the position of the b-th ‘1’ in the group of c– Count ‘0’s representing c up to the position
E.g. – T = babc abab abca
– B = 10100100 10010010 10110
rank-overT(‘b’, 3): count ‘0’s up to 3rd ‘1’ in ‘b’ group
![Page 20: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/20.jpg)
Over-block updates
If the number of blocks is fixed– Insert or delete 0s at the b-th block in I and B– Rank-select remains correct
E.g.– T = babc abab abca babc aabaaabb abca– I = 1000 1000 1000 1000 100000000 1000– B = 10100100 10010010 10110 10100000100 100100010 10110
![Page 21: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/21.jpg)
Over-block updates
If the number of blocks is changing– Split or merge the b-th block in I and B– Call O() queries on B amortized ( < log n)
E.g.– T = babc aabaaabb abca babc aaba aabb abca– I = 1000 10000000 1000 1000 1000 1000 1000– B =10100000100 1001000010 10110 101000100100 10010100010 10110
![Page 22: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/22.jpg)
In-block structures
We use the hierarchy as Makinen & Navarro’s: word, sub-block and block
Rank/select on word-size texts w– Convert w to a bit vector representing occurrences of c– E.g. w = abaacbab, mask = bbbbbbbb (log)
w XOR mask = x0xxx0x0 (log) 01000101(2)
– O(1) time rank-select by tables of o(n) bits size
![Page 23: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/23.jpg)
In-block structures
Linked list over sub-blocks– A block contains ½log n to 2log n words– A sub-block contains √log n words – One extra sub-block is a buffer for updates
Red-black tree over blocks– Leaf node: pointer to block, list of sub-blocks– Internal node: the number of blocks in its subtree
![Page 24: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/24.jpg)
In-block rank-select
RankTb(c, r) in O(log n) time– Traverse the tree to find the b-th block– Scan the b-th block of θ(log n) words
ab ba bc
2
2
3
5
![Page 25: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/25.jpg)
In-block updates
Update words in the list in O(log n) time Process carry characters using the extra spa
ce in a block
ab bc ab c
2
2
3
5
![Page 26: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/26.jpg)
In-block updates
Split or merge the block of out of the range Update tree nodes from leaf to root
ab bc ac ba
2
2
3
5
bc
![Page 27: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/27.jpg)
In-block updates
Split or merge the block of out of the range Update tree nodes from leaf to root
ab bc acba
2
2
2
4
6
bc
![Page 28: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/28.jpg)
Extension of our structure
Dynamic rank-select on plain texts over a large alphabet, σ < n– Use k-ary wavelet trees– O(log n logσ /loglog n) time & nlogσ + O(nlogσ /lo
glog n) bits
Application to run-length encoded texts– Start from RLFM (Makinen & Navarro)– Support dynamic BWT
![Page 29: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/29.jpg)
Application to RLE
Run-Length Encoding (RLE) of T– Character of runs: text T’– Length of runs: bit vector L– E.g. T = aaabbaacccc T’=abac, L=10010101000
RLE of BWT (Makinen & Navarro)– Run-Length based FM-index – The number of runs in BWT(T) ≤ min(n, nHk) + σk
![Page 30: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/30.jpg)
Application to RLE
Assume rank/select on L and T’– Total size of structure: O(n + n’logσ)– Operation time: O(log n + log n logσ/loglog n)
Some additional vectors– Sorted length vector: L’– Frequency table F’: count characters in T’– E.g.
T = bb aa bbbb cc aaa aa aaa bb bbbb ccL = 10 10 1000 10 100 L’ = 10 100 10 1000 10T’ = babca F’ = 001 001 01
![Page 31: Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ](https://reader035.vdocument.in/reader035/viewer/2022070307/551aa75b55034656628b4a0d/html5/thumbnails/31.jpg)
Conclusion
Rank-select structure is an essential ingredient of compressed full-text indices
We propose dynamic rank-select for a small alphabet and its large-alphabet version
We can apply our structures to indices that uses BWT, such as RLFM and index for texts collection