interplay between stringology and data structure design roberto grossi
TRANSCRIPT
Interplay between Stringology and Data Structure Design
Roberto Grossi <[email protected]>
Interplay between Stringology and Data Structure Design
(limited view: my own experience)
Roberto Grossi <[email protected]>
Interplay between Stringology and Data Structure Design
(limited view: my own experience)
Roberto Grossi <[email protected]>
advertising
4
Interaction between stringology and data structures
Case studies: Compressed text indexing [G., Gupta, Vitter] Multi-key data structures [Crescenzi, G., Italiano]
[Franceschini, G.] [G., Italiano] Order vs. disorder in searching [Franceschini, G.] In-place vector sorting [Franceschini, G.]
5
Compressed text indexing
Replace text 2 n ) self-indexing binary string[Ferragina, Manzini] [Sadakane] [G., Gupta, Vitter]
n log bits ) n Hh + … bits(where Hh = h-order empirical entropy)
Unique algorithmic framework:wavelet tree + finite set model + succinct dictionaries + …
Text indexing: new implementation of CSA
(compressed suffix array)
Text indexing: new implementation of CSA
(compressed suffix array)
6
Compressed text indexing
Replace text 2 n ) self-indexing binary string[Ferragina, Manzini] [Sadakane] [G., Gupta, Vitter]
n log bits ) n Hh + … bits(where Hh = h-order empirical entropy)
Unique algorithmic framework:wavelet tree + finite set model + succinct dictionaries + …
Compression: new analysis of BWT
(Burrows-Wheeler transform)
Compression: new analysis of BWT
(Burrows-Wheeler transform)
Text indexing: new implementation of CSA
(compressed suffix array)
Text indexing: new implementation of CSA
(compressed suffix array)
7
Suffix arrays, BWT, and Hh (high-order empirical entropy)
Equivalently usecontexts x of order h for cx instead of xc
T = mississippi#
#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#
ipssm#pissii
12
1185211097463
8
Suffix arrays, BWT, and Hh (high-order empirical entropy)
Context x = i, h =1
Chars c = p, s, m
Store “pssm” using just bits
Get n Hh bits!!!Add bits to encode the partition.
#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#
ipssm#pissii
12
1185211097463
9
Incremental representation
Example: mark ppssm ! 1000
remove p; mark mssm ! 001
remove m; mark sss ! 11
We obtain 3 subsets: Encode each subset, containing t items out of n, using bits.
10
Getting the multinomial coefficient
Sum of the log binomial coefficients of the subsets’ sizes
11
Wavelet trees
Generalize the idea from the linear list to any tree shape
Cost is independent of the shape (e.g. assign access frequencies)
12
Bound on bits of space
Crucial encoding for the BWT partition: r (positive) subsets’ sizes, t1, …, tr, with i ti = n.
Let enc(t1, …, tr) be the number of bits for encoding the sequence of these r sizes.
Let g’ = h+1 and g = h+1 log , both independent of n ! 1.
Then, r · g’ and storing BWT takesnHh + [enc(t1, ..., tr) - 1/2 i log ti] + O(r log ) bits.
13
Bound on bits of space
Crucial encoding for the BWT partition: r (positive) subsets’ sizes, t1, …, tr, with i ti = n.
Let enc(t1, …, tr) be the number of bits for encoding the sequence of these r sizes.
Let g’ = h+1 and g = h+1 log , both independent of n ! 1.
Then, r · g’ and storing BWT takesnHh + [enc(t1, ..., tr) - 1/2 i log ti] + O(r log )· nHh + g’ log(n/g’) + O(g) bits.
14
Interaction between stringology and data structures
Case studies: Compressed text indexing Multi-key data structures [Crescenzi, G., Italiano]
[Franceschini, G.] [G., Italiano] Order vs. disorder in searching In-place vector sorting
15
Why multi-key data? Strings are everywhere…
Keys are arbitrarily long Multi-dimensional points Multiple precision numbers Textual data XML paths URLs and IP addresses …
Modeled as strings in k, for unbounded alphabets Q: How to avoid O(k) slowdown factor in the cost of
the operations supported by known data structures?
16
I. Ad hoc data structures
Some examples ternary search trees [Clampett] [Bentley, Sedgewick] tries […] lexicographic D-trees [Mehlhorn] multi-dimensional B-trees [Gueting, Kriegel] multi-dimensional AVL trees [Vaishnavi] lexicographic splay trees [Sleator, Tarjan] multi-dimensional BST [Gonzalez] [Roura] multi-BB-trees [Vaishnavi] …
Search, insert, delete in O(k + log n) time Split and concatenate in O(k + log n) time
17
II. Augmenting access paths
Reuse many data structures for 1-dim keys:AVL trees, red-black treesskip lists(a,b)-treesBB[α]-treesself-adjusting treesrandom search trees (treaps,…)…
Inherit their combinatorial properties Traversing is driven by comparisons
18
III. Using an oracle for strings
Data structure D = black box performing comparisons on pairs of 1-dim keys.
General theorem for transforming D into a data
structure D’ for strings (no efficiency loss).
Oracle DSlcp for maintaing order in a linked list of strings, along with their lcps (extends Dietz-Sleator list).
19
The general technique
New data structure D ’ = old data structure D + oracle DSlcp
Method:comparison is O(1)-time if we know
•lcp(x, y)=min { j j x[j+1] y[j+1] }
(x < y iff x[lcp+1] < y[lcp+1])
use DSlcp for storing and comparing pairs of strings in D ’ in constant time
use predecessors and lcps computed so far to insert a new string y into D ’ (and DSlcp )
20
Theorem for general transformation
Comparison driven data structure
D for n keys :
ins identifies pred or succ
it does not necessarily imply (log n) per ins ina sequence of operations;e.g., finger search trees
21
Theorem for general transformation
Comparison driven data structure
D for n keys :
ins identifies pred or succ
String data structure D’for n strings
it does not necessarily imply (log n) per ins ina sequence of operations;e.g., finger search trees
22
Theorem for general transformation
Comparison driven data structure
D for n keys :
ins identifies pred or succ
String data structure D’for n strings
Space S(n) Space S(n) + O(n)
23
Theorem for general transformation
Comparison driven data structure
D for n keys :
ins identifies pred or succ
String data structure D’for n strings
Space S(n) Space S(n) + O(n)
Operation op on O(1) keys in D
in T(n) time
Operation op on O(1) strings in
D ’ in O(T(n)) time
24
Theorem for general transformation
Comparison driven data structure
D for n keys :
ins identifies pred or succ
String data structure D’for n strings
Space S(n) Space S(n) + O(n)
Operation op on O(1) keys in D
in T(n) time
Operation op on O(1) strings in
D ’ in O(T(n)) time
Operation op involving y not in
D, in T(n) time
Operation op involving y not in
D ’, in O(T(n) + k) time
25
Some features
No need to reinvent the wheel for data structs designers
Better than using compact trie + Dietz-Sleator list + dynamic LCA
when T(n) = o(log n), e.g.: weighted search O( log (i wi)/w ) finger search O( log d ) set manipulation O(n log(m / n) )
26
Interaction between stringology and data structures
Case studies: Compressed text indexing Multi-key data structures Order vs. disorder in searching [Franceschini, G.]
In-place vector sorting
27
Searching In-Place a Sorted(?) Array of Strings
“Imagine how hard it would be to use a dictionary if its
words were not alphabetized!”
-- D.E. Knuth, The Art of Comp. Prog., vol. 3, 1998
28
Order vs. Disorder:An experiment
Think of your table desk…
1. Are the papers on your desk in sorted order?
2. Probably not!
3. Unsorted data seems to provide more informative content than sorted data…
4. Can we formalize this intuition in the comparison model?
29
Preprocessing by sorting
In-place search the lexicographically sorted array in [Andersson, Hagerup, Håstad, Petersson, ’94, ’95, ’01]:
time
Upper/lower bounds. The classical (log n) when k = 1.
30
Permuting is better !
For any key length k, there exists an “unsorted” permutation attaining simultaneously (k + log n) timeO(1) extra space
Optimal among all possible permutations, better than those resulting from sorting.
Warning: suffix array search is not in-place (since LCP takes more than O(1) extra cells).
31
Basic tool: Bit stealing
Simple, yet effective, idea on pairwise sorted keys:
For keys of length k ) O(k) slowdown factor.
Q: Can we get O(1) decoding time?
0 1 0 1
4 7 5 2 1 6 8 3 Implicit bits encoded bypairwise exchanging keys!
Implicit bits encoded bypairwise exchanging keys!
32
K-dimensional bit stealing: Digging a ditch!
Using d = lcp(xi, xj)+1, decode a bit in O(1) time,by checking mismatches, xi[d] and xj[d].
Idea exploited for digging a ditch, in O(k + r) time:
DIGGING(x1 … xr)
d à 1, i à 1, j à rwhile i < j do // twin positions i and j while d · k and xi[d] = xj[d] do d à d + 1 i à i + 1, j à j - 1
33
Ditch: twin positions and twin intervals
Create twin intervals with same digging depth; bit stealing is O(1) time with keys in twin positions.
34
Large DITCH
Encode information for the twin intervals in O(k log n) distinct keys (which are still searchable).
These twin positions can encode 3 bits
35
Inside each twin interval T
Searching A reduces to searching in a specific twin interval T.
Use modified Manber-Myers search for accessing just O(log n) stealed bits in T for lcp information (instead of O(log n) £ O(log n) bits).
It is provably more efficient to keep data “unsorted” rather than “sorted” for in-place searching.
36
Interaction between stringology and data structures
Case studies: Compressed text indexing Multi-key data structures Order vs. disorder in searching In-place vector sorting [Franceschini, G.]
37
Logical order ´ physical layout
Knuth’s indirect addressing:1. permute the records’ pointers to find their ranks
2. permute the records according to the ranks
What if records are scrambled during merging?
Irregular access pattern to records
38
In-place model for vector sorting: GVSP( )
Comparison model extended to keys of length k,using O(1) extra memory cells m vectors of length k to be sorted p vectors for internal buffering h stealed bits with 2h vectors
initially ) m = n and p = h = 0
39
Optimal time-space bounds
Reduce recursively GVSP( ) to simpler instances
Use internal implicit data structures for strings in some of the instances
Sorting cost is time-space optimal: O(nk + n log n) time/comparisonsO(n) vector movesO(1) words of memory for extra space
40
Conclusions
Joint work on the “reverse” contribution, from stringology to data structure/algorithm design.
Fruitful interplay between the two areas:Compressed text indexingMulti-key data structuresOrder vs. disorder in searchingIn-place vector sorting