interplay between stringology and data structure design roberto grossi

Interplay between Stringology and Data Structure Design

Roberto Grossi <[email protected]>


(limited view: my own experience)



(limited view: my own experience)


advertising

4

Interaction between stringology and data structures

Case studies: Compressed text indexing [G., Gupta, Vitter] Multi-key data structures [Crescenzi, G., Italiano]

[Franceschini, G.] [G., Italiano] Order vs. disorder in searching [Franceschini, G.] In-place vector sorting [Franceschini, G.]

5

Compressed text indexing

Replace text 2 n ) self-indexing binary string[Ferragina, Manzini] [Sadakane] [G., Gupta, Vitter]

n log bits ) n Hh + … bits(where Hh = h-order empirical entropy)

Unique algorithmic framework:wavelet tree + finite set model + succinct dictionaries + …

Text indexing: new implementation of CSA

(compressed suffix array)



6

Compressed text indexing

Replace text 2 n ) self-indexing binary string[Ferragina, Manzini] [Sadakane] [G., Gupta, Vitter]

n log bits ) n Hh + … bits(where Hh = h-order empirical entropy)

Unique algorithmic framework:wavelet tree + finite set model + succinct dictionaries + …

Compression: new analysis of BWT

(Burrows-Wheeler transform)

Compression: new analysis of BWT

(Burrows-Wheeler transform)





7

Suffix arrays, BWT, and Hh (high-order empirical entropy)

Equivalently usecontexts x of order h for cx instead of xc

T = mississippi#

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

ipssm#pissii

12

1185211097463

8

Suffix arrays, BWT, and Hh (high-order empirical entropy)

Context x = i, h =1

Chars c = p, s, m

Store “pssm” using just bits

Get n Hh bits!!!Add bits to encode the partition.

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

ipssm#pissii

12

1185211097463

9

Incremental representation

Example: mark ppssm ! 1000

remove p; mark mssm ! 001

remove m; mark sss ! 11

We obtain 3 subsets: Encode each subset, containing t items out of n, using bits.

10

Getting the multinomial coefficient

Sum of the log binomial coefficients of the subsets’ sizes

11

Wavelet trees

Generalize the idea from the linear list to any tree shape

Cost is independent of the shape (e.g. assign access frequencies)

12

Bound on bits of space

Crucial encoding for the BWT partition: r (positive) subsets’ sizes, t1, …, tr, with i ti = n.

Let enc(t1, …, tr) be the number of bits for encoding the sequence of these r sizes.

Let g’ = h+1 and g = h+1 log , both independent of n ! 1.

Then, r · g’ and storing BWT takesnHh + [enc(t1, ..., tr) - 1/2 i log ti] + O(r log ) bits.

13

Bound on bits of space

Crucial encoding for the BWT partition: r (positive) subsets’ sizes, t1, …, tr, with i ti = n.

Let enc(t1, …, tr) be the number of bits for encoding the sequence of these r sizes.

Let g’ = h+1 and g = h+1 log , both independent of n ! 1.

Then, r · g’ and storing BWT takesnHh + [enc(t1, ..., tr) - 1/2 i log ti] + O(r log )· nHh + g’ log(n/g’) + O(g) bits.

14


Case studies: Compressed text indexing Multi-key data structures [Crescenzi, G., Italiano]

[Franceschini, G.] [G., Italiano] Order vs. disorder in searching In-place vector sorting

15

Why multi-key data? Strings are everywhere…

Keys are arbitrarily long Multi-dimensional points Multiple precision numbers Textual data XML paths URLs and IP addresses …

Modeled as strings in k, for unbounded alphabets Q: How to avoid O(k) slowdown factor in the cost of

the operations supported by known data structures?

16

I. Ad hoc data structures

Some examples ternary search trees [Clampett] [Bentley, Sedgewick] tries […] lexicographic D-trees [Mehlhorn] multi-dimensional B-trees [Gueting, Kriegel] multi-dimensional AVL trees [Vaishnavi] lexicographic splay trees [Sleator, Tarjan] multi-dimensional BST [Gonzalez] [Roura] multi-BB-trees [Vaishnavi] …

Search, insert, delete in O(k + log n) time Split and concatenate in O(k + log n) time

17

II. Augmenting access paths

Reuse many data structures for 1-dim keys:AVL trees, red-black treesskip lists(a,b)-treesBB[α]-treesself-adjusting treesrandom search trees (treaps,…)…

Inherit their combinatorial properties Traversing is driven by comparisons

18

III. Using an oracle for strings

Data structure D = black box performing comparisons on pairs of 1-dim keys.

General theorem for transforming D into a data

structure D’ for strings (no efficiency loss).

Oracle DSlcp for maintaing order in a linked list of strings, along with their lcps (extends Dietz-Sleator list).

19

The general technique

New data structure D ’ = old data structure D + oracle DSlcp

Method:comparison is O(1)-time if we know

•lcp(x, y)=min { j j x[j+1] y[j+1] }

(x < y iff x[lcp+1] < y[lcp+1])

use DSlcp for storing and comparing pairs of strings in D ’ in constant time

use predecessors and lcps computed so far to insert a new string y into D ’ (and DSlcp )

20

Theorem for general transformation

Comparison driven data structure

D for n keys :

ins identifies pred or succ

it does not necessarily imply (log n) per ins ina sequence of operations;e.g., finger search trees

21



D for n keys :


String data structure D’for n strings

it does not necessarily imply (log n) per ins ina sequence of operations;e.g., finger search trees

22



D for n keys :



Space S(n) Space S(n) + O(n)

23



D for n keys :




Operation op on O(1) keys in D

in T(n) time

Operation op on O(1) strings in

D ’ in O(T(n)) time

24



D for n keys :




Operation op on O(1) keys in D

in T(n) time

Operation op on O(1) strings in

D ’ in O(T(n)) time

Operation op involving y not in

D, in T(n) time

Operation op involving y not in

D ’, in O(T(n) + k) time

25

Some features

No need to reinvent the wheel for data structs designers

Better than using compact trie + Dietz-Sleator list + dynamic LCA

when T(n) = o(log n), e.g.: weighted search O( log (i wi)/w ) finger search O( log d ) set manipulation O(n log(m / n) )

26


Case studies: Compressed text indexing Multi-key data structures Order vs. disorder in searching [Franceschini, G.]

In-place vector sorting

27

Searching In-Place a Sorted(?) Array of Strings

“Imagine how hard it would be to use a dictionary if its

words were not alphabetized!”

-- D.E. Knuth, The Art of Comp. Prog., vol. 3, 1998

28

Order vs. Disorder:An experiment

Think of your table desk…

1. Are the papers on your desk in sorted order?

2. Probably not!

3. Unsorted data seems to provide more informative content than sorted data…

4. Can we formalize this intuition in the comparison model?

29

Preprocessing by sorting

In-place search the lexicographically sorted array in [Andersson, Hagerup, Håstad, Petersson, ’94, ’95, ’01]:

time

Upper/lower bounds. The classical (log n) when k = 1.

30

Permuting is better !

For any key length k, there exists an “unsorted” permutation attaining simultaneously (k + log n) timeO(1) extra space

Optimal among all possible permutations, better than those resulting from sorting.

Warning: suffix array search is not in-place (since LCP takes more than O(1) extra cells).

31

Basic tool: Bit stealing

Simple, yet effective, idea on pairwise sorted keys:

For keys of length k ) O(k) slowdown factor.

Q: Can we get O(1) decoding time?

0 1 0 1

4 7 5 2 1 6 8 3 Implicit bits encoded bypairwise exchanging keys!

Implicit bits encoded bypairwise exchanging keys!

32

K-dimensional bit stealing: Digging a ditch!

Using d = lcp(xi, xj)+1, decode a bit in O(1) time,by checking mismatches, xi[d] and xj[d].

Idea exploited for digging a ditch, in O(k + r) time:

DIGGING(x1 … xr)

d Ã 1, i Ã 1, j Ã rwhile i < j do // twin positions i and j while d · k and xi[d] = xj[d] do d Ã d + 1 i Ã i + 1, j Ã j - 1

33

Ditch: twin positions and twin intervals

Create twin intervals with same digging depth; bit stealing is O(1) time with keys in twin positions.

34

Large DITCH

Encode information for the twin intervals in O(k log n) distinct keys (which are still searchable).

These twin positions can encode 3 bits

35

Inside each twin interval T

Searching A reduces to searching in a specific twin interval T.

Use modified Manber-Myers search for accessing just O(log n) stealed bits in T for lcp information (instead of O(log n) £ O(log n) bits).

It is provably more efficient to keep data “unsorted” rather than “sorted” for in-place searching.

36


Case studies: Compressed text indexing Multi-key data structures Order vs. disorder in searching In-place vector sorting [Franceschini, G.]

37

Logical order ´ physical layout

Knuth’s indirect addressing:1. permute the records’ pointers to find their ranks

2. permute the records according to the ranks

What if records are scrambled during merging?

Irregular access pattern to records

38

In-place model for vector sorting: GVSP( )

Comparison model extended to keys of length k,using O(1) extra memory cells m vectors of length k to be sorted p vectors for internal buffering h stealed bits with 2h vectors

initially ) m = n and p = h = 0

39

Optimal time-space bounds

Reduce recursively GVSP( ) to simpler instances

Use internal implicit data structures for strings in some of the instances

Sorting cost is time-space optimal: O(nk + n log n) time/comparisonsO(n) vector movesO(1) words of memory for extra space

40

Conclusions

Joint work on the “reverse” contribution, from stringology to data structure/algorithm design.

Fruitful interplay between the two areas:Compressed text indexingMulti-key data structuresOrder vs. disorder in searchingIn-place vector sorting

interplay between stringology and data structure design roberto grossi

Documents