interplay between stringology and data structure design roberto grossi

40
Interplay between Stringology and Data Structure Design Roberto Grossi <[email protected]>

Upload: gavin-brice-brooks

Post on 18-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interplay between Stringology and Data Structure Design Roberto Grossi

Interplay between Stringology and Data Structure Design

Roberto Grossi <[email protected]>

Page 2: Interplay between Stringology and Data Structure Design Roberto Grossi

Interplay between Stringology and Data Structure Design

(limited view: my own experience)

Roberto Grossi <[email protected]>

Page 3: Interplay between Stringology and Data Structure Design Roberto Grossi

Interplay between Stringology and Data Structure Design

(limited view: my own experience)

Roberto Grossi <[email protected]>

advertising

Page 4: Interplay between Stringology and Data Structure Design Roberto Grossi

4

Interaction between stringology and data structures

Case studies: Compressed text indexing [G., Gupta, Vitter] Multi-key data structures [Crescenzi, G., Italiano]

[Franceschini, G.] [G., Italiano] Order vs. disorder in searching [Franceschini, G.] In-place vector sorting [Franceschini, G.]

Page 5: Interplay between Stringology and Data Structure Design Roberto Grossi

5

Compressed text indexing

Replace text 2 n ) self-indexing binary string[Ferragina, Manzini] [Sadakane] [G., Gupta, Vitter]

n log bits ) n Hh + … bits(where Hh = h-order empirical entropy)

Unique algorithmic framework:wavelet tree + finite set model + succinct dictionaries + …

Text indexing: new implementation of CSA

(compressed suffix array)

Text indexing: new implementation of CSA

(compressed suffix array)

Page 6: Interplay between Stringology and Data Structure Design Roberto Grossi

6

Compressed text indexing

Replace text 2 n ) self-indexing binary string[Ferragina, Manzini] [Sadakane] [G., Gupta, Vitter]

n log bits ) n Hh + … bits(where Hh = h-order empirical entropy)

Unique algorithmic framework:wavelet tree + finite set model + succinct dictionaries + …

Compression: new analysis of BWT

(Burrows-Wheeler transform)

Compression: new analysis of BWT

(Burrows-Wheeler transform)

Text indexing: new implementation of CSA

(compressed suffix array)

Text indexing: new implementation of CSA

(compressed suffix array)

Page 7: Interplay between Stringology and Data Structure Design Roberto Grossi

7

Suffix arrays, BWT, and Hh (high-order empirical entropy)

Equivalently usecontexts x of order h for cx instead of xc

T = mississippi#

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

ipssm#pissii

12

1185211097463

Page 8: Interplay between Stringology and Data Structure Design Roberto Grossi

8

Suffix arrays, BWT, and Hh (high-order empirical entropy)

Context x = i, h =1

Chars c = p, s, m

Store “pssm” using just bits

Get n Hh bits!!!Add bits to encode the partition.

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

ipssm#pissii

12

1185211097463

Page 9: Interplay between Stringology and Data Structure Design Roberto Grossi

9

Incremental representation

Example: mark ppssm ! 1000

remove p; mark mssm ! 001

remove m; mark sss ! 11

We obtain 3 subsets: Encode each subset, containing t items out of n, using bits.

Page 10: Interplay between Stringology and Data Structure Design Roberto Grossi

10

Getting the multinomial coefficient

Sum of the log binomial coefficients of the subsets’ sizes

Page 11: Interplay between Stringology and Data Structure Design Roberto Grossi

11

Wavelet trees

Generalize the idea from the linear list to any tree shape

Cost is independent of the shape (e.g. assign access frequencies)

Page 12: Interplay between Stringology and Data Structure Design Roberto Grossi

12

Bound on bits of space

Crucial encoding for the BWT partition: r (positive) subsets’ sizes, t1, …, tr, with i ti = n.

Let enc(t1, …, tr) be the number of bits for encoding the sequence of these r sizes.

Let g’ = h+1 and g = h+1 log , both independent of n ! 1.

Then, r · g’ and storing BWT takesnHh + [enc(t1, ..., tr) - 1/2 i log ti] + O(r log ) bits.

Page 13: Interplay between Stringology and Data Structure Design Roberto Grossi

13

Bound on bits of space

Crucial encoding for the BWT partition: r (positive) subsets’ sizes, t1, …, tr, with i ti = n.

Let enc(t1, …, tr) be the number of bits for encoding the sequence of these r sizes.

Let g’ = h+1 and g = h+1 log , both independent of n ! 1.

Then, r · g’ and storing BWT takesnHh + [enc(t1, ..., tr) - 1/2 i log ti] + O(r log )· nHh + g’ log(n/g’) + O(g) bits.

Page 14: Interplay between Stringology and Data Structure Design Roberto Grossi

14

Interaction between stringology and data structures

Case studies: Compressed text indexing Multi-key data structures [Crescenzi, G., Italiano]

[Franceschini, G.] [G., Italiano] Order vs. disorder in searching In-place vector sorting

Page 15: Interplay between Stringology and Data Structure Design Roberto Grossi

15

Why multi-key data? Strings are everywhere…

Keys are arbitrarily long Multi-dimensional points Multiple precision numbers Textual data XML paths URLs and IP addresses …

Modeled as strings in k, for unbounded alphabets Q: How to avoid O(k) slowdown factor in the cost of

the operations supported by known data structures?

Page 16: Interplay between Stringology and Data Structure Design Roberto Grossi

16

I. Ad hoc data structures

Some examples ternary search trees [Clampett] [Bentley, Sedgewick] tries […] lexicographic D-trees [Mehlhorn] multi-dimensional B-trees [Gueting, Kriegel] multi-dimensional AVL trees [Vaishnavi] lexicographic splay trees [Sleator, Tarjan] multi-dimensional BST [Gonzalez] [Roura] multi-BB-trees [Vaishnavi] …

Search, insert, delete in O(k + log n) time Split and concatenate in O(k + log n) time

Page 17: Interplay between Stringology and Data Structure Design Roberto Grossi

17

II. Augmenting access paths

Reuse many data structures for 1-dim keys:AVL trees, red-black treesskip lists(a,b)-treesBB[α]-treesself-adjusting treesrandom search trees (treaps,…)…

Inherit their combinatorial properties Traversing is driven by comparisons

Page 18: Interplay between Stringology and Data Structure Design Roberto Grossi

18

III. Using an oracle for strings

Data structure D = black box performing comparisons on pairs of 1-dim keys.

General theorem for transforming D into a data

structure D’ for strings (no efficiency loss).

Oracle DSlcp for maintaing order in a linked list of strings, along with their lcps (extends Dietz-Sleator list).

Page 19: Interplay between Stringology and Data Structure Design Roberto Grossi

19

The general technique

New data structure D ’ = old data structure D + oracle DSlcp

Method:comparison is O(1)-time if we know

•lcp(x, y)=min { j j x[j+1] y[j+1] }

(x < y iff x[lcp+1] < y[lcp+1])

use DSlcp for storing and comparing pairs of strings in D ’ in constant time

use predecessors and lcps computed so far to insert a new string y into D ’ (and DSlcp )

Page 20: Interplay between Stringology and Data Structure Design Roberto Grossi

20

Theorem for general transformation

Comparison driven data structure

D for n keys :

ins identifies pred or succ

it does not necessarily imply (log n) per ins ina sequence of operations;e.g., finger search trees

Page 21: Interplay between Stringology and Data Structure Design Roberto Grossi

21

Theorem for general transformation

Comparison driven data structure

D for n keys :

ins identifies pred or succ

String data structure D’for n strings

it does not necessarily imply (log n) per ins ina sequence of operations;e.g., finger search trees

Page 22: Interplay between Stringology and Data Structure Design Roberto Grossi

22

Theorem for general transformation

Comparison driven data structure

D for n keys :

ins identifies pred or succ

String data structure D’for n strings

Space S(n) Space S(n) + O(n)

Page 23: Interplay between Stringology and Data Structure Design Roberto Grossi

23

Theorem for general transformation

Comparison driven data structure

D for n keys :

ins identifies pred or succ

String data structure D’for n strings

Space S(n) Space S(n) + O(n)

Operation op on O(1) keys in D

in T(n) time

Operation op on O(1) strings in

D ’ in O(T(n)) time

Page 24: Interplay between Stringology and Data Structure Design Roberto Grossi

24

Theorem for general transformation

Comparison driven data structure

D for n keys :

ins identifies pred or succ

String data structure D’for n strings

Space S(n) Space S(n) + O(n)

Operation op on O(1) keys in D

in T(n) time

Operation op on O(1) strings in

D ’ in O(T(n)) time

Operation op involving y not in

D, in T(n) time

Operation op involving y not in

D ’, in O(T(n) + k) time

Page 25: Interplay between Stringology and Data Structure Design Roberto Grossi

25

Some features

No need to reinvent the wheel for data structs designers

Better than using compact trie + Dietz-Sleator list + dynamic LCA

when T(n) = o(log n), e.g.: weighted search O( log (i wi)/w ) finger search O( log d ) set manipulation O(n log(m / n) )

Page 26: Interplay between Stringology and Data Structure Design Roberto Grossi

26

Interaction between stringology and data structures

Case studies: Compressed text indexing Multi-key data structures Order vs. disorder in searching [Franceschini, G.]

In-place vector sorting

Page 27: Interplay between Stringology and Data Structure Design Roberto Grossi

27

Searching In-Place a Sorted(?) Array of Strings

“Imagine how hard it would be to use a dictionary if its

words were not alphabetized!”

-- D.E. Knuth, The Art of Comp. Prog., vol. 3, 1998

Page 28: Interplay between Stringology and Data Structure Design Roberto Grossi

28

Order vs. Disorder:An experiment

Think of your table desk…

1. Are the papers on your desk in sorted order?

2. Probably not!

3. Unsorted data seems to provide more informative content than sorted data…

4. Can we formalize this intuition in the comparison model?

Page 29: Interplay between Stringology and Data Structure Design Roberto Grossi

29

Preprocessing by sorting

In-place search the lexicographically sorted array in [Andersson, Hagerup, Håstad, Petersson, ’94, ’95, ’01]:

time

Upper/lower bounds. The classical (log n) when k = 1.

Page 30: Interplay between Stringology and Data Structure Design Roberto Grossi

30

Permuting is better !

For any key length k, there exists an “unsorted” permutation attaining simultaneously (k + log n) timeO(1) extra space

Optimal among all possible permutations, better than those resulting from sorting.

Warning: suffix array search is not in-place (since LCP takes more than O(1) extra cells).

Page 31: Interplay between Stringology and Data Structure Design Roberto Grossi

31

Basic tool: Bit stealing

Simple, yet effective, idea on pairwise sorted keys:

For keys of length k ) O(k) slowdown factor.

Q: Can we get O(1) decoding time?

0 1 0 1

4 7 5 2 1 6 8 3 Implicit bits encoded bypairwise exchanging keys!

Implicit bits encoded bypairwise exchanging keys!

Page 32: Interplay between Stringology and Data Structure Design Roberto Grossi

32

K-dimensional bit stealing: Digging a ditch!

Using d = lcp(xi, xj)+1, decode a bit in O(1) time,by checking mismatches, xi[d] and xj[d].

Idea exploited for digging a ditch, in O(k + r) time:

DIGGING(x1 … xr)

d à 1, i à 1, j à rwhile i < j do // twin positions i and j while d · k and xi[d] = xj[d] do d à d + 1 i à i + 1, j à j - 1

Page 33: Interplay between Stringology and Data Structure Design Roberto Grossi

33

Ditch: twin positions and twin intervals

Create twin intervals with same digging depth; bit stealing is O(1) time with keys in twin positions.

Page 34: Interplay between Stringology and Data Structure Design Roberto Grossi

34

Large DITCH

Encode information for the twin intervals in O(k log n) distinct keys (which are still searchable).

These twin positions can encode 3 bits

Page 35: Interplay between Stringology and Data Structure Design Roberto Grossi

35

Inside each twin interval T

Searching A reduces to searching in a specific twin interval T.

Use modified Manber-Myers search for accessing just O(log n) stealed bits in T for lcp information (instead of O(log n) £ O(log n) bits).

It is provably more efficient to keep data “unsorted” rather than “sorted” for in-place searching.

Page 36: Interplay between Stringology and Data Structure Design Roberto Grossi

36

Interaction between stringology and data structures

Case studies: Compressed text indexing Multi-key data structures Order vs. disorder in searching In-place vector sorting [Franceschini, G.]

Page 37: Interplay between Stringology and Data Structure Design Roberto Grossi

37

Logical order ´ physical layout

Knuth’s indirect addressing:1. permute the records’ pointers to find their ranks

2. permute the records according to the ranks

What if records are scrambled during merging?

Irregular access pattern to records

Page 38: Interplay between Stringology and Data Structure Design Roberto Grossi

38

In-place model for vector sorting: GVSP( )

Comparison model extended to keys of length k,using O(1) extra memory cells m vectors of length k to be sorted p vectors for internal buffering h stealed bits with 2h vectors

initially ) m = n and p = h = 0

Page 39: Interplay between Stringology and Data Structure Design Roberto Grossi

39

Optimal time-space bounds

Reduce recursively GVSP( ) to simpler instances

Use internal implicit data structures for strings in some of the instances

Sorting cost is time-space optimal: O(nk + n log n) time/comparisonsO(n) vector movesO(1) words of memory for extra space

Page 40: Interplay between Stringology and Data Structure Design Roberto Grossi

40

Conclusions

Joint work on the “reverse” contribution, from stringology to data structure/algorithm design.

Fruitful interplay between the two areas:Compressed text indexingMulti-key data structuresOrder vs. disorder in searchingIn-place vector sorting