compressing and searching xml data via two zips

16
Paolo Ferragina, Università di P isa Compressing and Searching XML Data Via Two Zips Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan]

Upload: huey

Post on 12-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Compressing and Searching XML Data Via Two Zips. Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan]. Opportunistic Data Structures with Applications P. Ferragina, G. Manzini. Survey by Navarro-Makinen cites more - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

Compressing and Searching XML Data Via

Two Zips

Paolo FerraginaDipartimento di Informatica, Università di Pisa

[Joint with F. Luccio, G. Manzini, S. Muthukrishnan]

Page 2: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

Six years ago... [now, J. ACM 05]

Opportunistic Data Structures with Applications

P. Ferragina, G. Manzini

Survey by Navarro-Makinen cites more than 50 papers on the subject !!

Page 3: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

An XML excerpt

<dblp> <book>

<author> Donald E. Knuth </author><title> The TeXbook </title><publisher> Addison-Wesley </publisher><year> 1986 </year>

</book> <article>

<author> Donald E. Knuth </author><author> Ronald W. Moore </author><title> An Analysis of Alpha-Beta Pruning </title><pages> 293-326 </pages><year> 1975 </year><volume> 6 </volume><journal> Artificial Intelligence </journal>

</article>

...</dblp>

It is verbose !

Page 4: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

A tree interpretation...

XML document exploration Tree navigation XML document search Labeled subpath

searches

Subset of XPath [W3C]

Page 5: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

The Problem

Summary indexes (like Dataguide, 1-index or 2-index) large space and do not support “content” searches

XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression

We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations:

Navigational operations: parent(u), child(u, i), child(u, i, c) Subpath searches: given a sequence of k labels Content searches: subpath + substring search Visualization operation: given a node, visualize its descending subtree

XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file

XML-native search engines

might exploit this tool as a core block for

query optimization and (compressed) storage

Theoretically do exist many solutions, starting from [Jacobson, IEEE

Focs ’89] no subpath/content searches, and poor performance on labeled

trees

Page 6: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

A transform for “labeled trees” [Ferragina et al, IEEE Focs ’05]

We proposed the XBW-transform that mimics on trees the nice structural properties of the Burrows-and-Wheeler Trasform on strings (do you know bzip !?).

The XBW linearizes the tree T in 2 arrays s.t.:

the compression of T reduces to use any k-th order entropy compressor (gzip, bzip,...) over these two arrays

the indexing of T reduces to implement simple rank/select query operations over these two arrays

Page 7: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

The XBW-TransformC

B A B

D c

c a

b a D

c

D a

b

CBDcacAb aDcBDba

S

CB CD B CD B CB CCA CA CA CD A CCB CD B CB C

S

upward labeled paths

Permutationof tree nodes

Step 1.Visit the tree in pre-order. For each node, write down its label and the labels on its upward path

Page 8: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

The XBW-TransformC

B A B

D c

c a

b a D

c

D a

b

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

S

upward labeled paths

Step 2.Stably sort according to S

Page 9: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

XBW takes optimal t log || + 2t bits

1001010 10011011

The XBW-TransformC

B A B

D c

c a

b a D

c

D a

b

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

S

Step 3.Add a binary array Slast marking the

rows corresponding to last children

Slast

XBW

XBW can be built and inverted

in optimal O(t) time

Key fact

Nodes correspond to items in <Slast,S>

Page 10: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

XBzip – a simple XML compressor

Pcdata

Tags, Attributes and symbol =

XBW is compressible:

S and Spcdata are locally homogeneous

Slast has some structure

Page 11: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

XBzip = XBW + PPMd

String compressors are not so bad: within 5%

0%

5%

10%

15%

20%

25%

DBLP Pathways News

gzip bzip2 ppmdi xmill + ppmdi scmppm XBzip

Page 12: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

1001010 10011011

Some structural properties

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

SSlast

XBW

C

B A B

D c

c a

b a D

c

D a

b

C

B A B

D c

c a

b a D

c

D a

b

Two useful properties:

• Children are contiguous and delimited by 1s

• Children reflect the order of their parents

B

Page 13: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

1001010 10011011

XBW is navigational

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

SSlast

XBW

C

B A B

D c

c a

b a D

c

D a

b

C

A B

D c

c a

b a D

c

D a

b

XBW is navigational:

• Rank-Select data structures on Slast and S

• The array C of || integers

B

Get_children

Rank(B,S)=2

Select in Slast the 2° item 1from here...

A 2B 5C 9D 12

C

Page 14: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

1001010 10011011

XBW is searchable (count subpaths)

C

B A B

D c

c a

b a D

c

D a

b

CbaDDc DaBABccab

S

A CA CA CB CB CB CB C CCCD A CD B CD B CD B C

SSlast

XBW-index

Inductive step:

Pick the next char in [i+1]i.e. ‘D’

Search for the first and last ‘D’ in S[fr,lr]

Jump to their children

fr

lr

= B D

[i+1]

Rows whoseS starts with ‘B’

Their childrenhave upwardpath = ‘D B’

A 2B 5C 9D 12

lr

fr

XBW is searchable:

• Rank-Select data structures on Slast and S

• Array C of || integers

C

2 occurrences of

because of two 1s

Page 15: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

XBzipIndex: XBW + FM-index

Upto 36% improvement in compression ratioQuery (counting) time 8 ms, Navigation time 3 ms

0%

10%

20%

30%

40%

50%

60%

DBLP Pathways News

Huffword XPress XQzip XBzipIndex XBzip

DBLP: 1.75 bytes/node, Pathways: 0.31 bytes/node, News: 3.91 bytes/node

Page 16: Compressing and Searching XML Data Via Two Zips

Paolo Ferragina, Università di Pisa

Indexing[Kosaraju, Focs ‘89]

[IEEE Focs ’05][WWW ’06]

The overall picture on Compressed Indexing...

Data type

CompressedIndexing Strong connection

[IEEE Focs ’00][J. ACM ’05]

This is a powerful paradigm to design compressed indexes:

1. Transform the input in few arrays (via BWT or XBW)

2. Index (+ Compress) the arrays to support rank/select ops

Theory: Soda ’06 (2), Cpm ’06 (2), Icalp ’06 (2), DCC ’06 (1)Experimental: Wea ’06 (2)

http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl