dan suciu univ. of washington querying xml streams1 from searching text to querying xml streams dan...

Dan SuciuUniv. of Washington

Querying XML Streams 1

From Searching Text to Querying XML Streams

Dan Suciu

www.cs.washington.edu/homes/suciu



About Me• Born 1957, Romania• BS: Bucharest, PhD: University of Pennsylvania• Now: University of Washington (Seattle)

My work is on semistructured data• Book: Data on the Web:

From relations, to semistructured data and XML

Past/present projects:• XML-QL = precursor of XQuery• XMill = the XML compressor• XML toolkit



Motivation

• Text databases– Studied over the past 15 years– Traditional client/server model– Struggled with lack of standard text syntax

• Recently, new standard: XML– Traditional client/server: in today’s dbms– New applications: stream processing

• This talk: processing stream XML data– My motivation: work on the XML Toolkit project



Outline

• Background

• The XML stream processing problem

• Basic XML processing with automata

• Adapting automata to XML

• Stream indexes

• Conclusions



Background:Relational Databases

• Structured, stored in tables• Schema separate from data• Queries: precise, refer to schema and data (SQL)

: BOOKS

ISBN Title Year Publisher

0201537710Foundations of

Databases1995 AW

155860622X Data on the Web 1999 MK

AUTHOR

AID Name Country

44 Abiteboul FR

06 Buneman UK

62 Hull USA

12 Suciu USA

29 Vianu USA

WROTE:

ISBN AID

0201537710 44

0201537710 62

0201537710 29

155860622X 44

155860622X 06

155860622X 12

Hard to publish, easy to query preciselyHard to publish, easy to query precisely



Background:Text Databases

• Unstructured, stored in documents

• No schema, only data

• Queries: imprecise, refer to data only (keywords)

Foundations of Databases,

Abiteboul (FR), Hull (USA), Vianu (USA)

Addison Wesley,

1995

Foundations of Databases,

Abiteboul (FR), Hull (USA), Vianu (USA)

Addison Wesley,

1995

Data on the Web

Abiteoul (FR), Buneman (UK), Suciu (USA)

Morgan Kaufmann,

1999

Data on the Web

Abiteoul (FR), Buneman (UK), Suciu (USA)

Morgan Kaufmann,

1999

Easy to publish, hard to query preciselyEasy to publish, hard to query precisely



Background:XML Data• Semistructured

• Schema and data are together: self-describing• Queries: precise, refer to schema and data (SQL)

<bib> <book> <title> Foundations… </title> <author> <name> Abiteboul </name> <country> FR </country> </author> <author> <name> Hull </name> <country> USA </country> </author> <author> <name> Vianu </name> <country> USA </country> </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …

</bib>

<bib> <book> <title> Foundations… </title> <author> <name> Abiteboul </name> <country> FR </country> </author> <author> <name> Hull </name> <country> USA </country> </author> <author> <name> Vianu </name> <country> USA </country> </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …

</bib>

XML: Easier to publish,easy to query precisely

XML: Easier to publish,easy to query precisely



Background:XML Data

bib

book

paper

titletitle

author author author publisherauthor journal

book

Data onthe Web

name country

Abiteboul FR Buneman UK

name countryAddisonWesley

Data model = tree



Background:XML Data

• Querying with XPath (and XQuery)• This talk: XPath queries restricted to:

tag///* [ ]path=“constant”



Background:XPath in One Slide

/bib/book[author/name=“Abiteboul”]/bib/book[author/name=“Abiteboul”]

/bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]]/bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]]

/bib/book/author/name/bib/book/author/name

/bib/book//name/*/zip/bib/book//name/*/zip

tag, /

//,*

[ ]

This is precisely the “region algebra”

E.g. use proximal nodes [Navarro&Baeza-Yates’97]

This is precisely the “region algebra”

E.g. use proximal nodes [Navarro&Baeza-Yates’97]

Navigate partially known structure

Conjunctivequeries ala SQL



Outline

• Background




• Stream indexes

• Conclusions



Main Application:XML Packet Routing

• Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02]

• XML content routing [Snoeren et al.01]

• SOAP Message routing in Application Servers



XML Packet Routing<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc> <doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc><doc>

<tag> value </tag>

</doc>

<doc>

<tag> value </tag>

</doc>



/bib/book /publisher=“MK”/bib/book [category=“recent”]/title =“Web”/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”

/bib/book /publisher=“MK”/bib/book [category=“recent”]/title =“Web”/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”

XPath expressions

<bib> <book>...</bib>


Input XML StreamOutput XML Streams



The XML Stream Processing Problem

Given:A set of XPath expressionsAn Incoming stream of XML documents

Decide:For each document which expressions it matches

Given:A set of XPath expressionsAn Incoming stream of XML documents

Decide:For each document which expressions it matches

Hard: Large number of XPath expressions e.g. 103 - 106

Streaming XML data, high throughput e.g. 5MB/s

Easy: Shallow XML data e.g. depth=20 Short XPath expressions

Hard: Large number of XPath expressions e.g. 103 - 106

Streaming XML data, high throughput e.g. 5MB/s

Easy: Shallow XML data e.g. depth=20 Short XPath expressions



The ApproachesBasic techniques• NFA plus optimizations:

– Xfilter/Yfilter [Altinel&Franklin’00]– XTrie [Chan et al.02]

• DFA:– XML Toolkit

Beyond the obvious• Stream indexes (XML Toolkit)• Stream views



Outline

• Background




• Stream indexes

• Conclusions



From XPath to NFA

/catalog/product[category="tools"][*/price = 200]/quantity//price

/catalog/product[category="tools"][*/price = 200]/quantity//price

Extra processing needed

to combine branches

(not in this talk)

Extra processing needed

to combine branches

(not in this talk)

catalog

product

category

price

quantity

"tools"

200

*

price

*



Basic NFA Evaluation/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”

/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”


NFA

. . . . . .

XPath

3,66,102,4534,...

2,3,543,43,254

1,55,99,...

STACK

SAXevents

Current states



Basic NFA Evaluation

Properties: Space = linear Throughput = decreases linearly

Systems:

• XFilter [Altinel&Franklin’99], YFilter.

• XTrie [Chan et al.’02]



Basic DFA Evaluation/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”

/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”


XPath

399

552

1

STACKSAXevents

DFAs

Current state



Basic DFA Evaluation

Properties: Throughput = constant ! Space = GOOD QUESTION

System:

• XML Toolkit [University of Washington]http://xmltk.sourceforge.net



Outline

• Background




• Stream indexes

• Conclusions



The Size of the DFA

NFA

b

a

b

a

a

*

DFA for //P

has 1+|P| states [KMP]

DFA for //P

has 1+|P| states [KMP]

0

1

2

3

4

5

[other]

[other]

[other]

[other]

a

[other]a

DFA

b

a

b

a

a

[other]0

01

02

013

014

025

//a/b/a/a/b//a/b/a/a/b



The Size of the DFA

//a/*/*/*/b//a/*/*/*/b

Size of DFA =

exponential in *’s

(not a real concern)

Size of DFA =

exponential in *’s

(not a real concern)

*

*

b

a

*

*0

1

2

3

4

5

NFA

a

a

[other]

[other] [other]

[other] [other]

DFA (fragment, and without back edges)

a

a

b

a

a

[other]0

01

012

0123

01234

012345

023

02

013 03

0234 0134 034

0345 0245 045

b b b



The Size of the DFA

Theorem [GMOS’02] The number of states in the DFA for one linear XPath expression P is at most:

k = number of //

s = size of the alphabet (number of tags)

m = max number of * between two consecutive //

k+|P| k smk+|P| k sm



Size of DFA: Multiple Expressions

//section/table/footnote//table/footnote//section/figure/footnote. . . . .//abstract/footnote/table

//section/table/footnote//table/footnote//section/figure/footnote. . . . .//abstract/footnote/table

DFA = Trie

has linear number of states [Aho&Corasick]



Size of DFA: Multiple Expressions

//section//footnote//table//footnote//figure//footnote. . . . .//abstract//footnote

//section//footnote//table//footnote//figure//footnote. . . . .//abstract//footnote

100 expressions

2100 states !!2100 states !!

There is a theorem here too, but it’s not useful…



Solution:Compute the DFA Lazily

• Also used in text searching• But will it work for 106 XPath expressions ?• YES !• For XPath it is provably effective, for two

reasons:– XML data is not very deep– The nesting structure in XML data tends to be

predictable



Lazy DFA and “Simple” DTDs

• Document Type Definition (DTD)– Part of the XML standard– Will be replaced by XML Schema

• Example DTD:

<!ELEMENT document (section*)><!ELEMENT section ((section|abstract|table|figure)*)><!ELEMENT figure (table?,footnote*)>. . . . .

<!ELEMENT document (section*)><!ELEMENT section ((section|abstract|table|figure)*)><!ELEMENT figure (table?,footnote*)>. . . . .

Definition A DTD is simple if all cycles are loops




document

section

table

figure

footnote

Simple DTD:

//section//footnote//table//footnote//figure//footnote//abstract//footnote

//section//footnote//table//footnote//figure//footnote//abstract//footnote

XPath expressions

abstract

Eager DFA “remembers” 24 sets

Lazy DFA “remembers” only 4 sets




Theorem [GMOS’02] If the XML data has a “simple” DTD, then lazy DFA has at most:

states.

n = max depths of XPath expressions

D = size of the “unfolded” DTD

d = max depths of self-loops in the DTD

1+D(1+n)d1+D(1+n)d

Fact of life: “Data-like” XMLhas simple DTDs

Fact of life: “Data-like” XMLhas simple DTDs




Definition [Goldman&Widom’97]

The data guide for an XML data instance is the Trie of all its root-to-leaf paths




document

section section

sectiontable

table

section

table figure

document

section

section

table

table figure

section

table

XML Data Data Guide

Fact of life: real XML data has “small” data guide

[Liefke&S.’00]

Fact of life: real XML data has “small” data guide

[Liefke&S.’00]

section

figure

figure




Theorem [GMOS’02] If the XML data has a data guide with G nodes, then the number of states in the lazy DFA is at most:

G = number of nodes in the data guide

1+G1+G



1

10

100

1000

10000

100000

simple prov ebBPSS protein nasa treebank

Number of Lazy DFA States - SYNTHETIC Data

103 XPath

104 XPath

105 XPath

4000

states



1

10

100

1000

10000

100000

protein nasa treebank

Number of Lazy DFA States - REAL Data

103 XPath

104 XPath

105 XPath 95 states

40000 states

G = 350000



Number of States in the lazy DFA

Real XML data Synthetic XML data

Data-style DTDTheorem

Lazy DFA is small

Theorem

Lazy DFA is small

Document-style DTDTheorem

Lazy DFA is smallFactLazy DFA is HUGE



Lazy DFA in the XML Toolkit

• The XML toolkit uses a lazy DFA to process XML streams

• “warm-up” phase, followed by very high throughput



Throughput for 103, 104, 105, 106 XPath expressions

[ prob(*)=10%, prob(//)=10% ]

0.0001MB/s

0.001MB/s

0.01MB/s

0.1MB/s

1MB/s

10MB/s

100MB/s

5MB 10MB 15MB 20MB 25MB

Total input size

parser

lazyDFA (103 XPath)

lazyDFA (104 XPath)

lazyDFA (105 XPath)

lazyDFA (106 XPath)

xfilter (103 XPath)

xfilter (104 XPath)

xfilter(105 XPath)

xfilter(106 XPath)

Parser:

10MB/s

Lazy DFA:

5.4MB/s



Summary of Lazy DFA and XML

• Linear Xpath expressions:– Process with one lazy DFA

• Xpath expressions with branches– Process with Deterministic Pushdown

Automata (ongoing work at the University of Washington)



Outline

• Background




• Stream indexes

• Conclusions



Stream IndeX (SIX)

Main observation:• Parsing is major bottleneck

Definition The SIX of an XML document is a binary table of (begin, end) offsets

Idea: • Use SIX to reduce amount of parsing• Works well with (lazy) DFA• Implemented in the XML toolkit



Stream IndeX (SIX)

<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

beginOffset endOffset

bib 0 1490124

book 3 409023

publisher 12 423

author 426 879

author 978 . . .

. . .

SIXXML



Stream IndeX (SIX)







0 205

30 66

72 188

0 205

30 66

72 188

90 110

95 98

0 205

30 66

The SIX stream is about 6% of the data stream

And can be made MUCH smaller

The SIX stream is about 6% of the data stream

And can be made MUCH smaller

SIX

(E.g. DIME)

XML



Throughput improvements from SIX (stable)

0

5

10

15

20

25

30

35

55 60 65 70 75 80 85 90 95 100 105

XML stream (MB)

MB

/s

Theta=3% (SIX)

Theta=3%

Theta=8% (SIX)

Theta=8%

Theta=14% (SIX)

Theta=14%



Stream Views

Idea: • Given a workload of XPath expressions with

branches• Precompute some views for each document

to speed up the entire workload

• views header has to be small



Stream Views

/a[b=11][c=22][e=23]/a[b=11][c=22][e=23] /a[b=33][d=44] [e=55]/a[c=66][f=77]/a[f=34][g=56]

/a[b=33][d=44] [e=55]/a[c=66][f=77]/a[f=34][g=56]

/a[b=88][c=99]/a[c=99][e=00]

/a[b=88][c=99]/a[c=99][e=00]

/a/c /a/e /a/f

/a/c /a/e /a/f

3 Views: Short circuit evaluation !Short circuit evaluation !

Queries

Servers



Stream Views

• Views header (binary offsets)




XML XML XML

0

30

72

0

30

72

0

30

72

100x speedup

on a hit

100x speedup

on a hitXML

Header

Choosing the views:

Difficult problem

Choosing the views:

Difficult problem



Outline

• Background




• Stream indexes

• Conclusions



Summary• XML stream processing problem:

– Fixed XPath queries, transient XML data– Large number of queries– High data throughput

• Relationship to text processing techniques:– Still regular expressions– Still automata and lazy DFAs– Different scale

• Techniques:– Lazy DFAs work for reasons specific to XML– Stream indexes and views: ongoing research



Future Work

• Handle branches in XPath expressions

• View selection for a given workload

• Network configuration



Thank you !

Links:

www.cs.washington.edu/homes/suciu

www.cs.washington.edu/homes/suciu/XMLTK

xmltk.sourceforge.net

dan suciu univ. of washington querying xml streams1 from searching text to querying xml streams dan...

Documents

xml data querying

processing stream xml

xml streams dan suciu

xml data semistructured

data queries

xml toolkit project

xml traditional clientserver

xml pastpresent projects