dan suciu univ. of washington querying xml streams1 from searching text to querying xml streams dan...
Post on 21-Dec-2015
226 views
TRANSCRIPT
Dan SuciuUniv. of Washington
Querying XML Streams 1
From Searching Text to Querying XML Streams
Dan Suciu
www.cs.washington.edu/homes/suciu
Dan SuciuUniv. of Washington
Querying XML Streams 2
About Me• Born 1957, Romania• BS: Bucharest, PhD: University of Pennsylvania• Now: University of Washington (Seattle)
My work is on semistructured data• Book: Data on the Web:
From relations, to semistructured data and XML
Past/present projects:• XML-QL = precursor of XQuery• XMill = the XML compressor• XML toolkit
Dan SuciuUniv. of Washington
Querying XML Streams 3
Motivation
• Text databases– Studied over the past 15 years– Traditional client/server model– Struggled with lack of standard text syntax
• Recently, new standard: XML– Traditional client/server: in today’s dbms– New applications: stream processing
• This talk: processing stream XML data– My motivation: work on the XML Toolkit project
Dan SuciuUniv. of Washington
Querying XML Streams 4
Outline
• Background
• The XML stream processing problem
• Basic XML processing with automata
• Adapting automata to XML
• Stream indexes
• Conclusions
Dan SuciuUniv. of Washington
Querying XML Streams 5
Background:Relational Databases
• Structured, stored in tables• Schema separate from data• Queries: precise, refer to schema and data (SQL)
: BOOKS
ISBN Title Year Publisher
0201537710Foundations of
Databases1995 AW
155860622X Data on the Web 1999 MK
AUTHOR
AID Name Country
44 Abiteboul FR
06 Buneman UK
62 Hull USA
12 Suciu USA
29 Vianu USA
WROTE:
ISBN AID
0201537710 44
0201537710 62
0201537710 29
155860622X 44
155860622X 06
155860622X 12
Hard to publish, easy to query preciselyHard to publish, easy to query precisely
Dan SuciuUniv. of Washington
Querying XML Streams 6
Background:Text Databases
• Unstructured, stored in documents
• No schema, only data
• Queries: imprecise, refer to data only (keywords)
Foundations of Databases,
Abiteboul (FR), Hull (USA), Vianu (USA)
Addison Wesley,
1995
Foundations of Databases,
Abiteboul (FR), Hull (USA), Vianu (USA)
Addison Wesley,
1995
Data on the Web
Abiteoul (FR), Buneman (UK), Suciu (USA)
Morgan Kaufmann,
1999
Data on the Web
Abiteoul (FR), Buneman (UK), Suciu (USA)
Morgan Kaufmann,
1999
Easy to publish, hard to query preciselyEasy to publish, hard to query precisely
Dan SuciuUniv. of Washington
Querying XML Streams 7
Background:XML Data• Semistructured
• Schema and data are together: self-describing• Queries: precise, refer to schema and data (SQL)
<bib> <book> <title> Foundations… </title> <author> <name> Abiteboul </name> <country> FR </country> </author> <author> <name> Hull </name> <country> USA </country> </author> <author> <name> Vianu </name> <country> USA </country> </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …
</bib>
<bib> <book> <title> Foundations… </title> <author> <name> Abiteboul </name> <country> FR </country> </author> <author> <name> Hull </name> <country> USA </country> </author> <author> <name> Vianu </name> <country> USA </country> </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …
</bib>
XML: Easier to publish,easy to query precisely
XML: Easier to publish,easy to query precisely
Dan SuciuUniv. of Washington
Querying XML Streams 8
Background:XML Data
bib
book
paper
titletitle
author author author publisherauthor journal
book
Data onthe Web
name country
Abiteboul FR Buneman UK
name countryAddisonWesley
Data model = tree
Dan SuciuUniv. of Washington
Querying XML Streams 9
Background:XML Data
• Querying with XPath (and XQuery)• This talk: XPath queries restricted to:
tag///* [ ]path=“constant”
Dan SuciuUniv. of Washington
Querying XML Streams 10
Background:XPath in One Slide
/bib/book[author/name=“Abiteboul”]/bib/book[author/name=“Abiteboul”]
/bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]]/bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]]
/bib/book/author/name/bib/book/author/name
/bib/book//name/*/zip/bib/book//name/*/zip
tag, /
//,*
[ ]
This is precisely the “region algebra”
E.g. use proximal nodes [Navarro&Baeza-Yates’97]
This is precisely the “region algebra”
E.g. use proximal nodes [Navarro&Baeza-Yates’97]
Navigate partially known structure
Conjunctivequeries ala SQL
Dan SuciuUniv. of Washington
Querying XML Streams 11
Outline
• Background
• The XML stream processing problem
• Basic XML processing with automata
• Adapting automata to XML
• Stream indexes
• Conclusions
Dan SuciuUniv. of Washington
Querying XML Streams 12
Main Application:XML Packet Routing
• Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02]
• XML content routing [Snoeren et al.01]
• SOAP Message routing in Application Servers
Dan SuciuUniv. of Washington
Querying XML Streams 13
XML Packet Routing<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc> <doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc><doc>
<tag> value </tag>
</doc>
<doc>
<tag> value </tag>
</doc>
Dan SuciuUniv. of Washington
Querying XML Streams 14
/bib/book /publisher=“MK”/bib/book [category=“recent”]/title =“Web”/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”
/bib/book /publisher=“MK”/bib/book [category=“recent”]/title =“Web”/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”
XPath expressions
<bib> <book>...</bib>
<bib> <book>...</bib>
Input XML StreamOutput XML Streams
Dan SuciuUniv. of Washington
Querying XML Streams 15
The XML Stream Processing Problem
Given:A set of XPath expressionsAn Incoming stream of XML documents
Decide:For each document which expressions it matches
Given:A set of XPath expressionsAn Incoming stream of XML documents
Decide:For each document which expressions it matches
Hard: Large number of XPath expressions e.g. 103 - 106
Streaming XML data, high throughput e.g. 5MB/s
Easy: Shallow XML data e.g. depth=20 Short XPath expressions
Hard: Large number of XPath expressions e.g. 103 - 106
Streaming XML data, high throughput e.g. 5MB/s
Easy: Shallow XML data e.g. depth=20 Short XPath expressions
Dan SuciuUniv. of Washington
Querying XML Streams 16
The ApproachesBasic techniques• NFA plus optimizations:
– Xfilter/Yfilter [Altinel&Franklin’00]– XTrie [Chan et al.02]
• DFA:– XML Toolkit
Beyond the obvious• Stream indexes (XML Toolkit)• Stream views
Dan SuciuUniv. of Washington
Querying XML Streams 17
Outline
• Background
• The XML stream processing problem
• Basic XML processing with automata
• Adapting automata to XML
• Stream indexes
• Conclusions
Dan SuciuUniv. of Washington
Querying XML Streams 18
From XPath to NFA
/catalog/product[category="tools"][*/price = 200]/quantity//price
/catalog/product[category="tools"][*/price = 200]/quantity//price
Extra processing needed
to combine branches
(not in this talk)
Extra processing needed
to combine branches
(not in this talk)
catalog
product
category
price
quantity
"tools"
200
*
price
*
Dan SuciuUniv. of Washington
Querying XML Streams 19
Basic NFA Evaluation/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”
/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”
<bib> <book>...</bib>
NFA
. . . . . .
XPath
3,66,102,4534,...
2,3,543,43,254
1,55,99,...
STACK
SAXevents
Current states
Dan SuciuUniv. of Washington
Querying XML Streams 20
Basic NFA Evaluation
Properties: Space = linear Throughput = decreases linearly
Systems:
• XFilter [Altinel&Franklin’99], YFilter.
• XTrie [Chan et al.’02]
Dan SuciuUniv. of Washington
Querying XML Streams 21
Basic DFA Evaluation/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”
/bib/book /publisher=“MK”/bib/book [category=“recent”]/title/bib/book //address//*/zip=“123”/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“123”/bib/book /address /field=“567”/bib/book /tag=“some”/bib/book [category=“recent”]/title/bib/book //address//*=“Seattle"/bib/book //address//*="Galaxy"/bib/book /category=“recent”/bib/book /address=“Lisbon”/bib/book /address /field=“some”. . .. . .. . ./bib/book/publisher=“AW”/bib/book [category=“recent”]/title/bib/book //address//*=“123”/bib/book //address//*="Galaxy"/bib/book /category=“new”/bib/book /address=“London”/bib/book /address /field =“some”/bib/book/category =“old”
<bib> <book>...</bib>
XPath
399
552
1
STACKSAXevents
DFAs
Current state
Dan SuciuUniv. of Washington
Querying XML Streams 22
Basic DFA Evaluation
Properties: Throughput = constant ! Space = GOOD QUESTION
System:
• XML Toolkit [University of Washington]http://xmltk.sourceforge.net
Dan SuciuUniv. of Washington
Querying XML Streams 23
Outline
• Background
• The XML stream processing problem
• Basic XML processing with automata
• Adapting automata to XML
• Stream indexes
• Conclusions
Dan SuciuUniv. of Washington
Querying XML Streams 24
The Size of the DFA
NFA
b
a
b
a
a
*
DFA for //P
has 1+|P| states [KMP]
DFA for //P
has 1+|P| states [KMP]
0
1
2
3
4
5
[other]
[other]
[other]
[other]
a
[other]a
DFA
b
a
b
a
a
[other]0
01
02
013
014
025
//a/b/a/a/b//a/b/a/a/b
Dan SuciuUniv. of Washington
Querying XML Streams 25
The Size of the DFA
//a/*/*/*/b//a/*/*/*/b
Size of DFA =
exponential in *’s
(not a real concern)
Size of DFA =
exponential in *’s
(not a real concern)
*
*
b
a
*
*0
1
2
3
4
5
NFA
a
a
[other]
[other] [other]
[other] [other]
DFA (fragment, and without back edges)
a
a
b
a
a
[other]0
01
012
0123
01234
012345
023
02
013 03
0234 0134 034
0345 0245 045
b b b
Dan SuciuUniv. of Washington
Querying XML Streams 26
The Size of the DFA
Theorem [GMOS’02] The number of states in the DFA for one linear XPath expression P is at most:
k = number of //
s = size of the alphabet (number of tags)
m = max number of * between two consecutive //
k+|P| k smk+|P| k sm
Dan SuciuUniv. of Washington
Querying XML Streams 27
Size of DFA: Multiple Expressions
//section/table/footnote//table/footnote//section/figure/footnote. . . . .//abstract/footnote/table
//section/table/footnote//table/footnote//section/figure/footnote. . . . .//abstract/footnote/table
DFA = Trie
has linear number of states [Aho&Corasick]
Dan SuciuUniv. of Washington
Querying XML Streams 28
Size of DFA: Multiple Expressions
//section//footnote//table//footnote//figure//footnote. . . . .//abstract//footnote
//section//footnote//table//footnote//figure//footnote. . . . .//abstract//footnote
100 expressions
2100 states !!2100 states !!
There is a theorem here too, but it’s not useful…
Dan SuciuUniv. of Washington
Querying XML Streams 29
Solution:Compute the DFA Lazily
• Also used in text searching• But will it work for 106 XPath expressions ?• YES !• For XPath it is provably effective, for two
reasons:– XML data is not very deep– The nesting structure in XML data tends to be
predictable
Dan SuciuUniv. of Washington
Querying XML Streams 30
Lazy DFA and “Simple” DTDs
• Document Type Definition (DTD)– Part of the XML standard– Will be replaced by XML Schema
• Example DTD:
<!ELEMENT document (section*)><!ELEMENT section ((section|abstract|table|figure)*)><!ELEMENT figure (table?,footnote*)>. . . . .
<!ELEMENT document (section*)><!ELEMENT section ((section|abstract|table|figure)*)><!ELEMENT figure (table?,footnote*)>. . . . .
Definition A DTD is simple if all cycles are loops
Dan SuciuUniv. of Washington
Querying XML Streams 31
Lazy DFA and “Simple” DTDs
document
section
table
figure
footnote
Simple DTD:
//section//footnote//table//footnote//figure//footnote//abstract//footnote
//section//footnote//table//footnote//figure//footnote//abstract//footnote
XPath expressions
abstract
Eager DFA “remembers” 24 sets
Lazy DFA “remembers” only 4 sets
Dan SuciuUniv. of Washington
Querying XML Streams 32
Lazy DFA and “Simple” DTDs
Theorem [GMOS’02] If the XML data has a “simple” DTD, then lazy DFA has at most:
states.
n = max depths of XPath expressions
D = size of the “unfolded” DTD
d = max depths of self-loops in the DTD
1+D(1+n)d1+D(1+n)d
Fact of life: “Data-like” XMLhas simple DTDs
Fact of life: “Data-like” XMLhas simple DTDs
Dan SuciuUniv. of Washington
Querying XML Streams 33
Lazy DFA and Data Guides
• “Non-simple” DTDs are useless for the lazy DFA
• “Everything may contain everything”
<!ELEMENT document (section*)><!ELEMENT section ((section|table|figure|abstract|footnote)*)><!ELEMENT table ((section|table|figure|abstract|footnote)*)><!ELEMENT figure ((section|table|figure|abstract|footnote)*)><!ELEMENT abstract ((section|table|figure|abstract|footnote)*)>
<!ELEMENT document (section*)><!ELEMENT section ((section|table|figure|abstract|footnote)*)><!ELEMENT table ((section|table|figure|abstract|footnote)*)><!ELEMENT figure ((section|table|figure|abstract|footnote)*)><!ELEMENT abstract ((section|table|figure|abstract|footnote)*)>
Fact of life: “Text”-like XML has non-simple DTDsFact of life: “Text”-like XML has non-simple DTDs
Dan SuciuUniv. of Washington
Querying XML Streams 34
Lazy DFA and Data Guides
Definition [Goldman&Widom’97]
The data guide for an XML data instance is the Trie of all its root-to-leaf paths
Dan SuciuUniv. of Washington
Querying XML Streams 35
Lazy DFA and Data Guides
document
section section
sectiontable
table
section
table figure
document
section
section
table
table figure
section
table
XML Data Data Guide
Fact of life: real XML data has “small” data guide
[Liefke&S.’00]
Fact of life: real XML data has “small” data guide
[Liefke&S.’00]
section
figure
figure
Dan SuciuUniv. of Washington
Querying XML Streams 36
Lazy DFA and “Simple” DTDs
Theorem [GMOS’02] If the XML data has a data guide with G nodes, then the number of states in the lazy DFA is at most:
G = number of nodes in the data guide
1+G1+G
Dan SuciuUniv. of Washington
Querying XML Streams 37
1
10
100
1000
10000
100000
simple prov ebBPSS protein nasa treebank
Number of Lazy DFA States - SYNTHETIC Data
103 XPath
104 XPath
105 XPath
4000
states
Dan SuciuUniv. of Washington
Querying XML Streams 38
1
10
100
1000
10000
100000
protein nasa treebank
Number of Lazy DFA States - REAL Data
103 XPath
104 XPath
105 XPath 95 states
40000 states
G = 350000
Dan SuciuUniv. of Washington
Querying XML Streams 39
Number of States in the lazy DFA
Real XML data Synthetic XML data
Data-style DTDTheorem
Lazy DFA is small
Theorem
Lazy DFA is small
Document-style DTDTheorem
Lazy DFA is smallFactLazy DFA is HUGE
Dan SuciuUniv. of Washington
Querying XML Streams 40
Lazy DFA in the XML Toolkit
• The XML toolkit uses a lazy DFA to process XML streams
• “warm-up” phase, followed by very high throughput
Dan SuciuUniv. of Washington
Querying XML Streams 41
Throughput for 103, 104, 105, 106 XPath expressions
[ prob(*)=10%, prob(//)=10% ]
0.0001MB/s
0.001MB/s
0.01MB/s
0.1MB/s
1MB/s
10MB/s
100MB/s
5MB 10MB 15MB 20MB 25MB
Total input size
parser
lazyDFA (103 XPath)
lazyDFA (104 XPath)
lazyDFA (105 XPath)
lazyDFA (106 XPath)
xfilter (103 XPath)
xfilter (104 XPath)
xfilter(105 XPath)
xfilter(106 XPath)
Parser:
10MB/s
Lazy DFA:
5.4MB/s
Dan SuciuUniv. of Washington
Querying XML Streams 42
Summary of Lazy DFA and XML
• Linear Xpath expressions:– Process with one lazy DFA
• Xpath expressions with branches– Process with Deterministic Pushdown
Automata (ongoing work at the University of Washington)
Dan SuciuUniv. of Washington
Querying XML Streams 43
Outline
• Background
• The XML stream processing problem
• Basic XML processing with automata
• Adapting automata to XML
• Stream indexes
• Conclusions
Dan SuciuUniv. of Washington
Querying XML Streams 44
Stream IndeX (SIX)
Main observation:• Parsing is major bottleneck
Definition The SIX of an XML document is a binary table of (begin, end) offsets
Idea: • Use SIX to reduce amount of parsing• Works well with (lazy) DFA• Implemented in the XML toolkit
Dan SuciuUniv. of Washington
Querying XML Streams 45
Stream IndeX (SIX)
<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>
</bib>
<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>
</bib>
beginOffset endOffset
bib 0 1490124
book 3 409023
publisher 12 423
author 426 879
author 978 . . .
. . .
SIXXML
Dan SuciuUniv. of Washington
Querying XML Streams 46
Stream IndeX (SIX)
<bib> <book>...</bib>
<bib> <book>...</bib>
<bib> <book>...</bib>
<bib> <book>...</bib>
<bib> <book>...</bib>
<bib> <book>...</bib>
0 205
30 66
72 188
0 205
30 66
72 188
90 110
95 98
0 205
30 66
The SIX stream is about 6% of the data stream
And can be made MUCH smaller
The SIX stream is about 6% of the data stream
And can be made MUCH smaller
SIX
(E.g. DIME)
XML
Dan SuciuUniv. of Washington
Querying XML Streams 47
Throughput improvements from SIX (stable)
0
5
10
15
20
25
30
35
55 60 65 70 75 80 85 90 95 100 105
XML stream (MB)
MB
/s
Theta=3% (SIX)
Theta=3%
Theta=8% (SIX)
Theta=8%
Theta=14% (SIX)
Theta=14%
Dan SuciuUniv. of Washington
Querying XML Streams 48
Stream Views
Idea: • Given a workload of XPath expressions with
branches• Precompute some views for each document
to speed up the entire workload
• views header has to be small
Dan SuciuUniv. of Washington
Querying XML Streams 49
Stream Views
/a[b=11][c=22][e=23]/a[b=11][c=22][e=23] /a[b=33][d=44] [e=55]/a[c=66][f=77]/a[f=34][g=56]
/a[b=33][d=44] [e=55]/a[c=66][f=77]/a[f=34][g=56]
/a[b=88][c=99]/a[c=99][e=00]
/a[b=88][c=99]/a[c=99][e=00]
/a/c /a/e /a/f
/a/c /a/e /a/f
3 Views: Short circuit evaluation !Short circuit evaluation !
Queries
Servers
Dan SuciuUniv. of Washington
Querying XML Streams 50
Stream Views
• Views header (binary offsets)
<bib> <book>...</bib>
<bib> <book>...</bib>
<bib> <book>...</bib>
XML XML XML
0
30
72
0
30
72
0
30
72
100x speedup
on a hit
100x speedup
on a hitXML
Header
Choosing the views:
Difficult problem
Choosing the views:
Difficult problem
Dan SuciuUniv. of Washington
Querying XML Streams 51
Outline
• Background
• The XML stream processing problem
• Basic XML processing with automata
• Adapting automata to XML
• Stream indexes
• Conclusions
Dan SuciuUniv. of Washington
Querying XML Streams 52
Summary• XML stream processing problem:
– Fixed XPath queries, transient XML data– Large number of queries– High data throughput
• Relationship to text processing techniques:– Still regular expressions– Still automata and lazy DFAs– Different scale
• Techniques:– Lazy DFAs work for reasons specific to XML– Stream indexes and views: ongoing research
Dan SuciuUniv. of Washington
Querying XML Streams 53
Future Work
• Handle branches in XPath expressions
• View selection for a given workload
• Network configuration
Dan SuciuUniv. of Washington
Querying XML Streams 54
Thank you !
Links:
www.cs.washington.edu/homes/suciu
www.cs.washington.edu/homes/suciu/XMLTK
xmltk.sourceforge.net