indexing and searching xml documents based on content and structure synopses
DESCRIPTION
Weimin He, Leonidas Fegaras, David Levine University of Texas at Arlington http://lambda.uta.edu. Indexing and Searching XML Documents based on Content and Structure Synopses . Outline. Motivation Key Contributions Related Work Data Synopses Indexing Query Processing Experimental Results - PowerPoint PPT PresentationTRANSCRIPT
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 1
Indexing and Searching XML Documents based on Content and Structure Synopses
Weimin He, Leonidas Fegaras, David Levine
University of Texas at Arlington
http://lambda.uta.edu
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 2
Outline
• Motivation• Key Contributions• Related Work• Data Synopses Indexing• Query Processing• Experimental Results• Conclusion
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 3
Why not Google?
• Need to query both structure and content– an opportunity for more precise search
• Keyword queries are NOT adequate for XML search An example query beyond Google: Find the price of the book whose author’s lastname is “Smith” and whose
title contains “XML” and “SAX” Semantic search using an XPath Query: //book[author/lastname ~ “Smith”][title ~ “XML” and “SAX”]/price• Simpler query formats cannot express complex containment
relationships: [ (lastname, Smith), (title, XML & SAX), price ]
• Fully indexing XML data is neither efficient nor scalable
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 4
Key Contributions
• A framework for indexing and searching schema-less XML documents based on data synopses extracted from documents
• Two novel data synopsis structures that can achieve higher query precision and scalability
• A hash-based processing algorithm to speed up searching• A prototype implementation to evaluate the performance of the
indexing scheme and to validate the data synopsis precision
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 5
Related Work
• Extend keyword queries to XML– XRank– XKSearch
• Integrate IR constructs and scoring into XQuery– TIX– TeXQuery
• XML Summarization Techniques– XSketch– XCluster
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 6
System Architecture
Meta-Data Indexer
Query Footprint Extractor
StructuralSummaries
DataSynopses
Query Footprint
QueryClient
DocumentSynopses
Query Optimizer
Server
XML DocumentRepository
Structural Summary Matcher
Matching StructuralSummaries & Label Paths
QueryProcessor
Full-Text XPath Query
A List of DocumentLocations
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 7
Specification of Search Queries• XPath is extended with a simple IR syntax:
Queries may contain predicates of the form: e ~ S– e is an XPath expression– S is a search predicate that takes the form:
“term” | S1 and S2 | S1 or S2 | (S)
• A running query example: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price
• Query result:A list of document locations (path names) that satisfy the query
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 8
Data Indexing
• Structural Summary (SS)– A tree that captures all unique paths in an XML document– It is constructed from XML data incrementally– Each SSnode# corresponds to a unique full label path:
9: /auction/sponsor/address1
2 8
3
4
56
7
9 10
auction
item sponsor
description address name
location namepayment price
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 9
Data Indexing (cont.)
• Content Synopsis (CS)– Summarizes the text associated with an SS node in an XML document– Approximated as a bit matrix of size W×L
• L is fixed but W may depend on the document size– Stored as a B+-tree that implements the mapping
(SSnode#, doc#) bit-matrix– Used in evaluating search predicates in the query
• Positional Filter (PF)– Captures the position spans of all XML elements associated with an SS
node in an XML document– Represented as a bit matrix of size M×L, where M ≥ 2– Stored as a B+-tree that implements the mapping
(SSnode#, doc#) bit-matrix– Used in enforcing containment constraints among query predicates
• Do we need positional dimension?
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 10
Data Synopsis Example
Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price
Term
DocumentPosition
0 1 2 20
01
229
13
Content Synopsis for /auction/item/location
hash(Dallas) = 13
DocumentPosition
01
229
Positional Filter for /auction/item
Term
DocumentPosition
0 1 2 200
12
29
hash(mountain) = 2 hash(bicycle) = 11
11
Content Synopsis for /auction/item/description
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 11
Containment Filtering
Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/priceitem(F2)
locationDallas A B
description
mountain bicycle
CF(F2, H4[Dallas])
CF(A,and(H3[mountain], H3[bicycle]))
Testing Running Query Using Data Synopses
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 12
Query Processing Overview
• Query Footprint (QF) Extraction – Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price– QF: //auction//item:0[location: 1][description: 2]/price
• Structural Summary Matching – Retrieve all structural summaries that match the QF
• We use the standard preorder numbering scheme to represent an SS• An SS is stored as a B+-tree that implements the mapping:
tag → {(SS#, SSnode#, begin, end, level)}• We use containment joins to retrieve the qualified full label paths that match the entry points
in the QF[ /auction/item, /auction/item/location, /auction/item/description ]
• Containment Filtering• Qualified document locations are collected and returned
– The unit of query processing is a mapping from a doc# to a bit matrix of size M×L (positions)
– An empty bit matrix means an unqualified document
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 13
Two-Phase Containment Filtering
• Many sources of inefficiency:– A large number of full label path may match a single generic XPath query– A long list of data synopses has to be retrieved for each label path in a QF– The retrieved lists of data synopses have to be correlated at each step
during containment filtering• Solution:
– Aggregate data synopses lists from multiple documents into a single bit matrix, called Document Synopsis, of size W×D path → bit-matrix
so that, given a term t and a full label path p, the document doc# is a candidate if the document synopsis for p is set at [hash(t),hash(doc#)]
– Need a two-phase containment filtering algorithm to prune unqualified document locations before the actual containment filtering
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 14
Document Synopsis
Term
Document ID
0 1 2 20
01
215
hash(“XML”) = 2
doc 12 mod 16 = 12
doc 105 mod 16 = 9doc 121 mod 16 = 9doc 137 mod 16 = 9
hash(“science”) = 11hash(“computer”) = 11
11
912
The document synopsis for /biblio/book/paragraph
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 15
Experimental Setup
Data Set Data Size(MB)
Files Avg. File Size(KB)
Avg. SS Size(Byte)
Avg. CS Size(Byte)
Avg. PF Size(Byte)
XBench 1050 2666 394 432 20564 178
XMark 55.8 11500 5 417 306 16
• A prototype system is implemented in Java
• Employed Berkeley DB Java Edition 3.2.13 as a storage manager • Datasets
– XMark– XBench
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 16
Query WorkloadDataset Query Query Expression
XMark Q1 /site//item[location ~ "United"][payment ~ "Creditcard" and "Check"]/description
XMark Q2 //regions//item[location ~ "States"][payment ~ "Creditcard" or "Cash"]/name
XMark Q3 /site//item[location ~ "United"][payment ~ "Creditcard"]/description
XMark Q4 //regions//item[location ~ "States"][payment ~ "Check"]/quantity
XMark Q5 /site//item[description//text ~ "gold"]/name
XMark Q6 /regions//item[description//text ~ "character "]/payment
XMark Q7 //closed_auction[type ~ "Regular"][annotation//text~ "heat"]/date
XMark Q8 //closed_auction[annotation//text~ "heat" or "country"]/seller
XMark Q9 //closed_auction[annotation//text~ "heat" and "country"]/buyer
XMark Q10 //closed_auction[annotation//text~ "country"]/type
XBench Q11 /article//body[abstract/p ~ "hockey"][section/p ~ "hockey" and "patterns"]/section
XBench Q12 //article//body[section/p ~ "regular"][abstract/p ~ "hockey" or "patterns"]/abstract
XBench Q13 /article//body[section/subsec/p ~ "hockey"][abstract/p ~ "hockey"]/abstract
XBench Q14 /article//body[section/subsec/p ~ "regular"][abstract/p ~ "patterns"]/section
XBench Q15 /article//body[section/p ~ "patterns"][abstract/p ~ "patterns"]/abstract
XBench Q16 /article//body[section/p ~ "hockey"][abstract/p ~ "patterns"]/abstract
XBench Q17 //prolog[keywords/keyword ~ "bold" or "regular"][title~ "regular"]/authors
XBench Q18 //prolog[keywords/keyword ~ "bold"][title~ "bold"]/title
XBench Q19 //prolog[genre ~ "Travel"] [keywords/keyword ~ "bold" or "stealth" ]//author/name
XBench Q20 //prolog[genre ~ "Travel"] [keywords/keyword ~ "bold"]/title
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 17
Indexing Scheme Comparison
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Xbench
Dataset
Inde
x B
uild Time(s) ILI
DSI
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Xbench
Dataset
Inde
x S
ize(MB) ILI
DSI
0
100
200
300
400
500
600
700
Xbench
Dataset
AVG
Que
ry R
espo
nse Tim
e(s)
ILI
DSI
ILI: using a standard XML indexing scheme based on full Inverted Lists
DSI: using our indexing scheme based on Data Synopses
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 18
Query Precision Measurement
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Query
Fals
e P
ositi
ve
Rat
e
ODBF
TDBF
ODBF: using one-dimensional Bloom FiltersTDBF: using two-dimensional Bloom Filters
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 19
Efficiency of Optimization Algorithm
0
20
40
60
80
100
120
140
160
180
200
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Query
Que
ry
Resp
onse
Ti
me(
s)OPCF
TPCF
OPCF: using one-phase containment filteringTPCF: using two-phase containment filtering
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 20
Future Research Directions
• Develop an effective ranking function• Adopt top-k algorithms to improve search efficiency• Apply our framework to structured P2P networks• Evaluate our framework over INEX data