indexing and searching xml documents based on content and structure synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 1

Indexing and Searching XML Documents based on Content and Structure Synopses

Weimin He, Leonidas Fegaras, David Levine

University of Texas at Arlington

http://lambda.uta.edu


Outline

• Motivation• Key Contributions• Related Work• Data Synopses Indexing• Query Processing• Experimental Results• Conclusion


Why not Google?

• Need to query both structure and content– an opportunity for more precise search

• Keyword queries are NOT adequate for XML search An example query beyond Google: Find the price of the book whose author’s lastname is “Smith” and whose

title contains “XML” and “SAX” Semantic search using an XPath Query: //book[author/lastname ~ “Smith”][title ~ “XML” and “SAX”]/price• Simpler query formats cannot express complex containment

relationships: [ (lastname, Smith), (title, XML & SAX), price ]

• Fully indexing XML data is neither efficient nor scalable


Key Contributions

• A framework for indexing and searching schema-less XML documents based on data synopses extracted from documents

• Two novel data synopsis structures that can achieve higher query precision and scalability

• A hash-based processing algorithm to speed up searching• A prototype implementation to evaluate the performance of the

indexing scheme and to validate the data synopsis precision


Related Work

• Extend keyword queries to XML– XRank– XKSearch

• Integrate IR constructs and scoring into XQuery– TIX– TeXQuery

• XML Summarization Techniques– XSketch– XCluster


System Architecture

Meta-Data Indexer

Query Footprint Extractor

StructuralSummaries

DataSynopses

Query Footprint

QueryClient

DocumentSynopses

Query Optimizer

Server

XML DocumentRepository

Structural Summary Matcher

Matching StructuralSummaries & Label Paths

QueryProcessor

Full-Text XPath Query

A List of DocumentLocations


Specification of Search Queries• XPath is extended with a simple IR syntax:

Queries may contain predicates of the form: e ~ S– e is an XPath expression– S is a search predicate that takes the form:

“term” | S1 and S2 | S1 or S2 | (S)

• A running query example: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price

• Query result:A list of document locations (path names) that satisfy the query


Data Indexing

• Structural Summary (SS)– A tree that captures all unique paths in an XML document– It is constructed from XML data incrementally– Each SSnode# corresponds to a unique full label path:

9: /auction/sponsor/address1

2 8

3

4

56

7

9 10

auction

item sponsor

description address name

location namepayment price


Data Indexing (cont.)

• Content Synopsis (CS)– Summarizes the text associated with an SS node in an XML document– Approximated as a bit matrix of size W×L

• L is fixed but W may depend on the document size– Stored as a B+-tree that implements the mapping

(SSnode#, doc#) bit-matrix– Used in evaluating search predicates in the query

• Positional Filter (PF)– Captures the position spans of all XML elements associated with an SS

node in an XML document– Represented as a bit matrix of size M×L, where M ≥ 2– Stored as a B+-tree that implements the mapping

(SSnode#, doc#) bit-matrix– Used in enforcing containment constraints among query predicates

• Do we need positional dimension?


Data Synopsis Example

Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price

Term

DocumentPosition

0 1 2 20

01

229

13

Content Synopsis for /auction/item/location

hash(Dallas) = 13

DocumentPosition

01

229

Positional Filter for /auction/item

Term

DocumentPosition

0 1 2 200

12

29

hash(mountain) = 2 hash(bicycle) = 11

11

Content Synopsis for /auction/item/description


Containment Filtering

Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/priceitem(F2)

locationDallas A B

description

mountain bicycle

CF(F2, H4[Dallas])

CF(A,and(H3[mountain], H3[bicycle]))

Testing Running Query Using Data Synopses


Query Processing Overview

• Query Footprint (QF) Extraction – Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price– QF: //auction//item:0[location: 1][description: 2]/price

• Structural Summary Matching – Retrieve all structural summaries that match the QF

• We use the standard preorder numbering scheme to represent an SS• An SS is stored as a B+-tree that implements the mapping:

tag → {(SS#, SSnode#, begin, end, level)}• We use containment joins to retrieve the qualified full label paths that match the entry points

in the QF[ /auction/item, /auction/item/location, /auction/item/description ]

• Containment Filtering• Qualified document locations are collected and returned

– The unit of query processing is a mapping from a doc# to a bit matrix of size M×L (positions)

– An empty bit matrix means an unqualified document


Two-Phase Containment Filtering

• Many sources of inefficiency:– A large number of full label path may match a single generic XPath query– A long list of data synopses has to be retrieved for each label path in a QF– The retrieved lists of data synopses have to be correlated at each step

during containment filtering• Solution:

– Aggregate data synopses lists from multiple documents into a single bit matrix, called Document Synopsis, of size W×D path → bit-matrix

so that, given a term t and a full label path p, the document doc# is a candidate if the document synopsis for p is set at [hash(t),hash(doc#)]

– Need a two-phase containment filtering algorithm to prune unqualified document locations before the actual containment filtering


Document Synopsis

Term

Document ID

0 1 2 20

01

215

hash(“XML”) = 2

doc 12 mod 16 = 12

doc 105 mod 16 = 9doc 121 mod 16 = 9doc 137 mod 16 = 9

hash(“science”) = 11hash(“computer”) = 11

11

912

The document synopsis for /biblio/book/paragraph


Experimental Setup

Data Set Data Size(MB)

Files Avg. File Size(KB)

Avg. SS Size(Byte)

Avg. CS Size(Byte)

Avg. PF Size(Byte)

XBench 1050 2666 394 432 20564 178

XMark 55.8 11500 5 417 306 16

• A prototype system is implemented in Java

• Employed Berkeley DB Java Edition 3.2.13 as a storage manager • Datasets

– XMark– XBench


Query WorkloadDataset Query Query Expression

XMark Q1 /site//item[location ~ "United"][payment ~ "Creditcard" and "Check"]/description

XMark Q2 //regions//item[location ~ "States"][payment ~ "Creditcard" or "Cash"]/name

XMark Q3 /site//item[location ~ "United"][payment ~ "Creditcard"]/description

XMark Q4 //regions//item[location ~ "States"][payment ~ "Check"]/quantity

XMark Q5 /site//item[description//text ~ "gold"]/name

XMark Q6 /regions//item[description//text ~ "character "]/payment

XMark Q7 //closed_auction[type ~ "Regular"][annotation//text~ "heat"]/date

XMark Q8 //closed_auction[annotation//text~ "heat" or "country"]/seller

XMark Q9 //closed_auction[annotation//text~ "heat" and "country"]/buyer

XMark Q10 //closed_auction[annotation//text~ "country"]/type

XBench Q11 /article//body[abstract/p ~ "hockey"][section/p ~ "hockey" and "patterns"]/section

XBench Q12 //article//body[section/p ~ "regular"][abstract/p ~ "hockey" or "patterns"]/abstract

XBench Q13 /article//body[section/subsec/p ~ "hockey"][abstract/p ~ "hockey"]/abstract

XBench Q14 /article//body[section/subsec/p ~ "regular"][abstract/p ~ "patterns"]/section

XBench Q15 /article//body[section/p ~ "patterns"][abstract/p ~ "patterns"]/abstract

XBench Q16 /article//body[section/p ~ "hockey"][abstract/p ~ "patterns"]/abstract

XBench Q17 //prolog[keywords/keyword ~ "bold" or "regular"][title~ "regular"]/authors

XBench Q18 //prolog[keywords/keyword ~ "bold"][title~ "bold"]/title

XBench Q19 //prolog[genre ~ "Travel"] [keywords/keyword ~ "bold" or "stealth" ]//author/name

XBench Q20 //prolog[genre ~ "Travel"] [keywords/keyword ~ "bold"]/title


Indexing Scheme Comparison

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Xbench

Dataset

Inde

x B

uild Time(s) ILI

DSI

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Xbench

Dataset

Inde

x S

ize(MB) ILI

DSI

0

100

200

300

400

500

600

700

Xbench

Dataset

AVG

Que

ry R

espo

nse Tim

e(s)

ILI

DSI

ILI: using a standard XML indexing scheme based on full Inverted Lists

DSI: using our indexing scheme based on Data Synopses


Query Precision Measurement

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

Query

Fals

e P

ositi

ve

Rat

e

ODBF

TDBF

ODBF: using one-dimensional Bloom FiltersTDBF: using two-dimensional Bloom Filters


Efficiency of Optimization Algorithm

0

20

40

60

80

100

120

140

160

180

200

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

Query

Que

ry

Resp

onse

Ti

me(

s)OPCF

TPCF

OPCF: using one-phase containment filteringTPCF: using two-phase containment filtering


Future Research Directions

• Develop an effective ranking function• Adopt top-k algorithms to improve search efficiency• Apply our framework to structured P2P networks• Evaluate our framework over INEX data