flexible and efficient xml search with complex full-text predicates sihem amer-yahia - at&t labs...

48
Flexible and Efficient XML Search with Complex Full-Text Predicates Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research Emiran Curtmola - University of California San Diego Alin Deutsch - University of California San Diego

Post on 21-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Flexible and Efficient XML Search with Complex Full-Text Predicates

Sihem Amer-Yahia - AT&T Labs Research → Yahoo! Research

Emiran Curtmola - University of California San Diego

Alin Deutsch - University of California San Diego

SIGMOD, June 2006 2

Introduction

Need for complex full-text predicates beyond simple keyword search

Library of Congress (LoC) Biomedical data ACM, IEEE publications INEX data collection Wikipedia XML data set

SIGMOD, June 2006 3

XML real fragment from LoChttp://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml

Congress on education and workforce, comments to appropriate services.

109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson

on May 2, 2004Joe Jefferson

introduced the following bill.The bill was reintroduced laterand was referred to the committee

on education and workforcesponsored by Joe Jefferson

House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson

Jeffersonand services …

HR2739

committee-name

action-desc

bill

congress-info

nbr sponsors

action

legis-session

legis

legis-body

legis-desc

SIGMOD, June 2006 4

Query with complex FT predicates

Document fragments (nodes) that

contain the keywords

“Jefferson” and “education”

and satisfy the predicates within a window of 10 words, with “Jefferson” ordered before “education”

SIGMOD, June 2006 5

Example: LoC document

Congress on education and workforce, comments to appropriate services.

109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson

on May 2, 2004Joe Jefferson

introduced the following bill.The bill was reintroduced laterand was referred to the committee

on education and workforcesponsored by Joe Jefferson

House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson

Jeffersonand services …

HR2739

committee-name

action-desc

bill

congress-info

nbr sponsors

action

legis-session

legis

legis-body

legis-desc

SIGMOD, June 2006 6

Example: LoC document

Congress on education and workforce, comments to appropriate services.

109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson

on May 2, 2004Joe Jefferson

introduced the following bill.The bill was reintroduced laterand was referred to the committee

on education and workforcesponsored by Joe Jefferson

House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson

Jeffersonand services …

HR2739

committee-name

action-desc

bill

congress-info

nbr sponsors

action

legis-session

legis

legis-body

legis-desc

Return document fragments

Naive solution: test the query at each node

→ redundant

Need for efficient evaluation of full-text predicates

use structural relationship between nodes avoid redundant computation

SIGMOD, June 2006 7

Existing languages Many XML full-text search languages

expressive power, semantics, scores [BAS-06]

XQFT-classW3C’s XQuery Full-Text (XQFT), NEXI, XIRQL, JuruXML, XSearch, XRank, XKSearch, Schema Free XQuery

Efficient query evaluation limited to Conjunctive keyword search (no predicates) Full-text predicates in isolation

Need for a universal optimization framework Guarantee the universality of the solution

SIGMOD, June 2006 8

Contributions

Formal semantics for XQFT-class Unified framework Capture family of tf*idf scoring methods

Structure-aware algorithms to efficiently evaluate XQFT-class languages XFT full-text algebra Enable new optimizations inspired by

relational rewritings

SIGMOD, June 2006 9

Talk Outline

Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion

SIGMOD, June 2006 10

Formalization: design goals

Capture existing full-text languages Language semantics in terms of

keyword patterns pattern matches predicates evaluated through matches

Manipulate tuples enable relational query evaluation and

rewritings

SIGMOD, June 2006 11

Formalization: patterns Pattern = tuple of simultaneously matching keywords

Query expression:

“Jefferson” and “education” within a window of 10 words, with “Jefferson” ordered before “education”

Pattern

(“Jefferson”, “education”)

SIGMOD, June 2006 12

Formalization: patterns

Formalization specifies patterns ← conjunction of keywords set of patterns ← disjunction of keywords exclusion patterns ← negation of keywords

No matches in the document

SIGMOD, June 2006 13

Formalization: matches

Congress on education and workforce, comments to appropriate services.

109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson

on May 2, 2004Joe Jefferson

introduced the following bill.The bill was reintroduced laterand was referred to the committee

on education and workforcesponsored by Joe Jefferson

House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson

Jeffersonand services …

HR2739

committee-name

action-desc

bill

congress-info

nbr sponsors

action

legis-session

legis

legis-body

legis-desc

“Jefferson”, “education”

(22, 3)

SIGMOD, June 2006 14

Formalization: matches

Congress on education and workforce, comments to appropriate services.

109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson

on May 2, 2004Joe Jefferson

introduced the following bill.The bill was reintroduced laterand was referred to the committee

on education and workforcesponsored by Joe Jefferson

House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson

Jeffersonand services …

HR2739

committee-name

action-desc

bill

congress-info

nbr sponsors

action

legis-session

legis

legis-body

legis-desc

“Jefferson”, “education”

(22, 3)

(22, 45)

SIGMOD, June 2006 15

Formalization: matches

Congress on education and workforce, comments to appropriate services.

109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson

on May 2, 2004Joe Jefferson

introduced the following bill.The bill was reintroduced laterand was referred to the committee

on education and workforcesponsored by Joe Jefferson

House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson

Jeffersonand services …

HR2739

committee-name

action-desc

bill

congress-info

nbr sponsors

action

legis-session

legis

legis-body

legis-desc

“Jefferson”, “education”

(22, 3)

(22, 45)

(22, 67)

SIGMOD, June 2006 16

Formalization: matches

Congress on education and workforce, comments to appropriate services.

109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson

on May 2, 2004Joe Jefferson

introduced the following bill.The bill was reintroduced laterand was referred to the committee

on education and workforcesponsored by Joe Jefferson

House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson

Jeffersonand services …

HR2739

committee-name

action-desc

bill

congress-info

nbr sponsors

action

legis-session

legis

legis-body

legis-desc

“Jefferson”, “education”

(22, 3)

(22, 45)

(22, 67)

(51, 3)

SIGMOD, June 2006 17

Formalization: matching tables

Matching table represents Nested relation Each node in the document Each pattern in the query Set of matches

SIGMOD, June 2006 18

Congress on education and workforce, comments to appropriate services.

109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson

on May 2, 2004Joe Jefferson

introduced the following bill.The bill was reintroduced laterand was referred to the committee

on education and workforcesponsored by Joe Jefferson

House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson

Jeffersonand services …

HR2739

committee-name

action-desc

bill

congress-info

nbr sponsors

action

legis-session

legis

legis-body

legis-desc

Formalization: matching tables

Node Pattern Matches

action “Jefferson”, “education” (28, 45)

(51, 45)

… … …

SIGMOD, June 2006 19

XFT Algebra

Similar to relational algebra Manipulate matching tables Leverage relational query evaluation + optimization

techniques

XFT operators construct matching table Rk for each keyword k

get(k) manipulate matching tables

R1 or R2

R1 and R2

R1 minus R2

σtimes(R), σordered(R), σwindow(R), σdistance(R)

SIGMOD, June 2006 20

XFT Algebra Query: Nodes that contain the keywords

“Jefferson” and “education” within a window of 10 words, with “Jefferson” ordered before “education”

)"("educationget

")","(" educationJeffersonordered

×

")","("10 educationJeffersonwindow

)"("Jeffersonget

Benefit: equivalent

query rewritings

SIGMOD, June 2006 21

Talk Outline

Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion

SIGMOD, June 2006 22

Query evaluation: AllNodes

Straightforward implementation of the XFT algebra

Each node is considered separately Each tuple is self-contained

Relational-style evaluation Joins → equi-joins Predicates → selections on set of matches

5

SIGMOD, June 2006 23

Example: LoC document

Congress on education and workforce, comments to appropriate services.

109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson

on May 2, 2004Joe Jefferson

introduced the following bill.The bill was reintroduced laterand was referred to the committee

on education and workforcesponsored by Joe Jefferson

House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson

Jeffersonand services …

HR2739

committee-name

action-desc

bill

congress-info

nbr sponsors

action

legis-session

legis

legis-body

legis-desc

1.1

1.2

1.3

1

1.1.11.1.2 1.1.3

1.2.2

1.2.2.2

1.3.1 1.3.2

1.3.1.2

SIGMOD, June 2006 24

Node Pattern Matches

1 “Jefferson” 22, 28, 51, 54, 72

1.1 “Jefferson” 22

1.1.3 “Jefferson” 22

1.2 “Jefferson” 28, 51

1.2.2 “Jefferson” 51

1.2.2.2 “Jefferson” 51

1.3 “Jefferson” 54, 72

1.3.1 “Jefferson” 54

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1 “education” 3, 45, 67

1.1 “education” 3

1.1.1 “education” 3

1.2 “education” 45

1.2.2 “education” 45

1.2.2.2 “education” 45

1.3 “education” 67

1.3.2 “education” 67

×

")","("10 educationJeffersonwindow

")","(" educationJeffersonordered

)"("Jeffersonget

)"("educationget

SIGMOD, June 2006 25

Node Pattern Matches

1 “Jefferson” 22, 28, 51, 54, 72

1.1 “Jefferson” 22

1.1.3 “Jefferson” 22

1.2 “Jefferson” 28, 51

1.2.2 “Jefferson” 51

1.2.2.2 “Jefferson” 51

1.3 “Jefferson” 54, 72

1.3.1 “Jefferson” 54

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1 “education” 3, 45, 67

1.1 “education” 3

1.1.1 “education” 3

1.2 “education” 45

1.2.2 “education” 45

1.2.2.2 “education” 45

1.3 “education” 67

1.3.2 “education” 67

×

")","("10 educationJeffersonwindow

")","(" educationJeffersonordered

Node Pattern Matches

1 “Jefferson”, “education” (22,45), (72,67)…

1.1 “Jefferson”, “education” (22, 3)

1.2 “Jefferson”, “education” (28, 45), (51, 45)

1.2.2 “Jefferson”, “education” (51, 45)

1.2.2.2 “Jefferson”, “education” (51, 45)

1.3 “Jefferson”, “education” (54, 67), (72, 67)

1.3.2 “Jefferson”, “education” (72, 67)

SIGMOD, June 2006 26

Node Pattern Matches

1 “Jefferson” 22, 28, 51, 54, 72

1.1 “Jefferson” 22

1.1.3 “Jefferson” 22

1.2 “Jefferson” 28, 51

1.2.2 “Jefferson” 51

1.2.2.2 “Jefferson” 51

1.3 “Jefferson” 54, 72

1.3.1 “Jefferson” 54

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1 “education” 3, 45, 67

1.1 “education” 3

1.1.1 “education” 3

1.2 “education” 45

1.2.2 “education” 45

1.2.2.2 “education” 45

1.3 “education” 67

1.3.2 “education” 67

×

")","("10 educationJeffersonwindow

")","(" educationJeffersonordered

Node Pattern Matches

1 “Jefferson”, “education” (22,45), (72,67)…

1.1 “Jefferson”, “education” (22, 3)

1.2 “Jefferson”, “education” (28, 45), (51, 45)

1.2.2 “Jefferson”, “education” (51, 45)

1.2.2.2 “Jefferson”, “education” (51, 45)

1.3 “Jefferson”, “education” (54, 67), (72, 67)

1.3.2 “Jefferson”, “education” (72, 67) Predicate operates one tuple at a time

SIGMOD, June 2006 27

Example: LoC document

Congress on education and workforce, comments to appropriate services.

109th Mr Column and co-sponsorsMrs Miller and Mrs Jones.Others include Jefferson

on May 2, 2004Joe Jefferson

introduced the following bill.The bill was reintroduced laterand was referred to the committee

on education and workforcesponsored by Joe Jefferson

House of RepresentativesCurrent chamber on workforceand services. Committees on education are headed by Jefferson

Jeffersonand services …

HR2739

committee-name

action-desc

bill

congress-info

nbr sponsors

action

legis-session

legis

legis-body

legis-desc

1.1

1.2

1.3

1

1.1.11.1.2 1.1.3

1.2.2

1.2.2.2

1.3.1 1.3.2

1.3.1.2

SIGMOD, June 2006 28

Query evaluation: SCU

AllNodes = straightforward algorithm

Reduce size of intermediate results structural relationships between nodes avoid redundant match representation

SCU = Smallest Containing Unit

5

SIGMOD, June 2006 29

Node Pattern Matches

1.1.3 “Jefferson” 22

1.2.2.2 “Jefferson” 51

1.2 “Jefferson” 28

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1 “Jefferson” 22, 28, 51, 54, 72

1.1 “Jefferson” 22

1.1.3 “Jefferson” 22

1.2 “Jefferson” 28, 51

1.2.2 “Jefferson” 51

1.2.2.2 “Jefferson” 51

1.3 “Jefferson” 54, 72

1.3.1 “Jefferson” 54

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Matching tables → SCU tables

captures same information

)"("Jeffersonget

)"("Jeffersonget

SIGMOD, June 2006 30

Node Pattern Matches

1.1.3 “Jefferson” 22

1.2.2.2 “Jefferson” 51

1.2 “Jefferson” 28

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1.1.1 “education” 3

1.2.2.2 “education” 45

1.3.2 “education” 67

×

")","("10 educationJeffersonwindow

")","(" educationJeffersonordered

)"("Jeffersonget )"("educationget

SIGMOD, June 2006 31

Node Pattern Matches

1.1.3 “Jefferson” 22

1.2.2.2 “Jefferson” 51

1.2 “Jefferson” 28

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1.1.1 “education” 3

1.2.2.2 “education” 45

1.3.2 “education” 67

Node Pattern Matches

1.2.2.2 “Jefferson”, “education” (51, 45)

1.3.2 “Jefferson”, “education” (72, 67)

×

")","("10 educationJeffersonwindow

")","(" educationJeffersonordered

Equi-join does not work• Need to compute LCA

SIGMOD, June 2006 32

Node Pattern Matches

1.1.3 “Jefferson” 22

1.2.2.2 “Jefferson” 51

1.2 “Jefferson” 28

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1.1.1 “education” 3

1.2.2.2 “education” 45

1.3.2 “education” 67

Node Pattern Matches

1.1 “Jefferson”, “education” (22, 3)

1.2.2.2 “Jefferson”, “education” (51, 45)

1.2 “Jefferson”, “education” (28, 45)

1.3.2 “Jefferson”, “education” (72, 67)

1.3 “Jefferson”, “education” (54, 67)

1 “Jefferson”, “education” (22, 45)… ×

")","("10 educationJeffersonwindow

")","(" educationJeffersonordered

1.1 is the LCA of1.1.3 and 1.1.1

SIGMOD, June 2006 33

Node Pattern Matches

1.1.3 “Jefferson” 22

1.2.2.2 “Jefferson” 51

1.2 “Jefferson” 28

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1.1.1 “education” 3

1.2.2.2 “education” 45

1.3.2 “education” 67

×

")","("10 educationJeffersonwindow

")","(" educationJeffersonordered

Node Pattern Matches

1.2 “Jefferson”, “education” (28, 45)

1.3 “Jefferson”, “education” (54, 67)

1 “Jefferson”, “education” (22, 45)…

Node Pattern Matches

EMPTY !!!

Node Pattern Matches

1.1 “Jefferson”, “education” (22, 3)

1.2.2.2 “Jefferson”, “education” (51, 45)

1.2 “Jefferson”, “education” (28, 45)

1.3.2 “Jefferson”, “education” (72, 67)

1.3 “Jefferson”, “education” (54, 67)

1 “Jefferson”, “education” (22, 45)…

SIGMOD, June 2006 34

Node Pattern Matches

1.1.3 “Jefferson” 22

1.2.2.2 “Jefferson” 51

1.2 “Jefferson” 28

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1.1.1 “education” 3

1.2.2.2 “education” 45

1.3.2 “education” 67

Node Pattern Matches

1.1 “Jefferson”, “education” (22, 3)

1.2.2.2 “Jefferson”, “education” (51, 45)

1.2 “Jefferson”, “education” (28, 45)

1.3.2 “Jefferson”, “education” (72, 67)

1.3 “Jefferson”, “education” (54, 67)

1 “Jefferson”, “education” (22, 45)… ×

")","("10 educationJeffersonwindow

")","(" educationJeffersonordered

SIGMOD, June 2006 35

Node Pattern Matches

1.1.3 “Jefferson” 22

1.2.2.2 “Jefferson” 51

1.2 “Jefferson” 28

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1.1.1 “education” 3

1.2.2.2 “education” 45

1.3.2 “education” 67

Node Pattern Matches

1.1 “Jefferson”, “education” (22, 3)

1.2.2.2 “Jefferson”, “education” (51, 45)

1.2 “Jefferson”, “education” (28, 45)

1.3.2 “Jefferson”, “education” (72, 67)

1.3 “Jefferson”, “education” (54, 67)

1 “Jefferson”, “education” (22, 45)… ×

")","("10 educationJeffersonwindow

")","(" educationJeffersonordered

Node Pattern Matches

1.3 “Jefferson”, “education” (54, 67)

1 “Jefferson”, “education”(22, 45)…

SIGMOD, June 2006 36

Node Pattern Matches

1.1.3 “Jefferson” 22

1.2.2.2 “Jefferson” 51

1.2 “Jefferson” 28

1.3.1.2 “Jefferson” 54

1.3.2 “Jefferson” 72

Node Pattern Matches

1.1.1 “education” 3

1.2.2.2 “education” 45

1.3.2 “education” 67

Node Pattern Matches

1.1 “Jefferson”, “education” (22, 3)

1.2.2.2 “Jefferson”, “education” (51, 45)

1.2 “Jefferson”, “education” (28, 45)

1.3.2 “Jefferson”, “education” (72, 67)

1.3 “Jefferson”, “education” (54, 67)

1 “Jefferson”, “education” (22, 45)… ×

")","("10 educationJeffersonwindow

")","(" educationJeffersonordered

Node Pattern Matches

1.3 “Jefferson”, “education” (54, 67)

(72, 67)

1 “Jefferson”, “education”(22, 45)…

•Postorder•Stack supports single scan

SIGMOD, June 2006 37

SCU summary

Equivalent to AllNodes Structure-awareness reduces size of

intermediate results Increase computation cost

Compute LCAs of nodes Match propagation

Stack-based techniques

5

SIGMOD, June 2006 38

Related work on LCA for XML LCA for conjunctive keyword search

XRank [GSBS-03] Schema-free XQuery [LYJ-04] XKSearch [XP-05]

Shortcomings No postprocessing, not compositional

Input in document order Output postorder traversal

Support for complex predicates is not straightforward

SIGMOD, June 2006 39

Talk Outline

Motivation & Contributions Formalization of XML full-text search Efficient evaluation Experiments Conclusion

SIGMOD, June 2006 40

Experimental goals

AllNodes vs. SCU AllNodes: redundant representation SCU: smaller sizes, more computation

SCU Overhead Stack Match propagation

Benefit of Rewritings Relational-style rewritings

SIGMOD, June 2006 41

Experimental setup

Centrino 1.8GHz with 1GB of RAM

XMark generated datasets Size ranges from 50 MB – 300 MB

SIGMOD, June 2006 42

Experiments: AllNodes vs. SCU

Varying document size (q1 - query without predicates)

q1 = get(“See”) and get(“internationally”) and get(“description”) and get(“charges”) and

get(“ship”)

SIGMOD, June 2006 43

Queries q4 = σwindow>1(“See”, “internationally”, “description”, “charges”, “ship”) (q1)

q5 = σwindow>90000000(“See”, “internationally”, “description”, “charges”, “ship”) (q1)

Recall that q1 = get(“See”) and get(“internationally”) and

get(“description”) and get(“charges”) and get(“ship”)

Experiments: SCU Overhead

SIGMOD, June 2006 44

Experiments: SCU Overhead q4 always true → no match propagation, just the stack overhead q5 always false → propagate all matches

Varying query predicates (not pushed)

SIGMOD, June 2006 45

Queries q2 = σorderedE(“See”, “internationally”, “description”, “charges”, “ship”) (q1)

q3 = push selections in q2

Recall that q1 = get(“See”) and get(“internationally”) and

get(“description”) and get(“charges”) and get(“ship”)

Experiments: Benefit of Rewritings

SIGMOD, June 2006 46

Experiments: Benefit of Rewritings

Varying document size (query with predicates)

40% improvement for relational-like query rewritings

SIGMOD, June 2006 47

Conclusion

A unified logical framework for XML full-text search languages

Algebra admits Efficient algorithms for operator evaluation

Rewritings of queries into more efficient forms Facilitate XML joint optimizations of queries on

both structure and text search Future work

Score-aware logical framework

SIGMOD, June 2006 48

Thank you! 5