xml full-text search: challenges and opportunities

80
2 September 2005 VLDB Tutorial on XML Full- Text Search XML Full-Text Search: Challenges and Opportunities Jayavel Shanmugasundaram Cornell University Sihem Amer-Yahia AT&T Labs – Research

Upload: conley

Post on 02-Feb-2016

55 views

Category:

Documents


0 download

DESCRIPTION

XML Full-Text Search: Challenges and Opportunities. Sihem Amer-Yahia AT&T Labs – Research. Jayavel Shanmugasundaram Cornell University. Outline. Motivation Full-Text Search Languages Scoring Query Processing Open Issues. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XML Full-Text Search: Challenges and Opportunities

Jayavel Shanmugasundaram

Cornell University

Sihem Amer-Yahia

AT&T Labs – Research

Page 2: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation

• Full-Text Search Languages

• Scoring

• Query Processing

• Open Issues

Page 3: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Motivation

• XML is able to represent a mix of structured and text information.

• XML applications: digital libraries, content management.

• XML repositories: IEEE INEX collection, LexisNexis, the Library of Congress collection.

Page 4: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XML in Library of Congresshttp://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml

<bill bill-stage="Introduced-in-House"> <congress>109th CONGRESS</congress> <session>1st

Session</session> <legis-num>H. R. 2739</legis-num> <current-chamber>IN THE HOUSE OF REPRESENTATIVES</current-

chamber> <action> <action-date date="20050526">May 26, 2005</action-date> <action-desc><sponsor name-id="T000266">Mr. Tierney</sponsor>

(for himself, <cosponsor name-id="M001143">Ms. McCollum of Minnesota</cosponsor>, <cosponsor name-id="M000725">Mr. George Miller of California</cosponsor>) introduced the following bill; which was referred to the <committee-name committee-id="HED00">Committee on Education and the Workforce</committee-name>

</action-desc> </action>…

Page 5: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

THOMAS: Library of Congress

Page 6: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

INEX Data <article> <fno>K0271</fno> <doi>10.1041/K0271s-2004</doi> <fm> <hdr><hdr1><ti>IEEE TRANSACTIONS ON KNOWLEDGE

AND DATA ENGINEERING</ti> <crt> <issn>1041-4347</issn>/04/$20.00 &copy; 2004 IEEE Published by the

IEEE Computer Society</crt></hdr1><hdr2><obi><volno>Vol. 16</volno>, <issno>No. 2</issno></obi> <pdt><mo>FEBRUARY</mo><yr>2004</yr></pdt>

<pp>pp. 271-288</pp></hdr2> </hdr> <tig><atl>A Graph-Based Approach for Timing Analysis and Refinement of OPS5 Knowledge-Based Systems</atl><pn>pp. 271-288</pn><ref rid="K02711aff" type="aff">*</ref></tig>

<au sequence="first"><fnm>Albert Mo Kim</fnm><snm> <ref aid="K0271a1“ type="prb">Cheng</ref></snm><role>Senior Member</role><aff><onm>IEEE</onm></aff></au><au sequence="additional"><fnm>Hsiu-yen</fnm><snm> Tsai</snm></au>

<abs><p><b>Abstract</b>&mdash;This paper examines the problem of predicting the timing behavior of knowledge-based systems for real-…

Page 7: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Example INEX Query <inex_topic topic_id="275" query_type="CAS"> <castitle>//article[about(.//abs, "data

mining")]//sec[about(., "frequent itemsets")]</castitle> <description>sections about frequent itemsets from

articles with abstract about data mining</description> <narrative>To be relevant, a component has to be a

section about "frequent itemsets". For example, it could be about algorithms for finding frequent itemsets, or uses of frequent itemsets to generate rules. Also, the article must have an abstract about "data mining". I need this information for a paper that I am writing. It is a survey of different algorithms for finding frequent itemsets. The paper will also have a section on why we would want to find frequent itemsets.</narrative>

</inex_topic>

Page 8: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Challenges in XML FT Search

• Searching over Semi-Structured Data– Users may specify a search context and return context.

• Expressive Power and Extensibility– Users should be able to express complex full-text

searches and combine them with structural searches. • Scores and Ranking

– Users may specify a scoring condition, possibly over both full-text and structured predicates and obtain top-k results based on query relevance scores.

– The language should allow for an efficient implementation.

Page 9: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XML FT Search Definition• Context expression: XML elements searched:

– pre-defined XML nodes.– XPath/XQuery queries.

• Return expression: XML fragments returned: – pre-defined meaningful XML fragments.– XPath/XQuery to build answers.

• Search expression: FT search conditions: – Boolean keyword search.– proximity distance, scoping, thesaurus, stop words, stemming.

• Score expression: – system-defined scoring function.– user-defined scoring function.– query-dependent keyword weights.

Page 10: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation

• Full-Text Search Languages

• Scoring

• Query Processing

• Open Issues

Page 11: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Four Classes of Languages

• Keyword search (INEX Content-Only Queries)– “book xml”

• Tag + Keyword search– book: xml

• Path Expression + Keyword search– /book[./title about “xml db”]

• XQuery + Complex full-text search– for $b in /book

let score $s := $b ftcontains “xml” && “db” distance 5

Page 12: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Scoring• Query Processing• Open Issues

Page 13: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XRank [Guo et al., SIGMOD 2003]

<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …

Page 14: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …

XRank [Guo et al., SIGMOD 2003]

Page 15: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XIRQL [Fuhr & Grobjohann, SIGIR 2001]

<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <em> The XQL language </em> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …

Index Node

Page 16: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Similar Notion of Results

• Nearest Concept Queries – [Schmidt et al., ICDE 2002]

• XKSearch – [Xu & Papakonstantinou, SIGMOD 2005]

Page 17: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Scoring• Query Processing• Open Issues

Page 18: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XSearch [Cohen et al., VLDB 2003]

<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … … </paper> <paper id=”2”> <title> XML Indexing </title> … <paper id=”2”>

Not a“meaningful”

result

Page 19: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Scoring• Query Processing• Open Issues

Page 20: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XPath [W3C 2005]

• fn:contains($e, string) returns true iff $e contains string

//section[fn:contains(./title, “XML Indexing”)]

Page 21: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XIRQL [Fuhr & Grobjohann, SIGIR 2001]

• Weighted extension to XQL (precursor to XPath)

//section[0.6 · .//* $cw$ “XQL” + 0.4 · .//section $cw$ “syntax”]

Page 22: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XXL [Theobald & Weikum, EDBT 2002]

• Introduces similarity operator ~

Select ZFrom http://www.myzoos.edu/zoos.htmlWhere zoos.#.zoo As Z and Z.animals.(animal)?.specimen as A and A.species ~ “lion” and A.birthplace.#.country as B and A.region ~ B.content

Page 23: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

NEXI [Trotman & Sigurbjornsson, INEX 2004]

• Narrowed Extended XPath I• INEX Content-and-Structure (CAS) Queries

//article[about(.//title, apple) and about(.//sec, computer)]

Page 24: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Scoring• Query Processing• Open Issues

Page 25: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Schema-Free XQuery [Li, Yu, Jagadish, VLDB 2003]

• Meaningful least common ancestor (mlcas)

for $a in doc(“bib.xml”)//author $b in doc(“bib.xml”)//title $c in doc(“bib.xml”)//yearwhere $a/text() = “Mary” and exists mlcas($a,$b,$c)return <result> {$b,$c} </result>

Page 26: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XQuery Full-Text [W3C 2005]

• Two new XQuery constructs

1) FTContainsExpr• Expresses “Boolean” full-text search predicates• Seamlessly composes with other XQuery

expressions

2) FTScoreClause• Extension to FLWOR expression• Can score FTContainsExpr and other expressions

Page 27: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

FTContainsExpr

//book ftcontains “Usability” && “testing” distance 5

//book[./content ftcontains “Usability” with stems]/title

//book ftcontains /article[author=“Dawkins”]/title

Page 28: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

FTScore Clause

FOR $v [SCORE $s]? IN [FUZZY] ExprLET …WHERE …ORDER BY …RETURN

Example

FOR $b SCORE $s in /pub/book[. ftcontains “Usability” &&

“testing”] ORDER BY $s

RETURN <result score={$s}> $b </result>

In anyorder

Page 29: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

FTScore Clause

FOR $v [SCORE $s]? IN [FUZZY] ExprLET …WHERE …ORDER BY …RETURN

Example

FOR $b SCORE $s in /pub/book[. ftcontains “Usability” &&

“testing” and ./price < 10.00] ORDER BY $s

RETURN $b

In anyorder

Page 30: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

FTScore Clause

FOR $v [SCORE $s]? IN [FUZZY] ExprLET …WHERE …ORDER BY …RETURN

Example

FOR $b SCORE $s in FUZZY/pub/book[. ftcontains “Usability” &&

“testing”] ORDER BY $s

RETURN $b

In anyorder

Page 31: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XQuery Full-Text Evolution

Quark Full-TextLanguage (Cornell)2002

2003

2004

2005

TeXQuery(Cornell, AT&T Labs)

IBM, Microsoft,Oracle proposals

XQuery Full-Text

XQuery Full-Text(Second Draft)

Page 32: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation

• Full-Text Search Languages

• Scoring

• Query Processing

• Open Issues

Page 33: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Full-Text Scoring

• Score value should reflect relevance of answer to user query. Higher scores imply a higher degree of relevance.

• Queries return document fragments. Granularity of returned results affects scoring.

• For queries containing conditions on structure, structural conditions may affect scoring.

• Existing proposals extend common scoring methods: probabilistic or vector-based similarity.

Page 34: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Granularity of Results

• Keyword queries– compute possibly different scores for LCAs.

• Tag + Keyword queries– compute scores based on tags and keywords.

• Path Expression + Keyword queries– compute scores based on paths and keywords.

• XQuery + Complex full-text queries– compute scores for (newly constructed) XML

fragments satisfying XQuery (structural, full-text and scalar conditions).

Page 35: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages• Scoring

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Query Processing• Open Issues

Page 36: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Granularity of Results

• Document as hierarchical structure of elements as opposed to flat document.– XXL [Theobald & Weikum, EDBT 2002]– XIRQL [Fuhr & Grobjohann, SIGIR 2001]– XRANK [Guo et al., SIGMOD 2003]

• Propagate keyword weights along document structure.

Page 37: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XML Data Model

<workshop>

date <title> <editors> <proceedings>

28 July … XML and … David Carmel …

<paper> <paper> …

<title> <author> … …

XQL and … Ricardo …

Containment edge

Hyperlink edge

Page 38: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XXL[Theobald & Weikum, EDBT 2002]

• Compute similar terms with relevance score r1 using an ontology.

• Compute tf*idf of each term for a given element content with relevance score r2.

• Relevance of an element content for a term is r1*r2.• r1 and r2 are computed as a weighted distance in an

ontology graph.• Probabilities of conjunctions multiplied

(independence assumption) along elements of same path to compute path score.

Page 39: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Probabilistic ScoringXIRQL [Fuhr & Grobjohann, SIGIR 2001]

• Extension of XPath.

• Weighting and ranking:– weighting of query terms:

• P(wsum((0.6,a), (0.4,b)) = 0.6 · P(a)+0.4 · P(b)

– probabilistic interpretation of Boolean connectors:

• P(a && b) = P(a) · P(b)

Page 40: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XIRQL Example

• Query:– “Search for an artist named Ulbrich, living in

Frankfurt, Germany about 100 years ago”

• Data:– “Ernst Olbrich, Darmstadt, 1899”

• Weights and ranking:– P(Olbrich p Ulbrich)=0.8 (phonetic similarity)– P(1899 n 1903)=0.9 (numeric similarity)– P(Darmstadt g Frankfurt)=0.7 (geographic distance)

Page 41: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

PageRank [Brin & Page 1998]

)(vp HEvu

huN

upd ),(

)(

)(

dN

d1

w

: Hyperlink edged/3

d/3

d/3

d: Probability of following hyperlink

1-d: Probability of random jump

Page 42: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

ElemRank [Guo et al. SIGMOD 2003]

w

: Hyperlink edged1/3

d1/3

d1/3

d1: Probability of following hyperlink

1-d1-d2-d3: Probability of random jump

: Containment edge

d2/2 d2/2

d2: Probability of visiting a subelement

d3

d3: Probability of visiting parent

)(ve HEvu h uN

ued

),(1 )(

)(

CEvu c uN

ued

),(2 )(

)(

1),(

3 )(CEvu

ued)(

1 321

vN

ddd

Page 43: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages• Scoring

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Query Processing• Open Issues

Page 44: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XSearch[Cohen et al., VLDB 2003]

• tf*ilf to compute weight of keyword for a leaf element.

• A vector is associated with each non-leaf element.

• sim(Q,N): sum of the cosine distances between the vectors associated with nodes in N and vectors associated with terms matched in Q.

NdesancNtsize

NQsim_1

,

Page 45: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages• Scoring

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Query Processing• Open Issues

Page 46: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Vector–based ScoringJuruXML [Mass et al INEX 2002]

• Transform query into (term,path) conditions: article/bm/bib/bibl/bb[about(., hypercube mesh torus

nonnumerical database)]• (term,path)-pairs:

hypercube, article/bm/bib/bibl/bb mesh, article/bm/bib/bibl/bb torus, article/bm/bib/bibl/bb nonnumerical, article/bm/bib/bibl/bb database, article/bm/bib/bibl/bb

• Modified cosine similarity as retrieval function for vague matching of path conditions.

Page 47: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

JuruXML Vague Path Matching

• Modified vector-based cosine similarity

),(),(),(||||

1

),( ),(

D

i

Qi

D

iiDQ

iQct DD

jicit

iQ

jjQii

cccrctwctwDQ

otherwise

ccifc

c

cccrQi

QiD

i

Qi

D

i

Qi

j

0

||1

||1

),(

Example of length normalization: cr (article/bibl, article/bm/bib/bibl/bb) = 3/6 = 0.5

Page 48: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Query Relaxation on Structure

• Schlieder, EDBT 2002

• Delobel & Rousset, 2002

• Amer-Yahia et al, VLDB 2005

Page 49: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XML Query Relaxation[Amer-Yahia et al EDBT 2002]

FlexPath [Amer-Yahia et al SIGMOD 2004]

• Tree pattern relaxations:– Leaf node deletion

– Edge generalization

– Subtree promotion

book

editionpaperback

info

authorDickens

book

editionpaperback

info authorDickens

book

info

authorC. Dickens

book

edition(paperback)

info

authorCharles Dickens

edition?

Query

Data

Page 50: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Adaptation of tf.idf to XML Whirlpool[Marian et al ICDE 2005]

Document Collection (Information Retrieval)

XML Document

Document XML Node (result is a subtree rooted at a returned node with a given tag and satisfying structural predicates in the query)

Keyword(s) Tree Pattern

idf (inverse document frequency) is a function of the fraction of documents that contain the keyword(s)

idf is a function of the fraction of returned nodes that match the query tree pattern

tf (term frequency) is a function of the number of occurrences of the keyword in the document

tf is a function of the number of ways the query tree pattern matches the returned node

Page 51: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

A Family of XML Scoring Methods [Amer-Yahia et al VLDB 2005]

• Twig scoring– High quality– Expensive computation

• Path scoring• Binary scoring

– Low quality– Fast computation

book

edition(paperback)

info

author(Dickens)

Query

book

edition(paperback)

info

author(Dickens)

book

edition(paperback)

author(Dickens)

book

info

+

edition(paperback)

author(Dickens)

bookbook

info

+ +book

Page 52: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages• Scoring

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Query Processing• Open Issues

Page 53: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XIRQL + Relaxation

• XIRQL proposes vague predicates but it is not clear how to combine it with all of XQuery.

• Open issue as how to relax all of XQuery including structured and scalar predicates.

Page 54: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation

• Full-Text Search Languages

• Scoring

• Query Processing

• Open Issues

Page 55: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages• Scoring• Query Processing

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Open Issues

Page 56: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Main Issue

• Given: Query keywords

• Compute: Least Common Ancestors (LCAs) that contain query keywords, in ranked order

Page 57: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Naïve Method

Naïve inverted lists:Ricardo 1 ; 5 ; 6 ; 8 XQL 1 ; 5 ; 6 ; 7

Problems: 1. Space Overhead2. Spurious Results

Main issue: Decouples representation of ancestors and descendants

<workshop>

date <title> <editors> <proceedings>

28 July … XML and … David Carmel …

<paper> <paper>…

<title> <author>… …

XQL and … Ricardo …

1

2 3 4 5

6

7 8

Page 58: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Dewey Encoding of IDs [1850s]<workshop>

0.0date 0.1<title>

0

0.2<editors> 0.3<proceedings>

28 July … XML and … David Carmel …

0.3.0<paper> 0.3.1<paper> …

0.3.0.0<title> 0.3.0.1<author> … …

XQL and … Ricardo …

Page 59: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XRank: Dewey Inverted List (DIL)

XQL 5.0.3.0.0 85 32

Dew

ey Id

Scor

ePo

sitio

n Li

st

8.0.3.8.3 38 89Sorted byDewey Id

… ……

Ricardo 5.0.3.0.1 82 38

8.2.1.4.2 99 52Sorted byDewey Id

… ……

Store IDs of elements that directly contain keyword - Avoids space overhead

91

Page 60: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

DIL: Query Processing

• Merge query keyword inverted lists in Dewey ID Order– Entries with common prefixes are processed together

• Compute Longest Common Prefix of Dewey IDs during the merge– Longest common prefix ensures most specific results

– Also suppresses spurious results

• Keep top-m results seen so far in output heap– Calculate rank using two-dimensional proximity metric

– Output contents of output heap after scanning inverted lists

• Algorithm works in a single scan over inverted lists

Page 61: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XRank: Ranked Dewey Inverted List (RDIL)

XQL

…(other keywords)

Inverted List …

Sorted by Score

B+-treeOn Dewey Id

Page 62: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

RDIL: Algorithm

• An element may be ranked highly in one list and low in another list– B+-tree helps search for low ranked element

• When to stop scanning inverted lists?– Based on Threshold Algorithm [Fagin et al.,

2002], which periodically calculates a threshold– Can stop if we have sufficient results above the

threshold– Extension to most specific results

Page 63: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

RDIL: Query Processing

RicardoInverted List

B+-tree on Dewey Id

XQL

P: 9.0.4.2.0

9.0.4.1.2

threshold = Score(P)+Max-ScoreRank(9.0.4)

Output Heap Temp HeapP P

R

threshold = Score(P)+Score(R)

8.2.1.4.29.0.4.1.

29.0.5.6 10.8.3

B+-tree on Dewey Id

9.0.4.2.0

9.0.5.69.0.4.1.

2

Page 64: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

ID Order vs. Rank Order

• Approaches that combine benefits

• Long ID inverted list, short score inverted list– HDIL (Guo et al., SIGMOD 2003)

• Chunk inverted list based on score, organize by ID within chunk– FlexPath (Amer-Yahia et al., SIGMOD 2004)– SVR (Guo et al., ICDE 2005)

Page 65: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages• Scoring• Query Processing

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Open Issues

Page 66: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XSearch Technique

• Given: An interconnection relationship R between nodes (semantic relationship)– R is reflexive and symmetric

• Node interconnection index– Given two nodes n and n’ in a document d, find

if (n,n’) are in R*

• Use dynamic programming to compute closure– Online vs. offline

Page 67: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages• Scoring• Query Processing

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Open Issues

Page 68: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

XXL Indexing

• Element Path Index (EPI)– Evaluates simple path expressions

• Element Content Index (ECI)– Traditional inverted list (but replicates nested

elements)

• Ontology Index (OI)– Lookup similar concepts (for evaluating ~e)– Returned in ranked order

Page 69: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Myaeng et al. [SIGIR 1994]

XQL 5 85 act

Doc

umen

t ID

Elem

ent I

DEl

emen

t Tag

0.3

Prob

abili

ty

play 0.2 plays 0.1

Elem

ent T

agPr

obab

ility

Elem

ent T

agPr

obab

ility

Page 70: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Integrating Structure and IL[Kaushik et al., SIGMOD 2004]

book

title

editioninfo

author

1

2 3

4 5

XQL 5 85 99

Doc

umen

t ID

Star

t ID

End

ID

3

Dep

th

5

Inde

x ID

0.9

Scor

e

B+ Tree

Page 71: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation• Full-Text Search Languages• Scoring• Query Processing

– Simple Keyword Search– Tags + Keyword Search– Path Expressions + Keyword Search– XQuery + Complex Full-Text Search

• Open Issues

Page 72: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Scoring Functions Critical for Top-k Query Processing

• Top-k answer quality depends on scoring function.

• Efficient top-k query processing requires scoring function to be:– Monotone.

– Fast to compute.

Page 73: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Structural Join Relaxation//book[./info[./author ftcontains “Dickens”]

[./edition ftcontains “paperback”]]

book info

author

edition

Dickens

paperback

pc(book,info)

pc(info,author)

pc(info,edition)

contains(author,”Dickens”)

contains(edition,”paperback”)

author

edition

Dickens

paperback

pc(book,info) or ad(book,info)

pc(info,author)

pc(info,edition) or ad(book,edition)

contains(author,”Dickens”)

contains(edition,”paperback”)

infobook

Page 74: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Quark/GalaxXQuery Engine

Full-Text Primitives (FTWord,

FTWindow, FTTimesetc.)

evaluation <doc> Text TextText Text </doc>

<doc> Text TextText Text </doc>

.xmlXQFTParser

EquivalentXQueryQuery

EquivalentXQueryQuery

Full-TextQuery

Full-TextQuery

Preprocessing& Inverted Lists

Generation

<xml> <doc>Text TextText Text </doc></xml

<xml> <doc>Text TextText Text </doc></xml

inverted lists

.xml

4Quark/GalaTex Architecture

API on posit

ions

Page 75: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Outline

• Motivation

• Full-Text Search Languages

• Scoring

• Query Processing

• Open Issues

Page 76: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

System Architecture

XQuery Engine IR Engine

Integration Layer

Page 77: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

System Architecture

XQuery + IR Engine

Quark/GalaTex use this architecture

Page 78: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Structural Relaxation

FOR $b SCORE $s in FUZZY

/pub/book[. ftcontains “Usability” with stems]

ORDER BY $s

RETURN $b

Page 79: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Search Over Views

<books> <book> … </book> <book> … </book> …</books>

<reviews> <review> … </review> <review> … </review> …</reviews>

Data Source 1 Data Source 2

<book> <reviews> … </reviews></book>…

IntegratedView

Page 80: XML Full-Text Search:  Challenges and Opportunities

2 September 2005 VLDB Tutorial on XML Full-Text Search

Other Open Issues

• Experimental evaluation of scoring functions and ranking algorithms for XML (INEX).

• Search over a mix of HTML and XML.• Joint scoring on full-text and scalar predicates.• Score-aware algebra for XML for the joint

optimization of queries on both structure and text.