xsearch xml search engine jonathan mamou october 2002

43
XSEarch XML Search Engine Jonathan MAMOU October 2002

Upload: gerard-todd

Post on 18-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: XSEarch XML Search Engine Jonathan MAMOU October 2002

XSEarchXML Search Engine

Jonathan MAMOU

October 2002

Page 2: XSEarch XML Search Engine Jonathan MAMOU October 2002

Motivation

Page 3: XSEarch XML Search Engine Jonathan MAMOU October 2002

XML Getting popular Allows meta-data to be embedded

into documents Data-centric view : exchange

format for structured data – meta data Document-centric view : Content –

text, meta data Querying data and meta-data

Page 4: XSEarch XML Search Engine Jonathan MAMOU October 2002

One Fish Two Fish by

John Meyer & Peter Smith

Costs Only: $7.95

Goodnight Moon by Margaret Brown

Costs Only: $10.55

Brown Bear by Bill Martin Jr.

Costs Only: $6.00

Buy our Classic

Children’s books.

amazing.com

Page 5: XSEarch XML Search Engine Jonathan MAMOU October 2002

<bookinfo><book><title>One Fish Two Fish</title>

<author>John Meyer</author> < author >Peter Smith</author> <price>7.95</price></book>

<book><title>Goodnight Moon</title> < author >Margaret

Brown</author> <price>10.55</price></book>

....</bookinfo>

Page 6: XSEarch XML Search Engine Jonathan MAMOU October 2002

A query Find titles and prices of books by

‘Meyer’ or ‘Smith’

Page 7: XSEarch XML Search Engine Jonathan MAMOU October 2002

IR Approach

How to deal with tags? Discard all tags

Simplicity Loss of information (structure) lower retrieval

performance Keep tags as keyword

How to write the query? “Title price book author Meyer Smith”

Page 8: XSEarch XML Search Engine Jonathan MAMOU October 2002

IR Approach (cont’d) Can’t specify that Meyer and Smith

are the authors Can’t specify that title, price and

author belongs to same book Can’t specify desired output (i.e.,

titles, price)

Page 9: XSEarch XML Search Engine Jonathan MAMOU October 2002

Database approachFOR $b IN document(“bib.xml”)//bookWHERE $b/author contains ‘Meyer’ OR $b/author

contains ‘Smith’RETURN <result>

<title> $b/title </title><price> $b/price </price>

</result>

•Difficult for naive user

•Requires knowledge of document structure

•Dependent on document structure

Page 10: XSEarch XML Search Engine Jonathan MAMOU October 2002

Our Goal

Combine IR and database techniques : tags + text

Simple language Logical Structure, not physical Require knowledge of tag names,

not structure Queries should work even if

structure changes Rank results

Page 11: XSEarch XML Search Engine Jonathan MAMOU October 2002

Framework

Page 12: XSEarch XML Search Engine Jonathan MAMOU October 2002

bookinfo

Just Lost

book

titleauthor

author

price

Mercy Meyer

Gina Meyer

$5.75

book

titleprice

Brown Bear

$13.95

Tree Representation

We need to find tuples of related title and price nodes.

Page 13: XSEarch XML Search Engine Jonathan MAMOU October 2002

author

name

Dr. Meyer

author

namebook

M. Brown

Goodnight Moon

title

book

titleprice

One Fish Two Fish

$12.50

book

title price

Cat in the Hat

$14.95

bookinfo

Another Tree Representation

Similar document, but with different hierarchical structure from the previous.

We need to find tuples of related title, author and price nodes.

Page 14: XSEarch XML Search Engine Jonathan MAMOU October 2002

Interconnection

Consider a title and price nodeIntuition: The nodes belong to different book entities

bookinfo

Just Lost

book

titlenamename

price

Mercy Meyer Gina

Meyer

$5.75

book

titleprice

Brown Bear

$13.95

The lowest common

ancestor of the circled

nodes

Page 15: XSEarch XML Search Engine Jonathan MAMOU October 2002

Interconnection (cont’d)

Just Lost

title

bookinfo

book

namename

price

Mercy Meyer Gina

Meyer

$5.75

book

titleprice

Brown Bear

$13.95

Intuition: The nodes belong to same book entity

The lowest common

ancestor of the circled

nodes

Page 16: XSEarch XML Search Engine Jonathan MAMOU October 2002

Interconnection (cont’d)

Just Lost

title

bookinfo

book

namename

price

Mercy Meyer Gina

Meyer

$5.75

book

titleprice

Brown Bear

$13.95

Intuition: The nodes belong to same book entity

Page 17: XSEarch XML Search Engine Jonathan MAMOU October 2002

Relationship tree

Nodes n1,n2

n their lowest common ancestor Tn the subtree rooted at n The relationship tree of n1,n2 is the

tree obtained by pruning from Tn all nodes other than n1,n2 that are not ancestors of n1,n2

Page 18: XSEarch XML Search Engine Jonathan MAMOU October 2002

Interconnection We say that n1,n2 are

interconnected if the relationship tree does not contain 2

distinct nodes with the same labelOr the relationship tree contains exactly

one pair of distinct nodes with the same label and this pair is comprised of n1,n2

Page 19: XSEarch XML Search Engine Jonathan MAMOU October 2002

All-Pairs Interconnection A set of nodes is all-pairs

interconnected if every pair of nodes are interconnected

Page 20: XSEarch XML Search Engine Jonathan MAMOU October 2002

Star interconnectionbookinfo

Just Lost

book

titleauthorauthor

price

Mercy Meyer

Gina Meyer

$5.75

book

titleprice

Brown Bear

$13.95name

name

The 2 names are not interconnected

Page 21: XSEarch XML Search Engine Jonathan MAMOU October 2002

Star Interconnection (cont’d)

A set of nodes is star interconnected if all the nodes in the set are interconnected to the same node

Page 22: XSEarch XML Search Engine Jonathan MAMOU October 2002

Search terms, Search query

Search Term (l,k) l label (context) k keyword

Search Query AND:L1 OR:L2 L1, L2 list of search terms

AND:(title,)(price,) OR:(author,Meyer)(author:Smith)

Page 23: XSEarch XML Search Engine Jonathan MAMOU October 2002

Answer AND:N1 OR:N2

N1, N2 are list of nodes Matching between N1,N2 and L1,L2 N1 and N2 are interconnected

All all-pair answers are star answers

Maximal answer

Page 24: XSEarch XML Search Engine Jonathan MAMOU October 2002

bookinfo

Just Lost

book

titleauthorauthor

price

Mercy Meyer Gina

Meyer

$5.75

book

titleprice

Brown Bear

$13.95

Example

(title,) (price,) (author,Meyer)

Find matchings of title, author and price to the nodes in the tree

title

author pricenull

Page 25: XSEarch XML Search Engine Jonathan MAMOU October 2002

Computing answers All-pairs

Determining whether the set of answers is empty is NP-complete

If L1 is empty, computing the set of answers is polynomial in the size of input and output

Star computing the set of answers is

polynomial in the size of input and output

Page 26: XSEarch XML Search Engine Jonathan MAMOU October 2002

Ranking results Unstructured

Keyword weight (tfilf) Tags weight Result size

Structured Nodes distance Ancestor-descendant

Page 27: XSEarch XML Search Engine Jonathan MAMOU October 2002

Keyword Weight Compute the weight of a keyword

k within a given node n Variation of the tfidf, one of the

metric of Vector Space Model (classical model in IR)

Page 28: XSEarch XML Search Engine Jonathan MAMOU October 2002

Keyword Weight (cont’d) Term Frequency (tf): number of

appearances of k within ntf(k,n) = occ(k,n) / (max occ(k’,n)) Inverse Leaf Frequency (ilf): inverse

frequency of k among all the leafs in the corpus

idf(k) = log(1+N/Nk) W(k,n) = tf(k,n) * idf(k) Normalized per leave

Page 29: XSEarch XML Search Engine Jonathan MAMOU October 2002

Tag Weight Give weight to tags according to

their importance E.g. give more weight to <title> than

to <abstract>

Page 30: XSEarch XML Search Engine Jonathan MAMOU October 2002

Result Size Number of search terms appearing

in the result (OR part)

Page 31: XSEarch XML Search Engine Jonathan MAMOU October 2002

Ranking-Structured Nodes distance

size of the relationship tree Ancestor-descendant relationship

“more” interconnected

Page 32: XSEarch XML Search Engine Jonathan MAMOU October 2002

System overview

Page 33: XSEarch XML Search Engine Jonathan MAMOU October 2002

XSEarch overview

XML corpus with logical hierarchy

Indexer Search

query

ResultsOffline

Online

Page 34: XSEarch XML Search Engine Jonathan MAMOU October 2002

Document Location array Generate a unique id, did Associate each did with the

physical location of the corresponding document

Logical structure of the corpus

Page 35: XSEarch XML Search Engine Jonathan MAMOU October 2002

Node Encoding Array Generate for each interior node a id,

nid Node encoding

Defined recursively Node encoding of its parent Index of the node among its siblings Eg: 13.8.1.9

Associate each nid with its node encoding

Page 36: XSEarch XML Search Engine Jonathan MAMOU October 2002

Node Label Array Associate each nid with its label

Page 37: XSEarch XML Search Engine Jonathan MAMOU October 2002

Inverted Tag Index For each tag, keep

posting list: list of nodes labeled with this tag

weight

Nid1tag Nid3Nid2

Page 38: XSEarch XML Search Engine Jonathan MAMOU October 2002

Inverted Keyword Index For each kw, keep

posting list: list of leafs containing this keyword

weight of the kw within the leaf (tfilf)

Nid1,w1kw

Nid3,w3Nid2,w2

Page 39: XSEarch XML Search Engine Jonathan MAMOU October 2002

Node Interconnection Matrix

element ij contains: 1, if ni and nj are interconnected 0, else

n*n symmetric sparse matrix Dynamic programming

Page 40: XSEarch XML Search Engine Jonathan MAMOU October 2002

Alternative Hash set : keep only

interconnected nodes Key: pair (ni, nj)

Page 41: XSEarch XML Search Engine Jonathan MAMOU October 2002

Interconnection Let n be the number of nodes It is possible to determine whether

n1 and n2 are interconnected in O(n) time

It is possible to determine interconnection of all pairs in O(n2)

Offline/Online computation

Page 42: XSEarch XML Search Engine Jonathan MAMOU October 2002

Interconnection for (i=size-1; i>=0; i--)

for (j=i+1; j<=size; j++) if i ancestor of j

connected(iChild,j) AND connected(i,jFather) AND labelIChild != labelJ AND labelI != labelJFather

for (j=i+1; j<size; j++) if i not ancestor of j

connected(i,jFather) AND connected(iFather,j) AND

labelI != labelJFather AND labelIFather != labelJ

Page 43: XSEarch XML Search Engine Jonathan MAMOU October 2002

Demo