xsearch xml search engine jonathan mamou october 2002
TRANSCRIPT
XSEarchXML Search Engine
Jonathan MAMOU
October 2002
Motivation
XML Getting popular Allows meta-data to be embedded
into documents Data-centric view : exchange
format for structured data – meta data Document-centric view : Content –
text, meta data Querying data and meta-data
One Fish Two Fish by
John Meyer & Peter Smith
Costs Only: $7.95
Goodnight Moon by Margaret Brown
Costs Only: $10.55
Brown Bear by Bill Martin Jr.
Costs Only: $6.00
Buy our Classic
Children’s books.
amazing.com
<bookinfo><book><title>One Fish Two Fish</title>
<author>John Meyer</author> < author >Peter Smith</author> <price>7.95</price></book>
<book><title>Goodnight Moon</title> < author >Margaret
Brown</author> <price>10.55</price></book>
....</bookinfo>
A query Find titles and prices of books by
‘Meyer’ or ‘Smith’
IR Approach
How to deal with tags? Discard all tags
Simplicity Loss of information (structure) lower retrieval
performance Keep tags as keyword
How to write the query? “Title price book author Meyer Smith”
IR Approach (cont’d) Can’t specify that Meyer and Smith
are the authors Can’t specify that title, price and
author belongs to same book Can’t specify desired output (i.e.,
titles, price)
Database approachFOR $b IN document(“bib.xml”)//bookWHERE $b/author contains ‘Meyer’ OR $b/author
contains ‘Smith’RETURN <result>
<title> $b/title </title><price> $b/price </price>
</result>
•Difficult for naive user
•Requires knowledge of document structure
•Dependent on document structure
Our Goal
Combine IR and database techniques : tags + text
Simple language Logical Structure, not physical Require knowledge of tag names,
not structure Queries should work even if
structure changes Rank results
Framework
bookinfo
Just Lost
book
titleauthor
author
price
Mercy Meyer
Gina Meyer
$5.75
book
titleprice
Brown Bear
$13.95
Tree Representation
We need to find tuples of related title and price nodes.
author
name
Dr. Meyer
author
namebook
M. Brown
Goodnight Moon
title
book
titleprice
One Fish Two Fish
$12.50
book
title price
Cat in the Hat
$14.95
bookinfo
Another Tree Representation
Similar document, but with different hierarchical structure from the previous.
We need to find tuples of related title, author and price nodes.
Interconnection
Consider a title and price nodeIntuition: The nodes belong to different book entities
bookinfo
Just Lost
book
titlenamename
price
Mercy Meyer Gina
Meyer
$5.75
book
titleprice
Brown Bear
$13.95
The lowest common
ancestor of the circled
nodes
Interconnection (cont’d)
Just Lost
title
bookinfo
book
namename
price
Mercy Meyer Gina
Meyer
$5.75
book
titleprice
Brown Bear
$13.95
Intuition: The nodes belong to same book entity
The lowest common
ancestor of the circled
nodes
Interconnection (cont’d)
Just Lost
title
bookinfo
book
namename
price
Mercy Meyer Gina
Meyer
$5.75
book
titleprice
Brown Bear
$13.95
Intuition: The nodes belong to same book entity
Relationship tree
Nodes n1,n2
n their lowest common ancestor Tn the subtree rooted at n The relationship tree of n1,n2 is the
tree obtained by pruning from Tn all nodes other than n1,n2 that are not ancestors of n1,n2
Interconnection We say that n1,n2 are
interconnected if the relationship tree does not contain 2
distinct nodes with the same labelOr the relationship tree contains exactly
one pair of distinct nodes with the same label and this pair is comprised of n1,n2
All-Pairs Interconnection A set of nodes is all-pairs
interconnected if every pair of nodes are interconnected
Star interconnectionbookinfo
Just Lost
book
titleauthorauthor
price
Mercy Meyer
Gina Meyer
$5.75
book
titleprice
Brown Bear
$13.95name
name
The 2 names are not interconnected
Star Interconnection (cont’d)
A set of nodes is star interconnected if all the nodes in the set are interconnected to the same node
Search terms, Search query
Search Term (l,k) l label (context) k keyword
Search Query AND:L1 OR:L2 L1, L2 list of search terms
AND:(title,)(price,) OR:(author,Meyer)(author:Smith)
Answer AND:N1 OR:N2
N1, N2 are list of nodes Matching between N1,N2 and L1,L2 N1 and N2 are interconnected
All all-pair answers are star answers
Maximal answer
bookinfo
Just Lost
book
titleauthorauthor
price
Mercy Meyer Gina
Meyer
$5.75
book
titleprice
Brown Bear
$13.95
Example
(title,) (price,) (author,Meyer)
Find matchings of title, author and price to the nodes in the tree
title
author pricenull
Computing answers All-pairs
Determining whether the set of answers is empty is NP-complete
If L1 is empty, computing the set of answers is polynomial in the size of input and output
Star computing the set of answers is
polynomial in the size of input and output
Ranking results Unstructured
Keyword weight (tfilf) Tags weight Result size
Structured Nodes distance Ancestor-descendant
Keyword Weight Compute the weight of a keyword
k within a given node n Variation of the tfidf, one of the
metric of Vector Space Model (classical model in IR)
Keyword Weight (cont’d) Term Frequency (tf): number of
appearances of k within ntf(k,n) = occ(k,n) / (max occ(k’,n)) Inverse Leaf Frequency (ilf): inverse
frequency of k among all the leafs in the corpus
idf(k) = log(1+N/Nk) W(k,n) = tf(k,n) * idf(k) Normalized per leave
Tag Weight Give weight to tags according to
their importance E.g. give more weight to <title> than
to <abstract>
Result Size Number of search terms appearing
in the result (OR part)
Ranking-Structured Nodes distance
size of the relationship tree Ancestor-descendant relationship
“more” interconnected
System overview
XSEarch overview
XML corpus with logical hierarchy
Indexer Search
query
ResultsOffline
Online
Document Location array Generate a unique id, did Associate each did with the
physical location of the corresponding document
Logical structure of the corpus
Node Encoding Array Generate for each interior node a id,
nid Node encoding
Defined recursively Node encoding of its parent Index of the node among its siblings Eg: 13.8.1.9
Associate each nid with its node encoding
Node Label Array Associate each nid with its label
Inverted Tag Index For each tag, keep
posting list: list of nodes labeled with this tag
weight
Nid1tag Nid3Nid2
Inverted Keyword Index For each kw, keep
posting list: list of leafs containing this keyword
weight of the kw within the leaf (tfilf)
Nid1,w1kw
Nid3,w3Nid2,w2
Node Interconnection Matrix
element ij contains: 1, if ni and nj are interconnected 0, else
n*n symmetric sparse matrix Dynamic programming
Alternative Hash set : keep only
interconnected nodes Key: pair (ni, nj)
Interconnection Let n be the number of nodes It is possible to determine whether
n1 and n2 are interconnected in O(n) time
It is possible to determine interconnection of all pairs in O(n2)
Offline/Online computation
Interconnection for (i=size-1; i>=0; i--)
for (j=i+1; j<=size; j++) if i ancestor of j
connected(iChild,j) AND connected(i,jFather) AND labelIChild != labelJ AND labelI != labelJFather
for (j=i+1; j<size; j++) if i not ancestor of j
connected(i,jFather) AND connected(iFather,j) AND
labelI != labelJFather AND labelIFather != labelJ
Demo