adaptive xml search

1

Adaptive XML Search

Dr Wilfred NgDepartment of Computer Science

The Hong Kong University of Science and Technology

2

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model

Ranking Support vector machine in voting SpyNB Framework (RSSF)

Experiments

Conclusions and Ongoing Work

3

Motivation

4

Why we need XML Search Why we need XML Search Engine?Engine?

Different nature of HTML and XML data HTML data

Hyperlink-intensive Declarative languages Tags have no semantic meaning

XML data Self-describing tags Extra structural information XML search engines retrieve more accurate

fragments

5

Why we need XML Search Engine?Why we need XML Search Engine?

Web searching Document paradigm Matching keywords Vs documents Return links to whole document (web page)

XML searching Query Keywords maybe tags or data values Structure of XML document is diverse, e.g. DBLP

and Shakespeare Not return whole document: 100Mb or larger Return fragments

6

DBLP

<dblp>

<incollection mdate="2002-01-03" key="books/acm/kim95/AnnevelinkACFHK95">

<author>Jurgen Annevelink</author>

<title>Object SQL - A Language for the Design and Implementation of Object Databases.</title>

<pages>42-68</pages>

<year>1995</year>

<booktitle>Modern Database Systems</booktitle>

<url>db/books/collections/kim95.html</url>

</incollection>

….

7

Shakespeare

<SPEECH> <SPEAKER>OCTAVIUS CAESAR</SPEAKER> <LINE>No, my most wronged sister; Cleopatra</LINE> <LINE>Hath nodded him to her. He hath given his empire</LINE> <LINE>Up to a whore; who now are levying</LINE> <LINE>The kings o' the earth for war; he hath assembled</LINE> <LINE>Bocchus, the king of Libya; Archelaus,</LINE> <LINE>Of Cappadocia; Philadelphos, king</LINE> <LINE>Of Paphlagonia; the Thracian king, Adallas;</LINE> <LINE>King Malchus of Arabia; King of Pont;</LINE> <LINE>Herod of Jewry; Mithridates, king</LINE> <LINE>Of Comagene; Polemon and Amyntas,</LINE> <LINE>The kings of Mede and Lycaonia,</LINE> <LINE>With a more larger list of sceptres.</LINE> </SPEECH>

8

Research IdeasResearch Ideas In Information Retrieval community, many

ranking techniques are developed Weighted keywords Vector space

Searching and ranking XML as plain text using IR techniques is possible but Too simple Do not use the advantage of XML data

Can achieve better accuracy using features of XML data: Structures Tag semantics

9

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model


Experiments


10

Key-Tag Search

11

Key-Tag Query vs. XQuery

Keywords in Web search engine vs. SQL The goals of key-tag query and XQuery are

different Key-Tag Query

Simple Easy to understand Flexible

XQuery:for $x in doc(“some.xml") where $x/author[(.ftcontains(‘Mary’)]return $x/title

Key-Tag Query:<author>Mary</author>

Too complicate for ordinary users!!

Will users input such complex XQuery in search engines?

12

Key-Tag Search Query

<author>Mary</author> For example, Tag Key

author

title

year

Mary

XML

2007

Tag Key

*

*

*

Mary

XML

2007

Tag Key

author

title

year

*

*

*

Tag Key

*

*

year

Mary

XML

*

KeyTag

13

Key-Tag Query Semantics A fragment is considered as a result candidate if at least one

key-tag is found in it. If F1 and F2 both contain the same instance of key-tag and F1 is a

subtree of F2, F1is chosen to be the only answer.

For example, a query b

F1: b

F2: <c>b</c>

F1 will be the answer

If there is a fragment:

<c>b

</c>

If there is a fragment:F1:<a>

b---------(B1)</a>F2:<a>

<c>b----------(B2)

</c></a>

14

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model


Experiments


15

Multi-Ranker Model

16

Introduction to MRMIntroduction to MRM

Handle diversified XML documents and user preferences

17

Multi-Ranker Model

AR1 AR2 ARn…

STR DAT DFT CUS

Similarity Granularity

1 2 n

Adaptive Ranking Level (AR)

Standard Ranking Level (XR)

Feature Ranking Level

RSSF

User Profiles

w11 w12 w13 w14

W1

KeywordAccessPathElementOrderCategory

SiblingChildrenDistance+Distance-TagAttribute

NEW

NEW

Feature1…Feature2Feature3

18

Adaptive Ranking Level (AR)Adaptive Ranking Level (AR) AR maintains a feature vector,, which adapts to the four

XRs, the vector is weighted and trained by RSSF = (STR, DAT, DFT, CUS, STR, DAT, DFT, CUS)

The adaptive ranking of fragments is calculated by:W * ,

where W is generated by RSSF, we will introduce it later.

19

Standard Ranking Level (XR)

Four XRs Structure ranker (STR): focus on ranking XML

fragments based on their structure Data ranker (DAT): ignore the structure and rank the

XML fragments with their textual data System default ranker (DFT): a balance of structure

and data ranker Customized ranker (CUS): system administrator

selects low-level feature for tuning, in our experiment, the low-level features are randomly pick

20


Similarity Features Keyword Access Path Element Order Category

For example, Q = {<author>Mary</author>, <title>XML</title>}

Keyword similarity = 5.0)5.01(

5.0)5.01(log

2

1

Path similarity = 3/4Access similarity = 3/7 Element similarity = 2/7

Order in Q: author > title

Ancestor order similarity = 0Sibling order similarity = 1/4

Sibling order in F:author>title, author>year, title>year, first>last

Predefine that: Academic category {article, title, author}Sport category {team, player, match, year}…

Category Vector for Q: <2/3, 0>Category Vector for F: <1, 1/4>Category similarity = distance of sqrt((1/3)2+(1/4)2)=0.4167

21


Granularity Features Sibling Children Distance+ Distance- Tag Attribute

Involves statistical data in the database

For example, Q = {<author>Mary</author>, <title>XML</title>}

Number of fragments whose roots are dblpNumber of tags whose parent are dblpThe length of the path from root to farthest leafdblp/article/author/first: length = 4The length of the path from root to nearest leafdblp/article/title: length = 3

Number of tag in F: 7Number of attributes in F: 0

22

Highlights of MRM Highly Flexible

Add or remove of new features or new XR is straightforward

Only require to update the feature vector, “Ranking Level Independence”

Analogous to data independence in relational model

23

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model


Experiments


24

Features of RSSF Input: set of labeled fragment Output: a trained ranker

Naïve Bayes is a successful algorithm for learning to classify text documents

Require small amount of training data, both positive and negative samples

In our setting, we only have labeled and unlabeled data, we extend the Naïve Bayes with spying technique to obtain the negative training samples

25

The RSSF

26

Ranking SVM Techniques

Find a vector that makes the inequality holds: F1 < F2 <F3

27

Voting Spy Naïve Bayes

28

Training Completed


Positive Unclassified Negative

Training Naïve Bayes…

Estimated Negative

P1

P2

P3

29


Positive Unclassified Negative

F11

F12

P1 P2 P3

F14

F11

F12

F11

F13

The Final Estimated Negative is……

F11

30

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model


Experiments


31

Effect of Varying Voting ThresholdX: voting thresholdY: Relative average rank of labeled fragments: new average rank / original average rank

32

Effectiveness of Low-Level Features on XR•In this experiment, we remove individual low-level feature from STR and DAT rankers and measure the new precision•The queries we use can be found in the appendix of the proposal

33

Processing Time

34

Comparison with TopX

Average precision over 100 recall points for each query.

Then, take the average.

Number of top k relevant resultsk

TopX is a searching engine for XML data available online State-of-the-art XML search engine

We measure the MAP and precison@k MAP: mean average precision precison@k: top k precision

35

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model


Experiments


36

Further remarksFurther remarks

Searching and ranking XML data are important, since existing Web search engines cannot handle them well

We present effective approach to perform adaptive XML searching and ranking by extending traditional IR techniques by considering different features of XML data

37

Ongoing Work – INEX 2007

The Initiative for Evaluation of XML retrieval (INEX) A community which aims to provide large test data and

scoring method for researchers to evaluate their retrieval systems

It is getting attention recently We participate INEX in 2006 and 2007 INEX 2007 Collection is a Wikipedia XML Corpus with a

set of 659388 XML documents We are running experiments using their data and queries

38

Ongoing Work – INEX 2007

39

Ongoing Work – Merging

Displaying a list of fragments one by one to the user may not be adequate in XML setting. Fragments may be scattered on the list Duplicated fragments in different structures Refine a search query to obtain more and better

results Ideas: Make use of the schema information

(DTD) and consider the fragments as entities and merge them in a concise way

40

My Publications Ho-Lam LAU and Wilfred NG. A Multi-Ranker Model for Adaptive XML Searching. Accepted and to

appear: VLDB Journal. (2007). Ho-Lam LAU and Wilfred NG.

Towards an Adaptive Information Merging Using Selected XML Fragments. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 1013-1019, (2007).

James CHENG and Wilfred NG. A Development of Hash-Lookup Trees to Support Querying Streaming XML. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 768-780, (2007).

Wilfred NG and James CHENG. An Efficient Index Lattice for XML Query Evaluation. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 753-767, (2007).

Wilfred NG and Ho-Lam LAU. A Co-Training Framework for Searching XML Documents. Information Systems, 32(3), pp. 477-503, (2007).

Yin YANG, Wilfred NG, Ho-Lam LAU and James CHENG . An Efficient Approach to Support Querying Secure Outsourced XML Information. Conference on Advanced Information Systems Engineering. CAiSE 2006, Lecture Notes in Computer Science Vol. 4007, Luxembourg, pp. 157-171, (2006).

Wilfred NG and Ho-Lam LAU. Effective Approaches for Watermarking XML Data. 10th International Conference on Database Systems for Advanced Applications DASFAA 2005, Lecture Notes of Computer Science Vol.3453, Beijing, China, page 68-80, (2005).

Ho-Lam LAU and Wilfred NG. A Unifying Framework for Merging and Evaluating XML Information. 10th International Conference on Database Systems for Advanced Applications DASFAA 2005, Lecture Notes of Computer Science Vol.3453, Beijing, China, page 81-94, (2005).

adaptive xml search

Documents

ranking xml

xtitle keytag query

web search engine

features of xml data

ranking techniques

advantage of xml datacan

goals of key

instance of key