adaptive xml search

40
1 Adaptive XML Search Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology

Upload: hester

Post on 08-Jan-2016

40 views

Category:

Documents


1 download

DESCRIPTION

Adaptive XML Search. Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology. Outline. Motivation Key-Tag Search Multi-Ranker Model Ranking Support vector machine in voting SpyNB Framework (RSSF) Experiments Conclusions and Ongoing Work. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Adaptive XML Search

1

Adaptive XML Search

Dr Wilfred NgDepartment of Computer Science

The Hong Kong University of Science and Technology

Page 2: Adaptive XML Search

2

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model

Ranking Support vector machine in voting SpyNB Framework (RSSF)

Experiments

Conclusions and Ongoing Work

Page 3: Adaptive XML Search

3

Motivation

Page 4: Adaptive XML Search

4

Why we need XML Search Why we need XML Search Engine?Engine?

Different nature of HTML and XML data HTML data

Hyperlink-intensive Declarative languages Tags have no semantic meaning

XML data Self-describing tags Extra structural information XML search engines retrieve more accurate

fragments

Page 5: Adaptive XML Search

5

Why we need XML Search Engine?Why we need XML Search Engine?

Web searching Document paradigm Matching keywords Vs documents Return links to whole document (web page)

XML searching Query Keywords maybe tags or data values Structure of XML document is diverse, e.g. DBLP

and Shakespeare Not return whole document: 100Mb or larger Return fragments

Page 6: Adaptive XML Search

6

DBLP

<dblp>

<incollection mdate="2002-01-03" key="books/acm/kim95/AnnevelinkACFHK95">

  <author>Jurgen Annevelink</author>

  <title>Object SQL - A Language for the Design and Implementation of Object Databases.</title>

  <pages>42-68</pages>

  <year>1995</year>

  <booktitle>Modern Database Systems</booktitle>

  <url>db/books/collections/kim95.html</url>

  </incollection>

….

Page 7: Adaptive XML Search

7

Shakespeare

<SPEECH> <SPEAKER>OCTAVIUS CAESAR</SPEAKER> <LINE>No, my most wronged sister; Cleopatra</LINE> <LINE>Hath nodded him to her. He hath given his empire</LINE> <LINE>Up to a whore; who now are levying</LINE> <LINE>The kings o' the earth for war; he hath assembled</LINE> <LINE>Bocchus, the king of Libya; Archelaus,</LINE> <LINE>Of Cappadocia; Philadelphos, king</LINE> <LINE>Of Paphlagonia; the Thracian king, Adallas;</LINE> <LINE>King Malchus of Arabia; King of Pont;</LINE> <LINE>Herod of Jewry; Mithridates, king</LINE> <LINE>Of Comagene; Polemon and Amyntas,</LINE> <LINE>The kings of Mede and Lycaonia,</LINE> <LINE>With a more larger list of sceptres.</LINE> </SPEECH>

Page 8: Adaptive XML Search

8

Research IdeasResearch Ideas In Information Retrieval community, many

ranking techniques are developed Weighted keywords Vector space

Searching and ranking XML as plain text using IR techniques is possible but Too simple Do not use the advantage of XML data

Can achieve better accuracy using features of XML data: Structures Tag semantics

Page 9: Adaptive XML Search

9

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model

Ranking Support vector machine in voting SpyNB Framework (RSSF)

Experiments

Conclusions and Ongoing Work

Page 10: Adaptive XML Search

10

Key-Tag Search

Page 11: Adaptive XML Search

11

Key-Tag Query vs. XQuery

Keywords in Web search engine vs. SQL The goals of key-tag query and XQuery are

different Key-Tag Query

Simple Easy to understand Flexible

XQuery:for $x in doc(“some.xml") where $x/author[(.ftcontains(‘Mary’)]return $x/title

Key-Tag Query:<author>Mary</author>

Too complicate for ordinary users!!

Will users input such complex XQuery in search engines?

Page 12: Adaptive XML Search

12

Key-Tag Search Query

<author>Mary</author> For example, Tag Key

author

title

year

Mary

XML

2007

Tag Key

*

*

*

Mary

XML

2007

Tag Key

author

title

year

*

*

*

Tag Key

*

*

year

Mary

XML

*

KeyTag

Page 13: Adaptive XML Search

13

Key-Tag Query Semantics A fragment is considered as a result candidate if at least one

key-tag is found in it. If F1 and F2 both contain the same instance of key-tag and F1 is a

subtree of F2, F1is chosen to be the only answer.

For example, a query <b>b</b>

F1: <b>b</b>

F2: <b><c><b>b</b></c></b>

F1 will be the answer

If there is a fragment:<b>

<c><b>b</b>

</c></b>

If there is a fragment:F1:<a>

<b>b</b>---------(B1)</a>F2:<a>

<c><b>b</b>----------(B2)

</c></a>

Page 14: Adaptive XML Search

14

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model

Ranking Support vector machine in voting SpyNB Framework (RSSF)

Experiments

Conclusions and Ongoing Work

Page 15: Adaptive XML Search

15

Multi-Ranker Model

Page 16: Adaptive XML Search

16

Introduction to MRMIntroduction to MRM

Handle diversified XML documents and user preferences

Page 17: Adaptive XML Search

17

Multi-Ranker Model

AR1 AR2 ARn…

STR DAT DFT CUS

Similarity Granularity

1 2 n

Adaptive Ranking Level (AR)

Standard Ranking Level (XR)

Feature Ranking Level

RSSF

User Profiles

w11 w12 w13 w14

W1

KeywordAccessPathElementOrderCategory

SiblingChildrenDistance+Distance-TagAttribute

NEW

NEW

Feature1…Feature2Feature3

Page 18: Adaptive XML Search

18

Adaptive Ranking Level (AR)Adaptive Ranking Level (AR) AR maintains a feature vector,, which adapts to the four

XRs, the vector is weighted and trained by RSSF = (STR, DAT, DFT, CUS, STR, DAT, DFT, CUS)

The adaptive ranking of fragments is calculated by:W * ,

where W is generated by RSSF, we will introduce it later.

Page 19: Adaptive XML Search

19

Standard Ranking Level (XR)

Four XRs Structure ranker (STR): focus on ranking XML

fragments based on their structure Data ranker (DAT): ignore the structure and rank the

XML fragments with their textual data System default ranker (DFT): a balance of structure

and data ranker Customized ranker (CUS): system administrator

selects low-level feature for tuning, in our experiment, the low-level features are randomly pick

Page 20: Adaptive XML Search

20

Feature Ranking Level

Similarity Features Keyword Access Path Element Order Category

For example, Q = {<author>Mary</author>, <title>XML</title>}

Keyword similarity = 5.0)5.01(

5.0)5.01(log

2

1

Path similarity = 3/4Access similarity = 3/7 Element similarity = 2/7

Order in Q: author > title

Ancestor order similarity = 0Sibling order similarity = 1/4

Sibling order in F:author>title, author>year, title>year, first>last

Predefine that: Academic category {article, title, author}Sport category {team, player, match, year}…

Category Vector for Q: <2/3, 0>Category Vector for F: <1, 1/4>Category similarity = distance of sqrt((1/3)2+(1/4)2)=0.4167

Page 21: Adaptive XML Search

21

Feature Ranking Level

Granularity Features Sibling Children Distance+ Distance- Tag Attribute

Involves statistical data in the database

For example, Q = {<author>Mary</author>, <title>XML</title>}

Number of fragments whose roots are dblpNumber of tags whose parent are dblpThe length of the path from root to farthest leafdblp/article/author/first: length = 4The length of the path from root to nearest leafdblp/article/title: length = 3

Number of tag in F: 7Number of attributes in F: 0

Page 22: Adaptive XML Search

22

Highlights of MRM Highly Flexible

Add or remove of new features or new XR is straightforward

Only require to update the feature vector, “Ranking Level Independence”

Analogous to data independence in relational model

Page 23: Adaptive XML Search

23

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model

Ranking Support vector machine in voting SpyNB Framework (RSSF)

Experiments

Conclusions and Ongoing Work

Page 24: Adaptive XML Search

24

Features of RSSF Input: set of labeled fragment Output: a trained ranker

Naïve Bayes is a successful algorithm for learning to classify text documents

Require small amount of training data, both positive and negative samples

In our setting, we only have labeled and unlabeled data, we extend the Naïve Bayes with spying technique to obtain the negative training samples

Page 25: Adaptive XML Search

25

The RSSF

Page 26: Adaptive XML Search

26

Ranking SVM Techniques

Find a vector that makes the inequality holds: F1 < F2 <F3

Page 27: Adaptive XML Search

27

Voting Spy Naïve Bayes

Page 28: Adaptive XML Search

28

Training Completed

Voting Spy Naïve Bayes

Positive Unclassified Negative

Training Naïve Bayes…

Estimated Negative

P1

P2

P3

Page 29: Adaptive XML Search

29

Voting Spy Naïve Bayes

Positive Unclassified Negative

F11

F12

P1 P2 P3

F14

F11

F12

F11

F13

The Final Estimated Negative is……

F11

Page 30: Adaptive XML Search

30

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model

Ranking Support vector machine in voting SpyNB Framework (RSSF)

Experiments

Conclusions and Ongoing Work

Page 31: Adaptive XML Search

31

Effect of Varying Voting ThresholdX: voting thresholdY: Relative average rank of labeled fragments: new average rank / original average rank

Page 32: Adaptive XML Search

32

Effectiveness of Low-Level Features on XR•In this experiment, we remove individual low-level feature from STR and DAT rankers and measure the new precision•The queries we use can be found in the appendix of the proposal

Page 33: Adaptive XML Search

33

Processing Time

Page 34: Adaptive XML Search

34

Comparison with TopX

Average precision over 100 recall points for each query.

Then, take the average.

Number of top k relevant resultsk

TopX is a searching engine for XML data available online State-of-the-art XML search engine

We measure the MAP and precison@k MAP: mean average precision precison@k: top k precision

Page 35: Adaptive XML Search

35

OutlineOutline

Motivation

Key-Tag Search

Multi-Ranker Model

Ranking Support vector machine in voting SpyNB Framework (RSSF)

Experiments

Conclusions and Ongoing Work

Page 36: Adaptive XML Search

36

Further remarksFurther remarks

Searching and ranking XML data are important, since existing Web search engines cannot handle them well

We present effective approach to perform adaptive XML searching and ranking by extending traditional IR techniques by considering different features of XML data

Page 37: Adaptive XML Search

37

Ongoing Work – INEX 2007

The Initiative for Evaluation of XML retrieval (INEX) A community which aims to provide large test data and

scoring method for researchers to evaluate their retrieval systems

It is getting attention recently We participate INEX in 2006 and 2007 INEX 2007 Collection is a Wikipedia XML Corpus with a

set of 659388 XML documents We are running experiments using their data and queries

Page 38: Adaptive XML Search

38

Ongoing Work – INEX 2007

Page 39: Adaptive XML Search

39

Ongoing Work – Merging

Displaying a list of fragments one by one to the user may not be adequate in XML setting. Fragments may be scattered on the list Duplicated fragments in different structures Refine a search query to obtain more and better

results Ideas: Make use of the schema information

(DTD) and consider the fragments as entities and merge them in a concise way

Page 40: Adaptive XML Search

40

My Publications Ho-Lam LAU and Wilfred NG. A Multi-Ranker Model for Adaptive XML Searching. Accepted and to

appear: VLDB Journal. (2007). Ho-Lam LAU and Wilfred NG.

Towards an Adaptive Information Merging Using Selected XML Fragments. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 1013-1019, (2007).

James CHENG and Wilfred NG. A Development of Hash-Lookup Trees to Support Querying Streaming XML. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 768-780, (2007).

Wilfred NG and James CHENG. An Efficient Index Lattice for XML Query Evaluation. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 753-767, (2007).

Wilfred NG and Ho-Lam LAU. A Co-Training Framework for Searching XML Documents. Information Systems, 32(3), pp. 477-503, (2007).

Yin YANG, Wilfred NG, Ho-Lam LAU and James CHENG . An Efficient Approach to Support Querying Secure Outsourced XML Information. Conference on Advanced Information Systems Engineering. CAiSE 2006, Lecture Notes in Computer Science Vol. 4007, Luxembourg, pp. 157-171, (2006).

Wilfred NG and Ho-Lam LAU. Effective Approaches for Watermarking XML Data. 10th International Conference on Database Systems for Advanced Applications DASFAA 2005, Lecture Notes of Computer Science Vol.3453, Beijing, China, page 68-80, (2005).

Ho-Lam LAU and Wilfred NG. A Unifying Framework for Merging and Evaluating XML Information. 10th International Conference on Database Systems for Advanced Applications DASFAA 2005, Lecture Notes of Computer Science Vol.3453, Beijing, China, page 81-94, (2005).