extending prix for similarity-based xml query

26
1 Extending PRIX for Similarity-based XML Quer y Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao

Upload: danno

Post on 08-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Extending PRIX for Similarity-based XML Query. Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao. Agenda. System Architecture Introduction Semantic-based Similarity Search Query Expansion Semantic Similarity Computation Structural-based Similarity Search Adapting PRIX algorithm - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Extending PRIX for  Similarity-based XML Query

1

Extending PRIX for Similarity-based XML Query

Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao

Page 2: Extending PRIX for  Similarity-based XML Query

2

Agenda System Architecture Introduction Semantic-based Similarity Search

Query Expansion Semantic Similarity Computation

Structural-based Similarity Search Adapting PRIX algorithm

Indexing Query Processing

Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

Page 3: Extending PRIX for  Similarity-based XML Query

3

Data Parser

Data Storage Manager

Results Ranking

Query Parser

Index Manager

Berke

ley

DB

Metadata Manager

Query Extending

XML data

XQuery Query Result

Loading XML Flow

XML Query Flow

1

3

8

1 - 8: XML Query Steps

2Query

ProcessingQuery Extensions

3

4

5

6

7

System Architecture Introduction

Page 4: Extending PRIX for  Similarity-based XML Query

4

Agenda System Architecture Introduction Semantic-based Similarity Search

Query Expansion Semantic Similarity Computation

Structural-based Similarity Search Adapting PRIX algorithm

Indexing Query Processing

Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

Page 5: Extending PRIX for  Similarity-based XML Query

5

Query Expansion (I)

An Example:

Tags in a sample query

{title, Praveen Rao, information retrieval}

Keywords

{title, Praveen, Rao, information, retrieval}

Keyword Extensions

{{title, status title,deed, claim, entity, style}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}}

Valid Keyword Extensions

{{title, claim, entity}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}}

(Continue in next page)

Extracting meaningful keywords for each tag in the tag sequence (ignore stop words)

Extracting all the tags from the query

XML Query

Tag sequence

Getting keyword extensions for each keyword in the keyword sequence based upon WordNet

Keyword sequence

Removing the keywords that do not exist in the database

Keyword extensions

Full combination of the keyword extensions in each tag

Valid Keyword extensions

Remove the tag extension if not all of its keywords appear at least once in a tag of the database;Replace the tag extensions with the tag whose keywords set is a super set of the keywords set of the tag extensions

Tag extensions

Full combination of all the tags in the original query to get query extensions.

Valid Tag extensions

Remove the query extensions whose tags do not appear in the same XML document of the database

Valid Query Extensions

Query Extensions

Page 6: Extending PRIX for  Similarity-based XML Query

6

Query Expansion (II)Tag Extensions

{{title}, {claim}, {entity}, {Praveen}, {Rao}, {data, retrieval}, {data recovery}, {information, retrieval}, {information, recovery}, {entropy, retrieval}, {entropy, recovery}}

Valid Tag Extensions

{{title}, {A claim on theory of computation}, {entity}, {Praveen Rao}, {modern information retrieval}, {A survey on information retrieval}, {information recovery}}

Query Expansions

1. { {title}, {Praveen Rao}, {modern information retrieval} }

2. {A claim on theory of computation} , {Praveen Rao}, {modern information retrieval} } ……

Valid Queries

{ {title}, {Praveen Rao}, {modern information retrieval} }

Extracting meaningful keywords for each tag in the tag sequence (ignore stop words)

Extracting all the tags from the query

XML Query

Tag sequence

Getting keyword extensions for each keyword in the keyword sequence based upon WordNet

Keyword sequence

Removing the keywords that do not exist in the database

Keyword extensions

Full combination of the keyword extensions in each tag

Valid Keyword extensions

Remove the tag extension if not all of its keywords appear at least once in a tag of the database;Replace the tag extensions with the tag whose keywords set is a super set of the keywords set of the tag extensions

Tag extensions

Full combination of all the tags in the original query to get query extensions.

Valid Tag extensions

Remove the query extensions whose tags do not appear in the same XML document of the database

Valid Query Extensions

Query Extensions

Page 7: Extending PRIX for  Similarity-based XML Query

7

Semantic Similarity Computation Similarity between query q and one of its

extensions q’

, ' '

1( , ') ( , ')query

t q t q

sim q q sim t tn

1

1( , ')

m

ii

sim t t xm

t: tag in query q

t’: tag in query q’

n: number of tags in q

= 1, if ki= ki’ α (0 =< α <1), if ki <> ki’

m: number of keywords in tag t

ixix

ix

Page 8: Extending PRIX for  Similarity-based XML Query

8

Agenda System Architecture Introduction Semantic-based Similarity Search

Query Expansion Semantic Similarity Computation

Structural-based Similarity Search Adapting PRIX algorithm

Indexing Query Processing

Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

Page 9: Extending PRIX for  Similarity-based XML Query

9

Indexing: Prix (PRüfer sequences for Indexing Xml)

No de R e m o v a lm e th o d

L PS : b, c , b, a , f , d, a

NPS : 1 , 2 , 3 , 6 , 4 , 5 , 6

a ,6

b ,3 d ,5

c ,2f ,4

b ,1

-

-

a ,3

b ,1 d ,2

No de R e m o v a lm e th o d

L PS : b, a , d, a

NPS : 1 , 3 , 2 , 3

D o cu m e n t Tre e

Q u e ry Tre e Pa t te rn

Page 10: Extending PRIX for  Similarity-based XML Query

10

Indexing: Prix (PRüfer sequences for Indexing Xml) AD-Label (Ancestor-

Descendant)

Indexing structure in DB

"n am e"tag B+ - T r ee

n o d e B+ - T r ee

a0 ,7

b1 ,4

d5 ,6

c2 ,3

Page 11: Extending PRIX for  Similarity-based XML Query

11

Query Processing

Procedure Filtering

Based on Subsequence matching O (n*n*m) : n is the number of nodes in the document; m

is the number of nodes in the query. Refinement

Connectivity Gap Consistency Frequency Consistency

Page 12: Extending PRIX for  Similarity-based XML Query

12

Subsequence Matching

Definition

- Example:

* Good results: media, mult, mm, ted, tia, etc…

Why it works? Is not enough, need more refinements…

Page 13: Extending PRIX for  Similarity-based XML Query

13

Refinement #1

Concept of Dummy Nodes

- PRIX offers only partial match

- Solution: extend prix to leaves level

- Example:

Page 14: Extending PRIX for  Similarity-based XML Query

14

Refinement #2

Connection vs Connectionless

- Definition

- How to check it?

- If not connected, then what?

- Solution: apply penalty

- Example (Disconnected By Gap):

- Example (Disconnected By Unknown):

Page 15: Extending PRIX for  Similarity-based XML Query

15

Refinement #3

Checking for Gap Consistency

- Gap Consistency depends on gaps of prüfer sequence

- How to check it?

- Determines if query tree is subset of searching domain

Page 16: Extending PRIX for  Similarity-based XML Query

16

Refinement #4

Checking for Frequency Consistency

- Frequency consistency depends on Gap Consistency and occurrences of NPS

- How to check it?

- Determines if query tree is exact match in searching domain

- If not frequency consistent, then what?

- Solution: apply penalty

Page 17: Extending PRIX for  Similarity-based XML Query

17

Structure Similarity

Calculations are based on edit distances which transforms to penalty values

Each mismatch node in structure has penalty equal to size of subtree + 1

Overall penalty is dot product of all mismatches All results are normalized with respect to worst case

penalty Overall penalty is dot product of all mismatches All results are normalized with respect to worst case

penalty

Page 18: Extending PRIX for  Similarity-based XML Query

18

Structural Similarity #1: Connectivity

,1 ,2 ,( ) { , ,..., }i i i i nNPS S s s s

( )childrenSize x

jisLast ,

&Re ( , ) 0parent child i js s

0 .

( )m n , where m is the number of the subsequences from the filter.

1 1

, & , , 1 ,1

( ) ( ) Re ( , ) ( )n

connection k i k i parent child k i k i children k ii

sim S Last s s s Size s

Page 19: Extending PRIX for  Similarity-based XML Query

19

Structural Similarity #2: Gap Similarity

1 2( ) { , ,..., }nNPS Q q q q

,1 ,2 ,( ) { , ,..., }i i i i nNPS S s s s

0, 0

sgn ( ), 0gap

xx

x x

0

( )n m

1 1

, 1 , 11

( , ) sgn ( )n

gap t i gap t i t i i ii

sim Q S s s q q

Page 20: Extending PRIX for  Similarity-based XML Query

20

Structural Similarity #3: Frequency Similarity

,1 ,2 , ,( , ) ( , ,..., ), {0,1}, 1,i i i n i jPos Q i b b b b j n describs the

positional information of the i th element in the NFS ofQ . When , 1i jb ,

it represents that iq equals jq .

( ( , ))num Pos Q i represents the number of the ‘1’ in the ( , )Pos Q i .

0

( )n m

1

1

( , ) ( ( ( , ) ( , ) ) )n

frequency t i ti

sim Q S num Pos S i Pos Q i

Page 21: Extending PRIX for  Similarity-based XML Query

21

Agenda System Architecture Introduction Semantic-based Similarity Search

Query Expansion Semantic Similarity Computation

Structural-based Similarity Search Adapting PRIX algorithm

Indexing Query Processing

Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

Page 22: Extending PRIX for  Similarity-based XML Query

22

Rank returned XML patterns

Similarity (q, q’’)= Semantic_sim(q, q’) * Structure_sim (q’, q’’)

Page 23: Extending PRIX for  Similarity-based XML Query

23

Advantages of the approach

Prix Indexing Faster Captures all structural information

Similarity based Structure similarity Semantic similarity

Page 24: Extending PRIX for  Similarity-based XML Query

24

Limitations and Extensions

Limitation of Prix: Ordering of nodes

We need to handle it in query extension

a

baca caba

cb

a

bc

Page 25: Extending PRIX for  Similarity-based XML Query

25

Limitations and Extensions

More Limitations of Prix: It is difficult to map intuitive structure

similarities in tree to sequences similarities in Prix sequences

thus difficult to have accurate definitions of the similarity

However: Translate tree structures to equivalent

sequences and further do data mining or similarity matching on sequences is a promising direction

Page 26: Extending PRIX for  Similarity-based XML Query

26

Limitations and Extensions Limitations of Semantic similarity

Too many similar results However:

We consider semantic similarity together with structure information

In broad sense: Structure similarity Semantic similarity Syntax similarity Similarity information from co-occurrences of keywords Similarity information from user feedback Similarity information from metadata (DTD, data source,

region, language, link structure of XML files, etc.)