extending prix for similarity-based xml query

1

Extending PRIX for Similarity-based XML Query

Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao

2

Agenda System Architecture Introduction Semantic-based Similarity Search

Query Expansion Semantic Similarity Computation

Structural-based Similarity Search Adapting PRIX algorithm

Indexing Query Processing

Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

3

Data Parser

Data Storage Manager

Results Ranking

Query Parser

Index Manager

Berke

ley

DB

Metadata Manager

Query Extending

XML data

XQuery Query Result

Loading XML Flow

XML Query Flow

1

3

8

1 - 8: XML Query Steps

2Query

ProcessingQuery Extensions

3

4

5

6

7

System Architecture Introduction

4






5

Query Expansion (I)

An Example:

Tags in a sample query

{title, Praveen Rao, information retrieval}

Keywords

{title, Praveen, Rao, information, retrieval}

Keyword Extensions

{{title, status title,deed, claim, entity, style}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}}

Valid Keyword Extensions

{{title, claim, entity}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}}

(Continue in next page)

Extracting meaningful keywords for each tag in the tag sequence (ignore stop words)

Extracting all the tags from the query

XML Query

Tag sequence

Getting keyword extensions for each keyword in the keyword sequence based upon WordNet

Keyword sequence

Removing the keywords that do not exist in the database

Keyword extensions

Full combination of the keyword extensions in each tag

Valid Keyword extensions

Remove the tag extension if not all of its keywords appear at least once in a tag of the database;Replace the tag extensions with the tag whose keywords set is a super set of the keywords set of the tag extensions

Tag extensions

Full combination of all the tags in the original query to get query extensions.

Valid Tag extensions

Remove the query extensions whose tags do not appear in the same XML document of the database

Valid Query Extensions

Query Extensions

6

Query Expansion (II)Tag Extensions

{{title}, {claim}, {entity}, {Praveen}, {Rao}, {data, retrieval}, {data recovery}, {information, retrieval}, {information, recovery}, {entropy, retrieval}, {entropy, recovery}}

Valid Tag Extensions

{{title}, {A claim on theory of computation}, {entity}, {Praveen Rao}, {modern information retrieval}, {A survey on information retrieval}, {information recovery}}

Query Expansions

1. { {title}, {Praveen Rao}, {modern information retrieval} }

2. {A claim on theory of computation} ， {Praveen Rao}, {modern information retrieval} } ……

Valid Queries

{ {title}, {Praveen Rao}, {modern information retrieval} }

Extracting meaningful keywords for each tag in the tag sequence (ignore stop words)

Extracting all the tags from the query

XML Query

Tag sequence

Getting keyword extensions for each keyword in the keyword sequence based upon WordNet

Keyword sequence

Removing the keywords that do not exist in the database

Keyword extensions

Full combination of the keyword extensions in each tag

Valid Keyword extensions

Remove the tag extension if not all of its keywords appear at least once in a tag of the database;Replace the tag extensions with the tag whose keywords set is a super set of the keywords set of the tag extensions

Tag extensions

Full combination of all the tags in the original query to get query extensions.

Valid Tag extensions

Remove the query extensions whose tags do not appear in the same XML document of the database

Valid Query Extensions

Query Extensions

7

Semantic Similarity Computation Similarity between query q and one of its

extensions q’

, ' '

1( , ') ( , ')query

t q t q

sim q q sim t tn

1

1( , ')

m

ii

sim t t xm

t: tag in query q

t’: tag in query q’

n: number of tags in q

= 1, if ki= ki’ α (0 =< α <1), if ki <> ki’

m: number of keywords in tag t

ixix

ix

8






9

Indexing: Prix (PRüfer sequences for Indexing Xml)

No de R e m o v a lm e th o d

L PS : b, c , b, a , f , d, a

NPS : 1 , 2 , 3 , 6 , 4 , 5 , 6

a ,6

b ,3 d ,5

c ,2f ,4

b ,1

-

-

a ,3

b ,1 d ,2

No de R e m o v a lm e th o d

L PS : b, a , d, a

NPS : 1 , 3 , 2 , 3

D o cu m e n t Tre e

Q u e ry Tre e Pa t te rn

10

Indexing: Prix (PRüfer sequences for Indexing Xml) AD-Label (Ancestor-

Descendant)

Indexing structure in DB

"n am e"tag B+ - T r ee

n o d e B+ - T r ee

a0 ,7

b1 ,4

d5 ,6

c2 ,3

11

Query Processing

Procedure Filtering

Based on Subsequence matching O (n*n*m) : n is the number of nodes in the document; m

is the number of nodes in the query. Refinement

Connectivity Gap Consistency Frequency Consistency

12

Subsequence Matching

Definition

- Example:

* Good results: media, mult, mm, ted, tia, etc…

Why it works? Is not enough, need more refinements…

13

Refinement #1

Concept of Dummy Nodes

- PRIX offers only partial match

- Solution: extend prix to leaves level

- Example:

14

Refinement #2

Connection vs Connectionless

- Definition

- How to check it?

- If not connected, then what?

- Solution: apply penalty

- Example (Disconnected By Gap):

- Example (Disconnected By Unknown):

15

Refinement #3

Checking for Gap Consistency

- Gap Consistency depends on gaps of prüfer sequence

- How to check it?

- Determines if query tree is subset of searching domain

16

Refinement #4

Checking for Frequency Consistency

- Frequency consistency depends on Gap Consistency and occurrences of NPS

- How to check it?

- Determines if query tree is exact match in searching domain

- If not frequency consistent, then what?

- Solution: apply penalty

17

Structure Similarity

Calculations are based on edit distances which transforms to penalty values

Each mismatch node in structure has penalty equal to size of subtree + 1

Overall penalty is dot product of all mismatches All results are normalized with respect to worst case

penalty Overall penalty is dot product of all mismatches All results are normalized with respect to worst case

penalty

18

Structural Similarity #1: Connectivity

,1 ,2 ,( ) { , ,..., }i i i i nNPS S s s s

( )childrenSize x

jisLast ,

&Re ( , ) 0parent child i js s

0 .

( )m n , where m is the number of the subsequences from the filter.

1 1

, & , , 1 ,1

( ) ( ) Re ( , ) ( )n

connection k i k i parent child k i k i children k ii

sim S Last s s s Size s

19

Structural Similarity #2: Gap Similarity

1 2( ) { , ,..., }nNPS Q q q q

,1 ,2 ,( ) { , ,..., }i i i i nNPS S s s s

0, 0

sgn ( ), 0gap

xx

x x

0

( )n m

1 1

, 1 , 11

( , ) sgn ( )n

gap t i gap t i t i i ii

sim Q S s s q q

20

Structural Similarity #3: Frequency Similarity

,1 ,2 , ,( , ) ( , ,..., ), {0,1}, 1,i i i n i jPos Q i b b b b j n describs the

positional information of the i th element in the NFS ofQ . When , 1i jb ,

it represents that iq equals jq .

( ( , ))num Pos Q i represents the number of the ‘1’ in the ( , )Pos Q i .

0

( )n m

1

1

( , ) ( ( ( , ) ( , ) ) )n

frequency t i ti

sim Q S num Pos S i Pos Q i

21






22

Rank returned XML patterns

Similarity (q, q’’)= Semantic_sim(q, q’) * Structure_sim (q’, q’’)

23

Advantages of the approach

Prix Indexing Faster Captures all structural information

Similarity based Structure similarity Semantic similarity

24

Limitations and Extensions

Limitation of Prix: Ordering of nodes

We need to handle it in query extension

a

baca caba

cb

a

bc

25

Limitations and Extensions

More Limitations of Prix: It is difficult to map intuitive structure

similarities in tree to sequences similarities in Prix sequences

thus difficult to have accurate definitions of the similarity

However: Translate tree structures to equivalent

sequences and further do data mining or similarity matching on sequences is a promising direction

26

Limitations and Extensions Limitations of Semantic similarity

Too many similar results However:

We consider semantic similarity together with structure information

In broad sense: Structure similarity Semantic similarity Syntax similarity Similarity information from co-occurrences of keywords Similarity information from user feedback Similarity information from metadata (DTD, data source,

region, language, link structure of XML files, etc.)

extending prix for similarity-based xml query

Documents

modern information retrieval

information recovery

praveen rao

valid tag extensionsremove

keywords set

sample query

original query

status title