extending prix for similarity-based xml query
DESCRIPTION
Extending PRIX for Similarity-based XML Query. Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao. Agenda. System Architecture Introduction Semantic-based Similarity Search Query Expansion Semantic Similarity Computation Structural-based Similarity Search Adapting PRIX algorithm - PowerPoint PPT PresentationTRANSCRIPT
1
Extending PRIX for Similarity-based XML Query
Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao
2
Agenda System Architecture Introduction Semantic-based Similarity Search
Query Expansion Semantic Similarity Computation
Structural-based Similarity Search Adapting PRIX algorithm
Indexing Query Processing
Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion
3
Data Parser
Data Storage Manager
Results Ranking
Query Parser
Index Manager
Berke
ley
DB
Metadata Manager
Query Extending
XML data
XQuery Query Result
Loading XML Flow
XML Query Flow
1
3
8
1 - 8: XML Query Steps
2Query
ProcessingQuery Extensions
3
4
5
6
7
System Architecture Introduction
4
Agenda System Architecture Introduction Semantic-based Similarity Search
Query Expansion Semantic Similarity Computation
Structural-based Similarity Search Adapting PRIX algorithm
Indexing Query Processing
Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion
5
Query Expansion (I)
An Example:
Tags in a sample query
{title, Praveen Rao, information retrieval}
Keywords
{title, Praveen, Rao, information, retrieval}
Keyword Extensions
{{title, status title,deed, claim, entity, style}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}}
Valid Keyword Extensions
{{title, claim, entity}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}}
(Continue in next page)
Extracting meaningful keywords for each tag in the tag sequence (ignore stop words)
Extracting all the tags from the query
XML Query
Tag sequence
Getting keyword extensions for each keyword in the keyword sequence based upon WordNet
Keyword sequence
Removing the keywords that do not exist in the database
Keyword extensions
Full combination of the keyword extensions in each tag
Valid Keyword extensions
Remove the tag extension if not all of its keywords appear at least once in a tag of the database;Replace the tag extensions with the tag whose keywords set is a super set of the keywords set of the tag extensions
Tag extensions
Full combination of all the tags in the original query to get query extensions.
Valid Tag extensions
Remove the query extensions whose tags do not appear in the same XML document of the database
Valid Query Extensions
Query Extensions
6
Query Expansion (II)Tag Extensions
{{title}, {claim}, {entity}, {Praveen}, {Rao}, {data, retrieval}, {data recovery}, {information, retrieval}, {information, recovery}, {entropy, retrieval}, {entropy, recovery}}
Valid Tag Extensions
{{title}, {A claim on theory of computation}, {entity}, {Praveen Rao}, {modern information retrieval}, {A survey on information retrieval}, {information recovery}}
Query Expansions
1. { {title}, {Praveen Rao}, {modern information retrieval} }
2. {A claim on theory of computation} , {Praveen Rao}, {modern information retrieval} } ……
Valid Queries
{ {title}, {Praveen Rao}, {modern information retrieval} }
Extracting meaningful keywords for each tag in the tag sequence (ignore stop words)
Extracting all the tags from the query
XML Query
Tag sequence
Getting keyword extensions for each keyword in the keyword sequence based upon WordNet
Keyword sequence
Removing the keywords that do not exist in the database
Keyword extensions
Full combination of the keyword extensions in each tag
Valid Keyword extensions
Remove the tag extension if not all of its keywords appear at least once in a tag of the database;Replace the tag extensions with the tag whose keywords set is a super set of the keywords set of the tag extensions
Tag extensions
Full combination of all the tags in the original query to get query extensions.
Valid Tag extensions
Remove the query extensions whose tags do not appear in the same XML document of the database
Valid Query Extensions
Query Extensions
7
Semantic Similarity Computation Similarity between query q and one of its
extensions q’
, ' '
1( , ') ( , ')query
t q t q
sim q q sim t tn
1
1( , ')
m
ii
sim t t xm
t: tag in query q
t’: tag in query q’
n: number of tags in q
= 1, if ki= ki’ α (0 =< α <1), if ki <> ki’
m: number of keywords in tag t
ixix
ix
8
Agenda System Architecture Introduction Semantic-based Similarity Search
Query Expansion Semantic Similarity Computation
Structural-based Similarity Search Adapting PRIX algorithm
Indexing Query Processing
Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion
9
Indexing: Prix (PRüfer sequences for Indexing Xml)
No de R e m o v a lm e th o d
L PS : b, c , b, a , f , d, a
NPS : 1 , 2 , 3 , 6 , 4 , 5 , 6
a ,6
b ,3 d ,5
c ,2f ,4
b ,1
-
-
a ,3
b ,1 d ,2
No de R e m o v a lm e th o d
L PS : b, a , d, a
NPS : 1 , 3 , 2 , 3
D o cu m e n t Tre e
Q u e ry Tre e Pa t te rn
10
Indexing: Prix (PRüfer sequences for Indexing Xml) AD-Label (Ancestor-
Descendant)
Indexing structure in DB
"n am e"tag B+ - T r ee
n o d e B+ - T r ee
a0 ,7
b1 ,4
d5 ,6
c2 ,3
11
Query Processing
Procedure Filtering
Based on Subsequence matching O (n*n*m) : n is the number of nodes in the document; m
is the number of nodes in the query. Refinement
Connectivity Gap Consistency Frequency Consistency
12
Subsequence Matching
Definition
- Example:
* Good results: media, mult, mm, ted, tia, etc…
Why it works? Is not enough, need more refinements…
13
Refinement #1
Concept of Dummy Nodes
- PRIX offers only partial match
- Solution: extend prix to leaves level
- Example:
14
Refinement #2
Connection vs Connectionless
- Definition
- How to check it?
- If not connected, then what?
- Solution: apply penalty
- Example (Disconnected By Gap):
- Example (Disconnected By Unknown):
15
Refinement #3
Checking for Gap Consistency
- Gap Consistency depends on gaps of prüfer sequence
- How to check it?
- Determines if query tree is subset of searching domain
16
Refinement #4
Checking for Frequency Consistency
- Frequency consistency depends on Gap Consistency and occurrences of NPS
- How to check it?
- Determines if query tree is exact match in searching domain
- If not frequency consistent, then what?
- Solution: apply penalty
17
Structure Similarity
Calculations are based on edit distances which transforms to penalty values
Each mismatch node in structure has penalty equal to size of subtree + 1
Overall penalty is dot product of all mismatches All results are normalized with respect to worst case
penalty Overall penalty is dot product of all mismatches All results are normalized with respect to worst case
penalty
18
Structural Similarity #1: Connectivity
,1 ,2 ,( ) { , ,..., }i i i i nNPS S s s s
( )childrenSize x
jisLast ,
&Re ( , ) 0parent child i js s
0 .
( )m n , where m is the number of the subsequences from the filter.
1 1
, & , , 1 ,1
( ) ( ) Re ( , ) ( )n
connection k i k i parent child k i k i children k ii
sim S Last s s s Size s
19
Structural Similarity #2: Gap Similarity
1 2( ) { , ,..., }nNPS Q q q q
,1 ,2 ,( ) { , ,..., }i i i i nNPS S s s s
0, 0
sgn ( ), 0gap
xx
x x
0
( )n m
1 1
, 1 , 11
( , ) sgn ( )n
gap t i gap t i t i i ii
sim Q S s s q q
20
Structural Similarity #3: Frequency Similarity
,1 ,2 , ,( , ) ( , ,..., ), {0,1}, 1,i i i n i jPos Q i b b b b j n describs the
positional information of the i th element in the NFS ofQ . When , 1i jb ,
it represents that iq equals jq .
( ( , ))num Pos Q i represents the number of the ‘1’ in the ( , )Pos Q i .
0
( )n m
1
1
( , ) ( ( ( , ) ( , ) ) )n
frequency t i ti
sim Q S num Pos S i Pos Q i
21
Agenda System Architecture Introduction Semantic-based Similarity Search
Query Expansion Semantic Similarity Computation
Structural-based Similarity Search Adapting PRIX algorithm
Indexing Query Processing
Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion
22
Rank returned XML patterns
Similarity (q, q’’)= Semantic_sim(q, q’) * Structure_sim (q’, q’’)
23
Advantages of the approach
Prix Indexing Faster Captures all structural information
Similarity based Structure similarity Semantic similarity
24
Limitations and Extensions
Limitation of Prix: Ordering of nodes
We need to handle it in query extension
a
baca caba
cb
a
bc
25
Limitations and Extensions
More Limitations of Prix: It is difficult to map intuitive structure
similarities in tree to sequences similarities in Prix sequences
thus difficult to have accurate definitions of the similarity
However: Translate tree structures to equivalent
sequences and further do data mining or similarity matching on sequences is a promising direction
26
Limitations and Extensions Limitations of Semantic similarity
Too many similar results However:
We consider semantic similarity together with structure information
In broad sense: Structure similarity Semantic similarity Syntax similarity Similarity information from co-occurrences of keywords Similarity information from user feedback Similarity information from metadata (DTD, data source,
region, language, link structure of XML files, etc.)