query biased snippet generation in xml search yi chen yu huang, ziyang liu, yi chen arizona state...
Post on 21-Dec-2015
226 Views
Preview:
TRANSCRIPT
Query Biased Snippet Query Biased Snippet GenerationGeneration
in XML Searchin XML Search
Yu Huang, Ziyang Liu, Yi ChenYi ChenArizona State University
SIGMOD 2008 2
Snippets in Text SearchSnippets in Text Search
Snippets are widely used in text search engine to help users to quickly identify relevant query results.
SIGMOD 2008 3
Fragment of an XML Search Fragment of an XML Search ResultResult
Find the apparel retailers in Texas Keyword Search
Texas, apparel, retailer
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
store2
state city
Texas2 Austin
merchandises2
retailer
category
suit1
clothes2 clothes3 clothes4 clothes5
fitting
men2
situation
formal2
situationfitting
women3 casual3
category
outwear3
situationfitting
men4
category
sweater4
categoryfitting
women5 outwear5
name product
Brook Brothers apparel
casual4
situation
casual1
name
Galleria
name
West
Village
…… ……
…………
There can be many large search results.
Good snippets can help users to quickly and easily judge the relevance.
SIGMOD 2008 4
A Sample SnippetA Sample SnippetFrom the snippet, we know The corresponding query result
contains matches to all keywords
The retailer is “Brook Brothers” This retailer has many stores in
Houston. The clothes featured by this
retailer.
It helps us to differentiate this query result from other apparel retailers (e.g. Carter’s)
store
state city
merchandises
clothes
fitting
men
Texas Houston
retailer
clothes
situation
casual
category
outwear
name product
Brook Brothers
apparel
How to generate good snippets for XML search?
No existing work on XML snippet generation yet.
SIGMOD 2008 5
Challenges and Our Challenges and Our ContributionsContributions
What are desirable properties of a good snippet?
Identified three properties: self-contained, distinguishable, representative
What information in the query result is significant in order to achieve the properties?
Designed an algorithm to generate a ranked list of significant information - IList
How to generate a snippet to maximally cover the significant information within a size bound?
Proved the NP-hardness of this problem.
Designed an efficient and effective algorithm for snippet generation
eXtract: The first system on snippet generation for XML search
SIGMOD 2008 6
RoadmapRoadmap
Identifying desirable properties of a good snippet Self-contained Distinguishable Representative
Constructing an information list – IList IList is a ranked list of significant information in the query result in
order to achieve the properties.
Building snippets based on IList within a snippet size bound
Experimental evaluation
Conclusions
SIGMOD 2008 7
Self-contained SnippetSelf-contained Snippet
Snippets should be self-contained in order to be understandable.
Text search: snippets usually preserve self-contained semantic units: phrases / sentences surrounding keyword matches.
XML search: semantic units should be preserved.
Challenge: What is a semantic unit?
SIGMOD 2008 8
Query Result Fragment Query Result Fragment (revisited)(revisited)
Adding keywords and their corresponding entity names to IList.
IList: Texas, apparel, retailer, store
Data contain Entities Attributes
A self-contained snippet should contain names of the entities whose attributes are in snippets
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
retailer
category
suit1
clothes2 clothes3
fitting
men2
situation
formal2
fitting
women3
name product
Brook Brothers
apparel
situation
casual1
name
Galleria
…… ……
……
……
SIGMOD 2008 9
Distinguishable SnippetDistinguishable Snippet
Snippets should be distinguishable, so that users can easily differentiate query results
Text search: the title of the document is included.
XML search: the “key” of the result should be included.
Challenge: What is the key of an XML search result?
SIGMOD 2008 10
Query Result FragmentQuery Result Fragment
Adding the key of the query result to IList.
IList: Texas, apparel, retailer, store, Brook Brothers
We can mine keys of entities
return entity
support
entity
We identify two types of entities in a query result. Return entities Support entities
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
retailer
category
suit1
clothes2 clothes3
fitting
men2
situation
formal2
fitting
women3
name product
Brook Brothers
apparel
situation
casual1
name
Galleria
…… ……
……
……
Inferring return entities:
An entity whose name or attribute name match keywords; otherwise the highest entity
Key of a query result Keys of return entities
SIGMOD 2008 11
Representative SnippetRepresentative Snippet
Snippets should provide summaries of the query results, so that users can quickly grasp the essence of the results
Text search: active research area;
sometimes the first and/or last sentence of a paragraph is used as a summary.
XML search: include “dominant features” of query results
Challenges:
• What are features?
• What are dominant features?
SIGMOD 2008 12
Features of Query ResultFeatures of Query Result
We define a feature as (entity, attribute, value).
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
retailer
category
suit1
clothes2 clothes3
fitting
men2
situation
formal2
fitting
women3
name product
Brook Brothers
apparel
situation
casual1
name
Galleria
…… ……
……
……
Feature type
Houston:6
Austin: 1
Other values (3): 3
Men: 600
Women: 360
Children: 40
Casual: 700
Formal: 300
Outwear: 220
Suit: 120
Skirt: 80
Sweaters: 70
Other values (7): 510
city:
fitting:
situation:
category:
entity: attribute: value: # of occurrences
store:
clothes:
clothes:
clothes:
Some feature statistics
SIGMOD 2008 13
Houston:6
Austin: 1
Other values (3): 3
Men: 600
Women: 360
Children: 40
Casual: 700
Formal: 300
Outwear: 220
Suit: 120
Skirt: 80
Sweaters: 70
Other values (7): 510
city:
fitting:
situation:
category:
entity: attribute: value: # of occurrences
store:
clothes:
clothes:
clothes:
Dominant Features of Query Dominant Features of Query ResultResult
A feature that occurs often is likely to be dominant.
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
retailer
category
suit1
clothes2 clothes3
fitting
men2
situation
formal2
fitting
women3
name product
Brook Brothers
apparel
situation
casual1
name
Galleria
…… ……
……
……
But this is not always reliable. Dominance score the # of occurrence of a feature / the avg. # of occurrences of features of the same type
Dominant features Features with dominance score ≥ 1
SIGMOD 2008 14
Representative SnippetRepresentative Snippet
Adding dominant features to IList in the order of dominance scores
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
retailer
category
suit1
clothes2 clothes3
fitting
men2
situation
formal2
fitting
women3
name product
Brook Brothers
apparel
situation
casual1
name
Galleria
…… ……
……
……
IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
Houston:6
Austin: 1
Other values (3): 3
Men: 600
Women: 360
Children: 40
Casual: 700
Formal: 300
Outwear: 220
Suit: 120
Skirt: 80
Sweaters: 70
Other values (7): 510
city:
fitting:
situation:
category:
entity: attribute: value: # of occurrences
store:
clothes:
clothes:
clothes:
SIGMOD 2008 15
RoadmapRoadmap
Identifying desirable properties of a good snippet Self-contained Distinguishable Representative
Constructing an information list – IList IList is a ranked list of significant information in the query result in
order to achieve the properties.
Building snippets based on IList within a snippet size bound
Experimental evaluation
Conclusions
SIGMOD 2008 16
RoadmapRoadmap
Identifying desirable properties of a good snippet Self-contained Related entity names Distinguishable Key of query result (return entities) Representative Dominant features
Constructing an information list – IList IList is a ranked list of significant information in the query result in
order to achieve the properties.
Building snippets based on IList within a snippet size bound
Experimental evaluation
Conclusions
IList
SIGMOD 2008 17
RoadmapRoadmap
Identifying desirable properties of a good snippet Self-contained Related entity names Distinguishable Key of query result (return entities) Representative Dominant features
Constructing an information list – IList IList is a ranked list of significant information in the query result in
order to achieve the properties.
Building snippets based on IList within a snippet size bound
Experimental evaluation
Conclusions
IList
SIGMOD 2008 18
Instance Selection ProblemInstance Selection Problem
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
store2
state city
Texas2 Austin
merchandises2
retailer
category
suit1
clothes2 clothes3 clothes4 clothes5
fitting
men2
situation
formal2
situationfitting
women3 casual3
category
outwear3
situationfitting
men4
category
sweater4
categoryfitting
women5 outwear5
name product
Brook Brothers apparel
casual4
situation
casual1
name
Galleria
name
West
Village
…… ……
…………
IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
Input: query result R, IList, a snippet size bound B Output: snippet S
Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B?
SIGMOD 2008 19
Instance Selection ProblemInstance Selection Problem
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
store2
state city
Texas2 Austin
merchandises2
retailer
category
suit1
clothes2 clothes3 clothes4 clothes5
fitting
men2
situation
formal2
situationfitting
women3 casual3
category
outwear3
situationfitting
men4
category
sweater4
categoryfitting
women5 outwear5
name product
Brook Brothers apparel
casual4
situation
casual1
name
Galleria
name
West
Village
…… ……
…………
Input: query result R, IList, a snippet size bound B Output: snippet S
Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B?
Good Bad
IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
SIGMOD 2008 20
Instance Selection ProblemInstance Selection ProblemChallenges: The cost of covering an IList item is dynamic The number of IList items that can be covered is unknown till
the very end.
The Instance Selection Problem is NP hard.
We designed an efficient and effective greedy algorithm to tackle this problem
SIGMOD 2008 21
Instance Selection Algorithm Instance Selection Algorithm
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
store2
state city
Texas2 Austin
merchandises2
retailer
category
suit1
clothes2 clothes3 clothes4 clothes5
fitting
men2
situation
formal2
situationfitting
women3 casual3
category
outwear3
situationfitting
men4
category
sweater4
categoryfitting
women5 outwear5
name product
Brook Brothers apparel
casual4
situation
casual1
name
Galleria
name
West
Village
…… ……
…………
IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
weight: 1 1 1 ½ ¼ 1/8 1/16 1/32 1/64
Path based instance selection Coverage: the entities on the path and their attributes Benefit: the total weight of IList items covered Cost: the path length
SIGMOD 2008 22
Instance Selection AlgorithmInstance Selection Algorithm
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
store2
state city
Texas2 Austin
merchandises2
retailer
category
suit1
clothes2 clothes3 clothes4 clothes5
fitting
men2
situation
formal2
situationfitting
women3 casual3
category
outwear3
situationfitting
men4
category
sweater4
categoryfitting
women5 outwear5
name product
Brook Brothers apparel
casual4
situation
casual1
name
Galleria
name
West
Village
…… ……
…………
1. For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it
2. Update benefits and costs of other paths
3. Go to step 1 till the size bound is reached or the whole IList is covered
IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
SIGMOD 2008 23
Instance Selection AlgorithmInstance Selection Algorithm
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
store2
state city
Texas2 Austin
merchandises2
retailer
category
suit1
clothes2 clothes3 clothes4 clothes5
fitting
men2
situation
formal2
situationfitting
women3 casual3
category
outwear3
situationfitting
men4
category
sweater4
categoryfitting
women5 outwear5
name product
Brook Brothers apparel
casual4
situation
casual1
name
Galleria
name
West
Village
…… ……
…………
1. For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it
2. Update benefits and costs of other paths
3. Go to step 1 till the size bound is reached or the whole IList is covered
IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
SIGMOD 2008 24
Final SnippetFinal Snippet
store1
state citymerchandises1
clothes1
fitting
men1
Texas1 Houston
store2
state city
Texas2 Austin
merchandises2
retailer
category
suit1
clothes2 clothes3 clothes4 clothes5
fitting
men2
situation
formal2
situationfitting
women3 casual3
category
outwear3
situationfitting
men4
category
sweater4
categoryfitting
women5 outwear5
name product
Brook Brothers apparel
casual4
situation
casual1
name
Galleria
name
West
Village
…… ……
…………
IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual
SIGMOD 2008 25
RoadmapRoadmap
Identifying desirable properties of a good snippet Self-contained Distinguishable Representative
Constructing an information list – IList IList is a ranked list of significant information in the query result in
order to achieve the properties.
Building snippets based on IList within a snippet size bound
Experimental evaluation
Conclusions
SIGMOD 2008 26
Experimental SetupExperimental Setup
Comparing the performance of Greedy Algorithm for Instance Selection -- eXtract Optimal (but exponential) Algorithm for Instance Selection Google Desktop
Measurements Search quality Speed Scalability
Data sets: Films, RetailerQuery sets: Eight queries for each data set
SIGMOD 2008 27
Ten users were asked to score the snippets generated by the three approaches on the same query results
The two approaches designed specifically for XML snippet generation have much higher scores than Google Desktop
Greedy algorithm (eXtract) has close scores to the Optimal algorithm
Search Quality: User StudySearch Quality: User Study
SIGMOD 2008 28
Search Quality: Precision & Search Quality: Precision & RecallRecall
Through another user study, the ground truth of snippets are obtained.
The snippets generated by the Greedy algorithm in eXtract have close precision and recall as the Optimal algorithm
Precision Recall
SIGMOD 2008 29
SpeedSpeed
Film Data Set Retailer Data Set
The performance of the Greedy algorithm is much better than that of the Optimal algorithm
SIGMOD 2008 30
ScalabilityScalability
Scalability on Snippet Size
(number of edges)
The scalability of the Greedy algorithm is much better than that of the Optimal algorithm
Scalability on Query Result Size (KB)
SIGMOD 2008 31
ConclusionsConclusions
The first work that generates result snippets for keyword search on XML data
Identified the desirable properties for snippets Self-contained Distinguishable Representative
Designed an algorithm to generate IList as a ranked list of significant items to be included in snippets
Proved that the instance selection problem is NP-hard
Designed an efficient algorithm to cover IList in building a snippet within a size bound
Experiments verified the effectiveness and efficiency
SIGMOD 2008 32
Thank You!Thank You!
Questions?Questions?
Welcome to visit eXtract demo in VLDB 2008Welcome to visit eXtract demo in VLDB 2008http://eXtract.asu.edu/http://eXtract.asu.edu/
SIGMOD 2008 33
Architecture of eXtractArchitecture of eXtract
IndexBuilder
XMLIndex
Return Entity IdentifierQuery
&
ResultDominant
Feature
Identifier
IList,
Query Result
Instance
Selector
Result
Snippet
Data Analyzer
Query Result Key Identifier
SIGMOD 2008
Snippets Comparison
store
state city
merchandises
clothes
fitting
men
Texas Houston
retailer
clothes
situation
casual
category
outwear
name product
Brook Brothers
appareleXtract
Google Desktop
top related