query biased snippet generation in xml search yi chen yu huang, ziyang liu, yi chen arizona state...

Query Biased Snippet Query Biased Snippet GenerationGeneration

in XML Searchin XML Search

Yu Huang, Ziyang Liu, Yi ChenYi ChenArizona State University

SIGMOD 2008 2

Snippets in Text SearchSnippets in Text Search

Snippets are widely used in text search engine to help users to quickly identify relevant query results.

SIGMOD 2008 3

Fragment of an XML Search Fragment of an XML Search ResultResult

Find the apparel retailers in Texas Keyword Search

Texas, apparel, retailer

store1

state citymerchandises1

clothes1

fitting

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

clothes2 clothes3 clothes4 clothes5

fitting

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

category

sweater4

categoryfitting

women5 outwear5

name product

Brook Brothers apparel

casual4

situation

casual1

Galleria

Village

…… ……

…………

There can be many large search results.

Good snippets can help users to quickly and easily judge the relevance.

SIGMOD 2008 4

A Sample SnippetA Sample SnippetFrom the snippet, we know The corresponding query result

contains matches to all keywords

The retailer is “Brook Brothers” This retailer has many stores in

Houston. The clothes featured by this

retailer.

It helps us to differentiate this query result from other apparel retailers (e.g. Carter’s)

state city

merchandises

clothes

fitting

Texas Houston

retailer

clothes

situation

casual

category

outwear

name product

Brook Brothers

apparel

How to generate good snippets for XML search?

No existing work on XML snippet generation yet.

SIGMOD 2008 5

Challenges and Our Challenges and Our ContributionsContributions

What are desirable properties of a good snippet?

Identified three properties: self-contained, distinguishable, representative

What information in the query result is significant in order to achieve the properties?

Designed an algorithm to generate a ranked list of significant information - IList

How to generate a snippet to maximally cover the significant information within a size bound?

Proved the NP-hardness of this problem.

Designed an efficient and effective algorithm for snippet generation

eXtract: The first system on snippet generation for XML search

SIGMOD 2008 6

RoadmapRoadmap

Identifying desirable properties of a good snippet Self-contained Distinguishable Representative

Constructing an information list – IList IList is a ranked list of significant information in the query result in

order to achieve the properties.

Building snippets based on IList within a snippet size bound

Experimental evaluation

Conclusions

SIGMOD 2008 7

Self-contained SnippetSelf-contained Snippet

Snippets should be self-contained in order to be understandable.

Text search: snippets usually preserve self-contained semantic units: phrases / sentences surrounding keyword matches.

XML search: semantic units should be preserved.

Challenge: What is a semantic unit?

SIGMOD 2008 8

Query Result Fragment Query Result Fragment (revisited)(revisited)

Adding keywords and their corresponding entity names to IList.

IList: Texas, apparel, retailer, store

Data contain Entities Attributes

A self-contained snippet should contain names of the entities whose attributes are in snippets

store1

clothes1

fitting

Texas1 Houston

retailer

category

clothes2 clothes3

fitting

situation

formal2

fitting

women3

name product

Brook Brothers

apparel

situation

casual1

Galleria

…… ……

……

SIGMOD 2008 9

Distinguishable SnippetDistinguishable Snippet

Snippets should be distinguishable, so that users can easily differentiate query results

Text search: the title of the document is included.

XML search: the “key” of the result should be included.

Challenge: What is the key of an XML search result?

SIGMOD 2008 10

Query Result FragmentQuery Result Fragment

Adding the key of the query result to IList.

IList: Texas, apparel, retailer, store, Brook Brothers

We can mine keys of entities

return entity

support

entity

We identify two types of entities in a query result. Return entities Support entities

store1

clothes1

fitting

Texas1 Houston

retailer

category

clothes2 clothes3

fitting

situation

formal2

fitting

women3

name product

Brook Brothers

apparel

situation

casual1

Galleria

…… ……

……

Inferring return entities:

An entity whose name or attribute name match keywords; otherwise the highest entity

Key of a query result Keys of return entities

SIGMOD 2008 11

Representative SnippetRepresentative Snippet

Snippets should provide summaries of the query results, so that users can quickly grasp the essence of the results

Text search: active research area;

sometimes the first and/or last sentence of a paragraph is used as a summary.

XML search: include “dominant features” of query results

Challenges:

• What are features?

• What are dominant features?

SIGMOD 2008 12

Features of Query ResultFeatures of Query Result

We define a feature as (entity, attribute, value).

store1

clothes1

fitting

Texas1 Houston

retailer

category

clothes2 clothes3

fitting

situation

formal2

fitting

women3

name product

Brook Brothers

apparel

situation

casual1

Galleria

…… ……

……

Feature type

Houston:6

Austin: 1

Other values (3): 3

Men: 600

Women: 360

Children: 40

Casual: 700

Formal: 300

Outwear: 220

Suit: 120

Skirt: 80

Sweaters: 70

Other values (7): 510

fitting:

situation:

category:

entity: attribute: value: # of occurrences

store:

clothes:

Some feature statistics

SIGMOD 2008 13

Houston:6

Austin: 1

Other values (3): 3

Men: 600

Women: 360

Children: 40

Casual: 700

Formal: 300

Outwear: 220

Suit: 120

Skirt: 80

Sweaters: 70

fitting:

situation:

category:

store:

clothes:

Dominant Features of Query Dominant Features of Query ResultResult

A feature that occurs often is likely to be dominant.

store1

clothes1

fitting

Texas1 Houston

retailer

category

clothes2 clothes3

fitting

situation

formal2

fitting

women3

name product

Brook Brothers

apparel

situation

casual1

Galleria

…… ……

……

But this is not always reliable. Dominance score the # of occurrence of a feature / the avg. # of occurrences of features of the same type

Dominant features Features with dominance score ≥ 1

SIGMOD 2008 14

Representative SnippetRepresentative Snippet

Adding dominant features to IList in the order of dominance scores

store1

clothes1

fitting

Texas1 Houston

retailer

category

clothes2 clothes3

fitting

situation

formal2

fitting

women3

name product

Brook Brothers

apparel

situation

casual1

Galleria

…… ……

……

IList: Texas, apparel, retailer, store, Brook Brothers, Houston, outwear, men, casual

Houston:6

Austin: 1

Other values (3): 3

Men: 600

Women: 360

Children: 40

Casual: 700

Formal: 300

Outwear: 220

Suit: 120

Skirt: 80

Sweaters: 70

fitting:

situation:

category:

store:

clothes:

SIGMOD 2008 15

RoadmapRoadmap

Conclusions

SIGMOD 2008 16

RoadmapRoadmap

Identifying desirable properties of a good snippet Self-contained Related entity names Distinguishable Key of query result (return entities) Representative Dominant features

Conclusions

SIGMOD 2008 17

RoadmapRoadmap

Identifying desirable properties of a good snippet Self-contained Related entity names Distinguishable Key of query result (return entities) Representative Dominant features

Conclusions

SIGMOD 2008 18

Instance Selection ProblemInstance Selection Problem

store1

clothes1

fitting

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

fitting

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

category

sweater4

categoryfitting

women5 outwear5

name product

casual4

situation

casual1

Galleria

Village

…… ……

…………

Input: query result R, IList, a snippet size bound B Output: snippet S

Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B?

SIGMOD 2008 19

Instance Selection ProblemInstance Selection Problem

store1

clothes1

fitting

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

fitting

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

category

sweater4

categoryfitting

women5 outwear5

name product

casual4

situation

casual1

Galleria

Village

…… ……

…………

Input: query result R, IList, a snippet size bound B Output: snippet S

Instance Selection Problem: How to select node instances in R to cover as much items in IList as possible in the ranked order to form S within bound B?

Good Bad

SIGMOD 2008 20

Instance Selection ProblemInstance Selection ProblemChallenges: The cost of covering an IList item is dynamic The number of IList items that can be covered is unknown till

the very end.

The Instance Selection Problem is NP hard.

We designed an efficient and effective greedy algorithm to tackle this problem

SIGMOD 2008 21

Instance Selection Algorithm Instance Selection Algorithm

store1

clothes1

fitting

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

fitting

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

category

sweater4

categoryfitting

women5 outwear5

name product

casual4

situation

casual1

Galleria

Village

…… ……

…………

weight: 1 1 1 ½ ¼ 1/8 1/16 1/32 1/64

Path based instance selection Coverage: the entities on the path and their attributes Benefit: the total weight of IList items covered Cost: the path length

SIGMOD 2008 22

Instance Selection AlgorithmInstance Selection Algorithm

store1

clothes1

fitting

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

fitting

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

category

sweater4

categoryfitting

women5 outwear5

name product

casual4

situation

casual1

Galleria

Village

…… ……

…………

1. For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it

2. Update benefits and costs of other paths

3. Go to step 1 till the size bound is reached or the whole IList is covered

SIGMOD 2008 23

Instance Selection AlgorithmInstance Selection Algorithm

store1

clothes1

fitting

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

fitting

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

category

sweater4

categoryfitting

women5 outwear5

name product

casual4

situation

casual1

Galleria

Village

…… ……

…………

1. For the next uncovered item in Ilist, choose the path with the highest benefit/cost that covers it

2. Update benefits and costs of other paths

3. Go to step 1 till the size bound is reached or the whole IList is covered

SIGMOD 2008 24

Final SnippetFinal Snippet

store1

clothes1

fitting

Texas1 Houston

store2

state city

Texas2 Austin

merchandises2

retailer

category

fitting

situation

formal2

situationfitting

women3 casual3

category

outwear3

situationfitting

category

sweater4

categoryfitting

women5 outwear5

name product

casual4

situation

casual1

Galleria

Village

…… ……

…………

SIGMOD 2008 25

RoadmapRoadmap

Conclusions

SIGMOD 2008 26

Experimental SetupExperimental Setup

Comparing the performance of Greedy Algorithm for Instance Selection -- eXtract Optimal (but exponential) Algorithm for Instance Selection Google Desktop

Measurements Search quality Speed Scalability

Data sets: Films, RetailerQuery sets: Eight queries for each data set

SIGMOD 2008 27

Ten users were asked to score the snippets generated by the three approaches on the same query results

The two approaches designed specifically for XML snippet generation have much higher scores than Google Desktop

Greedy algorithm (eXtract) has close scores to the Optimal algorithm

Search Quality: User StudySearch Quality: User Study

SIGMOD 2008 28

Search Quality: Precision & Search Quality: Precision & RecallRecall

Through another user study, the ground truth of snippets are obtained.

The snippets generated by the Greedy algorithm in eXtract have close precision and recall as the Optimal algorithm

Precision Recall

SIGMOD 2008 29

SpeedSpeed

Film Data Set Retailer Data Set

The performance of the Greedy algorithm is much better than that of the Optimal algorithm

SIGMOD 2008 30

ScalabilityScalability

Scalability on Snippet Size

(number of edges)

The scalability of the Greedy algorithm is much better than that of the Optimal algorithm

Scalability on Query Result Size (KB)

SIGMOD 2008 31

ConclusionsConclusions

The first work that generates result snippets for keyword search on XML data

Identified the desirable properties for snippets Self-contained Distinguishable Representative

Designed an algorithm to generate IList as a ranked list of significant items to be included in snippets

Proved that the instance selection problem is NP-hard

Designed an efficient algorithm to cover IList in building a snippet within a size bound

Experiments verified the effectiveness and efficiency

SIGMOD 2008 32

Thank You!Thank You!

Questions?Questions?

Welcome to visit eXtract demo in VLDB 2008Welcome to visit eXtract demo in VLDB 2008http://eXtract.asu.edu/http://eXtract.asu.edu/

SIGMOD 2008 33

Architecture of eXtractArchitecture of eXtract

IndexBuilder

XMLIndex

Return Entity IdentifierQuery

ResultDominant

Feature

Identifier

IList,

Query Result

Instance

Selector

Result

Snippet

Data Analyzer

Query Result Key Identifier

SIGMOD 2008

Snippets Comparison

state city

merchandises

clothes

fitting

Texas Houston

retailer

clothes

situation

casual

category

outwear

name product

Brook Brothers

appareleXtract

Google Desktop

query biased snippet generation in xml search yi chen yu huang, ziyang liu, yi chen arizona state...

xml search slide

xml snippet generation

xml search result

text search snippets

retailer store

sample snippet

texas keyword search

good snippets

Documents

yi-chen chen, vishal m. patel, sumit shekhar, rama...

professor chen yi-chun

identifying meaningful return information for xml keyword...

chen, yi-wen dept. of computer science & information...

xiaotian chen and yi huang

jie chen and yi zhang

an investigation of selected works by chen yi · 2020. 4....

teacher : ru-li lin student : dun-yang huang yi-jhih chen

curriculum vita yi-hsin chen, ph.d. · chen cv 1 curriculum...

hung-yi chen, crowdfunding and its interaction with urban...

shearing name : yi-wei chen student number : r02942096

yi-chen chen, vishal m. patel, sumit shekhar, rama

how to make an impressive powerpoint yi-chen chen @ april...

the smart walker project team: chen zhang, yi lu presented...

zhao ziyang

reporter: chen , yi-wen advisor: berlin chen dept. of...

european journal of medicinal...

reasoning and identifying relevant matches for xml keyword...

advisor: yen-ting chen presenter: yi-shiang chen 2011.4.27...

© 2007 yi-fan chen all rights reserved