diversified retrieval as structured prediction

46
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell University Joint work with Thorsten Joachims

Upload: clea

Post on 05-Jan-2016

31 views

Category:

Documents


1 download

DESCRIPTION

Diversified Retrieval as Structured Prediction. Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell University Joint work with Thorsten Joachims. Need for Diversity (in IR). Ambiguous Queries - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Diversified Retrieval as  Structured Prediction

Diversified Retrieval as Structured Prediction

Redundancy, Diversity, and

Interdependent Document Relevance (IDR ’09)

SIGIR 2009 Workshop

Yisong YueCornell University

Joint work with Thorsten Joachims

Page 2: Diversified Retrieval as  Structured Prediction

Need for Diversity (in IR)

• Ambiguous Queries– Different information needs using same query– “Jaguar”– At least one relevant result for each information need

• Learning Queries– User interested in “a specific detail or entire breadth

of knowledge available” • [Swaminathan et al., 2008]

– Results with high information diversity

Page 3: Diversified Retrieval as  Structured Prediction

Optimizing Diversity

• Interest in information retrieval– [Carbonell & Goldstein, 1998; Zhai et al., 2003; Zhang et al., 2005;

Chen & Karger, 2006; Zhu et al., 2007; Swaminathan et al., 2008]

• Requires inter-document dependencies– Impossible with standard independence assumptions– E.g., probability ranking principle

• No consensus on how to measure diversity.

Page 4: Diversified Retrieval as  Structured Prediction

This Talk

• A method for representing and optimizing information coverage

• Discriminative training algorithm– Based on structural SVMs

• Appropriate forms of training data– Requires sufficient granularity (subtopic labels)

• Empirical evaluation

Page 5: Diversified Retrieval as  Structured Prediction

•Choose top 3 documents•Individual Relevance: D3 D4 D1•Pairwise Sim MMR: D3 D1 D2•Best Solution: D3 D1 D5

Page 6: Diversified Retrieval as  Structured Prediction

How to Represent Information?

• Discrete feature space to represent information– Decomposed into “nuggets”

• For query q and its candidate documents:– All the words (title words, anchor text, etc)– Cluster memberships (topic models / dim reduction)– Taxonomy memberships (ODP)

• We will focus on words and title words.

Page 7: Diversified Retrieval as  Structured Prediction

Weighted Word Coverage

• More distinct words = more information

• Weight word importance

• Will work automatically w/o human labels

• Goal: select K documents which collectively cover as many distinct (weighted) words as possible– Budgeted max coverage problem (Khuller et al., 1997)

– Greedy selection yields (1-1/e) bound.

– Need to find good weighting function (learning problem).

Page 8: Diversified Retrieval as  Structured Prediction

Example

D1 D2 D3 Best

Iter 1 12 11 10 D1

Iter 2

Marginal Benefit

V1 V2 V3 V4 V5

D1 X X X

D2 X X X

D3 X X X X

Word Benefit

V1 1

V2 2

V3 3

V4 4

V5 5

Document Word Counts

Page 9: Diversified Retrieval as  Structured Prediction

Example

D1 D2 D3 Best

Iter 1 12 11 10 D1

Iter 2 -- 2 3 D3

Marginal Benefit

V1 V2 V3 V4 V5

D1 X X X

D2 X X X

D3 X X X X

Word Benefit

V1 1

V2 2

V3 3

V4 4

V5 5

Document Word Counts

Page 10: Diversified Retrieval as  Structured Prediction

How to Weight Words?

• Not all words created equal– “the”

• Conditional on the query– “computer” is normally fairly informative…– …but not for the query “ACM”

• Learn weights based on the candidate set – (for a query)

Page 11: Diversified Retrieval as  Structured Prediction

Prior Work

• Essential Pages [Swaminathan et al., 2008]

– Uses fixed function of word benefit– Depends on word frequency in candidate set

– - Local version of TF-IDF

– - Frequent words low weight– (not important for diversity)

– - Rare words low weight– (not representative)

Page 12: Diversified Retrieval as  Structured Prediction

Linear Discriminant

• x = (x1,x2,…,xn) - candidate documents

• v – an individual word

• We will use thousands of such features

...

]in titlesof 10% in appears [

...

] of 20% in appears [

] of 10% in appears [

),(

x

x

x

x

v

v

v

v

Page 13: Diversified Retrieval as  Structured Prediction

Linear Discriminant

• x = (x1,x2,…,xn) - candidate documents

• y – subset of x (the prediction)• V(y) – union of words from documents in y.

• Discriminant Function:

• Benefit of covering word v is then wT(v,x)

)(

),(),(y

xyxVv

TT vww

),(maxargˆ yxyy

Tw

Page 14: Diversified Retrieval as  Structured Prediction

Linear Discriminant

• Does NOT reward redundancy – Benefit of each word only counted once

• Greedy has (1-1/e)-approximation bound

• Linear (joint feature space) – Suitable for SVM optimization

)(

),(),(y

xyxVv

TT vww

Page 15: Diversified Retrieval as  Structured Prediction

More Sophisticated Discriminant

• Documents “cover” words to different degrees– A document with 5 copies of “Thorsten” might cover it

better than another document with only 2 copies.

Page 16: Diversified Retrieval as  Structured Prediction

More Sophisticated Discriminant

• Documents “cover” words to different degrees– A document with 5 copies of “Thorsten” might cover it

better than another document with only 2 copies.

• Use multiple word sets, V1(y), V2(y), … , VL(y)

• Each Vi(y) contains only words satisfying certain importance criteria.

• Requires more sophisticated joint feature map.

Page 17: Diversified Retrieval as  Structured Prediction

Conventional SVMs

• Input: x (high dimensional point)

• Target: y (either +1 or -1)

• Prediction: sign(wTx)

• Training:

subject to:

• The sum of slacks upper bounds the accuracy loss

N

ii

w N

Cw

1

2

, 2

1minarg

iiT xwi 1)(y : i

i

i

Page 18: Diversified Retrieval as  Structured Prediction

Structural SVM Formulation

• Input: x (candidate set of documents)• Target: y (subset of x of size K)

• Same objective function:

• Constraints for each incorrect labeling y’.

• Scoreof best y at least as large as incorrect y’ plus loss

• Requires new training algorithm [Tsochantaridis et al., 2005]

i

iN

Cw 2

2

1

)'()',(),( :' yyxyxyy TT ww

Page 19: Diversified Retrieval as  Structured Prediction

Weighted Subtopic Loss

• Example:– x1 covers t1

– x2 covers t1,t2,t3

– x3 covers t1,t3

• Motivation– Higher penalty for not covering popular subtopics– Mitigates effects of label noise in tail subtopics

# Docs Loss

t1 3 1/2

t2 1 1/6

t3 2 1/3

Page 20: Diversified Retrieval as  Structured Prediction

Diversity Training Data

• TREC 6-8 Interactive Track– Queries with explicitly labeled subtopics– E.g., “Use of robots in the world today”

• Nanorobots• Space mission robots• Underwater robots

– Manual partitioning of the total information regarding a query

Page 21: Diversified Retrieval as  Structured Prediction

Experiments

• TREC 6-8 Interactive Track Queries• Documents labeled into subtopics.

• 17 queries used, – considered only relevant docs– decouples relevance problem from diversity problem

• 45 docs/query, 20 subtopics/query, 300 words/doc

• Trained using LOO cross validation

Page 22: Diversified Retrieval as  Structured Prediction

• TREC 6-8 Interactive Track• Retrieving 5 documents

0.469 0.472 0.471

0.434

0.349

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

Random Okapi Unweighted Essential Pages SVM-div

Page 23: Diversified Retrieval as  Structured Prediction

Can expect further benefit from having more training data.

Page 24: Diversified Retrieval as  Structured Prediction

Moving Forward

• Larger datasets– Evaluate relevance & diversity jointly

• Different types of training data– Our framework can define loss in different ways– Can we leverage clickthrough data?

• Different feature representations• Build on top of topic modeling approaches?– Can we incorporate hierarchical retrieval?

Page 25: Diversified Retrieval as  Structured Prediction

References & Code/Data

• “Predicting Diverse Subsets Using Structural SVMs”– [Yue & Joachims, ICML 2008]

• Source code and dataset available online – http://projects.yisongyue.com/svmdiv/

• Work supported by NSF IIS-0713483, Microsoft Fellowship, and Yahoo! KTC Grant.

Page 26: Diversified Retrieval as  Structured Prediction

Extra Slides

Page 27: Diversified Retrieval as  Structured Prediction

More Sophisticated Discriminant

)(

)( 1

),(

),(

),(1

y

y

x

x

yx

LVv L

Vv

v

v

•Separate i for each importance level i. •Joint feature map is vector composition of all i

),(maxargˆ yxyy

Tw

•Greedy has (1-1/e)-approximation bound. •Still uses linear feature space.

Page 28: Diversified Retrieval as  Structured Prediction

Maximizing Subtopic Coverage

• Goal: select K documents which collectively cover as many subtopics as possible.

• Perfect selection takes n choose K time.– Basically a set cover problem.

• Greedy gives (1-1/e)-approximation bound.– Special case of Max Coverage (Khuller et al, 1997)

Page 29: Diversified Retrieval as  Structured Prediction

Learning Set Cover Representations

• Given:– Manual partitioning of a space

• subtopics

– Weighting for how items cover manual partitions • subtopic labels + subtopic loss

– Automatic partitioning of the space• Words

• Goal:– Weighting for how items cover automatic partitions– The (greedy) optimal covering solutions agree

Page 30: Diversified Retrieval as  Structured Prediction

Essential Pages

Page 31: Diversified Retrieval as  Structured Prediction

Essential Pages

),(1log),(),(

),,(),(),,(

2 xxx

xxx

vrvrv

xvTFvxvC ii

[Swaminathan et al., 2008]

Benefit of covering word v with document xi

Importance of covering word v

Intuition: - Frequent words cannot encode information diversity.- Infrequent words do not provide significant information

x = (x1,x2,…,xn) - set of candidate documents for a queryy – a subset of x of size K (our prediction).

v

vC ),,(maxargˆ xyyy

Page 32: Diversified Retrieval as  Structured Prediction

Structural SVMs

Page 33: Diversified Retrieval as  Structured Prediction

Minimizing Hinge Loss

Suppose for incorrect y’:

Then:

i

iN

Cw 2

2

1

iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy

[Tsochantaridis et al., 2005]

)'(75.0 yi

Page 34: Diversified Retrieval as  Structured Prediction

Finding Most Violated Constraint

• A constraint is violated when

• Finding most violated constraint reduces to

0)'(),()',( iTw yyxyx

)'()',(maxargˆ'

yyxyy

Tw

Page 35: Diversified Retrieval as  Structured Prediction

Finding Most Violated Constraint

• Encode each subtopic as an additional “word” to be covered.

• Use greedy prediction to find approximate most violated constraint.

)(

)(

)( 1

),(

),(

),('

1

y

y

y

x

x

yx

T T

Vv L

Vv

Lv

v

)',('maxargˆ'

yxyy

Tw

Page 36: Diversified Retrieval as  Structured Prediction

Illustrative Example

Original SVM Problem• Exponential constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

Page 37: Diversified Retrieval as  Structured Prediction

Illustrative Example

Original SVM Problem• Exponential constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

Page 38: Diversified Retrieval as  Structured Prediction

Illustrative Example

Original SVM Problem• Exponential constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

Page 39: Diversified Retrieval as  Structured Prediction

Illustrative Example

Original SVM Problem• Exponential constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

Page 40: Diversified Retrieval as  Structured Prediction

Approximate Constraint Generation

• Theoretical guarantees no longer hold.– Might not find an epsilon-close approximation to the feasible

region boundary.

• Performs well in practice.

Page 41: Diversified Retrieval as  Structured Prediction

Approximate constraint generation seems to work perform well.

Page 42: Diversified Retrieval as  Structured Prediction

Experiments

Page 43: Diversified Retrieval as  Structured Prediction

TREC Experiments

• 12/4/1 train/valid/test split– Approx 500 documents in training set

• Permuted until all 17 queries were tested once

• Set K=5 (some queries have very few documents)

• SVM-div – uses term frequency thresholds to define importance levels

• SVM-div2 – in addition uses TFIDF thresholds

Page 44: Diversified Retrieval as  Structured Prediction

TREC Results

Method Loss

Random 0.469

Okapi 0.472

Unweighted Model 0.471

Essential Pages 0.434

SVM-div 0.349

SVM-div2 0.382

Methods W / T / L

SVM-div vs Ess. Pages

14 / 0 / 3 **

SVM-div2 vs Ess. Pages

13 / 0 / 4

SVM-div vs SVM-div2

9 / 6 / 2

Page 45: Diversified Retrieval as  Structured Prediction

Synthetic Dataset

• Trec dataset very small• Synthetic dataset so we can vary retrieval size K

• 100 queries• 100 docs/query, 25 subtopics/query, 300 words/doc

• 15/10/75 train/valid/test split

Page 46: Diversified Retrieval as  Structured Prediction

Consistently outperforms Essential Pages