Transcript
Page 1: Diversified Retrieval as  Structured Prediction

Diversified Retrieval as Structured Prediction

Redundancy, Diversity, and

Interdependent Document Relevance (IDR ’09)

SIGIR 2009 Workshop

Yisong YueCornell University

Joint work with Thorsten Joachims

Page 2: Diversified Retrieval as  Structured Prediction

Need for Diversity (in IR)

• Ambiguous Queries– Different information needs using same query– “Jaguar”– At least one relevant result for each information need

• Learning Queries– User interested in “a specific detail or entire breadth

of knowledge available” • [Swaminathan et al., 2008]

– Results with high information diversity

Page 3: Diversified Retrieval as  Structured Prediction

Optimizing Diversity

• Interest in information retrieval– [Carbonell & Goldstein, 1998; Zhai et al., 2003; Zhang et al., 2005;

Chen & Karger, 2006; Zhu et al., 2007; Swaminathan et al., 2008]

• Requires inter-document dependencies– Impossible with standard independence assumptions– E.g., probability ranking principle

• No consensus on how to measure diversity.

Page 4: Diversified Retrieval as  Structured Prediction

This Talk

• A method for representing and optimizing information coverage

• Discriminative training algorithm– Based on structural SVMs

• Appropriate forms of training data– Requires sufficient granularity (subtopic labels)

• Empirical evaluation

Page 5: Diversified Retrieval as  Structured Prediction

•Choose top 3 documents•Individual Relevance: D3 D4 D1•Pairwise Sim MMR: D3 D1 D2•Best Solution: D3 D1 D5

Page 6: Diversified Retrieval as  Structured Prediction

How to Represent Information?

• Discrete feature space to represent information– Decomposed into “nuggets”

• For query q and its candidate documents:– All the words (title words, anchor text, etc)– Cluster memberships (topic models / dim reduction)– Taxonomy memberships (ODP)

• We will focus on words and title words.

Page 7: Diversified Retrieval as  Structured Prediction

Weighted Word Coverage

• More distinct words = more information

• Weight word importance

• Will work automatically w/o human labels

• Goal: select K documents which collectively cover as many distinct (weighted) words as possible– Budgeted max coverage problem (Khuller et al., 1997)

– Greedy selection yields (1-1/e) bound.

– Need to find good weighting function (learning problem).

Page 8: Diversified Retrieval as  Structured Prediction

Example

D1 D2 D3 Best

Iter 1 12 11 10 D1

Iter 2

Marginal Benefit

V1 V2 V3 V4 V5

D1 X X X

D2 X X X

D3 X X X X

Word Benefit

V1 1

V2 2

V3 3

V4 4

V5 5

Document Word Counts

Page 9: Diversified Retrieval as  Structured Prediction

Example

D1 D2 D3 Best

Iter 1 12 11 10 D1

Iter 2 -- 2 3 D3

Marginal Benefit

V1 V2 V3 V4 V5

D1 X X X

D2 X X X

D3 X X X X

Word Benefit

V1 1

V2 2

V3 3

V4 4

V5 5

Document Word Counts

Page 10: Diversified Retrieval as  Structured Prediction

How to Weight Words?

• Not all words created equal– “the”

• Conditional on the query– “computer” is normally fairly informative…– …but not for the query “ACM”

• Learn weights based on the candidate set – (for a query)

Page 11: Diversified Retrieval as  Structured Prediction

Prior Work

• Essential Pages [Swaminathan et al., 2008]

– Uses fixed function of word benefit– Depends on word frequency in candidate set

– - Local version of TF-IDF

– - Frequent words low weight– (not important for diversity)

– - Rare words low weight– (not representative)

Page 12: Diversified Retrieval as  Structured Prediction

Linear Discriminant

• x = (x1,x2,…,xn) - candidate documents

• v – an individual word

• We will use thousands of such features

...

]in titlesof 10% in appears [

...

] of 20% in appears [

] of 10% in appears [

),(

x

x

x

x

v

v

v

v

Page 13: Diversified Retrieval as  Structured Prediction

Linear Discriminant

• x = (x1,x2,…,xn) - candidate documents

• y – subset of x (the prediction)• V(y) – union of words from documents in y.

• Discriminant Function:

• Benefit of covering word v is then wT(v,x)

)(

),(),(y

xyxVv

TT vww

),(maxargˆ yxyy

Tw

Page 14: Diversified Retrieval as  Structured Prediction

Linear Discriminant

• Does NOT reward redundancy – Benefit of each word only counted once

• Greedy has (1-1/e)-approximation bound

• Linear (joint feature space) – Suitable for SVM optimization

)(

),(),(y

xyxVv

TT vww

Page 15: Diversified Retrieval as  Structured Prediction

More Sophisticated Discriminant

• Documents “cover” words to different degrees– A document with 5 copies of “Thorsten” might cover it

better than another document with only 2 copies.

Page 16: Diversified Retrieval as  Structured Prediction

More Sophisticated Discriminant

• Documents “cover” words to different degrees– A document with 5 copies of “Thorsten” might cover it

better than another document with only 2 copies.

• Use multiple word sets, V1(y), V2(y), … , VL(y)

• Each Vi(y) contains only words satisfying certain importance criteria.

• Requires more sophisticated joint feature map.

Page 17: Diversified Retrieval as  Structured Prediction

Conventional SVMs

• Input: x (high dimensional point)

• Target: y (either +1 or -1)

• Prediction: sign(wTx)

• Training:

subject to:

• The sum of slacks upper bounds the accuracy loss

N

ii

w N

Cw

1

2

, 2

1minarg

iiT xwi 1)(y : i

i

i

Page 18: Diversified Retrieval as  Structured Prediction

Structural SVM Formulation

• Input: x (candidate set of documents)• Target: y (subset of x of size K)

• Same objective function:

• Constraints for each incorrect labeling y’.

• Scoreof best y at least as large as incorrect y’ plus loss

• Requires new training algorithm [Tsochantaridis et al., 2005]

i

iN

Cw 2

2

1

)'()',(),( :' yyxyxyy TT ww

Page 19: Diversified Retrieval as  Structured Prediction

Weighted Subtopic Loss

• Example:– x1 covers t1

– x2 covers t1,t2,t3

– x3 covers t1,t3

• Motivation– Higher penalty for not covering popular subtopics– Mitigates effects of label noise in tail subtopics

# Docs Loss

t1 3 1/2

t2 1 1/6

t3 2 1/3

Page 20: Diversified Retrieval as  Structured Prediction

Diversity Training Data

• TREC 6-8 Interactive Track– Queries with explicitly labeled subtopics– E.g., “Use of robots in the world today”

• Nanorobots• Space mission robots• Underwater robots

– Manual partitioning of the total information regarding a query

Page 21: Diversified Retrieval as  Structured Prediction

Experiments

• TREC 6-8 Interactive Track Queries• Documents labeled into subtopics.

• 17 queries used, – considered only relevant docs– decouples relevance problem from diversity problem

• 45 docs/query, 20 subtopics/query, 300 words/doc

• Trained using LOO cross validation

Page 22: Diversified Retrieval as  Structured Prediction

• TREC 6-8 Interactive Track• Retrieving 5 documents

0.469 0.472 0.471

0.434

0.349

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

Random Okapi Unweighted Essential Pages SVM-div

Page 23: Diversified Retrieval as  Structured Prediction

Can expect further benefit from having more training data.

Page 24: Diversified Retrieval as  Structured Prediction

Moving Forward

• Larger datasets– Evaluate relevance & diversity jointly

• Different types of training data– Our framework can define loss in different ways– Can we leverage clickthrough data?

• Different feature representations• Build on top of topic modeling approaches?– Can we incorporate hierarchical retrieval?

Page 25: Diversified Retrieval as  Structured Prediction

References & Code/Data

• “Predicting Diverse Subsets Using Structural SVMs”– [Yue & Joachims, ICML 2008]

• Source code and dataset available online – http://projects.yisongyue.com/svmdiv/

• Work supported by NSF IIS-0713483, Microsoft Fellowship, and Yahoo! KTC Grant.

Page 26: Diversified Retrieval as  Structured Prediction

Extra Slides

Page 27: Diversified Retrieval as  Structured Prediction

More Sophisticated Discriminant

)(

)( 1

),(

),(

),(1

y

y

x

x

yx

LVv L

Vv

v

v

•Separate i for each importance level i. •Joint feature map is vector composition of all i

),(maxargˆ yxyy

Tw

•Greedy has (1-1/e)-approximation bound. •Still uses linear feature space.

Page 28: Diversified Retrieval as  Structured Prediction

Maximizing Subtopic Coverage

• Goal: select K documents which collectively cover as many subtopics as possible.

• Perfect selection takes n choose K time.– Basically a set cover problem.

• Greedy gives (1-1/e)-approximation bound.– Special case of Max Coverage (Khuller et al, 1997)

Page 29: Diversified Retrieval as  Structured Prediction

Learning Set Cover Representations

• Given:– Manual partitioning of a space

• subtopics

– Weighting for how items cover manual partitions • subtopic labels + subtopic loss

– Automatic partitioning of the space• Words

• Goal:– Weighting for how items cover automatic partitions– The (greedy) optimal covering solutions agree

Page 30: Diversified Retrieval as  Structured Prediction

Essential Pages

Page 31: Diversified Retrieval as  Structured Prediction

Essential Pages

),(1log),(),(

),,(),(),,(

2 xxx

xxx

vrvrv

xvTFvxvC ii

[Swaminathan et al., 2008]

Benefit of covering word v with document xi

Importance of covering word v

Intuition: - Frequent words cannot encode information diversity.- Infrequent words do not provide significant information

x = (x1,x2,…,xn) - set of candidate documents for a queryy – a subset of x of size K (our prediction).

v

vC ),,(maxargˆ xyyy

Page 32: Diversified Retrieval as  Structured Prediction

Structural SVMs

Page 33: Diversified Retrieval as  Structured Prediction

Minimizing Hinge Loss

Suppose for incorrect y’:

Then:

i

iN

Cw 2

2

1

iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy

[Tsochantaridis et al., 2005]

)'(75.0 yi

Page 34: Diversified Retrieval as  Structured Prediction

Finding Most Violated Constraint

• A constraint is violated when

• Finding most violated constraint reduces to

0)'(),()',( iTw yyxyx

)'()',(maxargˆ'

yyxyy

Tw

Page 35: Diversified Retrieval as  Structured Prediction

Finding Most Violated Constraint

• Encode each subtopic as an additional “word” to be covered.

• Use greedy prediction to find approximate most violated constraint.

)(

)(

)( 1

),(

),(

),('

1

y

y

y

x

x

yx

T T

Vv L

Vv

Lv

v

)',('maxargˆ'

yxyy

Tw

Page 36: Diversified Retrieval as  Structured Prediction

Illustrative Example

Original SVM Problem• Exponential constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

Page 37: Diversified Retrieval as  Structured Prediction

Illustrative Example

Original SVM Problem• Exponential constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

Page 38: Diversified Retrieval as  Structured Prediction

Illustrative Example

Original SVM Problem• Exponential constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

Page 39: Diversified Retrieval as  Structured Prediction

Illustrative Example

Original SVM Problem• Exponential constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

Page 40: Diversified Retrieval as  Structured Prediction

Approximate Constraint Generation

• Theoretical guarantees no longer hold.– Might not find an epsilon-close approximation to the feasible

region boundary.

• Performs well in practice.

Page 41: Diversified Retrieval as  Structured Prediction

Approximate constraint generation seems to work perform well.

Page 42: Diversified Retrieval as  Structured Prediction

Experiments

Page 43: Diversified Retrieval as  Structured Prediction

TREC Experiments

• 12/4/1 train/valid/test split– Approx 500 documents in training set

• Permuted until all 17 queries were tested once

• Set K=5 (some queries have very few documents)

• SVM-div – uses term frequency thresholds to define importance levels

• SVM-div2 – in addition uses TFIDF thresholds

Page 44: Diversified Retrieval as  Structured Prediction

TREC Results

Method Loss

Random 0.469

Okapi 0.472

Unweighted Model 0.471

Essential Pages 0.434

SVM-div 0.349

SVM-div2 0.382

Methods W / T / L

SVM-div vs Ess. Pages

14 / 0 / 3 **

SVM-div2 vs Ess. Pages

13 / 0 / 4

SVM-div vs SVM-div2

9 / 6 / 2

Page 45: Diversified Retrieval as  Structured Prediction

Synthetic Dataset

• Trec dataset very small• Synthetic dataset so we can vary retrieval size K

• 100 queries• 100 docs/query, 25 subtopics/query, 300 words/doc

• 15/10/75 train/valid/test split

Page 46: Diversified Retrieval as  Structured Prediction

Consistently outperforms Essential Pages


Top Related