Diversified Retrieval as Structured Prediction
Redundancy, Diversity, and
Interdependent Document Relevance (IDR ’09)
SIGIR 2009 Workshop
Yisong YueCornell University
Joint work with Thorsten Joachims
Need for Diversity (in IR)
• Ambiguous Queries– Different information needs using same query– “Jaguar”– At least one relevant result for each information need
• Learning Queries– User interested in “a specific detail or entire breadth
of knowledge available” • [Swaminathan et al., 2008]
– Results with high information diversity
Optimizing Diversity
• Interest in information retrieval– [Carbonell & Goldstein, 1998; Zhai et al., 2003; Zhang et al., 2005;
Chen & Karger, 2006; Zhu et al., 2007; Swaminathan et al., 2008]
• Requires inter-document dependencies– Impossible with standard independence assumptions– E.g., probability ranking principle
• No consensus on how to measure diversity.
This Talk
• A method for representing and optimizing information coverage
• Discriminative training algorithm– Based on structural SVMs
• Appropriate forms of training data– Requires sufficient granularity (subtopic labels)
• Empirical evaluation
•Choose top 3 documents•Individual Relevance: D3 D4 D1•Pairwise Sim MMR: D3 D1 D2•Best Solution: D3 D1 D5
How to Represent Information?
• Discrete feature space to represent information– Decomposed into “nuggets”
• For query q and its candidate documents:– All the words (title words, anchor text, etc)– Cluster memberships (topic models / dim reduction)– Taxonomy memberships (ODP)
• We will focus on words and title words.
Weighted Word Coverage
• More distinct words = more information
• Weight word importance
• Will work automatically w/o human labels
• Goal: select K documents which collectively cover as many distinct (weighted) words as possible– Budgeted max coverage problem (Khuller et al., 1997)
– Greedy selection yields (1-1/e) bound.
– Need to find good weighting function (learning problem).
Example
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2
Marginal Benefit
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
Example
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2 -- 2 3 D3
Marginal Benefit
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
How to Weight Words?
• Not all words created equal– “the”
• Conditional on the query– “computer” is normally fairly informative…– …but not for the query “ACM”
• Learn weights based on the candidate set – (for a query)
Prior Work
• Essential Pages [Swaminathan et al., 2008]
– Uses fixed function of word benefit– Depends on word frequency in candidate set
– - Local version of TF-IDF
– - Frequent words low weight– (not important for diversity)
– - Rare words low weight– (not representative)
Linear Discriminant
• x = (x1,x2,…,xn) - candidate documents
• v – an individual word
• We will use thousands of such features
...
]in titlesof 10% in appears [
...
] of 20% in appears [
] of 10% in appears [
),(
x
x
x
x
v
v
v
v
Linear Discriminant
• x = (x1,x2,…,xn) - candidate documents
• y – subset of x (the prediction)• V(y) – union of words from documents in y.
• Discriminant Function:
• Benefit of covering word v is then wT(v,x)
)(
),(),(y
xyxVv
TT vww
),(maxargˆ yxyy
Tw
Linear Discriminant
• Does NOT reward redundancy – Benefit of each word only counted once
• Greedy has (1-1/e)-approximation bound
• Linear (joint feature space) – Suitable for SVM optimization
)(
),(),(y
xyxVv
TT vww
More Sophisticated Discriminant
• Documents “cover” words to different degrees– A document with 5 copies of “Thorsten” might cover it
better than another document with only 2 copies.
More Sophisticated Discriminant
• Documents “cover” words to different degrees– A document with 5 copies of “Thorsten” might cover it
better than another document with only 2 copies.
• Use multiple word sets, V1(y), V2(y), … , VL(y)
• Each Vi(y) contains only words satisfying certain importance criteria.
• Requires more sophisticated joint feature map.
Conventional SVMs
• Input: x (high dimensional point)
• Target: y (either +1 or -1)
• Prediction: sign(wTx)
• Training:
subject to:
• The sum of slacks upper bounds the accuracy loss
N
ii
w N
Cw
1
2
, 2
1minarg
iiT xwi 1)(y : i
i
i
Structural SVM Formulation
• Input: x (candidate set of documents)• Target: y (subset of x of size K)
• Same objective function:
• Constraints for each incorrect labeling y’.
• Scoreof best y at least as large as incorrect y’ plus loss
• Requires new training algorithm [Tsochantaridis et al., 2005]
i
iN
Cw 2
2
1
)'()',(),( :' yyxyxyy TT ww
Weighted Subtopic Loss
• Example:– x1 covers t1
– x2 covers t1,t2,t3
– x3 covers t1,t3
• Motivation– Higher penalty for not covering popular subtopics– Mitigates effects of label noise in tail subtopics
# Docs Loss
t1 3 1/2
t2 1 1/6
t3 2 1/3
Diversity Training Data
• TREC 6-8 Interactive Track– Queries with explicitly labeled subtopics– E.g., “Use of robots in the world today”
• Nanorobots• Space mission robots• Underwater robots
– Manual partitioning of the total information regarding a query
Experiments
• TREC 6-8 Interactive Track Queries• Documents labeled into subtopics.
• 17 queries used, – considered only relevant docs– decouples relevance problem from diversity problem
• 45 docs/query, 20 subtopics/query, 300 words/doc
• Trained using LOO cross validation
• TREC 6-8 Interactive Track• Retrieving 5 documents
0.469 0.472 0.471
0.434
0.349
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
0.48
0.5
Random Okapi Unweighted Essential Pages SVM-div
Can expect further benefit from having more training data.
Moving Forward
• Larger datasets– Evaluate relevance & diversity jointly
• Different types of training data– Our framework can define loss in different ways– Can we leverage clickthrough data?
• Different feature representations• Build on top of topic modeling approaches?– Can we incorporate hierarchical retrieval?
References & Code/Data
• “Predicting Diverse Subsets Using Structural SVMs”– [Yue & Joachims, ICML 2008]
• Source code and dataset available online – http://projects.yisongyue.com/svmdiv/
• Work supported by NSF IIS-0713483, Microsoft Fellowship, and Yahoo! KTC Grant.
Extra Slides
More Sophisticated Discriminant
)(
)( 1
),(
),(
),(1
y
y
x
x
yx
LVv L
Vv
v
v
•Separate i for each importance level i. •Joint feature map is vector composition of all i
),(maxargˆ yxyy
Tw
•Greedy has (1-1/e)-approximation bound. •Still uses linear feature space.
Maximizing Subtopic Coverage
• Goal: select K documents which collectively cover as many subtopics as possible.
• Perfect selection takes n choose K time.– Basically a set cover problem.
• Greedy gives (1-1/e)-approximation bound.– Special case of Max Coverage (Khuller et al, 1997)
Learning Set Cover Representations
• Given:– Manual partitioning of a space
• subtopics
– Weighting for how items cover manual partitions • subtopic labels + subtopic loss
– Automatic partitioning of the space• Words
• Goal:– Weighting for how items cover automatic partitions– The (greedy) optimal covering solutions agree
Essential Pages
Essential Pages
),(1log),(),(
),,(),(),,(
2 xxx
xxx
vrvrv
xvTFvxvC ii
[Swaminathan et al., 2008]
Benefit of covering word v with document xi
Importance of covering word v
Intuition: - Frequent words cannot encode information diversity.- Infrequent words do not provide significant information
x = (x1,x2,…,xn) - set of candidate documents for a queryy – a subset of x of size K (our prediction).
v
vC ),,(maxargˆ xyyy
Structural SVMs
Minimizing Hinge Loss
Suppose for incorrect y’:
Then:
i
iN
Cw 2
2
1
iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy
[Tsochantaridis et al., 2005]
)'(75.0 yi
Finding Most Violated Constraint
• A constraint is violated when
• Finding most violated constraint reduces to
0)'(),()',( iTw yyxyx
)'()',(maxargˆ'
yyxyy
Tw
Finding Most Violated Constraint
• Encode each subtopic as an additional “word” to be covered.
• Use greedy prediction to find approximate most violated constraint.
)(
)(
)( 1
),(
),(
),('
1
y
y
y
x
x
yx
T T
Vv L
Vv
Lv
v
)',('maxargˆ'
yxyy
Tw
Illustrative Example
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Illustrative Example
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Illustrative Example
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Illustrative Example
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
Approximate Constraint Generation
• Theoretical guarantees no longer hold.– Might not find an epsilon-close approximation to the feasible
region boundary.
• Performs well in practice.
Approximate constraint generation seems to work perform well.
Experiments
TREC Experiments
• 12/4/1 train/valid/test split– Approx 500 documents in training set
• Permuted until all 17 queries were tested once
• Set K=5 (some queries have very few documents)
• SVM-div – uses term frequency thresholds to define importance levels
• SVM-div2 – in addition uses TFIDF thresholds
TREC Results
Method Loss
Random 0.469
Okapi 0.472
Unweighted Model 0.471
Essential Pages 0.434
SVM-div 0.349
SVM-div2 0.382
Methods W / T / L
SVM-div vs Ess. Pages
14 / 0 / 3 **
SVM-div2 vs Ess. Pages
13 / 0 / 4
SVM-div vs SVM-div2
9 / 6 / 2
Synthetic Dataset
• Trec dataset very small• Synthetic dataset so we can vary retrieval size K
• 100 queries• 100 docs/query, 25 subtopics/query, 300 words/doc
• 15/10/75 train/valid/test split
Consistently outperforms Essential Pages