an introduction to structural svms and its application to information retrieval yisong yue carnegie...

63
An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Upload: meryl-mcdowell

Post on 17-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

An Introduction to Structural SVMs and its Application to Information Retrieval

Yisong YueCarnegie Mellon University

Page 2: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Supervised Learning

• Find function from input space X to output space Y

such that the prediction error is low.

Microsoft announced today that they acquired Apple for the amount equal to the gross national product of Switzerland. Microsoft officials stated that they first wanted to buy Switzerland, but eventually were turned off by the mountains and the snowy winters…

x

y1

GATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCAATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCAATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCAGATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCACATTTA

x

y-1x

y7.3

Page 3: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

• Part-of-Speech Tagging– Given a sequence of words x, predict sequence of tags y.

– Dependencies from tag-tag transitions in Markov model.

Similarly for other sequence labeling problems, e.g., RNA

Intron/Exon Tagging.

The rain wet the catx

Det NVDet Ny

Examples of Complex Output Spaces

Page 4: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Examples of Complex Output Spaces

• Natural Language Parsing– Given a sequence of words x, predict the parse tree y.– Dependencies from structural constraints, since y has to be a

tree.

The dog chased the catx

S

VPNP

Det NV

NP

Det N

y

Page 5: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Examples of Complex Output Spaces

• Information Retrieval– Given a query x, predict a ranking y.– Dependencies between results (e.g. avoid redundant hits)– Loss function over rankings (e.g. Average Precision)

SVMx 1. Kernel-Machines

2. SVM-Light3. Learning with Kernels4. SV Meppen Fan Club5. Service Master & Co.6. School of Volunteer Management7. SV Mattersburg Online…

y

Page 6: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Examples of Complex Output Spaces

• Multi-label Prediction

• Protein Sequence Alignment

• Noun Phrase Co-reference Clustering

• Rankings in Information Retrieval

• Inference in Graphical Models

• …and many more.

Page 7: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

1st Order Sequence Labeling

• Given: – scoring function S(x, y1, y2)

– input example x = (x1,…,xn)

• Finds sequence y = (y1,…,yn) to maximize

• Solved with dynamic programming (Viterbi)

“Hypothesis Function”

Page 8: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Some Formulation Restrictions

• Assume S is parameterized linearly by some weight vector w in RD.

• This means that

“Hypothesis Function”

Page 9: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Joint Feature Map

• From last slide:

• Joint feature map:

• Our hypothesis function:

t

ttt yyx ),,(),( 1xy

),(maxarg);( xyxy

Twwh

“Linear Discriminant Function”

Page 10: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Structured Prediction Learning Problem

• Efficient Inference/Prediction

– Viterbi in sequence labeling– CKY Parser for parse trees– Belief Propagation for Markov random fields– Sorting for ranking

• Efficient Learning/Training – Learn parameters w from training data {xi,yi}i=1..N

– Solution: use Structural SVM framework– Can also use Perceptrons, CRFs, MEMMs, M3Ns etc.

Page 11: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Conventional SVMs

• Input: x (high dimensional point)• Target: y (either +1 or -1)• Prediction: sign(wTx)

• Training:

subject to:

• The sum of slacks upper bounds the 0/1 loss!

N

ii

w N

Cw

1

2

, 2

1minarg

iiT xwi 1)(y : i

i

i

Page 12: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Structural SVM

• Let x denote a structured input (sentence)• Let y denote a structured output (POS tags)

• Standard objective function:

• Constraints are defined for each incorrect labeling y’ over each x.

i

iN

Cw 2

2

1

[Tsochantaridis et al., 2005]

Score(y(i)) Score(y’) Loss(y’) Slack

Page 13: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Interpreting Constraints

Suppose for incorrect y’:

Then:

i

iN

Cw 2

2

1

)'(75.0 yi

Score(y(i)) Score(y’) Loss(y’) Slack

[Tsochantaridis et al., 2005]

Page 14: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Adapting to Sequence Labeling

• Minimize

subject to

where

and

• Sum of slacks upper bound loss.

t

ttt yyx ),,(),'( 1xy

i

iN

Cw 2

2

1

t

yy ttn '

11)',( 1yy

Too many constraints!

Page 15: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Structural SVM Training

• The trick is to not enumerate all constraints.

• Suppose we only solve the SVM objective over a small subset of constraints (working set).– This is efficient– Equivalent to solving standard SVM.

• But some constraints from global set might be violated.

Page 16: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Structural SVM Training• STEP 1: Solve the SVM objective function using only working set of

constraints W.

• STEP 2: Using the model learned in STEP 1, find the most violated constraint from the global set of constraints.

• STEP 3: If the constraint returned in STEP 2 is violated by more than ε, add it to W.

• Repeat STEP 1-3 until no additional constraints are added. Return the most recent model that was trained in STEP 1.

STEP 1-3 is guaranteed to loop for at most O(1/epsilon) iterations. [Joachims et al., 2009]

*This is known as a “cutting plane” method.

Page 17: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Illustrative Example

Original SVM Problem• Exponential constraints• Most are dominated by a small set

of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…• …until set of constraints is a good

approximation.

*This is known as a “cutting plane” method.

Page 18: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Illustrative Example

Original SVM Problem• Exponential constraints• Most are dominated by a small set

of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…• …until set of constraints is a good

approximation.

*This is known as a “cutting plane” method.

Page 19: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Illustrative Example

Original SVM Problem• Exponential constraints• Most are dominated by a small set

of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…• …until set of constraints is a good

approximation.

*This is known as a “cutting plane” method.

Page 20: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Illustrative Example

Original SVM Problem• Exponential constraints• Most are dominated by a small set

of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…• …until set of constraints is a good

approximation.

*This is known as a “cutting plane” method.

Page 21: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Finding Most Violated Constraint

• A constraint is violated when

• Finding most violated constraint reduces to

• Highly related to inference:

“Loss augmented inference”

Page 22: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Sequence Labeling Revisited

• Finding most violated constraint…

… can be solved using Viterbi!

Page 23: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Structural SVM Recipe

• Joint feature map

• Inference method

• Loss function

• Loss-augmented (most violated constraint)

),(maxarg xyy

Tw

),( xy

)(y

)'(),(maxarg'

yxyy

Tw

Page 24: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Structural SVMs for Rankings

• Predicting rankings important in IR

• Must provide the four ingredients:– How to represent joint feature map Ψ?– How to perform inference?– What loss function to optimize for?– How to find most violated constraint?

Page 25: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Joint Feature Map for Ranking

• Let x = (x1,…xn) denote candidate documents

• Let yjk = {+1, -1} encode pairwise rank orders

• Feature map is pairwise feature difference of documents.

• Inference made by sorting on document scores wTxi

Page 26: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Multivariate Loss Functions

• Information Retrieval focused on ranking-centric performance measures– Normalized Discounted Cumulative Gain– Precision @ K– Mean Average Precision– Expected Reciprocal Rank

• These measures all depend on the entire ranking

Page 27: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Multivariate Loss Functions

• Information Retrieval focused on ranking-centric performance measures– Normalized Discounted Cumulative Gain– Precision @ K– Mean Average Precision– Expected Reciprocal Rank

• These measures all depend on the entire ranking

Page 28: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Mean Average Precision

• Consider rank position of each relevant doc– K1, K2, … KR

• Compute Precision@K for each K1, K2, … KR

• Average precision = average of P@K

• Ex: has AvgPrec of

• MAP is Average Precision across multiple queries/rankings

76.05

3

3

2

1

1

3

1

Page 29: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

[Yue & Burges, 2007]

Page 30: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Structural SVM for MAP

• Minimize

subject to

where ( y jk = {-1, +1} )

and

• Sum of slacks is smooth upper bound on MAP loss.

relj relk

ik

ij

ii xxyjk

: :!

)()()()( )(),( xy

i

iN

Cw 2

2

1

iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy

)'(1)'( yy Avgprec

i

[Yue et al., SIGIR 2007]

Page 31: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Too Many Constraints!

• For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g.,

• An incorrect labeling would be any other ranking, e.g.,

• This ranking has Average Precision of about 0.8 with (y’) ≈ 0.2

• Intractable number of rankings, thus an intractable number of constraints!

Page 32: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Finding Most Violated Constraint

Observations• MAP is invariant on the order of documents within a relevance

class– Swapping two relevant or non-relevant documents does not change MAP.

• Joint SVM score is optimized by sorting by document score, wTxj

• Reduces to finding an interleaving

between two sorted lists of documents

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

[Yue et al., SIGIR 2007]

Page 33: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Finding Most Violated Constraint

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

[Yue et al., SIGIR 2007]

Page 34: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Finding Most Violated Constraint

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents• Find the best feasible ranking of

the non-relevant document

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

[Yue et al., SIGIR 2007]

Page 35: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Finding Most Violated Constraint

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents• Find the best feasible ranking of the

non-relevant document• Repeat for next non-relevant

document

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

[Yue et al., SIGIR 2007]

Page 36: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Finding Most Violated Constraint

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents• Find the best feasible ranking of the

non-relevant document• Repeat for next non-relevant

document• Never want to swap past previous

non-relevant document

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

[Yue et al., SIGIR 2007]

Page 37: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Finding Most Violated Constraint

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents• Find the best feasible ranking of the

non-relevant document• Repeat for next non-relevant

document• Never want to swap past previous

non-relevant document• Repeat until all non-relevant

documents have been considered

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

[Yue et al., SIGIR 2007]

Page 38: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Comparison with other SVM methods

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

TREC 9 Indri TREC 10 Indri TREC 9Submissions

TREC 10Submissions

TREC 9Submissions(without best)

TREC 10Submissions(without best)

Dataset

Mea

n A

vera

ge

Pre

cisi

on

SVM-MAP

SVM-ROC

SVM-ACC

SVM-ACC2

SVM-ACC3

SVM-ACC4

Page 39: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Need for Diversity (in IR)

• Ambiguous Queries– Different information needs using same query– “Jaguar”– At least one relevant result for each information need

• Learning Queries– User interested in “a specific detail or entire breadth

of knowledge available” • [Swaminathan et al., 2008]

– Results with high information diversity

Page 40: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

• Choose top 3 documents• Individual Relevance: D3 D4 D1• Greedy Coverage Solution: D3 D1 D5

Page 41: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

• Choose top 3 documents• Individual Relevance: D3 D4 D1• Greedy Coverage Solution: D3 D1 D5

Page 42: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

• Choose top 3 documents• Individual Relevance: D3 D4 D1• Greedy Coverage Solution: D3 D1 D5

Page 43: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

• Choose top 3 documents• Individual Relevance: D3 D4 D1• Greedy Coverage Solution: D3 D1 D5

Page 44: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

• Choose top 3 documents• Individual Relevance: D3 D4 D1• Greedy Coverage Solution: D3 D1 D5

Page 45: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

• Choose top 3 documents• Individual Relevance: D3 D4 D1• Greedy Coverage Solution: D3 D1 D5

Page 46: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

• Choose top 3 documents• Individual Relevance: D3 D4 D1• Greedy Coverage Solution: D3 D1 D5

Page 47: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Example

D1 D2 D3 Best

Iter 1 12 11 10 D1

Iter 2

Marginal Benefit

V1 V2 V3 V4 V5

D1 X X X

D2 X X X

D3 X X X X

Word Benefit

V1 1

V2 2

V3 3

V4 4

V5 5

Document Word Counts

Page 48: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Example

D1 D2 D3 Best

Iter 1 12 11 10 D1

Iter 2 -- 2 3 D3

Marginal Benefit

V1 V2 V3 V4 V5

D1 X X X

D2 X X X

D3 X X X X

Word Benefit

V1 1

V2 2

V3 3

V4 4

V5 5

Document Word Counts

Page 49: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Prior Work

• Essential Pages [Swaminathan et al., 2008]– Uses fixed function of word benefit– Depends on word frequency in candidate set

– - Local version of TF-IDF

– - Frequent words low weight– (not important for diversity)

– - Rare words low weight– (not representative)

Page 50: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Joint Feature Map

• x = (x1,x2,…,xn) - candidate documents

• y – subset of x • V(y) – union of words from documents in y.

• Joint feature map:

• (v,x) – frequency features (e.g., >10%, >20%, etc).

• Benefit of covering word v is then wT(v,x)

)(

),(),(y

xxyVv

TT vww

[Yue & Joachims, ICML 2008]

Page 51: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Joint Feature Map

• Does NOT reward redundancy – Benefit of each word only counted once

• Greedy inference has (1-1/e)-approximation bound– Due to h(x) being monotone submodular

• Linear (joint feature space) – Allows for SVM optimization

• (Used more sophisticated discriminant in experiments.)

[Yue & Joachims, ICML 2008]

Page 52: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Weighted Subtopic Loss

• Example:– x1 covers t1

– x2 covers t1,t2,t3

– x3 covers t1,t3

• Motivation– Higher penalty for not covering popular subtopics

# Docs Loss

t1 3 1/2

t2 1 1/6

t3 2 1/3

[Yue & Joachims, ICML 2008]

Page 53: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Finding Most Violated Constraint

• Encode each subtopic as an additional “word” to be covered.

• Use greedy algorithm:

Page 54: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

• TREC 6-8 Interactive Track• Retrieving 5 documents

0.469 0.472 0.471

0.434

0.349

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

Random Okapi Unweighted Essential Pages SVM-div

Page 55: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Sentiment Classification

• Product reviews• Movie reviews• Political speeches• Discussion forums

• What is the sentiment, and why?– What are the supporting sentences?

Page 56: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Identifying the Supporting Sentences

• Suppose we could extract the supporting sentences

• How can we do this automatically?

87% accuracy 98% accuracy

Page 57: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Joint Feature Map

• x = (x1,x2,…,xn) – sentences of an article

• s = subset of supporting sentences• y = sentiment label {+1,-1}

• Joint feature map:

[Yessenalina, Yue & Cardie, EMNLP 2010]

Page 58: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Inference

h(x|w) =

[Yessenalina, Yue & Cardie, EMNLP 2010]

Page 59: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Latent Variable SVMs

• Input: x = (x1,x2,…,xn) – sentences of an article

• Output: y (either +1 or -1), s (supporting sentences)

• Training:

subject to:

N

ii

w N

Cw

1

2

, 2

1minarg

Not convex!

[Yu & Joachims, 2009] [Yessenalina, Yue & Cardie, 2010]

Page 60: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Training Using CCCP

• First initialize s(i) for each training instance

• (1) train:

s.t.:

• (2) infer s(i) from w:

• Repeat (1) and (2) …

N

ii

w N

Cw

1

2

, 2

1minarg

[Yu & Joachims, 2009] [Yessenalina, Yue & Cardie, 2010]

Requires finding most violated constraint

Page 61: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Initialization Method Accuracy

Baseline Standard SVM 88.6%

Annotated Rationales Zaidan et al., 2007 92.2%

SVM-sle 93.2%

Opinion Finder Yessenalina et al., 2009 91.8%

SVM-sle 92.5%

Initialization Method Accuracy

Baseline Standard SVM 70.0%

Opinion Finder Thomas et al., 2006 71.3%

Bansal et al., 2008 75.0%

SVM-sle 77.7%

Movie Reviews Corpus

Congressional Debates Corpus

Page 62: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Green sentences denote most positive supporting sentencesUnderlined sentences denote least subjective sentences

Page 63: An Introduction to Structural SVMs and its Application to Information Retrieval Yisong Yue Carnegie Mellon University

Structural SVMs

• Train parameterized structured prediction models.

• Requires 4 ingredients– Joint feature map (linear)– Inference– Loss function– Finding most violated constraint

• Applications in information retrieval– Optimizing Mean Average Precision in rankings– Optimizing diversity– Predicting sentiment with latent explanations

Work supported by NSF IIS-0713483, Microsoft Fellowship, and Yahoo! KSC Award.