lecture 5: structured prediction

Machine Learning for Language Technology

Uppsala UniversityDepartment of Linguistics and Philology

Structured PredictionOctober 2013

Slides borrowed from previous courses.Thanks to Ryan McDonald (Google Research)

and Prof. Joakim Nivre

Machine Learning for Language Technology 1(36)

Structured Prediction

Outline

I Last time:I Preliminaries: input/output, features, etc.I Linear classifiers

I PerceptronI Large-margin classifiers (SVMs, MIRA)I Logistic regression

I Today:I Structured prediction with linear classifiers

I Structured perceptronI Structured large-margin classifiers (SVMs, MIRA)I Conditional random fields

I Case study: Dependency parsing



Structured Prediction (i)

I Sometimes our output space Y does not consist of simpleatomic classes

I Examples:I Parsing: for a sentence x, Y is the set of possible parse treesI Sequence tagging: for a sentence x, Y is the set of possible

tag sequences, e.g., part-of-speech tags, named-entity tagsI Machine translation: for a source sentence x, Y is the set of

possible target language sentences



Hidden Markov Models

I Generative model – maximizes likelihood of P(x,y)I We are looking at discriminative versions of this

I Not just sequences, though that will be the running example



Structured Prediction (ii)

I Can’t we just use our multiclass learning algorithms?

I In all the cases, the size of the set Y is exponential in thelength of the input x

I It is non-trivial to apply our learning algorithms in such cases



Perceptron

Training data: T = {(xt ,yt)}|T |t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. Let y′ = arg maxy w(i) · f(xt ,y) (**)5. if y′ 6= yt

6. w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)

7. i = i + 18. return wi

(**) Solving the argmax requires a search over an exponentialspace of outputs!



Large-Margin Classifiers

Batch (SVMs):

min1

2||w||2

such that:

w·f(xt ,yt)−w·f(xt ,y′) ≥ 1

∀(xt ,yt) ∈ T and y′ ∈ Yt (**)

Online (MIRA):

Training data: T = {(xt ,yt)}|T |t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T4. w(i+1) = arg minw*

∥∥w*−w(i)∥∥

such that:w · f(xt ,yt)−w · f(xt ,y

′) ≥ 1∀y′ ∈ Yt (**)


(**) There are exponential constraints in the size of each input!!



Factor the Feature Representations

I We can make an assumption that our feature representationsfactor relative to the output

I Example:I Context-free parsing:

f(x,y) =∑

A→BC∈y

f(x,A → BC )

I Sequence analysis – Markov assumptions:

f(x,y) =

|y|∑i=1

f(x, yi−1, yi )

I These kinds of factorizations allow us to run algorithms likeCKY and Viterbi to compute the argmax function



Example – Sequence Labeling

I Many NLP problems can be cast in this lightI Part-of-speech taggingI Named-entity extractionI Semantic role labelingI ...

I Input: x = x0x1 . . . xn

I Output: y = y0y1 . . . yn

I Each yi ∈ Yatom – which is small

I Each y ∈ Y = Ynatom – which is large

I Example: part-of-speech tagging – Yatom is set of tags

x = John saw Mary with the telescopey = noun verb noun preposition article noun



Sequence Labeling – Output Interaction


I Why not just break up sequence into a set of multi-classpredictions?

I Because there are interactions between neighbouring tagsI What tag does“saw”have?I What if I told you the previous tag was article?I What if it was noun?



Sequence Labeling – Markov Factorization


I Markov factorization – factor by adjacent labels

I First-order (like HMMs)

f(x,y) =

|y|∑i=1

f(x, yi−1, yi )

I kth-order

f(x,y) =

|y|∑i=k

f(x, yi−k , . . . , yi−1, yi )



Sequence Labeling – Features


I First-order

f(x,y) =

|y|∑i=1

f(x, yi−1, yi )

I f(x, yi−1, yi ) is any feature of the input & two adjacent labels

fj (x, yi−1, yi ) =

8<: 1 if xi =“saw”and yi−1 = noun and yi = verb

0 otherwisefj′ (x, yi−1, yi ) =

8<: 1 if xi =“saw”and yi−1 = article and yi = verb

0 otherwise

I wj should get high weight and wj ′ should get low weight



Sequence Labeling - Inference

I How does factorization effect inference?

y = arg maxy

w · f(x,y)

= arg maxy

w ·|y|∑i=1

f(x, yi−1, yi )

= arg maxy

|y|∑i=1

w · f(x, yi−1, yi )

= arg maxy

|y|∑i=1

m∑j=1

wj · fj(x, yi−1, yi )

I Can use the Viterbi algorithm



Sequence Labeling – Viterbi Algorithm

I Let αy ,i be the score of the best labelingI Of the sequence x0x1 . . . xi

I Where yi = y

I Let’s say we know α, thenI maxy αy ,n is the score of the best labeling of the sequence

I αy ,i can be calculated with the following recursion

αy ,0 = 0.0 ∀y ∈ Yatom

αy ,i = maxy∗

αy∗,i−1 + w · f(x, y∗, y)



Sequence Labeling - Back-Pointers

I But that only tells us what the best score isI Let βy ,i be the i th label in the best labeling

I Of the sequence x0x1 . . . xi

I Where yi = y

I βy ,i can be calculated with the following recursion

βy ,0 = nil ∀y ∈ Yatom

βy ,i = arg maxy∗

αy∗,i−1 + w · f(x, y∗, y)

I Thus:I The last label in the best sequence is yn = arg maxy αy ,n

I The second-to-last label is yn−1 = βyn,n

I ...I The first label is y0 = βy1,1



Structured Learning

I We know we can solve the inference problemI At least for sequence labelingI But also for many other problems where one can factor

features appropriately (context-free parsing, dependencyparsing, semantic role labeling, . . . )

I How does this change learning?I for the perceptron algorithm?I for SVMs?I for logistic regression?



Structured Perceptron

I Exactly like original perceptronI Except now the argmax function uses factored features

I Which we can solve with algorithms like the Viterbi algorithm

I All of the original analysis carries over!!

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. Let y′ = arg maxy w(i) · f(xt , y) (**)5. if y′ 6= yt

6. w(i+1) = w(i) + f(xt , yt) − f(xt , y′)7. i = i + 18. return wi

(**) Solve the argmax with Viterbi for sequence problems!



Online Structured SVMs (or Online MIRA)

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. w(i+1) = arg minw*‚‚w* − w(i)

‚‚such that:

w · f(xt , yt) − w · f(xt , y′) ≥ L(yt , y ′)∀y′ ∈ Yt and y′ ∈ k-best(xt , w(i))


I k-best(xt) is set of k outputs with highest scores using w(i)

I Simple solution – only consider highest-scoring output y′ ∈ Yt

I Note: Old fixed margin of 1 is now a fixed loss L(yt , y′)

between two structured outputs



Structured SVMs

min1

2||w||2

such that:

w · f(xt ,yt)−w · f(xt ,y′) ≥ L(yt , y

′)

∀(xt ,yt) ∈ T and y′ ∈ Yt

I Still have an exponential number of constraints

I Feature factorizations permit solutions (max-margin Markovnetworks, structured SVMs)

I Note: Old fixed margin of 1 is now a fixed loss L(yt , y′)

between two structured outputs



Conditional Random Fields (i)

I What about structured logistic regression?

I Such a thing exists – Conditional Random Fields (CRFs)

I Consider again the sequential case with 1st order factorization

I Inference is identical to the structured perceptron – use Viterbi

arg maxy

P(y|x) = arg maxy

ew·f(x,y)

Zx

= arg maxy

ew·f(x,y)

= arg maxy

w · f(x, y)

= arg maxy

Xi=1

w · f(x, yi−1, yi )



Conditional Random Fields (ii)

I However, learning does change

I Reminder: pick w to maximize log-likelihood of training data:

w = arg maxw

∑t

log P(yt |xt)

I Take gradient and use gradient ascent

∂

∂wiF (w) =

∑t

fi (xt ,yt)−∑

t

∑y′∈Y

P(y′|xt)fi (xt ,y′)

I And the gradient is:

OF (w) = (∂

∂w0F (w),

∂

∂w1F (w), . . . ,

∂

∂wmF (w))



Conditional Random Fields (iii)

I Problem: sum over output space Y

∂

∂wiF (w) =

Xt

fi (xt , yt) −X

t

Xy′∈Y

P(y′|xt)fi (xt , y′)

=X

t

Xj=1

fi (xt , yt,j−1, yt,j ) −X

t

Xy′∈Y

Xj=1

P(y′|xt)fi (xt , y′j−1, y

′j )

I Can easily calculate first term – just empirical counts

I What about the second term?



Conditional Random Fields (iv)

I Problem: sum over output space Y∑t

∑y′∈Y

∑j=1


′j )

I We need to show we can compute it for arbitrary xt∑y′∈Y

∑j=1


′j )

I Solution: the forward-backward algorithm



Forward Algorithm (i)

I Let αmu be the forward scores

I Let |xt | = nI αm

u is the sum over all labelings of x0 . . . xm such that y ′m = u

αmu =

∑|y′|=m, y ′m=u

ew·f(xt ,y′)

=∑

|y′|=m y ′m=u

eP

j=1 w·f(xt ,y ′j−1,y′j )

i.e., the sum of all labelings of length m, ending at position mwith label u

I Note then that

Zxt =∑y′

ew·f(xt ,y′) =∑u

αnu



Forward Algorithm (ii)

I We can fill in α as follows:

α0u = 1.0 ∀u

αmu =

∑v

αm−1v × ew·f(xt ,v ,u)



Backward Algorithm

I Let βmu be the symmetric backward scores

I i.e., the sum over all labelings of xm . . . xn such that y ′m = u

I We can fill in β as follows:

βnu = 1.0 ∀u

βmu =

∑v

βm+1v × ew·f(xt ,u,v)

I Note: β is overloaded – different from back-pointers



Conditional Random Fields - Final

I Let’s show we can compute it for arbitrary xt∑y′∈Y

∑j=1


′j )

I So we can re-write it as:

∑j=1

αj−1y ′j−1

× ew·f(xt ,y ′j−1,y′j ) × βj

yj

Zxt

fi (xt , y′j−1, y

′j )

I Forward-backward can calculate partial derivatives efficiently



Conditional Random Fields Summary

I Inference: Viterbi

I Learning: Use the forward-backward algorithmI What about not sequential problems

I Context-free parsing – can use inside-outside algorithmI General problems – message passing & belief propagation



Case Study: Dependency Parsing

I Given an input sentence x, predict syntactic dependencies y



Model 1: Arc-Factored Graph-Based Parsing

y = arg maxy

w · f(x,y)

= arg maxy

∑(i ,j)∈y

w · f(i , j)

I (i , j) ∈ y means xi → xj , i.e., a dependency from xi to xj

I Solving the argmaxI w · f(i , j) is weight of arcI A dependency tree is a spanning tree of a dense graph over xI Use max spanning tree algorithms for inference



Defining f(i , j)

I Can contain any feature over arc or the input sentenceI Some example features

I Identities of xi and xj

I Their part-of-speech tagsI The part-of-speech of surrounding wordsI The distance between xi and xj

I ...



Empirical Results

I Spanning tree dependency parsing results (McDonald 2006)

I Trained using MIRA (online SVMs)

English Czech Chinese

Accuracy Complete Accuracy Complete Accuracy Complete90.7 36.7 84.1 32.2 79.7 27.2

I Simple structured linear classifier

I Near state-of-the-art performance for many languages

I Higher-order models give higher accuracy



Model 2: Transition-Based Parsing

y = arg maxy

w · f(x,y)

= arg maxy

∑t(s)∈T (y)

w · f(s, t)

I t(s) ∈ T (y) means that the derivation of y includes theapplication of transition t to state s

I Solving the argmaxI w · f(s, t) is score of transition t in state sI Use beam search to find best derivation from start state s0



Defining f(s, t)

I Can contain any feature over parser statesI Some example features

I Identities of words in s (e.g., top of stack, head of queue)I Their part-of-speech tagsI Their head and dependents (and their part-of-speech tags)I The number of dependents of words in sI ...



Empirical Results

I Transition-based dependency parsing with beam search(**)(Zhang and Nivre 2011)

I Trained using perceptron

English Chinese

Accuracy Complete Accuracy Complete92.9 48.0 86.0 36.9

I Simple structured linear classifier

I State-of-the-art performance with rich non-local features

(**) Beam search is a heuristic search algorithm that explores agraph by expanding the most promising node in a limited set.Beam search is an optimization of best-first search that reduces itsmemory requirements. It only finds an approxiamate solution



Structured Prediction Summary

I Can’t use multiclass algorithms – search space too large

I Solution: factor representationsI Can allow for efficient inference and learning

I Showed for sequence learning: Viterbi + forward-backwardI But also true for other structures

I CFG parsing: CKY + inside-outsideI Dependency parsing: spanning tree algorithms or beam searchI General graphs: message passing and belief propagation


lecture 5: structured prediction

Education