Download - Lecture 5: Structured Prediction
Machine Learning for Language Technology
Uppsala UniversityDepartment of Linguistics and Philology
Structured PredictionOctober 2013
Slides borrowed from previous courses.Thanks to Ryan McDonald (Google Research)
and Prof. Joakim Nivre
Machine Learning for Language Technology 1(36)
Structured Prediction
Outline
I Last time:I Preliminaries: input/output, features, etc.I Linear classifiers
I PerceptronI Large-margin classifiers (SVMs, MIRA)I Logistic regression
I Today:I Structured prediction with linear classifiers
I Structured perceptronI Structured large-margin classifiers (SVMs, MIRA)I Conditional random fields
I Case study: Dependency parsing
Machine Learning for Language Technology 2(36)
Structured Prediction
Structured Prediction (i)
I Sometimes our output space Y does not consist of simpleatomic classes
I Examples:I Parsing: for a sentence x, Y is the set of possible parse treesI Sequence tagging: for a sentence x, Y is the set of possible
tag sequences, e.g., part-of-speech tags, named-entity tagsI Machine translation: for a source sentence x, Y is the set of
possible target language sentences
Machine Learning for Language Technology 3(36)
Structured Prediction
Hidden Markov Models
I Generative model – maximizes likelihood of P(x,y)I We are looking at discriminative versions of this
I Not just sequences, though that will be the running example
Machine Learning for Language Technology 4(36)
Structured Prediction
Structured Prediction (ii)
I Can’t we just use our multiclass learning algorithms?
I In all the cases, the size of the set Y is exponential in thelength of the input x
I It is non-trivial to apply our learning algorithms in such cases
Machine Learning for Language Technology 5(36)
Structured Prediction
Perceptron
Training data: T = {(xt ,yt)}|T |t=1
1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T
4. Let y′ = arg maxy w(i) · f(xt ,y) (**)5. if y′ 6= yt
6. w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)
7. i = i + 18. return wi
(**) Solving the argmax requires a search over an exponentialspace of outputs!
Machine Learning for Language Technology 6(36)
Structured Prediction
Large-Margin Classifiers
Batch (SVMs):
min1
2||w||2
such that:
w·f(xt ,yt)−w·f(xt ,y′) ≥ 1
∀(xt ,yt) ∈ T and y′ ∈ Yt (**)
Online (MIRA):
Training data: T = {(xt ,yt)}|T |t=1
1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T4. w(i+1) = arg minw*
∥∥w*−w(i)∥∥
such that:w · f(xt ,yt)−w · f(xt ,y
′) ≥ 1∀y′ ∈ Yt (**)
5. i = i + 16. return wi
(**) There are exponential constraints in the size of each input!!
Machine Learning for Language Technology 7(36)
Structured Prediction
Factor the Feature Representations
I We can make an assumption that our feature representationsfactor relative to the output
I Example:I Context-free parsing:
f(x,y) =∑
A→BC∈y
f(x,A → BC )
I Sequence analysis – Markov assumptions:
f(x,y) =
|y|∑i=1
f(x, yi−1, yi )
I These kinds of factorizations allow us to run algorithms likeCKY and Viterbi to compute the argmax function
Machine Learning for Language Technology 8(36)
Structured Prediction
Example – Sequence Labeling
I Many NLP problems can be cast in this lightI Part-of-speech taggingI Named-entity extractionI Semantic role labelingI ...
I Input: x = x0x1 . . . xn
I Output: y = y0y1 . . . yn
I Each yi ∈ Yatom – which is small
I Each y ∈ Y = Ynatom – which is large
I Example: part-of-speech tagging – Yatom is set of tags
x = John saw Mary with the telescopey = noun verb noun preposition article noun
Machine Learning for Language Technology 9(36)
Structured Prediction
Sequence Labeling – Output Interaction
x = John saw Mary with the telescopey = noun verb noun preposition article noun
I Why not just break up sequence into a set of multi-classpredictions?
I Because there are interactions between neighbouring tagsI What tag does“saw”have?I What if I told you the previous tag was article?I What if it was noun?
Machine Learning for Language Technology 10(36)
Structured Prediction
Sequence Labeling – Markov Factorization
x = John saw Mary with the telescopey = noun verb noun preposition article noun
I Markov factorization – factor by adjacent labels
I First-order (like HMMs)
f(x,y) =
|y|∑i=1
f(x, yi−1, yi )
I kth-order
f(x,y) =
|y|∑i=k
f(x, yi−k , . . . , yi−1, yi )
Machine Learning for Language Technology 11(36)
Structured Prediction
Sequence Labeling – Features
x = John saw Mary with the telescopey = noun verb noun preposition article noun
I First-order
f(x,y) =
|y|∑i=1
f(x, yi−1, yi )
I f(x, yi−1, yi ) is any feature of the input & two adjacent labels
fj (x, yi−1, yi ) =
8<: 1 if xi =“saw”and yi−1 = noun and yi = verb
0 otherwisefj′ (x, yi−1, yi ) =
8<: 1 if xi =“saw”and yi−1 = article and yi = verb
0 otherwise
I wj should get high weight and wj ′ should get low weight
Machine Learning for Language Technology 12(36)
Structured Prediction
Sequence Labeling - Inference
I How does factorization effect inference?
y = arg maxy
w · f(x,y)
= arg maxy
w ·|y|∑i=1
f(x, yi−1, yi )
= arg maxy
|y|∑i=1
w · f(x, yi−1, yi )
= arg maxy
|y|∑i=1
m∑j=1
wj · fj(x, yi−1, yi )
I Can use the Viterbi algorithm
Machine Learning for Language Technology 13(36)
Structured Prediction
Sequence Labeling – Viterbi Algorithm
I Let αy ,i be the score of the best labelingI Of the sequence x0x1 . . . xi
I Where yi = y
I Let’s say we know α, thenI maxy αy ,n is the score of the best labeling of the sequence
I αy ,i can be calculated with the following recursion
αy ,0 = 0.0 ∀y ∈ Yatom
αy ,i = maxy∗
αy∗,i−1 + w · f(x, y∗, y)
Machine Learning for Language Technology 14(36)
Structured Prediction
Sequence Labeling - Back-Pointers
I But that only tells us what the best score isI Let βy ,i be the i th label in the best labeling
I Of the sequence x0x1 . . . xi
I Where yi = y
I βy ,i can be calculated with the following recursion
βy ,0 = nil ∀y ∈ Yatom
βy ,i = arg maxy∗
αy∗,i−1 + w · f(x, y∗, y)
I Thus:I The last label in the best sequence is yn = arg maxy αy ,n
I The second-to-last label is yn−1 = βyn,n
I ...I The first label is y0 = βy1,1
Machine Learning for Language Technology 15(36)
Structured Prediction
Structured Learning
I We know we can solve the inference problemI At least for sequence labelingI But also for many other problems where one can factor
features appropriately (context-free parsing, dependencyparsing, semantic role labeling, . . . )
I How does this change learning?I for the perceptron algorithm?I for SVMs?I for logistic regression?
Machine Learning for Language Technology 16(36)
Structured Prediction
Structured Perceptron
I Exactly like original perceptronI Except now the argmax function uses factored features
I Which we can solve with algorithms like the Viterbi algorithm
I All of the original analysis carries over!!
1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T
4. Let y′ = arg maxy w(i) · f(xt , y) (**)5. if y′ 6= yt
6. w(i+1) = w(i) + f(xt , yt) − f(xt , y′)7. i = i + 18. return wi
(**) Solve the argmax with Viterbi for sequence problems!
Machine Learning for Language Technology 17(36)
Structured Prediction
Online Structured SVMs (or Online MIRA)
1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T
4. w(i+1) = arg minw*‚‚w* − w(i)
‚‚such that:
w · f(xt , yt) − w · f(xt , y′) ≥ L(yt , y ′)∀y′ ∈ Yt and y′ ∈ k-best(xt , w(i))
5. i = i + 16. return wi
I k-best(xt) is set of k outputs with highest scores using w(i)
I Simple solution – only consider highest-scoring output y′ ∈ Yt
I Note: Old fixed margin of 1 is now a fixed loss L(yt , y′)
between two structured outputs
Machine Learning for Language Technology 18(36)
Structured Prediction
Structured SVMs
min1
2||w||2
such that:
w · f(xt ,yt)−w · f(xt ,y′) ≥ L(yt , y
′)
∀(xt ,yt) ∈ T and y′ ∈ Yt
I Still have an exponential number of constraints
I Feature factorizations permit solutions (max-margin Markovnetworks, structured SVMs)
I Note: Old fixed margin of 1 is now a fixed loss L(yt , y′)
between two structured outputs
Machine Learning for Language Technology 19(36)
Structured Prediction
Conditional Random Fields (i)
I What about structured logistic regression?
I Such a thing exists – Conditional Random Fields (CRFs)
I Consider again the sequential case with 1st order factorization
I Inference is identical to the structured perceptron – use Viterbi
arg maxy
P(y|x) = arg maxy
ew·f(x,y)
Zx
= arg maxy
ew·f(x,y)
= arg maxy
w · f(x, y)
= arg maxy
Xi=1
w · f(x, yi−1, yi )
Machine Learning for Language Technology 20(36)
Structured Prediction
Conditional Random Fields (ii)
I However, learning does change
I Reminder: pick w to maximize log-likelihood of training data:
w = arg maxw
∑t
log P(yt |xt)
I Take gradient and use gradient ascent
∂
∂wiF (w) =
∑t
fi (xt ,yt)−∑
t
∑y′∈Y
P(y′|xt)fi (xt ,y′)
I And the gradient is:
OF (w) = (∂
∂w0F (w),
∂
∂w1F (w), . . . ,
∂
∂wmF (w))
Machine Learning for Language Technology 21(36)
Structured Prediction
Conditional Random Fields (iii)
I Problem: sum over output space Y
∂
∂wiF (w) =
Xt
fi (xt , yt) −X
t
Xy′∈Y
P(y′|xt)fi (xt , y′)
=X
t
Xj=1
fi (xt , yt,j−1, yt,j ) −X
t
Xy′∈Y
Xj=1
P(y′|xt)fi (xt , y′j−1, y
′j )
I Can easily calculate first term – just empirical counts
I What about the second term?
Machine Learning for Language Technology 22(36)
Structured Prediction
Conditional Random Fields (iv)
I Problem: sum over output space Y∑t
∑y′∈Y
∑j=1
P(y′|xt)fi (xt , y′j−1, y
′j )
I We need to show we can compute it for arbitrary xt∑y′∈Y
∑j=1
P(y′|xt)fi (xt , y′j−1, y
′j )
I Solution: the forward-backward algorithm
Machine Learning for Language Technology 23(36)
Structured Prediction
Forward Algorithm (i)
I Let αmu be the forward scores
I Let |xt | = nI αm
u is the sum over all labelings of x0 . . . xm such that y ′m = u
αmu =
∑|y′|=m, y ′m=u
ew·f(xt ,y′)
=∑
|y′|=m y ′m=u
eP
j=1 w·f(xt ,y ′j−1,y′j )
i.e., the sum of all labelings of length m, ending at position mwith label u
I Note then that
Zxt =∑y′
ew·f(xt ,y′) =∑u
αnu
Machine Learning for Language Technology 24(36)
Structured Prediction
Forward Algorithm (ii)
I We can fill in α as follows:
α0u = 1.0 ∀u
αmu =
∑v
αm−1v × ew·f(xt ,v ,u)
Machine Learning for Language Technology 25(36)
Structured Prediction
Backward Algorithm
I Let βmu be the symmetric backward scores
I i.e., the sum over all labelings of xm . . . xn such that y ′m = u
I We can fill in β as follows:
βnu = 1.0 ∀u
βmu =
∑v
βm+1v × ew·f(xt ,u,v)
I Note: β is overloaded – different from back-pointers
Machine Learning for Language Technology 26(36)
Structured Prediction
Conditional Random Fields - Final
I Let’s show we can compute it for arbitrary xt∑y′∈Y
∑j=1
P(y′|xt)fi (xt , y′j−1, y
′j )
I So we can re-write it as:
∑j=1
αj−1y ′j−1
× ew·f(xt ,y ′j−1,y′j ) × βj
yj
Zxt
fi (xt , y′j−1, y
′j )
I Forward-backward can calculate partial derivatives efficiently
Machine Learning for Language Technology 27(36)
Structured Prediction
Conditional Random Fields Summary
I Inference: Viterbi
I Learning: Use the forward-backward algorithmI What about not sequential problems
I Context-free parsing – can use inside-outside algorithmI General problems – message passing & belief propagation
Machine Learning for Language Technology 28(36)
Structured Prediction
Case Study: Dependency Parsing
I Given an input sentence x, predict syntactic dependencies y
Machine Learning for Language Technology 29(36)
Structured Prediction
Model 1: Arc-Factored Graph-Based Parsing
y = arg maxy
w · f(x,y)
= arg maxy
∑(i ,j)∈y
w · f(i , j)
I (i , j) ∈ y means xi → xj , i.e., a dependency from xi to xj
I Solving the argmaxI w · f(i , j) is weight of arcI A dependency tree is a spanning tree of a dense graph over xI Use max spanning tree algorithms for inference
Machine Learning for Language Technology 30(36)
Structured Prediction
Defining f(i , j)
I Can contain any feature over arc or the input sentenceI Some example features
I Identities of xi and xj
I Their part-of-speech tagsI The part-of-speech of surrounding wordsI The distance between xi and xj
I ...
Machine Learning for Language Technology 31(36)
Structured Prediction
Empirical Results
I Spanning tree dependency parsing results (McDonald 2006)
I Trained using MIRA (online SVMs)
English Czech Chinese
Accuracy Complete Accuracy Complete Accuracy Complete90.7 36.7 84.1 32.2 79.7 27.2
I Simple structured linear classifier
I Near state-of-the-art performance for many languages
I Higher-order models give higher accuracy
Machine Learning for Language Technology 32(36)
Structured Prediction
Model 2: Transition-Based Parsing
y = arg maxy
w · f(x,y)
= arg maxy
∑t(s)∈T (y)
w · f(s, t)
I t(s) ∈ T (y) means that the derivation of y includes theapplication of transition t to state s
I Solving the argmaxI w · f(s, t) is score of transition t in state sI Use beam search to find best derivation from start state s0
Machine Learning for Language Technology 33(36)
Structured Prediction
Defining f(s, t)
I Can contain any feature over parser statesI Some example features
I Identities of words in s (e.g., top of stack, head of queue)I Their part-of-speech tagsI Their head and dependents (and their part-of-speech tags)I The number of dependents of words in sI ...
Machine Learning for Language Technology 34(36)
Structured Prediction
Empirical Results
I Transition-based dependency parsing with beam search(**)(Zhang and Nivre 2011)
I Trained using perceptron
English Chinese
Accuracy Complete Accuracy Complete92.9 48.0 86.0 36.9
I Simple structured linear classifier
I State-of-the-art performance with rich non-local features
(**) Beam search is a heuristic search algorithm that explores agraph by expanding the most promising node in a limited set.Beam search is an optimization of best-first search that reduces itsmemory requirements. It only finds an approxiamate solution
Machine Learning for Language Technology 35(36)
Structured Prediction
Structured Prediction Summary
I Can’t use multiclass algorithms – search space too large
I Solution: factor representationsI Can allow for efficient inference and learning
I Showed for sequence learning: Viterbi + forward-backwardI But also true for other structures
I CFG parsing: CKY + inside-outsideI Dependency parsing: spanning tree algorithms or beam searchI General graphs: message passing and belief propagation
Machine Learning for Language Technology 36(36)