lecture 5: structured prediction

36
Machine Learning for Language Technology Uppsala University Department of Linguistics and Philology Structured Prediction October 2013 Slides borrowed from previous courses. Thanks to Ryan McDonald (Google Research) and Prof. Joakim Nivre Machine Learning for Language Technology 1(36)

Upload: marina-santini

Post on 05-Dec-2014

394 views

Category:

Education


2 download

DESCRIPTION

Structured prediction or structured learning refers to supervised machine learning techniques that involve predicting structured objects, rather than single labels or real values. For example, the problem of translating a natural language sentence into a syntactic representation such as a parse tree can be seen as a structured prediction problem in which the structured output domain is the set of all possible parse trees.

TRANSCRIPT

Page 1: Lecture 5: Structured Prediction

Machine Learning for Language Technology

Uppsala UniversityDepartment of Linguistics and Philology

Structured PredictionOctober 2013

Slides borrowed from previous courses.Thanks to Ryan McDonald (Google Research)

and Prof. Joakim Nivre

Machine Learning for Language Technology 1(36)

Page 2: Lecture 5: Structured Prediction

Structured Prediction

Outline

I Last time:I Preliminaries: input/output, features, etc.I Linear classifiers

I PerceptronI Large-margin classifiers (SVMs, MIRA)I Logistic regression

I Today:I Structured prediction with linear classifiers

I Structured perceptronI Structured large-margin classifiers (SVMs, MIRA)I Conditional random fields

I Case study: Dependency parsing

Machine Learning for Language Technology 2(36)

Page 3: Lecture 5: Structured Prediction

Structured Prediction

Structured Prediction (i)

I Sometimes our output space Y does not consist of simpleatomic classes

I Examples:I Parsing: for a sentence x, Y is the set of possible parse treesI Sequence tagging: for a sentence x, Y is the set of possible

tag sequences, e.g., part-of-speech tags, named-entity tagsI Machine translation: for a source sentence x, Y is the set of

possible target language sentences

Machine Learning for Language Technology 3(36)

Page 4: Lecture 5: Structured Prediction

Structured Prediction

Hidden Markov Models

I Generative model – maximizes likelihood of P(x,y)I We are looking at discriminative versions of this

I Not just sequences, though that will be the running example

Machine Learning for Language Technology 4(36)

Page 5: Lecture 5: Structured Prediction

Structured Prediction

Structured Prediction (ii)

I Can’t we just use our multiclass learning algorithms?

I In all the cases, the size of the set Y is exponential in thelength of the input x

I It is non-trivial to apply our learning algorithms in such cases

Machine Learning for Language Technology 5(36)

Page 6: Lecture 5: Structured Prediction

Structured Prediction

Perceptron

Training data: T = {(xt ,yt)}|T |t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. Let y′ = arg maxy w(i) · f(xt ,y) (**)5. if y′ 6= yt

6. w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)

7. i = i + 18. return wi

(**) Solving the argmax requires a search over an exponentialspace of outputs!

Machine Learning for Language Technology 6(36)

Page 7: Lecture 5: Structured Prediction

Structured Prediction

Large-Margin Classifiers

Batch (SVMs):

min1

2||w||2

such that:

w·f(xt ,yt)−w·f(xt ,y′) ≥ 1

∀(xt ,yt) ∈ T and y′ ∈ Yt (**)

Online (MIRA):

Training data: T = {(xt ,yt)}|T |t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T4. w(i+1) = arg minw*

∥∥w*−w(i)∥∥

such that:w · f(xt ,yt)−w · f(xt ,y

′) ≥ 1∀y′ ∈ Yt (**)

5. i = i + 16. return wi

(**) There are exponential constraints in the size of each input!!

Machine Learning for Language Technology 7(36)

Page 8: Lecture 5: Structured Prediction

Structured Prediction

Factor the Feature Representations

I We can make an assumption that our feature representationsfactor relative to the output

I Example:I Context-free parsing:

f(x,y) =∑

A→BC∈y

f(x,A → BC )

I Sequence analysis – Markov assumptions:

f(x,y) =

|y|∑i=1

f(x, yi−1, yi )

I These kinds of factorizations allow us to run algorithms likeCKY and Viterbi to compute the argmax function

Machine Learning for Language Technology 8(36)

Page 9: Lecture 5: Structured Prediction

Structured Prediction

Example – Sequence Labeling

I Many NLP problems can be cast in this lightI Part-of-speech taggingI Named-entity extractionI Semantic role labelingI ...

I Input: x = x0x1 . . . xn

I Output: y = y0y1 . . . yn

I Each yi ∈ Yatom – which is small

I Each y ∈ Y = Ynatom – which is large

I Example: part-of-speech tagging – Yatom is set of tags

x = John saw Mary with the telescopey = noun verb noun preposition article noun

Machine Learning for Language Technology 9(36)

Page 10: Lecture 5: Structured Prediction

Structured Prediction

Sequence Labeling – Output Interaction

x = John saw Mary with the telescopey = noun verb noun preposition article noun

I Why not just break up sequence into a set of multi-classpredictions?

I Because there are interactions between neighbouring tagsI What tag does“saw”have?I What if I told you the previous tag was article?I What if it was noun?

Machine Learning for Language Technology 10(36)

Page 11: Lecture 5: Structured Prediction

Structured Prediction

Sequence Labeling – Markov Factorization

x = John saw Mary with the telescopey = noun verb noun preposition article noun

I Markov factorization – factor by adjacent labels

I First-order (like HMMs)

f(x,y) =

|y|∑i=1

f(x, yi−1, yi )

I kth-order

f(x,y) =

|y|∑i=k

f(x, yi−k , . . . , yi−1, yi )

Machine Learning for Language Technology 11(36)

Page 12: Lecture 5: Structured Prediction

Structured Prediction

Sequence Labeling – Features

x = John saw Mary with the telescopey = noun verb noun preposition article noun

I First-order

f(x,y) =

|y|∑i=1

f(x, yi−1, yi )

I f(x, yi−1, yi ) is any feature of the input & two adjacent labels

fj (x, yi−1, yi ) =

8<: 1 if xi =“saw”and yi−1 = noun and yi = verb

0 otherwisefj′ (x, yi−1, yi ) =

8<: 1 if xi =“saw”and yi−1 = article and yi = verb

0 otherwise

I wj should get high weight and wj ′ should get low weight

Machine Learning for Language Technology 12(36)

Page 13: Lecture 5: Structured Prediction

Structured Prediction

Sequence Labeling - Inference

I How does factorization effect inference?

y = arg maxy

w · f(x,y)

= arg maxy

w ·|y|∑i=1

f(x, yi−1, yi )

= arg maxy

|y|∑i=1

w · f(x, yi−1, yi )

= arg maxy

|y|∑i=1

m∑j=1

wj · fj(x, yi−1, yi )

I Can use the Viterbi algorithm

Machine Learning for Language Technology 13(36)

Page 14: Lecture 5: Structured Prediction

Structured Prediction

Sequence Labeling – Viterbi Algorithm

I Let αy ,i be the score of the best labelingI Of the sequence x0x1 . . . xi

I Where yi = y

I Let’s say we know α, thenI maxy αy ,n is the score of the best labeling of the sequence

I αy ,i can be calculated with the following recursion

αy ,0 = 0.0 ∀y ∈ Yatom

αy ,i = maxy∗

αy∗,i−1 + w · f(x, y∗, y)

Machine Learning for Language Technology 14(36)

Page 15: Lecture 5: Structured Prediction

Structured Prediction

Sequence Labeling - Back-Pointers

I But that only tells us what the best score isI Let βy ,i be the i th label in the best labeling

I Of the sequence x0x1 . . . xi

I Where yi = y

I βy ,i can be calculated with the following recursion

βy ,0 = nil ∀y ∈ Yatom

βy ,i = arg maxy∗

αy∗,i−1 + w · f(x, y∗, y)

I Thus:I The last label in the best sequence is yn = arg maxy αy ,n

I The second-to-last label is yn−1 = βyn,n

I ...I The first label is y0 = βy1,1

Machine Learning for Language Technology 15(36)

Page 16: Lecture 5: Structured Prediction

Structured Prediction

Structured Learning

I We know we can solve the inference problemI At least for sequence labelingI But also for many other problems where one can factor

features appropriately (context-free parsing, dependencyparsing, semantic role labeling, . . . )

I How does this change learning?I for the perceptron algorithm?I for SVMs?I for logistic regression?

Machine Learning for Language Technology 16(36)

Page 17: Lecture 5: Structured Prediction

Structured Prediction

Structured Perceptron

I Exactly like original perceptronI Except now the argmax function uses factored features

I Which we can solve with algorithms like the Viterbi algorithm

I All of the original analysis carries over!!

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. Let y′ = arg maxy w(i) · f(xt , y) (**)5. if y′ 6= yt

6. w(i+1) = w(i) + f(xt , yt) − f(xt , y′)7. i = i + 18. return wi

(**) Solve the argmax with Viterbi for sequence problems!

Machine Learning for Language Technology 17(36)

Page 18: Lecture 5: Structured Prediction

Structured Prediction

Online Structured SVMs (or Online MIRA)

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. w(i+1) = arg minw*‚‚w* − w(i)

‚‚such that:

w · f(xt , yt) − w · f(xt , y′) ≥ L(yt , y ′)∀y′ ∈ Yt and y′ ∈ k-best(xt , w(i))

5. i = i + 16. return wi

I k-best(xt) is set of k outputs with highest scores using w(i)

I Simple solution – only consider highest-scoring output y′ ∈ Yt

I Note: Old fixed margin of 1 is now a fixed loss L(yt , y′)

between two structured outputs

Machine Learning for Language Technology 18(36)

Page 19: Lecture 5: Structured Prediction

Structured Prediction

Structured SVMs

min1

2||w||2

such that:

w · f(xt ,yt)−w · f(xt ,y′) ≥ L(yt , y

′)

∀(xt ,yt) ∈ T and y′ ∈ Yt

I Still have an exponential number of constraints

I Feature factorizations permit solutions (max-margin Markovnetworks, structured SVMs)

I Note: Old fixed margin of 1 is now a fixed loss L(yt , y′)

between two structured outputs

Machine Learning for Language Technology 19(36)

Page 20: Lecture 5: Structured Prediction

Structured Prediction

Conditional Random Fields (i)

I What about structured logistic regression?

I Such a thing exists – Conditional Random Fields (CRFs)

I Consider again the sequential case with 1st order factorization

I Inference is identical to the structured perceptron – use Viterbi

arg maxy

P(y|x) = arg maxy

ew·f(x,y)

Zx

= arg maxy

ew·f(x,y)

= arg maxy

w · f(x, y)

= arg maxy

Xi=1

w · f(x, yi−1, yi )

Machine Learning for Language Technology 20(36)

Page 21: Lecture 5: Structured Prediction

Structured Prediction

Conditional Random Fields (ii)

I However, learning does change

I Reminder: pick w to maximize log-likelihood of training data:

w = arg maxw

∑t

log P(yt |xt)

I Take gradient and use gradient ascent

∂wiF (w) =

∑t

fi (xt ,yt)−∑

t

∑y′∈Y

P(y′|xt)fi (xt ,y′)

I And the gradient is:

OF (w) = (∂

∂w0F (w),

∂w1F (w), . . . ,

∂wmF (w))

Machine Learning for Language Technology 21(36)

Page 22: Lecture 5: Structured Prediction

Structured Prediction

Conditional Random Fields (iii)

I Problem: sum over output space Y

∂wiF (w) =

Xt

fi (xt , yt) −X

t

Xy′∈Y

P(y′|xt)fi (xt , y′)

=X

t

Xj=1

fi (xt , yt,j−1, yt,j ) −X

t

Xy′∈Y

Xj=1

P(y′|xt)fi (xt , y′j−1, y

′j )

I Can easily calculate first term – just empirical counts

I What about the second term?

Machine Learning for Language Technology 22(36)

Page 23: Lecture 5: Structured Prediction

Structured Prediction

Conditional Random Fields (iv)

I Problem: sum over output space Y∑t

∑y′∈Y

∑j=1

P(y′|xt)fi (xt , y′j−1, y

′j )

I We need to show we can compute it for arbitrary xt∑y′∈Y

∑j=1

P(y′|xt)fi (xt , y′j−1, y

′j )

I Solution: the forward-backward algorithm

Machine Learning for Language Technology 23(36)

Page 24: Lecture 5: Structured Prediction

Structured Prediction

Forward Algorithm (i)

I Let αmu be the forward scores

I Let |xt | = nI αm

u is the sum over all labelings of x0 . . . xm such that y ′m = u

αmu =

∑|y′|=m, y ′m=u

ew·f(xt ,y′)

=∑

|y′|=m y ′m=u

eP

j=1 w·f(xt ,y ′j−1,y′j )

i.e., the sum of all labelings of length m, ending at position mwith label u

I Note then that

Zxt =∑y′

ew·f(xt ,y′) =∑u

αnu

Machine Learning for Language Technology 24(36)

Page 25: Lecture 5: Structured Prediction

Structured Prediction

Forward Algorithm (ii)

I We can fill in α as follows:

α0u = 1.0 ∀u

αmu =

∑v

αm−1v × ew·f(xt ,v ,u)

Machine Learning for Language Technology 25(36)

Page 26: Lecture 5: Structured Prediction

Structured Prediction

Backward Algorithm

I Let βmu be the symmetric backward scores

I i.e., the sum over all labelings of xm . . . xn such that y ′m = u

I We can fill in β as follows:

βnu = 1.0 ∀u

βmu =

∑v

βm+1v × ew·f(xt ,u,v)

I Note: β is overloaded – different from back-pointers

Machine Learning for Language Technology 26(36)

Page 27: Lecture 5: Structured Prediction

Structured Prediction

Conditional Random Fields - Final

I Let’s show we can compute it for arbitrary xt∑y′∈Y

∑j=1

P(y′|xt)fi (xt , y′j−1, y

′j )

I So we can re-write it as:

∑j=1

αj−1y ′j−1

× ew·f(xt ,y ′j−1,y′j ) × βj

yj

Zxt

fi (xt , y′j−1, y

′j )

I Forward-backward can calculate partial derivatives efficiently

Machine Learning for Language Technology 27(36)

Page 28: Lecture 5: Structured Prediction

Structured Prediction

Conditional Random Fields Summary

I Inference: Viterbi

I Learning: Use the forward-backward algorithmI What about not sequential problems

I Context-free parsing – can use inside-outside algorithmI General problems – message passing & belief propagation

Machine Learning for Language Technology 28(36)

Page 29: Lecture 5: Structured Prediction

Structured Prediction

Case Study: Dependency Parsing

I Given an input sentence x, predict syntactic dependencies y

Machine Learning for Language Technology 29(36)

Page 30: Lecture 5: Structured Prediction

Structured Prediction

Model 1: Arc-Factored Graph-Based Parsing

y = arg maxy

w · f(x,y)

= arg maxy

∑(i ,j)∈y

w · f(i , j)

I (i , j) ∈ y means xi → xj , i.e., a dependency from xi to xj

I Solving the argmaxI w · f(i , j) is weight of arcI A dependency tree is a spanning tree of a dense graph over xI Use max spanning tree algorithms for inference

Machine Learning for Language Technology 30(36)

Page 31: Lecture 5: Structured Prediction

Structured Prediction

Defining f(i , j)

I Can contain any feature over arc or the input sentenceI Some example features

I Identities of xi and xj

I Their part-of-speech tagsI The part-of-speech of surrounding wordsI The distance between xi and xj

I ...

Machine Learning for Language Technology 31(36)

Page 32: Lecture 5: Structured Prediction

Structured Prediction

Empirical Results

I Spanning tree dependency parsing results (McDonald 2006)

I Trained using MIRA (online SVMs)

English Czech Chinese

Accuracy Complete Accuracy Complete Accuracy Complete90.7 36.7 84.1 32.2 79.7 27.2

I Simple structured linear classifier

I Near state-of-the-art performance for many languages

I Higher-order models give higher accuracy

Machine Learning for Language Technology 32(36)

Page 33: Lecture 5: Structured Prediction

Structured Prediction

Model 2: Transition-Based Parsing

y = arg maxy

w · f(x,y)

= arg maxy

∑t(s)∈T (y)

w · f(s, t)

I t(s) ∈ T (y) means that the derivation of y includes theapplication of transition t to state s

I Solving the argmaxI w · f(s, t) is score of transition t in state sI Use beam search to find best derivation from start state s0

Machine Learning for Language Technology 33(36)

Page 34: Lecture 5: Structured Prediction

Structured Prediction

Defining f(s, t)

I Can contain any feature over parser statesI Some example features

I Identities of words in s (e.g., top of stack, head of queue)I Their part-of-speech tagsI Their head and dependents (and their part-of-speech tags)I The number of dependents of words in sI ...

Machine Learning for Language Technology 34(36)

Page 35: Lecture 5: Structured Prediction

Structured Prediction

Empirical Results

I Transition-based dependency parsing with beam search(**)(Zhang and Nivre 2011)

I Trained using perceptron

English Chinese

Accuracy Complete Accuracy Complete92.9 48.0 86.0 36.9

I Simple structured linear classifier

I State-of-the-art performance with rich non-local features

(**) Beam search is a heuristic search algorithm that explores agraph by expanding the most promising node in a limited set.Beam search is an optimization of best-first search that reduces itsmemory requirements. It only finds an approxiamate solution

Machine Learning for Language Technology 35(36)

Page 36: Lecture 5: Structured Prediction

Structured Prediction

Structured Prediction Summary

I Can’t use multiclass algorithms – search space too large

I Solution: factor representationsI Can allow for efficient inference and learning

I Showed for sequence learning: Viterbi + forward-backwardI But also true for other structures

I CFG parsing: CKY + inside-outsideI Dependency parsing: spanning tree algorithms or beam searchI General graphs: message passing and belief propagation

Machine Learning for Language Technology 36(36)