machine learning for structured prediction · grzegorz chrupal a (dcu) machine learning for...

19
Machine Learning for Structured Prediction Grzegorz Chrupala National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 1 / 19

Upload: doankhue

Post on 28-Feb-2019

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Machine Learning for Structured Prediction

Grzegorz Chrupa la

National Centre for Language TechnologySchool of ComputingDublin City University

NCLT Seminar

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 1 / 19

Page 2: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Structured vs Non-structured prediction

In supervised learning we try to learn a function h : X → Y where x ∈ Xare inputs and y ∈ Y are outputs.

Binary classification: Y = {−1, +1}

Multiclass classification: Y = {1, . . . ,K} (finite set of labels)

Regression: Y = R

In contrast, in structured prediction elements Y are complex

The prediction is based on the feature function Φ : X → F where usuallyF = R

D (D-dimensional vector space)

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 2 / 19

Page 3: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Structured prediction tasks

Sequence labeling: e.g. POS tagging

Parsing: given an input sequence, build a tree whose yield (leaves) arethe elements in the sequence and whose structure obeys somegrammar.

Collective classification: E.g. relation learning problems, such aslabeling web pages given link information.

Bipartite matching: given a bipartite graph, find the best possiblematching. e.g. word alignment in NLP and protein structureprediction in computational biology.

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 3 / 19

Page 4: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Linear classifiers

In binary classification, linear classifiers separate classes in amultidimensional input space with a hyperplane

The classification function giving the output score is

y = f (−→x · −→w + b) = f (∑

d∈D wdxd + b)

where◮−→x is the input feature vector = Φ(x),

◮−→w is a real vector of feature weights,

◮ b is the bias term, and◮ f maps the dot product of the two vectors to output (e.g. f (x) = −1

if x < 0, +1 otherwise)◮ The use of kernels permits to use linear classifiers for non-linearly

separable problems such as the XOR function

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 4 / 19

Page 5: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Linear classifier – hyperplanes

Figure: Three separating hyperplanes in 2-dimensional space (Wikipedia)

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 5 / 19

Page 6: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Perceptron

The Perceptron iterates I times through the 1...n training examples,updating the parameters (weights and bias) if the current example nis classified incorrectly:

◮−→w ← −→w + yn

−→x n

◮ b ← b + yn

where yn is the output for the nth training example

Averaged Perceptron, which generalizes better to unseen examples,does weight averaging, i.e. the final weights returned are the averageof all weight vectors used during the run of the algorithm

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 6 / 19

Page 7: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Perceptron Algorithm

Daume III (2006)

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 7 / 19

Page 8: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Structured Perceptron

For structured prediction the feature function is Φ : X × Y → F(second argument is the hypothesized output)

Structured prediction algorithms need Φ to admit tractable search

The argmax problem:

y = argmaxy∈Y

−→w · Φ(x , y)

Structured Perceptron: for each example n wherever the predicted yn

for xn differs from yn the weights are updated:◮−→w ← −→w + Φ(xn, yn)− Φ(xn, yn)

Similar to plain Perceptron — key difference: y is calculated usingthe arg max.

Incremental Perceptron is a variant used in cased where arg max isn’tavailable — a beam search algorithm is used instead

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19

Page 9: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Logistic regression / Maximum Entropy

Conditional probability of class y is proportional to exp f (x)

p(y |x ;−→w , b) =1

Zx ;−→w ,b

exp [y(−→w · Φ(x) + b)]

=1

1 + exp [−2y(−→w · Φ(x) + b)]

For training we try to find −→w and b which maximize the likelihood oftraining data

Gaussian prior used to penalize large weights

Generalized Iterative Scaling algorithm used to solve the maximizationproblem

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 9 / 19

Page 10: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Maximum Entropy Markov Models

MEMMs are an extension of MaxEnt to sequence labeling

Replace p(observation|state) with p(state|observation)

For first-order MEMM, the conditional distribution on the label yn

given full input x , the previous label yn−1, feature function Φ andweights −→w is:

p(yn|x , yn−1;−→w ) =1

Zx ;yn−1−→w

exp [−→w · Φ(x , yn, yn−1)]

When MEMM is trained true label yn−1 is used

At prediction time, Viterbi is applied to solve arg max

Predicted yn−1 label is used in classifying nth element of the sequence

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 10 / 19

Page 11: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Conditional Random Fields

Alternative generalization of MaxEnt to structured prediction

Main difference:◮ while MEMM uses per state exponential models for the conditional

probabilities of next states given the current state◮ CRF has a single exponential model for the joint probability of the

whole sequence of labels

The formulation is:

p(y |x ;−→w ) =1

Zx ;−→w

exp [−→w · Φ(x , y)]

Zx ;−→w =∑

y ′∈Y

exp [−→w · Φ(x , y ′)]

where elements y ′ ∈ Y are incorrect outputs

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 11 / 19

Page 12: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

CRF - experimental comparison to HMM and MEMMs

J. Lafferty, A. McCallum, F. Pereira - Proc. 18th International Conf. onMachine Learning, 2001

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 12 / 19

Page 13: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Large Margin Models

Choose parameter settings which maximize the distance between thehyperplane separating classes and the nearest data points on eitherside.

Support Vector Machines. For data separable with margin 1:

minimize−→w ,b

1

2‖−→w ‖2

subject to

∀n, yn[−→w · Φ(xn) + b] ≥ 1

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 13 / 19

Page 14: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

SVM

Soft-margin formulation: don’t enforce perfect classification oftraining examples

The slack variables ξ measures how far an example is from the correcthard margin constraint

minimize−→w ,b

1

2‖−→w ‖2 + C

N∑

n=1

ξn

subject to

∀n, yn[−→w · Φ(xn) + b] ≥ 1− ξn

ξn ≥ 0

The parameter C trades off fitting the training data and a smallweight vector

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 14 / 19

Page 15: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Margins and Support Vectors

Figure: SVM maximum-margin hyperplanes for binary classification. Examplesalong the hyperplanes are support vectors (from Wikipedia)

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 15 / 19

Page 16: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Maximum Margin models for Structured Prediction

Maximum Margin Markov Networks (M3N)

The difference in score (under loss function l) between true output y

and any incorrect output y is at least l(x , y , y ) — M3N scales themargin to be proportional to loss.

Support Vector Machines for Interdependent and Structured Outputs:SVMstruct

Instead of scaling the margin, scale the slack variables by the loss

The two approaches differ in the optimization techniques used(SVMstruct constrains the loss function less but is more difficult tooptimize).

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 16 / 19

Page 17: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Summary of structured prediction algorithms

Figure from Daume III (2006)

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 17 / 19

Page 18: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

Other approaches to structured outputs

Searn (Search + Learn)

Analogical learning

Miscellaneous hacks:◮ Reranking◮ Stacking◮ Class n-grams

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 18 / 19

Page 19: Machine Learning for Structured Prediction · Grzegorz Chrupal a (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 8 / 19 Logistic regression / Maximum Entropy Conditional

References

M. Collins, 2002, Discriminative Training Methods for Hidden MarkovModels: Theory and Experiments with Perceptron Algorithms,EMNLP-02McCallum, Andrew, Dayne Freitag, and Fernando Pereira. 2000.Maximum entropy Markov models for information extraction andsegmentation. ICMLJ. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional randomelds: Probabilistic models for segmenting and labeling sequence data.ICMLB. Taskar, C. Guestrin, and D. Koller. 2003. Max-margin Markovnetworks. In Advances in Neural Information Processing Systems(NIPS).I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. 2005.Large margin methods for structured and interdependent outputvariables. JMLRHal Daume III, 2006, Practical Structured Learning Techniques for

Natural Language Processing, PhD thesis

Grzegorz Chrupa la (DCU) Machine Learning for Structured Prediction NCLT Seminar 2006 19 / 19