cs546: machine learning and natural language multi-class and structured prediction problems

58
1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture

Upload: topper

Post on 31-Jan-2016

50 views

Category:

Documents


1 download

DESCRIPTION

CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems. Slides from Taskar and Klein are used in this lecture. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A. Outline. Multi-Class classification: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

1

CS546: Machine Learning and Natural Language

Multi-Class and Structured Prediction Problems

Slides from Taskar and Klein are used in this lecture

Page 2: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

2

Outline

– Multi-Class classification:– Structured Prediction– Models for Structured Prediction and

Classification

• Example of POS tagging

=5

Page 3: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

3

Mutliclass problems

– Most of the machinery we talked before was focused on binary classification problems– e.g., SVMs we discussed so far

– However most problems we encounter in NLP are either:• MultiClass: e.g., text categorization• Structured Prediction: e.g., predict syntactic structure

of a sentence– How to deal with them?

=5

Page 4: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

4

Binary linear classification

=5

Page 5: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

5

Multiclass classification

=5

Page 6: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

6

Perceptron

=5

Page 7: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

Structured Perceptron

• Joint feature representation:• Algoritm:

Page 8: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

8

Perceptron

=5

Page 9: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

9

Binary Classification Margin

=5

Page 10: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

10

Generalize to MultiClass

=5

Page 11: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

11

Converting to MultiClass SVM

=5

Page 12: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

12

Max margin = Min Norm

=5

•As before, these are equivalent formulations:

Page 13: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

13

Problems:

=5

•Requires separability•What if we have noise in data?•What if we have little simple feature space?

Page 14: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

14

Non-separable case

=5

Page 15: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

15

Non-separable case

=5

Page 16: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

16

Compare with MaxEnt

=5

Page 17: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

17

Loss Comparison

=5

Page 18: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

18

=5

• So far, we considered multiclass classification• 0-1 losses l(y,y’)• What if what we want to do is to predict:

• sequences of POS• syntactic trees• translation

Multiclass -> Structured

Page 19: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

19

=5

Predicting word alignments

Page 20: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

20

=5

Predicting Syntactic Trees

Page 21: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

21

=5

Structured Models

Page 22: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

22

=5

Parsing

Page 23: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

23

=5

Max Margin Markov Networks (M3Ns)

Taskar et al, 2003; similar Tsochantaridis et al, 2004

Page 24: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

24

=5

Max Margin Markov Networks (M3Ns)

Page 25: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

25 MultiClass Classification

Solving MultiClass with binary learning

• MultiClass classifier– Function f : Rd {1,2,3,...,k}

• Decompose into binary problems

• Not always possible to learn • Different scale • No theoretical justification

Real Problem

Page 26: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

26 MultiClass Classification

Learning via One-Versus-All (OvA) Assumption

• Find vr,vb,vg,vy Rn such that

– vr.x > 0 iff y = red – vb.x > 0 iff y = blue – vg.x > 0 iff y = green – vy.x > 0 iff y = yellow

• Classifier f(x) = argmax vi.x

Individual Classifiers

Decision Regions

H = Rkn

Page 27: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

27 MultiClass Classification

Learning via All-Verses-All (AvA) Assumption

• Find vrb,vrg,vry,vbg,vby,vgy Rd such that

– vrb.x > 0 if y = red < 0 if y = blue

– vrg.x > 0 if y = red < 0 if y = green– ... (for all pairs)

Individual Classifiers

Decision Regions

H = Rkkn

How to classify?

Page 28: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

28

Classifying with AvA

Tree

1 red, 2 yellow, 2 green ?

Majority Vote

Tournament

All are post-learning and might cause weird stuff

Page 29: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

29

=5

POS Tagging

• English tags

Page 30: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

30

=5

POS Tagging, examples from WSJ

From McCallum

Page 31: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

31

=5

POS Tagging

• Ambiguity: not a trivial task

• Useful tasks:• important features for other steps are based

on POS • E.g., use POS as input to a parser

Page 32: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

32

But still why so popular

=5

– Historically the first statistical NLP problem– Easy to apply arbitrary classifiers:

– both for sequence models and just independent classifiers

– Can be regarded as Finite-State Problem– Easy to evaluate– Annotation is cheaper to obtain than

TreeBanks (other languages)

Page 33: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

33

=5

HMM (reminder)

Page 34: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

34

=5

HMM (reminder) - transitions

Page 35: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

35

=5

Transition Estimates

Page 36: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

36

=5

Emission Estimates

Page 37: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

37

=5

MaxEnt (reminder)

Page 38: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

38

=5

Decoding: HMM vs MaxEnt

Page 39: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

39

=5

Accuracies overview

Page 40: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

40

=5

Accuracies overview

Page 41: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

41

SVMs for tagging

– We can use SVMs in a similar way as MaxEnt (or other classifiers)

– We can use a window around the word – 97.16 % on WSJ

=5

Page 42: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

42

SVMs for tagging

=5

from Jimenez & Marquez

Page 43: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

43

No sequence modeling

=5

Page 44: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

44

CRFs and other global models

=5

Page 45: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

45

CRFs and other global models

=5

Page 46: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

Compare

=5

CRFs - no local normalization

MEMMs - Note: after each step t the remaining probability mass cannotbe reduced – it can only be distributedacross among possible state transitions

HMMs

W

T

Page 47: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

47

Label Bias

=5

based on a slide from Joe Drish

Page 48: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

48

Label Bias

=5

• Recall Transition based parsing -- Nivre’s algorithm (with beam search)

• At each step we can observe only local features (limited look-ahead)

• If later we see that the following word is impossible we can only distribute probability uniformly across all (im-)possible decisions

• If a small number of such decisions – we cannot decrease probability dramatically

• So, label bias is likely to be a serious problem if:• Non local dependencies• States have small number of possible outgoing

transitions

Page 49: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

49

Pos Tagging Experiments

=5

– “+” is an extended feature set (hard to integrate in a generative model)

– oov – out-of-vocabulary

Page 50: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

50

Supervision

=5

– We considered before the supervised case– Training set is labeled

– However, we can try to induce word classes without supervision– Unsupervised tagging– We will later discuss the EM algorithm

– Can do it in a partly supervised:– Seed tags – Small labeled dataset– Parallel corpus– ....

Page 51: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

51

Why not to predict POS + parse trees simultaneously?

=5

– It is possible and often done this way– Doing tagging internally often benefits

parsing accuracy – Unfortunately, parsing models are less robust

than taggers– e.g., non-grammatical sentences, different

domains– It is more expensive and does not help...

Page 52: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

52

Questions

=5

• Why there is no label-bias problem for a generative model (e.g., HMM) ?

• How would you integrate word features in a generative model (e.g., HMMs for POS tagging)?

• e.g., if word has:• -ing, -s, -ed, -d, -ment, ...• post-, de-,...

Page 53: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

53

“CRFs” for more complex structured output problems

=5

• We considered sequence labeled problems• Here, the structure of dependencies is fixed• What if we do not know the structure but would

like to have interactions respecting the structure ?

Page 54: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

54

“CRFs” for more complex structured output problems

=5

• Recall, we had the MST algorithm (McDonald and Pereira, 05)

Page 55: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

55

“CRFs” for more complex structured output problems

=5

• Complex inference• E.g., arbitrary 2nd order dependency parsing

models are not tractable (non-projective) NP-complete: (McDonald & Pereira, EACL 06)

• Recently conditional models for constituent parsing:

• (Finkel et al, ACL 08)• (Carreras et al, CoNLL 08)• ...

Page 56: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

56

Back to MultiClass

=5

– Let us review how to decompose multiclass problem to binary classification problems

Page 57: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

57

Summary

=5

• Margin-based method for multiclass classification and structured prediction

• CRFs vs HMMs vs MEMMs for POS tagging

Page 58: CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems

58

Conclusions

• All approaches use linear representation• The differences are

– Features– How to learn weights– Training Paradigms:

• Global Training (CRF, Global Perceptron)• Modular Training (PMM, MEMM, ...)

– These approaches are easier to train, but may requires additional mechanisms to enforce global constraints.