cs546: machine learning and natural language multi-class and structured prediction problems
DESCRIPTION
CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems. Slides from Taskar and Klein are used in this lecture. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A. Outline. Multi-Class classification: - PowerPoint PPT PresentationTRANSCRIPT
1
CS546: Machine Learning and Natural Language
Multi-Class and Structured Prediction Problems
Slides from Taskar and Klein are used in this lecture
2
Outline
– Multi-Class classification:– Structured Prediction– Models for Structured Prediction and
Classification
• Example of POS tagging
=5
3
Mutliclass problems
– Most of the machinery we talked before was focused on binary classification problems– e.g., SVMs we discussed so far
– However most problems we encounter in NLP are either:• MultiClass: e.g., text categorization• Structured Prediction: e.g., predict syntactic structure
of a sentence– How to deal with them?
=5
4
Binary linear classification
=5
5
Multiclass classification
=5
6
Perceptron
=5
Structured Perceptron
x̂
• Joint feature representation:• Algoritm:
8
Perceptron
=5
9
Binary Classification Margin
=5
10
Generalize to MultiClass
=5
11
Converting to MultiClass SVM
=5
12
Max margin = Min Norm
=5
•As before, these are equivalent formulations:
13
Problems:
=5
•Requires separability•What if we have noise in data?•What if we have little simple feature space?
14
Non-separable case
=5
15
Non-separable case
=5
16
Compare with MaxEnt
=5
17
Loss Comparison
=5
18
=5
• So far, we considered multiclass classification• 0-1 losses l(y,y’)• What if what we want to do is to predict:
• sequences of POS• syntactic trees• translation
Multiclass -> Structured
•
19
=5
Predicting word alignments
•
20
=5
Predicting Syntactic Trees
•
21
=5
Structured Models
•
22
=5
Parsing
•
23
=5
Max Margin Markov Networks (M3Ns)
•
Taskar et al, 2003; similar Tsochantaridis et al, 2004
24
=5
Max Margin Markov Networks (M3Ns)
•
25 MultiClass Classification
Solving MultiClass with binary learning
• MultiClass classifier– Function f : Rd {1,2,3,...,k}
• Decompose into binary problems
• Not always possible to learn • Different scale • No theoretical justification
Real Problem
26 MultiClass Classification
Learning via One-Versus-All (OvA) Assumption
• Find vr,vb,vg,vy Rn such that
– vr.x > 0 iff y = red – vb.x > 0 iff y = blue – vg.x > 0 iff y = green – vy.x > 0 iff y = yellow
• Classifier f(x) = argmax vi.x
Individual Classifiers
Decision Regions
H = Rkn
27 MultiClass Classification
Learning via All-Verses-All (AvA) Assumption
• Find vrb,vrg,vry,vbg,vby,vgy Rd such that
– vrb.x > 0 if y = red < 0 if y = blue
– vrg.x > 0 if y = red < 0 if y = green– ... (for all pairs)
Individual Classifiers
Decision Regions
H = Rkkn
How to classify?
28
Classifying with AvA
Tree
1 red, 2 yellow, 2 green ?
Majority Vote
Tournament
All are post-learning and might cause weird stuff
29
=5
POS Tagging
•
• English tags
30
=5
POS Tagging, examples from WSJ
From McCallum
31
=5
POS Tagging
•
• Ambiguity: not a trivial task
• Useful tasks:• important features for other steps are based
on POS • E.g., use POS as input to a parser
32
But still why so popular
=5
– Historically the first statistical NLP problem– Easy to apply arbitrary classifiers:
– both for sequence models and just independent classifiers
– Can be regarded as Finite-State Problem– Easy to evaluate– Annotation is cheaper to obtain than
TreeBanks (other languages)
33
=5
HMM (reminder)
•
34
=5
HMM (reminder) - transitions
•
35
=5
Transition Estimates
•
36
=5
Emission Estimates
•
37
=5
MaxEnt (reminder)
•
38
=5
Decoding: HMM vs MaxEnt
•
39
=5
Accuracies overview
•
40
=5
Accuracies overview
•
41
SVMs for tagging
– We can use SVMs in a similar way as MaxEnt (or other classifiers)
– We can use a window around the word – 97.16 % on WSJ
=5
42
SVMs for tagging
=5
from Jimenez & Marquez
43
No sequence modeling
=5
44
CRFs and other global models
=5
45
CRFs and other global models
=5
Compare
=5
CRFs - no local normalization
MEMMs - Note: after each step t the remaining probability mass cannotbe reduced – it can only be distributedacross among possible state transitions
HMMs
W
T
47
Label Bias
=5
based on a slide from Joe Drish
48
Label Bias
=5
• Recall Transition based parsing -- Nivre’s algorithm (with beam search)
• At each step we can observe only local features (limited look-ahead)
• If later we see that the following word is impossible we can only distribute probability uniformly across all (im-)possible decisions
• If a small number of such decisions – we cannot decrease probability dramatically
• So, label bias is likely to be a serious problem if:• Non local dependencies• States have small number of possible outgoing
transitions
49
Pos Tagging Experiments
=5
– “+” is an extended feature set (hard to integrate in a generative model)
– oov – out-of-vocabulary
50
Supervision
=5
– We considered before the supervised case– Training set is labeled
– However, we can try to induce word classes without supervision– Unsupervised tagging– We will later discuss the EM algorithm
– Can do it in a partly supervised:– Seed tags – Small labeled dataset– Parallel corpus– ....
51
Why not to predict POS + parse trees simultaneously?
=5
– It is possible and often done this way– Doing tagging internally often benefits
parsing accuracy – Unfortunately, parsing models are less robust
than taggers– e.g., non-grammatical sentences, different
domains– It is more expensive and does not help...
52
Questions
=5
• Why there is no label-bias problem for a generative model (e.g., HMM) ?
• How would you integrate word features in a generative model (e.g., HMMs for POS tagging)?
• e.g., if word has:• -ing, -s, -ed, -d, -ment, ...• post-, de-,...
53
“CRFs” for more complex structured output problems
=5
• We considered sequence labeled problems• Here, the structure of dependencies is fixed• What if we do not know the structure but would
like to have interactions respecting the structure ?
54
“CRFs” for more complex structured output problems
=5
• Recall, we had the MST algorithm (McDonald and Pereira, 05)
55
“CRFs” for more complex structured output problems
=5
• Complex inference• E.g., arbitrary 2nd order dependency parsing
models are not tractable (non-projective) NP-complete: (McDonald & Pereira, EACL 06)
• Recently conditional models for constituent parsing:
• (Finkel et al, ACL 08)• (Carreras et al, CoNLL 08)• ...
56
Back to MultiClass
=5
– Let us review how to decompose multiclass problem to binary classification problems
57
Summary
=5
• Margin-based method for multiclass classification and structured prediction
• CRFs vs HMMs vs MEMMs for POS tagging
58
Conclusions
• All approaches use linear representation• The differences are
– Features– How to learn weights– Training Paradigms:
• Global Training (CRF, Global Perceptron)• Modular Training (PMM, MEMM, ...)
– These approaches are easier to train, but may requires additional mechanisms to enforce global constraints.