1 cs546: machine learning and natural language latent-variable models for structured prediction...
TRANSCRIPT
1
CS546: Machine Learning and Natural Language
Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing
Slides / Figures from Slav Petrov’s talk at COLING-ACL 06 are used in this lecture
2
Parsing Problem
• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]
3
Parsing Problem
• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]• Head lexicalization [Collins 99,...]
4
Parsing Problem
• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]• Head lexicalization [Collins 99,...]• Automatic Annotation [Matsuzaki et al, 05;...]• Manual Annotation [Klein and Manning 03]
5
Manual Annotation
• Manually split categories– NP: subject vs object– DT: determiners vs demonstratives– IN: sentential vs prepositional
• Advantages:– Fairly compact grammar– Linguistic motivations
• Disadvantages:– Performance leveled out– Manually annotated
Model F1
Naïve Treebank PCFG 72.6
Klein & Manning ’03 86.3
6
Automatic Annotation
• Use Latent Variable Models– Split (“annotate”) each node: E.g., NP -> ( NP[1], NP[2],...,NP[T])– Each node in the tree is annotated with a latent sub-category:
– Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables
7
How to perform this clustering?• Estimating model parameters (and models structure)
– Decide how do you split each terminal (what is T in ., NP -> ( NP[1], NP[2],...,NP[T])– Estimate probabilities for all
• Parsing:– Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated
nodes (2)?– Usually (2), but the inferred latent variables can can be useful for other tasks
– Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables
8
Estimating the model
• Estimating parameters:– If we decide on the structure of the model (how we split) we can use
EM (Matsuzaki et al, 05; Petrov and Klein, 06; ...):• E-Step: estimate - obtain
fractional counts of rules
• M-Step:
– Also can use variational methods (mean-field): [Titov and Henderson, 07; Liang et al, 07]
• Recall: We considered the variational methods in the context of LDA
9
Estimating the model • How to decide on how many nodes to split?
– Early models split all the nodes equally [Kurihara and Sato, 04; Matsuzaki et al, 05; Prescher 05,...] with T selected by hand
– Models are sparse (parameter estimates are not reliable), parsing time is large
10
Estimating the model • How to decide on how many nodes to split?
– Later different approaches were considered:• (Petrov and Klein 06): Split and merge approach – recursively split
each node in 2, if the likelihood is (significantly) improved – keep, otherwise, merge back; continue until no improvement
• (Liang et al 07): Use Dirichlet Processes to automatically infer the appropriate size of the grammar
– Larger is the training set: more fine grain the annotation is
11
Estimating the model • How to decide on how many nodes to split?
• (Titov and Henderson 07; current work): – Instead of annotating with a single label annotate with a
binary vector: - log-linear models for instead of counts
of productions- - can be large: standard Gaussian regularization to avoid
overtraining – efficient approximate parsing algorithms
12
How to parse?• Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes
(2)?• How to parse:
– (1) – easy – just usual parsing with the extended grammar (if all nodes split in T)– (2) - not tractable (NP-complete, [Matsuzaki et al, 2005]), – instead you can do Minimum Bayes Risk decoding (i.e., output the minimum loss tree [Goodman 96; Titov and Henderson,
06; Petrov and Klein 07]) => instead of predicting the best tree you output the tree with the minimal expected error
(Not always a great idea because we often do not know good loss measures: like optimizing the Hamming loss for sequence labeling can lead to lingustically non-plausible structures)
– Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables
13
Adaptive splitting• (Petrov and Klein, 06): Split and Merge: number of induced
constituent labels:
0
5
10
15
20
25
30
35
40
NP
VP PP
AD
VP S
AD
JP
SB
AR QP
WH
NP
PR
N
NX
SIN
V
PR
T
WH
PP
SQ
CO
NJP
FR
AG
NA
C
UC
P
WH
AD
VP
INT
J
SB
AR
Q
RR
C
WH
AD
JP X
RO
OT
LST
PP
VPNP
14
• (Petrov and Klein, 06): Split and Merge: number of induced POS tags:
Adaptive splitting
0
10
20
30
40
50
60
70
NN
P JJN
NS
NN
VB
N RB
VB
G VB
VB
D CD IN
VB
ZV
BP DT
NN
PS
CC
JJR
JJS :
PR
PP
RP
$M
DR
BR
WP
PO
SP
DT
WR
B-L
RB
- .E
XW
P$
WD
T-R
RB
- ''F
WR
BS
TO $
UH , ``
SY
M RP
LS#
TO
,POS
15
Adaptive splitting• (Petrov and Klein, 06): Split and Merge: number of induced
POS tags:
0
10
20
30
40
50
60
70
NN
P JJN
NS
NN
VB
N RB
VB
G VB
VB
D CD IN
VB
ZV
BP DT
NN
PS
CC
JJR
JJS :
PR
PP
RP
$M
DR
BR
WP
PO
SP
DT
WR
B-L
RB
- .E
XW
P$
WD
T-R
RB
- ''F
WR
BS
TO $
UH , ``
SY
M RP
LS#
TO
,POS
0
10
20
30
40
50
60
70
NN
P JJN
NS
NN
VB
N RB
VB
G VB
VB
D CD IN
VB
ZV
BP DT
NN
PS
CC
JJR
JJS :
PR
PP
RP
$M
DR
BR
WP
PO
SP
DT
WR
B-L
RB
- .E
XW
P$
WD
T-R
RB
- ''F
WR
BS
TO $
UH , ``
SY
M RP
LS#
NN
NNSNNP
JJ
16
Induced POS-tags Proper Nouns (NNP):
Personal pronouns (PRP):
NNP-14 Oct. Nov. Sept.
NNP-12 John Robert James
NNP-2 J. E. L.
NNP-1 Bush Noriega Peters
NNP-15 New San Wall
NNP-3 York Francisco Street
PRP-0 It He I
PRP-1 it he they
PRP-2 it them him
17
Induced POS tags
• Relative adverbs (RBR):
• Cardinal Numbers (CD):
RBR-0 further lower higher
RBR-1 more less More
RBR-2 earlier Earlier later
CD-7 one two Three
CD-4 1989 1990 1988
CD-11 million billion trillion
CD-0 1 50 100
CD-3 1 30 31
CD-9 78 58 34
18
Results for this model
F1
≤ 40 words
F1
all wordsParser
Klein & Manning ’03 86.3 85.7
Matsuzaki et al. ’05 86.7 86.1
Collins ’99 88.6 88.2
Charniak & Johnson ’05
90.1 89.6
Petrov & Klein, 06 90.2 89.7
19
LVs in Parsing• In standard models for parsing (and other structured prediction problems)
you need to decide how the structure decomposes into the parts (e.g., weighted CFGs / PCFGs)
• In latent variable models you relax this assumption: you assume how the structure annotated with latent variables decomposes
• In other words, you learn to construct composite features from the elementary features (parts) -> reduces feature engineering effort
• Latent variable models become popular in many applications:– syntactic dependency parsing [Titov and Henderson, 07] – best single model
system in the parsing competition (overall 3rd result out of 22 systems) (CoNLL-2007)
– joint semantic role labeling and parsing [Henderson et al, 09] – again the best single model (1st result in parsing, 3rd result in SRL) (CoNLL-2009)
– hidden (dynamics) CRFs [Quattoni, 09]– ...
20
Hidden CRFs• CRF (Lafferty et al, 2001):
• Latent Dynamic CRF
No long-distance statistical dependencies between y
Long-distance dependencies can be encoded using latent vectors
21
Latent Variables• Drawbacks:
– Learning LVs models usually involves using slower iterative algorithms (EM, Variation methods, sampling...)– Optimization problem is often non-convex – many local minima– Inference (decoding) can be more expensive
• Advantages:– Reduces feature engineering effort– Especially preferable if little domain knowledge is available and complex features are needed– Induced representation can be used for other tasks (e.g., LA-PCFGs induce fine-grain grammar can be useful,
e.g., for SRL)– Latent variables (= hidden representations) can be useful in muti-task learning: hidden representation is
induced simultaneously for several tasks [Collobert and Weston, 2008; Titov et al, 2009].
• #
22
Conclusions• We considered latent variable models in different contexts:
– Topic modeling– Structured prediction models
• We demonstrated where and why they are useful• Reviewed basic inference/learning techniques:
– EM-type algorithms– Variational approximations– Sampling
• Only very basic review
• Next time: a guest lecture by Ming-Wei Chang on Domain-Adaptation (really hot and important topic in NLP!)