1 cs546: machine learning and natural language latent-variable models for structured prediction...

CS546: Machine Learning and Natural Language

Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing

Slides / Figures from Slav Petrov’s talk at COLING-ACL 06 are used in this lecture

Parsing Problem

• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]

Parsing Problem

• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]• Head lexicalization [Collins 99,...]

Parsing Problem

• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]• Head lexicalization [Collins 99,...]• Automatic Annotation [Matsuzaki et al, 05;...]• Manual Annotation [Klein and Manning 03]

Manual Annotation

• Manually split categories– NP: subject vs object– DT: determiners vs demonstratives– IN: sentential vs prepositional

• Advantages:– Fairly compact grammar– Linguistic motivations

• Disadvantages:– Performance leveled out– Manually annotated

Model F1

Naïve Treebank PCFG 72.6

Klein & Manning ’03 86.3

Automatic Annotation

• Use Latent Variable Models– Split (“annotate”) each node: E.g., NP -> ( NP[1], NP[2],...,NP[T])– Each node in the tree is annotated with a latent sub-category:

– Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

How to perform this clustering?• Estimating model parameters (and models structure)

– Decide how do you split each terminal (what is T in ., NP -> ( NP[1], NP[2],...,NP[T])– Estimate probabilities for all

• Parsing:– Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated

nodes (2)?– Usually (2), but the inferred latent variables can can be useful for other tasks

Estimating the model

• Estimating parameters:– If we decide on the structure of the model (how we split) we can use

EM (Matsuzaki et al, 05; Petrov and Klein, 06; ...):• E-Step: estimate - obtain

fractional counts of rules

• M-Step:

– Also can use variational methods (mean-field): [Titov and Henderson, 07; Liang et al, 07]

• Recall: We considered the variational methods in the context of LDA

Estimating the model • How to decide on how many nodes to split?

– Early models split all the nodes equally [Kurihara and Sato, 04; Matsuzaki et al, 05; Prescher 05,...] with T selected by hand

– Models are sparse (parameter estimates are not reliable), parsing time is large

– Later different approaches were considered:• (Petrov and Klein 06): Split and merge approach – recursively split

each node in 2, if the likelihood is (significantly) improved – keep, otherwise, merge back; continue until no improvement

• (Liang et al 07): Use Dirichlet Processes to automatically infer the appropriate size of the grammar

– Larger is the training set: more fine grain the annotation is

• (Titov and Henderson 07; current work): – Instead of annotating with a single label annotate with a

binary vector: - log-linear models for instead of counts

of productions- - can be large: standard Gaussian regularization to avoid

overtraining – efficient approximate parsing algorithms

How to parse?• Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes

(2)?• How to parse:

– (1) – easy – just usual parsing with the extended grammar (if all nodes split in T)– (2) - not tractable (NP-complete, [Matsuzaki et al, 2005]), – instead you can do Minimum Bayes Risk decoding (i.e., output the minimum loss tree [Goodman 96; Titov and Henderson,

06; Petrov and Klein 07]) => instead of predicting the best tree you output the tree with the minimal expected error

(Not always a great idea because we often do not know good loss measures: like optimizing the Hamming loss for sequence labeling can lead to lingustically non-plausible structures)

Adaptive splitting• (Petrov and Klein, 06): Split and Merge: number of induced

constituent labels:

• (Petrov and Klein, 06): Split and Merge: number of induced POS tags:

Adaptive splitting

D CD IN

UH , ``

Adaptive splitting• (Petrov and Klein, 06): Split and Merge: number of induced

POS tags:

D CD IN

UH , ``

D CD IN

UH , ``

NNSNNP

Induced POS-tags Proper Nouns (NNP):

Personal pronouns (PRP):

NNP-14 Oct. Nov. Sept.

NNP-12 John Robert James

NNP-2 J. E. L.

NNP-1 Bush Noriega Peters

NNP-15 New San Wall

NNP-3 York Francisco Street

PRP-0 It He I

PRP-1 it he they

PRP-2 it them him

Induced POS tags

• Relative adverbs (RBR):

• Cardinal Numbers (CD):

RBR-0 further lower higher

RBR-1 more less More

RBR-2 earlier Earlier later

CD-7 one two Three

CD-4 1989 1990 1988

CD-11 million billion trillion

CD-0 1 50 100

CD-3 1 30 31

CD-9 78 58 34

Results for this model

≤ 40 words

all wordsParser

Klein & Manning ’03 86.3 85.7

Matsuzaki et al. ’05 86.7 86.1

Collins ’99 88.6 88.2

Charniak & Johnson ’05

90.1 89.6

Petrov & Klein, 06 90.2 89.7

LVs in Parsing• In standard models for parsing (and other structured prediction problems)

you need to decide how the structure decomposes into the parts (e.g., weighted CFGs / PCFGs)

• In latent variable models you relax this assumption: you assume how the structure annotated with latent variables decomposes

• In other words, you learn to construct composite features from the elementary features (parts) -> reduces feature engineering effort

• Latent variable models become popular in many applications:– syntactic dependency parsing [Titov and Henderson, 07] – best single model

system in the parsing competition (overall 3rd result out of 22 systems) (CoNLL-2007)

– joint semantic role labeling and parsing [Henderson et al, 09] – again the best single model (1st result in parsing, 3rd result in SRL) (CoNLL-2009)

– hidden (dynamics) CRFs [Quattoni, 09]– ...

Hidden CRFs• CRF (Lafferty et al, 2001):

• Latent Dynamic CRF

No long-distance statistical dependencies between y

Long-distance dependencies can be encoded using latent vectors

Latent Variables• Drawbacks:

– Learning LVs models usually involves using slower iterative algorithms (EM, Variation methods, sampling...)– Optimization problem is often non-convex – many local minima– Inference (decoding) can be more expensive

• Advantages:– Reduces feature engineering effort– Especially preferable if little domain knowledge is available and complex features are needed– Induced representation can be used for other tasks (e.g., LA-PCFGs induce fine-grain grammar can be useful,

e.g., for SRL)– Latent variables (= hidden representations) can be useful in muti-task learning: hidden representation is

induced simultaneously for several tasks [Collobert and Weston, 2008; Titov et al, 2009].

Conclusions• We considered latent variable models in different contexts:

– Topic modeling– Structured prediction models

• We demonstrated where and why they are useful• Reviewed basic inference/learning techniques:

– EM-type algorithms– Variational approximations– Sampling

• Only very basic review

• Next time: a guest lecture by Ming-Wei Chang on Domain-Adaptation (really hot and important topic in NLP!)

1 cs546: machine learning and natural language latent-variable models for structured prediction...

Documents

syntactic vs. post-syntactic...

slav/ling 075 language & nationalism midterm study

architecture portfolio - b. sc. slav vitanov

the southern slav library-iii

slav tales

portfolio + cv slav vitanov

slav/ling 075 language & nationalism

slav advocacyflyer content/documents/slav... · title:...

opus novum review grammar case case case syntactic...

norman harris the southern slav question

uiliam keit. cena slav

chebanenko slav pdf

nyor possibilities slav branch presentation november 2011

the slav - sadler

fairy tales of the slav peasants and herdsmen

cs546: machine learning and natural language multi-class and...

play the semi-slav - quality · pdf fileindex of full games...

portfolio - slav vitanov

putting technology on trial - slav conference

insideadog slav nov 11