1 cs546: machine learning and natural language latent-variable models for structured prediction...

22
1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav Petrov’s talk at COLING-ACL 06 are used in this lecture

Upload: isabella-owen

Post on 13-Jan-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

1

CS546: Machine Learning and Natural Language

Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing

Slides / Figures from Slav Petrov’s talk at COLING-ACL 06 are used in this lecture

Page 2: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

2

Parsing Problem

• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]

Page 3: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

3

Parsing Problem

• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]• Head lexicalization [Collins 99,...]

Page 4: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

4

Parsing Problem

• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]• Head lexicalization [Collins 99,...]• Automatic Annotation [Matsuzaki et al, 05;...]• Manual Annotation [Klein and Manning 03]

Page 5: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

5

Manual Annotation

• Manually split categories– NP: subject vs object– DT: determiners vs demonstratives– IN: sentential vs prepositional

• Advantages:– Fairly compact grammar– Linguistic motivations

• Disadvantages:– Performance leveled out– Manually annotated

Model F1

Naïve Treebank PCFG 72.6

Klein & Manning ’03 86.3

Page 6: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

6

Automatic Annotation

• Use Latent Variable Models– Split (“annotate”) each node: E.g., NP -> ( NP[1], NP[2],...,NP[T])– Each node in the tree is annotated with a latent sub-category:

– Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

Page 7: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

7

How to perform this clustering?• Estimating model parameters (and models structure)

– Decide how do you split each terminal (what is T in ., NP -> ( NP[1], NP[2],...,NP[T])– Estimate probabilities for all

• Parsing:– Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated

nodes (2)?– Usually (2), but the inferred latent variables can can be useful for other tasks

– Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

Page 8: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

8

Estimating the model

• Estimating parameters:– If we decide on the structure of the model (how we split) we can use

EM (Matsuzaki et al, 05; Petrov and Klein, 06; ...):• E-Step: estimate - obtain

fractional counts of rules

• M-Step:

– Also can use variational methods (mean-field): [Titov and Henderson, 07; Liang et al, 07]

• Recall: We considered the variational methods in the context of LDA

Page 9: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

9

Estimating the model • How to decide on how many nodes to split?

– Early models split all the nodes equally [Kurihara and Sato, 04; Matsuzaki et al, 05; Prescher 05,...] with T selected by hand

– Models are sparse (parameter estimates are not reliable), parsing time is large

Page 10: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

10

Estimating the model • How to decide on how many nodes to split?

– Later different approaches were considered:• (Petrov and Klein 06): Split and merge approach – recursively split

each node in 2, if the likelihood is (significantly) improved – keep, otherwise, merge back; continue until no improvement

• (Liang et al 07): Use Dirichlet Processes to automatically infer the appropriate size of the grammar

– Larger is the training set: more fine grain the annotation is

Page 11: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

11

Estimating the model • How to decide on how many nodes to split?

• (Titov and Henderson 07; current work): – Instead of annotating with a single label annotate with a

binary vector: - log-linear models for instead of counts

of productions- - can be large: standard Gaussian regularization to avoid

overtraining – efficient approximate parsing algorithms

Page 12: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

12

How to parse?• Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes

(2)?• How to parse:

– (1) – easy – just usual parsing with the extended grammar (if all nodes split in T)– (2) - not tractable (NP-complete, [Matsuzaki et al, 2005]), – instead you can do Minimum Bayes Risk decoding (i.e., output the minimum loss tree [Goodman 96; Titov and Henderson,

06; Petrov and Klein 07]) => instead of predicting the best tree you output the tree with the minimal expected error

(Not always a great idea because we often do not know good loss measures: like optimizing the Hamming loss for sequence labeling can lead to lingustically non-plausible structures)

– Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

Page 13: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

13

Adaptive splitting• (Petrov and Klein, 06): Split and Merge: number of induced

constituent labels:

0

5

10

15

20

25

30

35

40

NP

VP PP

AD

VP S

AD

JP

SB

AR QP

WH

NP

PR

N

NX

SIN

V

PR

T

WH

PP

SQ

CO

NJP

FR

AG

NA

C

UC

P

WH

AD

VP

INT

J

SB

AR

Q

RR

C

WH

AD

JP X

RO

OT

LST

PP

VPNP

Page 14: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

14

• (Petrov and Klein, 06): Split and Merge: number of induced POS tags:

Adaptive splitting

0

10

20

30

40

50

60

70

NN

P JJN

NS

NN

VB

N RB

VB

G VB

VB

D CD IN

VB

ZV

BP DT

NN

PS

CC

JJR

JJS :

PR

PP

RP

$M

DR

BR

WP

PO

SP

DT

WR

B-L

RB

- .E

XW

P$

WD

T-R

RB

- ''F

WR

BS

TO $

UH , ``

SY

M RP

LS#

TO

,POS

Page 15: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

15

Adaptive splitting• (Petrov and Klein, 06): Split and Merge: number of induced

POS tags:

0

10

20

30

40

50

60

70

NN

P JJN

NS

NN

VB

N RB

VB

G VB

VB

D CD IN

VB

ZV

BP DT

NN

PS

CC

JJR

JJS :

PR

PP

RP

$M

DR

BR

WP

PO

SP

DT

WR

B-L

RB

- .E

XW

P$

WD

T-R

RB

- ''F

WR

BS

TO $

UH , ``

SY

M RP

LS#

TO

,POS

0

10

20

30

40

50

60

70

NN

P JJN

NS

NN

VB

N RB

VB

G VB

VB

D CD IN

VB

ZV

BP DT

NN

PS

CC

JJR

JJS :

PR

PP

RP

$M

DR

BR

WP

PO

SP

DT

WR

B-L

RB

- .E

XW

P$

WD

T-R

RB

- ''F

WR

BS

TO $

UH , ``

SY

M RP

LS#

NN

NNSNNP

JJ

Page 16: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

16

Induced POS-tags Proper Nouns (NNP):

Personal pronouns (PRP):

NNP-14 Oct. Nov. Sept.

NNP-12 John Robert James

NNP-2 J. E. L.

NNP-1 Bush Noriega Peters

NNP-15 New San Wall

NNP-3 York Francisco Street

PRP-0 It He I

PRP-1 it he they

PRP-2 it them him

Page 17: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

17

Induced POS tags

• Relative adverbs (RBR):

• Cardinal Numbers (CD):

RBR-0 further lower higher

RBR-1 more less More

RBR-2 earlier Earlier later

CD-7 one two Three

CD-4 1989 1990 1988

CD-11 million billion trillion

CD-0 1 50 100

CD-3 1 30 31

CD-9 78 58 34

Page 18: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

18

Results for this model

F1

≤ 40 words

F1

all wordsParser

Klein & Manning ’03 86.3 85.7

Matsuzaki et al. ’05 86.7 86.1

Collins ’99 88.6 88.2

Charniak & Johnson ’05

90.1 89.6

Petrov & Klein, 06 90.2 89.7

Page 19: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

19

LVs in Parsing• In standard models for parsing (and other structured prediction problems)

you need to decide how the structure decomposes into the parts (e.g., weighted CFGs / PCFGs)

• In latent variable models you relax this assumption: you assume how the structure annotated with latent variables decomposes

• In other words, you learn to construct composite features from the elementary features (parts) -> reduces feature engineering effort

• Latent variable models become popular in many applications:– syntactic dependency parsing [Titov and Henderson, 07] – best single model

system in the parsing competition (overall 3rd result out of 22 systems) (CoNLL-2007)

– joint semantic role labeling and parsing [Henderson et al, 09] – again the best single model (1st result in parsing, 3rd result in SRL) (CoNLL-2009)

– hidden (dynamics) CRFs [Quattoni, 09]– ...

Page 20: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

20

Hidden CRFs• CRF (Lafferty et al, 2001):

• Latent Dynamic CRF

No long-distance statistical dependencies between y

Long-distance dependencies can be encoded using latent vectors

Page 21: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

21

Latent Variables• Drawbacks:

– Learning LVs models usually involves using slower iterative algorithms (EM, Variation methods, sampling...)– Optimization problem is often non-convex – many local minima– Inference (decoding) can be more expensive

• Advantages:– Reduces feature engineering effort– Especially preferable if little domain knowledge is available and complex features are needed– Induced representation can be used for other tasks (e.g., LA-PCFGs induce fine-grain grammar can be useful,

e.g., for SRL)– Latent variables (= hidden representations) can be useful in muti-task learning: hidden representation is

induced simultaneously for several tasks [Collobert and Weston, 2008; Titov et al, 2009].

• #

Page 22: 1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav

22

Conclusions• We considered latent variable models in different contexts:

– Topic modeling– Structured prediction models

• We demonstrated where and why they are useful• Reviewed basic inference/learning techniques:

– EM-type algorithms– Variational approximations– Sampling

• Only very basic review

• Next time: a guest lecture by Ming-Wei Chang on Domain-Adaptation (really hot and important topic in NLP!)