1 cs546: machine learning and natural language latent-variable models for structured prediction...

Post on 13-Jan-2016

214 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

CS546: Machine Learning and Natural Language

Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing

Slides / Figures from Slav Petrov’s talk at COLING-ACL 06 are used in this lecture

2

Parsing Problem

• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]

3

Parsing Problem

• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]• Head lexicalization [Collins 99,...]

4

Parsing Problem

• Annotation refines base treebank symbols to improve statistical fit of the grammar• Parent annotation [Johnson 98]• Head lexicalization [Collins 99,...]• Automatic Annotation [Matsuzaki et al, 05;...]• Manual Annotation [Klein and Manning 03]

5

Manual Annotation

• Manually split categories– NP: subject vs object– DT: determiners vs demonstratives– IN: sentential vs prepositional

• Advantages:– Fairly compact grammar– Linguistic motivations

• Disadvantages:– Performance leveled out– Manually annotated

Model F1

Naïve Treebank PCFG 72.6

Klein & Manning ’03 86.3

6

Automatic Annotation

• Use Latent Variable Models– Split (“annotate”) each node: E.g., NP -> ( NP[1], NP[2],...,NP[T])– Each node in the tree is annotated with a latent sub-category:

– Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

7

How to perform this clustering?• Estimating model parameters (and models structure)

– Decide how do you split each terminal (what is T in ., NP -> ( NP[1], NP[2],...,NP[T])– Estimate probabilities for all

• Parsing:– Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated

nodes (2)?– Usually (2), but the inferred latent variables can can be useful for other tasks

– Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

8

Estimating the model

• Estimating parameters:– If we decide on the structure of the model (how we split) we can use

EM (Matsuzaki et al, 05; Petrov and Klein, 06; ...):• E-Step: estimate - obtain

fractional counts of rules

• M-Step:

– Also can use variational methods (mean-field): [Titov and Henderson, 07; Liang et al, 07]

• Recall: We considered the variational methods in the context of LDA

9

Estimating the model • How to decide on how many nodes to split?

– Early models split all the nodes equally [Kurihara and Sato, 04; Matsuzaki et al, 05; Prescher 05,...] with T selected by hand

– Models are sparse (parameter estimates are not reliable), parsing time is large

10

Estimating the model • How to decide on how many nodes to split?

– Later different approaches were considered:• (Petrov and Klein 06): Split and merge approach – recursively split

each node in 2, if the likelihood is (significantly) improved – keep, otherwise, merge back; continue until no improvement

• (Liang et al 07): Use Dirichlet Processes to automatically infer the appropriate size of the grammar

– Larger is the training set: more fine grain the annotation is

11

Estimating the model • How to decide on how many nodes to split?

• (Titov and Henderson 07; current work): – Instead of annotating with a single label annotate with a

binary vector: - log-linear models for instead of counts

of productions- - can be large: standard Gaussian regularization to avoid

overtraining – efficient approximate parsing algorithms

12

How to parse?• Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes

(2)?• How to parse:

– (1) – easy – just usual parsing with the extended grammar (if all nodes split in T)– (2) - not tractable (NP-complete, [Matsuzaki et al, 2005]), – instead you can do Minimum Bayes Risk decoding (i.e., output the minimum loss tree [Goodman 96; Titov and Henderson,

06; Petrov and Klein 07]) => instead of predicting the best tree you output the tree with the minimal expected error

(Not always a great idea because we often do not know good loss measures: like optimizing the Hamming loss for sequence labeling can lead to lingustically non-plausible structures)

– Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables

13

Adaptive splitting• (Petrov and Klein, 06): Split and Merge: number of induced

constituent labels:

0

5

10

15

20

25

30

35

40

NP

VP PP

AD

VP S

AD

JP

SB

AR QP

WH

NP

PR

N

NX

SIN

V

PR

T

WH

PP

SQ

CO

NJP

FR

AG

NA

C

UC

P

WH

AD

VP

INT

J

SB

AR

Q

RR

C

WH

AD

JP X

RO

OT

LST

PP

VPNP

14

• (Petrov and Klein, 06): Split and Merge: number of induced POS tags:

Adaptive splitting

0

10

20

30

40

50

60

70

NN

P JJN

NS

NN

VB

N RB

VB

G VB

VB

D CD IN

VB

ZV

BP DT

NN

PS

CC

JJR

JJS :

PR

PP

RP

$M

DR

BR

WP

PO

SP

DT

WR

B-L

RB

- .E

XW

P$

WD

T-R

RB

- ''F

WR

BS

TO $

UH , ``

SY

M RP

LS#

TO

,POS

15

Adaptive splitting• (Petrov and Klein, 06): Split and Merge: number of induced

POS tags:

0

10

20

30

40

50

60

70

NN

P JJN

NS

NN

VB

N RB

VB

G VB

VB

D CD IN

VB

ZV

BP DT

NN

PS

CC

JJR

JJS :

PR

PP

RP

$M

DR

BR

WP

PO

SP

DT

WR

B-L

RB

- .E

XW

P$

WD

T-R

RB

- ''F

WR

BS

TO $

UH , ``

SY

M RP

LS#

TO

,POS

0

10

20

30

40

50

60

70

NN

P JJN

NS

NN

VB

N RB

VB

G VB

VB

D CD IN

VB

ZV

BP DT

NN

PS

CC

JJR

JJS :

PR

PP

RP

$M

DR

BR

WP

PO

SP

DT

WR

B-L

RB

- .E

XW

P$

WD

T-R

RB

- ''F

WR

BS

TO $

UH , ``

SY

M RP

LS#

NN

NNSNNP

JJ

16

Induced POS-tags Proper Nouns (NNP):

Personal pronouns (PRP):

NNP-14 Oct. Nov. Sept.

NNP-12 John Robert James

NNP-2 J. E. L.

NNP-1 Bush Noriega Peters

NNP-15 New San Wall

NNP-3 York Francisco Street

PRP-0 It He I

PRP-1 it he they

PRP-2 it them him

17

Induced POS tags

• Relative adverbs (RBR):

• Cardinal Numbers (CD):

RBR-0 further lower higher

RBR-1 more less More

RBR-2 earlier Earlier later

CD-7 one two Three

CD-4 1989 1990 1988

CD-11 million billion trillion

CD-0 1 50 100

CD-3 1 30 31

CD-9 78 58 34

18

Results for this model

F1

≤ 40 words

F1

all wordsParser

Klein & Manning ’03 86.3 85.7

Matsuzaki et al. ’05 86.7 86.1

Collins ’99 88.6 88.2

Charniak & Johnson ’05

90.1 89.6

Petrov & Klein, 06 90.2 89.7

19

LVs in Parsing• In standard models for parsing (and other structured prediction problems)

you need to decide how the structure decomposes into the parts (e.g., weighted CFGs / PCFGs)

• In latent variable models you relax this assumption: you assume how the structure annotated with latent variables decomposes

• In other words, you learn to construct composite features from the elementary features (parts) -> reduces feature engineering effort

• Latent variable models become popular in many applications:– syntactic dependency parsing [Titov and Henderson, 07] – best single model

system in the parsing competition (overall 3rd result out of 22 systems) (CoNLL-2007)

– joint semantic role labeling and parsing [Henderson et al, 09] – again the best single model (1st result in parsing, 3rd result in SRL) (CoNLL-2009)

– hidden (dynamics) CRFs [Quattoni, 09]– ...

20

Hidden CRFs• CRF (Lafferty et al, 2001):

• Latent Dynamic CRF

No long-distance statistical dependencies between y

Long-distance dependencies can be encoded using latent vectors

21

Latent Variables• Drawbacks:

– Learning LVs models usually involves using slower iterative algorithms (EM, Variation methods, sampling...)– Optimization problem is often non-convex – many local minima– Inference (decoding) can be more expensive

• Advantages:– Reduces feature engineering effort– Especially preferable if little domain knowledge is available and complex features are needed– Induced representation can be used for other tasks (e.g., LA-PCFGs induce fine-grain grammar can be useful,

e.g., for SRL)– Latent variables (= hidden representations) can be useful in muti-task learning: hidden representation is

induced simultaneously for several tasks [Collobert and Weston, 2008; Titov et al, 2009].

• #

22

Conclusions• We considered latent variable models in different contexts:

– Topic modeling– Structured prediction models

• We demonstrated where and why they are useful• Reviewed basic inference/learning techniques:

– EM-type algorithms– Variational approximations– Sampling

• Only very basic review

• Next time: a guest lecture by Ming-Wei Chang on Domain-Adaptation (really hot and important topic in NLP!)

top related