simple unsupervised grammar induction from raw text with cascaded finite state models: acl 2011 talk

Post on 26-Jan-2015

104 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides from my 2011 Association for Computational Linguistics paper & talk (joint work with Jason Baldridge and Katrin Erk). Presents Unsupervised Partial Parsing, a simple but very effective method for discovering grammatical phrases (like noun phrases and what not)

TRANSCRIPT

Simple Unsupervised Grammar Induction fromRaw Text with Cascaded Finite State Models

Elias Ponvert, Jason Baldridge, Katrin Erk

Department of LinguisticsThe University of Texas at Austin

Association for Computational Linguistics19–24 June, 2011

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 1 / 34

Why unsupervised parsing?1 Less reliance on annotated training

Hello!

2 Apply to new languages and domains

Særær manannær man

mæþæn

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 2 / 34

Assumptions made in parser learning

S

NP VPPP

P

on

NP

N

Sunday

Det

the

A

brown

N

bear

V

sleeps

,

,

Getting these labels right AS WELL AS the structureof the tree is hard

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34

Assumptions made in parser learning

P

on

N

Sunday

Det

the

A

brown

N

bear

V

sleeps

,

,

So the task is to identify the structure alone

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34

Assumptions made in parser learning

on Sunday the brown bear

sleeps,

Learning operates from gold-standard parts-of-speech(POS) rather than raw text

P N Det A N

V,

on Sunday , the brown bear sleepsP N , Det A N V

Klein & Manning 2003 CCMBod 2006a, 2006bKlein & Manning 2005 DMVSuccessors to DMV: - Smith 2006, Smith & Cohen 2009, Headden et al 2009, Spitkovsky et al 2010ab, &c

J. Gao et al 2003, 2004Seginer 2007

this work

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34

Unsupervised parsing: desiderata

Raw text

Standard NLP / extensible

Scalable and fast

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 4 / 34

A new approach: start from the bottom

Unsupervised Partial Parsing =segmentation of (non-overlapping) multiword constituents

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 5 / 34

Unsupervised segmentation of constituentsleaves some room for interpretation

Possible segmentations( the cat ) in ( the hat ) knows ( a lot ) about that

( the cat ) ( in the hat ) knows ( a lot ) ( about that )

( the cat in the hat ) knows ( a lot about that )

( the cat in the hat ) ( knows a lot about that )

( the cat in the hat ) ( knows a lot ) ( about that )

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 6 / 34

Defining UPP by evaluation1. Constituent chunks:

non-hierarchical multiword constituentsS

NP

D

The

N

Cat

PP

P

in

NP

D

the

N

hat

VP

V

knows

NP

D

a

N

lot

PP

P

about

NP

N

that

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 7 / 34

Defining UPP by evaluation2. Base NPs:

non-recursive noun phrases

S

NP

D

The

N

Cat

PP

P

in

NP

D

the

N

hat

VP

V

knows

NP

D

a

N

lot

PP

P

about

NP

N

that

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 7 / 34

Multilingual data for direct evaluation

English WSJGerman NegraChinese CTB

Sentences Types TokensWSJ Penn Treebank 49K 44K 1M

Negra Negra German Corpus 21K 49K 300KCTB Penn Chinese Treebank 19K 37K 430K

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 8 / 34

Constituent chunks and NPs in the data

WSJChunks 203KNPs 172KChunks ∩ NPs 161K

NegraChunks 59KNPs 33KChunks ∩ NPs 23K

CTBChunks 92KNPs 56KChunks ∩ NPs 43K

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 9 / 34

The benchmark: CCL parser

the cat

saw

the red dog

run

the0 ��

cat0

��

1 ��saw

0 ���� ��0

��the

0 ��red

0 ��

0�� dog

0�� run

0��

Common Cover Links representation

Constituency tree

Seginer (2007 ACL; 2007 PhD UvA)

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 10 / 34

Hypothesis

Segmentation can be learned bygeneralizing on phrasal boundaries

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 11 / 34

UPP as a tagging problem

Bthe

Icat

Oin

Bthe

Ihat

the cat in the hat

B Beginning of a constituentI Inside a constituent

O Not inside a constituent

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 12 / 34

Learning from boundaries

Bthe

Icat

Oin

Bthe

Ihat

the cat in the hat

STOP

#STOP

#

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 13 / 34

Learning from punctuation

Bon

Isunday

Bthe

Ibrown

Ibear

STOP

#STOP

#

on sunday , the brown bear sleeps

STOP

,O

sleeps

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 14 / 34

UPP: Models

P( ) ≈ P( ) P( )B

the

I

cat

O

in

B

the

I

hat

Hidden Markov Model

B I

the

B

the

B I

Probabilistic right linear grammar

P( ) = P( ) P( | )theB I B I

BI

OB

I

thecat

inthe

hat

B

Ithe

Learning: expectation maximization (EM) viaforward-backward (run to convergence)

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34

UPP: Models

P( ) ≈ P( ) P( )B

the

I

cat

O

in

B

the

I

hat

Hidden Markov Model

B I

the

B

the

B I

Probabilistic right linear grammar

P( ) = P( ) P( | )theB I B I

BI

OB

I

thecat

inthe

hat

B

Ithe

Decoding: ViterbiSmoothing: additive smoothing on emissions

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34

UPP: Constraints on sequences

Bthe

Icat

Oin

Bthe

Ihat

the cat in the hat

STOP

#STOP

#

STOP B

O I

1

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 16 / 34

UPP evaluation: Setup

Evaluation by comparison to treebank dataStandard train / development / test splitsPrecision and recall on matched constituentsBenchmark: CCLBoth get tokenization, punctuation,sentence boundaries

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 17 / 34

UPP evaluation: Chunking (F-score)

0 10 20 30 40 50 60 70 80

CTB

Negra

WSJ

CCL∗ HMM Chunker PRLG Chunker

CCL non-hierarchical constituentsFirst-level parsing output

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 18 / 34

UPP evaluation: Base NPs (F-score)

0 10 20 30 40 50 60 70 80

CTB

Negra

WSJ

CCL∗ HMM Chunker PRLG Chunker

CCL non-hierarchical constituentsFirst-level parsing output

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 19 / 34

UPP: Review

Sequence models can generalize on indicatorsfor phrasal boundariesLeads to improved unsupervised segmentation

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 20 / 34

Question

Are we limited to segmentation?

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 21 / 34

Hypothesis

Identification of higher level constituentscan also be learned by generalizing onphrasal boundaries

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 22 / 34

Cascaded UPP: 1 Segment raw text

there is no asbestos in our products now

there is no asbestos in our products now

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Cascaded UPP: 2 Choose stand-ins for phrases

our productsis no asbestos

there is no asbestos in our products now

there in nowis our

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Cascaded UPP: 3 Segment text + phrasal stand-ins

there in nowis our

there in nowis our

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4

our products

in

is no asbestos

there

there in nowis our

is in now

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Cascaded UPP: 5 Unwind to output tree

our products

in

is no asbestos

there

is in now

thereis no asbestos in our products

now

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Cascaded UPP: Review

Separate models learned at each cascade levelModels share hyper-parameters (smoothing etc)Choice of pseudowords as phrasal stand-insPseudoword-identification: corpus frequency

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 24 / 34

Cascaded UPP: Evaluation

0 10 20 30 40 50 60

CTB

Negra

WSJ

CCL Cascaded HMM Cascaded PRLG

All constituent F-scoreCascade run to convergence

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 25 / 34

More example parses

diethe

csuCSU

tutdoes

dasthis in

in

bayernBavaria

dochnevertheless

auchalso

sehrvery

erfolgreichsuccessfully

Nevertheless, the CSU does this in Bavaria very successfully as well

Gold standard

die csutut das

in bayerndoch auch

sehr erfolgreich

Cascaded PRLG – Negra correctincorrect

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34

More example parses

beiwith

denthe

windsorsWindsors

bleibtstays

alleseverything

inin der

the

familiefamily

With the Windsors everything stays in the family.

Gold standard

bei den windsorsbleibt alles

in der familie

Cascaded PRLG – Negra correctincorrect

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34

More example parses

immerever

mehrmore

anlagenteilemachine parts

uberalternover-age

(with) more and more machine parts over-age

Cascaded PRLG – Negra correctincorrect

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34

What we’ve learned

Unsupervised identification of base NPs andlocal constituents is possibleA cascade of chunking models for raw textparsing has state-of-the-art results

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 27 / 34

Future directions

Improvements to the sequence modelsBetter phrasal stand-in (pseudoword)constructionLearning joint models rather than a cascade

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 28 / 34

What’s in the paper

Comparison to Klein & Manning’s CCMDiscussion of phrasal punctuation

I the chunkers still do well w/out punctuation

Analysis of chunking and parsing ChineseError analysis

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 29 / 34

Thanks!

Contact: eponvert@utexas.eduCode: elias.ponvert.net/upparse

This work is supported in part by the U. S. Army Research Laboratory andthe U.S. Army Research Office under grant number W911NF-10-1-0533. Sup-port for Elias was also provided by Mike Hogg Endowment Fellowship, theOffice of Graduate Studies at The University of Texas at Austin.

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 30 / 34

Appendices

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 31 / 34

More example parses

two share

a house almost devoid of furniture

Gold standardtwo

share

a housealmost devoid

offurniture

Cascaded PRLG – WSJ correctincorrect

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 32 / 34

More example parses

what

is one to think of all this

Gold standardwhat

is

one

to

think

of

all this

Cascaded PRLG – WSJ correctincorrect

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 32 / 34

Learning curves: Base NPs

10 20 30 40K

20

40

60

80

sentences10 20 30 40K

2060

100

20

40

60

80

F-s

core

EM iter sentences

1

0 20 40 60 80 100

20

40

60

80

EM iter

PRLG chunking model: WSJ

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34

Learning curves: Base NPs

5 10 15K1020304050

sentences 5 10 15K20

80140

20

40

F-s

core

EM iter sentences

1

0 50 100 1501020304050

EM iter

PRLG chunking model: Negra

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34

Learning curves: Base NPs

5 10 15K0

10

20

30

sentences 510 15K

2060

100

10

20

30

F-s

core

EM iter sentences

1

0 20 40 60 80 1000

10

20

30

EM iter

PRLG chunking model: CTB

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34

What are the models learning?

B P(w|B)the 21.0a 8.7to 6.5’s 2.8in 1.9mr. 1.8its 1.6of 1.4an 1.4and 1.4

I P(w|I)% 1.8million 1.6be 1.3company 0.9year 0.8market 0.7billion 0.6share 0.5new 0.5than 0.5

O P(w|O)

of 5.8and 4.0in 3.7that 2.2to 2.1for 2.0is 2.0it 1.7said 1.7on 1.5

HMM Emissions: WSJ

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34

What are the models learning?

B P(w|B)der the 13.0die the 12.2den the 4.4und and 3.3im in 3.2das the 2.9des the 2.7dem the 2.4eine a 2.1ein a 2.0

I P(w|I)uhr o’clock 0.8juni June 0.6jahren years 0.4prozent percent 0.4mark currency 0.3stadt city 0.3000 0.3millionen millions 0.3jahre year 0.3frankfurter Frankfurt 0.3

O P(w|O)

in in 3.4und and 2.7mit with 1.7fur for 1.6auf on 1.5zu to 1.4von of 1.3sich such 1.3ist is 1.3nicht not 1.2

HMM Emissions: Negra

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34

What are the models learning?

B P(w|B)的 de, of 14.3一 one 3.1和 and 1.1两 two 0.9这 this 0.8有 have 0.8经济 economy 0.7各 each 0.7全 all 0.7不 no 0.6

I P(w|I)的 de 3.9了 (perf. asp.) 2.2个 ge (measure) 1.5年 year 1.3说 say 1.0中 middle 0.9上 on, above 0.9人 person 0.7大 big 0.7国 country 0.6

O P(w|O)

在 at, in 3.4是 is 2.4中国 China 1.4也 also 1.2不 no 1.2对 pair 1.1和 and 1.0的 de 1.0将 fut. tns. 1.0有 have 1.0

HMM Emissions: CTB

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34

top related