simple unsupervised grammar induction from raw text with cascaded finite state models: acl 2011 talk

Simple Unsupervised Grammar Induction fromRaw Text with Cascaded Finite State Models

Elias Ponvert, Jason Baldridge, Katrin Erk

Department of LinguisticsThe University of Texas at Austin

Association for Computational Linguistics19–24 June, 2011

Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 1 / 34

Why unsupervised parsing?1 Less reliance on annotated training

Hello!

2 Apply to new languages and domains

Særær manannær man

mæþæn

Assumptions made in parser learning

NP VPPP

Sunday

sleeps

Getting these labels right AS WELL AS the structureof the tree is hard

Sunday

sleeps

So the task is to identify the structure alone

on Sunday the brown bear

sleeps,

Learning operates from gold-standard parts-of-speech(POS) rather than raw text

P N Det A N

on Sunday , the brown bear sleepsP N , Det A N V

Klein & Manning 2003 CCMBod 2006a, 2006bKlein & Manning 2005 DMVSuccessors to DMV: - Smith 2006, Smith & Cohen 2009, Headden et al 2009, Spitkovsky et al 2010ab, &c

J. Gao et al 2003, 2004Seginer 2007

this work

Unsupervised parsing: desiderata

Raw text

Standard NLP / extensible

Scalable and fast

A new approach: start from the bottom

Unsupervised Partial Parsing =segmentation of (non-overlapping) multiword constituents

Unsupervised segmentation of constituentsleaves some room for interpretation

Possible segmentations( the cat ) in ( the hat ) knows ( a lot ) about that

( the cat ) ( in the hat ) knows ( a lot ) ( about that )

( the cat in the hat ) knows ( a lot about that )

( the cat in the hat ) ( knows a lot about that )

( the cat in the hat ) ( knows a lot ) ( about that )

Defining UPP by evaluation1. Constituent chunks:

non-hierarchical multiword constituentsS

Defining UPP by evaluation2. Base NPs:

non-recursive noun phrases

Multilingual data for direct evaluation

English WSJGerman NegraChinese CTB

Sentences Types TokensWSJ Penn Treebank 49K 44K 1M

Negra Negra German Corpus 21K 49K 300KCTB Penn Chinese Treebank 19K 37K 430K

Constituent chunks and NPs in the data

WSJChunks 203KNPs 172KChunks ∩ NPs 161K

NegraChunks 59KNPs 33KChunks ∩ NPs 23K

CTBChunks 92KNPs 56KChunks ∩ NPs 43K

The benchmark: CCL parser

the cat

the red dog

the0 ��

��

1 ��saw

0 �� 0

��the

0 ��red

0 ��

0�� dog

0�� run

0��

Common Cover Links representation

Constituency tree

Seginer (2007 ACL; 2007 PhD UvA)

Hypothesis

Segmentation can be learned bygeneralizing on phrasal boundaries

UPP as a tagging problem

the cat in the hat

B Beginning of a constituentI Inside a constituent

O Not inside a constituent

Learning from boundaries

the cat in the hat

Learning from punctuation

Isunday

Ibrown

on sunday , the brown bear sleeps

sleeps

UPP: Models

P( ) ≈ P( ) P( )B

Hidden Markov Model

Probabilistic right linear grammar

P( ) = P( ) P( | )theB I B I

thecat

Learning: expectation maximization (EM) viaforward-backward (run to convergence)

UPP: Models

P( ) ≈ P( ) P( )B

Hidden Markov Model

Probabilistic right linear grammar

P( ) = P( ) P( | )theB I B I

thecat

Decoding: ViterbiSmoothing: additive smoothing on emissions

UPP: Constraints on sequences

the cat in the hat

STOP B

UPP evaluation: Setup

Evaluation by comparison to treebank dataStandard train / development / test splitsPrecision and recall on matched constituentsBenchmark: CCLBoth get tokenization, punctuation,sentence boundaries

UPP evaluation: Chunking (F-score)

0 10 20 30 40 50 60 70 80

CCL∗ HMM Chunker PRLG Chunker

CCL non-hierarchical constituentsFirst-level parsing output

UPP evaluation: Base NPs (F-score)

0 10 20 30 40 50 60 70 80

CCL∗ HMM Chunker PRLG Chunker

CCL non-hierarchical constituentsFirst-level parsing output

UPP: Review

Sequence models can generalize on indicatorsfor phrasal boundariesLeads to improved unsupervised segmentation

Question

Are we limited to segmentation?

Hypothesis

Identification of higher level constituentscan also be learned by generalizing onphrasal boundaries

Cascaded UPP: 1 Segment raw text

there is no asbestos in our products now

Cascaded UPP: 2 Choose stand-ins for phrases

our productsis no asbestos

there is no asbestos in our products now

there in nowis our

Cascaded UPP: 3 Segment text + phrasal stand-ins

there in nowis our

Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4

our products

is no asbestos

there in nowis our

is in now

Cascaded UPP: 5 Unwind to output tree

our products

is no asbestos

is in now

thereis no asbestos in our products

Cascaded UPP: Review

Separate models learned at each cascade levelModels share hyper-parameters (smoothing etc)Choice of pseudowords as phrasal stand-insPseudoword-identification: corpus frequency

Cascaded UPP: Evaluation

0 10 20 30 40 50 60

CCL Cascaded HMM Cascaded PRLG

All constituent F-scoreCascade run to convergence

More example parses

diethe

csuCSU

tutdoes

dasthis in

bayernBavaria

dochnevertheless

auchalso

sehrvery

erfolgreichsuccessfully

Nevertheless, the CSU does this in Bavaria very successfully as well

Gold standard

die csutut das

in bayerndoch auch

sehr erfolgreich

Cascaded PRLG – Negra correctincorrect

More example parses

beiwith

denthe

windsorsWindsors

bleibtstays

alleseverything

inin der

familiefamily

With the Windsors everything stays in the family.

Gold standard

bei den windsorsbleibt alles

in der familie

More example parses

immerever

mehrmore

anlagenteilemachine parts

uberalternover-age

(with) more and more machine parts over-age

What we’ve learned

Unsupervised identification of base NPs andlocal constituents is possibleA cascade of chunking models for raw textparsing has state-of-the-art results

Future directions

Improvements to the sequence modelsBetter phrasal stand-in (pseudoword)constructionLearning joint models rather than a cascade

What’s in the paper

Comparison to Klein & Manning’s CCMDiscussion of phrasal punctuation

I the chunkers still do well w/out punctuation

Analysis of chunking and parsing ChineseError analysis

Thanks!

Contact: eponvert@utexas.eduCode: elias.ponvert.net/upparse

This work is supported in part by the U. S. Army Research Laboratory andthe U.S. Army Research Office under grant number W911NF-10-1-0533. Sup-port for Elias was also provided by Mike Hogg Endowment Fellowship, theOffice of Graduate Studies at The University of Texas at Austin.

Appendices

More example parses

two share

a house almost devoid of furniture

Gold standardtwo

a housealmost devoid

offurniture

Cascaded PRLG – WSJ correctincorrect

More example parses

is one to think of all this

Gold standardwhat

all this

Cascaded PRLG – WSJ correctincorrect

Learning curves: Base NPs

10 20 30 40K

sentences10 20 30 40K

EM iter sentences

0 20 40 60 80 100

EM iter

PRLG chunking model: WSJ

5 10 15K1020304050

sentences 5 10 15K20

EM iter sentences

0 50 100 1501020304050

EM iter

PRLG chunking model: Negra

5 10 15K0

sentences 510 15K

EM iter sentences

0 20 40 60 80 1000

EM iter

PRLG chunking model: CTB

What are the models learning?

B P(w|B)the 21.0a 8.7to 6.5’s 2.8in 1.9mr. 1.8its 1.6of 1.4an 1.4and 1.4

I P(w|I)% 1.8million 1.6be 1.3company 0.9year 0.8market 0.7billion 0.6share 0.5new 0.5than 0.5

O P(w|O)

of 5.8and 4.0in 3.7that 2.2to 2.1for 2.0is 2.0it 1.7said 1.7on 1.5

HMM Emissions: WSJ

B P(w|B)der the 13.0die the 12.2den the 4.4und and 3.3im in 3.2das the 2.9des the 2.7dem the 2.4eine a 2.1ein a 2.0

I P(w|I)uhr o’clock 0.8juni June 0.6jahren years 0.4prozent percent 0.4mark currency 0.3stadt city 0.3000 0.3millionen millions 0.3jahre year 0.3frankfurter Frankfurt 0.3

O P(w|O)

in in 3.4und and 2.7mit with 1.7fur for 1.6auf on 1.5zu to 1.4von of 1.3sich such 1.3ist is 1.3nicht not 1.2

HMM Emissions: Negra

B P(w|B)的 de, of 14.3一 one 3.1和 and 1.1两 two 0.9这 this 0.8有 have 0.8经济 economy 0.7各 each 0.7全 all 0.7不 no 0.6

I P(w|I)的 de 3.9了 (perf. asp.) 2.2个 ge (measure) 1.5年 year 1.3说 say 1.0中 middle 0.9上 on, above 0.9人 person 0.7大 big 0.7国 country 0.6

O P(w|O)

在 at, in 3.4是 is 2.4中国 China 1.4也 also 1.2不 no 1.2对 pair 1.1和 and 1.0的 de 1.0将 fut. tns. 1.0有 have 1.0

HMM Emissions: CTB

simple unsupervised grammar induction from raw text with cascaded finite state models: acl 2011 talk

Technology

cascaded shadow maps - nvidia

pure cycle cascaded binary geothermal power plant - · pdf...

acl reconstruction with preservation of remnant of acl

occupation based treatment interventions for patients with...

1. cascaded theory.full

applying unsupervised learning - mathworks · applying...

unsupervised learning learning unsupervised...unsupervised...

acl graft choice...acl graft options • optimal acl graft...

acl analytics - acl.com · acl™ analytics connects you to...

rehabilitation from acl reconstruction - … from acl...

acl 4000 flare stack ignitor - acl manufacturing · acl...

cascaded multilevel inverters: a survey of topologies...

real time implementation of pi and pid controlled cascaded...

option manual rus - energya-swo.com.ua004hf acl-hi-1.5...

new h bridge cascaded topologies

light cascaded convolutional neural networks for accurate...

acl injury - cdn.dal.ca · acl injury 5 (a) patient...

unsupervised deep learning stats 306b - stanford...

chapter 3 cascaded h-bridge multilevel...

normal acl, injured acl, reconstructed acl, and the · pdf...