simple unsupervised grammar induction from raw text with cascaded finite state models: acl 2011 talk
Post on 26-Jan-2015
104 Views
Preview:
DESCRIPTION
TRANSCRIPT
Simple Unsupervised Grammar Induction fromRaw Text with Cascaded Finite State Models
Elias Ponvert, Jason Baldridge, Katrin Erk
Department of LinguisticsThe University of Texas at Austin
Association for Computational Linguistics19–24 June, 2011
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 1 / 34
Why unsupervised parsing?1 Less reliance on annotated training
Hello!
2 Apply to new languages and domains
Særær manannær man
mæþæn
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 2 / 34
Assumptions made in parser learning
S
NP VPPP
P
on
NP
N
Sunday
Det
the
A
brown
N
bear
V
sleeps
,
,
Getting these labels right AS WELL AS the structureof the tree is hard
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34
Assumptions made in parser learning
P
on
N
Sunday
Det
the
A
brown
N
bear
V
sleeps
,
,
So the task is to identify the structure alone
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34
Assumptions made in parser learning
on Sunday the brown bear
sleeps,
Learning operates from gold-standard parts-of-speech(POS) rather than raw text
P N Det A N
V,
on Sunday , the brown bear sleepsP N , Det A N V
Klein & Manning 2003 CCMBod 2006a, 2006bKlein & Manning 2005 DMVSuccessors to DMV: - Smith 2006, Smith & Cohen 2009, Headden et al 2009, Spitkovsky et al 2010ab, &c
J. Gao et al 2003, 2004Seginer 2007
this work
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34
Unsupervised parsing: desiderata
Raw text
Standard NLP / extensible
Scalable and fast
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 4 / 34
A new approach: start from the bottom
Unsupervised Partial Parsing =segmentation of (non-overlapping) multiword constituents
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 5 / 34
Unsupervised segmentation of constituentsleaves some room for interpretation
Possible segmentations( the cat ) in ( the hat ) knows ( a lot ) about that
( the cat ) ( in the hat ) knows ( a lot ) ( about that )
( the cat in the hat ) knows ( a lot about that )
( the cat in the hat ) ( knows a lot about that )
( the cat in the hat ) ( knows a lot ) ( about that )
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 6 / 34
Defining UPP by evaluation1. Constituent chunks:
non-hierarchical multiword constituentsS
NP
D
The
N
Cat
PP
P
in
NP
D
the
N
hat
VP
V
knows
NP
D
a
N
lot
PP
P
about
NP
N
that
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 7 / 34
Defining UPP by evaluation2. Base NPs:
non-recursive noun phrases
S
NP
D
The
N
Cat
PP
P
in
NP
D
the
N
hat
VP
V
knows
NP
D
a
N
lot
PP
P
about
NP
N
that
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 7 / 34
Multilingual data for direct evaluation
English WSJGerman NegraChinese CTB
Sentences Types TokensWSJ Penn Treebank 49K 44K 1M
Negra Negra German Corpus 21K 49K 300KCTB Penn Chinese Treebank 19K 37K 430K
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 8 / 34
Constituent chunks and NPs in the data
WSJChunks 203KNPs 172KChunks ∩ NPs 161K
NegraChunks 59KNPs 33KChunks ∩ NPs 23K
CTBChunks 92KNPs 56KChunks ∩ NPs 43K
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 9 / 34
The benchmark: CCL parser
the cat
saw
the red dog
run
the0 ��
cat0
��
1 ��saw
0 ���� ��0
��the
0 ��red
0 ��
0�� dog
0�� run
0��
Common Cover Links representation
Constituency tree
Seginer (2007 ACL; 2007 PhD UvA)
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 10 / 34
Hypothesis
Segmentation can be learned bygeneralizing on phrasal boundaries
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 11 / 34
UPP as a tagging problem
Bthe
Icat
Oin
Bthe
Ihat
the cat in the hat
B Beginning of a constituentI Inside a constituent
O Not inside a constituent
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 12 / 34
Learning from boundaries
Bthe
Icat
Oin
Bthe
Ihat
the cat in the hat
STOP
#STOP
#
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 13 / 34
Learning from punctuation
Bon
Isunday
Bthe
Ibrown
Ibear
STOP
#STOP
#
on sunday , the brown bear sleeps
STOP
,O
sleeps
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 14 / 34
UPP: Models
P( ) ≈ P( ) P( )B
the
I
cat
O
in
B
the
I
hat
Hidden Markov Model
B I
the
B
the
B I
Probabilistic right linear grammar
P( ) = P( ) P( | )theB I B I
BI
OB
I
thecat
inthe
hat
B
Ithe
Learning: expectation maximization (EM) viaforward-backward (run to convergence)
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34
UPP: Models
P( ) ≈ P( ) P( )B
the
I
cat
O
in
B
the
I
hat
Hidden Markov Model
B I
the
B
the
B I
Probabilistic right linear grammar
P( ) = P( ) P( | )theB I B I
BI
OB
I
thecat
inthe
hat
B
Ithe
Decoding: ViterbiSmoothing: additive smoothing on emissions
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34
UPP: Constraints on sequences
Bthe
Icat
Oin
Bthe
Ihat
the cat in the hat
STOP
#STOP
#
STOP B
O I
1
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 16 / 34
UPP evaluation: Setup
Evaluation by comparison to treebank dataStandard train / development / test splitsPrecision and recall on matched constituentsBenchmark: CCLBoth get tokenization, punctuation,sentence boundaries
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 17 / 34
UPP evaluation: Chunking (F-score)
0 10 20 30 40 50 60 70 80
CTB
Negra
WSJ
CCL∗ HMM Chunker PRLG Chunker
CCL non-hierarchical constituentsFirst-level parsing output
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 18 / 34
UPP evaluation: Base NPs (F-score)
0 10 20 30 40 50 60 70 80
CTB
Negra
WSJ
CCL∗ HMM Chunker PRLG Chunker
CCL non-hierarchical constituentsFirst-level parsing output
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 19 / 34
UPP: Review
Sequence models can generalize on indicatorsfor phrasal boundariesLeads to improved unsupervised segmentation
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 20 / 34
Question
Are we limited to segmentation?
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 21 / 34
Hypothesis
Identification of higher level constituentscan also be learned by generalizing onphrasal boundaries
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 22 / 34
Cascaded UPP: 1 Segment raw text
there is no asbestos in our products now
there is no asbestos in our products now
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
Cascaded UPP: 2 Choose stand-ins for phrases
our productsis no asbestos
there is no asbestos in our products now
there in nowis our
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
Cascaded UPP: 3 Segment text + phrasal stand-ins
there in nowis our
there in nowis our
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4
our products
in
is no asbestos
there
there in nowis our
is in now
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
Cascaded UPP: 5 Unwind to output tree
our products
in
is no asbestos
there
is in now
thereis no asbestos in our products
now
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
Cascaded UPP: Review
Separate models learned at each cascade levelModels share hyper-parameters (smoothing etc)Choice of pseudowords as phrasal stand-insPseudoword-identification: corpus frequency
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 24 / 34
Cascaded UPP: Evaluation
0 10 20 30 40 50 60
CTB
Negra
WSJ
CCL Cascaded HMM Cascaded PRLG
All constituent F-scoreCascade run to convergence
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 25 / 34
More example parses
diethe
csuCSU
tutdoes
dasthis in
in
bayernBavaria
dochnevertheless
auchalso
sehrvery
erfolgreichsuccessfully
Nevertheless, the CSU does this in Bavaria very successfully as well
Gold standard
die csutut das
in bayerndoch auch
sehr erfolgreich
Cascaded PRLG – Negra correctincorrect
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34
More example parses
beiwith
denthe
windsorsWindsors
bleibtstays
alleseverything
inin der
the
familiefamily
With the Windsors everything stays in the family.
Gold standard
bei den windsorsbleibt alles
in der familie
Cascaded PRLG – Negra correctincorrect
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34
More example parses
immerever
mehrmore
anlagenteilemachine parts
uberalternover-age
(with) more and more machine parts over-age
Cascaded PRLG – Negra correctincorrect
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34
What we’ve learned
Unsupervised identification of base NPs andlocal constituents is possibleA cascade of chunking models for raw textparsing has state-of-the-art results
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 27 / 34
Future directions
Improvements to the sequence modelsBetter phrasal stand-in (pseudoword)constructionLearning joint models rather than a cascade
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 28 / 34
What’s in the paper
Comparison to Klein & Manning’s CCMDiscussion of phrasal punctuation
I the chunkers still do well w/out punctuation
Analysis of chunking and parsing ChineseError analysis
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 29 / 34
Thanks!
Contact: eponvert@utexas.eduCode: elias.ponvert.net/upparse
This work is supported in part by the U. S. Army Research Laboratory andthe U.S. Army Research Office under grant number W911NF-10-1-0533. Sup-port for Elias was also provided by Mike Hogg Endowment Fellowship, theOffice of Graduate Studies at The University of Texas at Austin.
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 30 / 34
Appendices
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 31 / 34
More example parses
two share
a house almost devoid of furniture
Gold standardtwo
share
a housealmost devoid
offurniture
Cascaded PRLG – WSJ correctincorrect
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 32 / 34
More example parses
what
is one to think of all this
Gold standardwhat
is
one
to
think
of
all this
Cascaded PRLG – WSJ correctincorrect
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 32 / 34
Learning curves: Base NPs
10 20 30 40K
20
40
60
80
sentences10 20 30 40K
2060
100
20
40
60
80
F-s
core
EM iter sentences
1
0 20 40 60 80 100
20
40
60
80
EM iter
PRLG chunking model: WSJ
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34
Learning curves: Base NPs
5 10 15K1020304050
sentences 5 10 15K20
80140
20
40
F-s
core
EM iter sentences
1
0 50 100 1501020304050
EM iter
PRLG chunking model: Negra
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34
Learning curves: Base NPs
5 10 15K0
10
20
30
sentences 510 15K
2060
100
10
20
30
F-s
core
EM iter sentences
1
0 20 40 60 80 1000
10
20
30
EM iter
PRLG chunking model: CTB
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34
What are the models learning?
B P(w|B)the 21.0a 8.7to 6.5’s 2.8in 1.9mr. 1.8its 1.6of 1.4an 1.4and 1.4
I P(w|I)% 1.8million 1.6be 1.3company 0.9year 0.8market 0.7billion 0.6share 0.5new 0.5than 0.5
O P(w|O)
of 5.8and 4.0in 3.7that 2.2to 2.1for 2.0is 2.0it 1.7said 1.7on 1.5
HMM Emissions: WSJ
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34
What are the models learning?
B P(w|B)der the 13.0die the 12.2den the 4.4und and 3.3im in 3.2das the 2.9des the 2.7dem the 2.4eine a 2.1ein a 2.0
I P(w|I)uhr o’clock 0.8juni June 0.6jahren years 0.4prozent percent 0.4mark currency 0.3stadt city 0.3000 0.3millionen millions 0.3jahre year 0.3frankfurter Frankfurt 0.3
O P(w|O)
in in 3.4und and 2.7mit with 1.7fur for 1.6auf on 1.5zu to 1.4von of 1.3sich such 1.3ist is 1.3nicht not 1.2
HMM Emissions: Negra
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34
What are the models learning?
B P(w|B)的 de, of 14.3一 one 3.1和 and 1.1两 two 0.9这 this 0.8有 have 0.8经济 economy 0.7各 each 0.7全 all 0.7不 no 0.6
I P(w|I)的 de 3.9了 (perf. asp.) 2.2个 ge (measure) 1.5年 year 1.3说 say 1.0中 middle 0.9上 on, above 0.9人 person 0.7大 big 0.7国 country 0.6
O P(w|O)
在 at, in 3.4是 is 2.4中国 China 1.4也 also 1.2不 no 1.2对 pair 1.1和 and 1.0的 de 1.0将 fut. tns. 1.0有 have 1.0
HMM Emissions: CTB
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34
top related