breaking the resource bottleneck for multilingual parsing rebecca hwa, philip resnik and amy...
Post on 21-Dec-2015
214 views
TRANSCRIPT
![Page 1: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/1.jpg)
Breaking the Resource Bottleneck for Multilingual Parsing
Rebecca Hwa, Philip Resnik and Amy Weinberg
University of Maryland
![Page 2: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/2.jpg)
The Treebank Bottleneck
• High-quality parsers need training examples with hand-annotated syntactic information
• Annotation is labor intensive and time consuming
• There is no sizable treebank for most languages other than English
[[S [NP-SBJ Ford Motor Co. ] [VP acquired [NP [NP 5 % ] [PP of [NP [NP the shares] [PP-LOC in [NP Jaguar PLC]]]]]]] . ]
![Page 3: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/3.jpg)
State of the Art Parsing
Language Treebank Size Parser Performance
English Penn
Treebank
1M words
40k sentences
~90%
Chinese Chinese
Treebank
100K words
4k sentences
~75%
Others(e.g., Hindi, Arabic)
? ? ?
![Page 4: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/4.jpg)
Research Questions
• How can we induce a non-English language treebank quickly and automatically?– Bootstrap from available English resources– Project syntactic dependency relationship
across bilingual sentences
• How good is the resulting treebank?– Can we use it to train a new parser?– How can we improve its quality?
![Page 5: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/5.jpg)
Roadmap
• Overview of the framework– Direct projection algorithm
• Problematic cases
– Post projection transformation• Remaining challenges
– Filtering
• Experiment– Direct evaluation of the projected trees– Evaluation of a Chinese parser trained on the induced
treebank
• Future Work
![Page 6: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/6.jpg)
Overview of Our Frameworkbilingual corpus
English Chinese
Englishdependency
parser
wordalignment
model
dependencyparser
projected Chinesedependency treebank
Filtering
Transformation
Projection
unseenChinese
sentences
train
dependency treesfor unseen sentences
![Page 7: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/7.jpg)
The Chinese side satisfactionexpressed thisregarding
中国 方面 对 表示 满意此
subject
Necessary Resources:1. Bilingual Sentences
![Page 8: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/8.jpg)
The Chinese side satisfactionexpressed thisregarding
中国 方面 对 表示 满意此
subject
subj objadj
det
det
modmod
Necessary Resources2. English (Dependency) Parser
![Page 9: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/9.jpg)
The Chinese side satisfactionexpressed thisregarding
中国 方面 对 表示 满意此
subject
subj objadj
det
det
modmod
Necessary Resources3. Word Alignment
![Page 10: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/10.jpg)
The Chinese side satisfactionexpressed thisregarding
中国 方面 对 表示 满意此
subject
subj objadj
det
det
modmod
mod
obj
subj
adj mod
Projected Chinese Dependency Tree
![Page 11: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/11.jpg)
Direct Projection Algorithm
• If there is a syntactic relationship between two English words, then the same syntactic relationship also exists between their corresponding Chinese words
![Page 12: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/12.jpg)
Problematic Case: Unaligned English
thisregarding subject
det
mod
对 此
![Page 13: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/13.jpg)
Problematic Case: Unaligned English
thisregarding subject
det
mod
对 此 *e*det
mod
![Page 14: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/14.jpg)
Problematic Case: many-to-1
thisregarding subject
det
mod
对 此
![Page 15: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/15.jpg)
Problematic Case: many-to-1
thisregarding subject
det
mod
对 此
mod
![Page 16: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/16.jpg)
Problematic Case: Unaligned Chinese
Chinese expressedThe
中国 方面 表示
subj
*e*
*e*
det
![Page 17: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/17.jpg)
Problematic Case: Unaligned Chinese
Chinese expressedThe
中国 方面 表示
subj
*e*
*e*
subj
det
det
![Page 18: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/18.jpg)
Problematic Case: 1-to-many
Chinese expressed
中国 方面 表示
subj
The
*e*
det
![Page 19: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/19.jpg)
Problematic Case: 1-to-many
Chinese expressed
中国 方面 表示*M*
mac
mac
subj
subj
The
*e*
det
det
![Page 20: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/20.jpg)
The Chinese satisfactionexpressed thisregarding
中国 方面 对 表示 满意此
subject
subj objdet det
modmod
obj
subj
Output of theDirect Projection Algorithm
*M**e*mod
moddet
mac
mac
![Page 21: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/21.jpg)
Post Projection Transformation
• Handles One-to-Many mapping– Select head based on (projected) part-of-speech categories
• Handles some Unaligned-Chinese cases– Only addressing close-class words
• Functional words (e.g., aspectual, measure words)
• Easily enumerable lexical categories (e.g., $, RMB, yen)
• Remove empty nodes introduced by the Unaligned-English cases by promoting its head child
![Page 22: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/22.jpg)
Remaining Challenges
• Handling divergences• Incorporating unaligned foreign words into the
projected tree• Removing cross dependencies
A B
a b
C D
d c
![Page 23: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/23.jpg)
Filtering
• Projected treebank is noisy – Mistakes introduced by the projection algorithm
– Mistakes introduced by component errors
• Use aggressive filtering techniques to remove the worst projected trees– Filter out a sentence pair if many English words were
unaligned
– Filter out a sentence pair if many Chinese words are aligned to the same English word
– Filter out a sentence pair if many of the projected links caused crossing dependencies
![Page 24: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/24.jpg)
Experiments
• Direct evaluation of the projection framework– Compare the (pre-filtered) projected trees against
human annotated gold standard
• Evaluation of the projected treebank– Use the (post-filtered) treebank to train a Chinese
parser
– Test the parser on unseen sentences and compare the output to human annotated gold standard
![Page 25: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/25.jpg)
Direct Evaluation
• Bilingual data: 88 Chinese Treebank sentences with their English translations
• Apply projection and transformation under idealized conditions– Given human-corrected English parse trees and hand-drawn
word-alignments
• Apply projection and transformation under realistic conditions– English parse trees generated from Collins parser (trained on
Penn Treebank)– Word-alignments generated from IBM MT Model (trained
on ~56K Hong Kong News bilingual sentences)
![Page 26: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/26.jpg)
Direct Evaluation Results
Condition Accuracy*
Ideal 67%
English parses from the Collins parser
62%
Word-alignments from the IBM MT Model
39%
*Accuracy = f-score based on unlabeled precision & recall
![Page 27: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/27.jpg)
Evaluating Trained Parser
• Bilingual data: 56K sentence pairs from the Hong Kong News parallel corpus
• Apply the DPA (using the Collins Parser and IBM MT Model) to create a projected Chinese treebank
• Filter out badly-aligned sentence pairs to reduce noise• Train a Chinese parser with the (filtered) projected
treebank• Test the Chinese parser on unseen test set (88
Chinese Treebank sentences)
![Page 28: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/28.jpg)
Parser Evaluation Results
Method Training
Corpus
Corpus Size Parser
Accuracy
Modify Prev
(baseline)
- - 13.5
Modify Next
(baseline)
- - 35.7
Stat. Parser HKNews
(Filtered)
5284 42.3
Stat. Parser
(upper bound)
Chinese Treebank
3870 75.6
![Page 29: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/29.jpg)
Conclusion
• We have presented a framework for acquiring Chinese dependency treebanks by bootstrapping from existing linguistic resources
• Although the projected trees may have an accuracy rate of nearly 70% in principle, reducing noise caused by word-alignment errors is still a major challenge
• A parser trained on the induced treebank can outperform some baselines
![Page 30: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/30.jpg)
Future Work
• Obtain larger parallel corpus
• Reduce error rates of the word-alignment models
• Develop more sophisticated techniques to filter out noise in the induced treebank
• Improve the projection algorithm to handle unaligned words and inconsistent trees
![Page 31: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/31.jpg)
Reserve slides
![Page 32: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/32.jpg)
DPA Case 1: One-to-One
A B
ab
![Page 33: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/33.jpg)
DPA Case 2: Many-to-One
a b
A1 BA2 A3C
c
![Page 34: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/34.jpg)
DPA Case 3: One-to-Many
A B
a1b a2 a3*a*
![Page 35: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/35.jpg)
DPA Case 4: Many-to-Many
*a* b
BC
c a1 a2
A1 A2 A3
![Page 36: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/36.jpg)
DPA Case 5: Unaligned English Word
A B
a
C
c
![Page 37: Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland](https://reader030.vdocument.in/reader030/viewer/2022032522/56649d6d5503460f94a4cccf/html5/thumbnails/37.jpg)
DPA Case 6: Unaligned Foreign Word
A
a b
C
c