real-world semi-supervised learning of pos-taggers for low-resource languages dan garrette, jason...
TRANSCRIPT
![Page 1: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/1.jpg)
Real-World Semi-Supervised Learning of POS-Taggers for
Low-Resource Languages
Dan Garrette, Jason Mielens, and Jason Baldridge
Proceedings of ACL 2013
![Page 2: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/2.jpg)
Semi-Supervised Training
HMM with Expectation-Maximization (EM)
Need:
Large raw corpus
Tag dictionary
[Kupiec, 1992][Merialdo, 1994]
![Page 3: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/3.jpg)
Previous Works: Supervised LearningProvide high accuracy for POS tagging (Manning, 2011).
Perform poorly when little supervision is available.
Semi-SupervisedDone by training sequence models such as HMM using the EM algorithm.
Work in this area has still relied on relativelylarge amounts of data.(Kupiec, 1992; Merialdo,1994).
![Page 4: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/4.jpg)
Previous Works: Goldberg et al.(2008)Manually constructed lexicon for Hebrew to
train HMM tagger.Lexicon was developed over a long period of
time by expert lexicographers. Tackstrom et al. (2013)Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages.
Large parallel corpora required.
![Page 5: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/5.jpg)
Low-Resource Languages
6,900 languages in the world
~30 have non-negligible quantities of data
No million-word corpus for anyendangered language
[Maxwell and Hughes, 2006][Abney and Bird, 2010]
![Page 6: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/6.jpg)
Low-Resource Languages
Kinyarwanda (KIN)Niger-Congo.Morphologically-rich.
Malagasy (MLG)Austronesian.Spoken in Madagascar.
Also, English
![Page 7: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/7.jpg)
Collecting Annotations
• Supervised training is not an option.
•Semi-supervised training:
•Annotate some data by hand in 4 hours,
(in 30-minute intervals) for two tasks.
•Type supervision.
•Token supervision.
![Page 8: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/8.jpg)
Tag Dict Generalization
These annotations are too sparse!
Generalize to the entire vocabulary
![Page 9: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/9.jpg)
Tag Dict Generalization
Haghighi and Klein (2006) do this witha vector space.
We don’t have enough raw data
Das and Petrov (2011) do this witha parallel corpus.
We don’t have a parallel corpus
![Page 10: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/10.jpg)
Tag Dict Generalization
Strategy: Label Propagation
• Connect annotations to raw corpus tokens
• Push tag labels to entire corpus
[Talukdar and Crammer. 2009]
![Page 11: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/11.jpg)
Morphological Transducers• Finite-state transducers are used for morphological analysis.
• FST accepts a word type and producesa set of morphological features.
•Power of FSTs:•Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.
![Page 12: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/12.jpg)
Tag Dict GeneralizationPREV_<b> NEXT_thug
TOK_the_4 TOK_the_1
TYPE_the
PREV_the
TOK_the_9 TOK_thug_5
TYPE_thug
NEXT_walks
TOK_dog_2
TYPE_dog
PRE1_t PRE2_th SUF1_e SUF1_g PRE1_d PRE2_do
![Page 13: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/13.jpg)
Tag Dict GeneralizationType Annotations
_the__DT_____dog_NN____
TYPE_the
PREV_<b>
PRE2_th PRE1_t
TYPE_thug
PREV_the
SUF1_g
TYPE_dog
NEXT_walks
TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2
![Page 14: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/14.jpg)
Tag Dict GeneralizationType Annotations
_the_________dog________
TYDTthe
PREV_<b>
PRE2_th PRE1_t
TYPE_thug
PREV_the
SUF1_g
TYNNog
NEXT_walks
TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2
![Page 15: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/15.jpg)
Tag Dict GeneralizationType Annotations
_the________dog
TYPE_the
PREV_<b>
PRE2_th PRE1_t
TYPE_thug
PREV_the
SUF1_g
TYPE_dog
NEXT_walks
TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2
Token Annotationsthe dog walksDT NN VBZ
![Page 16: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/16.jpg)
Tag Dict GeneralizationType Annotations
_the________dog
TYPE_the
PREV_<b>
PRE2_th PRE1_t
TYPE_thug
PREV_the
SUF1_g
TYPE_dog
NEXT_walks
TODTe_4TOK_the_1 TOK_thug_5
TOKNN_2
Token Annotationsthe dog walks____________
![Page 17: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/17.jpg)
Model Minimization
[Ravi et al., 2010; Garrette and Baldridge, 2012]
• LP graph has a node for each corpus token.• Each node is labelled with distribution over POS tags.•Graph provides a corpus of sentences labelled with noisy tag distributions.
•Greedily seek the minimal set of tagbigrams that describe the raw corpus.•Now use, HMM trained by EM.
![Page 18: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/18.jpg)
Overall Accuracy
KIN usin
g all t
ypes
MLG
using h
alf ty
pes and half
toke
ns
ENG using a
ll typ
es and m
axim
al am
ount of d
ata0.00%
20.00%
40.00%
60.00%
80.00%
100.00%Accuracy
Accuracy
All of these values were achieved using both FST and affix LP features.
![Page 19: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/19.jpg)
Results
![Page 20: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/20.jpg)
Types versus Tokens
![Page 21: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/21.jpg)
Mixing Type and Token Annotations
![Page 22: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/22.jpg)
Morphological Analysis
![Page 23: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/23.jpg)
Annotator Experience
![Page 24: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013](https://reader030.vdocument.in/reader030/viewer/2022032805/56649efb5503460f94c0d70c/html5/thumbnails/24.jpg)
Conclusion•Type Annotations are the most useful input from a linguist.
•We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.