the$world$inside$words:$informaon$ …the$world$inside$words:$informaon$...
TRANSCRIPT
![Page 1: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/1.jpg)
The world inside words: informa1on extrac1on and labeling in low resource languages through
subword models
Robert Munro CEO, Idibon
(research while at Stanford) MicrosoB Research, October 2012
![Page 2: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/2.jpg)
About me
• CEO of Idibon – Language technology startup
• Former CTO of epidemicIQ – Tracking outbreaks with NLP and crowdsourcing
• PhD from Stanford – Computa1onal linguis1cs
• Coordinator of Mission 4636 – Emergency response in Hai1
• Power Infrastructure in West Africa – Energy for Opportunity / UN
• Traveler – 20 countries by bicycle
![Page 3: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/3.jpg)
Acknowledgments
• Chris Manning • Dan Jurafsky • Tapan Parikh
• Stanford NLP • Stanford Linguis1cs
![Page 4: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/4.jpg)
Technology for low resource languages
• Microso' Translator Hub • One of the most important recent advances! • c/o Will Lewis, Kris1n Tolle (MSR Redmond) • I am interested to hear more about MSR’s work using language technologies to augmented textbooks!
![Page 5: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/5.jpg)
Daily poten1al language exposure
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720540 500
Year
# of languages
![Page 6: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/6.jpg)
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720540 500
Daily poten1al language exposure
Year
# of languages
![Page 7: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/7.jpg)
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720540 500
Daily poten1al language exposure
Year
# of languages
![Page 8: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/8.jpg)
5 5 5 5 5 5 4.5 4
50
1500
5000
2000
1400
720540 500
Daily poten1al language exposure
Year
# of languages
We will never be so under-‐resourced as right now
![Page 9: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/9.jpg)
Mo1va1on
2000: 1 Trillion
2007: 5 Trillion
2012 (es1mate): 9 Trillion
1 Interna3onal Telecommunica3on Union (ITU), 2012. hcp://www.itu.int/ITU-‐D/ict/sta1s1cs/
• Text messaging – Most popular form of remote communica1on in much of the world 1
– Especially in areas of linguis1c diversity
– Licle research
![Page 10: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/10.jpg)
ACM, IEEE and ACL publica1ons
Actual Usage Recent Research
(90% coverage)
(35% coverage)
![Page 11: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/11.jpg)
Outline
• What do short message communica1ons look like in most languages?
• How can we model the inherent varia1on? • Can we create accurate classifica1on systems despite the varia1on?
• Can we leverage loosely aligned transla1ons for informa1on extrac1on?
![Page 12: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/12.jpg)
Data – short messages used here
• 600 text messages sent between health workers in Malawi, in Chichewa
• 40,000 text messages sent from the Hai1an popula1on, in Hai1an Kreyol
• 500 text messages sent from the Pakistani popula1on, in Urdu
• Twicer messages from Hai1 and Pakistan • English transla1ons
![Page 13: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/13.jpg)
Chichewa, Malawi
• 600 text messages sent between health workers, with transla1ons and 0-‐9 labels
Bantu
1. Pa1ent-‐related 2. Clinic-‐admin 3. Technological 4. Response 5. Request for doctor 6. Medical advice 7. TB: tuberculosis 8. HIV 9. Death of pa1ent
![Page 14: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/14.jpg)
Hai1an Kreyol
• 40,000 text messages sent from the Hai1an popula1on to interna1onal relief efforts (Mission 4636) – ~40 labels (request for food, emergency, logis1cs, etc)
– Transla1ons – Named-‐en11es
• 60,000 tweets
![Page 15: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/15.jpg)
Urdu, Pakistan
• 500 text messages sent from the Pakistan popula1on to interna1onal relief efforts – ~40 labels – Transla1ons
• 1,000 tweets
![Page 16: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/16.jpg)
Outline
• What do short message communica1ons look like in most languages?
• How can we model the inherent varia1on? • Can we create accurate classifica1on systems despite the varia1on?
• Can we leverage loosely aligned transla1ons for informa1on extrac1on?
![Page 17: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/17.jpg)
Most NLP research to date assumes the standardiza1on found in wricen English
![Page 18: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/18.jpg)
English
• Genera1ons of standardiza1on in spelling and simple morphology – Whole words suitable as features for NLP systems
• Most other languages – Rela1vely complex morphology – Less (observed) standardized spellings – More dialectal varia1on
• ‘Subword varia3on’ used to refer to any difference in forms resul1ng from the above
![Page 19: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/19.jpg)
The extent of the subword varia1on
• >30 spellings of odwala (‘pa1ent’) in Chichewa • >50% variants of ‘odwala’ occur only once in the data used here: – Affixes and incorpora1on
• ‘kwaodwala’ -‐> ‘kwa + odwala’ • ‘ndiodwala’ -‐> ‘ndi odwala’ (official ‘ngodwala’ not present)
– Phonological/Orthographic • ‘odwara’ -‐> ‘odwala’ • ‘ndiwodwala’ -‐> ‘ndi (w) odwala’
![Page 20: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/20.jpg)
Chichewa
The word odwala (‘pa1ent’) in 600 text-‐messages in Chichewa and the English transla1ons
![Page 21: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/21.jpg)
Chichewa
• Morphology: affixes and incorpora1on ndi-‐ta-‐ma-‐mu-‐fun-‐a-‐nso 1PS-‐IMPLORE-‐PRESENT-‐2PS-‐want-‐VERB-‐also “I am also currently wan1ng you very much”
a-‐ta-‐ma-‐ka-‐fun-‐a-‐nso class2.PL-‐IMPLORE-‐PRESENT-‐class12.SG-‐want-‐VERB-‐also “They are also currently wan1ng it very much”
• More than 30 forms for fun (‘want’), 80% novel
![Page 22: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/22.jpg)
Hai1an Krèyol
• More or less French spellings • More or less phone1c spellings • Frequent words (esp pronouns) are shortened and compounded
• Regional slang / abbrevia1ons
![Page 23: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/23.jpg)
Hai1an Krèyol
mèsi, mesi, mèci, merci
C a p -‐ H a ï t i e n K a p a y i s y e n
![Page 24: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/24.jpg)
Urdu
• The least variant of the three languages here – Deriva1onal morphology
• Zaroori / zaroorath – Vowels and nonphonemic characters
• Zaruri / zaroorat
zaroori (‘need’)
![Page 25: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/25.jpg)
If it follows pacerns, we can model it
![Page 26: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/26.jpg)
Outline
• What do short message communica1ons look like in most languages?
• How can we model the inherent varia1on? • Can we create accurate classifica1on systems despite the varia1on?
• Can we leverage loosely aligned transla1ons for informa1on extrac1on?
![Page 27: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/27.jpg)
Subword models
• Segmenta1on – Separate into cons1tuent morphemes:
nditamamufunanso -‐> ndi-‐ta-‐ma-‐mu-‐fun-‐a-‐nso
• Normaliza1on – Model phonological, orthographic, more or less phone1c spellings: odwela, edwala, odwara -‐> odwala
![Page 28: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/28.jpg)
Language Specific
• Segmenta1on – Hand-‐coded morphological parser (Mchombo, 2004; Paas, 2005) 1
• Normaliza1on – Rule-‐based
ph -‐> f, etc.
1 robertmunro.com/research/chichewa.php
Table 4.1: Morphological paradigms for Chichewa verbs
![Page 29: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/29.jpg)
Language Independent
• Segmenta1on (Goldwater et al., 2009) – Context Sensi1ve Hierarchical Dirichlet Process, with morphemes, mi drawn from distribu1on G generated from Dirichlet Process DP(α0, P0), with Hm = DP for a specific morpheme:
• Extension to morphology: – Enforce exis1ng spaces as morpheme boundaries – Iden1fy free morphemes as min P0, per word
ndi mafuna -‐> ndi-‐ma-‐funa manthwala
![Page 30: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/30.jpg)
Language Independent
Table 4.3: Phone1cally, phonologically & orthographically mo1vated alterna1on candidates.
• Normaliza1on – Mo1vated from minimal pairs in the corpus, C
– Subs1tu1on, H, applied to a word, w, producing w' iff w' ∈ C
ndiwodwala -‐> ndiodwala
![Page 31: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/31.jpg)
Evalua1on – downstream accuracy
• Most morphological parsers are evaluated on gold data and limited to prefixes or suffixes only: – Linguis3ca (Goldsmith, 2001), Morphessor (Creutz, 2006)
• Classifica1on accuracy (macro-‐f, all labels):
Chichewa Language Specific independent
Segmenta1on: 0.476 0.425 Normaliza1on: 0.396 0.443 Combined: 0.484 0.459
![Page 32: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/32.jpg)
Other subword modeling results
• Stemming vs Segmenta1on – Stemming can harm Chichewa 1
– Segmenta1on most accurate when modeling discon1nuous morphemes 1
• Hand-‐craBed parser – Over-‐segments non-‐verbs (cf Porter Stemmer for English) – Under-‐segments compounds
• Acronym iden1fica1on – Improves accuracy & can be broadly implemented 1
1 Munro and Manning, (2010)
![Page 33: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/33.jpg)
Are subword models needed for classifica1on?
![Page 34: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/34.jpg)
Outline
• What do short message communica1ons look like in most languages?
• How can we model the inherent varia1on? • Can we create accurate classifica1on systems despite the varia1on?
• Can we leverage loosely aligned transla1ons for informa1on extrac1on?
![Page 35: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/35.jpg)
Classifica1on
• Stanford Classifier – Maximum Entropy Classifier (Klein and Manning, 2003)
• Binary predic1on of the labels associated with each message – Leave-‐one-‐out cross-‐valida1on – Micro-‐f
• Comparison of methods with and without subword models
![Page 36: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/36.jpg)
Strategy ndimmafunamanthwala
(‘I currently need medicine’)
ndimafunamantwala
ndi-‐ma-‐fun-‐a man-‐twala
ndi-‐ma-‐fun-‐a man-‐twala
ndi -‐fun man-‐twala(“I need medicine”)
Category = “Request for aid”
ndi kufuni mantwara(‘my want of medicine’)
ndi kufuni mantwala
ndi-‐ku-‐fun-‐i man-‐twala
ndi-‐ku-‐fun-‐iman-‐twala
ndi -‐fun man-‐twala(“I need medicine”)
Category = “Request for aid”
1) Normalize spellings
2) Segment
3) Identify predictors
1 in 5 classification errors with raw messages
1 in 20 classification error post-‐processing. Improveswith scale.
![Page 37: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/37.jpg)
Comparison with English
• bcb
Accuracy: M
icro-‐f
Percent of training data
![Page 38: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/38.jpg)
Streaming architecture
• Poten1al accuracy in a live, constantly upda1ng system – Time sensi1ve and 1me-‐changing
• Kreyol ‘is ac1onable’ category – Any message that could be responded to
(request for water, medical assistance, clustered requests for food, etc )
![Page 39: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/39.jpg)
Streaming architecture
• Build from ini1al items
1me Model
![Page 40: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/40.jpg)
Streaming architecture
• Predict (and evaluate) on incoming items – (penalty for training)
1me Model
![Page 41: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/41.jpg)
Streaming architecture
• Repeat / retrain
1me Model
![Page 42: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/42.jpg)
Streaming architecture
• Repeat / retrain
1me Model
![Page 43: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/43.jpg)
Streaming architecture
• Repeat / retrain
1me Model
![Page 44: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/44.jpg)
Streaming architecture
• Repeat / retrain
1me Model
![Page 45: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/45.jpg)
Streaming architecture
• Repeat / retrain
1me Model
![Page 46: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/46.jpg)
Features
• G : Words and ngrams • W : Subword pacerns • P : Source of the message • T : Time received • C : Categories (c0,...,47) • L : Loca1on (longitude and la1tude) • L∃ : Has-‐loca1on (a loca1on is wricen in the message)
![Page 47: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/47.jpg)
Hierarchical predic1on for ‘is ac1onable’
1me
Model
1me Model
1me Model
1me Model
…
Combines features with predic1ons from Category and Has-‐Loca1on models
predic3ng ‘is ac3onable’
predic3ng ‘has loca3on’
predic3ng ‘category 1’
predic3ng ‘category n’
![Page 48: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/48.jpg)
Results – subword models
• Also a gain in streaming models
Precision Recall F-‐value Baseline 0.622 0.124 0.207 W Subword 0.548 0.233 0.326
![Page 49: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/49.jpg)
Results – overall
• Gain of F > 0.6 for full hierarchical system, over baseline of words/phrases only
Precision Recall F-‐value Baseline 0.622 0.124 0.207 Final 0.872 0.840 0.855
![Page 50: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/50.jpg)
Other classifica1on results
• Urdu and English – Subword models improve Urdu & English tweets 1
• Domain dependence – Modeling the source improves accuracy 1
• Semi-‐supervised streaming models – Lower F-‐value but consistent priori1za1on 2
• Hierarchical streaming predic1ons – Outperforms oracle for ‘has loca1on’ 2
• Extension with topic models – Improves non-‐con1guous morphemes 3
1 Munro and Manning, (2012); 2 Munro, (2011); 3 Munro and Manning, (2010)
![Page 51: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/51.jpg)
Can we move beyond classifica1on to informa1on extrac1on?
![Page 52: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/52.jpg)
Outline
• What do short message communica1ons look like in most languages?
• How can we model the inherent varia1on? • Can we create accurate classifica1on systems despite the varia1on?
• Can we leverage loosely aligned transla1ons for informa1on extrac1on?
![Page 53: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/53.jpg)
Named En1ty Recogni1on
• Iden1fying men1ons of People, Loca1ons, and Organiza1ons – Informa1on extrac1on / parsing / Q+A
• Typically a high-‐resource task – Tagged corpus (Finkel and Manning, 2010) – Extensive hand-‐craBed rules (Chi1carui, 2010)
• How far can we get with loosely aligned text? – One of the only resources for most languages
![Page 54: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/54.jpg)
Example
Lopital Sacre-‐Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-‐Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.
![Page 55: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/55.jpg)
The intui1on
Lopital Sacre-‐Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-‐Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital. Do named en11es have the least edit distance?
![Page 56: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/56.jpg)
The intui1on
Lopital Sacre-‐Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-‐Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.
![Page 57: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/57.jpg)
The intui1on
Lopital Sacre-‐Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-‐Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.
![Page 58: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/58.jpg)
The intui1on
Lopital Sacre-‐Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-‐Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.
![Page 59: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/59.jpg)
The intui1on
Lopital Sacre-‐Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-‐Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital.
![Page 60: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/60.jpg)
The complica1ons
Lopital Sacre-‐Coeur ki nan vil Milot, 14 km nan sid vil Okap, pre pou li resevwa moun malad e lap mande pou moun ki malad yo ale la. Sacre-‐Coeur Hospital which located in this village Milot 14 km south of Oakp is ready to receive those who are injured. Therefore, we are asking those who are sick to report to that hospital. Capitaliza1on of en11es
was not always consistent
Slang/abbrevia1ons/alternate spellings for ‘Okap’ are frequent: ‘Cap-‐Hai1en’, ‘Cap Hai1en’, ‘Kap’, ‘Kapayisyen’
![Page 61: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/61.jpg)
3 Steps for Named En1ty Recogni1on
1. Generate seeds by calcula1ng the edit likelihood devia1on.
2. Learn context, word-‐shape and alignment models. 3. Learn weighted models incorpora1ng supervised
predic1ons.
![Page 62: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/62.jpg)
Step 1: Edit distance (Levenshtein)
• Number of subs1tu1ons, dele1ons or addi1ons to convert one string to another – Minimum Edit Distance: min between parallel text – String Similarity Es3mate: normalized by length – Edit Likelihood Devia3on: similarity, rela1ve to average similarity in parallel text (z-‐score)
– Weighted Devia3on Es3mate: combina1on of Edit Likelihood Devia1on and String Similarity Es1mate
![Page 63: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/63.jpg)
Example
• Edit distance: 6 • String Similarity: ~0.45 “Voye manje medikaman pou moun kie nan lopital Kapayisyen” “Send food and medicine for people in the Cap Hai1an hospitals” – Average & standard dev similarity: μ=0.12, σ=0.05 – Edit Likelihood Devia3on: 6.5 (good candidate)
“Voye manje medikaman pou moun kie nan lopital Kapayisyen” “They said to send manje medikaman for lopital Cap Hai1an” – Average & standard dev similarity: μ=0.21, σ=0.11 – Edit Likelihood Devia3on: 2.2 (doub�ul candidate)
C a p -‐ H a ï t i e n K a p a y i s y e n
![Page 64: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/64.jpg)
Equa1ons for edit-‐distance based metrics
• Given a string in a message and transla1on MS, M'S' Levenshtein distance LEV() String Similarity Es3mate SSE() Average AV() Standard Devia3on SD() Edit Likelihood Devia3on ELD() Normalizing Func3on N() Weighted Devia3on Es3mate WDE()
![Page 65: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/65.jpg)
Comparison of edit-‐distance based metrics
Past research used global edit-‐distance metrics (Song and Strassel, 2008) This line of research not pursued aBer REFLEX workshop.
Novel to this research: local devia1on in edit-‐distance.
En1ty candidates, ordered by confidence
Precision
![Page 66: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/66.jpg)
Step 2: Seeding a model
• Take the top 5% matches by WDE() – Assign an ‘en1ty’ label
• Take the bocom 5% matches by WDE() – Assign a ‘not-‐en1ty’ label
• Learn a model • Note: the bocom 5% were s1ll the best match for the given message/transla1on – Targe1ng the boundary condi1ons
![Page 67: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/67.jpg)
Features
… ki nan vil Milot, 14 km nan sid … … located in this village Milot 14 km south of …
• Context: BEF_vil, AFT_14 / BEF_village, AFT_14 • Word Shape: SHP_Ccp / SHP_Cc • Subword: SUB_<b>Mi, SUB_<b>Mil, SUB_il, … • Alignment: ALN_8_words, ALN_4_perc • Combina1ons: SHP_Cc_ALN_4_perc, …
![Page 68: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/68.jpg)
Strong results
• Joint-‐learning across both languages Precision Recall F-‐value Kreyol 0.904 0.794 0.846 English 0.915 0.813 0.861
• Language-‐specific:
Precision Recall F-‐value Kreyol 0.907 0.687 0.781 English 0.932 0.766 0.840
![Page 69: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/69.jpg)
Effec1ve extension over edit-‐distance Joint-‐pred
ic1o
n
String Similarity Es1mate
![Page 70: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/70.jpg)
Domain adap1on
• Joint-‐learning across both languages Precision Recall F-‐value Kreyol 0.904 0.794 0.846 English 0.915 0.813 0.861
• Supervised (MUC/CoNLL-‐trained Stanford NER):
Precision Recall F-‐value English 0.915 0.206 0.336
Completely unsupervised, using ~3,000 sentences loosely aligned with Kreyol
Fully supervised, trained over 10,000s of manually tagged sentences in English
![Page 71: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/71.jpg)
Step 3: Combined supervised model
… ki nan vil Milot, 14 km nan sid … … located in this village Milot 14 km south of …
Step 3a: Tag English sequences from a model trained on English corpora (Sang, 2002; Sang and De Meulder, 2003; Finkel and Manning, 2010)
Step 3b: Propagate across the candidate alignments, in combina1on with features (context, word-‐shape, etc)
![Page 72: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/72.jpg)
Combined supervised model
• Joint-‐learning across both languages Precision Recall F-‐value Kreyol 0.904 0.794 0.846 English 0.915 0.813 0.861
• Combined Supervised and Unsupervised
Precision Recall F-‐value Kreyol 0.838 0.902 0.869 English 0.846 0.916 0.880
![Page 73: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/73.jpg)
Other informa1on extrac1on results
• Other edit-‐distance func1ons (eg: Jaro-‐Winkler) – Make licle difference in the seed step -‐ the devia1on measure is the key feature 1
• Named en1ty discrimina1on – Dis1nguishing People, Loca1ons and Organiza1ons is reasonably accurate with licle data 1
• Clustering contexts – No clear gain – probably due to sparse data
1 Munro and Manning, (under review)
![Page 74: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/74.jpg)
Outline
• What do short message communica1ons look like in most languages?
• How can we model the inherent varia1on? • Can we create accurate classifica1on systems despite the varia1on?
• Can we leverage loosely aligned transla1ons for informa1on extrac1on?
![Page 75: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/75.jpg)
Conclusions
• It is necessary to model the subword varia1on found in many of the world’s short-‐message communica1ons
• Subword models can significantly improve classifica1on tasks in these languages
• The same subword varia1on, cross-‐linguis1cally, can be leveraged for accurate named en1ty recogni1on
![Page 76: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/76.jpg)
Conclusions
• More research is needed
2000: 1 Trillion
2007 (start of PhD): 5 Trillion
2012 (es1mate): 9 Trillion
![Page 77: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/77.jpg)
Thank you
![Page 78: The$world$inside$words:$informaon$ …The$world$inside$words:$informaon$ extrac1on$and$labeling$in$low$ resource$languages$through$ subwordmodels RobertMunro$ CEO,$Idibon$ (research$while$atStanford](https://reader036.vdocument.in/reader036/viewer/2022063005/5fb39f086ab47113592e2594/html5/thumbnails/78.jpg)
References
Munro, R. and Manning, C. D. (2010). Subword varia1on in text message classifica1on. Proceedings of the Annual Conference of the North American Chapter of the Associa3on for Computa3onal Linguis3cs (NAACL 2010), Los Angeles, CA.
Munro, R. (2011). Subword and spa1otemporal models for iden1fying ac1onable informa1on in Hai1an Kreyol. Proceedings of the Fi'eenth Conference on Natural Language Learning (CoNLL 2011), Portland, OR.
Munro, R. and Manning, C. D. (2012). Short message communica1ons: users, topics, and in-‐language processing. Proceedings of the Second Annual Symposium on Compu3ng for Development (ACM DEV 2012), Atlanta, GA.
Munro, R. and Manning, C. D. (2012). Accurate Unsupervised Joint Named-‐En1ty Extrac1on from Unaligned Parallel Text. Proceedings of the Named En33es Workshop (NEWS 2012), Jeju, Korea.