subword and spatiotemporal models for identifying actionable information in haitian kreyol
TRANSCRIPT
Subword and spatiotemporal models
for identifying actionable information
in Haitian Kreyol
Robert Munro
Stanford University
CoNLL 2011
Munro, Robert. "Subword and spatiotemporal models for identifying actionable
information in Haitian Kreyol." Proceedings of the Fifteenth Conference on
Computational Natural Language Learning. Association for Computational Linguistics,
2011.
http://www.robertmunro.com/research/munro11kreyol.pdf
Feedback
US Marines
◦ “Saving lives every day.”
FEMA:
◦ “The most comprehensive and up-to-date
map available to the humanitarian
community.”
World Food Program
◦ “We delivered food to an informal camp of
2500 people that you identified for us.”
Prioritization
Only 2% of messages were
‘actionable’
◦ An identifiable location
◦ Medical, S+R, water, clustered food
requests, security, unaccompanied
children.
How can we prioritize the
actionable items in the original
Haitian Kreyol?
Can we leverage the models for
more-sparse information sources?
Evaluation data
Mission 4636. 40,811 text-messages
sent to a free number, ‘4636’, in Haiti.
(predominantly in Haitian Kreyol, with
translations, UN-defined categories,
and geolocation)
Radio Station. 7,528 text-messages
sent to a Haitian radio station.
Twitter. 63,195 Haiti-related tweets.
Variation is the norm
mesi mèsi mèci meci merci
Kreyol French
Abbrv. Full Form Pattern Meaning
s’on se yon sVn is a
av`en av`eknou VvVn with us
relem rele mwen relem call me
wap ouap uVp you are
map mwen ap map I will be
zanmimzanmi mwen zanmim my friend
lavel lave li lavel to wash (it)
Features
G : Words and ngrams
W : Subword patterns
P : Source of the message
T : Time received
C : Categories (c0,...,47)
L : Location (longitude and latitude)
L : Has-location (a location is written
in the message)
Features
Subword models
◦ Full-forms and normalizations (Munro and
Manning 2010)
Abbrv. Full Form Pattern Meaning
s’on se yon sVn is a
av`en av`eknou VvVn with us
relem rele mwen relem call me
wap ouap uVp you are
map mwen ap map I will be
zanmimzanmi mwen zanmim my friend
lavel lave li lavel to wash (it)
Time / Space / source
Timestamp (in place of discounting)
Phone-number of sender
Spatial tile-membership:
Additional streaming models
Message contains an identifiable
location
Prediction for 47 categories
timeModel
timeModel
timeModel
timeModel
…
Final streaming model
Prediction for ‘is actionable’
timeModel
timeModel
timeModel
timeModel
…
Combines features with predictions
from Category and Has-Location
models
Evaluation
100 training epochs
◦ Calculated on predictions over epochs 2-
100
Comparison of two-tier architecture
with Oracle ‘has-location’ and
‘Categories’
Identification of actionable messages
in Radio Station and Twitter messages
(full results in paper)
Feature-based improvements
Subwords and Source (G, T,W, P)
Temporal feature (G, T)
Words/Ngrams (G)
0.326
0.252
0.207
Outperforming the oracle
Location (two-teir prediction)
Location (oracle)
Words/Ngrams (G)
0.310
0.274
0.207
Negative results from filtering
Oracle true-neg filtering
All Features/Models
Words/Ngrams (G)
0.428
0.855
0.207
Conclusions (usability)
Subword and spatio-temporal models
can give a 10-fold increase in
prioritization
Adding multi-tiered streaming models
can give a 50-fold increase in
prioritization
Cross-domain adaptation is possible
for need(le)-in-haystack information
extraction from social media
Appendix: abstract
Crisis-affected populations are often able to maintain digital
communications but in a sudden-onset crisis any aid organizations
will have the least free resources to process such communications.
Information that aid agencies can actually act on, ‘actionable’
information, will be sparse so there is great potential to
(semi)automatically identify actionable communications. However,
there are hurdles as the languages spoken will often be
underresourced, have orthographic variation, and the precise
definition of ‘actionable’ will be response-specific and evolving.
We present a novel system that addresses this, drawing on 40,000
emergency text messages sent in Haiti following the January 12,
2010 earthquake, predominantly in Haitian Kreyol. We show that
keyword/ngram-based models using streaming MaxEnt achieve up to
F=0.21 accuracy. Further, we find current state-of-the-art subword
models increase this substantially to F=0.33 accuracy, while
modeling the spatial, temporal, topic and source contexts of the
messages can increase this to a very accurate F=0.86 over direct
text messages and F=0.90-0.97 over social media, making it a viable
strategy for message prioritization.