data science for social good and ushahidi - final presentation

31
Ushine Plug-In Using machine learning and natural language processing to improve the human review process of crisis reports

Upload: ushahidi

Post on 17-May-2015

3.597 views

Category:

Technology


2 download

DESCRIPTION

The Data Science for Social Good Fellows (dssg.io) collaborated with Ushahidi (Ushahidi.com) Presented: August 20, 2013 Video - https://www.youtube.com/watch?v=4eK8HjVG2m0 Tool - http://dssg.ushahididev.com/

TRANSCRIPT

Page 1: Data Science for Social Good and Ushahidi - Final Presentation

Ushine Plug-In

Using machine learning and natural language processingto improve the human review process of crisis reports

Page 2: Data Science for Social Good and Ushahidi - Final Presentation

Topics● Intro to project

● Project contents

● Data sets

● Evaluation

● Data ethics

● Future work

Page 3: Data Science for Social Good and Ushahidi - Final Presentation

How to Follow Up...● GitHub repository (open-source project code + wiki documentation):

http://github.com/dssg/ushine-learningCollaborators welcome! (Both within and outside of Ushahidi.)

● DSSG team e-mail: [email protected]

● Main Ushahidi contacts: Emmanuel Kala + Heather Leson

● Data Science for Social Good fellowship: http://dssg.io

Page 4: Data Science for Social Good and Ushahidi - Final Presentation

Thanks!Thanks to our partners at Ushahidi and the many individuals and organizations who generously gave us their advice and feedback...Alphabetically:

Chris Albon, Rob Baker, George Chamales, Jennifer Chan, Crisis Mappers, Schuyler Erle, Sara-Jayne Farmer, Rayid Ghani, Eric Goodwin, Catherine Graham, Neil Horning, Humanity Road, Anahi Ayala Iacucci, Rob Mitchum, Emmanuel Kala, David Kobia, Heather Leson, Rob Munro, Chris Thompson, Syria Tracker, Juan-Pablo Velez.

Page 5: Data Science for Social Good and Ushahidi - Final Presentation

Project Contents [August 20]

1) Detect language of report text

2) Identify private information in report text

3) Identify locations in report text

4) Identify URLs in report text

5) Suggest categories of report

6) Detect (near-)duplicate reports

Page 6: Data Science for Social Good and Ushahidi - Final Presentation

Ushahidi Process

DSSG helps here

Page 7: Data Science for Social Good and Ushahidi - Final Presentation

Report Review w/o Ushine

Page 8: Data Science for Social Good and Ushahidi - Final Presentation

Report Review with Ushine (Exact user interface still under development)

Page 9: Data Science for Social Good and Ushahidi - Final Presentation

Scope● Ushine DOES:

○ Improve the human review process of reports

● Ushine DOESN’T:○ Verify reports○ “Really” understand the report○ Achieve 100% accuracy in anything

Page 10: Data Science for Social Good and Ushahidi - Final Presentation

Useful for:● In multi-lingual situation, automatically route reports to

speakers of that language

● Flag reports that need / don’t need translations○ (if deployment specifies certain set of acceptable

languages)

Caveats:● Not 100% accurate● Performs less well on “imperfect” writing

○ e.g. SMS-speak, mixed languages

1) Detect report language

Page 11: Data Science for Social Good and Ushahidi - Final Presentation

1) Detect report languageTechnical details:● Tested 4 plug-in language detectors on 850

reports, for agreement with human language identification:

Page 12: Data Science for Social Good and Ushahidi - Final Presentation

2) Identify Private InfoIdentify people’s names, organizations’ names, locations, e-mail addresses, URLs, phone/ID numbers, Twitter usernames

Useful for:● Flagging private info in report that reviewer might want to remove, to

protect sensitive people/situations● As an extra check before exporting reports to others.

Technical details:● Use NLTK’s pre-trained Named Entity Recognizer (NER) to identify people’

s names, organizations’ names, and locations.● Use regular expressions to identify e-mail addresses, URLs, phone/ID

numbers,and Twitter usernames.● Better to be overly careful: false negatives are more dangerous than false

positives

Page 13: Data Science for Social Good and Ushahidi - Final Presentation

2) Identify Private InfoCaveats:● Not 100% accurate.

○ Use to support, not replace, humans. (Though humans are not 100% accurate by themselves either!)

○ Always, be aware of responsibility to protect sensitive information.○ Non-sensitive deployments (non-wars/disasters) may still have

sensitive information.○ (More on data ethics @ end)

● Definition of “private” can be very subjective and nuanced.

● Does not re-word sentence; only identifies problematic words for editing.

● Currently only useful for English text (though extendable to other languages given a suitable NER)

Page 14: Data Science for Social Good and Ushahidi - Final Presentation

3) Identify LocationsUseful for:● Identifying text within report that may refer to a location

Caveats:● Imperfect accuracy, especially on imperfect English● Currently only useful for English text (though extendable to other

languages given a suitable NER)● Does not geo-locate location for mapping, just makes it easier to figure out

what text to then geo-locate.

Technical details:● Use NLTK’s pre-trained Named Entity Recognizer (NER)

Page 15: Data Science for Social Good and Ushahidi - Final Presentation

4) Identify URLs (links)Useful for:● Identifying text within report that refers to a URL (photo/video/article/etc.)

Technical details:● Use regular expressions

Page 16: Data Science for Social Good and Ushahidi - Final Presentation

A Detour on Data Sets● So far none of the tasks have required

“training data” on past Ushahidi deployments○ (NLTK’s named entity recognizer uses its own

training data, not from Ushahidi)● Next task, category rankings, DOES require

Ushahidi training data

● Data cleanliness: Often lacking○ We wrote scripts to automate cleaning○ Useful for other Ushahidi work too!

Page 17: Data Science for Social Good and Ushahidi - Final Presentation

Data Sets - ExamplesAdditional unusable datasets for various reasons (e.g. overly formulaic language)

Many additional CrowdMap datasets(not used by Ushine because of time constraints)

Sensitive data was removed before being shared with us

Page 18: Data Science for Social Good and Ushahidi - Final Presentation

Afghanistan election(peaceful)

Kenyan election(less peaceful)

Data Set Differences

Page 19: Data Science for Social Good and Ushahidi - Final Presentation

5) Category SuggestionsFor each category (e.g. “Bribery” or “Violence”),give 0-100% rating of how likely the report is to belong

Useful for:● Increasing speed and accuracy of the category assignment process

Caveats:● Not 100% accurate● “Cold start” problem

Page 20: Data Science for Social Good and Ushahidi - Final Presentation

5) Category Suggestions● Global classifier:

○ Classifier trained on previous deployments (e.g. previous Indian and Venezuela election reports) then used for a new deployment (e.g. new Kenyan election)

● Local classifier:○ Train a classifier on-the-fly on reports annotated in a

new deployment. Cold-start problem.● Adaptive classifier:

○ Retrain global classifier on the current deployment

Page 21: Data Science for Social Good and Ushahidi - Final Presentation

5) Category Suggestions● Learning Curve Plot from Mexico election

(Higher F1 score means better performance)

Page 22: Data Science for Social Good and Ushahidi - Final Presentation

5) Category SuggestionsTechnical details:● Binary classifier for each category.● Local classifier: Bag-of-words unigram

frequency features (with frequency cut-off = 5)○ In general, bigrams & TF-IDF normalization did not

help.● Global classifier for election deployment

○ Trained using 7 election deployments○ For each category label, cross-deployment validation

was used to select feature sets (unigram, tfidf, bigram, and C parameter).

Page 23: Data Science for Social Good and Ushahidi - Final Presentation

5) Category SuggestionsTechnical details:● Adaptive Classifier

○ Interpolates between local classifier f and global classifier g using (1-α)*g(x) + α*f(x), where x is a report.

○ α is tuned on-the-fly to maximize F1 score based on grid search.

Page 24: Data Science for Social Good and Ushahidi - Final Presentation

6) Detect (near-) duplicatesHas the report already been submitted, or retweeted?

Useful for:● Identifying (near-)duplicate reports to prevent

copies and redundant work

Caveats:● Not 100% accurate● Not looking at “similar/related content”, but rather (near-)duplicates

Technical details:● SimHash efficiently hashes each report text to a 64-bit representation.● (Near-)duplicates have short distances

Page 25: Data Science for Social Good and Ushahidi - Final Presentation

EvaluationCurrently analyzing the results of an evaluation experiment that simulates an election crisis.

Assess the impact on users’ speed and accuracy of● identifying private info, location, URLs● choosing categories

3 comparison groups:1) “Regular” process w/o computer suggestions2) Our computer’s suggestions3) “Perfect” suggestions

Page 26: Data Science for Social Good and Ushahidi - Final Presentation

Evaluation

Page 27: Data Science for Social Good and Ushahidi - Final Presentation

Ushahidi Plugin integration● Configurable URL for the Ushine web

service● Extract location names and other entities

from report text. These are displayed as report metadata

● Detect and display the report language● Suggest reports that are similar to the

current one

Page 28: Data Science for Social Good and Ushahidi - Final Presentation

Data EthicsThis isn’t today’s focus, but very important as part of an on-going Ushahidi discussion:

1) Private information tool especially should be used wisely -- not 100% accurate and does not replace, but rather supports, thoughtful human decision-making.

2) To improve category classification, need access to training data.How to store data? Who has access?

Carelessness about sensitive datacan have real and bad consequences!

Non-sensitive deployments (non-wars/disasters)may still have sensitive information.

Page 29: Data Science for Social Good and Ushahidi - Final Presentation

Automated vs. Suggestions● In theory, everything could be automated

○ Ex: Automatically select top-ranked categories instead of giving humans the rankings

● Ushahidi reports need high quality data, so we recommend using our package’s output as suggestions to guide human decisions

● Especially important for sensitive tasks like private information detection!

Page 30: Data Science for Social Good and Ushahidi - Final Presentation

Future Ideas1. Urgency assessment

2. Filter irrelevant reports (not strictly spam)

3. Automatically propose new [sub-]categories

4. Cluster similar (non-identical) reports

5. Hierarchical topic modelling / visualization

6. …?

Page 31: Data Science for Social Good and Ushahidi - Final Presentation

How to Follow Up...● GitHub repository (project code + wiki documentation): http://github.

com/dssg/ushine-learningCollaborators welcome! (Both within and outside of Ushahidi.)

● DSSG team e-mail: [email protected]

● Main Ushahidi contacts: Emmanuel Kala + Heather Leson

● Data Science for Social Good fellowship: http://dssg.io