chattergrabber.py methods and development

28
ChatterGrabber.py Methods and Development A System for High Throughput Social Media Data Collection By James Schlitt, in collaboration with Elizabeth Musser, Dr. Bryan Lewis, and Dr. Stephen Eubank

Upload: myra

Post on 29-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

ChatterGrabber.py Methods and Development. A System for High Throughput Social Media Data Collection. By James Schlitt, in collaboration with Elizabeth Musser, Dr. Bryan Lewis, and Dr. Stephen Eubank. Introduction. Social media surveillance is a valuable tool for epidemiological research: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ChatterGrabber.py Methods and Development

ChatterGrabber.py Methods and Development

A System for High Throughput Social Media Data Collection

By James Schlitt, in collaboration with Elizabeth Musser, Dr. Bryan Lewis, and Dr. Stephen Eubank

Page 2: ChatterGrabber.py Methods and Development

Social media surveillance is a valuable tool for epidemiological research:– Pros: Cheap, consistent, and easy to parse data source.– Cons: Volume & specificity vary, content cannot be easily verified.– Tweepy provides easy Twitter API access via python.

Introduction

Page 3: ChatterGrabber.py Methods and Development

• Developed under MIDAS funding, the Virginia Department of Health requested a tool to track Norovirus and gastrointestinal illness (GI) outbreaks within Montgomery County, VA with the following capabilities:

– Automated surveillance of social media.– No special skills required to use.– Forward compatible for GIS applications.

• Twitter was well suited to GI outbreak surveillance due to the short duration of infection and Tweet-worthy symptoms. For example:

@ATweeter20 My hubs did some vomiting w/ his flu. Had stuff messing w/ his tummy - high fever & snot was bad. Get better!

• Challenged by low population density, high degree of linguistic confounders.

The Twitter Norovirus Study

Page 4: ChatterGrabber.py Methods and Development

• Tweepy.py: Python wrapper for Twitter RESTful APIs• Gnip: Twitter commercial partner

Methods Considered

Page 5: ChatterGrabber.py Methods and Development

Search:•Up to 100% of all tweets matching query, may use multiple queries.•Search by location and/ or keywords.•~35 mile search radius limit.•All tweets within the last week.•Query rate limited per Twitter OAuth key to 180 searches every 15 minutes.•Narrower geographic coverage, but very flexible.

Streaming:•Up to 1% of total stream volume by location or keywords from Twitter.•About 10 keywords per query, 1 query per stream, and 1 stream per OAuth key.•Tweets come in real-time.•Tweet pull rate limited by Twitter.•Most commonly used, great for whole country studies with simple queries.

Twitter Method Comparison

Page 6: ChatterGrabber.py Methods and Development

• Official Twitter data partner.• Historic or real-time, variety of services.• Large volume, representative sample. Excellent choice

when affordable!• Prices not public, quoted in a 2010 interview as:

– 5% stream for $60k/year.– 50% stream for $360k/year.

Gnip Method Comparison

Page 7: ChatterGrabber.py Methods and Development

Given a partial data sample, how can we accurately track tweets in an area with low engagement?

– 12 potential NRV Norovirus/GI Tweets per day.– 4 suspected hits after human confirmation.– Long keyword list requires multiple queries.

Twitter 1% streaming limited by query length and volume and Gnip was not affordable, that leaves the search method...

Challenges

Page 8: ChatterGrabber.py Methods and Development

ChatterGrabber: A search method based social media data miner developed in Python.

– GDI Google Docs interface included for simplified partner access.

– Specialized hunters pull from GDI Spreadsheets to set run parameters.

– Multiple logins may be used to increase search frequency during collaborative experiments.

– No limits on query length.– Data sent nightly to subscribers as CSV.– Summary of history presented in dashboard (under

development)

ChatterGrabber Introduction

Page 9: ChatterGrabber.py Methods and Development

High redundancy & error tolerance for long term experiments:– If multiple API keys used, functional keys take up the

work of failed keys until they may be reconnected.– Daemon automatically executes & resumes experiments

on start up and after an interruption.– Any hunter may be resumed up to 1 week after

termination without loss of incoming data.

ChatterGrabber Reliability

Page 10: ChatterGrabber.py Methods and Development

General Execution

Yes

No

Yes No

Partition conditions into

{x} queries

Search radius > 35

miles?

Generate {y} Coordinate sets via covering algorithm

Prepare search With |x| queries

Prepare search with |x|*|y| queries

Run Twitter search,from last tweet ID

recorded for locationand query pair

Filter results byphrases, classifiers, and location; sleep

Has a newday begun?

Store data, send subscribers CSV and config link

Pull list of condition phrases & config from Google spreadsheet

Page 11: ChatterGrabber.py Methods and Development

ChatterGrabber GDI Interface Example

Page 12: ChatterGrabber.py Methods and Development

Pure Query Based:

• Conditions, qualifiers, & exclusions.

• Searches by conditions, keeps if qualifier and no exclusions present.

• Simple, easy to setup, but vulnerable to complexities of wording.

NLTK* Based:• Take output from

conditions search, manually classify.

• Train NLTK maxEnt or Naïve Bayesian classifier via content n-grams.

• Classifier discards tweets that don’t fit desired categories.

• Powerful, but requires longer setup, representative tweet sample.

ChatterGrabber Search Methods

*NLTK: Natural Language Tool Kit

Page 13: ChatterGrabber.py Methods and Development

Tweet Linguistic Classification

Using NLTKmode?

Classify Tweetby features

Is Tweetclassification

sought?

Extractfeatures from

Tweet

DiscardTweet

Does Tweetcontain anexclusion?

Does Tweetcontain aqualifier?

Store Tweet dataand derived data

Yes

No

Yes

Yes

Yes

Yes

No

No No

No

Tweet passed for classification

Keepingnon-hits?

Page 14: ChatterGrabber.py Methods and Development

NLTK Classifier Example

Page 15: ChatterGrabber.py Methods and Development

• Large lat/lon boxes filled via covering algorithm.• Fine and coarse geolocations obtained via GoogleMapsV3 API:

– If coordinates to tweet are present, finds street address.– If common name present, finds coordinates, then searches by

coordinates for proper name/ street address of position.– If location is outside of lat/lon box, discards tweet.

• All geo queries cached locally, shared between experiments, and pulled on demand to reduce API utilization.

ChatterGrabber Geographic Methods

Page 16: ChatterGrabber.py Methods and Development

Basic Execution:1. Create GDI sheet, run initial experiment.2. Check first results for confounders, update keyword lists.3. Rerun experiment with new keywords, monitor periodically for new keywords & memes.If Greater Specificity Desired:1. Run whole country experiment with desired query list.2. Score output manually & enable NLTK classification.3. Expand area as desired.

Work Flow

Page 17: ChatterGrabber.py Methods and Development

Results

• Found and geolocated 4,000-8,000 suspected Norovirus tweets per day across the US during peak Norovirus season.

• Preliminary estimates of 70-80% accuracy with 2,000 tweet training set

Page 18: ChatterGrabber.py Methods and Development

Results Continued

Page 19: ChatterGrabber.py Methods and Development

• Results exceed the geographic and temporal resolution of existing surveillance systems, complicating verification

• No true denominator, ChatterGrabber only collects queried hits.

• Not all desired information is available in social media, some may be incomplete or falsified.

• ChatterGrabber is just an information gathering method, external analysis and review needed for validity.

• Twitter users will differ from population at large.

Limitations

Page 20: ChatterGrabber.py Methods and Development

● ChatterGrabber provides an easy to use social media surveillance tool– Natural Language Processing speeds illness identification.– Geographic region directed searching allows complete

coverage of a user defined jurisdiction.● ChatterGrabber can successfully identify GI illness related

tweets in a population.– 220 Million USA Tweets per day– 6,000 matches per day by Nationwide NLTK search.– 353 matches per day by Virginia keyword search.– 136 matches per day by Virginia NLTK search.

Conclusions

Page 21: ChatterGrabber.py Methods and Development

• Streamlined web interface needed for NDSSL long term studies.

• Real-time bioterrorism surveillance methods under evaluation using gun violence as a proof of concept.

• Norovirus visualization & dashboard under development by Elizabeth Musser.

• Tick bite zoonosis and unlicensed tattoo hepatitis risk tracking underway by Pyrros Telionis.

• Vaccine sentiment tracking underway by Meredith Wilson.

Next Steps

Page 22: ChatterGrabber.py Methods and Development

Next Steps

Page 23: ChatterGrabber.py Methods and Development

Firearm violence related tweets by time of day

Next Steps

Page 24: ChatterGrabber.py Methods and Development

Next Steps

Page 25: ChatterGrabber.py Methods and Development

Next Steps

Page 26: ChatterGrabber.py Methods and Development

• Design and execution of real-world use by state and local public health offices.

• Dashboard deployed and customized for users across Virginia.

• Evaluation of pre and post deployment practice. • Assessment of utility and iterative refinement.• If interested contact: [email protected]

Next Steps: Public Health Outreach

Page 27: ChatterGrabber.py Methods and Development

Python Resources

I. Roesslein, J. (2009). Tweepy (Version 1.8) [Computer program]. Available at https://github.com/tweepy/tweepy (Accessed 1 November 2013)

II. Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O’Reilly Media Inc. (Accessed 14 January 2014)

III. Google Developers (2012). gdata-python-client (Version 3.0) [Computer program]. Available at http://code.google.com/p/gdata-python-client/ (Accessed 6 January 2014)

IV. McKinney, W. (2010). Data structures for statistical computing in Python. In Proc. 9th Python Sci. Conf (pp. 51-56)

V. Tigas, M. (2014). GeoPy (Version 0.99) [Computer program]. Available at https://github.com/geopy/geopy (Accessed 21 December 2013)

VI. KilleBrew, K. (2013). query_places.py [Computer program]. Available at https://gist.github.com/flibbertigibbet/7956133 (Accessed 27 January 2014)

VII. Coutinho, R. (2007, August 22nd) Sending emails via Gmail with Python [Web log Post]. Retrieved January 5th fromhttp://kutuma.blogspot.com/2007/08/sending-emails-via-gmail-with-python.html

Relevant Papers

I. Rivers, C. M., & Lewis, B. L. (2014). Ethical research standards in a world of big data. F1000Research, 3.

II. Young, S. D., Rivers, C., & Lewis, B. (2014). Methods of using real-time social media technologies for detection and remote monitoring of HIV outcomes. Preventive medicine.

III. Chakraborty, P., Khadivi, P., Lewis, B., Mahendiran, A., Chen, J., Butler, P., ... & Ramakrishnan, N. Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions. SDM14

References

Page 28: ChatterGrabber.py Methods and Development

Questions?