surveillance of social media: big data analytics
TRANSCRIPT
Presented by: Thomas Otto (Manager Business Intelligence)
Dr. Mehnaz Adnan (Senior Scientist Health Intelligence) Institute of Environmental Science & Research Ltd.
Credits:ESR – Dr. Mehnaz Adnan: Health Intelligence Analytics on Tweets ESR - Franco Andrews: SAP Data Integration, Modelling, Analytics and VisualisationESR IT: Infrastructure / FirewallSoltius NZ - Erik Roelofs: Connection Module and SAP Data Services
Syndromic Surveillance of Social Media - Big Data Analytics
© ESR 2015
Problem statement and hypothesis• Individuals disclose a lot of personal information on Social Media
channels (i.e. Facebook, Twitter etc.)
• There’s lots of Social Media Data (SMD) out there and:-• It is very noisy • It is not verified • It needs to be curated (checked by a clinician)
• Personal information contains location, names and self diagnosed syndromes
• SMD could be used to feed an early warning surveillance system
1. How to exploit twitter for public health monitoring (http://goo.gl/sOx9xo)
2. Digital disease detection—harnessing the Web for public health surveillance. (http://goo.gl/fxwoJT)
3. Influenza forecasting with Google flu trends. (http://goo.gl/z7GZco)
Related work
1) Denecke, K., Krieck, M., Otrusina, L., Smrz, P., Dolog, P., Nejdl, W., & Velasco, E. (2013). How to exploit twitter for public health monitoring. Methods Inf Med, 52(4), 326-339.
2) Brownstein, J. S., Freifeld, C. C., & Madoff, L. C. (2009). Digital disease detection—harnessing the Web for public health surveillance. New England Journal of Medicine, 360(21), 2153-2157.
3) Dugas, A. F., Jalalpour, M., Gel, Y., Levin, S., Torcaso, F., Igusa, T., & Rothman, R. E. (2013). Influenza forecasting with Google flu trends. PloS one, 8(2), e56176.
What is Social Media?Social media refers to the means of interactions among people in which they create, share, and/or exchange information and ideas in virtual communities and networks¹.
1 Tufts university, Boston, U.S.A.2 Social Media Examiner: 2014 Social Media Marketing Industry Report
What is Twitter?
Some Twitter statistics• 1 billion users registered• 255 million users/month • 100 million users per day ³
1 Wikipedia2 PEW Research Centre: January 20143 DMR: March 20144 Twitter Terms of Service as of 24/7/14
Overview• This is a proof of concept (POC)
• The POC is not yet used for surveillance or to monitor actual diseases
• This POC is an experimental application at ESR to understand the validity of the approach
Method of data collection
Commercial / Government clients
Future workMachine Learning (ML), Artificial Intelligence (AI) etc.
Commercial / Government clients
• There was a measles outbreak in 2014
• We extracted a subset of tweets for the period of Jan 2014 to Dec. 2015 containing the key word ‘measles’ from our twitter data mart
• We extracted the number of confirmed measles cases for the period of Jan 2014 to Dec. 2015 from a national New Zealand surveillance system (EpiSurv)
• We performed quantitative data analysis on both data sets
Study Design
Results
• Number of tweets collected for measles: 1408
• Single keyword-based data curation
• Usage of free Twitter API 1.1 (volume, timeliness)
Limitations
Social Media (Twitter) – Visualisation / Front-End
Select keywordMeasles
Zoom into WLG
Basic stats by location
Measles Tweets
Current, active keywords
• We believe that Social Media Data (SMD) is a relevant source of information
• Storage is potentially challenging (it has aspects of Big Data)
• Cleansing (it needs to be curated)• A mixed approach between machine automation and
human verification (i.e. clinician)
• Curated SMD will be the source for down-stream Analytics and early warning systems (syndromic surveillance)
Conclusion
• Potentially use a Twitter data aggregator, or a paid Twitter API connection (higher volumes, better timeliness)
• Adding to the Linguistic Analytical Module applying:-• Machine Learning (sentiment analysis, linear regression analysis
etc.)• Prediction
• Evaluate the Deep Dive engine from Stanford University (http://deepdive.stanford.edu/)
• Develop ontology for syndromic keywords related to specific diseases (i.e. spots, rash, itching for measles)
Future work
Confidence
Tweet is a real event.
100 %
0 %
Time
? %UnverifiedTwitter data
Verify withHealth Line data
Enrich Twitter data set with other, verified data (counts, location, time).
Verify withHealth Stats data
Verify with Lab Information Data
Verify with Sentinel data
Verify with National SurveillanceDatabase
PRONature scientific journal: There is a close correlation between the rates of doctor visits for flu symptoms, and the use of flu-like search terms. NZ Herald 23/7/14
CONResearchers from Harvard University state: Google Flu Tracker has overestimated for 100 of the 108 weeks starting from August, 2011 source: motherboard.vice.com
VISION - Big Data complements traditional methods Calibrate social media data with verified and trusted data to identify valid tweets
Risk Opportunity
• Social Media Data is validated and a trusted source of information
• Maybe used for indicative, early warnings of potential outbreaks?
Google ?? Flu Tracker
NZ
50 %
Some Social Media exploration tools• Try below tools first and see what benefit they offer
(This is random selection and does not rate or recommend any of the tools in particular)
• PlusOne Social (http://plusonesocial.com )
• Microsoft Excel Twitter add-in (http://goo.gl/WBaXt5)
• Others• http://www.razorsocial.com/free-twitter-analytics/• http://www.socialmediaexaminer.com/6-twitter-analytics-tools/
Presented by: Thomas Otto (Manager Business Intelligence)
Dr. Mehnaz Adnan (Senior Scientist Health Intelligence) Institute of Environmental Science & Research Ltd.
Credits:ESR – Dr. Mehnaz Adnan: Health Intelligence Analytics on Tweets ESR - Franco Andrews: SAP Data Integration, Modelling, Analytics and VisualisationESR IT: Infrastructure / FirewallSoltius NZ - Erik Roelofs: Connection Module and SAP Data Services
Syndromic Surveillance of Social Media - Big Data Analytics
© ESR 2015