predicting spread of disease from social interactions kau… · 10/11/2012  · predicting spread...

23
Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday, October 11, 12

Upload: others

Post on 05-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Predicting Spread of Disease from Social Interactions

Adam SadilekHenry Kautz

Vincent SilenzioThursday, October 11, 12

Page 2: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Thursday, October 11, 12

Page 3: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Thursday, October 11, 12

Page 4: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Organic Sensor Networks

• 52% of adults use online social networks ➚

• Smartphone access ➚• Real time

• “In the moment”

• Location aware

Thursday, October 11, 12

Page 5: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

• Detailed measurements at a population scale

• No active user participation

• Fine granularity

• Timely

• Inference

• Prediction

Organic Sensor Networks

Thursday, October 11, 12

Page 6: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Thursday, October 11, 12

Page 7: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

? ??

?

Thursday, October 11, 12

Page 8: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Thursday, October 11, 12

Page 9: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

03:0004:0005:0006:0007:0008:0009:0010:0011:0012:0013:0014:0015:0016:0017:0018:0019:0020:00

Thursday, October 11, 12

Page 10: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

The Data

• New York City

• 6K geo-active users

• 2.5M tweets by geo-active users

• 103K “follows” relationships

• 32K “friends” relationships

Thursday, October 11, 12

Page 11: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Thursday, October 11, 12

Page 12: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Thursday, October 11, 12

Page 13: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Spread of Disease

• Identify people with symptoms

• Quantify the impact of:

• Co-location

• Social ties

• Predict contagion with fine granularity

Thursday, October 11, 12

Page 14: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Tweet: “feeling sick”

feeling

house

sick

0

1

1

1

Thursday, October 11, 12

Page 15: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Cascade SVM

Corpus of 5,128 tweets

labeled by human workers

Co

Cs

Corpus of 1.6 million machine-labeled tweets

Training

Training

Labeling

Corpus of "other" tweets

Corpus of "sick" tweets

+ Final corpus

Labeling

Random sample of 200 milion tweets

Cf

Training

Thursday, October 11, 12

Page 16: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

SVM Features

Corpus of 5,128 tweets

labeled by human workers

Co

Cs

Corpus of 1.6 million machine-labeled tweets

Training

Training

Labeling

Corpus of "other" tweets

Corpus of "sick" tweets

+ Final corpus

Labeling

Random sample of 200 milion tweets

Cf

Training

Figure 2: A diagram of our cascade learning of our SVM classifierCf that we use to detect tweets indicating an infectious sicknessof the author. The and symbols denote thresholding of theclassification score, where we select the bottom 10% of the scorespredicted by Co (i.e., tweets that are normal with high probabil-ity), and the top 10% of scores predicted by Cs (i.e., likely “sick”tweets).

Positive Features Negative FeaturesFeature Weight Feature Weightsick 0.9579 sick of ´0.4005headache 0.5249 you ´0.3662flu 0.5051 lol ´0.3017fever 0.3879 love ´0.1753feel 0.3451 i feel your ´0.1416coughing 0.2917 so sick of ´0.0887being sick 0.1919 bieber fever ´0.1026better 0.1988 smoking ´0.0980being 0.1943 i’m sick of ´0.0894stomach 0.1703 pressure ´0.0837and my 0.1687 massage ´0.0726infection 0.1686 i love ´0.0719morning 0.1647 pregnant ´0.0639

Table 2: Example positively and negatively weighted significantfeatures of our SVM model Cf .

for further progress, as false negatives and false positivescannot be traded-off against each other in this domain—theyboth carry equal importance. In this work, we use Cf to dis-tinguish between tweets indicating the author is afflicted byan infectious ailment (we call such tweets “sick”), and allother tweets (called “other” or “normal”).

We need to obtain sufficient amount of labeled trainingdata in order to learn Cf . We do this by first training two“helper” SVMs, Cs and Co, on a dataset of 5,128 tweets,each labeled as either “sick” or “other” by multiple Ama-zon Mechanical Turk workers and carefully checked by theauthors. Cs is highly penalized for inducing false positives(mistakenly labeling a normal tweet as “sick”), whereas Co

is heavily penalized for creating false negatives (labelingsymptomatic tweets as normal). After training, we used Cs

and Co to label a set of 1.6 million tweets that are likelyhealth-related, but contain some noise. We obtained bothdatasets from Paul and Dredze (2011a), and they are com-pletely disjoint from our NYC data.

The intuition behind this cascading process, illustrated inFig. 2, is to extract tweets that are with high confidenceabout sickness with Cs, and tweets that are almost certainly

about other topics with Co from the corpus of 1.6 milliontweets. We further supplement the final corpus with mes-sages from a sample of 200 million tweets (also disjointfrom all other corpora considered here) that Co classifiedas “other” with high probability. We apply thresholding onthe classification score to reduce the noise in the cascade, asshown in Fig. 2.

As SVM features, we use all unigram, bigram, and tri-gram word tokens that appear in the training data. For ex-ample, a tweet “I feel sick.” is represented by the followingfeature vector:

`i, feel, sick, i feel, feel sick, i feel sick

˘.

Before tokenization, we convert all text to lower case, strippunctuation and special characters, and remove mentionsof user names (the “@” tag) and re-tweets (analogous toemail forwarding). However, we do keep hashtags (such as“#sick”), as those are often relevant to the author’s healthstate, and are particularly useful for disambiguation of shortor ill-formed messages. Table 2 lists examples of significantfeatures found in the process of learning Cf .

Evaluation of Cf on a held-out set shows 0.98 preci-sion and 0.97 recall. Furthermore, the correlation betweenthe prevalence of infectious diseases predicted by Cf andthe predictions made by Google Flu Trends specifically forNew York City is 0.73. The official Center for DiseaseControl and Prevention data for NYC is not available withsufficiently fine granularity, but previous work has shownthat Google’s predictions closely correspond to the offi-cial statistics for larger geographical areas (Ginsberg et al.2008). Google Flu Trends may have greater specificity to“influenza-like illness”, whereas our approach may be lessspecific, but more sensitive to detect other, related infec-tious processes exhibiting these nonspecific features in Twit-ter content.

Predicting the Spread of DiseaseHuman contact is the single most important factor in thetransmission of infectious diseases (Clayton, Hills, andPickles 1993). Since the contact is often indirect, such asvia a doorknob, we focus on a more general notion of co-location. We consider two individuals co-located if they visitthe same 100 by 100 meter cell within a time window (slack)of length T . For clarity, we show results for T “ 12 hours,but we obtained virtually identical prediction performancefor T P t1, . . . , 24u hours. We use the 100m threshold, asthat is the typical lower bound on the accuracy of a GPS sen-sor in obstructed areas, such as Manhattan. Since we focuson geo-active individuals, we can calculate co-location withhigh accuracy. The results below are for a condition, wherewe consider a person ill up to four days after they write a“sick” tweet. As with the parameter T , it is important tonote that the results are consistent over a wide range of dura-tion of contagiousness (from 1 to 7 days). Few diseases withinfluenza-like symptoms are contagious for periods of timebeyond these bounds.

Statistical analysis of the data shows that avoiding en-counters with infected people generally decreases your

Thursday, October 11, 12

Page 17: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Thursday, October 11, 12

Page 18: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Twitter Health

• Aggregate accuracy comparable with:

• Google Flu Trends (R = 0.73)

• CDC statistics

+ we can model fine-grained interactions between specific individuals

Thursday, October 11, 12

Page 19: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

ht +1htht −1

Ot +1OtOt −1

. . . . . .

Health Prediction

Thursday, October 11, 12

Page 20: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Health Insights!

[Snow,1855]!

Thursday, October 11, 12

Page 21: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Thursday, October 11, 12

Page 22: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Psychological Health• Effect of connectedness on mental health

• Twitter data exhibits patterns from animal studies: higher status -> better health

• Can we estimate emotional state from tweets?

• Tweeters may try to disguise certain feelings

• Interaction of social network and depression

• Is depression contagious?

• How does network reduce or reinforce?

Thursday, October 11, 12

Page 23: Predicting Spread of Disease from Social Interactions Kau… · 10/11/2012  · Predicting Spread of Disease from Social Interactions Adam Sadilek Henry Kautz Vincent Silenzio Thursday,

Summary

• Quantify impact of physical encounters and social ties on public health

• Predict future health

• Real world observed via Twitter

• Fine granularity

• Real-time

• No active user participation

Thursday, October 11, 12