making diffusion work for you: from social media to epidemiology b. aditya prakash computer science...
TRANSCRIPT
Making Diffusion Work for You: From Social Media to
EpidemiologyB. Aditya Prakash
Computer ScienceVirginia Tech.
BSEC Conference, ORNL, Aug 26, 2015
Prakash 2015 2
Networks are everywhere!
Human Disease Network [Barabasi 2007]
Gene Regulatory Network [Decourty 2008]
Facebook Network [2010]
The Internet [2005]
Prakash 2015 3
Dynamical Processes over networks are also everywhere!
Prakash 2015 4
Why do we care?• Social collaboration• Information Diffusion• Viral Marketing• Epidemiology and Public Health• Cyber Security• Human mobility • Games and Virtual Worlds • Ecology........
Prakash 2015 5
Why do we care? (1: Epidemiology)
• Dynamical Processes over networks[AJPH 2007]
CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts
Diseases over contact networks
SI Model
Prakash 2015 6
Why do we care? (1: Epidemiology)
• Dynamical Processes over networks
• Each circle is a hospital• ~3000 hospitals• More than 30,000 patients transferred
[US-MEDICARE NETWORK 2005]
Problem: Given k units of disinfectant, whom to immunize?
Prakash 2015 7
Why do we care? (1: Epidemiology)
CURRENT PRACTICE OUR METHOD
~6x fewer!
[US-MEDICARE NETWORK 2005]
Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year)
Prakash 2015 8
Why do we care? (2: Online Diffusion)
> 800m users, ~$1B revenue [WSJ 2010]
~100m active users
> 50m users
Prakash 2015 9
Why do we care? (2: Online Diffusion)
• Dynamical Processes over networks
Celebrity
Buy Versace™!
Followers
Social Media Marketing
Prakash 2015 10
Why do we care? (3: To change the world?)
• Dynamical Processes over networks
Social networks and Collaborative Action
Prakash 2015 11
High Impact – Multiple Settings
Q. How to squash rumors faster?
Q. How do opinions spread?
Q. How to market better?
epidemic out-breaks
products/viruses
transmit s/w patches
Prakash 2015 12
Research Theme
DATALarge real-world
networks & processes
ANALYSISUnderstanding
POLICY/ ACTIONManaging/
Utilizing
Prakash 2015 13
Research Theme – Public Health
DATAModeling # patient
transfers
ANALYSISWill an epidemic
happen?
POLICY/ ACTION
How to control out-breaks?
Prakash 2015 14
Research Theme – Social Media
DATAModeling Tweets
spreading
POLICY/ ACTION
How to market better?
ANALYSIS# cascades in
future?
Prakash 2015 15
In this talk
DATALarge real-world
networks & processes
Q1: How to predict Flu- trends better?
Q2: How does ‘activity’ evolve over time?
Prakash 2015 16
In this talk
Q3: How to control out-breaks?
POLICY/ ACTIONUtilizing
Prakash 2015 17
Outline
• Motivation• Part 1: Learning Models (Empirical Studies)• Part 2: Policy and Action (Algorithms)• Conclusion
Prakash 2015 18
Part 1: Empirical Studies
• Q1: How to predict Flu-trends better?
• Q2: How does activity evolve over time?
Prakash 2015
Surveillance• How to estimate and predict flu trends?
19
Population survey
Hospital record
Lab survey
Surveillance Report
Prakash 2015
GFT & Twitter• Estimate flu trends using online electronic
sources
20
So cold today, I’m catching cold.
I have headache, sore throat, I can’t go to school today.
My nose is totally congested, I havea hard time understanding what I’msaying.
Prakash 2015
Observation 1: States
• There are different states in an infection cycle.• SEIR model:
1. Susceptible 2. Exposed3. Infected 4. Recovered
21
Prakash 2015
Observation 2: Ep. & So. Gap
• Infection cases drop exponentially in epidemiology (Hethcote 2000)
• Keyword mentions drop in a power-law pattern in social media (Matsubara 2012)
22
Prakash 2015
HFSTM Model• Hidden Flu-State from Tweet Model (HFSTM)
– Each word (w) in a tweet (Oi) can be generated by:• A background topic• Non-flu related topics• State related topics
23
Binary background switch
Binary non-flu related switch
Word distribution
Latent stateInitial
prob.
Transit. prob.
Transit. switch
Prakash 2015
HFSTM Model• Generating tweets
24
Generate the state for a tweetGenerate the topic for a word
State: [S,E,I] Topic: [Background,Non-flu,State]
S: goodThis restaurant is really
E: Themoviewas
goodbut it
wasfreezing
I: I think I have flu
Prakash 2015
• EM-based algorithm: HFSTM-FIT– E-step:
• At(i)=P(O1,O2,…,Ot,St=i)
• Bt(i)=P(Ot+1,…,OTu|St=i)
• γt(i)=P(St=i|Ou)
– M-step:• Other parameters such as state transition probabilities,
topic distributions, etc.
– Parameters learned:
Inference
25
Prakash 2015
A possible issue with HFSTM
• Suffers from large, noisy vocabulary. • Semi-supervision for improvement
– Introduce weak supervision into HFSTM.
26
Prakash 2015
HFSTM-A
• HFSTM-A(spect)– Introduce an aspect variable y, expressing our belief on
whether a word is flu-related or not.– The value of y biases the switch variables s.t. flu-related
words are more likely to be explained by state topics.
27
When the aspect value (y) is introduced, the switching probability are updated accordingly.
Prakash 2015
Vocabulary & Dataset• Vocabulary (230 words):
– Flu-related keyword list by Chakraborty SDM 2014
– Extra state-related keyword list• Dataset (34,000 tweets):
– Identify infected users and collect their tweets– Train on data from Jun 20, 2013-Aug 06, 2013– Test on two time period:
• Dec 01, 2012- July 08, 2013• Nov 10, 2013-Jan 26, 2014
28
Prakash 2015
Learned word distributions• The most probable words learned in each state
29
Probably healthy: S Having symptons: E Definitely sick: I
Prakash 2015
Learned state transitionTransition probabilities Transition in real tweets
30
Not directly flu-related, yet correctly identified
Learned by HFSTM:
Prakash 2015
Flu trend fitting
• Ground-truth: – The Pan American Health Organization (PAHO)
• Algorithms:– Baseline:
• Count the number of keywords weekly as features, and regress to the ground-truth curve.
– Google flu trend:• Take the google flu trend data as input, regress to the PAHO curve.
– HFSTM:• Distinguish different states of keyword, and only use the number
of keywords in I state. Again regress to PAHO.
31
Prakash 2015
Flu trend fitting• Linear regression to the case count
reported by PAHO (the ground-truth)
32
Prakash 2015
HFSTM-A
• Results are qualitatively similar with HFSTM, when the vocabulary is 10 times larger.
33
See Poster!
Prakash 2015 34
Part 1: Empirical Studies
• Q1: How to predict Flu-trends better?
• Q2: How does activity evolve over time?
Prakash 2015 35
Google Search Volume
e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date
? ?
(1) First spike (2) Release date (3) Two weeks before release
Prakash 2015 36
Patterns
X
Y
Prakash 2015 37
Patterns
X
Y
More Data
Prakash 2015 38
Patterns
X
YAnomaly
?
Prakash 2015 39
Patterns
X
YAnomaly
?
Extrapolation
Prakash 2015 40
Patterns
X
YAnomalyImputation
Extrapolation
Prakash 2015 41
Patterns
AnomalyImputation
Extrapolation
Compression
Prakash 2015 42
• Meme (# of mentions in blogs)– short phrases Sourced from U.S. politics in 2008
“you can put lipstick on a pig”
“yes we can”
Rise and fall patterns in social media
Prakash 2015 43
Rise and fall patterns in social media
• Can we find a unifying model, which includes these patterns?
• four classes on YouTube [Crane et al. ’08]• six classes on Meme [Yang et al. ’11]
Prakash 2015 44
Rise and fall patterns in social media
• Answer: YES!
• We can represent all patterns by single model
In Matsubara, Sakurai, Prakash+ SIGKDD 2012
Prakash 2015 45
Main idea - SpikeM- 1. Un-informed bloggers (uninformed about rumor)- 2. External shock at time nb (e.g, breaking news)- 3. Infection (word-of-mouth)
Infectiveness of a blog-post at age n:
- Strength of infection (quality of news)
- Decay function (how infective a blog posting is)
Time n=0 Time n=nb Time n=nb+1
β
Power Law
Prakash 2015 46
-1.5 slopeJ. G. Oliveira et. al. Human Dynamics: The
Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005) . [PDF]
(also in Leskovec, McGlohon+, SDM 2007)
Prakash 2015 47
SpikeM - with periodicity
• Full equation of SpikeM
Periodicity
12pmPeak activity 3am
Low activity
Time n
Bloggers change their activity over time
(e.g., daily, weekly, yearly)
activity
Details
Prakash 2015 48
Tail-part forecasts
• SpikeM can capture tail part
Prakash 2015 49
“What-if” forecasting
e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date
? ?
(1) First spike (2) Release date (3) Two weeks before release
Prakash 2015 50
“What-if” forecasting
–SpikeM can forecast not only tail-part, but also rise-part!
• SpikeM can forecast upcoming spikes
(1) First spike (2) Release date (3) Two weeks before release
Prakash 2015 51
Modeling Malware Penetration
• Worldwide Intelligence Network– Which machine got which malware (or legitimate files)– 1 Billion nodes– 37 Billion edges
• Q: Temporal patterns?
[Papalexakakis et. al. + 2013]
Prakash 2015 52
Q: Temporal Patterns
Looks familiar?
Prakash 2015 53
SpikeM again (or SharkFin)
7 parameters only!
~ 400 points ~ 400 points
Prakash 2015 54
Latent Propagation Patterns
Prakash 2015 55
Bonus: Protest Predictions
• Can Twitter provide a lead time?• South American twitter dataset
– Language: Spanish/Portuguese– Idea
1. Look for trending keywords.2. Predict event type for protest using SpikeMparameters!
A political tweet
Violent Protest (VP)
Non Violent Protest (P)
[Sundereisan et al. ASONAM 2014][Jin et al. SIGKDD 2014]
VP
P
Prakash 2015 56
Part 1: Algorithms
• Q3: How to control out-breaks?
(Broad theme: Network Topology Manipulation)
Prakash 2015 57
Immunization (= Interventions)
• Different Flavors:– Pre-emptive– Data-aware
Prakash 2015 58
Pre-emptive: Vulnerability• First eigenvalue λ1 (of adjacency matrix) is sufficient
for most diffusion models. [Prakash et al. ICDM’12 selected for best papers]
λ1 is the epidemic threshold
“Safe” “Vulnerable” “Deadly”
Increasing λ1 , Increasing vulnerability
Prakash 2015 59
Goal
• Decrease λ1 as much as possible
• Node based [Tong, Prakash, + ICDM 2010]• Edge-based [Tong, Prakash, Eliassi-Rad+ CIKM
2012, Best Paper Award]• Edge-Manipulation (see next)
Prakash 2015 60
Fractional Asymmetric Immunization
Hospital Another Hospital
Drug-resistant Bacteria (like XDR-TB)
[Prakash, Adamic, Iwashnya (M.D.) SDM 2013]
Prakash 2015 61
Fractional Asymmetric Immunization
Hospital Another Hospital
Drug-resistant Bacteria (like XDR-TB)
= f
Prakash 2015 62
Fractional Asymmetric Immunization
Hospital Another Hospital
Problem: Given k units of disinfectant, how to distribute them to maximize
hospitals saved?
Prakash 2015 63
Our Solution
• Part 1: Value– Approximate Eigen-drop (Δ λ)– Matrix perturbation theory
• Part 2: Algorithm– Greedily pick best node at each step– Near-optimal due to submodularity
• SmartAlloc (linear complexity)
Prakash 2015 64
Our Algorithm “SMART-ALLOC”
~CURRENT PRACTICE SMART-ALLOC
[US-MEDICARE NETWORK 2005]• Each circle is a hospital, ~3000 hospitals• More than 30,000 patients transferred
~6x fewer!
Prakash 2015 65
Running Time
≈
Simulations (Best competitor)
SMART-ALLOC
> 1 week
14 secs
> 30,000x speed-up!
Wall-Clock Time
Lower is better
Prakash 2015 66
Experiments
K = 200 K = 2000
PENN-NETWORK SECOND-LIFE
~5 x ~2.5 x
Lower is better
Prakash 2015 67
Latest results
• First (provable) approximation algorithms for edge-based problem ([Saha, Adiga, Prakash, Vullikanti SDM 2015])– O(log^2 n)--factor (can be improved to O(log n))
• Based on the idea of removing closed walks
– Semi-Definite Programming Rounding-based O(1) factor
Prakash 2015 68
Data-aware Immunization
Dominator tree
Graph with infected nodes
Given: Graph and Infected nodesFind: ‘best’ nodes for immunization• Complexity
– NP-hard– Hard to approximate within an absolute error
• DAVA-tree– Optimal solution on the tree
• DAVA and DAVA-fast– Merging infected nodes– Build a “dominator tree”, and run DAVA-tree
• Running time: subquadratic– DAVA: O(k(|E|+ |V|log|V|))– DAVA-fast: O(|E|+|V|log|V|)
[Zhang and Prakash, SDM 2014]
Prakash 2015 69
Extensions
• Can be extended to Uncertain and noisy initial data as well!
[Zhang and Prakash, CIKM 2014]
Twitter Firehose API1% sample
Prakash 2015 70
Group-based Immunization
How to select groups to minimize the epidemic?
A
FE
D
CB
• Epidemiology• Contact networks• People are grouped by ages,
demographics, occupations …
• Social Media• Friendship networks• Friends are grouped by the
same interests• E.g., Facebook pages
[Zhang, Adiga, Vullikanti, Prakash, ICDM 2015]
See Poster!
Outline
• Motivation• Part 1: Learning Models (Empirical Studies)• Part 2: Policy and Action (Algorithms)• Conclusion and Future Plans
Prakash 2015 71
Future Plans
DATALarge real-world
networks & processes
ANALYSISUnderstanding
POLICY/ ACTIONManaging
Prakash 2015 72
Scalability – Big Data
• Datasets of unprecedented scale– High dimensionality and sample size!
• Need scalable algorithms for – Learning Models– Developing Policy
• Leverage parallel systems– Map-Reduce clusters (like Hadoop) for data-intensive
jobs (more than 6000 machines) – Parallelized compute-intensive simulations (like Condor)
Prakash 2015 73
Uncertain Data in Cascade analysis (more implementable policies)
Original, Nodes sampled off
Culprits, and missing nodes filled in
Sundereisan, Vreeken, Prakash. 2014
Correcting for missing data Designing More Robust Immunization Policies
Zhang and Prakash. CIKM 2014
Prakash 2015 74
Prakash 2015 75
References1. Scalable Vaccine Distribution in Large Graphs given Uncertain Data (Yao Zhang and B. Aditya Prakash) -- In
CIKM 2014.2. Fast Influence-based Coarsening for Large Networks (Manish Purohit, B. Aditya Prakash, Chahhyun Kang, Yao
Zhang and V. S. Subrahmanian) – In SIGKDD 20143. DAVA: Distributing Vaccines over Large Networks under Prior Information (Yao Zhang and B. Aditya Prakash) --
In SDM 20144. Fractional Immunization on Networks (B. Aditya Prakash, Lada Adamic, Jack Iwashnya, Hanghang Tong, Christos
Faloutsos) – In SDM 20135. Spotting Culprits in Epidemics: Who and How many? (B. Aditya Prakash, Jilles Vreeken, Christos Faloutsos) – In
ICDM 2012, Brussels Vancouver (Invited to KAIS Journal Best Papers of ICDM.)6. Gelling, and Melting, Large Graphs through Edge Manipulation (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-
Rad, Michalis Faloutsos, Christos Faloutsos) – In ACM CIKM 2012, Hawaii (Best Paper Award)7. Rise and Fall Patterns of Information Diffusion: Model and Implications (Yasuko Matsubara, Yasushi Sakurai, B.
Aditya Prakash, Lei Li, Christos Faloutsos) – In SIGKDD 2012, Beijing8. Interacting Viruses on a Network: Can both survive? (Alex Beutel, B. Aditya Prakash, Roni Rosenfeld, Christos
Faloutsos) – In SIGKDD 2012, Beijing9. Winner-takes-all: Competing Viruses or Ideas on fair-play networks (B. Aditya Prakash, Alex Beutel, Roni
Rosenfeld, Christos Faloutsos) – In WWW 2012, Lyon10. Threshold Conditions for Arbitrary Cascade Models on Arbitrary Networks (B. Aditya Prakash, Deepayan
Chakrabarti, Michalis Faloutsos, Nicholas Valler, Christos Faloutsos) - In IEEE ICDM 2011, Vancouver (Invited to KAIS Journal Best Papers of ICDM.)
11. Times Series Clustering: Complex is Simpler! (Lei Li, B. Aditya Prakash) - In ICML 2011, Bellevue12. Epidemic Spreading on Mobile Ad Hoc Networks: Determining the Tipping Point (Nicholas Valler, B. Aditya
Prakash, Hanghang Tong, Michalis Faloutsos and Christos Faloutsos) – In IEEE NETWORKING 2011, Valencia, Spain
13. Formalizing the BGP stability problem: patterns and a chaotic model (B. Aditya Prakash, Michalis Faloutsos and Christos Faloutsos) – In IEEE INFOCOM NetSciCom Workshop, 2011.
14. On the Vulnerability of Large Graphs (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad and Christos Faloutsos) – In IEEE ICDM 2010, Sydney, Australia
15. Virus Propagation on Time-Varying Networks: Theory and Immunization Algorithms (B. Aditya Prakash, Hanghang Tong, Nicholas Valler, Michalis Faloutsos and Christos Faloutsos) – In ECML-PKDD 2010, Barcelona, Spain
16. MetricForensics: A Multi-Level Approach for Mining Volatile Graphs (Keith Henderson, Tina Eliassi-Rad, Christos Faloutsos, Leman Akoglu, Lei Li, Koji Maruhashi, B. Aditya Prakash and Hanghang Tong) - In SIGKDD 2010, Washington D.C.
Prakash 2015 76
Acknowledgements
Collaborators Christos Faloutsos Roni Rosenfeld, Michalis Faloutsos, Lada Adamic, Theodore Iwashyna (M.D.), Dave Andersen, Tina Eliassi-Rad, Iulian Neamtiu,
Varun Gupta, Jilles Vreeken, V. S. Subrahmanian John Brownstein (M.D.)
Deepayan Chakrabarti, Hanghang Tong, Kunal Punera, Ashwin Sridharan, Sridhar Machiraju, Mukund Seshadri, Alice Zheng, Lei Li, Polo Chau, Nicholas Valler, Alex Beutel, Xuetao Wei
Prakash 2015 77
Acknowledgements
• Students Liangzhe Chen Shashidhar Sundereisan Benjamin Wang Yao Zhang Sorour Amiri
Prakash 2015 78
Acknowledgements
Funding
Prakash 2015 79
Analysis Policy/Action Data
Making Diffusion Work for You
B. Aditya Prakash http://www.cs.vt.edu/~badityap