xintao wu jan 18, 2013 retweeting behavior and spectral graph analysis in social media

Click here to load reader

Upload: evangeline-wade

Post on 13-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1

Xintao Wu Jan 18, 2013 Retweeting Behavior and Spectral Graph Analysis in Social Media Slide 2 Social Media Customer Analytics 2 Network topology namesexagediseasesalary AdaF18cancer25k BobM25heart110k idSexageaddressIncome 5FYNC25k 3MYSC110k Structured profile Retweet sequence Unstructured text (e.g., blog, tweet) Customer profile Customer transaction Inventory Product desc and review Entity resolution Patterns Temporal/spatial Scalability Visualization Sentiment Privacy Slide 3 Outline Examining retweeting behavior to understand information propagation Multi-factor interaction analysis Coverage prediction Burst detection Spectral graph analysis Community partition Fraud detection 3 Slide 4 Multi-factor interaction analysis 4 For each following relationship, what factors affect the user As decision on whether to forward messages from B to A s followers? We examine users retweet behaviors by using various features Power ratio (A) Link structure (B) Location factor (C) Gender factor (D) We apply a fitted Log-linear model to capture and interpret interaction patterns among features A-D and retweet E. Slide 5 Interpreting interaction effect 5 Slide 6 Interpretation example Neither gender nor location has any significant effect on retweeting solely. However, considering link structure, Females are more conservative and have a lower tendency to retweet messages from non-friend (especially female) users, but have a higher tendency to retweet messages from friends or superstars. Males are more open-minded and have a higher tendency to retweet messages from non-friend (especially female) users. 6 Slide 7 Outline Examining retweeting behavior to understand information propagation Multi-factor interaction analysis Coverage prediction Burst detection Spectral graph analysis Community partition Fraud detection 7 Slide 8 Retweet Sequence Information dynamically flows through the network. 8 Alice Bob Cathy DavidEllenFred D1D2 D3 t1m1A Slide 9 Retweet Sequence Information dynamically flows through a social network. 9 Alice Bob Cathy DavidEllenFred D1D2 D3 t1m1A t2m2Bt1m1A Slide 10 Flow Through Tree Structure Information dynamically flows through a social network. 10 Alice Bob Cathy DavidEllenFred D1D2 D3 t1m1A t2m2Bt1m1A t3m3D\t Bt1m1A Slide 11 Flow Through Tree Structure Information dynamically flows through a social network. 11 Alice Bob Cathy DavidEllenFred D1D2 D3 t1m1A t2m2Bt1m1A t3m3D\t Bt1m1A t4m4Ct1m1A Slide 12 WISE12 Challenge Sina Weibo # of user: 5,636,858 # of tweets: 46,584,914 # of retweets: 190,920,026 33 test messages each with 100 initial retweets composed by 27 users from 6 events For each message, predict M1: the number of retweets in 30 days M2: the number of possible-views in 30 days 12 Slide 13 Idea We treat retweeting activities of each original message in the training data as a time series Each value corresponds to the number of times that the original message during time period t For each message in the test data 13 Known from 100 retweets Use ARMA to predict Slide 14 Prediction Result 14 Runner-up award (2 nd place) on WISE 2012 Challenge Mining Track. Death of Steve Jobs Xiaomi Release Yao Jiaxin Murder Case Xiaomi Release Slide 15 Outline Examining retweeting behavior to understand information propagation Multi-factor interaction analysis Coverage prediction Burst detection Spectral graph analysis Community partition Fraud detection 15 Slide 16 Bursts 16 Peak Time Duration Time Slide 17 Topic 17 Slide 18 Retweet vs. Time 18 Slide 19 Retweet vs. Time 19 Slide 20 Burst Analysis : Users Top 100 users tend to have: shorter path length, shorter peak time, shorter duration time. 20 Slide 21 Burst Prediction Extract features User related including profile and history information Tweet-related including time series and retweet tree Run classifiers Logistic regression Random forest Decision tree Nave bayes SVM KNN Achieve 83.2% accuracy 21 Slide 22 Outline Examining retweeting behavior to understand information propagation Multi-factor interaction analysis Coverage prediction Burst detection Spectral graph analysis Community partition Fraud detection 22 Slide 23 Spectral graph analysis Spectral coordinate: Polbook Network 23 Slide 24 Accuracy of AdjCluster Lap [Miller and Teng 1998] : Laplacian based Ncut [Shi and Malik, 2000] : Normalized cut HE [Wakita and Tsurumi, 2007] : Modularity based agglomerative clustering SpokEn [Prakash et al., 2010] : EigenSpoke Accuracy: where :the i-th community produced by different algorithms 24 Refer to IJCAI 11 for details Slide 25 Evaluation on Web spam challenge data SPCTRA fraud detection 25 GREEDY: based on outer-triangles [Shrivastava, ICDE, 2008] 100-1000 times faster Refer to ICDE11details. Slide 26 Acknowledgments This work was supported in part by U.S. National Science Foundation CNS- 0831204 and CCF-1047621, and UNC Charlotte Chancellors Special Fund. Thank You! Questions? 26