detecting social spam campaigns on twitter zi chu & haining wang the college of william &...
TRANSCRIPT
Detecting Social Spam Campaigns on Twitter
Zi Chu & Haining WangThe College of William & Mary
Indra WidjajaBell Laboratories, Alcatel-Lucent, USA
Presented by Yingjiu LiSingapore Management University, Singapore
LOGO Background
5
Popularity brings spam- Spam definition: malicious / phishing / scam content or
URL- Social spamming is more successful using social
relationship
Spam tweet <text, URL>
LOGO Background
6
Spam campaign
- Spammer runs multiple accounts to spread spam tweets for a specific purpose (i.e. propagating a spam site, selling goods)
Real case of adult pill campaign with multiple accounts
LOGO Background
7
Detecting spam, 1st step in fighting spam - Tweet level: spam words, spam URLs - Account level: spam tweets, aggressive automation - Campaign level: cluster related tweets/accounts into
campaigns, observe collective features (similar content, posting behavior …)
Efficiency - Capture multiple spam accounts at one time
Robustness - Some spamming methods can’t be detected at
individual level
LOGO Related Work
8
Existing work relies on solo URL feature- Group tweets into a campaign based on the shared URL.
If the URL is blacklisted, the campaign is classified as spam.
Disadvantages - Blacklists have the lag effect (90% of clicks before
blacklisted)- Blacklists can only cover part of spam URLs- False positive (whole domain bit.ly is blacklisted, benign
webpage http://bit.ly/fg7Uy)- False negative: the URL/website is benign, but the
campaign’s collective behavior is spamming
LOGO Background
9
A real spam campaign example of aggressive duplication
Twitter Spamming Rule: “posts duplicate content over multiple accounts”
Account EldoYPISILONENutz this music video, SO COOL ;) http://on.fb.me/ht2wXJ?=mti0
Account MatthewVankomenAmazing this music footage, you'll like ^^ http://on.fb.me/ht2wXJ?=nzky
Account KristaBauske2rAmazing this music vid, Maybe u'll like it :^ http://on.fb.me/ht2wXJ?=mtcz
LOGO Contribution
10
Improve the existing work based on solo URL detection
Introduce new features
Design an automatic detection system using machine learning
LOGO Data Collection
12
Twitter Streaming API - Spritzer - Uniform sampling, 1% of real-time global tweets
Dataset, 50 million tweets - Feb. – Apr. 2011 - Only check tweets with URLs, 8 million (assume tweets without URLs are non-spam)
LOGO Clustering Algorithm
13
URL redirection, tweet = <content, URL>
- original URL => final landing URLhttp://ow.ly/5UbUS ==> ... ==>http://www.people.com/people/.../020515101,00.html Cluster tweets with the same final URL into a
campaign
Campaign = <shared URL, tweet set, account set>
Campaign_1 <shared_URL_1, {tweet_1, tweet_2, tweet_3}, {account_1, account_2}>
Campaign_2 <shared_URL_2, {tweet_4, tweet_5}, {account_1, account_3}>
Campaign_3 <shared_URL_3, {tweet_1, tweet_6}, {account_1, account_4}>
LOGOGround Truth
14
Creation
- Blacklists: Google’s SafeBrowsing, PhishTank, URIBL, SURBL, Spamhaus
(If URL is blacklisted, the campaign is labeled as spam)
- Manual check: content of tweets, accounts (i.e. tweets posted by them outside the campaign)…
Violate Official Twitter Rules of Spam and Abuse?
Ground truth set - 580 legitimate campaigns- 744 spam campaigns
LOGO Data Analysis
16
Master URL, http://biy.ly/5As4k3Affiliate URL <==> spam accountAccount_1, http://biy.ly/5As4k3?=xd56Account_2, http://biy.ly/5As4k3?=f2kk
Master URL Diversity Ratio = unique_Master_URL_# / tweet_noHigh ratio ==> account independenceLow ratio ==> account dependence
LOGO Feature Extractor
21
Tweet-level Features
- Tweet = <textual content, URL>
- Text contains spam words?
- URL is redirected?
- URL is blacklisted?
LOGO Feature Extractor
22
Account-level Features
Account = <tweets, friends/followers, account properties>
- Lifetime tweet count
- Account registration date
- Account protected? Verified?
- Friend_count, follower_count, ratio
- Account reputation = follower_count / (follower_count + friend_count)
- Account taste = avg(account reputation of each of his friend)
LOGO Feature Extractor
23
Campaign-level Features- Campaign = ({tweets}, {accounts}, shared_URL)
- Account Diversity Ratio = account_no / tweet_no
- Entropy of inter-arrival timingLower: regular behavior ==> coordination
Higher: irregular behavior ==> independent participation
Corrected Conditional Entropy (CCE)
LOGO Feature Extractor
24
- Content self-similarity
{Tweets} => sense clusters
Cluster_1) this music video so cool,
amazing this music footage you'll like,
this music video hope u like
Cluster_2) How to Consolidate Credit Card Debt
Consolidate Credit Cards Now to Become Debt Free Later
Three Effective Ways to Consolidate Credit Card Debt Without Using Intermediaries
LOGO Feature Extractor
25
- SenseClusters - cluster messages based on contextual similarity
- Vector space model: text ==>vectorMsg_1, He visited Russia in 1996.
Msg_2, In 1996 he went to Russia.
…
Vocabulary = {in, he, Russia, to, visited, went, 1996, …}
Occurrence Matrix
Weight, TF-IDF (Term Frequency – Inverse Document Frequency)
word_1 word_2 word_3 … word_N
Msg_1 weight 0 weight 0 0
Msg_2 0 weight 0 0 0
… 0 0 0 0
LOGO Feature Extractor
26
- Latent Semantic Analysis, rank lowering
- 2nd-order similarity (1st-order similarity)“Score” => a number that expresses the accomplishment of a
team in a game
“Goal” => a successful attempt at scoring
- Cosine similarity measure
- cos0 = 1, same- cos90 = 0, orthogonal- cos_sim > threshold, the same sense cluster
LOGO Feature Extractor
27
- {Tweets} => K sense clusters (on the fly)
Cluster Size % Similarity
1 10% 1
2 30% 0.9
3 60% 0.1
1 2
31
_ * __ _
w wKi i
Wi
cluster size cluster similarityself similarity score
K
LOGO Decision Maker
28
Random Forest
- Ensemble classifier that consists of many decision trees
- Construction of each tree: calculate the best split based on m (<< M) features in the training set
- Prediction of a new sample is pushed down the tree. It is assigned with the label of the leaf node it ends up in
- Final decision – majority voting of all trees
LOGO Evaluation
30
Classifier Accuracy % FPR % FNR %
Random Forest 94.5 4.1 6.6
DecisionTable 92.1 6.7 8.8
RandomTree 91.4 9.1 8.2
KStar 90.2 7.9 11.3
Bayes Net 88.8 9.6 12.4
SMO 85.2 11.2 17.6
SimpleLogistic 84.0 10.4 20.4
J48 82.8 15.2 18.8
Weka
Try each classifier with the ground truth set, 10-fold cross-validation
High Accuracy, Low FPR (legitimate => spam), Medium FNR (spam => legitimate)
LOGO Evaluation
31
- Evaluate importance for every feature with Decision Tree Only use one feature for classification each time
Feature Accuracy % FPR % FNR %Account Diversity Ratio 85.6 16.2 13.0
Timing Entropy 83.0 9.5 22.8
URL Blacklist(Our Result)
82.3(94.5)
3.2(4.1)
29.0(6.6)
Avg Account Reputation 78.5 25.6 18.3
Active Time 77.0 16.2 28.3
Affiliate URL No 76.7 9.6 34.0
Manual Device % 74.8 10.3 36.8
Tweet Total No 74.32 32.4 20.4
Content Self Similarity 72.3 33.7 23.0
Spam Word Ratio 70.5 25.8 32.4
LOGO Conclusion
33
Large measurement on Twitter
Formulation of new features
Automatic classification system
Overall accuracy 94.5%