Download - Spam Detection

Transcript
Page 1: Spam Detection

Spam DetectionJingrui He

10/08/2007

Page 2: Spam Detection

Spam Types Email Spam

Unsolicited commercial email Blog Spam

Unwanted comments in blogs Splogs

Fake blogs to boost PageRank

Page 3: Spam Detection

From Learning Point of View Spam Detection

Classification problem (ham vs. spam) Feature Extraction

A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung

Fast Classifier Relaxed Online SVMs for Spam Filtering. D.

Sculley, G.M. Wachman

Page 4: Spam Detection

A Learning Approach to Spam Detection based on Social Networks

H.Y. Lam and D.Y. Yeung

CEAS 2007

Page 5: Spam Detection

Problem Statement n Email Accounts Sender Set: ; Receiver Set Labeled Sender Set: s.t.

Goal Assign the remaining account with in

Page 6: Spam Detection

System Flow Chart

Page 7: Spam Detection

Social Network from Logs Directed Graph Directed Edge

Email sent from to Edge Weight =

is the number of emails sent from to

Page 8: Spam Detection

System Flow Chart

Page 9: Spam Detection

Features from Email Social Networks In-count / Out-count

The sum of in-coming / out-going edge weights

In-degree / Out-degree The number of email accounts that a node

receives emails from / sends emails to

Page 10: Spam Detection

Features from Email Social Networks Communication Reciprocity (CR)

The percentage of interactive neighbors that a node has

The set of accounts that received emails from

The set of accounts that sent emails to

Page 11: Spam Detection

Communication Interaction Average (CIA) The level of interaction between a sender and

each of the corresponding recipients

Features from Email Social Networks

Page 12: Spam Detection

Clustering Coefficient (CC) Friends-of-friends relationship between email

accounts

Features from Email Social Networks

Number of neighbors of

Number of connections between neighbors of

Page 13: Spam Detection

System Flow Chart

Page 14: Spam Detection

Preprocessing Sender Feature Vector

Weighted Features

Problematic?

Page 15: Spam Detection

System Flow Chart

Page 16: Spam Detection

Assigning Spam Score Similarity Weighted k-NN method

Gaussian similarity

Similarity weighted mean k-NN scores

Score scaling

The set of knearest

neighbors

:x

:x

j

j

ij jji

ijj

w yy

w

Page 17: Spam Detection

Experiments Enron Dataset: 9150 Senders To Get

Legitimate Enron senders: email transactions within the Enron email domain

5000 generated spam accounts 120 senders from each class

Results Averaged over 100 Times

Page 18: Spam Detection

Number of Nearest Neighbors

Page 19: Spam Detection

Feature Weights (CC)

Page 20: Spam Detection

Feature Weights (CIA)

Page 21: Spam Detection

Feature Weights (CR)

Page 22: Spam Detection

Feature Weights In/Out-Count & In/Out-Degree

The smaller the better Final Weights

In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15

Page 23: Spam Detection

Conclusion Legitimacy Score

No content needed Can Be Combined with Content-Based Filters More Sophisticated Classifiers

SVM, boosting, etc Classifiers Using Combined Feature

Page 24: Spam Detection

Relaxed Online SVMs for Spam Filtering

D. Sculley and G.M. Washman

SIGIR 2007

Page 25: Spam Detection

Anti-Spam Controversy Support Vector Machines (SVMs) Academic Researchers

Statistically robust State-of-the-art performance

Practitioners Quadratic in the number of training examples Impractical!

Solution: Relaxed Online SVMs

Page 26: Spam Detection

Background: SVMs Data Set = Class Label : 1 for spam; -1 for ham Classifier: To Find and

Minimize:

Constraints:

Slack variable

Maximizing the marginMinimizing the loss function

Tradeoff parameter

Page 27: Spam Detection

Online SVMs

Page 28: Spam Detection

Tuning the Tradeoff Parameter C Spamassassin data set: 6034 examples

Large C preferred

Page 29: Spam Detection

Email Spam and SVMs TREC05P-1: 92189 Messages TREC06P: 37822 messages

Page 30: Spam Detection

Blog Comment Spam and SVMs Leave One Out Cross Validation 50 Blog Posts; 1024 Comments

Page 31: Spam Detection

Splogs and SVMs Leave One Out Cross Validation 1380 Examples

Page 32: Spam Detection

Computational Cost Online SVMs: Quadratic Training Time

Page 33: Spam Detection

Relaxed Online SVMs (ROSVM) Objective Function of SVMs:

Large C Preferred Minimizing training error more important than

maximizing the margin ROSVM

Full margin maximization not necessary Relax this requirement

Page 34: Spam Detection

The last value found for when

Three Ways to Relax SVMs (1) Only Optimize Over the Recent p Examples

Dual form of SVMs

Constraints

Page 35: Spam Detection

Three Ways to Relax SVMs (2) Only Update on Actual Errors

Original online SVMs Update when

ROSVM Update when m=0: mistake driven online SVMs NO significant degrade in performance Significantly reduce cost

Page 36: Spam Detection

Three Ways to Relax SVMs (3) Reduce the Number of Iterations in Interative

SVMs SMO: repeated pass over the training set to

minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance

Page 37: Spam Detection

Testing Reduced Size

Page 38: Spam Detection

Testing Reduced Iterations

Page 39: Spam Detection

Testing Reduced Updates

Page 40: Spam Detection

Online SVMs and ROSVM ROSVM:

Email Spam

Blog Comment Spam

Splog Data Set


Top Related