spam email detection ethan grefe december 13, 2013

6
Spam Email Detection Ethan Grefe December 13, 2013

Upload: camron-potter

Post on 04-Jan-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Spam Email Detection Ethan Grefe December 13, 2013

Spam Email DetectionEthan GrefeDecember 13, 2013

Page 2: Spam Email Detection Ethan Grefe December 13, 2013

Motivation• Spam email is constantly cluttering inboxes• Commonly removed using rule based filters• Spam often has very similar characteristics • This allows them to be detected using machine learning• Naïve Bayes Classifiers• Support Vector Machines

Page 3: Spam Email Detection Ethan Grefe December 13, 2013

SVM Solution• Used training data from CSDMC2010 SPAM corpus• 4327 labeled emails• 2949 non-spam messages (HAM)• 1378 spam messages (SPAM).

• Extracted features from the subject and body of emails• Used resulting feature vectors to train an SVM classifier in

Matlab

Page 4: Spam Email Detection Ethan Grefe December 13, 2013

Email Features• Features were determined by research and observation• Best results were obtained with the following features• Percentage of letters that are capitalized• Types of punctuation used• Average length of a word• Amount of html in the email

Page 5: Spam Email Detection Ethan Grefe December 13, 2013

Classifier Results• Trained on a random 35% of emails• Tested SVM classifier on remaining 65%• Trained SVM using three different kernel functions

Kernel Function Spam Classification Rate Ham Classification Rate Total Classification Rate

RBF 80.06% 92.33% 86.20%

Linear 78.69% 80.66% 79.67%

Quadratic 82.75% 84.85% 83.80%

Page 6: Spam Email Detection Ethan Grefe December 13, 2013

Possible Improvements• Use Naïve Bayes to classify emails using word frequency• Obtain a wider variety of input features• Test other types of learning algorithms