spam email detection ethan grefe december 13, 2013
TRANSCRIPT
![Page 1: Spam Email Detection Ethan Grefe December 13, 2013](https://reader031.vdocument.in/reader031/viewer/2022020417/56649f165503460f94c2b8ef/html5/thumbnails/1.jpg)
Spam Email DetectionEthan GrefeDecember 13, 2013
![Page 2: Spam Email Detection Ethan Grefe December 13, 2013](https://reader031.vdocument.in/reader031/viewer/2022020417/56649f165503460f94c2b8ef/html5/thumbnails/2.jpg)
Motivation• Spam email is constantly cluttering inboxes• Commonly removed using rule based filters• Spam often has very similar characteristics • This allows them to be detected using machine learning• Naïve Bayes Classifiers• Support Vector Machines
![Page 3: Spam Email Detection Ethan Grefe December 13, 2013](https://reader031.vdocument.in/reader031/viewer/2022020417/56649f165503460f94c2b8ef/html5/thumbnails/3.jpg)
SVM Solution• Used training data from CSDMC2010 SPAM corpus• 4327 labeled emails• 2949 non-spam messages (HAM)• 1378 spam messages (SPAM).
• Extracted features from the subject and body of emails• Used resulting feature vectors to train an SVM classifier in
Matlab
![Page 4: Spam Email Detection Ethan Grefe December 13, 2013](https://reader031.vdocument.in/reader031/viewer/2022020417/56649f165503460f94c2b8ef/html5/thumbnails/4.jpg)
Email Features• Features were determined by research and observation• Best results were obtained with the following features• Percentage of letters that are capitalized• Types of punctuation used• Average length of a word• Amount of html in the email
![Page 5: Spam Email Detection Ethan Grefe December 13, 2013](https://reader031.vdocument.in/reader031/viewer/2022020417/56649f165503460f94c2b8ef/html5/thumbnails/5.jpg)
Classifier Results• Trained on a random 35% of emails• Tested SVM classifier on remaining 65%• Trained SVM using three different kernel functions
Kernel Function Spam Classification Rate Ham Classification Rate Total Classification Rate
RBF 80.06% 92.33% 86.20%
Linear 78.69% 80.66% 79.67%
Quadratic 82.75% 84.85% 83.80%
![Page 6: Spam Email Detection Ethan Grefe December 13, 2013](https://reader031.vdocument.in/reader031/viewer/2022020417/56649f165503460f94c2b8ef/html5/thumbnails/6.jpg)
Possible Improvements• Use Naïve Bayes to classify emails using word frequency• Obtain a wider variety of input features• Test other types of learning algorithms