Countering Spam Using Classification Techniques
Steve [email protected] Mining Guest LectureFebruary 21, 2008
Overview
Introduction Countering Email Spam
Problem Description Classification History Ongoing Research
Countering Web Spam Problem Description Classification History Ongoing Research
Conclusions
Introduction
The Internet has spawned numerous information-rich environments Email Systems World Wide Web Social Networking
Communities
Openness facilities information sharing, but it also makes them vulnerable…
Denial of Information (DoI) Attacks
Deliberate insertion of low quality information (or noise) into information-rich environments
Information analog to Denial of Service (DoS) attacks
Two goals Promotion of ideals by means of deception Denial of access to high quality information
Spam is the currently the most prominent example of a DoI attack
Overview
IntroductionIntroduction Countering Email Spam
Problem Description Classification History Ongoing Research
Countering Web SpamCountering Web Spam Problem DescriptionProblem Description Classification HistoryClassification History Ongoing ResearchOngoing Research
ConclusionsConclusions
Countering Email Spam
Close to 200 billion (yes, billion) emails are sent each day
Spam accounts for around 90% of that email traffic
~2 million spam messages every second
Problem Description
Email spam detection can be modeled as a binary text classification problem Two classes: spam and legitimate (non-spam)
Example of supervised learning Build a model (classifier) based on training data to approximate
the target function
Construct a function : M {spam, legitimate} such that it overlaps : M {spam, legitimate} as much as possible
Problem Description (cont.)
How do we represent a message?
How do we generate features?
How do we process features?
How do we evaluate performance?
How do we represent a message?
Classification algorithms require a consistent format
Salton’s vector space model (“bag of words”) is the most popular representation
Each message m is represented as a feature vector f of n features: <f1, f2, …, fn>
How do we generate features?
Sources of information SMTP connections
Network properties
Email headers Social networks
Email body Textual parts URLs Attachments
How do we process features?
Feature Tokenization Alphanumeric tokens N-grams Phrases
Feature Scrubbing Stemming Stop word removal
Feature Selection Simple feature removal Information-theoretic
algorithms
dc
cFN
ba
bFP
dc
dR
db
dP
How do we evaluate performance?
Traditional IR metrics Precision vs. Recall
False positives vs. False negatives Imbalanced error costs
ROC curves
Classification History
Sahami et al. (1998)Used a Naïve Bayes classifierWere the first to apply text classification
research to the spam problem
Pantel and Lin (1998)Also used a Naïve Bayes classifierFound that Naïve Bayes outperforms RIPPER
Classification History (cont.)
Drucker et al. (1999) Evaluated Support Vector Machines as a solution to
spam Found that SVM is more effective than RIPPER and
Rocchio
Hidalgo and Lopez (2000) Found that decision trees (C4.5) outperform Naïve
Bayes and k-NN
Classification History (cont.)
Up to this point, private corpora were used exclusively in email spam research
Androutsopoulos et al. (2000a) Created the first publicly available email spam corpus
(Ling-spam) Performed various feature set size, training set size,
stemming, and stop-list experiments with a Naïve Bayes classifier
Classification History (cont.)
Androutsopoulos et al. (2000b) Created another publicly available email spam corpus
(PU1) Confirmed previous research than Naïve Bayes
outperforms a keyword-based filter
Carreras and Marquez (2001) Used PU1 to show that AdaBoost is more effective
than decision trees and Naïve Bayes
Classification History (cont.)
Androutsopoulos et al. (2004) Created 3 more publicly available corpora (PU2, PU3, and PUA) Compared Naïve Bayes, Flexible Bayes, Support Vector
Machines, and LogitBoost: FB, SVM, and LB outperform NB
Zhang et al. (2004) Used Ling-spam, PU1, and the SpamAssassin corpora Compared Naïve Bayes, Support Vector Machines, and
AdaBoost: SVM and AB outperform NB
Classification History (cont.)
CEAS (2004 – present) Focuses solely on email and anti-spam research Generates a significant amount of academic and industry anti-spam
research
Klimt and Yang (2004) Published the Enron Corpus – the first large-scale corpus of legitimate
email messages
TREC Spam Track (2005 – present) Produces new corpora every year Provides a standardized platform to evaluate classification algorithms
Concept Drift
Spam content is extremely dynamic Topic drift (e.g., specific
scams) Technique drift (e.g.,
obfuscations)
How do we keep up with the Joneses?
Batch vs. Online Learning
New Classification Approaches
Filter Fusion
Compression-based Filtering
Network behavioral clustering
Adversarial Classification
Classifiers assume a clear distinction between spam and legitimate features
Camouflaged messages Mask spam content with
legitimate content Disrupt decision
boundaries for classifiers
Camouflage Attacks
Baseline performance Accuracies consistently
higher than 98%
Classifiers under attack Accuracies degrade to
between 50% and 70%
Retrained classifiers Accuracies climb back to
between 91% and 99%
Camouflage Attacks (cont.)
Retraining postpones the problem, but it doesn’t solve it
We can identify features that are less susceptible to attack, but that’s simply another stalling technique
Image Spam
What happens when an email does not contain textual features?
OCR is easily defeated
Classification using image properties
Overview
IntroductionIntroduction Countering Email SpamCountering Email Spam
Problem DescriptionProblem Description Classification HistoryClassification History Ongoing ResearchOngoing Research
Countering Web Spam Problem Description Classification History Ongoing Research
ConclusionsConclusions
Countering Web Spam
What is web spam? Traditional definition Our definition
Between 13.8% and 22.1% of all web pages
Ad Farms
Only contain advertising links (usually ad listings)
Elaborate entry pages used to deceive visitors
Ad Farms (cont.)
Clicking on an entry page link leads to an ad listing
Ad syndicators provide the content
Web spammers create the HTML structures
Parked Domains
Domain parking servicesProvide place holders for newly registered
domainsAllow ad listings to be used as place holders
to monetize a domain
Inevitably, web spammers abused these services
Parked Domains (cont.)
Functionally equivalent to Ad Farms Both rely on ad syndicators for content Both provide little to no value to their visitors
Unique Characteristics Reliance on domain parking services (e.g.,
apps5.oingo.com, searchportal.information.com, etc.) Typically for sale by owner (“Offer To Buy This
Domain”)
Advertisements
Pages advertising specific products or services
Examples of the kinds of pages being advertised in Ad Farms and Parked Domains
Problem Description
Web spam detection can also be modeled as a binary text classification problem
Salton’s vector space model is quite common
Feature processing and performance evaluation are also quite similar
But what about feature generation…
How do we generate features?
Sources of information HTTP connections
Hosting IP addresses Session headers
HTML content Textual properties Structural properties
URL linkage structure PageRank scores Neighbor properties
Classification History
Davison (2000) Was the first to investigate link-based web spam Built decision trees to successfully identify “nepotistic
links”
Becchetti et al. (2005) Revisited the use of decision trees to identify link-
based web spam Used link-based features such as PageRank and
TrustRank scores
Classification History
Drost and Scheffer (2005) Used Support Vector Machines to classify web spam
pages Relied on content-based features as well as link-
based features
Ntoulas et al. (2006) Built decision trees to classify web spam Used content-based features (e.g., fraction of visible
content, compressibility, etc.)
Classification History
Up to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets
Webb et al. (2006) Presented the Webb Spam Corpus – a first-of-its-kind large-scale,
publicly available web spam corpus (almost 350K web spam pages) http://www.webbspamcorpus.org
Castillo et al. (2006) Presented the WEBSPAM-UK2006 corpus – a publicly available web
spam corpus (only contains 1,924 web spam pages)
Classification History
Castillo et al. (2007) Created a cost-sensitive decision tree to identify web spam in
the WEBSPAM-UK2006 data set Used link-based features from [Becchetti et al. (2005)] and
content-based features from [Ntoulas et al. (2006)]
Webb et al. (2008) Compared various classifiers (e.g., SVM, decision trees, etc.)
using HTTP session information exclusively Used the Webb Spam Corpus, WebBase data, and the
WEBSPAM-UK2006 data set Found that these classifiers are comparable to (and in many
cases, better than) existing approaches
Redirection
144,801 unique redirect chains (1.54 average HTTP redirects)
43.9% of web spam pages use some form of HTML or JavaScript redirection
49%
14%
11%
8%
7%
5%
3%
2%
1%
302 HTTP redirect
frame redirect
301 HTTP redirect
iframe redirect
meta refresh andlocation.replace()
meta refresh
meta refresh and location
location*
Other
Phishing
Interesting form of deception that affects email and web users
Another form of adversarial classification
Conclusions
Email and web spam are currently two of the largest information security problems
Classification techniques offer an effective way to filter this low quality information
Spammers are extremely dynamic, generating various areas of important future research…