analyzing behavioral features for email classification

Analyzing Analyzing Behavioral Behavioral

Features for Email Features for Email ClassificationClassification

Steve Martin, Anil Sewani, Blaine Nelson, Karl Chen, and

Anthony Joseph

{steve0, anil, nelsonb, quarl, adj}@cs.berkeley.edu

University of California at Berkeley

The Problem: Email Abuse• Email has become globally ubiquitous

– By 2006, email traffic is expected to surge to 60 billion messages daily.

• However, spam accounts for half the email sent on a daily basis worldwide.

• Nearly all of the most virulent worms of 2004 spread by email.

• Email system abuse results in huge damage costs.

Current Email Analysis• Many current methods for detecting email abuse

examine characteristics of incoming email.

• Example: Spam Detection– Calculate statistical features on received mail and

classify each message separately.

• Example: Virus Scanning– Generate a hash value on each incoming message,

compare with stored database of values.– Signatures must be predetermined by human analyst.

• Can be effective, but room for improvement.

Our Approach• Huge corpus of ignored data: outgoing email!

– Can’t profile user email behavior with incoming email.– Outgoing email contains this information.

• Calculate features on outgoing email.– Observe a wide variety of statistics.

• Build a statistical understanding of user behavior.– Use to classify email sent by individual users.– Can detect sudden changes in behavior, such as

worm/spam activity.

Ex. Outgoing Email FeaturesPer-Email Features

Email Contains HTML?

Email Contains Scripts?

Email Contains Images?

Email Contains Links?

MIME Types of Attachments

Number of Attachments

Number of Words in Body

Number of Words in Subject

Number of Chars in Subject

. . .

Per-User Features(calc’d over a window of email)

Frequency of Email Sending

No. of Unique ‘To’ Addr.

No. of Unique ‘From’ Addr.

Ratio Emails w/ Attachments

Average Word Length

Avg. No. of Words/Body

Avg. No. of Words/Subject

Variance in Word Length

Variance in No. Words/Body

. . .

1. Histogram Analysis• Histograms of separate users over specific

features allow similarity estimation.

• Example below: on left, two users, same feature. On right, difference between values.– Shows how these users differ over this feature.– Can use to detect differences in behavior between

these two users.

Per-Feature Histograms

2. Covariance Analysis• Goal: identify features that vary most

significantly with the labels.

• Method 1: Principal Component Analysis (PCA)– Determines a linear combination of relevant features

that maximize variance.– Does not take labels or redundancy into account.

• Method 2: Directions of Max Covariance– Determines directions in feature space that maximize

the covariance between data and labels.– Modified to take potential feature redundancy into

account.

Greedy Feature Ranking• Rank features with a simple greedy approach

using Directions of Max Covariance:– Rank features by their contribution to the first principal

component of covariance matrix: cov[data,labels

Feature Ranking Algorithm

Set F = all features

While F is not empty:

CovMat = Empirical Covariance Matrix

V = principle component vector of CovMat via SVD.

Select feature f from principle component of V

Modify (deflate) CovMat to eliminate redundancy

F = F - f

Feature Ranking ResultsRelative Relevance of Features per User

User

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

WordsInSubj

WordsInBody

VarWordsInBody

VarCharInSubj

ScriptsInEmail

NumToAddrInWindow

NumFromAddrInWindow

MeanWordsInBody

MeanCharInSubj

LinksInEmail

ImagesInEmail

HtmlInEmail

FreqEmailSent

CharsInSubj

AvgWordLength

Application: Worm Detection• Can apply statistical learning on outgoing email

to detect/prevent novel worm propagation.– Success depends on ability of features to identify

anomalous behavior.

• Constructed training/test sets of real email traffic artificially ‘infected’ with viruses.

• Applied feature selection techniques, then tested with different models.

Example Results

• Features added greedily using selection algorithm.

• Graphs show exists an optimal set of features, beyond which performance decreases.

Support Vector Machines Naïve Bayes Classifier

Conclusions and Future Work• Conclusion: analysis of email behavior could

have many applications.– Feature selection is extremely important to model

performance.

• In the future, study effects of feature selection on classification accuracy for other statistical models

• Try similar analysis on existing anti-spam solutions.

• Cluster user behavior into sets of common models describing general behavior patterns.

analyzing behavioral features for email classification

Documents

email features email

email abuse email

outgoing email features

email traffic

email classification

current email analysis

profile user email behavior

email system abuse results