spam and the ongoing battle for the inbox

COMMUNICATIONS OF THE ACM February 2007/Vol. 50, No. 2 25

By Joshua Goodman, Gordon V. Cormack, and David Heckerman

Even as spammers and phishers try ever- more sophisticated techniques to get past filters and

into users’ mailboxes, anti-spam researchers have managed to stay several steps ahead, so far.

SSpam and the

Ongoing Battle for the Inbox

ince August 1998, when Communicationspublished the article “Spam!” by LorrieFaith Cranor and Brian A. LaMacchiadescribing the then rapidly growing

onslaught of unwanted email, the amount of allemail sent has grown exponentially, but the volume

of spam has grown even more. Spam hasincreased from approximately 10% ofoverall mail volume in 1998, constitut-ing an annoyance, to as much as 80%today [8], creating an onerous burden

on both tens of thousands of email ser-vice providers (ESPs) and tens of millions

of end users worldwide. Large email ser-

Illustration by Robert Neubecker

spammer attacks and ploys. Besides getting more data, faster, we also now use

much more sophisticated learning algorithms. Forinstance, algorithms based on logistic regression andthat support vector machines can reduce by half theamount of spam that evades filtering, compared toNaive Bayes. These algorithms “learn” a weight foreach word in a message. The weights are carefullyadjusted so results derived from the training examplesof both spam and good email are as accurate as possi-ble. The learning process may require repeatedlyadjusting tens of thousands or even hundreds of thou-sands of weights, a potentially time-consumingprocess. Fortunately, progress in machine learningover the past few years has made such computationpossible. Sophisticated algorithms (such as SequentialConditional Generalized Iterative Scaling) allow us tolearn a new filter from scratch in about an hour, evenwhen training on more than a million messages.

Spam filtering is not the only beneficiary ofadvances in machine learning; it also drives excitingnew research. For instance, machine learning algo-rithms are typically trained to maximize accuracy(how often their predictions are correct). But in prac-tice, for spam filtering, fraud detection, and manyother problems, these systems are configured veryconservatively. Only if an algorithm is nearly certainthat a message is spam is the message filtered. Thisissue has driven recent research in how to learn specif-ically for these cases. One clever technique that canreduce spam by 20% or more—developed by ScottYih of Microsoft Research—involves training two fil-ters. The first identifies the difficult cases; the secondis trained on only those cases. Focusing on themimproves overall results [12].

Meanwhile, spammers have not been idle asmachine learning has progressed. Traditional machinelearning for spam filtering has many weaknesses. Oneof the most glaring is that the first step in almost anysystem is to break apart a message into its individualwords, then perform an analysis at the word level. Ini-tially, spammers sought to overcome these filters bymaking sure that words with large (spammy) weights,like “free,” did not appear verbatim in their messages.For instance, they might break the word into multiplepieces using an HTML comment (free) or encode it with HTML ASCII codes(fr&#101xe). When displayed to a user, both theseexamples look like “free,” but for spam-filtering soft-ware, especially on servers, any sort of complexHTML processing is too computationally expensive,so the systems do not detect the word “free.”

Spammers do not use these techniques randomly,carefully monitoring what works and what doesn’t.

For instance, in 2003, we saw that the HTML char-acter encoding trick was being used in 5% of spamsent to Hotmail. The trick is easy to detect, however,since it is rare for a normal letter like “e” to be encodedin ASCII in legitimate email. Shortly after we andothers began detecting it, spammers stopped using it;by 2004, it was down to 0% [6]. On the other hand,the token-breaking trick using comments or othertags can be done in many different ways, some diffi-cult to detect; from 2003 to 2004, this exploit wentfrom 7% to 15% of all spam at Hotmail. When weattack the spammers, they actively adapt, droppingthe techniques that don’t work and increasing theiruse of the ones that do.

Scientific evaluation is an essential com-ponent of research; researchers must beable to compare methods using stan-dard data and measurements. This kindof evaluation is particularly difficult for

spam filtering. Due to the sensitivity of email (few ofus would allow our own to be publicly distributed,and those that would are hardly typical), building astandard benchmark for use by researchers is difficult.Within the context of the larger Text REtrieval Con-ference (TREC) effort begun in 1991, a U.S.-govern-ment-supported program (trec.nist.gov) thatfacilitates evaluations, one of us (Cormack) foundedand coordinates a special spam track to evaluate par-ticipants’ filters on real email streams; more impor-tant, it defines standard measures and corpora forfuture tests. It relies on two types of email corpora:

• Synthetic, consisting of a rare public corpus ofgood email, combined with a carefully modifiedset of recent spam. It can be freely shared, andresearchers run their filters on it; and

• Private, whereby researchers submit their code totesters, who run it on the private corpora andreturn summary results only, ensuring privacy.

Results in terms of distinguishing spam from goodemail on the two types are similar, suggesting that, forthe first time, relatively realistic comparisons of differ-ent spam-filtering techniques may be carried out bydifferent groups. Future TREC tracks aim to developeven more realistic evaluation strategies. Meanwhile,the European Conference on Machine Learning(www.ecmlpkdd2006.org) addressed the issue ofspam-filter evaluation through its 2006 DiscoveryChallenge, testing the efficacy of spam filtering with-out user feedback.

One surprising result is that a compression-based

COMMUNICATIONS OF THE ACM February 2007/Vol. 50, No. 2 2726 February 2007/Vol. 50, No. 2 COMMUNICATIONS OF THE ACM

vices (such as Microsoft’s Hotmail) may be sentmore than a billion spam messages per day. Fortu-nately, however, since 1998, considerable progresshas also been made in stopping spam. Only a smallfraction of that 80% actually reaches end users.Today, essentially all ESPs and most email programsinclude spam filters. From the point of view of theend user, the problem is stabilizing, and spam is formost users today an annoyance rather than a threatto their use of email.

Meanwhile, an ongoing escalation of technology istaking place behind the scenes, with both spammersand spam filter providers using increasingly sophisti-cated solutions. Whenever researchers and developersimprove spam-filtering software, spammers devisemore sophisticated techniques to defeat the filters.Without constant innovation from spam researchers,the deluge of spam being filtered would overwhelmusers’ inboxes.

Spam research is a fascinating topic not only in itsown right but also in how it relates to other fields. Forinstance, most spam filtering programs use at leastone machine learning component. Spam research hasexposed shortcomings in current machine learningtechnology and driven new areas of research. Simi-larly, spam and phishing have made the need to ver-ify senders’ identities on the Internet more importantthan ever and led to new more practical verificationmethods. Spam filtering is an example of adversarialinformation processing; related methods may applynot only to email spam but to many situations inwhich an active opponent attempts to thwart any newdefensive approach.

A round the time spam was becoming amajor problem in 1997, one of us(Heckerman), along with other col-leagues at Microsoft Research, beganwork on machine learning

approaches to spam filtering [11]. In them, computerprograms are provided examples of both spam andgood (non-spam) email (see Figure 1). A learningalgorithm is then used to find the characteristics ofthe spam mail versus those of the good mail. Futuremessages can be automatically categorized as highlylikely to be spam, highly likely to be good, or some-where in between. The earliest learning approacheswere fairly simple, using, say, the Naive Bayes algo-rithm to count how often each word or other featureoccurs in spam messages and in good messages.

To be effective, Naive Bayes and other methodsneed training data—known spam and known goodmail—to train the system. When we first shipped

spam filters, spam was relativelystatic. We had 20 users manuallycollect and hand-label theiremail. We then used this collec-tion to train a filter that was notupdated for many months.Words like “sex,” “free,” and“money” were all good indicatorsof spam that worked for anextended period. As spam filter-

ing became more widely deployed, spammersadapted, quickly learning the most obvious words toavoid and the most innocuous words to add to trickthe filter. It became necessary to gather ever-largeramounts of email (as spammers began using a widervariety of terms), as well as to update the filters fre-quently to keep up with spammers. Today, Hotmailuses a feedback loop system in which more than100,000 volunteers each day are asked to label a mes-sage that was sent to them as either “spam” or “good”email. This regularly provides us new messages totrain our filters, allowing us to react quickly to new

Goodman stick fig (2/07)

ExternalResources

MemoryFilter

Quarantine

Search

MisclassifiedGood Email

GoodEmail

Triage

Read Email

Misclassified Spam

Incoming Mail

Figure 1. Spam filtersseparate email into twofolders: the inbox, whichis read regularly, andquarantine, which issearched occasionally.Mistakes—spam in theinbox or good email inquarantine—may bereported to the filter ifnoticed.