adaptive filtering: one year on

www.sophos.com

Adaptive Filtering: One Year On

John Graham-CummingResearch Director, Sophos’s Anti-Spam Task Force

Author, POPFile

Adaptive Filtering

Definition: An email filter that can be taught to recognize different types of mail without writing rules.

Most use some machine learning technique: Naïve Bayesian Classification1

knn2

Support Vector Machines3

All provide some measure of “spamminess”

Machine Learning & Anti-spam A little more than one year Papers

Mar 1998: SpamCop: A Spam Classification & Organization Program1

Jul 1998: A Bayesian Approach to Filtering Junk E-mail2 2000: An evaluation of Naive Bayesian anti-spam

filtering3

Aug 2002: A Plan for Spam4

Patents Jun 1998: 6,161,130: Technique which utilizes a

probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set

Jun 1999: 6,592,627: System and method for organizing repositories of semi-structured documents such as email

Why now?

The “Grandma Problem”

Confluence of events: Spam getting close to 50% of all mail1

Email reaching 1/3 of adults in US2

Fast processors can handle the processing load

No other good alternatives Laws? Migrate from SMTP?3

Two Routes

Open Source Lots of open source anti-spam solutions Many are “wannabe” solutions that simply

implemented Paul Graham’s ideas Some are interesting tools (bogofilter, POPFile,

SpamBayes) Commercial

Vendors now incorporating Adaptive Filtering into their anti-spam products

Classic tradeoff: Free, open source, community supported Fee, “productized”, vendor supported

Practical Open Source Filters General mail filters1

Aug 1996: ifile Aug 2002: POPFile Oct 2002: dbacl

Spam Filters2

Bogofilter, SpamBayes, Bayesian Spam Filter, SpamProbe, SpamWizard, BSpam, The Spam Secretary, Expaminator, SqueakyMail, Bayespam, spaminator, Quick Spam Filter, Annoyance Filter, DSPAM, PASP, Spam Blocker, CRM114

SpamAssassin (added Bayesian in 2.5)

Mainstream Adaptive Filtering General

SwiftFile (for Lotus Notes)1

Ella Pro (for Microsoft Outlook)2

Anti-spam Desktop Mozilla 1.3, Eudora 6.0 Microsoft MSN 8, Microsoft Outlook 2003 AOL 9.0, Apple Mail.app (Jaguar)

Anti-spam Gateway Sophos PureMessage 4.x

Prediction: By end of 2004 every major email client includes adaptive filtering

The Problems

Man-in-the-street Usability

False Positives

Over training

One man’s spam is another man’s ham

Internationalization

Usability

Proxy, plug-in and external filters are too complex

General user needs: To not understand the underlying mechanism Complete integration with mail client Obvious operation (e.g. spam is moved into a

folder call Spam) Automatic whitelisting (if I send to Mom, Mom

is ok)

False Positives

False Positive == Good mail identified as bad

False Negative == Spam identified as good

People tolerate false negatives, but hate false positives

Spam filters must guard against false positives: Bias towards False Negatives (“A Plan for Spam”) Cross check results (SpamBayes) High spam threshold

Over Training

Occurs when user loads up adaptive filter with lots more spam than ham e.g. feeds entire spam archive into filter

Some adaptive filters then think everything is spam

For Naïve Bayes classifiers the “train on errors” methodology works well in practice. User teaches filter only on mails it incorrectly

classified “No, that’s spam or no, that’s ham” button

One man’s spam…

Can be hard to unsubscribe from legitimate bulk mail

Users tell spam filter that legitimate mail is spam Creates false positives for other users in shared

systems e.g. I say CNET News email is spam, you want

it Ideal system has two parts

Gateway spam filter run by IT group Individual preferences on each client

Internationalization

Tokenization non-trivial for some languages In English words are “space separated” Thisisnotthecaseinsomeotherlanguages:

Japanese (POPFile の特別な使い方 )

Different punctuation ¿Español? «Français»

UTF-8, Unicode تقارير و looks like ÃÎÈÇÑ æ ÊÞÇÑíÑ أخبار

Spammer’s Response

Overwhelm filter with “good words”

Hide those good words from people

Use HTML as trickery toolbox

Three techniques: And the Kitchen Sink Invisible Ink Camouflage

More in Sophos’s Field Guide to Spam1

And the Kitchen Sink

Throw in innocent words before or after the HTML

<html><body>Viagra</body></html>

Hi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom

And the Kitchen Sink

Spammer hopes reader concentrates on the spam message part

Ineffective because user gets to see the innocent words

Spammers need ways to hide the innocent words

So they’ve taken inspiration from search engine trickery…

Invisible Ink

Use HTML font colors to write white on white

<body bgcolor=white>ViagraHi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom</body>

Invisible Ink

Easily spotted if filter groks HTML

Can confuse filters that just drop HTML tags

Spammers have noticed that Invisible Ink is being targeted

They’ve adapted…

Camouflage

Use very similar HTML colors

<body bgcolor=#113333>

Viagra

some innocent words

</body>

Camouflage

Hard to see, but “some innocent words” do appear

Pythagoras Spots Spam

Foreground and background colors are coordinates in 3D

Imagine a Red axis, a Green axis and a Blue

(00,00,00)

Sweet, I rule in 2003

• Similar colors are close• Dissimilar colors are far apart• Pythagoras’ Theorem (3D)1 gives the color distance

(11,33,33)

(12,39,39)

(FF,FF,00)

●●

●

Blue

Red

Green

http://www.ida.liu.se/~rosgr/image/pythagoras.jpg

Spammers love HTML

Spams using HTML

84% 83% 85% 84% 84%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Dec-02 Jan-03 Feb-03 Mar-03 Apr-03

% M

essa

ges

Trick Trends - Two Increasing

Two Tricks Showing Gains

0%

5%

10%

15%

20%

25%

30%

Dec-02 Jan-03 Feb-03 Mar-03 Apr-03

% M

essa

ges

HTML CommentsInvisible Ink

Tricks Make Spam Spotting Easier Bad news for spammers:

The harder you try to obscure your messages the easier they are to filter

Spam trickery becomes the spam fingerprint Bad news for end users:

Spammers will react by making spam more innocent

Hi, I saw your profile and wanted to get in touch, please check out my site at www.some-viagra-site.com

The Filter Paradox

Do filters make spam more effective? One spammer claimed on /.

“Your filters help cut down on the complaints to ISPs […] you no longer complain to [email protected], my access providers, or anyone else who might cause me problems”

Time will tell

The End

Following slides are for reference purposes

References

Slide 21. http://www.wikipedia.org/wiki/

Naive_Bayesian_classification

2. http://www.usenix.org/events/sec02/full_papers/liao/liao_html/node4.html

3. http://citeseer.nj.nec.com/tong00support.html

References

Slide 31. http://citeseer.nj.nec.com/pantel98spamcop.html2. http://citeseer.nj.nec.com/sahami98bayesian.html3. http://citeseer.nj.nec.com/androutsopoulos00evaluation.html4. http://www.paulgraham.com/spam.html

Slide 41. Wired, p50, September 2003 predicts 50% of all mail

will be spam by September 20042. US Census Bureau, 20003. One proposal is AMTP:

http://www.ietf.org/internet-drafts/draft-weinman-amtp-00.txt

References

Slide 51. POPFile:

http://popfile.sourceforge.netifile: http://www.nongnu.org/ifile/

2. Search SourceForge and Freshmeat Slide 6

1. http://www.research.ibm.com/swiftfile/

2. http://www.openfieldsoftware.com/Ella.asp

References

Slide 17

1. http://www.activestate.com/Products/PureMessage/Field_Guide_to_Spam/

Pythagoras in 3D

Distance between two points in space

Pythagoras: δ2 = α2 + β2

Pythagoras: α2 = (x-a)2 + (z-c)2

β2 = (y-b)2

(a, b, c)

(x, y, z)

δ

α

βδ = √ ( (x-a)2 + (y-b)2 + (z-c)2 )

adaptive filtering: one year on

Documents