adaptive filtering: one year on

of 31 /31
www.sophos.com Adaptive Filtering: One Year On John Graham-Cumming Research Director, Sophos’s Anti-Spam Task Force Author, POPFile

Upload: telma

Post on 25-Feb-2016

57 views

Category:

Documents


2 download

Embed Size (px)

DESCRIPTION

Adaptive Filtering: One Year On. John Graham-Cumming Research Director, Sophos’s Anti-Spam Task Force Author, POPFile. Adaptive Filtering. Definition: An email filter that can be taught to recognize different types of mail without writing rules. Most use some machine learning technique: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Adaptive Filtering:  One Year On

www.sophos.com

Adaptive Filtering: One Year On

John Graham-CummingResearch Director, Sophos’s Anti-Spam Task Force

Author, POPFile

Page 2: Adaptive Filtering:  One Year On

Adaptive Filtering

Definition: An email filter that can be taught to recognize different types of mail without writing rules.

Most use some machine learning technique: Naïve Bayesian Classification1

knn2

Support Vector Machines3

All provide some measure of “spamminess”

Page 3: Adaptive Filtering:  One Year On

Machine Learning & Anti-spam A little more than one year Papers

Mar 1998: SpamCop: A Spam Classification & Organization Program1

Jul 1998: A Bayesian Approach to Filtering Junk E-mail2 2000: An evaluation of Naive Bayesian anti-spam

filtering3

Aug 2002: A Plan for Spam4

Patents Jun 1998: 6,161,130: Technique which utilizes a

probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set

Jun 1999: 6,592,627: System and method for organizing repositories of semi-structured documents such as email

Page 4: Adaptive Filtering:  One Year On

Why now?

The “Grandma Problem”

Confluence of events: Spam getting close to 50% of all mail1

Email reaching 1/3 of adults in US2

Fast processors can handle the processing load

No other good alternatives Laws? Migrate from SMTP?3

Page 5: Adaptive Filtering:  One Year On

Two Routes

Open Source Lots of open source anti-spam solutions Many are “wannabe” solutions that simply

implemented Paul Graham’s ideas Some are interesting tools (bogofilter, POPFile,

SpamBayes) Commercial

Vendors now incorporating Adaptive Filtering into their anti-spam products

Classic tradeoff: Free, open source, community supported Fee, “productized”, vendor supported

Page 6: Adaptive Filtering:  One Year On

Practical Open Source Filters General mail filters1

Aug 1996: ifile Aug 2002: POPFile Oct 2002: dbacl

Spam Filters2

Bogofilter, SpamBayes, Bayesian Spam Filter, SpamProbe, SpamWizard, BSpam, The Spam Secretary, Expaminator, SqueakyMail, Bayespam, spaminator, Quick Spam Filter, Annoyance Filter, DSPAM, PASP, Spam Blocker, CRM114

SpamAssassin (added Bayesian in 2.5)

Page 7: Adaptive Filtering:  One Year On

Mainstream Adaptive Filtering General

SwiftFile (for Lotus Notes)1

Ella Pro (for Microsoft Outlook)2

Anti-spam Desktop Mozilla 1.3, Eudora 6.0 Microsoft MSN 8, Microsoft Outlook 2003 AOL 9.0, Apple Mail.app (Jaguar)

Anti-spam Gateway Sophos PureMessage 4.x

Prediction: By end of 2004 every major email client includes adaptive filtering

Page 8: Adaptive Filtering:  One Year On

The Problems

Man-in-the-street Usability

False Positives

Over training

One man’s spam is another man’s ham

Internationalization

Page 9: Adaptive Filtering:  One Year On

Usability

Proxy, plug-in and external filters are too complex

General user needs: To not understand the underlying mechanism Complete integration with mail client Obvious operation (e.g. spam is moved into a

folder call Spam) Automatic whitelisting (if I send to Mom, Mom

is ok)

Page 10: Adaptive Filtering:  One Year On

False Positives

False Positive == Good mail identified as bad

False Negative == Spam identified as good

People tolerate false negatives, but hate false positives

Spam filters must guard against false positives: Bias towards False Negatives (“A Plan for Spam”) Cross check results (SpamBayes) High spam threshold

Page 11: Adaptive Filtering:  One Year On

Over Training

Occurs when user loads up adaptive filter with lots more spam than ham e.g. feeds entire spam archive into filter

Some adaptive filters then think everything is spam

For Naïve Bayes classifiers the “train on errors” methodology works well in practice. User teaches filter only on mails it incorrectly

classified “No, that’s spam or no, that’s ham” button

Page 12: Adaptive Filtering:  One Year On

One man’s spam…

Can be hard to unsubscribe from legitimate bulk mail

Users tell spam filter that legitimate mail is spam Creates false positives for other users in shared

systems e.g. I say CNET News email is spam, you want

it Ideal system has two parts

Gateway spam filter run by IT group Individual preferences on each client

Page 13: Adaptive Filtering:  One Year On

Internationalization

Tokenization non-trivial for some languages In English words are “space separated” Thisisnotthecaseinsomeotherlanguages:

Japanese (POPFile の特別な使い方 )

Different punctuation ¿Español? «Français»

UTF-8, Unicode تقارير و looks like ÃÎÈÇÑ æ ÊÞÇÑíÑ أخبار

Page 14: Adaptive Filtering:  One Year On

Spammer’s Response

Overwhelm filter with “good words”

Hide those good words from people

Use HTML as trickery toolbox

Three techniques: And the Kitchen Sink Invisible Ink Camouflage

More in Sophos’s Field Guide to Spam1

Page 15: Adaptive Filtering:  One Year On

And the Kitchen Sink

Throw in innocent words before or after the HTML

<html><body>Viagra</body></html>

Hi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom

Page 16: Adaptive Filtering:  One Year On

And the Kitchen Sink

Spammer hopes reader concentrates on the spam message part

Ineffective because user gets to see the innocent words

Spammers need ways to hide the innocent words

So they’ve taken inspiration from search engine trickery…

Page 17: Adaptive Filtering:  One Year On

Invisible Ink

Use HTML font colors to write white on white

<body bgcolor=white>Viagra<font color=white>Hi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom</font></body>

Page 18: Adaptive Filtering:  One Year On

Invisible Ink

Easily spotted if filter groks HTML

Can confuse filters that just drop HTML tags

Spammers have noticed that Invisible Ink is being targeted

They’ve adapted…

Page 19: Adaptive Filtering:  One Year On

Camouflage

Use very similar HTML colors

<body bgcolor=#113333>

<font color=yellow>Viagra</font>

<font color=#123939>some innocent words</font>

</body>

Page 20: Adaptive Filtering:  One Year On

Camouflage

Hard to see, but “some innocent words” do appear

Page 21: Adaptive Filtering:  One Year On

Pythagoras Spots Spam

Foreground and background colors are coordinates in 3D

Imagine a Red axis, a Green axis and a Blue

(00,00,00)

Sweet, I rule in 2003

• Similar colors are close• Dissimilar colors are far apart• Pythagoras’ Theorem (3D)1 gives the color distance

(11,33,33)

(12,39,39)

(FF,FF,00)

●●

Blue

Red

Green

Page 22: Adaptive Filtering:  One Year On

Spammers love HTML

Spams using HTML

84% 83% 85% 84% 84%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Dec-02 Jan-03 Feb-03 Mar-03 Apr-03

% M

essa

ges

Page 23: Adaptive Filtering:  One Year On

Trick Trends - Two Increasing

Two Tricks Showing Gains

0%

5%

10%

15%

20%

25%

30%

Dec-02 Jan-03 Feb-03 Mar-03 Apr-03

% M

essa

ges

HTML CommentsInvisible Ink

Page 24: Adaptive Filtering:  One Year On

Tricks Make Spam Spotting Easier Bad news for spammers:

The harder you try to obscure your messages the easier they are to filter

Spam trickery becomes the spam fingerprint Bad news for end users:

Spammers will react by making spam more innocent

Hi, I saw your profile and wanted to get in touch, please check out my site at www.some-viagra-site.com

Page 25: Adaptive Filtering:  One Year On

The Filter Paradox

Do filters make spam more effective? One spammer claimed on /.

“Your filters help cut down on the complaints to ISPs […] you no longer complain to [email protected], my access providers, or anyone else who might cause me problems”

Time will tell

Page 26: Adaptive Filtering:  One Year On

The End

Following slides are for reference purposes

Page 27: Adaptive Filtering:  One Year On

References

Slide 21. http://www.wikipedia.org/wiki/

Naive_Bayesian_classification

2. http://www.usenix.org/events/sec02/full_papers/liao/liao_html/node4.html

3. http://citeseer.nj.nec.com/tong00support.html

Page 28: Adaptive Filtering:  One Year On

References

Slide 31. http://citeseer.nj.nec.com/pantel98spamcop.html2. http://citeseer.nj.nec.com/sahami98bayesian.html3. http://citeseer.nj.nec.com/androutsopoulos00evaluation.html4. http://www.paulgraham.com/spam.html

Slide 41. Wired, p50, September 2003 predicts 50% of all mail

will be spam by September 20042. US Census Bureau, 20003. One proposal is AMTP:

http://www.ietf.org/internet-drafts/draft-weinman-amtp-00.txt

Page 29: Adaptive Filtering:  One Year On

References

Slide 51. POPFile:

http://popfile.sourceforge.netifile: http://www.nongnu.org/ifile/

2. Search SourceForge and Freshmeat Slide 6

1. http://www.research.ibm.com/swiftfile/

2. http://www.openfieldsoftware.com/Ella.asp

Page 30: Adaptive Filtering:  One Year On

References

Slide 17

1. http://www.activestate.com/Products/PureMessage/Field_Guide_to_Spam/

Page 31: Adaptive Filtering:  One Year On

Pythagoras in 3D

Distance between two points in space

Pythagoras: δ2 = α2 + β2

Pythagoras: α2 = (x-a)2 + (z-c)2

β2 = (y-b)2

(a, b, c)

(x, y, z)

δ

α

βδ = √ ( (x-a)2 + (y-b)2 + (z-c)2 )