adapting statistical email filtering david kohlbrenner it.com tjhsst

10
Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

Upload: randall-preston

Post on 03-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

Adapting Statistical Email Filtering

David KohlbrennerIT.com

TJHSST

Page 2: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

What is a statistical email filter?

Filters email for spam.

Supervised learning

Not heuristics

Bayesian filtering and Bayes

Page 3: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

Why is a statistical email filter better?

Not based on pre-set values by a human

Can use concepts not easily understood by

people Learns over time, and therefore adapts Real world tests put accuracy better than

99.9%

Page 4: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

How does a statistical email filter work?

Three parts

Tokenization / feature extraction

Training

Analysis

Also, it must store a persistent state

Page 5: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

Tokenization / Feature extraction

Emails are made of words Tokens are words, phrases, HTML,

timestamps, senders, etc. The goal is to get as many 'features' of the

email as is possible, the good ones rise to the top

“the orange ball” Becomes: “the”, “orange”, “ball”, “the orange”, “orange ball”, “the orange ball”, “*Font: Albany”, etc.

Page 6: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

Training

All filters begin blank

Trained with a corpus of spam / nonspam

Methods for training as email is seen

TEFT

TUNE

TOE

Page 7: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

Analysis

Email's tokens are compared to training data

Some aggregated percentage is created for

email Categorized based on that. Bayesian filtering gets its name from Bayes

theorem here.

Page 8: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

So how does this one work?

Designed to be highly modular. Currently has modules for:

TEFT Chi squared Robinson's Graham's

Corpus is non changing, just classifications change.

Page 9: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

Object Diagram

ExternalUser

AnalysisPackage

TrainingPackage

Message Database

Token Database

Marked Messages

Un-marked messages

Token Counts

Suggestions

SuggestionsUn-marked messages

ExternalDatabase(Optional)

All database information

EmailCorpus

Page 10: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST

Does this one work?

To an extent Test data had very limited feature set Test data was based on personal writing style Little time to test/tune

56%-57% accuracy at best Measured by interesting predicted/interesting

actual Also mistakes/interesting marked

More testing will be done Other projects are more critical