adapting statistical email filtering david kohlbrenner it.com tjhsst

Adapting Statistical Email Filtering

David KohlbrennerIT.com

TJHSST

What is a statistical email filter?

Filters email for spam.

Supervised learning

Not heuristics

Bayesian filtering and Bayes

Why is a statistical email filter better?

Not based on pre-set values by a human

Can use concepts not easily understood by

people Learns over time, and therefore adapts Real world tests put accuracy better than

99.9%

How does a statistical email filter work?

Three parts

Tokenization / feature extraction

Training

Analysis

Also, it must store a persistent state

Tokenization / Feature extraction

Emails are made of words Tokens are words, phrases, HTML,

timestamps, senders, etc. The goal is to get as many 'features' of the

email as is possible, the good ones rise to the top

“the orange ball” Becomes: “the”, “orange”, “ball”, “the orange”, “orange ball”, “the orange ball”, “*Font: Albany”, etc.

Training

All filters begin blank

Trained with a corpus of spam / nonspam

Methods for training as email is seen

TEFT

TUNE

TOE

Analysis

Email's tokens are compared to training data

Some aggregated percentage is created for

email Categorized based on that. Bayesian filtering gets its name from Bayes

theorem here.

So how does this one work?

Designed to be highly modular. Currently has modules for:

TEFT Chi squared Robinson's Graham's

Corpus is non changing, just classifications change.

Object Diagram

ExternalUser

AnalysisPackage

TrainingPackage

Message Database

Token Database

Marked Messages

Un-marked messages

Token Counts

Suggestions

SuggestionsUn-marked messages

ExternalDatabase(Optional)

All database information

EmailCorpus

Does this one work?

To an extent Test data had very limited feature set Test data was based on personal writing style Little time to test/tune

56%-57% accuracy at best Measured by interesting predicted/interesting

actual Also mistakes/interesting marked

More testing will be done Other projects are more critical

adapting statistical email filtering david kohlbrenner it.com tjhsst

Documents