adapting statistical email filtering david kohlbrenner it.com tjhsst
TRANSCRIPT
![Page 1: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST](https://reader036.vdocument.in/reader036/viewer/2022072017/56649f055503460f94c1ab76/html5/thumbnails/1.jpg)
Adapting Statistical Email Filtering
David KohlbrennerIT.com
TJHSST
![Page 2: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST](https://reader036.vdocument.in/reader036/viewer/2022072017/56649f055503460f94c1ab76/html5/thumbnails/2.jpg)
What is a statistical email filter?
Filters email for spam.
Supervised learning
Not heuristics
Bayesian filtering and Bayes
![Page 3: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST](https://reader036.vdocument.in/reader036/viewer/2022072017/56649f055503460f94c1ab76/html5/thumbnails/3.jpg)
Why is a statistical email filter better?
Not based on pre-set values by a human
Can use concepts not easily understood by
people Learns over time, and therefore adapts Real world tests put accuracy better than
99.9%
![Page 4: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST](https://reader036.vdocument.in/reader036/viewer/2022072017/56649f055503460f94c1ab76/html5/thumbnails/4.jpg)
How does a statistical email filter work?
Three parts
Tokenization / feature extraction
Training
Analysis
Also, it must store a persistent state
![Page 5: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST](https://reader036.vdocument.in/reader036/viewer/2022072017/56649f055503460f94c1ab76/html5/thumbnails/5.jpg)
Tokenization / Feature extraction
Emails are made of words Tokens are words, phrases, HTML,
timestamps, senders, etc. The goal is to get as many 'features' of the
email as is possible, the good ones rise to the top
“the orange ball” Becomes: “the”, “orange”, “ball”, “the orange”, “orange ball”, “the orange ball”, “*Font: Albany”, etc.
![Page 6: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST](https://reader036.vdocument.in/reader036/viewer/2022072017/56649f055503460f94c1ab76/html5/thumbnails/6.jpg)
Training
All filters begin blank
Trained with a corpus of spam / nonspam
Methods for training as email is seen
TEFT
TUNE
TOE
![Page 7: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST](https://reader036.vdocument.in/reader036/viewer/2022072017/56649f055503460f94c1ab76/html5/thumbnails/7.jpg)
Analysis
Email's tokens are compared to training data
Some aggregated percentage is created for
email Categorized based on that. Bayesian filtering gets its name from Bayes
theorem here.
![Page 8: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST](https://reader036.vdocument.in/reader036/viewer/2022072017/56649f055503460f94c1ab76/html5/thumbnails/8.jpg)
So how does this one work?
Designed to be highly modular. Currently has modules for:
TEFT Chi squared Robinson's Graham's
Corpus is non changing, just classifications change.
![Page 9: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST](https://reader036.vdocument.in/reader036/viewer/2022072017/56649f055503460f94c1ab76/html5/thumbnails/9.jpg)
Object Diagram
ExternalUser
AnalysisPackage
TrainingPackage
Message Database
Token Database
Marked Messages
Un-marked messages
Token Counts
Suggestions
SuggestionsUn-marked messages
ExternalDatabase(Optional)
All database information
EmailCorpus
![Page 10: Adapting Statistical Email Filtering David Kohlbrenner IT.com TJHSST](https://reader036.vdocument.in/reader036/viewer/2022072017/56649f055503460f94c1ab76/html5/thumbnails/10.jpg)
Does this one work?
To an extent Test data had very limited feature set Test data was based on personal writing style Little time to test/tune
56%-57% accuracy at best Measured by interesting predicted/interesting
actual Also mistakes/interesting marked
More testing will be done Other projects are more critical