ppm based spam filtering in sewm2008
DESCRIPTION
PPM based Spam Filtering in SEWM2008. Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong [email protected],[email protected] ,[email protected] [email protected] College of Computer Science, Zhejiang University April 10, 2008. Outline. PPM( prediction by partial matching ) - PowerPoint PPT PresentationTRANSCRIPT
PPM based Spam Filtering
in SEWM2008Liu JuXin, Xu Congfu, Peng Peng, Lu
Guanzhong
[email protected],[email protected],[email protected] [email protected]
College of Computer Science, Zhejiang UniversityApril 10, 2008
Outline
PPM( prediction by partial matching ) Email Pre-processing Train PPM Model Model Classification
PPM
Data Compression
PPM Framework
Email Pre-processing
Source alphabet Merge continuous spaces Truncate long messages
Email Pre-processing
Raw DataAbcd_= - Af?/[]=+ safj =ab fe addfe
Sample:Alphabet : {a,b,c,d,e,f,_,=, }Replace char: ?Truncate length: 20
After Replaceabcd_= ? Af????=? ?af? =ab fe addfe
After Merge Blankabcd_= ? Af????=? ?af? =ab fe addfe
After Truncateabcd_= ? Af????=? ?a
Train PPM Model
Use order-6 PPM* model Use Method D Escape estimation Train Two PPM model HAM Model SPAM Model
Model Classification
MCE( Minimum Cross-entropy ) MDL( Minimum Description Length ) Spam Score
Advantage
Simple pre-processing No decode ( avoid obfuscate ) Highly self-adaptive Low false positive
Reference
《 Spam Filtering Using Statistical Data Compression Models 》
《 Unbounded Length Contexts for PPM 》
Question
Delay Index ham, Ham and HAM Active learning 10000
Deliver the filter
Thanks for your attention!Q&A