email data cleaning
TRANSCRIPT
Email Data CleaningKDD’05, August 21-24, 2005, Chicago, Illinois, USA.
Jie TangDepartment of Computer ScienceTsinghua University12#109, Tsinghua UniversityBeijing, China, [email protected]
Hang Li, Yunbo CaoMicrosoft Research Asia
5F Sigma CenterNo.49 Zhichun Road, Haidian
Beijing, China, 100080.{hangli, yucao}@microsoft.com
Zhaohui TangMicrosoft CorporationOne Microsoft Way
Redmond,WA, USA, 98052
Agenda
Introduction Related work Cleaning as filtering and normalization
Formalize the problem of email data cleaning Cascaded approach
Describe the approach to the problem Implementation Experimental result Conclusion
Introduction
Email data can be very noisy: It may contain headers, signatures,
quotations, and program code. It may contain extra line breaks, extra
space, and special character tokens. It may have spaces and periods
mistakenly removed. It may contain words badly cased or non-
cased and words misspelled.
Introduction (Continue)
Three question for email data cleaning: How to formalize the problem?
Sol: Non-text filtering and text normalization Non-text data: header, signature, quotation and program code fil
tering. Transforming relevant text data into canonical form like newspa
per. (Paragraph, sentence, and word normalization.) How to solve ther problem in a principled approach.
Sol: “Cascaded” fashion Clean up an email by several passes: email body level, paragra
ph, sentence, and word level. How to make an implementation
Sol: A unified statistical learning approach. Base on SVM (Support Vector Machines). Define features for the models.
Related work – Data Cleaning
Data Cleaning Email Data Cleaning
eClean 2000 tool WinPure ListCleaner Pro product
Web Page Data Cleaning The weight calculation Partition a page into blocks base on HTML tags. Then
calculate entropy value of each block. Tabular Data Cleaning
SQL server 2005 – Fuzzy Grouping. ETL tool.
Related work – Language Processing
Language Processing Sentence Boundary Detection
Neural network model Case Restoration
Language model approach – four classes All lower case First letter upper case All letter upper case Mixed case
Spelling Error Correction A word sense disambiguation problem.
Select correct word from a set of confusing words, e.g., {to, too, two} in a specific context.
Statistical learning methods. Word Normalization
A taxonomy of non-standard words N-gram language model Decision tree Weighted finite-state transducers
Cleaning as filtering and normalization1. On Mon, 23 Dec 2002 13:39:42 -0500, "Brendon"2. <[email protected]> wrote:3. NETSVC.EXE from the NTReskit. Or use the4. psexec from5. sysinternals.com. this lets you run6. commands remotely - for example net stop 'service'.7. --8. --------------------------------------9. Best Regards10. Brendon11.12. Delighting our customers is our top priority. We welcome your
comments and13. suggestions about how we can improve the support we provid
e to you.14. --------------------------------------15. >>-----Original Message-----16. >>"Jack" <[email protected]> wrote in message17. >>news:00a201c2aab2$12154680$d5f82ecf@TK2MSFTNGX
A12...18. >> Is there a command line util that would allow me to19. >> shutdown services on a remote machine via a batch file?20. >>Best Regards21. >>Jack
1. NETSVC.EXE from the NTReskit. Or use the psexec from sysinternals.com.2. This lets you run commands remotely – for example net stop 'service'.
Example of email message
Cleaned email message
Cascaded Approach
1. Non-text filtering• Identify header, signature, quotation and
program code.2. Paragraph normalization
• Identify extra line breaks.3. Sentence normalization
• Figure out whether a period, a question mark, or an exclamation mark is a real sentence-ending. (Sentence boundary)
• Remove non-words: non-ASCII words, tokens containing may special symbols.
4. Word normalization• Conduct case restoration on badly cased words.
Implementation
Steps: Preprocessing
Used pattern to recognize “special words”. Non-text filtering
Using a classification model to detect the header and signature and program code.
Using hard-coded rules to filter out quotations. This step relies on header detection, signature detection, and
program code detection. Paragraph normalization
Using a classification model to identify each line break is a paragraph ending.
This step is based on paragraph ending detection. Sentence normalization Word normalization
Classification model – SVM (Support Vector Machines)
Implementation - Header and Signature Detection
Header detection and signature detection are similar problems.
Two stages: Training
Two SVM models that can detect start line and end line. Define a set of feature and assign a label – start line, end line or
either. Use the labeled data to train the SVM model.
Detection Using two SVM models Start line of header of end line of header The lines between identify start line and end line is header.
The key issue is how to define features for effectively performing the cleaning task.
Implementation - Header and Signature Detection (Features in header detection model.)
Position feature Current line is the first line in the email.
Positive word features current line begins with words like “From:”, “Re:”, “In article”, and “In message”, contains words
such as “original message” and “Fwd:”, or ends with words like “wrote:” and “said:”. Negative word features
The words are usually used in greeting and should not be included in a header. “Hi”, “dear”, “thank you”, and “best regards”
Number of words feature the number of words in the current line.
Person name feature Ending character features
current line ends with colon, semicolon, quotation mark, question mark, exclamation mark or suspension points.
Special pattern features Positive types include email address and date. Negative types include money and percentage.
Number of Line Breaks feature How many line breaks exist before the current line.
Implementation - Header and Signature Detection (Features in Signature detection model.)
Position features The current line is the first line or last line in the mail.
Positive word features “Best Regards”, “Thanks”, “Sincerely” and “Good luck”.
Number of words features Person name feature Ending character features Special Symbol pattern features
special symbols such as: “--------”, “======”, “******”. Case features
the tokens are all in uppercase, all in lower-case, all capitalized or only the first token is capitalized.
Number of line breaks feature How many line breaks exist before the current line.
Implementation - Program Code Detection
Features in program code detection model Position feature Declaration keyword feature
“string”, “char”, “double”, “dim”, “typedef struct”, “#include”, “import”, “#define”, “#undef”, “#ifdef”, and “#endif”.
Statement keyword features “i++”, “if”, “else if”, “switch”, and “case”, “while”, “do{”, “for”, and “for each”, “got
o”, “continue;”, “next;”, “break;”, “last;” and “return Equation Pattern features
“=”, “<=” and “<<=”, “a=b+/*-c;”, “a=B(bb,cc), “a=b;” Function defination feature
“sub” or “function”, “end function” or “end sub”. Bracket features
“{”, “}” Percentage of real words feature Ending character features Number of line breaks feature
Implementation - Paragraph Ending Detection
Features in paragraph ending detection model Position features Greeting word features
greeting words like “Hi” and “Dear”. (In such case, the line break should not be removed).
Ending Character features Case features
lower case letters and whether or not the next line starts with a word in lower case letters.
Bullet features “1.” and “a)”. (In such cases, the line break should be
retained) Number of line breaks feature
Experimental result – Data sets
- 73.2% of the emails need paragraph normalization. - 85.4% of the emails need sentence normalization.- 47.1% of the emails need case restoration.- Only 7.4% of the emails contain at least one spelling error. - Only 1.6% of the emails are absolutely clean.
Experimental result – Evaluation measures
Experimental result
- Baseline methods is eClean.
Conclusion – future work
Improvement on the accuracy of each step.
Apply the cleaning method to other text mining application.
Q & A