email data cleaning

Email Data CleaningKDD’05, August 21-24, 2005, Chicago, Illinois, USA.

Jie TangDepartment of Computer ScienceTsinghua University12#109, Tsinghua UniversityBeijing, China, [email protected]

Hang Li, Yunbo CaoMicrosoft Research Asia

5F Sigma CenterNo.49 Zhichun Road, Haidian

Beijing, China, 100080.{hangli, yucao}@microsoft.com

Zhaohui TangMicrosoft CorporationOne Microsoft Way

Redmond,WA, USA, 98052

[email protected]

Agenda

Introduction Related work Cleaning as filtering and normalization

Formalize the problem of email data cleaning Cascaded approach

Describe the approach to the problem Implementation Experimental result Conclusion

Introduction

Email data can be very noisy: It may contain headers, signatures,

quotations, and program code. It may contain extra line breaks, extra

space, and special character tokens. It may have spaces and periods

mistakenly removed. It may contain words badly cased or non-

cased and words misspelled.

Introduction (Continue)

Three question for email data cleaning: How to formalize the problem?

Sol: Non-text filtering and text normalization Non-text data: header, signature, quotation and program code fil

tering. Transforming relevant text data into canonical form like newspa

per. (Paragraph, sentence, and word normalization.) How to solve ther problem in a principled approach.

Sol: “Cascaded” fashion Clean up an email by several passes: email body level, paragra

ph, sentence, and word level. How to make an implementation

Sol: A unified statistical learning approach. Base on SVM (Support Vector Machines). Define features for the models.

Related work – Data Cleaning

Data Cleaning Email Data Cleaning

eClean 2000 tool WinPure ListCleaner Pro product

Web Page Data Cleaning The weight calculation Partition a page into blocks base on HTML tags. Then

calculate entropy value of each block. Tabular Data Cleaning

SQL server 2005 – Fuzzy Grouping. ETL tool.

Related work – Language Processing

Language Processing Sentence Boundary Detection

Neural network model Case Restoration

Language model approach – four classes All lower case First letter upper case All letter upper case Mixed case

Spelling Error Correction A word sense disambiguation problem.

Select correct word from a set of confusing words, e.g., {to, too, two} in a specific context.

Statistical learning methods. Word Normalization

A taxonomy of non-standard words N-gram language model Decision tree Weighted finite-state transducers

Cleaning as filtering and normalization1. On Mon, 23 Dec 2002 13:39:42 -0500, "Brendon"2. <[email protected]> wrote:3. NETSVC.EXE from the NTReskit. Or use the4. psexec from5. sysinternals.com. this lets you run6. commands remotely - for example net stop 'service'.7. --8. --------------------------------------9. Best Regards10. Brendon11.12. Delighting our customers is our top priority. We welcome your

comments and13. suggestions about how we can improve the support we provid

e to you.14. --------------------------------------15. >>-----Original Message-----16. >>"Jack" <[email protected]> wrote in message17. >>news:00a201c2aab2$12154680$d5f82ecf@TK2MSFTNGX

A12...18. >> Is there a command line util that would allow me to19. >> shutdown services on a remote machine via a batch file?20. >>Best Regards21. >>Jack

1. NETSVC.EXE from the NTReskit. Or use the psexec from sysinternals.com.2. This lets you run commands remotely – for example net stop 'service'.

Example of email message

Cleaned email message

Cascaded Approach

1. Non-text filtering• Identify header, signature, quotation and

program code.2. Paragraph normalization

• Identify extra line breaks.3. Sentence normalization

• Figure out whether a period, a question mark, or an exclamation mark is a real sentence-ending. (Sentence boundary)

• Remove non-words: non-ASCII words, tokens containing may special symbols.

4. Word normalization• Conduct case restoration on badly cased words.

Implementation

Steps: Preprocessing

Used pattern to recognize “special words”. Non-text filtering

Using a classification model to detect the header and signature and program code.

Using hard-coded rules to filter out quotations. This step relies on header detection, signature detection, and

program code detection. Paragraph normalization

Using a classification model to identify each line break is a paragraph ending.

This step is based on paragraph ending detection. Sentence normalization Word normalization

Classification model – SVM (Support Vector Machines)

Implementation - Header and Signature Detection

Header detection and signature detection are similar problems.

Two stages: Training

Two SVM models that can detect start line and end line. Define a set of feature and assign a label – start line, end line or

either. Use the labeled data to train the SVM model.

Detection Using two SVM models Start line of header of end line of header The lines between identify start line and end line is header.

The key issue is how to define features for effectively performing the cleaning task.

Implementation - Header and Signature Detection (Features in header detection model.)

Position feature Current line is the first line in the email.

Positive word features current line begins with words like “From:”, “Re:”, “In article”, and “In message”, contains words

such as “original message” and “Fwd:”, or ends with words like “wrote:” and “said:”. Negative word features

The words are usually used in greeting and should not be included in a header. “Hi”, “dear”, “thank you”, and “best regards”

Number of words feature the number of words in the current line.

Person name feature Ending character features

current line ends with colon, semicolon, quotation mark, question mark, exclamation mark or suspension points.

Special pattern features Positive types include email address and date. Negative types include money and percentage.

Number of Line Breaks feature How many line breaks exist before the current line.

Implementation - Header and Signature Detection (Features in Signature detection model.)

Position features The current line is the first line or last line in the mail.

Positive word features “Best Regards”, “Thanks”, “Sincerely” and “Good luck”.

Number of words features Person name feature Ending character features Special Symbol pattern features

special symbols such as: “--------”, “======”, “******”. Case features

the tokens are all in uppercase, all in lower-case, all capitalized or only the first token is capitalized.

Number of line breaks feature How many line breaks exist before the current line.

Implementation - Program Code Detection

Features in program code detection model Position feature Declaration keyword feature

“string”, “char”, “double”, “dim”, “typedef struct”, “#include”, “import”, “#define”, “#undef”, “#ifdef”, and “#endif”.

Statement keyword features “i++”, “if”, “else if”, “switch”, and “case”, “while”, “do{”, “for”, and “for each”, “got

o”, “continue;”, “next;”, “break;”, “last;” and “return Equation Pattern features

“=”, “<=” and “<<=”, “a=b+/*-c;”, “a=B(bb,cc), “a=b;” Function defination feature

“sub” or “function”, “end function” or “end sub”. Bracket features

“{”, “}” Percentage of real words feature Ending character features Number of line breaks feature

Implementation - Paragraph Ending Detection

Features in paragraph ending detection model Position features Greeting word features

greeting words like “Hi” and “Dear”. (In such case, the line break should not be removed).

Ending Character features Case features

lower case letters and whether or not the next line starts with a word in lower case letters.

Bullet features “1.” and “a)”. (In such cases, the line break should be

retained) Number of line breaks feature

Experimental result – Data sets

- 73.2% of the emails need paragraph normalization. - 85.4% of the emails need sentence normalization.- 47.1% of the emails need case restoration.- Only 7.4% of the emails contain at least one spelling error. - Only 1.6% of the emails are absolutely clean.

Experimental result – Evaluation measures

Experimental result

- Baseline methods is eClean.

Conclusion – future work

Improvement on the accuracy of each step.

Apply the cleaning method to other text mining application.

Q & A

email data cleaning

Business