automated content labeling using context in email
DESCRIPTION
Through a recent survey, we observe that a significantpercentage of people still share photos through email.When an user composes an email with an attachment,he/she most likely talks about the attachment in thebody of the email. In this paper, we develop a supervisedmachine learning framework to extract keywordsrelevant to the attachments from the body ofthe email. The extracted keywords can then be storedas an extended attribute to the file on the local filesystem.Both desktop indexing software and web-basedimage search portals can leverage the context-enrichedkeywords to enable a richer search experience. Ourresults on the public Enron email dataset shows thatthe proposed extraction framework provides both highprecision and recall. As a proof of concept, we havealso implemented a version of the proposed algorithmsas a Mozilla Thunderbird Add-on.1TRANSCRIPT
![Page 1: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/1.jpg)
Automated Content Labeling using Context in Email
Aravindan RaghuveerYahoo! Inc, Bangalore.
![Page 2: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/2.jpg)
Yahoo! Confidential
2
Show of hands!
at-least one email with an attachment last 2 weeks?
Image Source: http://www.crcna.org/site_uploads/uploads/osjha/mdgs/mdghands.jpg
![Page 3: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/3.jpg)
Yahoo! Confidential
3
Introduction: The “What ?”
Email attachments are a very popular mechanism to exchange content.- Both in our personal / office worlds.
The one-liner:- The email usually contains a crisp description of the
attachment.- “Can we auto generate tags for the attachment from
the email ?”
![Page 4: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/4.jpg)
Yahoo! Confidential
4
Introduction: The “Why?”
Tags can be stored as extended attributes of files. Applications like desktop search can use these
tags for building indexes. Tags generated even without parsing content:- Useful for Images- In some cases, the tags have more context than the
attachment itself.
![Page 5: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/5.jpg)
Yahoo! Confidential
5
Outline
Problem Statement Challenges Overview of solution The Dataset : Quirks and observations Feature Design Experiments and overview of results Conclusion
![Page 6: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/6.jpg)
Yahoo! Confidential
6
Problem Statement
“Given an email E that has a set of attachments A, find the set of words KEA that appear in E and are relevant to the attachments A.”
Subproblems:• Sentence Selection • Keyword Selection
![Page 7: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/7.jpg)
Yahoo! Confidential
7
Overview of Solution
Solve a binary classification problem : - Is a given sentence relevant to the attachments or not?
Two part solution:- What are the features to use?
• Most insightful and interesting aspect of this work.• Focus of this talk
- What classification algorithm to use?
![Page 8: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/8.jpg)
Yahoo! Confidential
8
Challenges : Why is the classification hard?
Only a small part of the email is relevant to the attachment. Which part?
While writing emails: users tend to rely on context that is easily understood by humans. - usage of pronouns- nick names - cross referencing information from a different
conversation in the same email thread.
![Page 9: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/9.jpg)
Yahoo! Confidential
9
The Enron Email Dataset
Made public by Federal Energy Regulatory Commission during its investigation [1].
Curated version [2] consists of 157,510 emails belonging to 150 users.
A total of 30,968 emails have at least one attachment
[1] http://www.cs.cmu.edu/~enron/[2] Thanks to Mark Dredze for providing the curated dataset.
![Page 10: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/10.jpg)
Yahoo! Confidential
10
Observation-1: User behavior for attachments
• Barring a few outliers, almost all users sent/received emails with attachments.• A rich / well-suited corpus for studying attachment behavior
![Page 11: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/11.jpg)
Yahoo! Confidential
11
• Roughly 80% of the emails have less than 8 sentences. • In another analysis: even in emails that have less than 3 sentences, not every sentence is related to the attachment!!
Observation-2: Email length
![Page 12: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/12.jpg)
Yahoo! Confidential
12
Feature Design : Levels of Granularity
Email Level Conversation Level
S1
S2
S3
S4
S5
S6
Sentence Level
![Page 13: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/13.jpg)
Yahoo! Confidential
13
Features
SentenceConversationEmail
Anchor Noisy Anaphora
Strong Phrase
Extension
Attachment Name
Weak Phrase
Length
Lexicon
Level
Feature Taxonomy:
(Short email: less than 4 sentences? )
(high dimension, sparse, bag ofwords feature )
(Level: Older in the email thread, higher the conversation level ) Noun
Verb
![Page 14: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/14.jpg)
Yahoo! Confidential
Feature Design : Sentence level
14
S1
S2
S3
S4
S5
S6
Anaphora Sentences: have linguistic relationships to anchor sentences
Noisy Sentences: Most likely negative matches.
Anchor Sentences: Most likely positive matches.
![Page 15: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/15.jpg)
Yahoo! Confidential
15
Feature Design: Sentence Level Anchor
Strong Phrase Anchor: Feature value set to 1 if sentence has any of the words:- attach - here is- Enclosed
of the 30968 emails that have an attachment, 52% of them had a strong anchor phrase.
![Page 16: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/16.jpg)
Yahoo! Confidential
16
Feature Design: Sentence Level Anchor
Behavioral Observation: Users tend to refer to an attachment by its file type
Extension Anchor: Feature value set to 1 if sentence has any of the extension keywords:- xls spreadsheet, report, excel file- jpg image, photo, picture
Example:“Please refer to the attached spreadsheet for a list of Associates and
Analysts who will be ranked in these meetings and their PRC Reps.”
![Page 17: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/17.jpg)
Yahoo! Confidential
17
Feature Design: Sentence Level Anchor
Behavioral Observation: Users tend to use file name tokens of the attachment to refer to the attachment
Attachment Name Anchor: Feature value set to 1 if sentence has any of the file name tokens.- Tokenization done on case and type transitions,
Example: - attachment name “Book Request Form East.xls”- “These are book requests for the Netco books for all
regions.”
![Page 18: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/18.jpg)
Yahoo! Confidential
Feature Design : Sentence level
18
S1
S2
S3
S4
S5
S6
Anaphora Sentences: have linguistic relationships to anchor sentences
Noisy Sentences: Most likely negative matches.
Anchor Sentences: Most likely positive matches.
![Page 19: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/19.jpg)
Yahoo! Confidential
19
Feature Design : Sentence level Noisy
Noisy sentences are usually salutations, signature sections and email headers of conversations.
Two features to capture noisy sentences- Noisy Noun- Noisy Verb
• Noisy Verb: marked true if no verbs in the sentence
Noisy Noun: marked true if more than 85% of the words in the sentence are nouns.
![Page 20: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/20.jpg)
Yahoo! Confidential
Feature Design : Sentence level
20
S1
S2
S3
S4
S5
S6
Anaphora Sentences: have linguistic relationships to anchor sentences
Noisy Sentences: Most likely negative matches.
Anchor Sentences: Most likely positive matches.
![Page 21: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/21.jpg)
Yahoo! Confidential
21
Feature Design : Sentence level Anaphora
Once anchors have been identified:- NLP technique called anaphora detection can be
employed- Detects other sentences that are linguistically
dependent on anchor sentence.- Tracks the hidden context in email.
Example:- “Thought you might me interested in the report. It gives a nice
snapshot of our activity with our major counterparties.”
![Page 22: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/22.jpg)
Yahoo! Confidential
22
Correlation Analysis
Best positive correlation:• Strong phrase anchor • Anaphora feature
Short email low correlation coefficient.
noisy verb feature good negative correlation
conversation level <=2 feature has lower negative correlation when comparedto the conversation level > 2 feature.
![Page 23: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/23.jpg)
Yahoo! Confidential
23
Experiments
Ground Truth Data- randomly sampled 1800 sent emails.- Two independent editors for producing class labels for
every sentence in the above sample.- Reconciliation:
• Discarded emails that had at least one sentence with conflicting labels.
ML Algorithms studied : Naïve Bayes, SVM, CRF.
![Page 24: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/24.jpg)
Yahoo! Confidential
24
Summary of Results: F1 measure
With all features used, F1 scores read:- CRF : 0.87- SVM: 0.79- Naïve Bayes: 0.74
CRF consistently beats the other two methods across all feature sub-sets- Sequential nature of the data works in CRF’s favor.
The Phrase Anchor provides best increase in precision The Anaphora feature provides best increase in recall.
![Page 25: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/25.jpg)
Yahoo! Confidential
25
Summary of Results: User Specific Performance
In majority of the cases, CRF outperforms
With same set of features:- the CRF can learn a more
generic model applicable to a variety of users
![Page 26: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/26.jpg)
Yahoo! Confidential
26
In the paper . . .
A specific use-case for attachment based tagging: - Images sent over email- A user study based on survey
Heuristics for sentences keywords
More detailed evaluation studies / comparison for Naïve Bayes, SVM, CRF
TagSeeker : A prototype for proposed algorithms implemented as a Thunderbird plugin.
![Page 27: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/27.jpg)
Yahoo! Confidential
27
Closing Remarks
Improvement of retrieval effectiveness due to keywords mined:- Could not be performed because the attachment content
is not available. Working on an experiment to do this on a different
dataset.
Thanks to the:- Reviewers for the great feedback!- Organizers for the effort putting together this conference!
![Page 28: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/28.jpg)
Yahoo! Confidential
28
Conclusion
Presented a technique to extract information from noisy data.
The F1 measure of the proposed methodology - In the high eighties. Good! - Generalized well across different users.
For more information on this work / Information Extraction @ Yahoo! - [email protected]
![Page 29: Automated Content Labeling using Context in Email](https://reader036.vdocument.in/reader036/viewer/2022062511/54bd4e524a7959537a8b462c/html5/thumbnails/29.jpg)
Yahoo! Confidential
29