scalable discovery of hidden emails from large folders
TRANSCRIPT
Scalable Discovery of Scalable Discovery of Hidden Emails from Hidden Emails from Large FoldersLarge Folders
Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou
Department of Computer Science
University of British Columbia, Canada
KDD’05, August 21–24, 2005, Chicago, Illinois, USA.
Copyright 2005
Introduction and motivation
Related work
Basic framework for reconstructing hidden emails
Optimization for large folders and long emails The Enron case study Conclusion
AgendaAgenda
Prepared byJoyce Chen Page 3
Introduction and motivationIntroduction and motivationWhat is the hidden email
A hidden email is an email quoted by at least one email in the folder but does not exist itself in the same folder.
Deleted emails (intentionally or accidentally)
Forwarded messages.
Previous discussions before users join
This paper attempts to solve is:Discovered the hidden email.
Using embedded quotations to reconstructed the hidden email in a robust and efficient way.
Prepared byJoyce Chen Page 4
Related workRelated work
Derek Lam, Steven L. Rohall, …“Exploiting e-mail structure to improve summarization”Summarize a set of emails base on their threading
hierarchy.But do not study how to regenerate those deleted emails.
Carvalho and William “Learning to extract signature and reply lines from email.”Help this paper to indentify quotations.
Giuseppe Carenini, Raymond Ng, …”Discovery and regeneration of hidden email.A preliminary report on the hidden email discovery and
regeneration.
Prepared byJoyce Chen Page 5
A basic framework for reconstructing A basic framework for reconstructing hidden emailshidden emailsMethodology – skeleton
Step 1. Discovery of hidden emails1.1 Identify hidden fragments
1.2 Find overlapping of hidden fragments
Step 2. Regeneration of hidden emails2.1 Build the precedence graph
2.2 Generate bulletized emails
Prepared byJoyce Chen Page 6
A basic framework (cont.)A basic framework (cont.) Example:
Subject: Midterm Details
a) I need to meet with a faculty recruit at lunch tomorrow. … …
b) Don, can you go directly to SOWK 124 ... ……
c) Warren and Qiang, can you go directly to LSK 201. ……
d) I will bring the exams with me to Sage, … …
e) Students whose last name begin with M-Q will … …
f) I will bring classlists with me. … …
Thanks.
-Ed
Subject: Re: Midterm Details
> a) I need to meet with a > faculty recruit … > b) Don, can you go directly > to SOWK 124 …..
Sure.
> d) I will bring the exams with > me …
I can help you carry it.
> f) I will bring classlists with > me. … …
Is there a seating plan?
Don
Subject: Re: Midterm Details
> a) I need to meet with a > faculty recruit … I will go there as well.
> c) Warren and Qiang, can > you go directly to LSK 201? ……
No problem.
> f) I will bring classlists with > me. … …
Do they need to sign on the list?
- Warren
Subject: Re: Midterm Details
> a) I need to meet with a > faculty recruit … > b) Don, can you go > directly to SOWK 124 ...
Don, I’ll go with you too.
> e) Students whose last > names begin with …> f) I will bring classlists > with me. ……Do we have a seating plan as last term?
Cheers,Kevin
Prepared byJoyce Chen Page 7
A basic framework (cont.)A basic framework (cont.)
Step 1.1: Identify hidden fragments
1. Separate quoted & new fragments
2. Compare each quoted fragment (F) with all other new fragments in the folder. If there is no sufficiently long overlapping,
F is considered as hidden fragments.
Otherwise, there exists a sufficiently long overlapping, the overlapped part is not hidden.
Quoted fragment
(F)
> a …I will go there as well.
> c
No problem.
> f
Do they need to sign on the list?
- Warren
> a
> c
> f
Prepared byJoyce Chen Page 8
A basic framework (cont.)A basic framework (cont.)
Step 1.2: Overlapping of hidden fragments
Subject: Re: Midterm Details
> a) … … > b) … …
Sure.
> d) … …
I can help you carry it.
> f) … …
Is there a seating plan?
Don
Subject: Re: Midterm Details
> a) … …I will go there as well.
> c) ……
No problem.
> f) … …
Do they need to sign on the list?
- Warren
a
c
f
ab
d
f
a
c
f
a
b
d
f
Prepared byJoyce Chen Page 9
A basic framework (cont.)A basic framework (cont.)
Step 2.1: Build the precedence graph
a
b c
d e
f
Three emails in the current folder
> a
> b
… …
> d
… …
> f> a
> b
… ...
> e
> f
> a
> c
… ...
> f
The precedence graph
Prepared byJoyce Chen Page 10
A basic framework (cont.)A basic framework (cont.)
Precedence graph: complicationsThe ideal case:
A chain of nodes A total ordered hidden fragments.
Complicate cases:Incompatible nodes:
e.g., b & c, d & e
Partial order is necessary.
Complication of quoted fragmentsE.g., deletion, insertion and forwarded message
Use Longest Common Substring (LCS) to identify a match.
a
b c
d e
f
Prepared byJoyce Chen Page 11
A basic framework (cont.)A basic framework (cont.)People read documents sequentially graphical
representation isn’t acceptable
Solution: Bulletized email modelText devices:
bullets incompatible nodes.
offsets nested relations among bulletized fragments.
One bulletized hidden email suffice.a
b c
d e
f
a
• c
• b
> d
> e
f
Prepared byJoyce Chen Page 12
Optimization for large folders and long Optimization for large folders and long emailsemailsTwo bottlenecks in the hidden fragment
identificationDealing with large folders
Due to the large number of matches
Need to be performed between quoted fragments and other emails.
Dealing with long emailsHow efficiently LCS matching is performed.
Two optimization to overcome these bottlenecksEmail filtering by IndexingLCS-Anchoring by Indexing
Prepared byJoyce Chen Page 13
Optimizations – Email filtering by indexingOptimizations – Email filtering by indexingQuoted fragment: FBe matched against every single email: M In the primary folder: MFReference folder: RF1,…,RFk
First optimization is to use a word index. Index entry form: <ω, L ω> ω is a word in the email corpusL ω is a list of ids of emails containing at lease one
occurrence of ωFor example:<available, <id= 17, id = 287>>Does not contain high frequency closed-class terms
(i.e.,stop-words “the”)
Prepared byJoyce Chen Page 14
Optimizations – Email filtering by indexing Optimizations – Email filtering by indexing (cont.)(cont.)Algorithm EmailFiltering
Input: a word index, a frequent word list FW, a quoted fragment F
Output: a list of email ids possibly matching F
1. Tokenize F to a set of words w, and remove all the stop-words.
2. For each w not in the list FW, use the word index to identify Lω.
3. Return the unioned list, i.e., Defined the length of FW as “frequent word threshold ft.”
Prepared byJoyce Chen Page 15
LCS – Anchoring by indexingLCS – Anchoring by indexing The reason for using LCS
Suppose the original email is a sequence of fragments OM = <F1, F2…, F5>
User may edit the quotes with following condition:Delete the beginning and/or the end parts.
QF1 = <F2, F3, F4>
In a more sophisticated setting to reduce the lengthQF2 = <F2, F4>
Furthermore, copy another fragment F6 form another email.QF3 = <F2, F6, F4>
Substring searching is not able to handle QF2 and QF3. LCS matching can correctly handle QF1, QF2 and QF3.
The complexity of LCS is quadratic in the length of the fragment and the email
LCS is no scalable for long emails and/or quotations.
Prepared byJoyce Chen Page 16
LCS – Anchoring by indexing (cont.)LCS – Anchoring by indexing (cont.)Propose to extend the word index from the email
filtering step to tackle this problemFor each email in the list LωRecord the positions a which the work ω occurs in the
corresponding email.i.e., each entry in Lω is the form <id, {pos1,…, posk}>
For example: the word “available” may have following index entry:
<available, <<id =17, pos = {89, 3475}>, <id=278, pos={190, 345, 3805}>>>.
Use the list {pos1, …, pos,} as an “anchors” to facilitate the matching between F and M.
Prepared byJoyce Chen Page 17
The Enron case studyThe Enron case study
The data and the setupThe Enron email dataset
Using their word indexes for experiments.
The word index contains 160,203 unique words.
Focus on the inbox folders of the users.Of the 150 users, 137 have an inbox folder.The number of emails in those folders is
Range: 3~ 1466
The average number of emails are 327.The median number of emails are 223.
Prepared byJoyce Chen Page 18
The Enron case study (cont.)The Enron case study (cont.) Below figure shows the number of emails that contains at least one hidden
fragment.
Prepared byJoyce Chen Page 19
The Enron case study (cont.)The Enron case study (cont.) This figure displays the percentage of emails containing at least one
hidden fragment.
Prepared byJoyce Chen Page 20
The Enron case study (cont.)The Enron case study (cont.) This figure shows a histogram of the recollection rates for all users
Recollection rate as the ratio of nl / (nl + ng)
Prepared byJoyce Chen Page 21
Effectiveness of optimizationsEffectiveness of optimizations
Prepared byJoyce Chen Page 22
Effectiveness of optimizations (cont.)Effectiveness of optimizations (cont.)
Prepared byJoyce Chen Page 23
Effectiveness of optimizations (cont.)Effectiveness of optimizations (cont.)
Prepared byJoyce Chen Page 24
ConclusionConclusion This paper studies the problem of reconstructing hidden emails.
Using embedded quotataions. Found in messages further down the thread hierachyy.
Optimize the basic HiddenEmailFinder algorithm Word indexing
Reduce the number of emails that need to be matched.Reduce the amount of effort to find the LCS between the fragment and the email under consideration
The Enron case study WE show that our framework is robust in dealing with real folders.
Providing a scalability techniques to large folders and long emails. EmailFiltering LCS-Anchoring
Further works Include applying natural language understanding techniques.