scalable discovery of hidden emails from large folders

Scalable Discovery of Scalable Discovery of Hidden Emails from Hidden Emails from Large FoldersLarge Folders

Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou

Department of Computer Science

University of British Columbia, Canada

KDD’05, August 21–24, 2005, Chicago, Illinois, USA.

Copyright 2005

Introduction and motivation

Related work

Basic framework for reconstructing hidden emails

Optimization for large folders and long emails The Enron case study Conclusion

AgendaAgenda

Prepared byJoyce Chen

Introduction and motivationIntroduction and motivationWhat is the hidden email

A hidden email is an email quoted by at least one email in the folder but does not exist itself in the same folder.

Deleted emails (intentionally or accidentally)

Forwarded messages.

Previous discussions before users join

This paper attempts to solve is:Discovered the hidden email.

Using embedded quotations to reconstructed the hidden email in a robust and efficient way.


Related workRelated work

Derek Lam, Steven L. Rohall, …“Exploiting e-mail structure to improve summarization”Summarize a set of emails base on their threading

hierarchy.But do not study how to regenerate those deleted emails.

Carvalho and William “Learning to extract signature and reply lines from email.”Help this paper to indentify quotations.

Giuseppe Carenini, Raymond Ng, …”Discovery and regeneration of hidden email.A preliminary report on the hidden email discovery and

regeneration.


A basic framework for reconstructing A basic framework for reconstructing hidden emailshidden emailsMethodology – skeleton

Step 1. Discovery of hidden emails1.1 Identify hidden fragments

1.2 Find overlapping of hidden fragments

Step 2. Regeneration of hidden emails2.1 Build the precedence graph

2.2 Generate bulletized emails


A basic framework (cont.)A basic framework (cont.) Example:

Subject: Midterm Details

a) I need to meet with a faculty recruit at lunch tomorrow. … …

b) Don, can you go directly to SOWK 124 ... ……

c) Warren and Qiang, can you go directly to LSK 201. ……

d) I will bring the exams with me to Sage, … …

e) Students whose last name begin with M-Q will … …

f) I will bring classlists with me. … …

Thanks.

-Ed

Subject: Re: Midterm Details

> a) I need to meet with a > faculty recruit … > b) Don, can you go directly > to SOWK 124 …..

Sure.

> d) I will bring the exams with > me …

I can help you carry it.

> f) I will bring classlists with > me. … …

Is there a seating plan?

Don


> a) I need to meet with a > faculty recruit … I will go there as well.

> c) Warren and Qiang, can > you go directly to LSK 201? ……

No problem.

> f) I will bring classlists with > me. … …

Do they need to sign on the list?

- Warren


> a) I need to meet with a > faculty recruit … > b) Don, can you go > directly to SOWK 124 ...

Don, I’ll go with you too.

> e) Students whose last > names begin with …> f) I will bring classlists > with me. ……Do we have a seating plan as last term?

Cheers,Kevin


A basic framework (cont.)A basic framework (cont.)

Step 1.1: Identify hidden fragments

1. Separate quoted & new fragments

2. Compare each quoted fragment (F) with all other new fragments in the folder. If there is no sufficiently long overlapping,

F is considered as hidden fragments.

Otherwise, there exists a sufficiently long overlapping, the overlapped part is not hidden.

Quoted fragment

(F)

> a …I will go there as well.

> c

No problem.

> f


- Warren

> a

> c

> f



Step 1.2: Overlapping of hidden fragments


> a) … … > b) … …

Sure.

> d) … …

I can help you carry it.

> f) … …

Is there a seating plan?

Don


> a) … …I will go there as well.

> c) ……

No problem.

> f) … …


- Warren

a

c

f

ab

d

f

a

c

f

a

b

d

f



Step 2.1: Build the precedence graph

a

b c

d e

f

Three emails in the current folder

> a

> b

… …

> d

… …

> f> a

> b

… ...

> e

> f

> a

> c

… ...

> f

The precedence graph



Precedence graph: complicationsThe ideal case:

A chain of nodes A total ordered hidden fragments.

Complicate cases:Incompatible nodes:

e.g., b & c, d & e

Partial order is necessary.

Complication of quoted fragmentsE.g., deletion, insertion and forwarded message

Use Longest Common Substring (LCS) to identify a match.

a

b c

d e

f


A basic framework (cont.)A basic framework (cont.)People read documents sequentially graphical

representation isn’t acceptable

Solution: Bulletized email modelText devices:

bullets incompatible nodes.

offsets nested relations among bulletized fragments.

One bulletized hidden email suffice.a

b c

d e

f

a

• c

• b

> d

> e

f


Optimization for large folders and long Optimization for large folders and long emailsemailsTwo bottlenecks in the hidden fragment

identificationDealing with large folders

Due to the large number of matches

Need to be performed between quoted fragments and other emails.

Dealing with long emailsHow efficiently LCS matching is performed.

Two optimization to overcome these bottlenecksEmail filtering by IndexingLCS-Anchoring by Indexing


Optimizations – Email filtering by indexingOptimizations – Email filtering by indexingQuoted fragment: FBe matched against every single email: M In the primary folder: MFReference folder: RF1,…,RFk

First optimization is to use a word index. Index entry form: <ω, L ω> ω is a word in the email corpusL ω is a list of ids of emails containing at lease one

occurrence of ωFor example:<available, <id= 17, id = 287>>Does not contain high frequency closed-class terms

(i.e.,stop-words “the”)


Optimizations – Email filtering by indexing Optimizations – Email filtering by indexing (cont.)(cont.)Algorithm EmailFiltering

Input: a word index, a frequent word list FW, a quoted fragment F

Output: a list of email ids possibly matching F

1. Tokenize F to a set of words w, and remove all the stop-words.

2. For each w not in the list FW, use the word index to identify Lω.

3. Return the unioned list, i.e., Defined the length of FW as “frequent word threshold ft.”


LCS – Anchoring by indexingLCS – Anchoring by indexing The reason for using LCS

Suppose the original email is a sequence of fragments OM = <F1, F2…, F5>

User may edit the quotes with following condition:Delete the beginning and/or the end parts.

QF1 = <F2, F3, F4>

In a more sophisticated setting to reduce the lengthQF2 = <F2, F4>

Furthermore, copy another fragment F6 form another email.QF3 = <F2, F6, F4>

Substring searching is not able to handle QF2 and QF3. LCS matching can correctly handle QF1, QF2 and QF3.

The complexity of LCS is quadratic in the length of the fragment and the email

LCS is no scalable for long emails and/or quotations.


LCS – Anchoring by indexing (cont.)LCS – Anchoring by indexing (cont.)Propose to extend the word index from the email

filtering step to tackle this problemFor each email in the list LωRecord the positions a which the work ω occurs in the

corresponding email.i.e., each entry in Lω is the form <id, {pos1,…, posk}>

For example: the word “available” may have following index entry:

<available, <<id =17, pos = {89, 3475}>, <id=278, pos={190, 345, 3805}>>>.

Use the list {pos1, …, pos,} as an “anchors” to facilitate the matching between F and M.


The Enron case studyThe Enron case study

The data and the setupThe Enron email dataset

Using their word indexes for experiments.

The word index contains 160,203 unique words.

Focus on the inbox folders of the users.Of the 150 users, 137 have an inbox folder.The number of emails in those folders is

Range: 3~ 1466

The average number of emails are 327.The median number of emails are 223.


The Enron case study (cont.)The Enron case study (cont.) Below figure shows the number of emails that contains at least one hidden

fragment.


The Enron case study (cont.)The Enron case study (cont.) This figure displays the percentage of emails containing at least one

hidden fragment.


The Enron case study (cont.)The Enron case study (cont.) This figure shows a histogram of the recollection rates for all users

Recollection rate as the ratio of nl / (nl + ng)


Effectiveness of optimizationsEffectiveness of optimizations


Effectiveness of optimizations (cont.)Effectiveness of optimizations (cont.)


ConclusionConclusion This paper studies the problem of reconstructing hidden emails.

Using embedded quotataions. Found in messages further down the thread hierachyy.

Optimize the basic HiddenEmailFinder algorithm Word indexing

Reduce the number of emails that need to be matched.Reduce the amount of effort to find the LCS between the fragment and the email under consideration

The Enron case study WE show that our framework is robust in dealing with real folders.

Providing a scalability techniques to large folders and long emails. EmailFiltering LCS-Anchoring

Further works Include applying natural language understanding techniques.

scalable discovery of hidden emails from large folders

Technology