scalable discovery of hidden emails from large folders

24
Scalable Discovery of Scalable Discovery of Hidden Emails from Hidden Emails from Large Folders Large Folders Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou Department of Computer Science University of British Columbia, Canada KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Copyright 2005

Upload: feiwin

Post on 04-Jul-2015

411 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Scalable Discovery Of Hidden Emails From Large Folders

Scalable Discovery of Scalable Discovery of Hidden Emails from Hidden Emails from Large FoldersLarge Folders

Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou

Department of Computer Science

University of British Columbia, Canada

KDD’05, August 21–24, 2005, Chicago, Illinois, USA.

Copyright 2005

Page 2: Scalable Discovery Of Hidden Emails From Large Folders

Introduction and motivation

Related work

Basic framework for reconstructing hidden emails

Optimization for large folders and long emails The Enron case study Conclusion

AgendaAgenda

Page 3: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 3

Introduction and motivationIntroduction and motivationWhat is the hidden email

A hidden email is an email quoted by at least one email in the folder but does not exist itself in the same folder.

Deleted emails (intentionally or accidentally)

Forwarded messages.

Previous discussions before users join

This paper attempts to solve is:Discovered the hidden email.

Using embedded quotations to reconstructed the hidden email in a robust and efficient way.

Page 4: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 4

Related workRelated work

Derek Lam, Steven L. Rohall, …“Exploiting e-mail structure to improve summarization”Summarize a set of emails base on their threading

hierarchy.But do not study how to regenerate those deleted emails.

Carvalho and William “Learning to extract signature and reply lines from email.”Help this paper to indentify quotations.

Giuseppe Carenini, Raymond Ng, …”Discovery and regeneration of hidden email.A preliminary report on the hidden email discovery and

regeneration.

Page 5: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 5

A basic framework for reconstructing A basic framework for reconstructing hidden emailshidden emailsMethodology – skeleton

Step 1. Discovery of hidden emails1.1 Identify hidden fragments

1.2 Find overlapping of hidden fragments

Step 2. Regeneration of hidden emails2.1 Build the precedence graph

2.2 Generate bulletized emails

Page 6: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 6

A basic framework (cont.)A basic framework (cont.) Example:

Subject: Midterm Details

a) I need to meet with a faculty recruit at lunch tomorrow. … …

b) Don, can you go directly to SOWK 124 ... ……

c) Warren and Qiang, can you go directly to LSK 201. ……

d) I will bring the exams with me to Sage, … …

e) Students whose last name begin with M-Q will … …

f) I will bring classlists with me. … …

Thanks.

-Ed

Subject: Re: Midterm Details

> a) I need to meet with a > faculty recruit … > b) Don, can you go directly > to SOWK 124 …..

Sure.

> d) I will bring the exams with > me …

I can help you carry it.

> f) I will bring classlists with > me. … …

Is there a seating plan?

Don

Subject: Re: Midterm Details

> a) I need to meet with a > faculty recruit … I will go there as well.

> c) Warren and Qiang, can > you go directly to LSK 201? ……

No problem.

> f) I will bring classlists with > me. … …

Do they need to sign on the list?

- Warren

Subject: Re: Midterm Details

> a) I need to meet with a > faculty recruit … > b) Don, can you go > directly to SOWK 124 ...

Don, I’ll go with you too.

> e) Students whose last > names begin with …> f) I will bring classlists > with me. ……Do we have a seating plan as last term?

Cheers,Kevin

Page 7: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 7

A basic framework (cont.)A basic framework (cont.)

Step 1.1: Identify hidden fragments

1. Separate quoted & new fragments

2. Compare each quoted fragment (F) with all other new fragments in the folder. If there is no sufficiently long overlapping,

F is considered as hidden fragments.

Otherwise, there exists a sufficiently long overlapping, the overlapped part is not hidden.

Quoted fragment

(F)

> a …I will go there as well.

> c

No problem.

> f

Do they need to sign on the list?

- Warren

> a

> c

> f

Page 8: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 8

A basic framework (cont.)A basic framework (cont.)

Step 1.2: Overlapping of hidden fragments

Subject: Re: Midterm Details

> a) … … > b) … …

Sure.

> d) … …

I can help you carry it.

> f) … …

Is there a seating plan?

Don

Subject: Re: Midterm Details

> a) … …I will go there as well.

> c) ……

No problem.

> f) … …

Do they need to sign on the list?

- Warren

a

c

f

ab

d

f

a

c

f

a

b

d

f

Page 9: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 9

A basic framework (cont.)A basic framework (cont.)

Step 2.1: Build the precedence graph

a

b c

d e

f

Three emails in the current folder

> a

> b

… …

> d

… …

> f> a

> b

… ...

> e

> f

> a

> c

… ...

> f

The precedence graph

Page 10: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 10

A basic framework (cont.)A basic framework (cont.)

Precedence graph: complicationsThe ideal case:

A chain of nodes A total ordered hidden fragments.

Complicate cases:Incompatible nodes:

e.g., b & c, d & e

Partial order is necessary.

Complication of quoted fragmentsE.g., deletion, insertion and forwarded message

Use Longest Common Substring (LCS) to identify a match.

a

b c

d e

f

Page 11: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 11

A basic framework (cont.)A basic framework (cont.)People read documents sequentially graphical

representation isn’t acceptable

Solution: Bulletized email modelText devices:

bullets incompatible nodes.

offsets nested relations among bulletized fragments.

One bulletized hidden email suffice.a

b c

d e

f

a

• c

• b

> d

> e

f

Page 12: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 12

Optimization for large folders and long Optimization for large folders and long emailsemailsTwo bottlenecks in the hidden fragment

identificationDealing with large folders

Due to the large number of matches

Need to be performed between quoted fragments and other emails.

Dealing with long emailsHow efficiently LCS matching is performed.

Two optimization to overcome these bottlenecksEmail filtering by IndexingLCS-Anchoring by Indexing

Page 13: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 13

Optimizations – Email filtering by indexingOptimizations – Email filtering by indexingQuoted fragment: FBe matched against every single email: M In the primary folder: MFReference folder: RF1,…,RFk

First optimization is to use a word index. Index entry form: <ω, L ω> ω is a word in the email corpusL ω is a list of ids of emails containing at lease one

occurrence of ωFor example:<available, <id= 17, id = 287>>Does not contain high frequency closed-class terms

(i.e.,stop-words “the”)

Page 14: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 14

Optimizations – Email filtering by indexing Optimizations – Email filtering by indexing (cont.)(cont.)Algorithm EmailFiltering

Input: a word index, a frequent word list FW, a quoted fragment F

Output: a list of email ids possibly matching F

1. Tokenize F to a set of words w, and remove all the stop-words.

2. For each w not in the list FW, use the word index to identify Lω.

3. Return the unioned list, i.e., Defined the length of FW as “frequent word threshold ft.”

Page 15: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 15

LCS – Anchoring by indexingLCS – Anchoring by indexing The reason for using LCS

Suppose the original email is a sequence of fragments OM = <F1, F2…, F5>

User may edit the quotes with following condition:Delete the beginning and/or the end parts.

QF1 = <F2, F3, F4>

In a more sophisticated setting to reduce the lengthQF2 = <F2, F4>

Furthermore, copy another fragment F6 form another email.QF3 = <F2, F6, F4>

Substring searching is not able to handle QF2 and QF3. LCS matching can correctly handle QF1, QF2 and QF3.

The complexity of LCS is quadratic in the length of the fragment and the email

LCS is no scalable for long emails and/or quotations.

Page 16: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 16

LCS – Anchoring by indexing (cont.)LCS – Anchoring by indexing (cont.)Propose to extend the word index from the email

filtering step to tackle this problemFor each email in the list LωRecord the positions a which the work ω occurs in the

corresponding email.i.e., each entry in Lω is the form <id, {pos1,…, posk}>

For example: the word “available” may have following index entry:

<available, <<id =17, pos = {89, 3475}>, <id=278, pos={190, 345, 3805}>>>.

Use the list {pos1, …, pos,} as an “anchors” to facilitate the matching between F and M.

Page 17: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 17

The Enron case studyThe Enron case study

The data and the setupThe Enron email dataset

Using their word indexes for experiments.

The word index contains 160,203 unique words.

Focus on the inbox folders of the users.Of the 150 users, 137 have an inbox folder.The number of emails in those folders is

Range: 3~ 1466

The average number of emails are 327.The median number of emails are 223.

Page 18: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 18

The Enron case study (cont.)The Enron case study (cont.) Below figure shows the number of emails that contains at least one hidden

fragment.

Page 19: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 19

The Enron case study (cont.)The Enron case study (cont.) This figure displays the percentage of emails containing at least one

hidden fragment.

Page 20: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 20

The Enron case study (cont.)The Enron case study (cont.) This figure shows a histogram of the recollection rates for all users

Recollection rate as the ratio of nl / (nl + ng)

Page 21: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 21

Effectiveness of optimizationsEffectiveness of optimizations

Page 22: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 22

Effectiveness of optimizations (cont.)Effectiveness of optimizations (cont.)

Page 23: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 23

Effectiveness of optimizations (cont.)Effectiveness of optimizations (cont.)

Page 24: Scalable Discovery Of Hidden Emails From Large Folders

Prepared byJoyce Chen Page 24

ConclusionConclusion This paper studies the problem of reconstructing hidden emails.

Using embedded quotataions. Found in messages further down the thread hierachyy.

Optimize the basic HiddenEmailFinder algorithm Word indexing

Reduce the number of emails that need to be matched.Reduce the amount of effort to find the LCS between the fragment and the email under consideration

The Enron case study WE show that our framework is robust in dealing with real folders.

Providing a scalability techniques to large folders and long emails. EmailFiltering LCS-Anchoring

Further works Include applying natural language understanding techniques.