modeling identity in archival collections of email: a preliminary study tamer elsayed and douglas w....

32
Modeling Identity Modeling Identity in Archival Collections of in Archival Collections of Email: Email: A Preliminary study A Preliminary study Tamer Elsayed and Douglas W. Oard Conference on Email and Anti-Spam (CEAS), July 28 Conference on Email and Anti-Spam (CEAS), July 28 th th , 2006 , 2006 Department of Computer Science College of Information Studies Institute for Advanced Computer Studies

Post on 20-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Modeling Identity Modeling Identity in Archival Collections of Email: in Archival Collections of Email:

A Preliminary studyA Preliminary study

Tamer Elsayed and Douglas W. Oard

Conference on Email and Anti-Spam (CEAS), July 28Conference on Email and Anti-Spam (CEAS), July 28thth, 2006, 2006

Department of Computer Science

College of Information Studies

Institute for Advanced Computer Studies

Modeling Identity in Archival Collections of Email: A Preliminary Study

Real Problem

National ArchivesNational Archives

Clinton Clinton White HouseWhite House Tobacco Tobacco

PolicyPolicy

search search requestrequest

hired 25 hired 25 personspersons

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~32 million

emails

200,000

80,000

for 6 months …

Modeling Identity in Archival Collections of Email: A Preliminary Study

Email Search

ParticipantParticipant Non-participantNon-participant

PersonalPersonal My own emailsShneiderman’s

Postel’s

OrganizationalOrganizationalCS

UMIACS

White House

Enron

PublicPublic TREC EnterpriseUsenet news

W3C

Meaning Modeling Content People Modeling Identity

SearcherSearcher

Modeling Identity in Archival Collections of Email: A Preliminary Study

Identity

~~~~~~~~~~~EmailEmail~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~

sent email to

SenderSender ReceiversReceivers

MentionedMentioned

sent received

mentions

mentioned mentioned to

Email Address

Name

Nickname

Email Address

Name

Nickname

Email Address

Name

Nickname

Modeling Identity in Archival Collections of Email: A Preliminary Study

Outline

Problem Identity Resolution Architecture Evaluation Conclusion

Modeling Identity in Archival Collections of Email: A Preliminary Study

Entity Example

[email protected]

“Robert Bruce” “Bob”

Robert E. BruceSenior CounselEnron North America Corp.T (713) 345-7780F (713) [email protected]

Static Signature (140)

Main Headers (915)Quoted Headers (8)

Salutations (7)Free Signatures (9)

Name

Email Address

Nickname

Signature Block

Modeling Identity in Archival Collections of Email: A Preliminary Study

Enron Collection

Example of large organizational collection CMU version

about half million emails 133,581 unique email addresses

~52% of emails are duplicates! same address, subject, body

Modeling Identity in Archival Collections of Email: A Preliminary Study

Message Header

Main BodySalutationSalutation

Signature BlockSignature Block

Quoted Header QuotedText

Message Body

Quoted SignatureQuoted Signature

Quoted Main Body

Typical Enron Email

-----Original Message-----From: [email protected]@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]: [email protected]: Shhhh.... it's a SURPRISE !

Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: [email protected]: [email protected]: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: '[email protected]@ENRON'

Hope all is well.Count me in for the group present.See ya next week if not earlier

Please call me (713) 207-5233

Liza

Elizabeth Sager713-853-6349

Hi Shari

Thanks!

Shari

Modeling Identity in Archival Collections of Email: A Preliminary Study

Identity Resolution Architecture

Duplicate Detection

Extraction from Main Header

Extraction from Quoted Header

Body and Quoted Text Separation

Signature Line Detection

Salutation Line Detection

Nickname Extraction

Main body

Salutation linesSignature lines

Address-Nickname Address-Nickname AssociationsAssociations

Address-Name Address-Name AssociationsAssociations

Address-Address Address-Address AssociationsAssociations

Clustering Associations

EntitiesEntities

Unique emails

Quoted headers

Modeling Identity in Archival Collections of Email: A Preliminary Study

Message-ID: <1486175.1075858665169.JavaMail.evans@thyme>Date: Wed, 26 Sep 2001 09:25:19 -0700 (PDT)From: [email protected]: [email protected], [email protected],

[email protected], o'[email protected], [email protected]: New Email AddressX-From: Jim Mathes <[email protected]>X-To: Vandini, Mark <[email protected]>, Urbon Steve <[email protected]>,

Tony Sapienza <[email protected]>, Tom O'Rourke <[email protected]>, Tom Lyons <[email protected]>, Tom Hodgson <[email protected]>

X-cc: X-bcc:

We have just launched our "New & Improved Website",www.newbedfordchamber.com and I have a new email address:

[email protected]

Please make the appropriate changes in your email address book.

Thank you,

Jim Mathes, PresidentNew Bedford Area Chamber of Commerce

Extraction From Main Headers

Name-Address Association

Name-Address Association

Address-Address Association

Modeling Identity in Archival Collections of Email: A Preliminary Study

Extraction From Quoted HeadersHi Jeff,

Did you get our registration packet? If not, stop by and pick one upbecause you need it. Make sure you get the one for new students.

Shawn

On Wednesday, November 03, 1999 11:18 AM, Jeff Dasovich[SMTP:[email protected]] wrote:>>> ok, don't shoot me, but what's the deadline for scheduling for classes?>> signed,> clueless

Name-Address Association

---------------------- Forwarded by Elizabeth Sager/HOU/ECT on 02/09/2000 12:02 PM ---------------------------

"Patricia Young" <[email protected]> on 02/09/2000 08:50:59 AMTo: Elizabeth Sager/HOU/ECT@ECTcc: Subject: If possible, would you forward your resume to me electronically? Thanks.

If possible, would you forward your resume to me electronically? Thanks.

Name-Address Association

Modeling Identity in Archival Collections of Email: A Preliminary Study

From: [email protected]

The kiddies are going back to school already so now would be a good time to plan a trip to D.C. at last. Maybe early Sept?Also I'd be game for a girls' trip to Destin.

Time to work!Love,-Sooz

Procurement, Logistics, and ContractsEnron Broadband Services, Inc.1400 Smith, Suite EB-4573AHouston, TX 77002

Signature & Salutation Detection

The week is going OK. All the tennis and swimming has left me with sore muscles so this is my night off. Am planning to do some more house chores so I do not end up with another weekend like the last.

I'm still planning on coming to Austin next weekend, I'm just not sure when, but I'll let you know.

Call if you get lonely!

Love,Love,SoozSooz

Procurement, Logistics, and ContractsProcurement, Logistics, and ContractsEnron Broadband Services, Inc.Enron Broadband Services, Inc.1400 Smith, Suite EB-4573A1400 Smith, Suite EB-4573AHouston, TX 77002Houston, TX 77002

Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime.

Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long.

Have a good afternoon!

love,love,soozsooz

Procurement, Logistics, and ContractsProcurement, Logistics, and ContractsEnron Broadband Services, Inc.Enron Broadband Services, Inc.1400 Smith, Suite EB-4573A1400 Smith, Suite EB-4573AHouston, TX 77002Houston, TX 77002

Modeling Identity in Archival Collections of Email: A Preliminary Study

3,151 address-nickname associations

Nickname Extraction

Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime.

Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long.

Have a good afternoon!

love,love,soozsooz

Procurement, Logistics, and ContractsProcurement, Logistics, and ContractsEnron Broadband Services, Inc.Enron Broadband Services, Inc.1400 Smith, Suite EB-4573A1400 Smith, Suite EB-4573AHouston, TX 77002Houston, TX 77002

nicknamenickname

From: [email protected]

Modeling Identity in Archival Collections of Email: A Preliminary Study

Identifying Entities

[email protected]

“Robert Bruce” “Bob”

Robert E. BruceSenior CounselEnron North America Corp.T (713) 345-7780F (713) [email protected]

Static Signature (140)

Main Headers (915)Quoted Headers (8)

Salutations (7)Free Signatures (9)

Name

Email Address

Nickname

Signature Block

[email protected]

Email Address

“Robert”

Name

Quoted Headers (5)

Main Headers (7)

82,084addr-name

3,151 addr-nickname

19,708 addr-addr

66,715 entities

Modeling Identity in Archival Collections of Email: A Preliminary Study

Outline

Problem Identity Resolution Architecture EvaluationEvaluation Conclusion Future Work

Modeling Identity in Archival Collections of Email: A Preliminary Study

Stratified Sampling

Weakest Evidence Stronger Evidence

Address-Name AssociationsAddress-Name Associations

Main headers only 50 / 29677 50 / 31248

Quoted headers only 50 / 8042 50 / 3828

Both headers 50 / 9289

Address-Nickname AssociationsAddress-Nickname Associations

Salutations only 50 / 272 50 / 465

Signatures only 50 / 172 50 / 1754

Both 50/490

Address-Address Address-Address AssociationsAssociations

50 / 6514 50 / 4194

Modeling Identity in Archival Collections of Email: A Preliminary Study

Judgment Process

[email protected] "home email"

[email protected] "alexis james-petty"

[email protected] “june deadrick”

[email protected] “robbie lewis”

[email protected] "terrie covarrubias"

[email protected] "randy"

[email protected] "phyllis"

[email protected] "tom"

IncorrectIncorrect

Correct but not informativeCorrect but not informative

Correct and somewhat informativeCorrect and somewhat informative

Correct and very informativeCorrect and very informative

Modeling Identity in Archival Collections of Email: A Preliminary Study

Evaluation Measures

Judged AssociationsCorrect

Informative

Very Informative

Modeling Identity in Archival Collections of Email: A Preliminary Study

Accuracy

0

20

40

60

80

100

Main Headers QuotedHeaders

Both Overall

Per

cen

t A

ccu

racy

Weakest evidenceAverage evidenceStronger evidence

0

20

40

60

80

100

Salutation Signature Both Overall

Per

cen

t A

ccu

racy

0

20

40

60

80

100

Main Headers

Per

cen

t A

ccu

racy

Address-Name Associations

Address-Nickname Associations

Address-Address Associations

100% accuracy with multiple sources of evidence.

Address-name association was nearly perfect

80% minimum accuracy in address-nickname

96.7% entity accuracy

Modeling Identity in Archival Collections of Email: A Preliminary Study

Informativeness

0

20

40

60

80

100

Main Headers QuotedHeaders

Both Overall

Per

cen

t In

form

ativ

e

0

20

40

60

80

100

Salutation Signature Both OverallPerc

en

t In

form

ati

ve

0

20

40

60

80

100

Main Headers

Per

cen

t In

form

ativ

e

0

20

40

60

80

100

Main Headers QuotedHeaders

Both Overall

Per

cen

t V

ery

Info

rmat

ive

0

20

40

60

80

100

Salutation Signature Both Overall

Per

cen

t V

ery

Info

rmat

ive

Weakest evidence

Average evidence

Stronger evidence

0

20

40

60

80

100

Main Headers

Per

cen

t V

ery

Info

rmat

ive

Address-Name Associations

Address-Nickname Associations

Address-Address Associations0

20

40

60

80

100

Salutation Signature Both Overall

Per

cen

t A

ccu

racy

Modeling Identity in Archival Collections of Email: A Preliminary Study

Outline

Problem Identity Resolution Architecture Evaluation ConclusionConclusion

Modeling Identity in Archival Collections of Email: A Preliminary Study

Conclusion

Introduced a computational model of identity a set of simple techniques put together provide a useful baseline assessed its potential utility in the context of one fairly

complex email collection Automatic detection of nicknames in salutations

and signature lines. Most informative results from weakest evidence &

least accurate Accuracy and informativeness are both important

Modeling Identity in Archival Collections of Email: A Preliminary Study

Limitations

Email address associated with single identity Strength of evidence not exploited Heuristics hand-tuned for Enron collection Focus on personal attributes No reconciliation of multiple identities for single

person No attempt to classify identities as machines or

groups Recall?

Modeling Identity in Archival Collections of Email: A Preliminary Study

Thank You!Questions?

Modeling Identity in Archival Collections of Email: A Preliminary Study

Backup

Modeling Identity in Archival Collections of Email: A Preliminary Study

Future Work

extend the model to exploit temporal features and behavioral evidence

implement machine learning techniques perform ablation studies characterize the coverage of our methods in more detail replicate this work in other contexts integrate these techniques with the ultimate applications

for which computational models of identity are needed (e.g., social network analysis).

Modeling Identity in Archival Collections of Email: A Preliminary Study

Helping in Judgments

Modeling Identity in Archival Collections of Email: A Preliminary Study

Identity Framework

PersonPersonGroupGroup

Identity

MachineMachine

Identity Identity

EntityEntityEntityEntity EntityEntityEntityEntity EntityEntity

EntityEntity

Candidates

Modeling Identity in Archival Collections of Email: A Preliminary Study

Modeling Identity

Attributes (stable explicit features) email addresses, names, nickname, contact info

Associations Link attributes together Based on observations

Entities Representation of an identity Set of attributes in undirected graph

Linked by weighted associations

Modeling Identity in Archival Collections of Email: A Preliminary Study

Identifying Entities

First round limited transitive closure

Merging associations based on unique attributes Address-address associations

No use of strength of evidence yet 66,715 entities

Covering 77,420 unique email address (58% of all addresses)

Modeling Identity in Archival Collections of Email: A Preliminary Study

Related Work

Attribute/association extraction Name recognition and reference resolution Applications:

Social network analysis Finding experts

Modeling Identity in Archival Collections of Email: A Preliminary Study

Unjudged Associations

0

1

2

3

4

5

Main Headers Quoted Headers

Unj

udge

d As

soci

atio

ns

Weakest evidenceStronger evidence

Salutations Signatures Main Headers

Address-NameAssociations

Address-NicknameAssociations

Address-AddressAssociations

Only 19 ~3%