modeling identity in archival collections of email: a preliminary study tamer elsayed and douglas w....
Post on 20-Dec-2015
217 views
TRANSCRIPT
Modeling Identity Modeling Identity in Archival Collections of Email: in Archival Collections of Email:
A Preliminary studyA Preliminary study
Tamer Elsayed and Douglas W. Oard
Conference on Email and Anti-Spam (CEAS), July 28Conference on Email and Anti-Spam (CEAS), July 28thth, 2006, 2006
Department of Computer Science
College of Information Studies
Institute for Advanced Computer Studies
Modeling Identity in Archival Collections of Email: A Preliminary Study
Real Problem
National ArchivesNational Archives
Clinton Clinton White HouseWhite House Tobacco Tobacco
PolicyPolicy
search search requestrequest
hired 25 hired 25 personspersons
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~32 million
emails
200,000
80,000
for 6 months …
Modeling Identity in Archival Collections of Email: A Preliminary Study
Email Search
ParticipantParticipant Non-participantNon-participant
PersonalPersonal My own emailsShneiderman’s
Postel’s
OrganizationalOrganizationalCS
UMIACS
White House
Enron
PublicPublic TREC EnterpriseUsenet news
W3C
Meaning Modeling Content People Modeling Identity
SearcherSearcher
Modeling Identity in Archival Collections of Email: A Preliminary Study
Identity
~~~~~~~~~~~EmailEmail~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
sent email to
SenderSender ReceiversReceivers
MentionedMentioned
sent received
mentions
mentioned mentioned to
Email Address
Name
Nickname
Email Address
Name
Nickname
Email Address
Name
Nickname
Modeling Identity in Archival Collections of Email: A Preliminary Study
Outline
Problem Identity Resolution Architecture Evaluation Conclusion
Modeling Identity in Archival Collections of Email: A Preliminary Study
Entity Example
“Robert Bruce” “Bob”
Robert E. BruceSenior CounselEnron North America Corp.T (713) 345-7780F (713) [email protected]
Static Signature (140)
Main Headers (915)Quoted Headers (8)
Salutations (7)Free Signatures (9)
Name
Email Address
Nickname
Signature Block
Modeling Identity in Archival Collections of Email: A Preliminary Study
Enron Collection
Example of large organizational collection CMU version
about half million emails 133,581 unique email addresses
~52% of emails are duplicates! same address, subject, body
Modeling Identity in Archival Collections of Email: A Preliminary Study
Message Header
Main BodySalutationSalutation
Signature BlockSignature Block
Quoted Header QuotedText
Message Body
Quoted SignatureQuoted Signature
Quoted Main Body
Typical Enron Email
-----Original Message-----From: [email protected]@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]: [email protected]: Shhhh.... it's a SURPRISE !
Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: [email protected]: [email protected]: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: '[email protected]@ENRON'
Hope all is well.Count me in for the group present.See ya next week if not earlier
Please call me (713) 207-5233
Liza
Elizabeth Sager713-853-6349
Hi Shari
Thanks!
Shari
Modeling Identity in Archival Collections of Email: A Preliminary Study
Identity Resolution Architecture
Duplicate Detection
Extraction from Main Header
Extraction from Quoted Header
Body and Quoted Text Separation
Signature Line Detection
Salutation Line Detection
Nickname Extraction
Main body
Salutation linesSignature lines
Address-Nickname Address-Nickname AssociationsAssociations
Address-Name Address-Name AssociationsAssociations
Address-Address Address-Address AssociationsAssociations
Clustering Associations
EntitiesEntities
Unique emails
Quoted headers
Modeling Identity in Archival Collections of Email: A Preliminary Study
Message-ID: <1486175.1075858665169.JavaMail.evans@thyme>Date: Wed, 26 Sep 2001 09:25:19 -0700 (PDT)From: [email protected]: [email protected], [email protected],
[email protected], o'[email protected], [email protected]: New Email AddressX-From: Jim Mathes <[email protected]>X-To: Vandini, Mark <[email protected]>, Urbon Steve <[email protected]>,
Tony Sapienza <[email protected]>, Tom O'Rourke <[email protected]>, Tom Lyons <[email protected]>, Tom Hodgson <[email protected]>
X-cc: X-bcc:
We have just launched our "New & Improved Website",www.newbedfordchamber.com and I have a new email address:
Please make the appropriate changes in your email address book.
Thank you,
Jim Mathes, PresidentNew Bedford Area Chamber of Commerce
Extraction From Main Headers
Name-Address Association
Name-Address Association
Address-Address Association
Modeling Identity in Archival Collections of Email: A Preliminary Study
Extraction From Quoted HeadersHi Jeff,
Did you get our registration packet? If not, stop by and pick one upbecause you need it. Make sure you get the one for new students.
Shawn
On Wednesday, November 03, 1999 11:18 AM, Jeff Dasovich[SMTP:[email protected]] wrote:>>> ok, don't shoot me, but what's the deadline for scheduling for classes?>> signed,> clueless
Name-Address Association
---------------------- Forwarded by Elizabeth Sager/HOU/ECT on 02/09/2000 12:02 PM ---------------------------
"Patricia Young" <[email protected]> on 02/09/2000 08:50:59 AMTo: Elizabeth Sager/HOU/ECT@ECTcc: Subject: If possible, would you forward your resume to me electronically? Thanks.
If possible, would you forward your resume to me electronically? Thanks.
Name-Address Association
Modeling Identity in Archival Collections of Email: A Preliminary Study
From: [email protected]
The kiddies are going back to school already so now would be a good time to plan a trip to D.C. at last. Maybe early Sept?Also I'd be game for a girls' trip to Destin.
Time to work!Love,-Sooz
Procurement, Logistics, and ContractsEnron Broadband Services, Inc.1400 Smith, Suite EB-4573AHouston, TX 77002
Signature & Salutation Detection
The week is going OK. All the tennis and swimming has left me with sore muscles so this is my night off. Am planning to do some more house chores so I do not end up with another weekend like the last.
I'm still planning on coming to Austin next weekend, I'm just not sure when, but I'll let you know.
Call if you get lonely!
Love,Love,SoozSooz
Procurement, Logistics, and ContractsProcurement, Logistics, and ContractsEnron Broadband Services, Inc.Enron Broadband Services, Inc.1400 Smith, Suite EB-4573A1400 Smith, Suite EB-4573AHouston, TX 77002Houston, TX 77002
Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime.
Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long.
Have a good afternoon!
love,love,soozsooz
Procurement, Logistics, and ContractsProcurement, Logistics, and ContractsEnron Broadband Services, Inc.Enron Broadband Services, Inc.1400 Smith, Suite EB-4573A1400 Smith, Suite EB-4573AHouston, TX 77002Houston, TX 77002
Modeling Identity in Archival Collections of Email: A Preliminary Study
3,151 address-nickname associations
Nickname Extraction
Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime.
Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long.
Have a good afternoon!
love,love,soozsooz
Procurement, Logistics, and ContractsProcurement, Logistics, and ContractsEnron Broadband Services, Inc.Enron Broadband Services, Inc.1400 Smith, Suite EB-4573A1400 Smith, Suite EB-4573AHouston, TX 77002Houston, TX 77002
nicknamenickname
From: [email protected]
Modeling Identity in Archival Collections of Email: A Preliminary Study
Identifying Entities
“Robert Bruce” “Bob”
Robert E. BruceSenior CounselEnron North America Corp.T (713) 345-7780F (713) [email protected]
Static Signature (140)
Main Headers (915)Quoted Headers (8)
Salutations (7)Free Signatures (9)
Name
Email Address
Nickname
Signature Block
Email Address
“Robert”
Name
Quoted Headers (5)
Main Headers (7)
82,084addr-name
3,151 addr-nickname
19,708 addr-addr
66,715 entities
Modeling Identity in Archival Collections of Email: A Preliminary Study
Outline
Problem Identity Resolution Architecture EvaluationEvaluation Conclusion Future Work
Modeling Identity in Archival Collections of Email: A Preliminary Study
Stratified Sampling
Weakest Evidence Stronger Evidence
Address-Name AssociationsAddress-Name Associations
Main headers only 50 / 29677 50 / 31248
Quoted headers only 50 / 8042 50 / 3828
Both headers 50 / 9289
Address-Nickname AssociationsAddress-Nickname Associations
Salutations only 50 / 272 50 / 465
Signatures only 50 / 172 50 / 1754
Both 50/490
Address-Address Address-Address AssociationsAssociations
50 / 6514 50 / 4194
Modeling Identity in Archival Collections of Email: A Preliminary Study
Judgment Process
[email protected] "home email"
[email protected] "alexis james-petty"
[email protected] “june deadrick”
[email protected] “robbie lewis”
[email protected] "terrie covarrubias"
[email protected] "randy"
[email protected] "phyllis"
[email protected] "tom"
IncorrectIncorrect
Correct but not informativeCorrect but not informative
Correct and somewhat informativeCorrect and somewhat informative
Correct and very informativeCorrect and very informative
Modeling Identity in Archival Collections of Email: A Preliminary Study
Evaluation Measures
Judged AssociationsCorrect
Informative
Very Informative
Modeling Identity in Archival Collections of Email: A Preliminary Study
Accuracy
0
20
40
60
80
100
Main Headers QuotedHeaders
Both Overall
Per
cen
t A
ccu
racy
Weakest evidenceAverage evidenceStronger evidence
0
20
40
60
80
100
Salutation Signature Both Overall
Per
cen
t A
ccu
racy
0
20
40
60
80
100
Main Headers
Per
cen
t A
ccu
racy
Address-Name Associations
Address-Nickname Associations
Address-Address Associations
100% accuracy with multiple sources of evidence.
Address-name association was nearly perfect
80% minimum accuracy in address-nickname
96.7% entity accuracy
Modeling Identity in Archival Collections of Email: A Preliminary Study
Informativeness
0
20
40
60
80
100
Main Headers QuotedHeaders
Both Overall
Per
cen
t In
form
ativ
e
0
20
40
60
80
100
Salutation Signature Both OverallPerc
en
t In
form
ati
ve
0
20
40
60
80
100
Main Headers
Per
cen
t In
form
ativ
e
0
20
40
60
80
100
Main Headers QuotedHeaders
Both Overall
Per
cen
t V
ery
Info
rmat
ive
0
20
40
60
80
100
Salutation Signature Both Overall
Per
cen
t V
ery
Info
rmat
ive
Weakest evidence
Average evidence
Stronger evidence
0
20
40
60
80
100
Main Headers
Per
cen
t V
ery
Info
rmat
ive
Address-Name Associations
Address-Nickname Associations
Address-Address Associations0
20
40
60
80
100
Salutation Signature Both Overall
Per
cen
t A
ccu
racy
Modeling Identity in Archival Collections of Email: A Preliminary Study
Outline
Problem Identity Resolution Architecture Evaluation ConclusionConclusion
Modeling Identity in Archival Collections of Email: A Preliminary Study
Conclusion
Introduced a computational model of identity a set of simple techniques put together provide a useful baseline assessed its potential utility in the context of one fairly
complex email collection Automatic detection of nicknames in salutations
and signature lines. Most informative results from weakest evidence &
least accurate Accuracy and informativeness are both important
Modeling Identity in Archival Collections of Email: A Preliminary Study
Limitations
Email address associated with single identity Strength of evidence not exploited Heuristics hand-tuned for Enron collection Focus on personal attributes No reconciliation of multiple identities for single
person No attempt to classify identities as machines or
groups Recall?
Modeling Identity in Archival Collections of Email: A Preliminary Study
Future Work
extend the model to exploit temporal features and behavioral evidence
implement machine learning techniques perform ablation studies characterize the coverage of our methods in more detail replicate this work in other contexts integrate these techniques with the ultimate applications
for which computational models of identity are needed (e.g., social network analysis).
Modeling Identity in Archival Collections of Email: A Preliminary Study
Identity Framework
PersonPersonGroupGroup
Identity
MachineMachine
Identity Identity
EntityEntityEntityEntity EntityEntityEntityEntity EntityEntity
EntityEntity
Candidates
Modeling Identity in Archival Collections of Email: A Preliminary Study
Modeling Identity
Attributes (stable explicit features) email addresses, names, nickname, contact info
Associations Link attributes together Based on observations
Entities Representation of an identity Set of attributes in undirected graph
Linked by weighted associations
Modeling Identity in Archival Collections of Email: A Preliminary Study
Identifying Entities
First round limited transitive closure
Merging associations based on unique attributes Address-address associations
No use of strength of evidence yet 66,715 entities
Covering 77,420 unique email address (58% of all addresses)
Modeling Identity in Archival Collections of Email: A Preliminary Study
Related Work
Attribute/association extraction Name recognition and reference resolution Applications:
Social network analysis Finding experts
Modeling Identity in Archival Collections of Email: A Preliminary Study
Unjudged Associations
0
1
2
3
4
5
Main Headers Quoted Headers
Unj
udge
d As
soci
atio
ns
Weakest evidenceStronger evidence
Salutations Signatures Main Headers
Address-NameAssociations
Address-NicknameAssociations
Address-AddressAssociations
Only 19 ~3%