Enron Emails as Graph Data Corpus for Large-scale Enron Emails as Graph Data Corpus for Large-scale Graph Querying ExperimentationGraph Querying Experimentation
Michal Laclavík, Martin Šeleng, Marek Ciglan, Ladislav Hluchý
Motivation and ApproachMotivation and Approach
Motivation• To exploit information and knowledge included in email communication
Approach• Social Network Extraction• Entities extraction like People, Organizations, Locations, Contact data• Forming semantic trees and graphs• User interaction with graph data
Bratislava, 26 October 2011 GCCP 2011 2
Email Social NetworksEmail Social Networks
• Email Social Networks are less explored– Several scientific publications:
Apache mailing list, Enron, …
– Commercial: Xobni (contacts and attachments)
• Benefit– Web Social Network Sites: owned by third parties
– Email SN: owned by organization, individual or community
– Additional level of interaction and context is present in emails
• Information and Knowledge– People, locations, contacts, product, services, attachments or links
– Interactions
– Time
– Discovering relations can bring significant benefits
– Spread of Activation – simple way to discover relations
Bratislava, 26 October 2011 GCCP 2011 3
Ontea: Information Extraction ToolOntea: Information Extraction Tool
Regex patternsGazetteersResuls
Key-value pairs Structured into trees graphs
Transformers, ConfigurationAutomatic loading of extractors
Visual Annotation Tool Integration with external tools
GATE, Stemers, Hadoop …Multilingual tests
English, Slovak, Spanish, Italian
GCCP 2011 4Bratislava, 26 October 2011
http://ontea.sf.net
GCCP 2011 5
Business objects in EmailsBusiness objects in Emails
• Study on 6 organizations show:– Objects can be identified by patterns and gazeteers– It is possible to define set of common objects
• Objects identified:– Organization:
• org:Name, org:RegNo, org:TaxNo– Person:
• person:Name, person:Function– Contact:
• contact:Phone, contact:Email, contact:Webpage– Address:
• address:ZIP, address:Street, address:Settlement– Product:
• product:Name, product:Module, product:Component, product:BOID– Document:
• doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest– Inventory:
• inventory:ResID, inventory:ResType– Other business object
• ID: BOID
Bratislava, 26 October 2011
• Use of Social Network from email• Includes extracted objects• Full text of extracted objects• Related objects discovered and
ordered by spread activation on social network graph
• Faceted search, navigation• http://ikt.ui.sav.sk/esns/
Email Search PrototypeEmail Search Prototype
GCCP 2011 7Bratislava, 26 October 2011
gSemSearch: Graph based Semantic Search
Email ExampleEmail Example
1 Vertex: Doc=>/home/misos/enron/test/6.eml
1 Vertex: Quote=>/6.eml0:1:0
2 Edge: (Doc=>/home/misos/enron/test/6.eml)=>(Quote=>/6.eml0:1:0)
1 Vertex: Paragraph=>/6.eml0:1:0
2 Edge: (Quote=>/6.eml0:1:0)=>(Paragraph=>/6.eml0:1:0)
1 Vertex: Sentence=>/6.eml0:1:0
2 Edge: (Paragraph=>/6.eml0:1:0)=>(Sentence=>/6.eml0:1:0)
1 Vertex: DateTime=>Fri, 8 Mar 2002 06:46:07 -0800 (PST)
2 Edge: (Sentence=>/6.eml0:1:0)=>(DateTime=>Fri, 8 Mar 2002 06:46:07 -0800 (PST))
1 Vertex: Email=>[email protected]
2 Edge: (Sentence=>/6.eml0:1:0)=>(Email=>[email protected])
1 Vertex: Email=>[email protected]
2 Edge: (Sentence=>/6.eml0:1:0)=>(Email=>[email protected])
1 Vertex: Person:Name=>Grigsby, Mike
2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Grigsby, Mike)
1 Vertex: Company=>ENRON
2 Edge: (Sentence=>/6.eml0:1:0)=>(Company=>ENRON)
1 Vertex: Person:Name=>Badeer, Robert
2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Badeer, Robert)
1 Vertex: Company=>ENRON
2 Edge: (Sentence=>/6.eml0:1:0)=>(Company=>ENRON)
1 Vertex: Person:GivenName=>Robert
2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:GivenName=>Robert)
1 Vertex: Person:Name=>Badeer, Robert
2 Edge: (Sentence=>/6.eml0:1:0)=>(Person:Name=>Badeer, Robert)
1 Vertex: Paragraph=>/6.eml659:19:0
2 Edge: (Quote=>/6.eml0:1:0)=>(Paragraph=>/6.eml659:19:0)
1 Vertex: Sentence=>/6.eml659:19:0
2 Edge: (Paragraph=>/6.eml659:19:0)=>(Sentence=>/6.eml659:19:0)
1 Vertex: Person:Name=>Michael D. Grigsby
2 Edge: (Sentence=>/6.eml659:19:0)=>(Person:Name=>Michael D. Grigsby)
1 Vertex: Company=>UBS Warburg Energy, LLC
2 Edge: (Sentence=>/6.eml659:19:0)=>(Company=>UBS Warburg Energy, LLC)
1 Vertex: TelephoneNumber=>713-853-7031
2 Edge: (Sentence=>/6.eml659:19:0)=>(TelephoneNumber=>713-853-7031)
1 Vertex: TelephoneNumber=>713-408-6256
2 Edge: (Sentence=>/6.eml659:19:0)=>(TelephoneNumber=>713-408-6256)
Bratislava, 26 October 2011 GCCP 2011 8
Enron Graph corpus StatisticsEnron Graph corpus Statistics
Bratislava, 26 October 2011 GCCP 2011 9
Description Size/Count
Corpus Size 2.5 GB
Compressed Corpus Size 217 MB
Messages 517,377
Nodes 8,269,278
Edges 20,383,709
Address 4,997
CityName 1,550
Company 52,286
DateTime 228,175
Email 162,754
MoneyAmount 28,992
Paragraph 2,631,292
Person 167,613
Quote 533,007
Sentence 3,800,504
TelephoneNumber 26,013
WebAddress 105,610
Future Direction: Relations Discovery in Large Graph DataFuture Direction: Relations Discovery in Large Graph Data
• Motivation– Graph/Network data are everywhere: social networks, web, LinkedData,
transactions, communication (email, phone). – Also text can be converted to graph. – Interconnecting graph data and searching for relations is crucial.
• Approach– Forming semantic trees and graphs from text, web, communication, databases
and LinkedData– User interaction with graph data in order to achieve integration and data
cleansing– Users will do it, if user effort have immediate impact on search results
Bratislava, 26 October 2011 GCCP 2011 11
SGDB: Simple Graph DatabaseSGDB: Simple Graph Database
• Storage for graphs• Optimized for graph traversing and spread of activation• Faster then Neo4j for graph traversing operations• Supports Blueprints API• https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3
• Graph Database Benchmarks– Graph Traversal Benchmark for Graph Databases
– http://ups.savba.sk/~marek/gbench.html
– Blueprints API - possibility to test compliant Graph databases
Bratislava, 26 October 2011 GCCP 2011 12
• Email Archives– Valuable source of knowledge
– Hidden Social Networks owned by Enterprise or Individual
– Information Extraction and Social Network Analysis can help
• Challenges– Graph based Querying
– New data and approach for information search
– Relation search
• Applications– Recommendation and Search in Emails
– Population of Databases (Cold start problem)
– Possibility to extend social network graph with transaction data, processed document repositories and other business data
– Business Intelligence and Knowledge Management
ConclusionConclusion
Bratislava, 26 October 2011 GCCP 2011 13