data wrangling: from raw data to...
TRANSCRIPT
Data Wrangling: From Raw Data to Networks
Notes by Dr. Leto PeelCSCI 5352: Network Analysis and Modeling
October 31, 2017
Creating networks from data
When creating networks from data we need to make a number ofdesign descisions
I How will we collect the data?I What type of entity (node) to use and how to extract it?I What type of relationship or interaction do our links represent?I What time period?I Directed or undirected links?
How we make these decisions depends on:I the task we’re trying to achieveI the model and algorithm we are using
Creating networks from data
When creating networks from data we need to make a number ofdesign descisions
I How will we collect the data?I What type of entity (node) to use and how to extract it?I What type of relationship or interaction do our links represent?I What time period?I Directed or undirected links?
How we make these decisions depends on:I the task we’re trying to achieveI the model and algorithm we are using
Motivating Example
Change-point detection
Aim: To understand how external events “shocks” are related tochanges in network structure
Datasets
Two datasets
EnronI Criminal investigationI Single networkI Semi-structured
MIT Reality MiningI Consenting participantsI Multiple networksI Structured data
MIT Reality Mining dataset
I 94 participantsI 68 MIT Media Lab (90%
graduate students, 10% staff)I 26 incoming Sloan business
school students
I September 2004 and June 2005I Rich dataset (phone data +
survey data)I Incentive: free use of exclusive
phone
Rich Data
Phone dataI Communication events
(voice, sms)I Phone charge statusI Phone active / on?I Location (cell tower)I Bluetooth devicesI App usage
Survey dataI Who are your friends?I Have you travelled recently?I Do you own a car?I How long into the term did
it take for your social circleto become what it is today?
I Preferred work/personalcommunication medium?
Noise
The dataset is very noisy.
Sources of noise / missing data:
I phone left at homeI no battery (or being
charged)I sensor errorI date discrepencies (reset)
Noise
The dataset is very noisy.
Sources of noise / missing data:I phone left at homeI no battery (or being
charged)I sensor errorI date discrepencies (reset)
Link reciprocity
The bluetooth network is, in its raw form, a directed network.
It doesn’t make sense to have directed network of physicalproximities.
Link reciprocity
The bluetooth network is, in its raw form, a directed network.
It doesn’t make sense to have directed network of physicalproximities.
Link reciprocity
I Links that only exist in one direction indicate a mismatchbetween reality and sensor.
I Two choices: minimal or maximal set.
Enron email dataset
I Largest supplier of natural gas to NorthAmerica
I “America’s Most Innovative Company” by themagazine Fortune from 1996 to 2001
I Misrepresentation of earnings and unethicalpractises
I End of 2001: One of the largest bankruptciesin American history
Enron email dataset
I Largest supplier of natural gas to NorthAmerica
I “America’s Most Innovative Company” by themagazine Fortune from 1996 to 2001
I Misrepresentation of earnings and unethicalpractises
I End of 2001: One of the largest bankruptciesin American history
Enron email dataset
I Dataset publically released during the FERCinvestigation
I 151 Enron employee email accountsI ∼ 0.5 million messages, 1.4GB
Entity extraction
I To build a network we need to identify the nodes (i.e. theemail addresses)
I >15,000 unique email addressesI Only 151 employees part of the investigationI Spam?I Scalability issue?
Entity extraction
I To build a network we need to identify the nodes (i.e. theemail addresses)
I >15,000 unique email addressesI Only 151 employees part of the investigationI Spam?I Scalability issue?
Identifying the right entities
How to identify key employee email addresses?
I Custom folders so we can’t check “Sent mail” folder for senderaddress
I Similar issue with checking “Inbox” + this includes mailing listsI Doesn’t match the most frequently occuring emails
Identifying the right entities
How to identify key employee email addresses?I Custom folders so we can’t check “Sent mail” folder for sender
address
I Similar issue with checking “Inbox” + this includes mailing listsI Doesn’t match the most frequently occuring emails
Identifying the right entities
How to identify key employee email addresses?I Custom folders so we can’t check “Sent mail” folder for sender
addressI Similar issue with checking “Inbox” + this includes mailing lists
I Doesn’t match the most frequently occuring emails
Identifying the right entities
How to identify key employee email addresses?I Custom folders so we can’t check “Sent mail” folder for sender
addressI Similar issue with checking “Inbox” + this includes mailing listsI Doesn’t match the most frequently occuring emails
Metadata
I Often we use metadata (non-network data) as part of networkanalysis
I e.g. Comparing large-scale structure to node level information
I For change-point detection we are interested in how changesrelate to external events
Resouces
Further reading:
Datasets:I Enron emails:
https://www.cs.cmu.edu/~./enron/I MIT Reality Mining:
http://realitycommons.media.mit.edu/realitymining.html
Code:I Enron parser: http:
//tinyurl.com/letopeel/datasets.html
Software:I PythonI MatlabI Octave