data wrangling: from raw data to...

36
Data Wrangling: From Raw Data to Networks Notes by Dr. Leto Peel CSCI 5352: Network Analysis and Modeling October 31, 2017

Upload: dinhkien

Post on 17-Mar-2018

223 views

Category:

Documents


1 download

TRANSCRIPT

Data Wrangling: From Raw Data to Networks

Notes by Dr. Leto PeelCSCI 5352: Network Analysis and Modeling

October 31, 2017

Creating networks from data

When creating networks from data we need to make a number ofdesign descisions

I How will we collect the data?I What type of entity (node) to use and how to extract it?I What type of relationship or interaction do our links represent?I What time period?I Directed or undirected links?

How we make these decisions depends on:I the task we’re trying to achieveI the model and algorithm we are using

Creating networks from data

When creating networks from data we need to make a number ofdesign descisions

I How will we collect the data?I What type of entity (node) to use and how to extract it?I What type of relationship or interaction do our links represent?I What time period?I Directed or undirected links?

How we make these decisions depends on:I the task we’re trying to achieveI the model and algorithm we are using

Motivating Example

Change-point detection

Aim: To understand how external events “shocks” are related tochanges in network structure

Detecting change points

Real Interactions Sensors Data

Network Snapshots

Datasets

Two datasets

EnronI Criminal investigationI Single networkI Semi-structured

MIT Reality MiningI Consenting participantsI Multiple networksI Structured data

MIT Reality Mining dataset

I 94 participantsI 68 MIT Media Lab (90%

graduate students, 10% staff)I 26 incoming Sloan business

school students

I September 2004 and June 2005I Rich dataset (phone data +

survey data)I Incentive: free use of exclusive

phone

Rich Data

Phone dataI Communication events

(voice, sms)I Phone charge statusI Phone active / on?I Location (cell tower)I Bluetooth devicesI App usage

Survey dataI Who are your friends?I Have you travelled recently?I Do you own a car?I How long into the term did

it take for your social circleto become what it is today?

I Preferred work/personalcommunication medium?

Which network?

Friendship network

Voice call network

SMS network

Physical proximity network

Which network?

Friendship network

Voice call network

SMS network

Physical proximity network

Which network?

Friendship network

Voice call network

SMS network

Physical proximity network

Which network?

Friendship network

Voice call network

SMS network

Physical proximity network

Noise

The dataset is very noisy.

Sources of noise / missing data:

I phone left at homeI no battery (or being

charged)I sensor errorI date discrepencies (reset)

Noise

The dataset is very noisy.

Sources of noise / missing data:I phone left at homeI no battery (or being

charged)I sensor errorI date discrepencies (reset)

Link reciprocity

The bluetooth network is, in its raw form, a directed network.

It doesn’t make sense to have directed network of physicalproximities.

Link reciprocity

The bluetooth network is, in its raw form, a directed network.

It doesn’t make sense to have directed network of physicalproximities.

Link reciprocity

I Links that only exist in one direction indicate a mismatchbetween reality and sensor.

I Two choices: minimal or maximal set.

Temporal resolution

I Bluetooth scans every 2.5 minutesI What temporal resolution should we use?

Temporal resolution

Enron email dataset

I Largest supplier of natural gas to NorthAmerica

I “America’s Most Innovative Company” by themagazine Fortune from 1996 to 2001

I Misrepresentation of earnings and unethicalpractises

I End of 2001: One of the largest bankruptciesin American history

Enron email dataset

I Largest supplier of natural gas to NorthAmerica

I “America’s Most Innovative Company” by themagazine Fortune from 1996 to 2001

I Misrepresentation of earnings and unethicalpractises

I End of 2001: One of the largest bankruptciesin American history

Enron email dataset

I Dataset publically released during the FERCinvestigation

I 151 Enron employee email accountsI ∼ 0.5 million messages, 1.4GB

Email data

Email data

Custom email subfolders (name and depth)!

Email data

Email data

Entity extraction

I To build a network we need to identify the nodes (i.e. theemail addresses)

I >15,000 unique email addressesI Only 151 employees part of the investigationI Spam?I Scalability issue?

Entity extraction

I To build a network we need to identify the nodes (i.e. theemail addresses)

I >15,000 unique email addressesI Only 151 employees part of the investigationI Spam?I Scalability issue?

Identifying the right entities

How to identify key employee email addresses?

I Custom folders so we can’t check “Sent mail” folder for senderaddress

I Similar issue with checking “Inbox” + this includes mailing listsI Doesn’t match the most frequently occuring emails

Identifying the right entities

How to identify key employee email addresses?I Custom folders so we can’t check “Sent mail” folder for sender

address

I Similar issue with checking “Inbox” + this includes mailing listsI Doesn’t match the most frequently occuring emails

Identifying the right entities

How to identify key employee email addresses?I Custom folders so we can’t check “Sent mail” folder for sender

addressI Similar issue with checking “Inbox” + this includes mailing lists

I Doesn’t match the most frequently occuring emails

Identifying the right entities

How to identify key employee email addresses?I Custom folders so we can’t check “Sent mail” folder for sender

addressI Similar issue with checking “Inbox” + this includes mailing listsI Doesn’t match the most frequently occuring emails

Metadata

I Often we use metadata (non-network data) as part of networkanalysis

I e.g. Comparing large-scale structure to node level information

I For change-point detection we are interested in how changesrelate to external events

Events

Source: time.com

Resouces

Further reading:

Datasets:I Enron emails:

https://www.cs.cmu.edu/~./enron/I MIT Reality Mining:

http://realitycommons.media.mit.edu/realitymining.html

Code:I Enron parser: http:

//tinyurl.com/letopeel/datasets.html

Software:I PythonI MatlabI Octave