dynamics in graph analysis (pydata carolinas 2016)

41
Dynamics in Graph Analysis Adding Time as Structure for Visual and Statistical Insight Benjamin Bengfort @bbengfort District Data Labs

Upload: benjamin-bengfort

Post on 07-Jan-2017

286 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Dynamics in Graph Analysis Adding Time as Structure for Visual

and Statistical InsightBenjamin Bengfort

@bbengfort District Data Labs

Are graphs effective for analytics?Or why use graphs at all?

Algorithm PerformanceMore understandable implementations and native parallelism provide benefits

particularly to machine learning.

Visual AnalyticsHumans can understand and interpret interconnection structures, leading to

immediate insights.

“Graph technologies ease the modeling of your domain and improve the simplicity and speed of your queries.”— Marko A. Rodriguez http://bit.ly/2cthd2L

ConstructionGiven a set of [paths, vertices] is a [constraint] graph construction possible?

ExistenceDoes there exist a [path, vertex, set] within [constraints]?

OptimizationGiven several [paths, subgraphs, vertices, sets] is one the best?

EnumerationHow many [vertices, edges] exist with [constraints], is it possible to list them?

Traversals

Property Graphs

How do you model time?

Relational Database

Time Properties

Time Modifies Traversal

Example of Time Filtered Traversal: Data Model

Name: Emails Sent Network

Number of nodes: 6,174

Number of edges: 343,702

Average degree: 111.339

def sent_range(g, before=None, after=None):

# Create filtering function based on date range.

def inner(edge):

if before:

return g.ep.sent[edge] < before

if after:

return g.ep.sent[edge] > after

return inner

def degree_filter(degree=0):

# Create filtering function based on min degree.

def inner(vertex):

return vertex.out_degree() > degree

return inner

Example of Time Filtered Traversal

print("{} vertices and {} edges".format(

g.num_vertices(), g.num_edges()

))

# 6174 vertices and 343702 edges

aug = sent_range(g,

after=dateparse("Aug 1, 2016 09:00:00 EST")

)

view = gt.GraphView(g, efilt=aug)

view = gt.GraphView(view, vfilt=degree_filter())

print("{} vertices and {} edges".format(

view.num_vertices(), view.num_edges()

))

# 853 vertices and 24813 edges

Example of Time Filtered Traversal

What makes a graph dynamic?

Time Structures Perform static analysis on dynamic

components with time as a structure.

Dynamic GraphsMultiple subgraphs representing the graph state at a discrete timestep.

Keyphrases over Time

Natural Language Graph Analysis: Data Ingestion

Natural Language Graph Analysis: Data Modeling

Name: Baleen Keyphrase GraphNumber of nodes: 2,682,624Number of edges: 46,958,599Average degree: 35.0095

Name: Sampled Keyphrase GraphNumber of nodes: 139,227Number of edges: 257,316Average degree: 3.6964

def degree_filter(degree=0):

def inner(vertex):

return vertex.out_degree() > degree

return inner

g = gt.GraphView(g, vfilt=degree_filter(3))

Name: High Degree Phrase Graph

Number of nodes: 8,520

Number of edges: 112,320

Average degree: 26.366

Natural Language Graph Analysis: Data Wrangling

Basic Keyphrase Graph Information

Vertex Type AnalysisPrimarily keyphrases and documents.

Degree DistributionPower laws distribution of degree.

Natural Language Graph Analysis: Data Wrangling

def ego_filter(g, ego, hops=2):

def inner(v):

dist = gt.shortest_distance(g, ego, v)

return dist <= hops

return inner

# Get a random document

v = random.choice([

v for v in g.vertices()

if g.vp.type[v] == 'document'

])

ego = gt.GraphView(

g, vfilt=ego_filter(g,v, 1)

)

The Centrality of Time

Extract Week of the Year as Time Structure

# Construct Time Structures to Keyphrase

h = gt.Graph(directed=False)

h.gp.name = h.new_graph_property('string')

h.gp.name = "Phrases by Week"

# Add vertex properties

h.vp.label = h.new_vertex_property('string')

h.vp.vtype = h.new_vertex_property('string')

# Create graph from the keyphrase graph

for vertex in g.vertices():

if g.vp.type[vertex] == 'document':

dt = g.vp.pubdate[vertex]

weekno = dt.isocalendar()[1]

week = h.add_vertex()

h.vp.label[week] = "Week %d" % weekno

h.vp.vtype[week] = 'week'

for neighbor in vertex.out_neighbours():

if g.vp.type[neighbor] == 'phrase':

phrase = h.add_vertex()

h.vp.vtype[vidmap[phrase]] = 'phrase'

h.add_edge(week, phrase)

PageRank CentralityA variant of Eigenvector centrality that has a scaling factor and prioritizes incoming links.

Eigenvector CentralityA measure of relative influence where closeness to important nodes matters as much as other metrics.

Degree CentralityA vertex is more important the more connections it has. E.g. “celebrity”.

Betweenness CentralityHow many shortest paths pass through the given vertex. E.g. how often is information flow through?

What are the central weeks and phrases?

Betweenness Centrality Katz Centrality

Keyphrase Dynamics

Create Sequences of Time Ordered Subgraphs

Network Visualization

Layout: Edge and Vertex Positioning

FruchtermanReingold

SFDP (Yifan-Hu) Force Directed

Radial Tree Layout by MST

ARF Spring Block

Visual Properties of Vertices

Lane Harrison, The Links that Bind Us: Network Visualizationshttp://blog.visual.ly/network-visualizations

Visual Properties of Edges

Lane Harrison, The Links that Bind Us: Network Visualizationshttp://blog.visual.ly/network-visualizations

Visual Analysis

The Visual Analytics Mantra

Overview First Zoom and Filter Details on Demand

Questions?