dynamics in graph analysis (pydata carolinas 2016)
TRANSCRIPT
Dynamics in Graph Analysis Adding Time as Structure for Visual
and Statistical InsightBenjamin Bengfort
@bbengfort District Data Labs
Algorithm PerformanceMore understandable implementations and native parallelism provide benefits
particularly to machine learning.
Visual AnalyticsHumans can understand and interpret interconnection structures, leading to
immediate insights.
“Graph technologies ease the modeling of your domain and improve the simplicity and speed of your queries.”— Marko A. Rodriguez http://bit.ly/2cthd2L
ConstructionGiven a set of [paths, vertices] is a [constraint] graph construction possible?
ExistenceDoes there exist a [path, vertex, set] within [constraints]?
OptimizationGiven several [paths, subgraphs, vertices, sets] is one the best?
EnumerationHow many [vertices, edges] exist with [constraints], is it possible to list them?
Example of Time Filtered Traversal: Data Model
Name: Emails Sent Network
Number of nodes: 6,174
Number of edges: 343,702
Average degree: 111.339
def sent_range(g, before=None, after=None):
# Create filtering function based on date range.
def inner(edge):
if before:
return g.ep.sent[edge] < before
if after:
return g.ep.sent[edge] > after
return inner
def degree_filter(degree=0):
# Create filtering function based on min degree.
def inner(vertex):
return vertex.out_degree() > degree
return inner
Example of Time Filtered Traversal
print("{} vertices and {} edges".format(
g.num_vertices(), g.num_edges()
))
# 6174 vertices and 343702 edges
aug = sent_range(g,
after=dateparse("Aug 1, 2016 09:00:00 EST")
)
view = gt.GraphView(g, efilt=aug)
view = gt.GraphView(view, vfilt=degree_filter())
print("{} vertices and {} edges".format(
view.num_vertices(), view.num_edges()
))
# 853 vertices and 24813 edges
Example of Time Filtered Traversal
Time Structures Perform static analysis on dynamic
components with time as a structure.
Dynamic GraphsMultiple subgraphs representing the graph state at a discrete timestep.
Natural Language Graph Analysis: Data Modeling
Name: Baleen Keyphrase GraphNumber of nodes: 2,682,624Number of edges: 46,958,599Average degree: 35.0095
Name: Sampled Keyphrase GraphNumber of nodes: 139,227Number of edges: 257,316Average degree: 3.6964
def degree_filter(degree=0):
def inner(vertex):
return vertex.out_degree() > degree
return inner
g = gt.GraphView(g, vfilt=degree_filter(3))
Name: High Degree Phrase Graph
Number of nodes: 8,520
Number of edges: 112,320
Average degree: 26.366
Natural Language Graph Analysis: Data Wrangling
Basic Keyphrase Graph Information
Vertex Type AnalysisPrimarily keyphrases and documents.
Degree DistributionPower laws distribution of degree.
Natural Language Graph Analysis: Data Wrangling
def ego_filter(g, ego, hops=2):
def inner(v):
dist = gt.shortest_distance(g, ego, v)
return dist <= hops
return inner
# Get a random document
v = random.choice([
v for v in g.vertices()
if g.vp.type[v] == 'document'
])
ego = gt.GraphView(
g, vfilt=ego_filter(g,v, 1)
)
Extract Week of the Year as Time Structure
# Construct Time Structures to Keyphrase
h = gt.Graph(directed=False)
h.gp.name = h.new_graph_property('string')
h.gp.name = "Phrases by Week"
# Add vertex properties
h.vp.label = h.new_vertex_property('string')
h.vp.vtype = h.new_vertex_property('string')
# Create graph from the keyphrase graph
for vertex in g.vertices():
if g.vp.type[vertex] == 'document':
dt = g.vp.pubdate[vertex]
weekno = dt.isocalendar()[1]
week = h.add_vertex()
h.vp.label[week] = "Week %d" % weekno
h.vp.vtype[week] = 'week'
for neighbor in vertex.out_neighbours():
if g.vp.type[neighbor] == 'phrase':
phrase = h.add_vertex()
h.vp.vtype[vidmap[phrase]] = 'phrase'
h.add_edge(week, phrase)
PageRank CentralityA variant of Eigenvector centrality that has a scaling factor and prioritizes incoming links.
Eigenvector CentralityA measure of relative influence where closeness to important nodes matters as much as other metrics.
Degree CentralityA vertex is more important the more connections it has. E.g. “celebrity”.
Betweenness CentralityHow many shortest paths pass through the given vertex. E.g. how often is information flow through?
Layout: Edge and Vertex Positioning
FruchtermanReingold
SFDP (Yifan-Hu) Force Directed
Radial Tree Layout by MST
ARF Spring Block
Visual Properties of Vertices
Lane Harrison, The Links that Bind Us: Network Visualizationshttp://blog.visual.ly/network-visualizations
Visual Properties of Edges
Lane Harrison, The Links that Bind Us: Network Visualizationshttp://blog.visual.ly/network-visualizations