network analysis in python with networkxbiconsulting.hu › letoltes › 2015budapestbi ›...
TRANSCRIPT
Network Analysis in Python with NetworkX
Johannes WachsCentral European University,Center for Network Science
About me
PhD student at CEU’s Center for Network Science
Researcher at the Corruption Research Center Budapest
Consultant at Bondweaver
Python/NetworkX user in all three roles
Why Networks?
Networks are an excellent framework to study complexity.
They provide holistic information that is increasingly valuable (and quantifiable!) with the growth of information technology.
*New Yorker
Famous Example: Google
Count the number of links to each page.
>> Easy to manipulate, poor results
How do we order webpages?The problem:
The previous solution (Alta Vista):
Measure link quality (PageRank)The innovation:
PageRank
It’s not how many links you get, but who links to you
(and who links to them, etc.)
*http://www.bloggingfever.com/
NetworkX
Python (2/3) library for network analysis
Made at LAML, released 2005. Currently: v1.11 (October 2015)
● provide tools to study networks● standard interface suitable for many projects● rapid development for collaborative/multidisciplinary projects● interface to C/C++/FORTRAN● ability to take in large, nonstandard datasets
Goals:
NetworkX Workflow
Import data
● dictionary, Pandas DF, np array
● edgelist/adj Matrix CSV
● graph file format GML, GEXF
● JSON
Create graph
● Graph, DiGraph, MultiGraph, Bipartite
● Labels/Attributes for nodes and edges
Calculate
● clustering, centrality, degrees
● community detection
● null-models
Analyze
● Draw graph (networkX or export to Gephi)
● Distribution of network statistics vs attributes
● Export ● Interactive via
D3.js
Case Study 1Collaboration and Expertise in a Firm
A 300 person department of a multinational wanted to know how they were collaborating and how expertise was distributed across the unit.
We surveyed employees to see who they considered important collaborators. We also asked them to nominate knowledge experts (+sociometry).
Two layers: the network of collaboration, and the network (PageRanked!) of expertise.
Firm Results
We see large clusters of collaborating employees without direct access to expertise on the periphery.
Red: ManagersPink:Team LeadsBlack: Employees
Larger nodes are experts
Edges connect collaborating coworkers
Case Study 2Twitter and Elections
For better or worse, Twitter has become a major platform for politics.
Scraping tweets we see how followers of different parties talk with each other around election time.
Networks show us the big picture: do left and right ever talk anymore?
*Reuters
Denmark 2015
10 major parties.
~100,000 accounts, 5.5 million tweets from six months before the election to one month after. Accounts grouped Left/Center/Right.
Sentiment Analysis via AFINN.
The most negative tweet during the campaign:Engang var #dkpol en kamp mellem de onde og de dumme. Nu er det de onde og dumme mod nogle andre onde og dumme. #fv15 #fv2015 #stemblankt'
Once, #dkpol was a battle between evil and stupid. Now it's the evil and stupid against some other evil and stupid. # fv15 # fv2015 #stemblankt '
Twitter Results
Some of the Left and Right tweet at each other, but on the Left’s turf.
Long path between the Left and Right primary clusters.
Sentiment:
● Left/Right fight (negative sentiment) before the election, get more friendly after.
Red: LeftYellow: CenterBlue: Right
size ~ PageRank
Corruption in Public Contracting
Public contracting is up to 25% of GDP in EU countries. This is a major avenue for corruption.
The Corruption Risk Index (CRI) grades contracts on the presence of red flags:
● Short bidding time (tell your friend ahead of time)● Presence of dummy bids (create fake competitors for your friend)● Overdetermination of requirement (make your friend uniquely eligible)● etc.
Q: How is CRI distributed in the market of issuers and firms?
Corruption risk is clustered!
Significant corruption assortativity
Corruption Results
Nodes:Firms and IssuersRed: High CRIHungary 2009
Alternatives
● igraph: written in C/C++, packages in Python and R. Faster● graph-tool: Python with data structures/algos in C++. Fastest
My rule of thumb is if:
● >50,000 Nodes or ● if integrated into production
consider these options.
Thanks!