congressional pagerank: graph analytics of us congress with neo4j

65
Congressional PageRank: Graph Analytics Of US Congress William Lyon Graph Day - Austin, TX January 2016

Upload: william-lyon

Post on 12-Jan-2017

562 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Congressional PageRank:Graph Analytics Of US Congress

William Lyon

Graph Day - Austin, TXJanuary 2016

Page 2: Congressional PageRank: Graph Analytics of US Congress With Neo4j

About me

Software Developer @[email protected]

@lyonwjlyonwj.com

William Lyon

Page 3: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Agenda

• Brief intro to Neo4j graph database• Modeling US Congress as a graph• Exploring the 114th Congress • Finding influential legislators

Page 4: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Neo4j – Key Features

Native Graph StorageEnsures data consistency and performance

Native Graph Processing Millions of hops per second, in real time

“Whiteboard Friendly” Data ModelingModel data as it naturally occurs

High Data IntegrityFully ACID transactions

Powerful, Expressive Query LanguageRequires 10x to 100x less code than SQL Scalability and High Availability Vertical and horizontal scaling optimized for graphs Built-in ETLSeamless import from other databases Integration Drivers and APIs for popular languages

MATCH(A)

Page 5: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Property Graph Model

Page 6: Congressional PageRank: Graph Analytics of US Congress With Neo4j

The Whiteboard Model Is the Physical Model

Page 7: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Relational Versus Graph Models

Relational Model Graph Model

KNOWS

KNOWS

KNOWS

ANDREAS

TOBIAS

MICA

DELIA

Person FriendPerson-Friend

ANDREASDELIA

TOBIAS

MICA

Page 8: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Property Graph Model Components

Nodes • The objects in the graph • Can have name-value properties • Can be labeled

Relationships • Relate nodes by type and

direction • Can have name-value properties

CAR

DRIVES

name: “Dan” born: May 29, 1970

twitter: “@dan”name: “Ann”

born: Dec 5, 1975

since: Jan 10, 2011

brand: “Volvo” model: “V70”

LOVES

LOVES

LIVES WITH

OWNS

PERSON PERSON

Page 9: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Cypher Query Language

Page 10: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Cypher: Powerful and Expressive Query Language

CREATE (:Person { name:“Dan”} ) -[:LOVES]-> (:Person { name:“Ann”} )

LOVES

Dan Ann

LABEL PROPERTY

NODE NODE

LABEL PROPERTY

Page 11: Congressional PageRank: Graph Analytics of US Congress With Neo4j

MATCH (boss)-[:MANAGES*0..3]->(sub), (sub)-[:MANAGES*1..3]->(report)WHERE boss.name = “John Doe”RETURN sub.name AS Subordinate, count(report) AS Total

Express Complex Queries Easily with Cypher

Find all direct reports and how many people they manage, up to 3 levels down

Cypher Query

SQL Query

Page 12: Congressional PageRank: Graph Analytics of US Congress With Neo4j

http://www.opencypher.org/

Page 13: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Getting Data into Neo4j

Cypher-Based “LOAD CSV” Capability • Transactional (ACID) writes • Initial and incremental loads of up to

10 million nodes and relationships

Command-Line Bulk Loader

neo4j-import • For initial database population • For loads with 10B+ records • Up to 1M records per second

4.58 million things and their relationships…

Loads in 100 seconds!

Page 14: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Neo4j

Graph Database

• Property graph datamodel• Nodes and relationships

• Native graph processing• Cypher query language

Page 15: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Graphing US Congress

Page 16: Congressional PageRank: Graph Analytics of US Congress With Neo4j

https://github.com/legis-graph/legis-graph

Page 17: Congressional PageRank: Graph Analytics of US Congress With Neo4j

https://github.com/legis-graph/legis-graph

LOAD CSV WITH HEADERS FROM “file:///legislators.csv” AS line MERGE (l:Legislator (thomasID: line.thomasID}) SET l = line MERGE (s:State {code:line.state})<-[:REPRESENTS]-(l) …

US Congress

Page 18: Congressional PageRank: Graph Analytics of US Congress With Neo4j

https://github.com/legis-graph/legis-graph

Page 19: Congressional PageRank: Graph Analytics of US Congress With Neo4j

What Legislators represent Texas?

MATCH (s:State {code: "TX"})<-[:REPRESENTS]-(l:Legislator) RETURN l,s;

Page 20: Congressional PageRank: Graph Analytics of US Congress With Neo4j

…include congressional body and partyMATCH (s:State {code: "TX"})<-[:REPRESENTS]-(l:Legislator) MATCH (p:Party)<-[:IS_MEMBER_OF]-(l)-[:ELECTED_TO]->(b:Body) RETURN b,l,s,p;

Page 21: Congressional PageRank: Graph Analytics of US Congress With Neo4j
Page 22: Congressional PageRank: Graph Analytics of US Congress With Neo4j
Page 23: Congressional PageRank: Graph Analytics of US Congress With Neo4j
Page 24: Congressional PageRank: Graph Analytics of US Congress With Neo4j

How to find influential legislators?

Page 25: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Bill Sponsorship

Page 26: Congressional PageRank: Graph Analytics of US Congress With Neo4j
Page 27: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Bill Cosponsorship

Page 28: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Degree centrality

Page 29: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Bill Cosponsorship

Page 30: Congressional PageRank: Graph Analytics of US Congress With Neo4j

• Cosponsors are “influenced by” bill sponsors

• Add INFLUENCED_BY relationships

Page 31: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Betweenness centrality

The number of times a node acts as a bridge along the shortest path between two other nodes.

https://en.wikipedia.org/wiki/Betweenness_centrality

Page 32: Congressional PageRank: Graph Analytics of US Congress With Neo4j
Page 33: Congressional PageRank: Graph Analytics of US Congress With Neo4j

image credit: https://en.wikipedia.org/wiki/PageRank

Page 34: Congressional PageRank: Graph Analytics of US Congress With Neo4j

image credit: https://en.wikipedia.org/wiki/PageRank

?

Page 35: Congressional PageRank: Graph Analytics of US Congress With Neo4j

PageRankCypher approximation

UNWIND range(1,10) AS round MATCH (l:Legislator) WHERE rand() < 0.1 MATCH (l:Legislator)-[:INFLUENCED_BY]->(o:Legislator) SET o.rank = coalesce(o.rank,0) + 1;

http://neo4j.com/blog/using-neo4j-hr-analytics/

Page 36: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Neo4j server extensions with Java

Page 37: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Neo4j server extensions with Java

curl http://localhost:7474/service/v1/pagerank/Person/KNOWS

Page 38: Congressional PageRank: Graph Analytics of US Congress With Neo4j

PageRankGraph processing server extension

https://github.com/maxdemarzi/graph_processing

curl http://localhost:7474/service/v1/pagerank/Person/KNOWS

Page 39: Congressional PageRank: Graph Analytics of US Congress With Neo4j

PageRank

neo4j-noderank

https://github.com/graphaware/neo4j-noderank

Page 40: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Two issues

• Local vs global• Iterative algorithms and graph complexity

Page 41: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Local vs globalLocal Global

Page 42: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Local vs globalLocal Global

Offline / batchOLTP / realtime

Page 43: Congressional PageRank: Graph Analytics of US Congress With Neo4j

For iterative algorithms like PageRank, it’s all about complexity of the graphLots of paths. Lots of iterations

Graph complexity

Page 44: Congressional PageRank: Graph Analytics of US Congress With Neo4j

PageRank

Graph global!

Page 45: Congressional PageRank: Graph Analytics of US Congress With Neo4j

PageRank

Graph global!Iterative!

Page 46: Congressional PageRank: Graph Analytics of US Congress With Neo4j

• Efficient in-memory data processing and machine learning platform

• Graph analytics with GraphX• In-memory message passing algorithm

Apache Spark is a fast and general engine for large-scale data processing.

http://spark.apache.org/

Page 47: Congressional PageRank: Graph Analytics of US Congress With Neo4j
Page 48: Congressional PageRank: Graph Analytics of US Congress With Neo4j

PageRankSpark with Neo4j - Scala

https://github.com/AnormCypher/AnormCypher

Page 49: Congressional PageRank: Graph Analytics of US Congress With Neo4j

import org.anormcypher._ import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._

val total =    100000000 val batch = total/1000000 val links = sc.range(0,batch).repartition(batch).mapPartitionsWithIndex( (i,p) => {    val dbConn = Neo4jREST("localhost", 9474, "/db/data/", "neo4j", "test")    val q = "MATCH (l1:Legislator)-[:INFLUENCED_BY]->(l2:Legislator) RETURN id(l1) as from, id(l2) as to skip {skip} limit 1000000"    p.flatMap( skip => {       Cypher(q).on("skip"->skip*1000000).apply()(dbConn).map(row =>             (row[Int]("from").toLong,row[Int]("to").toLong)         )    }) })

links.cache links.count

val edges = links.map( l => Edge(l._1,l._2, None)) val g = Graph.fromEdges(edges,"none") val v = PageRank.run(g, 5).vertices

Extract subgraph. Run PageRank using Spark GraphX.

Page 50: Congressional PageRank: Graph Analytics of US Congress With Neo4j

val res = v.repartition(total/100000).mapPartitions( part => {   val localConn = Neo4jREST("localhost", 9474, "/db/data/", "neo4j", "test")   val updateStmt = Cypher("UNWIND {updates} as update MATCH (p) where id(p) = update.id SET p.pagerank = update.rank")   val updates = part.map( v => Map("id"->v._1.toLong, "rank" -> v._2.toDouble))   val count = updateStmt.on("updates"->updates).execute()(localConn)   Iterator(part.size) })

Write back to graph

Page 51: Congressional PageRank: Graph Analytics of US Congress With Neo4j

PageRank

Mazerunner

http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html

• Enables two-way ETL between Spark and Neo4j

• Run GraphX jobs from data in Neo4j

• Write results back to Neo4j

Page 52: Congressional PageRank: Graph Analytics of US Congress With Neo4j

PageRank

Mazerunner

http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html

• Enables two-way ETL between Spark and Neo4j

• Run GraphX jobs from data in Neo4j

• Write results back to Neo4j

• Support for:• PageRank• Closeness Centrality• Betweenness Centrality• Triangle Counting• Connected Components• Strongly Connected Components

Page 53: Congressional PageRank: Graph Analytics of US Congress With Neo4j

https://github.com/neo4j-contrib/neo4j-mazerunner

Page 54: Congressional PageRank: Graph Analytics of US Congress With Neo4j

curl http://localhost:7474/service/mazerunner/analysis/pagerank/INFLUENCED_BY

Page 55: Congressional PageRank: Graph Analytics of US Congress With Neo4j

• Cosponsors are “influenced by” bill sponsors

• Add INFLUENCED_BY relationships

Page 56: Congressional PageRank: Graph Analytics of US Congress With Neo4j
Page 57: Congressional PageRank: Graph Analytics of US Congress With Neo4j
Page 58: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Who are the influential legislators?

Page 59: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Who are the influential legislators?

Page 60: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Influential legislators by topic

Page 61: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Influential legislators by topic

Page 62: Congressional PageRank: Graph Analytics of US Congress With Neo4j

graphdatabases.com

Page 63: Congressional PageRank: Graph Analytics of US Congress With Neo4j

http://graphgist.neo4j.com/

Page 64: Congressional PageRank: Graph Analytics of US Congress With Neo4j

http://portal.graphgist.org/challenge/index.html

Page 65: Congressional PageRank: Graph Analytics of US Congress With Neo4j

Links

• http://www.lyonwj.com/2015/09/20/legis-graph-congressional-data-using-neo4j/

• http://www.lyonwj.com/2015/10/11/congressional-pagerank/• https://github.com/legis-graph/legis-graph• https://github.com/neo4j-contrib/neo4j-mazerunner• http://www.kennybastani.com/2014/11/graph-analytics-docker-

spark-neo4j.html• http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-

docker.html