machine learning and graphx
TRANSCRIPT
Massive Graph MiningApache Spark’s GraphX and Data Mining
Who we are
Andy
@Noootsab@NextLab_be@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool
Rand
@randhindi@snipsEntrepreneurPhD bioinformatics, etc.. Love data & ML
Graph 101
A graph is a mathematical representation of linked data.It’s defined in term of its Vertices and Edges, G(V,E).
A vertex is an entity that can bring a bag of data (generally small)An edge connects vertices, and can also own a bag of data.
Graph 101
A Graph represent data in a less convenient way for classical processing framework.
Because the burden is not put on the observations themselves (row) but on their linkage, and specifically density.
Thus, the problem is often translated as a self-join one.
Graph 101
A Graph, G(V,E) has a reverse representation, its Dual.
A Dual is nothing other than the graph, G’(V’,E’), where ● a vertex is an edge in G, and● an edge is a vertex in G, which has at least
one edge.
Graph 101
The classical way to store or share the connectivity of a graph is using its tabular version, that is, its Adjacency Matrix.
ref: http://en.wikipedia.org/wiki/Adjacency_matrix
GraphX (Apache Spark)
Spark 101
GraphX (Apache Spark)
Offers a Graph API on top of Spark.Enabling cross-world manipulations
GraphX (Apache Spark)
How it differs from other classical systems...
GraphX (Apache Spark)
GraphX (Apache Spark)
GraphX (Apache Spark)
Plenty of operators on both RDDs, but
GraphX (Apache Spark)
Plenty of operators on both RDDs, but
GraphX (Apache Spark)
1. Sends messages to neighbors2. Returns an RDD of aggregated messages
GraphX (Apache Spark)
Offers higher level operators and algo, like
GraphX (Apache Spark)
This one rules them all (and more)
More later...
PageRank and Pregel
Everybody know PageRank, right?
If not: it’s our oil, our friend, our preferred black box…
It’s why Google Search works so fine!
PageRank and Pregel
Essentially, PageRank is all about importance of a node in a Graph → Link Analysis.
The bottom line is:● In-Links are votes● In-Links from important node are more
important →recursion
PageRank and Pregel
https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
PageRank and Pregel
TL;DRThe importance of a node is the probability that a random (drunk) walker fall on a given node.So, it depends on:1. the probability that he lands into one of its
neighbor2. the probability that he crosses a link from
the neighbor to it3. an arbitrary probability of teleportation
PageRank and Pregel
Solution: Power Method/Iteration (recursive)
r_new = A x r_old Matrix algebra is a pain in distributed environment…
But wait, the process is rather graph oriented!
PageRank and Pregel
Pregel (google again)
Based on BSP, Bulk Sync Parallel
BSP works like message passing style
PageRank and Pregel
During Superstep i, a vertex can:
● use messages received from Superstep i-1● execute a function● send messages● vote to halt
PageRank and Pregel
PageRank and Pregel
In GraphX, as usual with Spark, it’s simple:
mapReduceTriplet
PageRank and Pregel
PageRank with Pregel:
PageRank and Pregel
Applying on our USA.csv file:
OpenStreetMap
Founded by Steve Coast (UK, 2004)
Aims to take Geodata off the govs hands to give them to the crowd
Actually, the crowd has to create them...
OSM
OSM
OSM
So it’s a Graph!
Node = Vertexsingle point in space defined by its latitude, longitude and node id
Way = EdgeA way can have between 2 and 2,000 nodes
OSM
The network is over-complex for what we need, thus:
● reducing cycling ways like roundabouts to a single one
● transforming the nodes into sections, i.e. pieces of streets between 2 intersections
OSM
Hence, OSM ~ G(Node, Way)
If it’s not exactly we can still manipulate them
In our case, we don’t need the connectivity of an intersection, but the connectivity of a section.This is given by G’ (dual of G)
Dataset
● 80 cities● 3M edges in total● smallest city 200 edges (Tempe)● largest city 200,000 edges (Los Angeles)
● Hypothesis: Cities with similar connectivity have similar PageRank distribution
NYC Chicago
Comparing Cities
Fort Worth = Philadelphia?
Looks the same!
Smells like Spurious Correlation
● Problem: PageRank is correlated with the size of the city
● size of city = number of sections (edges) in the graph
● Normalized PageRank = PageRank / size_of_city
● Now we can compare cities of different sizes!
Normalizing PageRank distributions
Fort Worth != Philadelphia!
Totally different!
Fort Worth before and after
Note that range of PageRank is preserved
● How to compare PageRank distributions?● It’s not always a normal distribution!● Can use the Kullback-Leibler divergence
from information theory● the Kullback–Leibler divergence of Q from
P, denoted DKL(P||Q), is a measure of the information lost when Q is used to approximate P
Distance between PG Distributions
● Easy to compute● Units is nats (can be bits if using log2
instead of ln)
KL Divergence
● KL divergence = 18.407 ● Dallas is irregular, Seattle is a perfect grid
Very different cities: Dallas & Seattle
● KL divergence = 0.36● Both are very irregular
Very similar cities: Atlanta & Boston
● Using multiple street topology indicators to measure the risk of car accident
Next steps
Q.E.D
Thanks for keeping up!
Question => Future[(Option[Response], Future[Question])]