graphx - stanford universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 ·...

34
UC BERKELEY GraphX Graph Analytics in Spark Ankur Dave Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez, Reynold Xin, Daniel Crankshaw, Michael Franklin, and Ion Stoica UC BERKELEY

Upload: others

Post on 19-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

UC  BERKELEY  

GraphXGraph Analytics in Spark

Ankur Dave���Graduate Student, UC Berkeley AMPLab

Joint work with Joseph Gonzalez, Reynold Xin, Daniel Crankshaw, Michael Franklin, and Ion Stoica

UC  BERKELEY  

Page 2: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Model & Dependencies

Architecture

Machine Learning Landscape

Large & Dense

Parameter ServerGraph-Parallel

SparseSmall & Dense

MapReduce

Page 3: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Model & Dependencies

Architecture

Machine Learning Landscape

Large & Dense

Parameter Server

SparseSmall & Dense

Spark DataflowFramework

GraphX

Page 4: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Graphs

Page 5: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Social Networks

Page 6: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Web Graphs

Page 7: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

⋆⋆⋆⋆⋆

⋆⋆⋆⋆

⋆⋆⋆⋆⋆⋆⋆⋆⋆

User-Item Graphs

Page 8: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Graph Algorithms

Page 9: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

PageRank

Page 10: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Triangle Counting

Page 11: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Collaborative FilteringU

sers

ProductsRatings

Use

rs

≈x

Products

f(i)

f(j)

⋆⋆⋆⋆⋆

Page 12: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Collaborative Filtering

r13

r14

r24

r25

f(1)

f(2)

f(3)

f(4)

f(5)

Use

r Fac

tors

Product Factors

f [i] = arg minw2Rd

X

j2Nbrs(i)

�rij � wT f [j]

�2+ �||w||22

Page 13: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The Graph-Parallel Pattern

Page 14: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The Graph-Parallel Pattern

Page 15: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The Graph-Parallel Pattern

Page 16: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Collaborative Filtering» Alternating Least Squares» Stochastic Gradient Descent» Tensor Factorization

Structured Prediction» Loopy Belief Propagation» Max-Product Linear

Programs» Gibbs Sampling

Semi-supervised ML» Graph SSL » CoEM

Community Detection» Triangle-Counting» K-core Decomposition» K-Truss

Graph Analytics» PageRank» Personalized PageRank» Shortest Path» Graph Coloring

Classification» Neural Networks

Many Graph-Parallel Algorithms

Page 17: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

Link TableTitle Link

Editor GraphCommunityDetection

User Community

User Com.

EditorTable

Editor Title

Top CommunitiesCom. PR..

Modern Analytics

Page 18: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Tables

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

Link TableTitle Link

Editor GraphCommunityDetection

User Community

User Com.

Top CommunitiesCom. PR..

EditorTable

Editor Title

Page 19: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

EditorTable

Editor Title

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

Link TableTitle Link

Editor GraphCommunityDetection

User Community

User Com.

Top CommunitiesCom. PR..

Graphs

Page 20: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The GraphX API

Page 21: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Vertex Property:•  User Profile•  Current PageRank Value

Edge Property:•  Weights•  Relationships•  Timestamps

Property Graphs

Page 22: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Graphtype  VertexId  =  Long    val  vertices:  RDD[(VertexId,  String)]  =      sc.parallelize(List(          (1L,  “Alice”),          (2L,  “Bob”),          (3L,  “Charlie”)))    class  Edge[ED](      val  srcId:  VertexId,      val  dstId:  VertexId,      val  attr:  ED)    val  edges:  RDD[Edge[String]]  =      sc.parallelize(List(          Edge(1L,  2L,  “coworker”),          Edge(2L,  3L,  “friend”)))    val  graph  =  Graph(vertices,  edges)  

Creating a Graph (Scala)

1

3

2

Alice

Bob

Charlie

coworker

friend

Page 23: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

class  Graph[VD,  ED]  {    //  Table  Views  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    def  vertices:  RDD[(VertexId,  VD)]    def  edges:  RDD[Edge[ED]]    def  triplets:  RDD[EdgeTriplet[VD,  ED]]    //  Transformations  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    def  mapVertices[VD2](f:  (VertexId,  VD)  =>  VD2):  Graph[VD2,  ED]    def  mapEdges[ED2](f:  Edge[ED]  =>  ED2):  Graph[VD2,  ED]    def  reverse:  Graph[VD,  ED]    def  subgraph(epred:  EdgeTriplet[VD,  ED]  =>  Boolean,                              vpred:  (VertexId,  VD)  =>  Boolean):  Graph[VD,  ED]    //  Joins  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    def  outerJoinVertices[U,  VD2]  

               (tbl:  RDD[(VertexId,  U)])                  (f:  (VertexId,  VD,  Option[U])  =>  VD2):  Graph[VD2,  ED]  

 //  Computation  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    def  mapReduceTriplets[A](  

               sendMsg:  EdgeTriplet[VD,  ED]  =>  Iterator[(VertexId,  A)],                  mergeMsg:  (A,  A)  =>  A):  RDD[(VertexId,  A)]    

Graph Operations (Scala)

Page 24: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

 //  Continued  from  previous  slide    def  pageRank(tol:  Double):  Graph[Double,  Double]    def  triangleCount():  Graph[Int,  ED]    def  connectedComponents():  Graph[VertexId,  ED]    //  ...and  more:  org.apache.spark.graphx.lib  

}  

Built-in Algorithms (Scala)

PageRank Triangle Count Connected Components

Page 25: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

RDD

The triplets view

Graph

1

3

2

Alice

Bob

Charlie

coworker

friend

class  Graph[VD,  ED]  {    def  triplets:  RDD[EdgeTriplet[VD,  ED]]  

}    class  EdgeTriplet[VD,  ED](      val  srcId:  VertexId,  val  dstId:  VertexId,  val  attr:  ED,      val  srcAttr:  VD,  val  dstAttr:  VD)  

srcAttr dstAttr attrAlice coworker BobBob friend Charlie

triplets  

Page 26: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The subgraph transformationclass  Graph[VD,  ED]  {  

 def  subgraph(epred:  EdgeTriplet[VD,  ED]  =>  Boolean,                              vpred:  (VertexId,  VD)  =>  Boolean):  Graph[VD,  ED]  

}    graph.subgraph(epred  =  (edge)  =>  edge.attr  !=  “relative”)  

subgraph  

Graph

Alice Bob

Charlie

relativefriend

coworker

Davidrelative

Graph

Alice Bob

Charlie

friend

coworker

David

Page 27: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The subgraph transformationclass  Graph[VD,  ED]  {  

 def  subgraph(epred:  EdgeTriplet[VD,  ED]  =>  Boolean,                              vpred:  (VertexId,  VD)  =>  Boolean):  Graph[VD,  ED]  

}    graph.subgraph(vpred  =  (id,  name)  =>  name  !=  “Bob”)  

subgraph  

Graph

Alice Bob

Charlie

relativefriend

coworker

Davidrelative

Graph

Alice

Charlie

relative

Davidrelative

Page 28: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Computation with mapReduceTripletsclass  Graph[VD,  ED]  {      def  mapReduceTriplets[A](          sendMsg:  EdgeTriplet[VD,  ED]  =>  Iterator[(VertexId,  A)],          mergeMsg:  (A,  A)  =>  A):  RDD[(VertexId,  A)]  }    graph.mapReduceTriplets(      edge  =>  Iterator(          (edge.srcId,  1),          (edge.dstId,  1)),      _  +  _)  

Graph

Alice Bob

Charlie

relativefriend

coworker

Davidrelative

mapReduceTriplets  

RDD

vertex id degreeAlice 2Bob 2Charlie 3David 1

upgrade to aggregateMessages  in Spark 1.2.0

Page 29: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

How GraphX Works

Page 30: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Part. 2

Part. 1

Vertex Table

(RDD)

B C

A D

F E

A D

Encoding Property Graphs as RDDs

D

Property Graph

B C

D

E

AA

F

Machine 1

Machine 2

Edge Table(RDD)

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

RoutingTable

(RDD)

B

C

D

E

A

F

1  

2  

1   2  

1   2  

1  

2  

Vertex Cut

Page 31: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Graph System OptimizationsSpecialized���

Data-StructuresVertex-CutsPartitioning

RemoteCaching / Mirroring

Message Combiners Active Set Tracking

Page 32: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

0500

100015002000250030003500

0100020003000400050006000700080009000

Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)

PageRank Benchmark

GraphX performs comparably to ���state-of-the-art graph processing systems.

Runt

ime

(Sec

onds

)

EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE

7x 18x

Page 33: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

1.  Language supporta)  Java API: PR #3234b)  Python API: collaborating with Intel, SPARK-3789

2.  More algorithmsa)  LDA (topic modeling): PR #2388b)  Correlation clusteringc)  Your algorithm here?

3.  Speculativea)  Streaming/time-varying graphsb)  Graph database–like queries

Future of GraphX

Page 34: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Thanks!

[email protected]

[email protected]@eecs.berkeley.edu

[email protected]

http://spark.apache.org/graphx