unifying data-parallel and graph-parallel analytics · 2017-09-12 · graphx:! unifying...

48
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica Strata 2014 *These slides are best viewed in PowerPoint with animation.

Upload: others

Post on 04-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

GraphX:���Unifying Data-Parallel and Graph-Parallel Analytics

Presented by Joseph Gonzalez ���Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica Strata 2014

*These slides are best viewed in PowerPoint with animation.

Page 2: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Graphs are Central to Analytics

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages Title PR

Text Table

Title Body Topic Model

(LDA) Word Topics Word Topic

Editor Graph Community Detection

User Community

User Com.

Term-Doc Graph

Discussion Table

User Disc.

Community Topic

Topic Com.

Page 3: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Update ranks in parallel

Iterate until convergence

Rank of user i Weighted sum of

neighbors’ ranks

3

R[i] = 0.15 +X

j2Nbrs(i)

wjiR[j]

PageRank: Identifying Leaders

Page 4: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

The Graph-Parallel Pattern

4  

Model / Alg. State

Computation depends only on the neighbors

Page 5: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Many Graph-Parallel Algorithms •  Collaborative Filtering

–  Alternating Least Squares –  Stochastic Gradient Descent –  Tensor Factorization

•  Structured Prediction –  Loopy Belief Propagation –  Max-Product Linear Programs –  Gibbs Sampling

•  Semi-supervised ML –  Graph SSL –  CoEM

•  Community Detection –  Triangle-Counting –  K-core Decomposition –  K-Truss

•  Graph Analytics –  PageRank –  Personalized PageRank –  Shortest Path –  Graph Coloring

•  Classification –  Neural Networks

5

Page 6: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Graph-Parallel Systems

6

oogle

Expose specialized APIs to simplify graph programming.

Exploit graph structure to achieve orders-of-

magnitude performance gains over more general ���data-parallel systems.

Page 7: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

PageRank on the Live-Journal Graph

22

354

1340

0 200 400 600 800 1000 1200 1400 1600

GraphLab

Naïve Spark

Mahout/Hadoop

Runtime (in seconds, PageRank for 10 iterations)

GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark

Page 8: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Graphs are Central to Analytics

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages Title PR

Text Table

Title Body Topic Model

(LDA) Word Topics Word Topic

Editor Graph Community Detection

User Community

User Com.

Term-Doc Graph

Discussion Table

User Disc.

Community Topic

Topic Com.

Page 9: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Separate Systems to Support Each View Table View Graph View

Dependency Graph

6. Before

8. After

7. After

Table

Result

Row

Row

Row

Row

Page 10: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Having separate systems ���for each view is ���

difficult to use and inefficient

10  

Page 11: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Difficult to Program and Use

Users must Learn, Deploy, and Manage multiple systems

Leads to brittle and often ���complex interfaces

 11  

Page 12: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Inefficient

12  

Extensive data movement and duplication across ���the network and file system

< / >!< / >!< / >!XML!

HDFS   HDFS   HDFS   HDFS  

Limited reuse internal data-structures ���across stages

Page 13: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Solution: The GraphX Unified Approach

Enabling users to easily and efficiently express the entire graph analytics pipeline

New API Blurs the distinction between

Tables and Graphs

New System Combines Data-Parallel Graph-Parallel Systems

Page 14: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Tables and Graphs are composable ���views of the same physical data

GraphX Unified Representation

Graph View Table View

Each view has its own operators that ���exploit the semantics of the view

to achieve efficient execution

Page 15: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

View a Graph as a Table

Id

Rxin Jegonzal Franklin Istoica

SrcId DstId

rxin jegonzal franklin rxin istoica franklin franklin jegonzal

Property (E)

Friend Advisor

Coworker PI

Property (V)

(Stu., Berk.) (PstDoc, Berk.)

(Prof., Berk) (Prof., Berk)

R  

J  

F  

I  

Property Graph Vertex Property Table

Edge Property Table

Page 16: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Table Operators Table (RDD) operators are inherited from Spark:

16

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Page 17: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])

// Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E]

}

Graph Operators

17

Page 18: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Triplets Join Vertices and Edges The triplets operator joins vertices and edges:

The mrTriplets operator sums adjacent triplets. SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum FROM triplets AS t GROUPBY t.dstId

Triplets Vertices Edges

B

A

C

D

A B A C B C C D

A B A

B A C B C C D

Page 19: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

F  

E  

Map Reduce Triplets

Map-Reduce for each vertex

D  

B  

A  

C  

mapF( ) A   B  

mapF( ) A   C  

A1  

A2  

reduceF( , ) A1   A2   A  

19

Page 20: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

F  

E  

Example: Oldest Follower

D  

B  

A  

C  What is the age of the oldest follower for each user?

val oldestFollowerAge = graph .mrTriplets( e=> (e.dst.id, e.src.age),//Map (a,b)=> max(a, b) //Reduce ) .vertices

23 42

30

19 75

16 20

Page 21: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

We express the Pregel and GraphLab ���abstractions using the GraphX operators���

in less than 50 lines of code!

21

By composing these operators we can ���construct entire graph-analytics pipelines.

Page 22: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

DIY Demo this Afternoon

Page 23: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

GraphX System Design

Page 24: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Part. 2

Part. 1

Vertex Table

(RDD)

B C

A D

F E

A D

Distributed Graphs as Tables (RDDs)

D

Property Graph

B C

D

E

A A

F

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

Routing Table

(RDD)

B

C

D

E

A

F

1  

2  

1   2  

1   2  

1  

2  

2D Vertex Cut Heuristic

Page 25: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Vertex Table

(RDD)

Caching for Iterative mrTriplets Edge Table

(RDD) A B

A C

C D

B C

A E

A F

E F

E D

Mirror Cache

B C D

A

Mirror Cache

D E F

A

B

C

D

E

A

F

B

C

D

E

A

F

A

D

Page 26: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Vertex Table

(RDD)

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

Mirror Cache

B C D

A

Mirror Cache

D E F

A

Incremental Updates for Iterative mrTriplets

B

C

D

E

A

F

Change A A

Change E

Scan

Page 27: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Vertex Table

(RDD)

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

Mirror Cache

B C D

A

Mirror Cache

D E F

A

Aggregation for Iterative mrTriplets

B

C

D

E

A

F

Change

Change

Scan

Change

Change

Change

Change

Local Aggregate

Local Aggregate

B C

D

F

Page 28: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Reduction in Communication Due to Cached Updates

0.1

1

10

100

1000

10000

0 2 4 6 8 10 12 14 16

Net

wor

k Co

mm

. (M

B)

Iteration

Connected Components on Twitter Graph

Most vertices are within 8 hops���of all vertices in their comp.

Page 29: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Benefit of Indexing Active Edges

0

5

10

15

20

25

30

0 2 4 6 8 10 12 14 16

Runt

ime

(Sec

onds

)

Iteration

Connected Components on Twitter Graph

Scan

Indexed

Scan All Edges

Index of “Active” Edges

Page 30: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Join Elimination Identify and bypass joins for unused triplets fields » Example: PageRank only accesses source attribute

30

0 2000 4000 6000 8000

10000 12000 14000

0 5 10 15 20

Com

mun

icatio

n (M

B)

Iteration

PageRank on Twitter Three Way Join

Join Elimination

Factor of 2 reduction in communication

Page 31: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Additional Query Optimizations

Indexing and Bitmaps: » To accelerate joins across graphs » To efficiently construct sub-graphs

Substantial Index and Data Reuse: » Reuse routing tables across graphs and sub-graphs » Reuse edge adjacency information and indices

31

Page 32: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Performance Comparisons

22 68

207 354

1340

0 200 400 600 800 1000 1200 1400 1600

GraphLab GraphX Giraph

Naïve Spark Mahout/Hadoop

Runtime (in seconds, PageRank for 10 iterations)

GraphX is roughly 3x slower than GraphLab

Live-Journal: 69 Million Edges

Page 33: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

GraphX scales to larger graphs

203

451

749

0 200 400 600 800

GraphLab

GraphX

Giraph

Runtime (in seconds, PageRank for 10 iterations)

GraphX is roughly 2x slower than GraphLab » Scala + Java overhead: Lambdas, GC time, … » No shared memory parallelism: 2x increase in comm.

Twitter Graph: 1.5 Billion Edges

Page 34: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

PageRank is just one stage…. ������

What about a pipeline?

Page 35: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

HDFS HDFS

Compute Spark Preprocess Spark Post.

A Small Pipeline in GraphX

Timed end-to-end GraphX is faster than GraphLab

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages

342

1492

0 200 400 600 800 1000 1200 1400 1600

GraphLab + Spark GraphX

Giraph + Spark Spark

Total Runtime (in Seconds)

605

375

Page 36: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

The GraphX Stack���(Lines of Code)

GraphX (3575)

Spark

Pregel (28) + GraphLab (50)

PageRank (5)

Connected Comp. (10)

Shortest Path (10)

ALS (40) LDA

(120)

K-core (51) Triangle

Count (45)

SVD (40)

Page 37: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Status Alpha release as part of Spark 0.9

Seeking collaborators and feedback

Page 38: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Conclusion and Observations Domain specific views: Tables and Graphs » tables and graphs are first-class composable objects » specialized operators which exploit view semantics

Single system that efficiently spans the pipeline » minimize data movement and duplication » eliminates need to learn and manage multiple systems

Graphs through the lens of database systems » Graph-Parallel Pattern à Triplet joins in relational alg. » Graph Systems à Distributed join optimizations

38

Page 39: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Active Research Static Data à Dynamic Data » Apply GraphX unified approach to time evolving data » Model and analyze relationships over time

Serving Graph Structured Data » Allow external systems to interact with GraphX » Unify distributed graph databases with relational

database technology

39

Page 40: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Thanks!

[email protected] [email protected]

[email protected] [email protected]

http://amplab.github.io/graphx/

Page 41: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Graph Property 1���Real-World Graphs

41

100 102 104 106 108100

102

104

106

108

1010

degree

count

Top 1% of vertices are adjacent to

50% of the edges!

Num

ber o

f Ver

tices

AltaVista WebGraph1.4B Vertices, 6.6B Edges

Degree

More than 108 vertices ���have one neighbor.

0 20 40 60 80

100 120 140 160 180 200

2008 2009 2010 2011 2012 Ra

tio o

f Edg

es to

Vert

ices

Year

Facebook

Power-Law Degree Distribution Edges >> Vertices

Page 42: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Graph Property 2���Active Vertices

1

10

100

1000

10000

100000

1000000

10000000

100000000

0 10 20 30 40 50 60 70

Num

-Ver

tices

Number of Updates

51% updated only once! PageRank on Web Graph

Page 43: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Graphs are Essential to Data Mining and Machine Learning

Identify influential people and information

Find communities

Understand people’s shared interests

Model complex data dependencies

Page 44: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Ratings Items

Recommending Products Users

Page 45: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Low-Rank Matrix Factorization:

45

r13

r14

r24

r25

f(1)

f(2)

f(3)

f(4)

f(5) Use

r Fac

tors

(U)

Movie Factors (M

) U

sers Movies

Netflix U

sers

≈ x

Movies

f(i)

f(j)

Iterate:

f [i] = arg minw2Rd

X

j2Nbrs(i)

�rij � wT f [j]

�2+ �||w||22

Recommending Products

Page 46: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Liberal Conservative

Post

Post

Post

Post

Post

Post

Post

Post

Predicting User Behavior

Post

Post

Post

Post

Post

Post

Post  

Post  

Post  

Post  

Post  

Post  

Post  

Post  

?  ?  

?  

?  

?  ?  

?  

?   ?  ?  

?  

?  

?  ?  

?  ?  

?  

?  

?  

?  

?  

?  

?  

?  

?  

?  

?  

?   ?  

?  

46  

Conditional Random Field!Belief Propagation!

Page 47: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Count triangles passing through each vertex: "

Measures “cohesiveness” of local community

More Triangles Stronger Community

Fewer Triangles Weaker Community

1 2 3

4

Finding Communities

Page 48: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with

Preprocessing Compute Post Proc.

Example Graph Analytics Pipeline

48  

< / >!< / >!< / >!XML!

Raw Data ETL Slice Compute

Repeat

Subgraph PageRank Initial Graph

Analyze

Top Users