unifying data-parallel and graph-parallel analytics · 2017-09-12 · graphx:! unifying...

GraphX:��Unifying Data-Parallel and Graph-Parallel Analytics

Presented by Joseph Gonzalez ��Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica Strata 2014

*These slides are best viewed in PowerPoint with animation.

Graphs are Central to Analytics

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages Title PR

Text Table

Title Body Topic Model

(LDA) Word Topics Word Topic

Editor Graph Community Detection

User Community

User Com.

Term-Doc Graph

Discussion Table

User Disc.

Community Topic

Topic Com.

Update ranks in parallel

Iterate until convergence

Rank of user i Weighted sum of

neighbors’ ranks

3

R[i] = 0.15 +X

j2Nbrs(i)

wjiR[j]

PageRank: Identifying Leaders

The Graph-Parallel Pattern

4

Model / Alg. State

Computation depends only on the neighbors

Many Graph-Parallel Algorithms •  Collaborative Filtering

–  Alternating Least Squares –  Stochastic Gradient Descent –  Tensor Factorization

•  Structured Prediction –  Loopy Belief Propagation –  Max-Product Linear Programs –  Gibbs Sampling

•  Semi-supervised ML –  Graph SSL –  CoEM

•  Community Detection –  Triangle-Counting –  K-core Decomposition –  K-Truss

•  Graph Analytics –  PageRank –  Personalized PageRank –  Shortest Path –  Graph Coloring

•  Classification –  Neural Networks

5

Graph-Parallel Systems

6

oogle

Expose specialized APIs to simplify graph programming.

Exploit graph structure to achieve orders-of-

magnitude performance gains over more general ��data-parallel systems.

PageRank on the Live-Journal Graph

22

354

1340

0 200 400 600 800 1000 1200 1400 1600

GraphLab

Naïve Spark

Mahout/Hadoop

Runtime (in seconds, PageRank for 10 iterations)

GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark

Graphs are Central to Analytics

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages Title PR

Text Table

Title Body Topic Model

(LDA) Word Topics Word Topic

Editor Graph Community Detection

User Community

User Com.

Term-Doc Graph

Discussion Table

User Disc.

Community Topic

Topic Com.

Separate Systems to Support Each View Table View Graph View

Dependency Graph

6. Before

8. After

7. After

Table

Result

Row

Row

Row

Row

Having separate systems ��for each view is ��

difficult to use and inefficient

10

Difficult to Program and Use

Users must Learn, Deploy, and Manage multiple systems

Leads to brittle and often ��complex interfaces

11

Inefficient

12

Extensive data movement and duplication across ��the network and file system

< / >!< / >!< / >!XML!

HDFS HDFS HDFS HDFS

Limited reuse internal data-structures ��across stages

Solution: The GraphX Unified Approach

Enabling users to easily and efficiently express the entire graph analytics pipeline

New API Blurs the distinction between

Tables and Graphs

New System Combines Data-Parallel Graph-Parallel Systems

Tables and Graphs are composable ��views of the same physical data

GraphX Unified Representation

Graph View Table View

Each view has its own operators that ��exploit the semantics of the view

to achieve efficient execution

View a Graph as a Table

Id

Rxin Jegonzal Franklin Istoica

SrcId DstId

rxin jegonzal franklin rxin istoica franklin franklin jegonzal

Property (E)

Friend Advisor

Coworker PI

Property (V)

(Stu., Berk.) (PstDoc, Berk.)

(Prof., Berk) (Prof., Berk)

R

J

F

I

Property Graph Vertex Property Table

Edge Property Table

Table Operators Table (RDD) operators are inherited from Spark:

16

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])

// Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E]

}

Graph Operators

17

Triplets Join Vertices and Edges The triplets operator joins vertices and edges:

The mrTriplets operator sums adjacent triplets. SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum FROM triplets AS t GROUPBY t.dstId

Triplets Vertices Edges

B

A

C

D

A B A C B C C D

A B A

B A C B C C D

F

E

Map Reduce Triplets

Map-Reduce for each vertex

D

B

A

C

mapF( ) A B

mapF( ) A C

A1

A2

reduceF( , ) A1 A2 A

19

F

E

Example: Oldest Follower

D

B

A

C What is the age of the oldest follower for each user?

val oldestFollowerAge = graph .mrTriplets( e=> (e.dst.id, e.src.age),//Map (a,b)=> max(a, b) //Reduce ) .vertices

23 42

30

19 75

16 20

We express the Pregel and GraphLab ��abstractions using the GraphX operators��

in less than 50 lines of code!

21

By composing these operators we can ��construct entire graph-analytics pipelines.

DIY Demo this Afternoon

GraphX System Design

Part. 2

Part. 1

Vertex Table

(RDD)

B C

A D

F E

A D

Distributed Graphs as Tables (RDDs)

D

Property Graph

B C

D

E

A A

F

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

Routing Table

(RDD)

B

C

D

E

A

F

1

2

1 2

1 2

1

2

2D Vertex Cut Heuristic

Vertex Table

(RDD)

Caching for Iterative mrTriplets Edge Table

(RDD) A B

A C

C D

B C

A E

A F

E F

E D

Mirror Cache

B C D

A

Mirror Cache

D E F

A

B

C

D

E

A

F

B

C

D

E

A

F

A

D

Vertex Table

(RDD)

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

Mirror Cache

B C D

A

Mirror Cache

D E F

A

Incremental Updates for Iterative mrTriplets

B

C

D

E

A

F

Change A A

Change E

Scan

Vertex Table

(RDD)

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

Mirror Cache

B C D

A

Mirror Cache

D E F

A

Aggregation for Iterative mrTriplets

B

C

D

E

A

F

Change

Change

Scan

Change

Change

Change

Change

Local Aggregate

Local Aggregate

B C

D

F

Reduction in Communication Due to Cached Updates

0.1

1

10

100

1000

10000

0 2 4 6 8 10 12 14 16

Net

wor

k Co

mm

. (M

B)

Iteration

Connected Components on Twitter Graph

Most vertices are within 8 hops��of all vertices in their comp.

Benefit of Indexing Active Edges

0

5

10

15

20

25

30

0 2 4 6 8 10 12 14 16

Runt

ime

(Sec

onds

)

Iteration

Connected Components on Twitter Graph

Scan

Indexed

Scan All Edges

Index of “Active” Edges

Join Elimination Identify and bypass joins for unused triplets fields » Example: PageRank only accesses source attribute

30

0 2000 4000 6000 8000

10000 12000 14000

0 5 10 15 20

Com

mun

icatio

n (M

B)

Iteration

PageRank on Twitter Three Way Join

Join Elimination

Factor of 2 reduction in communication

Additional Query Optimizations

Indexing and Bitmaps: » To accelerate joins across graphs » To efficiently construct sub-graphs

Substantial Index and Data Reuse: » Reuse routing tables across graphs and sub-graphs » Reuse edge adjacency information and indices

31

Performance Comparisons

22 68

207 354

1340

0 200 400 600 800 1000 1200 1400 1600

GraphLab GraphX Giraph

Naïve Spark Mahout/Hadoop


GraphX is roughly 3x slower than GraphLab

Live-Journal: 69 Million Edges

GraphX scales to larger graphs

203

451

749

0 200 400 600 800

GraphLab

GraphX

Giraph


GraphX is roughly 2x slower than GraphLab » Scala + Java overhead: Lambdas, GC time, … » No shared memory parallelism: 2x increase in comm.

Twitter Graph: 1.5 Billion Edges

PageRank is just one stage…. ��

What about a pipeline?

HDFS HDFS

Compute Spark Preprocess Spark Post.

A Small Pipeline in GraphX

Timed end-to-end GraphX is faster than GraphLab

Raw Wikipedia

< / >!< / >!< / >!XML!

Hyperlinks PageRank Top 20 Pages

342

1492

0 200 400 600 800 1000 1200 1400 1600

GraphLab + Spark GraphX

Giraph + Spark Spark

Total Runtime (in Seconds)

605

375

The GraphX Stack��(Lines of Code)

GraphX (3575)

Spark

Pregel (28) + GraphLab (50)

PageRank (5)

Connected Comp. (10)

Shortest Path (10)

ALS (40) LDA

(120)

K-core (51) Triangle

Count (45)

SVD (40)

Status Alpha release as part of Spark 0.9

Seeking collaborators and feedback

Conclusion and Observations Domain specific views: Tables and Graphs » tables and graphs are first-class composable objects » specialized operators which exploit view semantics

Single system that efficiently spans the pipeline » minimize data movement and duplication » eliminates need to learn and manage multiple systems

Graphs through the lens of database systems » Graph-Parallel Pattern à Triplet joins in relational alg. » Graph Systems à Distributed join optimizations

38

Active Research Static Data à Dynamic Data » Apply GraphX unified approach to time evolving data » Model and analyze relationships over time

Serving Graph Structured Data » Allow external systems to interact with GraphX » Unify distributed graph databases with relational

database technology

39

Thanks!

[email protected] [email protected]

[email protected] [email protected]

http://amplab.github.io/graphx/

Graph Property 1��Real-World Graphs

41

100 102 104 106 108100

102

104

106

108

1010

degree

count

Top 1% of vertices are adjacent to

50% of the edges!

Num

ber o

f Ver

tices

AltaVista WebGraph1.4B Vertices, 6.6B Edges

Degree

More than 108 vertices ��have one neighbor.

0 20 40 60 80

100 120 140 160 180 200

2008 2009 2010 2011 2012 Ra

tio o

f Edg

es to

Vert

ices

Year

Facebook

Power-Law Degree Distribution Edges >> Vertices

Graph Property 2��Active Vertices

1

10

100

1000

10000

100000

1000000

10000000

100000000

0 10 20 30 40 50 60 70

Num

-Ver

tices

Number of Updates

51% updated only once! PageRank on Web Graph

Graphs are Essential to Data Mining and Machine Learning

Identify influential people and information

Find communities

Understand people’s shared interests

Model complex data dependencies

Ratings Items

Recommending Products Users

Low-Rank Matrix Factorization:

45

r13

r14

r24

r25

f(1)

f(2)

f(3)

f(4)

f(5) Use

r Fac

tors

(U)

Movie Factors (M

) U

sers Movies

Netflix U

sers

≈ x

Movies

f(i)

f(j)

Iterate:

f [i] = arg minw2Rd

X

j2Nbrs(i)

�rij � wT f [j]

�2+ �||w||22

Recommending Products

Liberal Conservative

Post

Post

Post

Post

Post

Post

Post

Post

Predicting User Behavior

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

? ?

?

?

? ?

?

? ? ?

?

?

? ?

? ?

?

?

?

?

?

?

?

?

?

?

?

? ?

?

46

Conditional Random Field!Belief Propagation!

Count triangles passing through each vertex: "

Measures “cohesiveness” of local community

More Triangles Stronger Community

Fewer Triangles Weaker Community

1 2 3

4

Finding Communities

Preprocessing Compute Post Proc.

Example Graph Analytics Pipeline

48

< / >!< / >!< / >!XML!

Raw Data ETL Slice Compute

Repeat

Subgraph PageRank Initial Graph

Analyze

Top Users

unifying data-parallel and graph-parallel analytics · 2017-09-12 · graphx:! unifying...

Documents