graphx : unifying data-parallel and graph-parallel analytics

48
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica Strata 2014 *These slides are best viewed in PowerPoint with an

Upload: galvin

Post on 24-Feb-2016

172 views

Category:

Documents


0 download

DESCRIPTION

GraphX : Unifying Data-Parallel and Graph-Parallel Analytics . Presented by Joseph Gonzalez Joint work with Reynold Xin , Daniel Crankshaw , Ankur Dave, Michael Franklin, and Ion Stoica Strata 2014. *These slides are best viewed in P owerPoint with animation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

GraphX:Unifying Data-Parallel and Graph-Parallel Analytics Presented by Joseph Gonzalez

Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion StoicaStrata 2014

*These slides are best viewed in PowerPoint with animation.

Page 2: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Graphs are Central to Analytics

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PRText

TableTitle Body Topic Model

(LDA) Word TopicsWor

d Topic

Editor GraphCommunityDetection

User Community

User Com.

Term-DocGraph

DiscussionTableUser Disc.

CommunityTopic

TopicCom.

Page 3: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Update ranks in parallel Iterate until convergence

Rank of user i Weighted sum of

neighbors’ ranks

3

PageRank: Identifying Leaders

Page 4: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

The Graph-Parallel Pattern

4

Model / Alg. State

Computation depends only on the

neighbors

Page 5: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Many Graph-Parallel Algorithms

• Collaborative Filtering– Alternating Least

Squares– Stochastic Gradient

Descent– Tensor Factorization

• Structured Prediction– Loopy Belief

Propagation– Max-Product Linear

Programs– Gibbs Sampling

• Semi-supervised ML

– Graph SSL – CoEM

• Community Detection– Triangle-Counting– K-core Decomposition– K-Truss

• Graph Analytics– PageRank– Personalized PageRank– Shortest Path– Graph Coloring

• Classification– Neural Networks

5

Page 6: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Graph-Parallel Systems

6

Pregeloogle

Expose specialized APIs to simplify graph programming.

Exploit graph structure to achieve orders-of-magnitude performance

gains over more general data-parallel systems.

Page 7: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

PageRank on the Live-Journal Graph

GraphLab

Naïve Spark

Mahout/Hadoop

0 200 400 600 800 1000 1200 1400 1600

22

354

1340

Runtime (in seconds, PageRank for 10 iter-ations)

GraphLab is 60x faster than HadoopGraphLab is 16x faster than Spark

Page 8: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Graphs are Central to Analytics

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PRText

TableTitle Body Topic Model

(LDA) Word TopicsWor

d Topic

Editor GraphCommunityDetection

User Community

User Com.

Term-DocGraph

DiscussionTableUser Disc.

CommunityTopic

TopicCom.

Page 9: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Separate Systems to Support Each View

Table View Graph View

Dependency Graph

Pregel

Table

Result

Row

Row

Row

Row

Page 10: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Having separate systems for each view is

difficult to use and inefficient

10

Page 11: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Difficult to Program and UseUsers must Learn, Deploy, and

Manage multiple systems

Leads to brittle and often complex interfaces

11

Page 12: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Inefficient

12

Extensive data movement and duplication across

the network and file system< / >< / >< / >

XML

HDFS HDFS HDFS HDFS

Limited reuse internal data-structures across stages

Page 13: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Solution: The GraphX Unified Approach

Enabling users to easily and efficiently express the entire graph

analytics pipeline

New APIBlurs the distinction between Tables and

Graphs

New SystemCombines Data-Parallel Graph-

Parallel Systems

Page 14: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Tables and Graphs are composable

views of the same physical data

GraphX Unified

Representation

Graph ViewTable View

Each view has its own operators that

exploit the semantics of the view to achieve efficient execution

Page 15: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

View a Graph as a TableId

RxinJegonzalFranklinIstoica

SrcId DstIdrxin jegonzal

franklin

rxin

istoica franklinfrankli

njegonzal

Property (E)Friend

AdvisorCoworker

PI

Property (V)(Stu., Berk.)

(PstDoc, Berk.)(Prof., Berk)(Prof., Berk)

R

J

F

I

Property GraphVertex Property Table

Edge Property Table

Page 16: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Table OperatorsTable (RDD) operators are inherited from Spark:

16

mapfiltergroupBysortunionjoinleftOuterJoinrightOuterJoin

reducecountfoldreduceByKeygroupByKeycogroupcrosszip

sampletakefirstpartitionBymapWithpipesave...

Page 17: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])

// Table Views -----------------def vertices: Table[ (Id, V) ]def edges: Table[ (Id, Id, E) ]def triplets: Table [ ((Id, V), (Id, V), E) ]// Transformations ------------------------------def reverse: Graph[V, E]def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E]def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T]// Joins ----------------------------------------def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]// Computation ----------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],

reduceF: (T, T) => T): Graph[T, E]}

Graph Operators

17

Page 18: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Triplets Join Vertices and Edges

The triplets operator joins vertices and edges:

The mrTriplets operator sums adjacent triplets.SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum

FROM triplets AS t GROUPBY t.dstId

TripletsVertices Edges

BA

CD

A BA CB CC D

A BAB A C

B CC D

Page 19: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

F

E

Map Reduce TripletsMap-Reduce for each vertex

D

B

A

C

mapF( )A B

mapF( )A C

A1

A2

reduceF( , )A1 A2 A

19

Page 20: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

F

E

Example: Oldest Follower

D

B

A

CWhat is the age of the oldest follower for each user?val oldestFollowerAge = graph .mrTriplets( e=> (e.dst.id, e.src.age),//Map (a,b)=> max(a, b) //Reduce ) .vertices

23 42

30

19 75

1620

Page 21: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

We express the Pregel and GraphLab abstractions using the GraphX

operatorsin less than 50 lines of code!

21

By composing these operators we canconstruct entire graph-analytics

pipelines.

Page 22: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

DIY Demo this Afternoon

Page 23: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

GraphX System Design

Page 24: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Part. 2

Part. 1

Vertex Table (RDD)

B C

A D

F E

A D

Distributed Graphs as Tables (RDDs)

D

Property Graph

B C

D

E

AA

F

Edge Table (RDD)A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

Routing

Table (RDD)

B

C

D

E

A

F

1

2

1 2

1 2

1

2

2D Vertex Cut Heuristic

Page 25: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Vertex Table (RDD)

Caching for Iterative mrTripletsEdge Table

(RDD)A B

A C

C D

B C

A E

A F

E F

E D

MirrorCache

BCD

A

MirrorCache

DEF

A

B

C

D

E

A

F

B

C

D

E

A

F

A

D

Page 26: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Vertex Table (RDD)

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

MirrorCache

BCD

A

MirrorCache

DEF

A

Incremental Updates for Iterative mrTriplets

B

C

D

E

A

F

Change AA

Change E

Scan

Page 27: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Vertex Table (RDD)

Edge Table (RDD)

A B

A C

C D

B C

A E

A F

E F

E D

MirrorCache

BCD

A

MirrorCache

DEF

A

Aggregation for Iterative mrTriplets

B

C

D

E

A

F

Change

Change

Scan

Change

Change

Change

Change

LocalAggregate

LocalAggregate

BC

D

F

Page 28: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Reduction in Communication Due to

Cached Updates

0 2 4 6 8 10 12 14 160.1

110

1001000

10000Connected Components on Twitter Graph

Iteration

Net

wor

k Co

mm

. (M

B)

Most vertices are within 8 hopsof all vertices in their comp.

Page 29: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Benefit of Indexing Active Edges

0 2 4 6 8 10 12 14 1605

1015202530Connected Components on Twitter Graph

ScanIndexed

Iteration

Runt

ime

(Sec

onds

)

Scan All Edges

Index of “Active” Edges

Page 30: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Join EliminationIdentify and bypass joins for unused triplets fields

»Example: PageRank only accesses source attribute

30

0 2 4 6 8 10 12 14 16 18 200

5000

10000

15000PageRank on TwitterThree Way Join

Join Elimination

IterationCom

mun

icat

ion

(MB)

Factor of 2 reduction in communication

Page 31: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Additional Query Optimizations

Indexing and Bitmaps:»To accelerate joins across graphs»To efficiently construct sub-graphs

Substantial Index and Data Reuse:»Reuse routing tables across graphs and

sub-graphs»Reuse edge adjacency information and

indices31

Page 32: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Performance Comparisons

GraphLab

Giraph

Mahout/Hadoop

0 200 400 600 800 1000 1200 1400 160022

68207

3541340

Runtime (in seconds, PageRank for 10 iter-ations)

GraphX is roughly 3x slower than GraphLab

Live-Journal: 69 Million Edges

Page 33: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

GraphX scales to larger graphs

GraphLab

GraphX

Giraph

0 200 400 600 800

203

451

749

Runtime (in seconds, PageRank for 10 iter-ations)

GraphX is roughly 2x slower than GraphLab»Scala + Java overhead: Lambdas, GC time, …»No shared memory parallelism: 2x increase in comm.

Twitter Graph: 1.5 Billion Edges

Page 34: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

PageRank is just one stage….

What about a pipeline?

Page 35: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

HDFSHDFS

ComputeSpark Preprocess Spark Post.

A Small Pipeline in GraphX

Timed end-to-end GraphX is faster than GraphLab

Raw Wikipedia < / >< / >< / >

XML

Hyperlinks PageRank Top 20 Pages

GraphLab + Spark

Giraph + Spark

0 200 400 600 800 1000 1200 1400 1600

342

1492

Total Runtime (in Seconds)

605

375

Page 36: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

The GraphX Stack(Lines of Code)

GraphX (3575)

Spark

Pregel (28) + GraphLab (50)

PageRank (5)

Connected Comp. (10)

Shortest Path (10)

ALS(40) LDA

(120)

K-core(51)

Triangle

Count(45)

SVD(40)

Page 37: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

StatusAlpha release as part of Spark 0.9

Seeking collaborators and feedback

Page 38: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Conclusion and ObservationsDomain specific views: Tables and

Graphs»tables and graphs are first-class

composable objects»specialized operators which exploit view

semantics

Single system that efficiently spans the pipeline

»minimize data movement and duplication»eliminates need to learn and manage

multiple systems

Graphs through the lens of database systems

»Graph-Parallel Pattern Triplet joins in relational alg.

»Graph Systems Distributed join optimizations

38

Page 39: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Active ResearchStatic Data Dynamic Data

»Apply GraphX unified approach to time evolving data

»Model and analyze relationships over time

Serving Graph Structured Data»Allow external systems to interact with

GraphX»Unify distributed graph databases with

relational database technology 39

Page 41: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Graph Property 1Real-World Graphs

41

Top 1% of vertices are adjacent to50% of the

edges!

Num

ber o

f Ve

rtice

s

AltaVista WebGraph1.4B Vertices, 6.6B Edges

Degree

More than 108 vertices have one neighbor.

-Slope = α ≈ 2 2008 2009 2010 2011 20120

20406080

100120140160180200

Facebook

Year

Ratio

of E

dges

to V

erti

ces

Power-Law Degree DistributionEdges >> Vertices

Page 42: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Graph Property 2Active Vertices

0 10 20 30 40 50 60 701

10100

100010000

1000001000000

10000000100000000

Number of Updates

Num

-Ver

tice

s

51% updated only once!PageRank on Web Graph

Page 43: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Graphs are Essential to Data Mining and Machine Learning

Identify influential people and informationFind communitiesUnderstand people’s shared interestsModel complex data dependencies

Page 44: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Ratings Items

Recommending ProductsUsers

Page 45: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Low-Rank Matrix Factorization:

45

r13

r14

r24

r25

f(1)

f(2)

f(3)

f(4)

f(5)User

Fact

ors (

U)

Movie Factors (M

)Us

ers

MoviesNetflix Us

ers≈ x

Movies

f(i)f(j)

Iterate:

Recommending Products

Page 46: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Liberal Conservative

Post

Post

Post

Post

Post

Post

Post

Post

Predicting User Behavior

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

??

?

?

??

?

? ??

?

?

??

? ?

?

?

?

?

?

?

?

?

?

?

?

? ?

?

46

Conditional Random FieldBelief Propagation

Page 47: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Count triangles passing through each vertex:

Measures “cohesiveness” of local community

More TrianglesStronger Community

Fewer TrianglesWeaker Community

12 3

4

Finding Communities

Page 48: GraphX : Unifying Data-Parallel and Graph-Parallel Analytics

Preprocessing Compute Post Proc.

Example Graph Analytics Pipeline

48

< / >< / >< / >XML

RawData ETL Slice Comput

e

Repeat

Subgraph PageRankInitial Graph

Analyze

TopUsers