carnegie mellon university danny bickson yucheng low aapo kyrola carlos guestrin joe hellerstein...

100
Carnegie Mellon Universit Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola rallel Machine Learning for Large-Scale Grap Jay Gu Joseph Gonzalez The GraphLab Team:

Upload: matteo-marshman

Post on 11-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Carnegie Mellon University

Danny Bickson

YuchengLow

AapoKyrola

Carlos Guestrin

JoeHellerstein

AlexSmola

Parallel Machine Learning for Large-Scale Graphs

JayGu

JosephGonzalez

The GraphLab Team:

Page 2: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Parallelism is DifficultWide array of different parallel architectures:

Different challenges for each architecture

GPUs Multicore Clusters Clouds Supercomputers

High Level Abstractions to make things easier

Page 3: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

How will wedesign and implement

parallel learning systems?

Page 4: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Map-Reduce / HadoopBuild learning algorithms on-top of

high-level parallel abstractions

... a popular answer:

Page 5: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

BeliefPropagation

Label Propagation

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

PageRank

Lasso

Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Page 6: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Example of Graph Parallelism

Page 7: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

PageRank ExampleIterate:

Where:α is the random reset probabilityL[j] is the number of links on page j

1 32

4 65

Page 8: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Properties of Graph Parallel Algorithms

DependencyGraph

IterativeComputation

My Rank

Friends Rank

LocalUpdates

Page 9: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

BeliefPropagation

SVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

PageRank

Lasso

Addressing Graph-Parallel MLWe need alternatives to Map-Reduce

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Map Reduce?Pregel (Giraph)?

Page 10: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Barrie

rPregel (Giraph)

Bulk Synchronous Parallel Model:

Compute Communicate

Page 11: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Bulk synchronous computation can be

highly inefficient

Problem:

Page 12: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

BSP Systems Problem:Curse of the Slow Job

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Barr

ier

Barr

ier

Data

Data

Data

Data

Data

Data

Data

Barr

ier

Page 13: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

BeliefPropagationSVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

PageRank

Lasso

The Need for a New AbstractionIf not Pregel, then what?

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Pregel (Giraph)

Page 14: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

The GraphLab SolutionDesigned specifically for ML needs

Express data dependenciesIterative

Simplifies the design of parallel programs:Abstract away hardware issuesAutomatic data synchronizationAddresses multiple hardware architectures

MulticoreDistributedCloud computing GPU implementation in progress

Page 15: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

What is GraphLab?

Page 16: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Page 17: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Data GraphA graph with arbitrary data (C++ Objects) associated with each vertex and edge.

Vertex Data:• User profile text• Current interests estimates

Edge Data:• Similarity weights

Graph:• Social Network

Page 18: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }

;][)1(][][

iNj

ji jRWiR

Update FunctionsAn update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

Dynamic computation

Page 19: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

The Scheduler

CPU 1

CPU 2

The scheduler determines the order that vertices are updated

e f g

kjih

dcba b

ih

a

i

b e f

j

c

Sch

edule

r

The process repeats until the scheduler is empty

Page 20: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Page 21: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Ensuring Race-Free CodeHow much can computation overlap?

Page 22: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Need for Consistency?

No Consisten

cy Higher Throug

hput(#updates/sec)

Potentially Slower

Convergence of ML

Page 23: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Inconsistent ALS

0 2000000 4000000 6000000 8000000

2

20 Dynamic Inconsistent

Dynamic

Updates

Trai

n RM

SE

Netflix data, 8 cores

Consistent

Page 24: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Even Simple PageRank can be Dangerous

GraphLab_pagerank(scope) {ref sum = scope.center_valuesum = 0forall (neighbor in scope.in_neighbors )

sum = sum + neighbor.value / nbr.num_out_edges

sum = ALPHA + (1-ALPHA) * sum…

Page 25: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Inconsistent PageRank

Page 26: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Even Simple PageRank can be Dangerous

GraphLab_pagerank(scope) {ref sum = scope.center_valuesum = 0forall (neighbor in scope.in_neighbors)

sum = sum + neighbor.value / nbr.num_out_edges

sum = ALPHA + (1-ALPHA) * sum…

CPU 1 CPU 2Read

Read-write race CPU 1 reads bad PageRank estimate, as CPU 2 computes value

Page 27: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Race Condition Can Be Very SubtleGraphLab_pagerank(scope) {

ref sum = scope.center_valuesum = 0forall (neighbor in scope.in_neighbors)

sum = sum + neighbor.value / neighbor.num_out_edges

sum = ALPHA + (1-ALPHA) * sum…

GraphLab_pagerank(scope) {sum = 0forall (neighbor in scope.in_neighbors)

sum = sum + neighbor.value / nbr.num_out_edges

sum = ALPHA + (1-ALPHA) * sumscope.center_value = sum …

Uns

tabl

eSt

able

This was actually encountered in user code.

Page 28: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

GraphLab Ensures Sequential Consistency

For each parallel execution, there exists a sequential execution of update functions which produces the same result.

CPU 1

CPU 2

SingleCPU

Parallel

Sequential

time

Page 29: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Consistency Rules

Guaranteed sequential consistency for all update functions

Data

Page 30: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Full Consistency

Page 31: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Obtaining More Parallelism

Page 32: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Edge Consistency

CPU 1 CPU 2

Safe

Read

Page 33: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Page 34: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Carnegie Mellon University

What algorithms are implemented in

GraphLab?

Page 35: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Bayesian Tensor Factorization

Gibbs SamplingDynamic Block Gibbs Sampling

MatrixFactorization

Lasso

SVM

Belief PropagationPageRank

CoEM

K-Means

SVD

LDA

…Many others…Linear Solvers

Splash SamplerAlternating Least

Squares

Page 36: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

GraphLab Libraries

Matrix factorizationSVD,PMF, BPTF, ALS, NMF, Sparse ALS, Weighted ALS, SVD++, time-SVD++, SGD

Linear SolversJacobi, GaBP, Shotgun Lasso, Sparse logistic regression, CG

ClusteringK-means, Fuzzy K-means, LDA, K-core decomposition

InferenceDiscrete BP, NBP, Kernel BP

Page 37: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Carnegie Mellon University

Efficient MulticoreCollaborative Filtering

LeBuSiShu team – 5th place in track1

Institute of AutomationChinese Academy of Sciences

Machine Learning DeptCarnegie Mellon University

ACM KDD CUP Workshop 2011

Yao Wu Qiang Yan Danny Bickson Yucheng LowQing Yang

Page 38: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs
Page 39: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

ACM KDD CUP 2011

• Task: predict music score• Two main challenges:

• Data magnitude – 260M ratings• Taxonomy of data

Page 40: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Data taxonomy

Page 41: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Our approach

• Use ensemble method• Custom SGD algorithm for handling

taxonomy

Page 42: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Ensemble method

• Solutions are merged using linear regression

Page 43: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Performance results

Blended Validation RMSE: 19.90

Page 44: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Classical Matrix Factorization

Sparse Matrix

Users

Item

d

Page 45: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

MFITR

Sparse Matrix

Users

d

Features of the ArtistFeatures of the AlbumItem Specific Features

“Effective Feature of an Item”

Page 46: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Intuitively, features of an artist and features of his/her album should be “similar”. How do we express this?

Album

Artist

Track

• Penalty terms which ensure Artist/Album/Track features are “close”

• Strength of penalty depends on “normalized rating similarity”

(See neighborhood model)

Page 47: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Fine Tuning ChallengeDataset has around 260M observed ratings12 different algorithms, total 53 tunable parametersHow do we train and cross validate all these parameters?

USE GRAPHLAB!

Page 48: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

16 Cores Runtime

Page 49: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Speedup plots

Page 50: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs
Page 51: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Carnegie Mellon University

Who is using GraphLab?

Page 52: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Universities using GraphLab

Page 53: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Companies tyring out GraphLab2400++ Unique Downloads Tracked

(possibly many more from direct repository checkouts)

Startups using GraphLab

Page 54: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

User community

Page 55: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Performance results

Page 56: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

GraphLab vs. Pregel (BSP)

Multicore PageRank (25M Vertices, 355M Edges)

0 10 20 30 40 50 60 701

100

10000

1000000

100000000

Number of Updates

Num

-Ver

tices 51% updated only once

02000

40006000

800010000

1200014000

1.00E-02

1.00E+00

1.00E+02

1.00E+04

1.00E+06

1.00E+08

Runtime (s)

L1 E

rror

GraphLab

Pregel(via GraphLab)

0.0E+00 5.0E+08 1.0E+09 1.5E+09 2.0E+091.00E-02

1.00E+00

1.00E+02

1.00E+04

1.00E+06

1.00E+08

Updates

L1 E

rror

GraphLab

Pregel(via GraphLab)

Page 57: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

CoEM (Rosie Jones, 2005)Named Entity Recognition Task

the dog

Australia

Catalina Island

<X> ran quickly

travelled to <X>

<X> is pleasant

Hadoop 95 Cores 7.5 hrs

Is “Dog” an animal?Is “Catalina” a place?

Vertices: 2 MillionEdges: 200 Million

Page 58: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Spee

dup

Bett

er

Optimal

GraphLab CoEM

CoEM (Rosie Jones, 2005)

62

GraphLab 16 Cores 30 min

15x Faster!6x fewer CPUs!

Hadoop 95 Cores 7.5 hrs

Page 59: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Carnegie Mellon

GraphLab in the Cloud

Page 60: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

CoEM (Rosie Jones, 2005)

Optimal

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Sp

eed

up

Bett

er

Small

Large

GraphLab 16 Cores 30 min

Hadoop 95 Cores 7.5 hrs

GraphLabin the Cloud

32 EC2 machines

80 secs

0.3% of Hadoop time

Page 61: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Cost-Time Tradeoff video co-segmentation results

more machines, higher cost

fast

er

a few machines helps a lot

diminishingreturns

Page 62: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Netflix Collaborative FilteringAlternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix

Users

Movies

DHadoopMPI

GraphLab

Ideal

D=100

D=20

Page 63: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Multicore Abstraction Comparison

Netflix Matrix Factorization

0 2000000 4000000 6000000 8000000 10000000-0.036

-0.034

-0.032

-0.03

-0.028

-0.026

-0.024

-0.022

-0.02

DynamicRound Robin

Updates

Log

Test

Err

or

Dynamic Computation,Faster Convergence

Page 64: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

The Cost of Hadoop

Page 65: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Carnegie Mellon University

Fault Tolerance

Page 66: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Fault-ToleranceLarger Problems Increased chance of Machine Failure

GraphLab2 Introduces two fault tolerance (checkpointing) mechanisms

Synchronous SnapshotsChandi-Lamport Asynchronous Snapshots

Page 67: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Synchronous Snapshots

Run GraphLab Run GraphLab

Barrier + Snapshot

Tim

e

Run GraphLab Run GraphLab

Barrier + Snapshot

Run GraphLab Run GraphLab

Page 68: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Curse of the slow machine

sync.Snapshot

No Snapshot

Page 69: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Curse of the Slow Machine

Run GraphLab

Run GraphLab

Tim

e

Barrier + Snapshot

Run GraphLabRun GraphLab

Page 70: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Curse of the slow machine

sync.Snapshot

No Snapshot

Delayed sync.Snapshot

Page 71: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Asynchronous Snapshots

struct chandy_lamport { void operator()(icontext_type& context) {

save(context.vertex_data()); foreach ( edge_type edge, context.in_edges() )

{if (edge.source() was not marked as

saved) {save(context.edge_data(edge));context.schedule(edge.source(),

chandy_lamport());}

}... Repeat for context.out_edgesMark context.vertex() as saved;

}};

Chandy Lamport algorithm implementable as a GraphLab update function! Requires edge consistency

Page 72: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Snapshot Performance

Async.Snapshot

sync.Snapshot

No Snapshot

Page 73: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Snapshot with 15s fault injection

No SnapshotAsync.

Snapshot

sync.Snapshot

Halt 1 out of 16 machines 15s

Page 74: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

New challenges

Page 75: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Natural Graphs Power Law

Top 1% of vertices is adjacent to

53% of the edges!

Yahoo! Web Graph: 1.4B Verts, 6.7B Edges

“Power Law”

Page 76: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Problem: High Degree Vertices

High degree vertices limit parallelism:

Touch a LargeAmount of State

Requires Heavy Locking

Processed Sequentially

Page 77: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Split gather and scatter across machines:

High Communication in Distributed Updates

Y

Machine 1 Machine 2

Data from neighbors transmitted separately

across network

Page 78: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

High Degree Vertices are Common

Use

rs

Movies

Netflix

“Social” People Popular Movies

θZwZwZwZw

θZwZwZwZw

θZwZwZwZw

θZwZwZwZw

Hyper Parameters

Doc

s

Words

LDA

Common Words

Obama

Page 79: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Factorized Update Functors

Delta Update Functors

Two Core Changes to Abstraction

Monolithic Updates

++

++ +

+

++

Gather Apply Scatter

Decomposed Updates

Monolithic Updates Composable Update “Messages”

f1 f2

(f1o f2)( )

Page 80: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Decomposable Update Functors

Locks are acquired only for region within a scope Relaxed Consistency

+ + … + Δ

Y YY

ParallelSum

User Defined:

Gather( ) ΔY

Δ1 + Δ2 Δ3

Y Scope

Gather

Y

YApply( , Δ) Y

Apply the accumulated value to center vertex

User Defined:

Apply

Y

Scatter( )

Update adjacent edgesand vertices.

User Defined:Y

Scatter

Page 81: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Factorized PageRankdouble gather(scope, edge) {

return edge.source().value().rank /

scope.num_out_edge(edge.source())}

double merge(acc1, acc2) { return acc1 + acc2 }

void apply(scope, accum) {old_value = scope.center_value().rankscope.center_value().rank = ALPHA + (1 - ALPHA) *

accumscope.center_value().residual =

abs(scope.center_value().rank – old_value)}

void scatter(scope, edge) {if (scope.center_vertex().residual > EPSILON)

reschedule_schedule(edge.target())}

Page 82: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Y

Split gather and scatter across machines:

Factorized Updates: Significant Decrease in Communication

( o )( )Y

YYF1 F2

YY

Small amount of data transmitted over network

Page 83: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Factorized ConsistencyNeighboring vertices maybe be updated simultaneously:

A B

Gather Gather

Page 84: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Apply

Factorized Consistency LockingGather on an edge cannot occur during apply:

A B

Gather

Vertex B gathers on other neighbors while A is performing Apply

Page 85: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Decomposable Loopy Belief Propagation

Gather: Accumulates product of in messages

Apply: Updates central belief

Scatter: Computes out messages and schedules adjacent vertices

Page 86: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Decomposable Alternating Least Squares (ALS)

y1

y2

y3

y4

w1

w2

x1

x2

x3Use

r Fac

tors

(W)

Movie Factors (X)

Use

rs

Movies

Netflix

Use

rs

≈x

Movies

Gather: Sum terms

wi

xj

Update Function:

Apply: matrix inversion & multiply

Page 87: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Comparison of Abstractions

Multicore PageRank (25M Vertices, 355M Edges)

0 1000 2000 3000 4000 5000 60001.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+08

Runtime (s)

L1 E

rror

GraphLab1

FactorizedUpdates

Page 88: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Need for Vertex Level Asynchrony

Exploit commutative associative “sum”

Y

+ + + + + Y

Costly gather for a single change!

Page 89: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Commut-Assoc Vertex Level Asynchrony

Exploit commutative associative “sum”

+ + + + + Y

Y

Page 90: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Exploit commutative associative “sum”

+ + + + + + Δ Y

Y

Commut-Assoc Vertex Level Asynchrony

+ Δ

Page 91: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Delta Updates: Vertex Level Asynchrony

Exploit commutative associative “sum”

+ + + + + + Δ YOld (Cached) Sum

Y

Page 92: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Exploit commutative associative “sum”

YΔ Δ

Delta Updates: Vertex Level Asynchrony

+ + + + + + Δ YOld (Cached) Sum

Page 93: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Delta Update

void update(scope, delta) {scope.center_value() = scope.center_value() +

deltaif(abs(delta) > EPSILON) {

out_delta = delta * (1 – ALPHA) *1 /

scope.num_out_edge(edge.source())reschedule_out_neighbors(delta)

}}

double merge(delta, delta) { return delta + delta }

Program starts with: schedule_all(ALPHA)

Page 94: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Multicore Abstraction Comparison

Multicore PageRank (25M Vertices, 355M Edges)

0 2000 4000 6000 8000 10000 12000 140001.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+08

Delta

Factorized

GraphLab 1

Simulated Pregel

Runtime (s)

L1 E

rror

Page 95: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Distributed Abstraction Comparison

Distributed PageRank (25M Vertices, 355M Edges)

2 3 4 5 6 7 80

50

100

150

200

250

300

350

400

# Machines (8 CPUs per Machine)

Runti

me

(s)

2 3 4 5 6 7 80

5

10

15

20

25

30

35

# Machines (8 CPUs per Machine)

Tota

l Com

mun

icati

on (G

B)

GraphLab1

GraphLab2 (Delta Updates)

GraphLab1

GraphLab2 (Delta Updates)

Page 96: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

PageRankAltavista Webgraph 2002

1.4B vertices, 6.7B edges

Hadoop 9000 s800 cores

Prototype GraphLab2 431s512 cores

Known Inefficiencies.

2x gain possible

Page 97: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Decomposed Update Functions: Expose parallelism in high-degree vertices:

Delta Update Functions: Expose asynchrony in high-degree vertices

Summary of GraphLab2

++

++ +

+

++

Gather Apply Scatter

Y YΔ

Page 98: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Lessons LearnedMachine Learning:

Asynchronous often much faster than SynchronousDynamic computation often faster

However, can be difficult to define optimal thresholds:

Science to do!

Consistency can improve performance

Sometimes required for convergenceThough there are cases where relaxed consistency is sufficient

System:Distributed asynchronous systems are harder to build

But, no distributed barriers == better scalability and performance

Scaling up by an order of magnitude requires rethinking of design assumptions

E.g., distributed graph representation

High degree vertices & natural graphs can limit parallelism

Need further assumptions on update functions

Page 99: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

SummaryAn abstraction tailored to Machine Learning

Targets Graph-Parallel Algorithms

Naturally expressesData/computational dependenciesDynamic iterative computation

Simplifies parallel algorithm designAutomatically ensures data consistencyAchieves state-of-the-art parallel performance on a variety of problems

Page 100: Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs

Carnegie Mellon

Parallel GraphLab 1.1

Multicore Available TodayGraphLab2 (in the Cloud)

soon…

http://graphlab.org

Documentation… Code… Tutorials…