carnegie mellon machine learning in the cloud yucheng low aapo kyrola danny bickson joey gonzalez...

86
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Upload: domenic-holt

Post on 03-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

Machine Learning in the Cloud

YuchengLow

AapoKyrola

DannyBickson

JoeyGonzalez

Carlos GuestrinJoe Hellerstein

David O’Hallaron

Page 2: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Machine Learning in the Real World

24 Hours a MinuteYouTube

13 Million Wikipedia Pages

500 MillionFacebook Users

3.6 Billion Flickr Photos

Page 3: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Parallelism is DifficultWide array of different parallel architectures:

Different challenges for each architecture

4

GPUs Multicore Clusters Clouds Supercomputers

High Level Abstractions to make things easier.

Page 4: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

Embarrassingly Parallel independent computation

12.9

42.3

21.3

25.8

No Communication needed

Page 5: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

Embarrassingly Parallel independent computation

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

No Communication needed

Page 6: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

Embarrassingly Parallel independent computation

12.9

42.3

21.3

25.8

17.5

67.5

14.9

34.3

24.1

84.3

18.4

84.4

No Communication needed

Page 7: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

CPU 1 CPU 2

MapReduce – Reduce Phase

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

17.5

67.5

14.9

34.3

2226.

26

1726.

31

Fold/Aggregation

Page 8: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

MapReduce and MLExcellent for large data-parallel tasks!

9

Data-Parallel Complex Parallel Structure

Is there more toMachine Learning

?Cross

ValidationFeature

Extraction

Map Reduce

Computing SufficientStatistics

Page 9: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Slow

Proc

esso

rIterative Algorithms?

We can implement iterative algorithms in MapReduce:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Barr

ier

Barr

ier

Barr

ier

Page 10: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Iterative MapReduceSystem is not optimized for iteration:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Disk Pe

nalty

Disk Pe

nalty

Disk Pe

nalty

Sta

rtup

Pen

alty

Sta

rtup

Pen

alty

Sta

rtup

Pen

alty

Page 11: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Iterative MapReduceOnly a subset of data needs computation:

(multi-phase iteration)

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Barr

ier

Barr

ier

Barr

ier

Page 12: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

MapReduce and MLExcellent for large data-parallel tasks!

13

Data-Parallel Complex Parallel Structure

Is there more toMachine Learning

?Cross

ValidationFeature

Extraction

Map Reduce

Computing SufficientStatistics

Page 13: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Structured Problems

14

Interdependent Computation:

Not Map-Reducible

Example Problem: Will I be successful in research?

May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling]

Success depends on the success of others.

Page 14: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Space of Problems

15

Asynchronous Iterative Computation

Repeated iterations over local kernel computations

Sparse Computation Dependencies

Can be decomposed into local “computation-kernels”

Page 15: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

GraphLab

Data-Parallel Structured Iterative Parallel

Parallel Computing and MLNot all algorithms are efficiently data parallel

16

CrossValidation

Feature Extraction Belief

Propagation

SVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

LearningGraphicalModels

Lasso

Map Reduce

Computing SufficientStatistics

Sampling

?

Page 16: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

GraphLab GoalsDesigned for ML needs

Express data dependenciesIterative

Simplifies the design of parallel programs:

Abstract away hardware issuesAddresses multiple hardware architectures

MulticoreDistributedGPU and others

Page 17: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

GraphLab GoalsSimple Models

ComplexModels

SmallData

LargeData

Data-Parallel

Now

Goal

Page 18: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

GraphLab GoalsSimple Models

ComplexModels

SmallData

LargeData

Data-Parallel

Now

GraphLab

Page 19: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

GraphLab

A Domain-Specific Abstraction for Machine Learning

Page 20: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Everything on a GraphA Graph with data associated with every vertex and edge

:Data

Page 21: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Update FunctionsUpdate Functions: operations applied on vertex transform data in scope of vertex

Page 22: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Update FunctionsUpdate Function can Schedule the computation of anyother update function:

Scheduled computation is guaranteed to execute eventually.

- FIFO Scheduling - Prioritized Scheduling - Randomized Etc.

Page 23: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Example: Page Rank

multiply adjacent pagerank values with edge weights to get current vertex’s pagerank

Graph = WWW

Update Function:

“Prioritized” PageRank Computation? Skip converged vertices.

Page 24: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Example: K-Means Clustering

Cluster Update:compute average of data connected on a “marked” edge.

Data Update:Pick the closest cluster and mark the edge. Unmark remaining edges.

(Fully Connected?)Bipartite Graph

Update Function:

Data

Clusters

Page 25: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Example: MRF Sampling

- Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex

Graph = MRF

Update Function:

Page 26: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Not Message Passing!

Graph is a data-structure.Update Functions perform parallel modifications to the data-structure.

Page 27: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Safety

If adjacent update functions occur simultaneously?

Page 28: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Safety

If adjacent update functions occur simultaneously?

Page 29: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Importance of Consistency

Permit Races? “Best-effort” computation?

ML resilient to soft-optimization?

True for some algorithms.

Not true for many. May work empirically on some datasets; may fail on others.

Page 30: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Importance of Consistency

Many algorithms require strict consistency, or performs significantly better under strict consistency.

Alternating Least Squares

Page 31: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Importance of Consistency

Fast ML Algorithm development cycle:

Build

Test

Debug

Tweak Model

Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism.Is the execution wrong? Or is the model wrong?

Page 32: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Sequential ConsistencyGraphLab guarantees sequential

consistency parallel execution, sequential execution of update functions which produce same result

CPU 1

CPU 2

CPU 1

Parallel

Sequential

time

Page 33: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Sequential ConsistencyGraphLab guarantees sequential

consistency parallel execution, sequential execution of update functions which produce same result

Formalization of the intuitive concept of a “correct program”.

- Computation does not read outdated data from the past- Computation does not read results of computation that occurs in the future.Primary Property of GraphLab

Page 34: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Global Information

What if we need global information?

Sum of all the vertices?

Algorithm Parameters?

Sufficient Statistics?

Page 35: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Shared VariablesGlobal aggregation through Sync OperationA global parallel reduction over the graph data.Synced variables recomputed at defined intervalsSync computation is Sequentially Consistent

Permits correct interleaving of Syncs and Updates

Sync: Sum of Vertex Values

Sync: Loglikelihood

Page 36: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Sequential ConsistencyGraphLab guarantees sequential

consistency parallel execution, sequential execution of update functions and Syncs which produce same result

CPU 1

CPU 2

CPU 1

Parallel

Sequential

time

Page 37: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

GraphLab in the Cloud

Page 38: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Moving towards the cloud…

Purchasing and maintaining computers is very expensive

Most computing resources seldomly used

Only for deadlines…

Buy time, access hundreds or thousands of processors

Only pay for needed resources

Page 39: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Distributed GL ImplementationMixed Multi-threaded / Distributed Implementation. (Each machine runs only one instance)Requires all data to be in memory. Move computation to data.

MPI for management + TCP/IP for communicationAsynchronous C++ RPC Layer

Ran on 64 EC2 HPC Nodes = 512 Processors

Skip Implementation

Page 40: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

Underlying Network

RPC Controller RPC Controller RPC Controller RPC Controller

Distributed Graph

Distributed Locks

Execution EngineExecution Threads

Cache Coherent Distributed K-V Store

Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads

Cache Coherent Distributed K-V Store

Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads

Cache Coherent Distributed K-V Store

Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads

Cache Coherent Distributed K-V Store

Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads

Cache Coherent Distributed K-V Store

Shared Data

Page 41: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

GraphLab RPC

Page 42: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Write distributed programs easily

Asynchronous communicationMultithreaded supportFastScalable

Easy To Use

(Every machine runs the same binary)

Page 43: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

I

C++

Page 44: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

FeaturesEasy RPC capabilities:

rpc.remote_call([target_machine ID], printf, “%s %d %d %d\n”, “hello world”, 1, 2, 3);

Requests (call with return value)

vec = rpc.remote_request( [target_machine ID],

sort_vector, vec);

std::vector<int>& sort_vector(std::vector<int> &v) { std::sort(v.begin(), v.end()); return v;}

One way calls

Page 45: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Features

Object Instance Context

MPI-like primitivesdc.barrier()dc.gather(...)dc.send_to([target machine], [arbitrary object])dc.recv_from([source machine], [arbitrary object ref])

K-V Object K-V Object K-V Object K-V Object

RPC Controller RPC Controller RPC Controller RPC Controller

MPI-Like Safety

Page 46: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Request Latency

16 128 1024 102400

50

100

150

200

250

300

350

GraphLab RPC

MemCached

Value Length (Bytes)

Late

ncy

(us)

Ping RTT = 90us

Page 47: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

One-Way Call Rate

16 128 1024 102400

100200300400500600700800900

1000

GraphLab RPCICE

Value Length (Bytes)

Mbp

s

1Gbps physical peak

Page 48: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Serialization Performance

ICE RPC Buffered RPC Unbuffered0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Receive

Issue

Seco

nds

(s)

100,000 XOne way call of vector of 10 X {"hello", 3.14, 100}

Page 49: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Distributed Computing Challenges

Q1: How do we efficiently distribute the state ?

- Potentially varying #machines

Q2: How do we ensure sequential consistency ?

Keeping in mind:Limited BandwidthHigh LatencyPerformance

Page 50: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

Distributed Graph

Page 51: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Two-stage Partitioning

Initial Overpartitioning of the Graph

Page 52: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Two-stage Partitioning

Initial Overpartitioning of the GraphGenerate Atom Graph

Page 53: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Two-stage Partitioning

Initial Overpartitioning of the GraphGenerate Atom Graph

Page 54: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Two-stage Partitioning

Initial Overpartitioning of the GraphGenerate Atom GraphRepartition as needed

Page 55: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Two-stage Partitioning

Initial Overpartitioning of the GraphGenerate Atom GraphRepartition as needed

Page 56: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Ghosting

Ghost vertices/edges act as cache for remote data.Coherency maintained using versioning. Decrease bandwidth utilization.

Ghost vertices are a copy of neighboring vertices which are on remote machines.

Page 57: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

Distributed Engine

Page 58: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Distributed Engine

Sequential Consistency can be guaranteed through distributed locking. Direct analogue to shared memory impl.

To improve performance: User provides some “expert knowledge” about the properties of the update function.

Page 59: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Full Consistency

User says: update function modifies all data in scope.

Limited opportunities for parallelism.

Acquire write-lock on all vertices.

Page 60: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Edge Consistency

User: update function only reads from adjacent vertices.

More opportunities for parallelism.

Acquire write-lock on center vertex, read-lock on adjacent.

Page 61: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Vertex Consistency

User: update function does not touch edges nor adjacent vertices

Maximum opportunities for parallelism.

Acquire write-lock on current vertex.

Page 62: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Performance Enhancements

Latency Hiding:- “pipelining” of >> #CPU update

function calls. (about 1K deep pipeline)- Hides the latency of lock acquisition

and cache synchronization

Lock Strength Reduction:- A trick where number of locks can be

decreased while still providing same guarantees

Page 63: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Video Cosegmentation

Segments mean the same

Model: 10.5 million nodes, 31 million edges

Gaussian EM clustering + BP on 3D grid

Page 64: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Speedups

Page 65: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Video Segmentation

Page 66: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Video Segmentation

Page 67: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Chromatic Distributed Engine

Observation:Scheduling using vertex colorings can

be used to automatically satisfy consistency.

Locking overhead is too high in high-degree models. Can we satisfy sequential consistency in a simpler way?

Page 68: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Example: Edge Consistency

(distance 1) vertex coloring

Update functions can be executed on all vertices of the same color in parallel.

Page 69: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Example: Full Consistency

(distance 2) vertex coloring

Update functions can be executed on all vertices of the same color in parallel.

Page 70: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Example: Vertex Consistency

(distance 0) vertex coloring

Update functions can be executed on all vertices of the same color in parallel.

Page 71: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Chromatic Distributed Engine

Tim

e

Execute tasks on all vertices of

color 0

Execute tasks on all vertices of

color 0

Data Synchronization Completion + Barrier

Execute tasks on all vertices of

color 1

Execute tasks on all vertices of

color 1

Data Synchronization Completion + Barrier

Page 72: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

ExperimentsNetflix Collaborative Filtering

Alternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix

Users

Movies

d

Page 73: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

NetflixSpeedup Increasing size of the matrix

factorization

Page 74: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Netflix

Page 75: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Netflix

Page 76: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

ExperimentsNamed Entity Recognition

(part of Tom Mitchell’s NELL project)CoEM Algorithm

Web Crawl

Model: 2 million nodes, 200 million edges

Graph is rather dense. A small number of vertices connect to almost all the vertices.

Page 77: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Named Entity Recognition (CoEM)

85

Page 78: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Named Entity Recognition (CoEM)

86

Bandwidth Bound

Page 79: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Named Entity Recognition (CoEM)

87

Page 80: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Future WorkDistributed GraphLab

Fault Tolerance Spot Instances CheaperGraph using off-memory store (disk/SSD)GraphLab as a databaseSelf-optimized partitioningFast data graph construction primitives

GPU GraphLab ?Supercomputer GraphLab ?

Page 81: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

Is GraphLab the Answer to (Life the

Universe and Everything?)

Probably Not.

Page 82: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

graphlab.ml.cmu.edu Parallel/Distributed Implementation

LGPL (highly probable switch to MPL in a few weeks)

GraphLab

bickson.blogspot.comVery fast matrix factorization implementations, other examples, installation, comparisons, etc

Danny Bickson Marketing Agency

Microsoft Safe

Page 83: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Carnegie Mellon

Questions?

Bayesian Tensor Factorization

Gibbs Sampling

Dynamic Block Gibbs Sampling

MatrixFactorization

Lasso

SVM

Belief Propagation

PageRank

CoEM

Many Others…

SVD

Page 84: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Video CosegmentationNaïve Idea: Treat patches independently

Use Gaussian EM clustering (on image features) E step: Predict membership of each patch given cluster centers M step: Compute cluster centers given memberships of each patchDoes not take relationships among patches into account!

Page 85: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Video CosegmentationBetter Idea: Connect the patches using an MRF. Set edge potentials so that adjacent (spatially and temporally) patches prefer to be of the same cluster.

Gaussian EM clustering with a twist: E step: Make unary potentials for each patch using cluster centers. Predict membership of each patch using BP M step: Compute cluster centers given memberships of each patch

D. Batra, et al. iCoseg: Interactive co-segmentation with intelligent scribble guidance. CVPR 2010.

Page 86: Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron

Distributed Memory Programming APIs

• MPI• Global Arrays• GASnet• ARMCI• etc.

…do not make it easy…

Synchronous computation.Insufficient primitives for multi-threaded use.Also, not exactly easy to use…

If all your data is a n-D array

Direct remote pointer access. Severe limitations depending on system architecture.