carnegie mellon machine learning in the cloud yucheng low aapo kyrola danny bickson joey gonzalez...

Carnegie Mellon

Machine Learning in the Cloud

YuchengLow

AapoKyrola

DannyBickson

JoeyGonzalez

Carlos GuestrinJoe Hellerstein

David O’Hallaron

Machine Learning in the Real World

24 Hours a MinuteYouTube

13 Million Wikipedia Pages

500 MillionFacebook Users

3.6 Billion Flickr Photos

Parallelism is DifficultWide array of different parallel architectures:

Different challenges for each architecture

4

GPUs Multicore Clusters Clouds Supercomputers

High Level Abstractions to make things easier.

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

Embarrassingly Parallel independent computation

12.9

42.3

21.3

25.8

No Communication needed




12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4





12.9

42.3

21.3

25.8

17.5

67.5

14.9

34.3

24.1

84.3

18.4

84.4


CPU 1 CPU 2

MapReduce – Reduce Phase

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

17.5

67.5

14.9

34.3

2226.

26

1726.

31

Fold/Aggregation

MapReduce and MLExcellent for large data-parallel tasks!

9

Data-Parallel Complex Parallel Structure

Is there more toMachine Learning

?Cross

ValidationFeature

Extraction

Map Reduce

Computing SufficientStatistics

Slow

Proc

esso

rIterative Algorithms?

We can implement iterative algorithms in MapReduce:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Barr

ier

Barr

ier

Barr

ier

Iterative MapReduceSystem is not optimized for iteration:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Disk Pe

nalty

Disk Pe

nalty

Disk Pe

nalty

Sta

rtup

Pen

alty

Sta

rtup

Pen

alty

Sta

rtup

Pen

alty

Iterative MapReduceOnly a subset of data needs computation:

(multi-phase iteration)

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Barr

ier

Barr

ier

Barr

ier

MapReduce and MLExcellent for large data-parallel tasks!

13

Data-Parallel Complex Parallel Structure

Is there more toMachine Learning

?Cross

ValidationFeature

Extraction

Map Reduce


Structured Problems

14

Interdependent Computation:

Not Map-Reducible

Example Problem: Will I be successful in research?

May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling]

Success depends on the success of others.

Space of Problems

15

Asynchronous Iterative Computation

Repeated iterations over local kernel computations

Sparse Computation Dependencies

Can be decomposed into local “computation-kernels”

GraphLab

Data-Parallel Structured Iterative Parallel

Parallel Computing and MLNot all algorithms are efficiently data parallel

16

CrossValidation

Feature Extraction Belief

Propagation

SVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

LearningGraphicalModels

Lasso

Map Reduce


Sampling

?

GraphLab GoalsDesigned for ML needs

Express data dependenciesIterative

Simplifies the design of parallel programs:

Abstract away hardware issuesAddresses multiple hardware architectures

MulticoreDistributedGPU and others

GraphLab GoalsSimple Models

ComplexModels

SmallData

LargeData

Data-Parallel

Now

Goal

GraphLab GoalsSimple Models

ComplexModels

SmallData

LargeData

Data-Parallel

Now

GraphLab

Carnegie Mellon

GraphLab

A Domain-Specific Abstraction for Machine Learning

Everything on a GraphA Graph with data associated with every vertex and edge

:Data

Update FunctionsUpdate Functions: operations applied on vertex transform data in scope of vertex

Update FunctionsUpdate Function can Schedule the computation of anyother update function:

Scheduled computation is guaranteed to execute eventually.

- FIFO Scheduling - Prioritized Scheduling - Randomized Etc.

Example: Page Rank

multiply adjacent pagerank values with edge weights to get current vertex’s pagerank

Graph = WWW

Update Function:

“Prioritized” PageRank Computation? Skip converged vertices.

Example: K-Means Clustering

Cluster Update:compute average of data connected on a “marked” edge.

Data Update:Pick the closest cluster and mark the edge. Unmark remaining edges.

(Fully Connected?)Bipartite Graph

Update Function:

Data

Clusters

Example: MRF Sampling

- Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex

Graph = MRF

Update Function:

Not Message Passing!

Graph is a data-structure.Update Functions perform parallel modifications to the data-structure.

Safety

If adjacent update functions occur simultaneously?

Importance of Consistency

Permit Races? “Best-effort” computation?

ML resilient to soft-optimization?

True for some algorithms.

Not true for many. May work empirically on some datasets; may fail on others.


Many algorithms require strict consistency, or performs significantly better under strict consistency.

Alternating Least Squares


Fast ML Algorithm development cycle:

Build

Test

Debug

Tweak Model

Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism.Is the execution wrong? Or is the model wrong?

Sequential ConsistencyGraphLab guarantees sequential

consistency parallel execution, sequential execution of update functions which produce same result

CPU 1

CPU 2

CPU 1

Parallel

Sequential

time


consistency parallel execution, sequential execution of update functions which produce same result

Formalization of the intuitive concept of a “correct program”.

- Computation does not read outdated data from the past- Computation does not read results of computation that occurs in the future.Primary Property of GraphLab

Global Information

What if we need global information?

Sum of all the vertices?

Algorithm Parameters?

Sufficient Statistics?

Shared VariablesGlobal aggregation through Sync OperationA global parallel reduction over the graph data.Synced variables recomputed at defined intervalsSync computation is Sequentially Consistent

Permits correct interleaving of Syncs and Updates

Sync: Sum of Vertex Values

Sync: Loglikelihood


consistency parallel execution, sequential execution of update functions and Syncs which produce same result

CPU 1

CPU 2

CPU 1

Parallel

Sequential

time

Carnegie Mellon

GraphLab in the Cloud

Moving towards the cloud…

Purchasing and maintaining computers is very expensive

Most computing resources seldomly used

Only for deadlines…

Buy time, access hundreds or thousands of processors

Only pay for needed resources

Distributed GL ImplementationMixed Multi-threaded / Distributed Implementation. (Each machine runs only one instance)Requires all data to be in memory. Move computation to data.

MPI for management + TCP/IP for communicationAsynchronous C++ RPC Layer

Ran on 64 EC2 HPC Nodes = 512 Processors

Skip Implementation

Carnegie Mellon

Underlying Network

RPC Controller RPC Controller RPC Controller RPC Controller

Distributed Graph

Distributed Locks

Execution EngineExecution Threads

Cache Coherent Distributed K-V Store

Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads


Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads


Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads


Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads


Shared Data

Carnegie Mellon

GraphLab RPC

Write distributed programs easily

Asynchronous communicationMultithreaded supportFastScalable

Easy To Use

(Every machine runs the same binary)

Carnegie Mellon

I

C++

FeaturesEasy RPC capabilities:

rpc.remote_call([target_machine ID], printf, “%s %d %d %d\n”, “hello world”, 1, 2, 3);

Requests (call with return value)

vec = rpc.remote_request( [target_machine ID],

sort_vector, vec);

std::vector<int>& sort_vector(std::vector<int> &v) { std::sort(v.begin(), v.end()); return v;}

One way calls

Features

Object Instance Context

MPI-like primitivesdc.barrier()dc.gather(...)dc.send_to([target machine], [arbitrary object])dc.recv_from([source machine], [arbitrary object ref])

K-V Object K-V Object K-V Object K-V Object

RPC Controller RPC Controller RPC Controller RPC Controller

MPI-Like Safety

Request Latency

16 128 1024 102400

50

100

150

200

250

300

350

GraphLab RPC

MemCached

Value Length (Bytes)

Late

ncy

(us)

Ping RTT = 90us

One-Way Call Rate

16 128 1024 102400

100200300400500600700800900

1000

GraphLab RPCICE

Value Length (Bytes)

Mbp

s

1Gbps physical peak

Serialization Performance

ICE RPC Buffered RPC Unbuffered0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Receive

Issue

Seco

nds

(s)

100,000 XOne way call of vector of 10 X {"hello", 3.14, 100}

Distributed Computing Challenges

Q1: How do we efficiently distribute the state ?

- Potentially varying #machines

Q2: How do we ensure sequential consistency ?

Keeping in mind:Limited BandwidthHigh LatencyPerformance

Carnegie Mellon

Distributed Graph

Two-stage Partitioning

Initial Overpartitioning of the Graph


Initial Overpartitioning of the GraphGenerate Atom Graph


Initial Overpartitioning of the GraphGenerate Atom GraphRepartition as needed

Ghosting

Ghost vertices/edges act as cache for remote data.Coherency maintained using versioning. Decrease bandwidth utilization.

Ghost vertices are a copy of neighboring vertices which are on remote machines.

Carnegie Mellon

Distributed Engine

Distributed Engine

Sequential Consistency can be guaranteed through distributed locking. Direct analogue to shared memory impl.

To improve performance: User provides some “expert knowledge” about the properties of the update function.

Full Consistency

User says: update function modifies all data in scope.

Limited opportunities for parallelism.

Acquire write-lock on all vertices.

Edge Consistency

User: update function only reads from adjacent vertices.

More opportunities for parallelism.

Acquire write-lock on center vertex, read-lock on adjacent.

Vertex Consistency

User: update function does not touch edges nor adjacent vertices

Maximum opportunities for parallelism.

Acquire write-lock on current vertex.

Performance Enhancements

Latency Hiding:- “pipelining” of >> #CPU update

function calls. (about 1K deep pipeline)- Hides the latency of lock acquisition

and cache synchronization

Lock Strength Reduction:- A trick where number of locks can be

decreased while still providing same guarantees

Video Cosegmentation

Segments mean the same

Model: 10.5 million nodes, 31 million edges

Gaussian EM clustering + BP on 3D grid

Speedups

Video Segmentation

Chromatic Distributed Engine

Observation:Scheduling using vertex colorings can

be used to automatically satisfy consistency.

Locking overhead is too high in high-degree models. Can we satisfy sequential consistency in a simpler way?

Example: Edge Consistency

(distance 1) vertex coloring

Update functions can be executed on all vertices of the same color in parallel.

Example: Full Consistency



Example: Vertex Consistency



Chromatic Distributed Engine

Tim

e

Execute tasks on all vertices of

color 0


color 0

Data Synchronization Completion + Barrier


color 1


color 1

Data Synchronization Completion + Barrier

ExperimentsNetflix Collaborative Filtering

Alternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix

Users

Movies

d

NetflixSpeedup Increasing size of the matrix

factorization

Netflix

ExperimentsNamed Entity Recognition

(part of Tom Mitchell’s NELL project)CoEM Algorithm

Web Crawl

Model: 2 million nodes, 200 million edges

Graph is rather dense. A small number of vertices connect to almost all the vertices.

Named Entity Recognition (CoEM)

85


86

Bandwidth Bound


87

Future WorkDistributed GraphLab

Fault Tolerance Spot Instances CheaperGraph using off-memory store (disk/SSD)GraphLab as a databaseSelf-optimized partitioningFast data graph construction primitives

GPU GraphLab ?Supercomputer GraphLab ?

Carnegie Mellon

Is GraphLab the Answer to (Life the

Universe and Everything?)

Probably Not.

Carnegie Mellon

graphlab.ml.cmu.edu Parallel/Distributed Implementation

LGPL (highly probable switch to MPL in a few weeks)

GraphLab

bickson.blogspot.comVery fast matrix factorization implementations, other examples, installation, comparisons, etc

Danny Bickson Marketing Agency

Microsoft Safe

Carnegie Mellon

Questions?

Bayesian Tensor Factorization

Gibbs Sampling

Dynamic Block Gibbs Sampling

MatrixFactorization

Lasso

SVM

Belief Propagation

PageRank

CoEM

Many Others…

SVD

Video CosegmentationNaïve Idea: Treat patches independently

Use Gaussian EM clustering (on image features) E step: Predict membership of each patch given cluster centers M step: Compute cluster centers given memberships of each patchDoes not take relationships among patches into account!

Video CosegmentationBetter Idea: Connect the patches using an MRF. Set edge potentials so that adjacent (spatially and temporally) patches prefer to be of the same cluster.

Gaussian EM clustering with a twist: E step: Make unary potentials for each patch using cluster centers. Predict membership of each patch using BP M step: Compute cluster centers given memberships of each patch

D. Batra, et al. iCoseg: Interactive co-segmentation with intelligent scribble guidance. CVPR 2010.

Distributed Memory Programming APIs

• MPI• Global Arrays• GASnet• ARMCI• etc.

…do not make it easy…

Synchronous computation.Insufficient primitives for multi-threaded use.Also, not exactly easy to use…

If all your data is a n-D array

Direct remote pointer access. Severe limitations depending on system architecture.

carnegie mellon machine learning in the cloud yucheng low aapo kyrola danny bickson joey gonzalez...

Documents

parallel computation

data parallel algorithms

large dataparallel tasks

parallel design challeges

mapreduce abstraction

map stage

mapreduce framework

large quantities of