carnegie mellon machine learning in the cloud yucheng low aapo kyrola danny bickson joey gonzalez...

Carnegie Mellon

Machine Learning in the Cloud

YuchengLow

AapoKyrola

DannyBickson

JoeyGonzalez

Carlos GuestrinJoe Hellerstein

David O’Hallaron

Machine Learning in the Real World

24 Hours a MinuteYouTube

13 Million Wikipedia Pages

500 MillionFacebook Users

3.6 Billion Flickr Photos

Parallelism is DifficultWide array of different parallel architectures:

Different challenges for each architecture

GPUs Multicore Clusters Clouds Supercomputers

High Level Abstractions to make things easier.

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

Embarrassingly Parallel independent computation

No Communication needed

CPU 1 CPU 2

MapReduce – Reduce Phase

Fold/Aggregation

MapReduce and MLExcellent for large data-parallel tasks!

Data-Parallel Complex Parallel Structure

Is there more toMachine Learning

?Cross

ValidationFeature

Extraction

Map Reduce

Computing SufficientStatistics

rIterative Algorithms?

We can implement iterative algorithms in MapReduce:

Iterations

Iterative MapReduceSystem is not optimized for iteration:

Iterations

Disk Pe

Iterative MapReduceOnly a subset of data needs computation:

(multi-phase iteration)

Iterations

MapReduce and MLExcellent for large data-parallel tasks!

Data-Parallel Complex Parallel Structure

Is there more toMachine Learning

?Cross

ValidationFeature

Extraction

Map Reduce

Structured Problems

Interdependent Computation:

Not Map-Reducible

Example Problem: Will I be successful in research?

May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling]

Success depends on the success of others.

Space of Problems

Asynchronous Iterative Computation

Repeated iterations over local kernel computations

Sparse Computation Dependencies

Can be decomposed into local “computation-kernels”

GraphLab

Data-Parallel Structured Iterative Parallel

Parallel Computing and MLNot all algorithms are efficiently data parallel

CrossValidation

Feature Extraction Belief

Propagation

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

LearningGraphicalModels

Map Reduce

Sampling

GraphLab GoalsDesigned for ML needs

Express data dependenciesIterative

Simplifies the design of parallel programs:

Abstract away hardware issuesAddresses multiple hardware architectures

MulticoreDistributedGPU and others

GraphLab GoalsSimple Models

ComplexModels

SmallData

LargeData

Data-Parallel

GraphLab GoalsSimple Models

ComplexModels

SmallData

LargeData

Data-Parallel

GraphLab

Carnegie Mellon

GraphLab

A Domain-Specific Abstraction for Machine Learning

Everything on a GraphA Graph with data associated with every vertex and edge

Update FunctionsUpdate Functions: operations applied on vertex transform data in scope of vertex

Update FunctionsUpdate Function can Schedule the computation of anyother update function:

Scheduled computation is guaranteed to execute eventually.

- FIFO Scheduling - Prioritized Scheduling - Randomized Etc.

Example: Page Rank

multiply adjacent pagerank values with edge weights to get current vertex’s pagerank

Graph = WWW

Update Function:

“Prioritized” PageRank Computation? Skip converged vertices.

Example: K-Means Clustering

Cluster Update:compute average of data connected on a “marked” edge.

Data Update:Pick the closest cluster and mark the edge. Unmark remaining edges.

(Fully Connected?)Bipartite Graph

Update Function:

Clusters

Example: MRF Sampling

- Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex

Graph = MRF

Update Function:

Not Message Passing!

Graph is a data-structure.Update Functions perform parallel modifications to the data-structure.

Safety

If adjacent update functions occur simultaneously?

Safety

If adjacent update functions occur simultaneously?

Importance of Consistency

Permit Races? “Best-effort” computation?

ML resilient to soft-optimization?

True for some algorithms.

Not true for many. May work empirically on some datasets; may fail on others.

Many algorithms require strict consistency, or performs significantly better under strict consistency.

Alternating Least Squares

Fast ML Algorithm development cycle:

Tweak Model

Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism.Is the execution wrong? Or is the model wrong?

Sequential ConsistencyGraphLab guarantees sequential

consistency parallel execution, sequential execution of update functions which produce same result

Parallel

Sequential

consistency parallel execution, sequential execution of update functions which produce same result

Formalization of the intuitive concept of a “correct program”.

- Computation does not read outdated data from the past- Computation does not read results of computation that occurs in the future.Primary Property of GraphLab

Global Information

What if we need global information?

Sum of all the vertices?

Algorithm Parameters?

Sufficient Statistics?

Shared VariablesGlobal aggregation through Sync OperationA global parallel reduction over the graph data.Synced variables recomputed at defined intervalsSync computation is Sequentially Consistent

Permits correct interleaving of Syncs and Updates

Sync: Sum of Vertex Values

Sync: Loglikelihood

consistency parallel execution, sequential execution of update functions and Syncs which produce same result

Parallel

Sequential

Carnegie Mellon

GraphLab in the Cloud

Moving towards the cloud…

Purchasing and maintaining computers is very expensive

Most computing resources seldomly used

Only for deadlines…

Buy time, access hundreds or thousands of processors

Only pay for needed resources

Distributed GL ImplementationMixed Multi-threaded / Distributed Implementation. (Each machine runs only one instance)Requires all data to be in memory. Move computation to data.

MPI for management + TCP/IP for communicationAsynchronous C++ RPC Layer

Ran on 64 EC2 HPC Nodes = 512 Processors

Skip Implementation

Carnegie Mellon

Underlying Network

RPC Controller RPC Controller RPC Controller RPC Controller

Distributed Graph

Distributed Locks

Execution EngineExecution Threads

Cache Coherent Distributed K-V Store

Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads

Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads

Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads

Shared Data

Distributed Graph

Distributed Locks

Execution Engine

Execution Threads

Shared Data

Carnegie Mellon

GraphLab RPC

Write distributed programs easily

Asynchronous communicationMultithreaded supportFastScalable

Easy To Use

(Every machine runs the same binary)

Carnegie Mellon

FeaturesEasy RPC capabilities:

rpc.remote_call([target_machine ID], printf, “%s %d %d %d\n”, “hello world”, 1, 2, 3);

Requests (call with return value)

vec = rpc.remote_request( [target_machine ID],

sort_vector, vec);

std::vector<int>& sort_vector(std::vector<int> &v) { std::sort(v.begin(), v.end()); return v;}

One way calls

Features

Object Instance Context

MPI-like primitivesdc.barrier()dc.gather(...)dc.send_to([target machine], [arbitrary object])dc.recv_from([source machine], [arbitrary object ref])

K-V Object K-V Object K-V Object K-V Object

RPC Controller RPC Controller RPC Controller RPC Controller

MPI-Like Safety

Request Latency

16 128 1024 102400

GraphLab RPC

MemCached

Value Length (Bytes)

Ping RTT = 90us

One-Way Call Rate

16 128 1024 102400

100200300400500600700800900

GraphLab RPCICE

Value Length (Bytes)

1Gbps physical peak

Serialization Performance

ICE RPC Buffered RPC Unbuffered0

Receive

100,000 XOne way call of vector of 10 X {"hello", 3.14, 100}

Distributed Computing Challenges

Q1: How do we efficiently distribute the state ?

- Potentially varying #machines

Q2: How do we ensure sequential consistency ?

Keeping in mind:Limited BandwidthHigh LatencyPerformance

Carnegie Mellon

Distributed Graph

Two-stage Partitioning

Initial Overpartitioning of the Graph

Initial Overpartitioning of the GraphGenerate Atom Graph

Initial Overpartitioning of the GraphGenerate Atom GraphRepartition as needed

Ghosting

Ghost vertices/edges act as cache for remote data.Coherency maintained using versioning. Decrease bandwidth utilization.

Ghost vertices are a copy of neighboring vertices which are on remote machines.

Carnegie Mellon

Distributed Engine

Sequential Consistency can be guaranteed through distributed locking. Direct analogue to shared memory impl.

To improve performance: User provides some “expert knowledge” about the properties of the update function.

Full Consistency

User says: update function modifies all data in scope.

Limited opportunities for parallelism.

Acquire write-lock on all vertices.

Edge Consistency

User: update function only reads from adjacent vertices.

More opportunities for parallelism.

Acquire write-lock on center vertex, read-lock on adjacent.

Vertex Consistency

User: update function does not touch edges nor adjacent vertices

Maximum opportunities for parallelism.

Acquire write-lock on current vertex.

Performance Enhancements

Latency Hiding:- “pipelining” of >> #CPU update

function calls. (about 1K deep pipeline)- Hides the latency of lock acquisition

and cache synchronization

Lock Strength Reduction:- A trick where number of locks can be

decreased while still providing same guarantees

Video Cosegmentation

Segments mean the same

Model: 10.5 million nodes, 31 million edges

Gaussian EM clustering + BP on 3D grid

Speedups

Video Segmentation

Chromatic Distributed Engine

Observation:Scheduling using vertex colorings can

be used to automatically satisfy consistency.

Locking overhead is too high in high-degree models. Can we satisfy sequential consistency in a simpler way?

Example: Edge Consistency

(distance 1) vertex coloring

Update functions can be executed on all vertices of the same color in parallel.

Example: Full Consistency

Example: Vertex Consistency

Chromatic Distributed Engine

Execute tasks on all vertices of

color 0

Data Synchronization Completion + Barrier

color 1

Data Synchronization Completion + Barrier

ExperimentsNetflix Collaborative Filtering

Alternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix

Movies

NetflixSpeedup Increasing size of the matrix

factorization

Netflix

ExperimentsNamed Entity Recognition

(part of Tom Mitchell’s NELL project)CoEM Algorithm

Web Crawl

Model: 2 million nodes, 200 million edges

Graph is rather dense. A small number of vertices connect to almost all the vertices.

Named Entity Recognition (CoEM)

Bandwidth Bound

Named Entity Recognition (CoEM)

Future WorkDistributed GraphLab

Fault Tolerance Spot Instances CheaperGraph using off-memory store (disk/SSD)GraphLab as a databaseSelf-optimized partitioningFast data graph construction primitives

GPU GraphLab ?Supercomputer GraphLab ?

Carnegie Mellon

Is GraphLab the Answer to (Life the

Universe and Everything?)

Probably Not.

Carnegie Mellon

graphlab.ml.cmu.edu Parallel/Distributed Implementation

LGPL (highly probable switch to MPL in a few weeks)

GraphLab

bickson.blogspot.comVery fast matrix factorization implementations, other examples, installation, comparisons, etc

Danny Bickson Marketing Agency

Microsoft Safe

Carnegie Mellon

Questions?

Bayesian Tensor Factorization

Gibbs Sampling

Dynamic Block Gibbs Sampling

MatrixFactorization

Belief Propagation

PageRank

Many Others…

Video CosegmentationNaïve Idea: Treat patches independently

Use Gaussian EM clustering (on image features) E step: Predict membership of each patch given cluster centers M step: Compute cluster centers given memberships of each patchDoes not take relationships among patches into account!

Video CosegmentationBetter Idea: Connect the patches using an MRF. Set edge potentials so that adjacent (spatially and temporally) patches prefer to be of the same cluster.

Gaussian EM clustering with a twist: E step: Make unary potentials for each patch using cluster centers. Predict membership of each patch using BP M step: Compute cluster centers given memberships of each patch

D. Batra, et al. iCoseg: Interactive co-segmentation with intelligent scribble guidance. CVPR 2010.

Distributed Memory Programming APIs

• MPI• Global Arrays• GASnet• ARMCI• etc.

…do not make it easy…

Synchronous computation.Insufficient primitives for multi-threaded use.Also, not exactly easy to use…

If all your data is a n-D array

Direct remote pointer access. Severe limitations depending on system architecture.

carnegie mellon machine learning in the cloud yucheng low aapo kyrola danny bickson joey gonzalez...

parallel computation

data parallel algorithms

large dataparallel tasks

parallel design challeges

mapreduce abstraction

map stage

mapreduce framework

large quantities of

Documents

priya goyal piotr dollar ross girshick pieter noordhuis...

joseph gonzalez yucheng low danny bickson distributed...

intercepting mobiles communications: the insecurity of...

create -...

graphlab: how i understood it with sample code aapo kyrola,...

cornell hospitality report - accelerate · raymond bickson,...

graphchi - snia · 2013 storage developer conference. ©...

aapo wa featured in the west australian newspaper

kineograph: taking the pulse of a fast-changing and...

efficient large scale content distribution wdas 2004 by...

joseph gonzalez joint work with yucheng low aapo kyrola...

summit highlights -...

joseph gonzalez - peoplejegonzal/assets/slides/...joseph...

introduction to large-scale graph computation + graphlab and...

carnegie mellon university danny bickson yucheng low aapo...

joseph gonzalez postdoc, uc berkeley amplab...

survey on ica technical report, aapo hyvärinen, 1999....

aapo member features in daily telegraph

everlab workshop, june 7-8, 2006, jerusalem working with...

aapo - 1800 got-junk? presentation