carnegie mellon machine learning in the cloud yucheng low aapo kyrola danny bickson joey gonzalez...
TRANSCRIPT
Carnegie Mellon
Machine Learning in the Cloud
YuchengLow
AapoKyrola
DannyBickson
JoeyGonzalez
Carlos GuestrinJoe Hellerstein
David O’Hallaron
Machine Learning in the Real World
24 Hours a MinuteYouTube
13 Million Wikipedia Pages
500 MillionFacebook Users
3.6 Billion Flickr Photos
Parallelism is DifficultWide array of different parallel architectures:
Different challenges for each architecture
4
GPUs Multicore Clusters Clouds Supercomputers
High Level Abstractions to make things easier.
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
17.5
67.5
14.9
34.3
24.1
84.3
18.4
84.4
No Communication needed
CPU 1 CPU 2
MapReduce – Reduce Phase
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
17.5
67.5
14.9
34.3
2226.
26
1726.
31
Fold/Aggregation
MapReduce and MLExcellent for large data-parallel tasks!
9
Data-Parallel Complex Parallel Structure
Is there more toMachine Learning
?Cross
ValidationFeature
Extraction
Map Reduce
Computing SufficientStatistics
Slow
Proc
esso
rIterative Algorithms?
We can implement iterative algorithms in MapReduce:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barr
ier
Barr
ier
Barr
ier
Iterative MapReduceSystem is not optimized for iteration:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Disk Pe
nalty
Disk Pe
nalty
Disk Pe
nalty
Sta
rtup
Pen
alty
Sta
rtup
Pen
alty
Sta
rtup
Pen
alty
Iterative MapReduceOnly a subset of data needs computation:
(multi-phase iteration)
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barr
ier
Barr
ier
Barr
ier
MapReduce and MLExcellent for large data-parallel tasks!
13
Data-Parallel Complex Parallel Structure
Is there more toMachine Learning
?Cross
ValidationFeature
Extraction
Map Reduce
Computing SufficientStatistics
Structured Problems
14
Interdependent Computation:
Not Map-Reducible
Example Problem: Will I be successful in research?
May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling]
Success depends on the success of others.
Space of Problems
15
Asynchronous Iterative Computation
Repeated iterations over local kernel computations
Sparse Computation Dependencies
Can be decomposed into local “computation-kernels”
GraphLab
Data-Parallel Structured Iterative Parallel
Parallel Computing and MLNot all algorithms are efficiently data parallel
16
CrossValidation
Feature Extraction Belief
Propagation
SVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
LearningGraphicalModels
Lasso
Map Reduce
Computing SufficientStatistics
Sampling
?
GraphLab GoalsDesigned for ML needs
Express data dependenciesIterative
Simplifies the design of parallel programs:
Abstract away hardware issuesAddresses multiple hardware architectures
MulticoreDistributedGPU and others
GraphLab GoalsSimple Models
ComplexModels
SmallData
LargeData
Data-Parallel
Now
Goal
GraphLab GoalsSimple Models
ComplexModels
SmallData
LargeData
Data-Parallel
Now
GraphLab
Carnegie Mellon
GraphLab
A Domain-Specific Abstraction for Machine Learning
Everything on a GraphA Graph with data associated with every vertex and edge
:Data
Update FunctionsUpdate Functions: operations applied on vertex transform data in scope of vertex
Update FunctionsUpdate Function can Schedule the computation of anyother update function:
Scheduled computation is guaranteed to execute eventually.
- FIFO Scheduling - Prioritized Scheduling - Randomized Etc.
Example: Page Rank
multiply adjacent pagerank values with edge weights to get current vertex’s pagerank
Graph = WWW
Update Function:
“Prioritized” PageRank Computation? Skip converged vertices.
Example: K-Means Clustering
Cluster Update:compute average of data connected on a “marked” edge.
Data Update:Pick the closest cluster and mark the edge. Unmark remaining edges.
(Fully Connected?)Bipartite Graph
Update Function:
Data
Clusters
Example: MRF Sampling
- Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex
Graph = MRF
Update Function:
Not Message Passing!
Graph is a data-structure.Update Functions perform parallel modifications to the data-structure.
Safety
If adjacent update functions occur simultaneously?
Safety
If adjacent update functions occur simultaneously?
Importance of Consistency
Permit Races? “Best-effort” computation?
ML resilient to soft-optimization?
True for some algorithms.
Not true for many. May work empirically on some datasets; may fail on others.
Importance of Consistency
Many algorithms require strict consistency, or performs significantly better under strict consistency.
Alternating Least Squares
Importance of Consistency
Fast ML Algorithm development cycle:
Build
Test
Debug
Tweak Model
Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism.Is the execution wrong? Or is the model wrong?
Sequential ConsistencyGraphLab guarantees sequential
consistency parallel execution, sequential execution of update functions which produce same result
CPU 1
CPU 2
CPU 1
Parallel
Sequential
time
Sequential ConsistencyGraphLab guarantees sequential
consistency parallel execution, sequential execution of update functions which produce same result
Formalization of the intuitive concept of a “correct program”.
- Computation does not read outdated data from the past- Computation does not read results of computation that occurs in the future.Primary Property of GraphLab
Global Information
What if we need global information?
Sum of all the vertices?
Algorithm Parameters?
Sufficient Statistics?
Shared VariablesGlobal aggregation through Sync OperationA global parallel reduction over the graph data.Synced variables recomputed at defined intervalsSync computation is Sequentially Consistent
Permits correct interleaving of Syncs and Updates
Sync: Sum of Vertex Values
Sync: Loglikelihood
Sequential ConsistencyGraphLab guarantees sequential
consistency parallel execution, sequential execution of update functions and Syncs which produce same result
CPU 1
CPU 2
CPU 1
Parallel
Sequential
time
Carnegie Mellon
GraphLab in the Cloud
Moving towards the cloud…
Purchasing and maintaining computers is very expensive
Most computing resources seldomly used
Only for deadlines…
Buy time, access hundreds or thousands of processors
Only pay for needed resources
Distributed GL ImplementationMixed Multi-threaded / Distributed Implementation. (Each machine runs only one instance)Requires all data to be in memory. Move computation to data.
MPI for management + TCP/IP for communicationAsynchronous C++ RPC Layer
Ran on 64 EC2 HPC Nodes = 512 Processors
Skip Implementation
Carnegie Mellon
Underlying Network
RPC Controller RPC Controller RPC Controller RPC Controller
Distributed Graph
Distributed Locks
Execution EngineExecution Threads
Cache Coherent Distributed K-V Store
Shared Data
Distributed Graph
Distributed Locks
Execution Engine
Execution Threads
Cache Coherent Distributed K-V Store
Shared Data
Distributed Graph
Distributed Locks
Execution Engine
Execution Threads
Cache Coherent Distributed K-V Store
Shared Data
Distributed Graph
Distributed Locks
Execution Engine
Execution Threads
Cache Coherent Distributed K-V Store
Shared Data
Distributed Graph
Distributed Locks
Execution Engine
Execution Threads
Cache Coherent Distributed K-V Store
Shared Data
Carnegie Mellon
GraphLab RPC
Write distributed programs easily
Asynchronous communicationMultithreaded supportFastScalable
Easy To Use
(Every machine runs the same binary)
Carnegie Mellon
I
C++
FeaturesEasy RPC capabilities:
rpc.remote_call([target_machine ID], printf, “%s %d %d %d\n”, “hello world”, 1, 2, 3);
Requests (call with return value)
vec = rpc.remote_request( [target_machine ID],
sort_vector, vec);
std::vector<int>& sort_vector(std::vector<int> &v) { std::sort(v.begin(), v.end()); return v;}
One way calls
Features
Object Instance Context
MPI-like primitivesdc.barrier()dc.gather(...)dc.send_to([target machine], [arbitrary object])dc.recv_from([source machine], [arbitrary object ref])
K-V Object K-V Object K-V Object K-V Object
RPC Controller RPC Controller RPC Controller RPC Controller
MPI-Like Safety
Request Latency
16 128 1024 102400
50
100
150
200
250
300
350
GraphLab RPC
MemCached
Value Length (Bytes)
Late
ncy
(us)
Ping RTT = 90us
One-Way Call Rate
16 128 1024 102400
100200300400500600700800900
1000
GraphLab RPCICE
Value Length (Bytes)
Mbp
s
1Gbps physical peak
Serialization Performance
ICE RPC Buffered RPC Unbuffered0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Receive
Issue
Seco
nds
(s)
100,000 XOne way call of vector of 10 X {"hello", 3.14, 100}
Distributed Computing Challenges
Q1: How do we efficiently distribute the state ?
- Potentially varying #machines
Q2: How do we ensure sequential consistency ?
Keeping in mind:Limited BandwidthHigh LatencyPerformance
Carnegie Mellon
Distributed Graph
Two-stage Partitioning
Initial Overpartitioning of the Graph
Two-stage Partitioning
Initial Overpartitioning of the GraphGenerate Atom Graph
Two-stage Partitioning
Initial Overpartitioning of the GraphGenerate Atom Graph
Two-stage Partitioning
Initial Overpartitioning of the GraphGenerate Atom GraphRepartition as needed
Two-stage Partitioning
Initial Overpartitioning of the GraphGenerate Atom GraphRepartition as needed
Ghosting
Ghost vertices/edges act as cache for remote data.Coherency maintained using versioning. Decrease bandwidth utilization.
Ghost vertices are a copy of neighboring vertices which are on remote machines.
Carnegie Mellon
Distributed Engine
Distributed Engine
Sequential Consistency can be guaranteed through distributed locking. Direct analogue to shared memory impl.
To improve performance: User provides some “expert knowledge” about the properties of the update function.
Full Consistency
User says: update function modifies all data in scope.
Limited opportunities for parallelism.
Acquire write-lock on all vertices.
Edge Consistency
User: update function only reads from adjacent vertices.
More opportunities for parallelism.
Acquire write-lock on center vertex, read-lock on adjacent.
Vertex Consistency
User: update function does not touch edges nor adjacent vertices
Maximum opportunities for parallelism.
Acquire write-lock on current vertex.
Performance Enhancements
Latency Hiding:- “pipelining” of >> #CPU update
function calls. (about 1K deep pipeline)- Hides the latency of lock acquisition
and cache synchronization
Lock Strength Reduction:- A trick where number of locks can be
decreased while still providing same guarantees
Video Cosegmentation
Segments mean the same
Model: 10.5 million nodes, 31 million edges
Gaussian EM clustering + BP on 3D grid
Speedups
Video Segmentation
Video Segmentation
Chromatic Distributed Engine
Observation:Scheduling using vertex colorings can
be used to automatically satisfy consistency.
Locking overhead is too high in high-degree models. Can we satisfy sequential consistency in a simpler way?
Example: Edge Consistency
(distance 1) vertex coloring
Update functions can be executed on all vertices of the same color in parallel.
Example: Full Consistency
(distance 2) vertex coloring
Update functions can be executed on all vertices of the same color in parallel.
Example: Vertex Consistency
(distance 0) vertex coloring
Update functions can be executed on all vertices of the same color in parallel.
Chromatic Distributed Engine
Tim
e
Execute tasks on all vertices of
color 0
Execute tasks on all vertices of
color 0
Data Synchronization Completion + Barrier
Execute tasks on all vertices of
color 1
Execute tasks on all vertices of
color 1
Data Synchronization Completion + Barrier
ExperimentsNetflix Collaborative Filtering
Alternating Least Squares Matrix Factorization
Model: 0.5 million nodes, 99 million edges
Netflix
Users
Movies
d
NetflixSpeedup Increasing size of the matrix
factorization
Netflix
Netflix
ExperimentsNamed Entity Recognition
(part of Tom Mitchell’s NELL project)CoEM Algorithm
Web Crawl
Model: 2 million nodes, 200 million edges
Graph is rather dense. A small number of vertices connect to almost all the vertices.
Named Entity Recognition (CoEM)
85
Named Entity Recognition (CoEM)
86
Bandwidth Bound
Named Entity Recognition (CoEM)
87
Future WorkDistributed GraphLab
Fault Tolerance Spot Instances CheaperGraph using off-memory store (disk/SSD)GraphLab as a databaseSelf-optimized partitioningFast data graph construction primitives
GPU GraphLab ?Supercomputer GraphLab ?
Carnegie Mellon
Is GraphLab the Answer to (Life the
Universe and Everything?)
Probably Not.
Carnegie Mellon
graphlab.ml.cmu.edu Parallel/Distributed Implementation
LGPL (highly probable switch to MPL in a few weeks)
GraphLab
bickson.blogspot.comVery fast matrix factorization implementations, other examples, installation, comparisons, etc
Danny Bickson Marketing Agency
Microsoft Safe
Carnegie Mellon
Questions?
Bayesian Tensor Factorization
Gibbs Sampling
Dynamic Block Gibbs Sampling
MatrixFactorization
Lasso
SVM
Belief Propagation
PageRank
CoEM
Many Others…
SVD
Video CosegmentationNaïve Idea: Treat patches independently
Use Gaussian EM clustering (on image features) E step: Predict membership of each patch given cluster centers M step: Compute cluster centers given memberships of each patchDoes not take relationships among patches into account!
Video CosegmentationBetter Idea: Connect the patches using an MRF. Set edge potentials so that adjacent (spatially and temporally) patches prefer to be of the same cluster.
Gaussian EM clustering with a twist: E step: Make unary potentials for each patch using cluster centers. Predict membership of each patch using BP M step: Compute cluster centers given memberships of each patch
D. Batra, et al. iCoseg: Interactive co-segmentation with intelligent scribble guidance. CVPR 2010.
Distributed Memory Programming APIs
• MPI• Global Arrays• GASnet• ARMCI• etc.
…do not make it easy…
Synchronous computation.Insufficient primitives for multi-threaded use.Also, not exactly easy to use…
If all your data is a n-D array
Direct remote pointer access. Severe limitations depending on system architecture.