carnegie mellon graphlab a new framework for parallel machine learning yucheng low aapo kyrola...
TRANSCRIPT
Carnegie Mellon
GraphLabA New Framework for
Parallel Machine Learning
Yucheng LowAapo Kyrola Carlos Guestrin
Joseph GonzalezDanny BicksonJoe Hellerstein
2
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
0.01
0.1
1
10
Exponential Parallelism
Exponentially
Incre
asing
Sequential P
erform
ance
Constant SequentialPerformance
Pro
cess
or
Sp
eed
GH
z
Exponentially Increasing Parallel Performance
Release Date
13 Million Wikipedia Pages3.6 Billion photos on Flickr
Parallel Programming is Hard
Designing efficient Parallel Algorithms is hard
Race conditions and deadlocksParallel memory bottlenecksArchitecture specific concurrencyDifficult to debug
ML experts repeatedly address the same parallel design challenges
3
Avoid these problems by using high-level abstractions.
Graduate students
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
4
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
5
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
6
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
17.5
67.5
14.9
34.3
24.1
84.3
18.4
84.4
No Communication needed
CPU 1 CPU 2
MapReduce – Reduce Phase
7
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
17.5
67.5
14.9
34.3
2226.
26
1726.
31
Fold/Aggregation
Parallel Computing and ML
Not all algorithms are efficiently data parallel
9
Data-Parallel Complex Parallel Structure
CrossValidation
Feature Extraction
BeliefPropagation
SVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
Sampling
Lasso
Common Properties
10
1) Sparse Data Dependencies
2) Local Computations
3) Iterative Updates
• Sparse Primal SVM• Tensor/Matrix Factorization
• Expectation Maximization• Optimization
• Sampling• Belief Propagation
Operation A
Operation B
Gibbs Sampling
11
X4 X5 X6
X9X8
X3X2X1
X7
1) Sparse Data Dependencies
2) Local Computations
3) Iterative Updates
GraphLab is the SolutionDesigned specifically for ML needs
Express data dependenciesIterative
Simplifies the design of parallel programs:
Abstract away hardware issuesAutomatic data synchronizationAddresses multiple hardware architectures
Implementation here is multi-coreDistributed implementation in progress
12
Data Graph
15
A Graph with data associated with every vertex and edge.
:Data
x3: Sample valueC(X3): sample counts
Φ(X6,X9): Binary potential
X1
X2
X3
X5
X6
X7
X8
X9
X10
X4
X11
Update Functions
16
Update Functions are operations which are applied on a vertex and transform the data in the scope of the vertex
Gibbs Update: - Read samples on adjacent vertices - Read edge potentials - Compute a new sample for the current vertex
Static ScheduleScheduler determines the
order of Update Function Evaluations
19
Synchronous Schedule: Every vertex updated simultaneously
Round Robin Schedule: Every vertex updated sequentially
Dynamic ScheduleUpdate Functions can insert new tasks into
the schedule
22
FIFO Queue Wildfire BP [Selvatici et al.]
Priority Queue Residual BP [Elidan et al.]
Splash Schedule Splash BP [Gonzalez et al.]
Obtain different algorithms simply by changing a flag!
--scheduler=fifo --scheduler=priority --scheduler=splash
Global Information
What if we need global information?
23
Sum of all the vertices?
Algorithm Parameters?
Sufficient Statistics?
Accumulate Function:
Sync OperationSync is a fold/reduce operation over the graph
25
Sync!
1 3 2
1211
3251
0
Apply Function:
AddDivide by |
V|9222
Accumulate performs an aggregation over verticesApply makes a final modification to the accumulated dataExample: Compute the average of all the vertices
Shared Data Table (SDT)Global constant parametersGlobal computation (Sync Operation)
26
Constant:Total # Samples
Sync: SampleStatistics
Sync: LoglikelihoodConstant: Temperature
Write-Write Race
28
Write-Write Race If adjacent update functions write simultaneously
Left update writes: Right update writes:Final Value
Race Conditions + Deadlocks
Just one of the many possible racesRace-free code is extremely difficult to write
29
GraphLab design ensures race-free operation
Full Consistency
31
Only allow update functions two vertices apart to be run in parallelReduced opportunities for parallelism
Obtaining More Parallelism
32
Not all update functions will modify the entire scope!
Belief Propagation: Only uses edge dataGibbs Sampling: Only needs to read adjacent vertices
Sequential ConsistencyGraphLab guarantees sequential
consistency
36
For every parallel execution, there exists a sequential execution of update functions which will produce the same result.
CPU 1
CPU 2
CPU 1
Parallel
Sequential
time
ExperimentsShared Memory Implemention in C++ using PthreadsTested on a 16 processor machine
4x Quad Core AMD Opteron 838464 GB RAM
Belief Propagation +Parameter Learning
Gibbs SamplingCoEMLasso
39
Compressed SensingSVMPageRankTensor Factorization
Graphical Model Learning
40
3D retinal image denoising
Data Graph: 256x64x64 vertices
Update Function
Belief PropagationSync
Acc: Compute inference statisticsApply:Take a gradient step
Sync: Edge-potential
Graphical Model Learning
41
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Sp
eed
up
Optimal
Bett
er
Approx. Priority Schedule
Splash Schedule
15.5x speedup on 16 cpus
Graphical Model Learning
42
Inference
Gradient Step
Standard parameter learning takes gradient only after inference is compute
Parallel Inference + Gradient Step
With GraphLab:Take gradient step while inference is running
Ru
nti
me
3x faster!
2100 sec
700 sec
Iterated Simultaneous
Gibbs SamplingTwo methods for sequentially consistency:
43
ScopesEdge Scope
graphlab(gibbs, edge, sweep);
SchedulingGraph Coloring
CPU
1
CPU
2
CPU
3
t0
t1
t2
t3
graphlab(gibbs, vertex, colored);
Gibbs SamplingProtein-protein interaction networks [Elidan et al. 2006]
Pair-wise MRF14K Vertices100K Edges
10x SpeedupScheduling reduceslocking overhead
44
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Sp
eed
up
Optimal
Bett
er
Round robin schedule
Colored Schedule
CoEM (Rosie Jones, 2005)Named Entity Recognition Task
Vertices
Edges
Small 0.2M 20M
Large 2M 200M
the dog
Australia
Catalina Island
<X> ran quickly
travelled to <X>
<X> is pleasant
Hadoop 95 Cores 7.5 hrs
Is “Dog” an animal?Is “Catalina” a place?
CoEM (Rosie Jones, 2005)
4646
Optimal
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Sp
eed
up
Bett
er
Small
Large
GraphLab 16 Cores 30 min
15x Faster!6x fewer CPUs!
Hadoop 95 Cores 7.5 hrs
Lasso
47
L1 regularized Linear Regression
Shooting Algorithm (Coordinate Descent)Due to the properties of the update, full consistency is needed
Lasso
48
L1 regularized Linear Regression
Shooting Algorithm (Coordinate Descent)Due to the properties of the update, full consistency is needed
Lasso
49
L1 regularized Linear Regression
Shooting Algorithm (Coordinate Descent)Due to the properties of the update, full consistency is needed
Finance Dataset from Kogan et al [2009].
Full Consistency
50
Optimal
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Sp
eed
up
Bett
er
Dense
Sparse
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Sp
eed
up
Relaxing Consistency
51Why does this work? (Open Question)
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Sp
eed
up
Bett
er Optimal
Dense
Sparse
GraphLabAn abstraction tailored to Machine Learning Provides a parallel framework which compactly expresses
Data/computational dependenciesIterative computation
Achieves state-of-the-art parallel performance on a variety of problemsEasy to use
52
Future WorkDistributed GraphLab
Load BalancingMinimize CommunicationLatency HidingDistributed Data ConsistencyDistributed Scalability
GPU GraphLabMemory bus bottle neckWarp alignment
State-of-the-art performance for <Your Algorithm Here> .
53