joseph gonzalez joint work with yucheng low aapo kyrola danny bickson carlos guestrin guy blelloch...

Download Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework

If you can't read please download the document

Upload: augustus-rodgers

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David OHallaron A New Parallel Framework for Machine Learning Alex Smola
  • Slide 2
  • 2 The foundation of computation is changing
  • Slide 3
  • Parallelism: Hope for the Future Wide array of different parallel architectures: New Challenges for Designing Machine Learning Algorithms: Race conditions and deadlocks Managing distributed model state New Challenges for Implementing Machine Learning Algorithms: Parallel debugging and profiling Hardware specific APIs 3 GPUsMulticoreClustersMini CloudsClouds
  • Slide 4
  • The foundation of computation is changing 4 the scale of machine learning is exploding
  • Slide 5
  • 27 Hours a Minute YouTube 13 Million Wikipedia Pages 750 Million Facebook Users 3.6 Billion Flickr Photos 5
  • Slide 6
  • Massive data provides opportunities for rich probabilistic structure 6
  • Slide 7
  • Shopper 1 Shopper 2 Cameras Cooking 7 Social Network
  • Slide 8
  • Massive Structured Problems Advances Parallel Hardware ? Thesis: Parallel Learning and Inference in Probabilistic Graphical Models 8
  • Slide 9
  • Massive Structured Problems Advances Parallel Hardware Parallel Learning and Inference in Probabilistic Graphical Models GraphLab Parallel Algorithms for Probabilistic Learning and Inference Parallel Algorithms for Probabilistic Learning and Inference Probabilistic Graphical Models 9
  • Slide 10
  • GraphLab Massive Structured Problems Advances Parallel Hardware 10 Parallel Algorithms for Probabilistic Learning and Inference Parallel Algorithms for Probabilistic Learning and Inference Probabilistic Graphical Models
  • Slide 11
  • Massive Structured Problems Advances Parallel Hardware GraphLab Parallel Algorithms for Probabilistic Learning and Inference Parallel Algorithms for Probabilistic Learning and Inference Probabilistic Graphical Models 11
  • Slide 12
  • Carnegie Mellon Question: How will we design and implement parallel learning systems?
  • Slide 13
  • Carnegie Mellon Threads, Locks, & Messages Build each new learning systems using low level parallel primitives We could use .
  • Slide 14
  • Threads, Locks, and Messages ML experts repeatedly solve the same parallel design challenges: Implement and debug complex parallel system Tune for a specific parallel platform Two months later the conference paper contains: We implemented ______ in parallel. The resulting code: is difficult to maintain is difficult to extend couples learning model to parallel implementation 14 Graduate students
  • Slide 15
  • Carnegie Mellon Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions... a better answer:
  • Slide 16
  • CPU 1 CPU 2 CPU 3 CPU 4 MapReduce Map Phase 16 Embarrassingly Parallel independent computation 12.912.9 42.342.3 21.321.3 25.825.8 No Communication needed
  • Slide 17
  • CPU 1 CPU 2 CPU 3 CPU 4 MapReduce Map Phase 17 Embarrassingly Parallel independent computation 12.912.9 42.342.3 21.321.3 25.825.8 24.124.1 84.384.3 18.418.4 84.484.4 No Communication needed
  • Slide 18
  • CPU 1 CPU 2 CPU 3 CPU 4 MapReduce Map Phase 18 Embarrassingly Parallel independent computation 12.912.9 42.342.3 21.321.3 25.825.8 17.517.5 67.567.5 14.914.9 34.334.3 24.124.1 84.384.3 18.418.4 84.484.4 No Communication needed
  • Slide 19
  • CPU 1 CPU 2 MapReduce Reduce Phase 19 12.912.9 42.342.3 21.321.3 25.825.8 24.124.1 84.384.3 18.418.4 84.484.4 17.517.5 67.567.5 14.914.9 34.334.3 22 26. 26 17 26. 31 Fold/Aggregation
  • Slide 20
  • Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization Sampling Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 20 Data-Parallel Graph-Parallel Is there more to Machine Learning ? Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics
  • Slide 21
  • Carnegie Mellon Concrete Example Label Propagation
  • Slide 22
  • Profile Label Propagation Algorithm Social Arithmetic: Recurrence Algorithm: iterate until convergence Parallelism: Compute all Likes[i] in parallel Sue Ann Carlos Me 50% What I list on my profile 40% Sue Ann Likes 30% Carlos Like 40% 10% 50% 80% Cameras 20% Biking 30% Cameras 70% Biking 50% Cameras 50% Biking I Like: + 60% Cameras, 40% Biking
  • Slide 23
  • Properties of Graph Parallel Algorithms Dependency Graph Iterative Computation What I Like What My Friends Like Factored Computation
  • Slide 24
  • ? Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization Sampling Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 24 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Map Reduce?
  • Slide 25
  • Carnegie Mellon Why not use Map-Reduce for Graph Parallel Algorithms?
  • Slide 26
  • Data Dependencies Map-Reduce does not efficiently express dependent data User must code substantial data transformations Costly data replication Independent Data Rows
  • Slide 27
  • Slow Processor Iterative Algorithms Map-Reduce not efficiently express iterative algorithms: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier
  • Slide 28
  • MapAbuse: Iterative MapReduce Only a subset of data needs computation: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier
  • Slide 29
  • MapAbuse: Iterative MapReduce System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty
  • Slide 30
  • Synchronous vs. Asynchronous Example Algorithm: If Red neighbor then turn Red Synchronous Computation (Map-Reduce) : Evaluate condition on all vertices for every phase 4 Phases each with 9 computations 36 Computations Asynchronous Computation (Wave-front) : Evaluate condition only when neighbor changes 4 Phases each with 2 computations 8 Computations Time 0 Time 1 Time 2Time 3Time 4
  • Slide 31
  • Data-Parallel Algorithms can be Inefficient The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms. Optimized in Memory MapReduceBP Asynchronous Splash BP
  • Slide 32
  • ? Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization Sampling Lasso The Need for a New Abstraction Map-Reduce is not well suited for Graph-Parallelism 32 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics
  • Slide 33
  • Carnegie Mellon What is GraphLab?
  • Slide 34
  • The GraphLab Framework SchedulerConsistency Model Graph Based Data Representation Update Functions User Computation 34 Aggregates
  • Slide 35
  • Data Graph 35 A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Vertex Data: User profile text Current interests estimates Edge Data: Similarity weights Graph: Social Network
  • Slide 36
  • The GraphLab Framework SchedulerConsistency Model Graph Based Data Representation Update Functions User Computation 36 Aggregates
  • Slide 37
  • label_prop(i, scope){ // Get Neighborhood data (Likes[i], W ij, Likes[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); } Update Functions 37 An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex
  • Slide 38
  • The GraphLab Framework SchedulerConsistency Model Graph Based Data Representation Update Functions User Computation 38 Aggregates
  • Slide 39
  • The Scheduler 39 CPU 1 CPU 2 The scheduler determines the order that vertices are updated. e e f f g g k k j j i i h h d d c c b b a a b b i i h h a a i i b b e e f f j j c c Scheduler The process repeats until the scheduler is empty.
  • Slide 40
  • Choosing a Schedule GraphLab provides several different schedulers Round Robin: vertices are updated in a fixed order FIFO: Vertices are updated in the order they are added Priority: Vertices are updated in priority order 40 The choice of schedule affects the correctness and parallel performance of the algorithm Obtain different algorithms by simply changing a flag! --scheduler=roundrobin --scheduler=fifo --scheduler=priority
  • Slide 41
  • ConvergedSlowly Converging Focus Effort Dynamic Computation 41
  • Slide 42
  • The GraphLab Framework SchedulerConsistency Model Graph Based Data Representation Update Functions User Computation 42 Aggregates
  • Slide 43
  • GraphLab Ensures Sequential Consistency 43 For each parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 CPU 2 Single CPU Single CPU Parallel Sequential time
  • Slide 44
  • Ensuring Race-Free Code How much can computation overlap?
  • Slide 45
  • CPU 1 CPU 2 Common Problem: Write-Write Race 45 Processors running adjacent update functions simultaneously modify shared data: CPU1 writes:CPU2 writes: Final Value
  • Slide 46
  • Nuances of Sequential Consistency Data consistency depends on the update function: Some algorithms are robust to data-races GraphLab Solution The user can choose from three consistency models Full, Edge, Vertex GraphLab automatically enforces the users choice 46 CPU 1 CPU 2 Unsafe CPU 1 CPU 2 Safe Read
  • Slide 47
  • Consistency Rules 47 Guaranteed sequential consistency for all update functions Data
  • Slide 48
  • Full Consistency 48 Only allow update functions two vertices apart to be run in parallel Reduced opportunities for parallelism
  • Slide 49
  • Obtaining More Parallelism 49 Not all update functions will modify the entire scope! Edge consistency is sufficient for a large number of algorithms including: Label Propagation
  • Slide 50
  • Edge Consistency 50 CPU 1 CPU 2 Safe Read
  • Slide 51
  • Obtaining More Parallelism 51 Map operations. Feature extraction on vertex data
  • Slide 52
  • Vertex Consistency 52
  • Slide 53
  • The GraphLab Framework SchedulerConsistency Model Graph Based Data Representation Update Functions User Computation 53 Aggregates
  • Slide 54
  • Global State and Computation Not everything fits the Graph metaphor Shared Weight: Global Computation:
  • Slide 55
  • Aggregates Compute asynchronously at fixed time intervals: User defined Map operation (extracted Likes[i]) User defined Reduce (sum) operation User defined normalization Update functions have Read-Only access to a copy
  • Slide 56
  • Anatomy of a GraphLab Program: 1) Define C++ Update Function 2) Build data graph using the C++ graph object 3) Set engine parameters: 1) Scheduler type 2) Consistency model 4) Add initial vertices to the scheduler 5) Register global aggregates and set refresh intervals 6) Run the engine on the graph [Blocking C++ call] 7) Final answer is stored in the graph
  • Slide 57
  • Algorithms Implemented PageRank Loopy Belief Propagation Gibbs Sampling CoEM Graphical Model Parameter Learning Probabilistic Matrix/Tensor Factorization Alternating Least Squares Lasso with Sparse Features Support Vector Machines with Sparse Features Label-Propagation
  • Slide 58
  • Carnegie Mellon Implementing the GraphLab API Multi-core & Cloud Settings
  • Slide 59
  • Multi-core Implementation Implemented in C++ on top of: Pthreads, GCC Atomics Consistency Models implemented using: Read-Write Locks on each vertex Canonically ordered lock acquisition (dining philosophers) Approximate schedulers: Approximate FiFo/Priority ordering to reduced locking overhead Experimental Matlab/Java/Python support Nearly Complete Implementation Available under Apache 2.0 License at graphlab.org
  • Slide 60
  • Distributed Cloud Implementation Implemented in C++ on top of: Multi-core implementation for each multi-core node Custom RPC built on-top of TCP/IP and MPI Graph is Partitioned over Cluster using either: ParMETIS: High-performance partitioning heuristics Random Cuts: Seems to work well on natural graphs Consistency models are enforced using either Distributed RW-Locks with pipelined acquisition Graph-coloring with phased execution No Fault Tolerance: we are working on a solution Still Experimental
  • Slide 61
  • Carnegie Mellon Shared Memory Experiments Shared Memory Setting 16 Core Workstation 61
  • Slide 62
  • Loopy Belief Propagation 62 3D retinal image denoising Data Graph Update Function: Loopy BP Update Equation Scheduler: Approximate Priority Consistency Model: Edge Consistency Vertices: 1 Million Edges: 3 Million
  • Slide 63
  • Loopy Belief Propagation 63 Optimal Better SplashBP 15.5x speedup
  • Slide 64
  • Gibbs Sampling Protein-protein interaction networks [Elidan et al. 2006] Provably correct Parallelization Edge Consistency Round-Robin Scheduler 64 Discrete MRF 14K Vertices 100K Edges Backbone Protein Interactions Side-Chain
  • Slide 65
  • Gibbs Sampling 65 Optimal Better Chromatic Gibbs Sampler
  • Slide 66
  • CoEM (Rosie Jones, 2005) Named Entity Recognition Task the dog Australia Catalina Island ran quickly travelled to is pleasant Hadoop95 Cores7.5 hrs Is Dog an animal? Is Catalina a place? Vertices: 2 Million Edges: 200 Million
  • Slide 67
  • Better Optimal GraphLab CoEM CoEM (Rosie Jones, 2005) 67 GraphLab16 Cores30 min 15x Faster!6x fewer CPUs! Hadoop95 Cores7.5 hrs
  • Slide 68
  • Lasso: Regularized Linear Model 68 Data matrix, n x d weights d x 1 Observations n x 1 5 Features 4 Examples Shooting Algorithm [Coordinate Descent] Updates on weight vertices modify losses on observation vertices. Requires the Full Consistency Model Financial prediction dataset from Kogan et al [2009]. Regularization
  • Slide 69
  • Full Consistency 69 Optimal Better Dense Sparse
  • Slide 70
  • Relaxing Consistency 70 Why does this work? (See Shotgut ICML Paper) Better Optimal Dense Sparse
  • Slide 71
  • Carnegie Mellon Experiments Amazon EC2 High-Performance Nodes 71
  • Slide 72
  • Matrix Factorization Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Netflix Users Movies d
  • Slide 73
  • Netflix Speedup Increasing size of the matrix factorization
  • Slide 74
  • Netflix
  • Slide 75
  • Netflix: Comparison to MPI Cluster Nodes
  • Slide 76
  • Summary An abstraction tailored to Machine Learning Targets Graph-Parallel Algorithms Naturally expresses Data/computational dependencies Dynamic iterative computation Simplifies parallel algorithm design Automatically ensures data consistency Achieves state-of-the-art parallel performance on a variety of problems 76
  • Slide 77
  • Current/Future Work Out-of-core Storage Hadoop/HDFS Integration Graph Construction Graph Storage Launching GraphLab from Hadoop Fault Tolerance through HDFS Checkpoints Sub-scope parallelism Address the challenge of very high degree nodes Stochastic Scopes Update Functions -> Update Functors Allows update functions to send state when rescheduling.
  • Slide 78
  • Carnegie Mellon Checkout GraphLab http://graphlab.org 78 Documentation Code Tutorials Questions & Feedback [email protected]