yucheng low

65
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson work for Machine Learning and Data Mining in the Joseph Gonzalez Carlos Guestrin Joe Hellerstein

Upload: fergal

Post on 24-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

A Framework for Machine Learning and Data Mining in the Cloud . Yucheng Low. Joseph Gonzalez. Aapo Kyrola. Danny Bickson. Carlos Guestrin. Joe Hellerstein. Big Data is Everywhere. 900 Million Facebook Users. 28 Million Wikipedia Pages. 6 Billion Flickr Photos. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Yucheng Low

Carnegie Mellon

Yucheng Low

AapoKyrola

DannyBickson

A Framework for Machine Learning and Data Mining in the Cloud

JosephGonzalez

CarlosGuestrin

JoeHellerstein

Page 2: Yucheng Low

Big Data is Everywhere

72 Hours a MinuteYouTube28 Million

Wikipedia Pages

900 MillionFacebook Users

6 Billion Flickr Photos

2

“… data a new class of economic asset, like currency or gold.”

“…growing at 50 percent a year…”

Page 3: Yucheng Low

How will wedesign and implement Big learning systems?

Big Learning

3

Page 4: Yucheng Low

Shift Towards Use Of Parallelism in ML

GPUs Multicore Clusters Clouds Supercomputers

ML experts repeatedly solve the same parallel design challenges:

Race conditions, distributed state, communication…

Resulting code is very specialized:difficult to maintain, extend, debug…

Graduate students

Avoid these problems by using high-level abstractions

4

Page 5: Yucheng Low

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

5

Page 6: Yucheng Low

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

6

Embarrassingly Parallel independent computation

12.9

42.3

21.3

25.8

No Communication needed

Page 7: Yucheng Low

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

7

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

Embarrassingly Parallel independent computation No Communication needed

Page 8: Yucheng Low

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

8

12.9

42.3

21.3

25.8

17.5

67.5

14.9

34.3

24.1

84.3

18.4

84.4

Embarrassingly Parallel independent computation No Communication needed

Page 9: Yucheng Low

CPU 1 CPU 2

MapReduce – Reduce Phase

9

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

17.5

67.5

14.9

34.3

2226.

26

1726.

31

Image Features

Attractive Face Statistics

Ugly Face Statistics

U A A U U U A A U A U A

Attractive Faces Ugly Faces

Page 10: Yucheng Low

MapReduce for Data-Parallel ML

Excellent for large data-parallel tasks!

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

MapReduce

Computing SufficientStatistics

Graphical ModelsGibbs Sampling

Belief PropagationVariational Opt.

Semi-Supervised Learning

Label PropagationCoEM

Graph AnalysisPageRank

Triangle Counting

Collaborative Filtering

Tensor Factorization

Is there more toMachine Learning

?10

Page 11: Yucheng Low

Carnegie Mellon

Exploit Dependencies

Page 12: Yucheng Low

12

HockeyScuba Diving

Underwater Hockey

Scuba Diving

Scuba Diving

Scuba Diving

Hockey

Hockey

Hockey

Hockey

Page 13: Yucheng Low

Graphs are Everywhere

Use

rs

Movies

Netflix

Collaborative Filtering

Docs

Words

Wiki

Text Analysis

Social Network

Probabilistic Analysis

13

Page 14: Yucheng Low

Properties of Computation on Graphs

DependencyGraph

IterativeComputation

My Interests

Friends Interests

LocalUpdates

Page 15: Yucheng Low

ML Tasks Beyond Data-Parallelism

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Graphical ModelsGibbs Sampling

Belief PropagationVariational Opt.

Semi-Supervised Learning

Label PropagationCoEM

Graph AnalysisPageRank

Triangle Counting

Collaborative Filtering

Tensor Factorization

15

Page 16: Yucheng Low

Bayesian Tensor Factorization

Gibbs Sampling

MatrixFactorization

Lasso

SVM

Belief Propagation

PageRank

CoEM

SVD

LDA

…Many others…Linear Solvers

Splash SamplerAlternating Least

Squares

21

2010Shared Memory

Page 17: Yucheng Low

22

Limited CPU PowerLimited MemoryLimited Scalability

Page 18: Yucheng Low

Distributed Cloud

- Distributing State- Data Consistency- Fault Tolerance

23

Unlimited amount of computation resources!(up to funding limitations)

Page 19: Yucheng Low

The GraphLab Framework

Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

24

Page 20: Yucheng Low

Data GraphData associated with vertices and edges

Vertex Data:• User profile• Current interests estimates

Edge Data:• Relationship (friend, classmate, relative)

Graph:• Social Network

25

Page 21: Yucheng Low

Distributed Graph

Partition the graph across multiple machines.

26

Page 22: Yucheng Low

Distributed Graph

Ghost vertices maintain adjacency structure and replicate remote data.

“ghost” vertices

27

Page 23: Yucheng Low

Distributed Graph

Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …)

“ghost” vertices

28

Page 24: Yucheng Low

The GraphLab Framework

Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

29

Page 25: Yucheng Low

Pagerank(scope){ // Update the current vertex data

// Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; }

Update FunctionsUser-defined program: applied to avertex and transforms data in scope of vertex

Dynamic computation

Update function applied (asynchronously) in parallel until convergence

Many schedulers available to prioritize computation

30

Why Dynamic?

Page 26: Yucheng Low

Asynchronous Belief Propagation

Cumulative Vertex Updates

ManyUpdates

FewUpdates

Algorithm identifies and focuses on hidden sequential structure

Graphical Model

Challenge = Boundaries

32

Page 27: Yucheng Low

33

Shared Memory Dynamic Schedule

e f g

kjih

dcbaCPU 1

CPU 2

ahab

b

i

Process repeats until scheduler is empty

Scheduler

Page 28: Yucheng Low

Distributed Scheduling

e

ih

ba

f g

kj

dcah

fg

j

cb

i

Each machine maintains a schedule over the vertices it owns.

34Distributed Consensus used to identify completion

Page 29: Yucheng Low

Ensuring Race-Free CodeHow much can computation overlap?

35

Page 30: Yucheng Low

The GraphLab Framework

Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

37

Page 31: Yucheng Low

PageRank Revisited

38

Pagerank(scope) {

…}

Page 32: Yucheng Low

Racing PageRankThis was actually encountered in user code.

39

Page 33: Yucheng Low

Bugs

40

Pagerank(scope) {

…}

Page 34: Yucheng Low

Bugs

41

Pagerank(scope) {

…}

tmp

tmp

Page 35: Yucheng Low

Throughput != Performance

No Consistency

Higher Throughput

(#updates/sec)

Potentially Slower Convergence of ML

42

Page 36: Yucheng Low

Racing Collaborative Filtering

43

0 200 400 600 800 1000 1200 1400 16000.01

0.1

1

d=20 Racing d=20

Time

RMSE

Page 37: Yucheng Low

Serializability

44

For every parallel execution, there exists a sequential execution of update functions which produces the same result.

CPU 1

CPU 2

SingleCPU

Parallel

Sequential

time

Page 38: Yucheng Low

Serializability Example

45

Read

Write

Update functions one vertex apart can be run in parallel.

Edge Consistency

Overlapping regions are only read.

Stronger / Weaker consistency levels available

User-tunable consistency levelstrades off parallelism & consistency

Page 39: Yucheng Low

Solution 1Graph Coloring

Distributed Consistency

Solution 2Distributed Locking

Page 40: Yucheng Low

Edge Consistency via Graph Coloring

Vertices of the same color are all at least one vertex apart.Therefore, All vertices of the same color can be run in parallel!

48

Page 41: Yucheng Low

Chromatic Distributed EngineTi

me

Execute tasks on all vertices of

color 0Execute tasks

on all vertices of color 0

Ghost Synchronization Completion + Barrier

Execute tasks on all vertices of

color 1

Execute tasks on all vertices of

color 1

Ghost Synchronization Completion + Barrier

49

Page 42: Yucheng Low

Matrix FactorizationNetflix Collaborative Filtering

Alternating Least Squares Matrix FactorizationModel: 0.5 million nodes, 99 million edges

Netflix

Users

Movies

d

50

Users Movies

Page 43: Yucheng Low

51

Netflix Collaborative Filtering

Ideal

D=100

D=20

# machines

HadoopMPI

GraphLab

# machines

Page 44: Yucheng Low

The Cost of Hadoop

52

Page 45: Yucheng Low

CoEM (Rosie Jones, 2005)Named Entity Recognition Task

the cat

Australia

Istanbul

<X> ran quickly

travelled to <X>

<X> is pleasant

Hadoop 95 Cores 7.5 hrs

Is “Cat” an animal?Is “Istanbul” a place?

Vertices: 2 MillionEdges: 200 Million

53

Distributed GL 32 EC2 Nodes 80 secs

GraphLab 16 Cores 30 min

0.3% of Hadoop time

Page 46: Yucheng Low

ProblemsRequire a graph coloring to be available.

Frequent Barriers make it extremely inefficient for highly dynamic systems where only a small number of vertices are active in each round.

55

Page 47: Yucheng Low

Solution 1Graph Coloring

Distributed Consistency

Solution 2Distributed Locking

Page 48: Yucheng Low

Distributed LockingEdge Consistency can be guaranteed through locking.

: RW Lock

57

Page 49: Yucheng Low

Consistency Through LockingAcquire write-lock on center vertex, read-lock on adjacent.

58

Page 50: Yucheng Low

SolutionPipelining

CPU Machine 1

Machine 2

A CB D

Consistency Through LockingMulticore Setting

PThread RW-Locks

Distributed SettingDistributed LocksChallenges

Latency

A C

B D

A CB DA

59

Page 51: Yucheng Low

No Pipelining

lock scope 1

Process request 1

scope 1 acquiredupdate_function 1

release scope 1

Process release 1

Tim

e

60

Page 52: Yucheng Low

Pipelining / Latency HidingHide latency using pipelining

lock scope 1

Process request 1

scope 1 acquired

update_function 1release scope 1

Process release 1

lock scope 2

Tim

e lock scope 3 Process request 2Process request 3

scope 2 acquiredscope 3 acquired

update_function 2release scope 2

61

Page 53: Yucheng Low

Latency HidingHide latency using request buffering

lock scope 1

Process request 1

scope 1 acquired

update_function 1release scope 1

Process release 1

lock scope 2

Tim

e lock scope 3 Process request 2Process request 3

scope 2 acquiredscope 3 acquired

update_function 2release scope 2

No Pipelining 472 s

Pipelining 10 s

47x Speedup

Residual BP on 190K Vertex 560K Edge Graph4 Machines

62

Page 54: Yucheng Low

63

Video Cosegmentation

Segments mean the same

1740 FramesModel: 10.5 million nodes, 31 million edges

Probabilistic Inference Task

Page 55: Yucheng Low

64

Video Coseg. Speedups

GraphLab

Ideal

# machines

Page 56: Yucheng Low

The GraphLab Framework

Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

65

Page 57: Yucheng Low

What if machines fail? How do we provide fault tolerance?

Page 58: Yucheng Low

Checkpoint1: Stop the world

2: Write state to disk

Page 59: Yucheng Low

68

Snapshot Performance

No Snapshot

Snapshot

One slow machine

Because we have to stop the world, One slow machine slows everything down!

Snapshot time

Slow machine

Page 60: Yucheng Low

How can we do better?

Take advantage of consistency

Page 61: Yucheng Low

70

Checkpointing1985: Chandy-Lamport invented an asynchronous snapshotting algorithm for distributed systems.

snapshottedNot snapshotted

Page 62: Yucheng Low

71

CheckpointingFine Grained Chandy-Lamport.

Easily implemented within GraphLab as an Update Function!

Page 63: Yucheng Low

72

Async. Snapshot Performance

No Snapshot

Snapshot

One slow machine

No penalty incurred by the slow machine!

Page 64: Yucheng Low

SummaryExtended GraphLab abstraction to distributed systemsTwo different methods of achieving consistency

Graph ColoringDistributed Locking with pipelining

Efficient implementationsAsynchronous Fault Tolerance with fined-grained Chandy-Lamport

Performance

Useability

Efficiency Scalabilitys

73

Page 65: Yucheng Low

Carnegie Mellon University

Release 2.1 + many toolkitshttp://graphlab.org

PageRank: 40x faster than Hadoop

1B edges per second.

Triangle Counting: 282x faster than Hadoop

400M triangles per second.

Major Improvements to be published in OSDI 2012