gaia: geo-distributed machine learning approaching lan...

51
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu

Upload: others

Post on 16-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Kevin HsiehAaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu†

Page 2: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Machine Learning and Big Data

•Machine learning is widely used to derive useful information from large-scale data

2

Image Classification

VideosPictures

Video Analytics

User Activities

Preference Prediction

……

Page 3: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Big Data is Geo-Distributed • A large amount of data is generated rapidly, all over the world

3

Page 4: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Centralizing Data is Infeasible [1, 2, 3]

• Moving data over wide-area networks (WANs) can be extremely slow

• It is also subject to data sovereignty laws

4

1.Vulimiri etal.,NSDI’152.Puetal.,SIGCOMM’153.Viswanathanetal.,OSDI’16

Page 5: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Geo-distributed ML is Challenging• No ML system is designed to run across data centers

(up to 53X slowdown in our study)

5

Page 6: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Our Goal

•Develop a geo-distributed ML system•Minimize communication over wide-area networks

•Retain the accuracy and correctness of ML algorithms

•Without requiring changes to the algorithms

6

KeyResult:1.8-53.5Xspeedupoverstate-of-the-artMLsystemsonWANs

Page 7: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Outline

•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion

7

Page 8: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

WorkerMachineN

ParameterServer

Background: Parameter Server Architecture

• The parameter server architecture has been widely adopted in many ML systems

8

…WorkerMachine1

Data1

ParameterServer

MLModelUpdateUpdate

ReadRead …

TrainingData

DataN

UpdateUpdate

ReadRead

Page 9: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

WorkerMachineN

ParameterServer

Background: Parameter Server Architecture

• The parameter server architecture has been widely adopted in many ML systems

9

…WorkerMachine1

Data1

ParameterServer

MLModelUpdateUpdate

ReadRead …

TrainingData

DataN

UpdateUpdate

ReadRead

Synchronization is critical to the accuracy and correctness of ML algorithms

Page 10: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Deploy Parameter Servers on WANs• Deploying parameter servers across data centers

requires a lot of communication over WANs

10

WorkerMachineN

ParameterServer

…WorkerMachine1

ParameterServer

MLModel

DataCenter1 DataCenter2

Page 11: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

WAN: Low Bandwidth and High Cost• WAN bandwidth is 15X smaller than LAN bandwidth on average,

and up to 60X smaller• In Amazon EC2, the monetary cost of WAN communication is

up to 38X the cost of renting machines

11

VirginiaCaliforniaOregonIrelandFrankfurtTokyoSeoulSingaporeSydneyMumbaiSão Paulo

0100200300400500600700800900

1000

Net

wor

k Ba

ndw

idth

(Mb/

s)

11 Amazon EC2 Regions

Page 12: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

3.7X 3.5X

23.8X

5.9X 4.4X

24.2X

05

10152025

LAN EC2-ALL V/C WAN S/S WANNor

mal

ized

Exe

cutio

n Ti

me

until

Con

verg

ence IterStore Bӧsen

ML System Performance on WANs

1) Cui et al., “Exploiting Iterative-ness for Parallel ML Computations”, SoCC’14

2) Wei et al., “Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics”, SoCC’15 12

MatrixFactorization

Page 13: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

3.7X 3.5X

23.8X

5.9X 4.4X

24.2X

05

10152025

LAN EC2-ALL V/C WAN S/S WANNor

mal

ized

Exe

cutio

n Ti

me

until

Con

verg

ence IterStore Bӧsen

ML System Performance on WANs

1) Cui et al., “Exploiting Iterative-ness for Parallel ML Computations”, SoCC’14

2) Wei et al., “Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics”, SoCC’15 13

Running ML systems on WANs can seriously slow down ML applications

11EC2Regions

Virginia/CaliforniaSingapore/SãoPaulo

MatrixFactorization

Page 14: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Outline

•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion

14

Page 15: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Gaia System Overview• Key idea: Decouple the synchronization model within

the data center from the synchronization model between data centers

15

ParameterServer

DataCenter1

ParameterServer

WorkerMachine

LocalSync

ParameterServer

ParameterServer

DataCenter2

WorkerMachineWorkerMachine

ApproximatelyCorrectModelCopy

ApproximatelyCorrectModelCopy

RemoteSync

Page 16: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Gaia System Overview• Key idea: Decouple the synchronization model within

the data center from the synchronization model between data centers

16

ParameterServer

DataCenter1

ParameterServer

WorkerMachine

LocalSync

ParameterServer

ParameterServer

DataCenter2

WorkerMachineWorkerMachine

ApproximatelyCorrectModelCopy

ApproximatelyCorrectModelCopy

RemoteSync

Communicate over WANsonly significant updates

Page 17: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

95.6% 95.2% 97.0%

0%20%40%60%80%

100%

10% 5% 1% 0.5% 0.1% 0.05% 0.01%Insi

gnifi

cant

Upd

ates

%

Threshold of Significant Updates

Matrix Factorization Topic Modeling Image Classification

Key Finding: Study of Update Significance

17

The vast majority of updates are insignificant

Page 18: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Outline

•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion

18

Page 19: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Approximate Synchronous Parallel

Thesignificancefilter• Filterupdatesbasedontheirsignificance

ASPselectivebarrier• Ensuresignificantupdatesarereadintime

Mirrorclock• Safeguardforpathologicalcases

19

Page 20: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

The Significance Filter

20

Worker Machine

ParameterServer

Parameter XValue Aggregated Update

Update (Δ1) on X

Significance Function

Other Parameters

Significance Threshold>?

𝐴𝑔𝑔. 𝑈𝑝𝑑𝑎𝑡𝑒𝑉𝑎𝑙𝑢𝑒

1%𝑇�

1%

Update (Δ2) on X

Δ1Δ1+Δ20

Page 21: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Approximate Synchronous Parallel

Thesignificancefilter• Filterupdatesbasedontheirsignificance

ASPselectivebarrier• Ensuresignificantupdatesarereadintime

Mirrorclock• Safeguardforpathologicalcases

21

Page 22: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

ASP Selective Barrier

22

Data Center 1 Data Center 2

Parameter Server Parameter Server

Significant Update

Significant Update

Significant Update

Significant Update

Arrive too late!

Data Center 1 Data Center 2

Parameter Server Parameter Server

Significant Update

Significant Update

Significant Update

Significant Update

SelectiveBarrier

Only workers that depend on these parameters are blocked

Page 23: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Outline

•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion

23

Page 24: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Put it All Together: The Gaia System

24

LocalServer

GaiaParameterServer

WorkerMachine

SignificanceFilter

ParameterStore

WorkerMachine

WorkerMachine

MirrorServer

MirrorClient

DataCenterBoundary

ControlQueue

DataQueue

GaiaParameterServer

Update AggregatedUpdate

SelectiveBarrier

Page 25: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Put it All Together: The Gaia System

25

LocalServer

GaiaParameterServer

WorkerMachine

SignificanceFilter

ParameterStore

WorkerMachine

WorkerMachine

MirrorServer

MirrorClient

DataCenterBoundary

ControlQueue

DataQueue

GaiaParameterServer

Update AggregatedUpdate

SelectiveBarrier

Control messages (barriers, etc.) are always prioritized

No change is required for ML algorithms and ML programs

Page 26: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Problem: Broadcast Significant Updates

26

Communication overhead is proportional to the number of data centers

Page 27: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Mitigation: Overlay Networks and Hubs

27

Save communication on WANs by aggregating the updates at hubs

Data Center Group Data Center Group Data Center Group

Data Center Group

Hub

HubHub

Hub

Page 28: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Outline

•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion

28

Page 29: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Methodology

• Applications• Matrix Factorization with the Netflix dataset• Topic Modeling with the Nytimes dataset• Image Classification with the ILSVRC12 dataset

• Hardware platform• 22 machines with emulated EC2 WAN bandwidth• We validated the performance with a real EC2 deployment

• Baseline• IterStore (Cui et al., SoCC’14) and GeePS (Cui et al., EuroSys’16) on WAN

• Performance metrics• Execution time until algorithm convergence• Monetary cost of algorithm convergence

29

Page 30: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

3.8X 3.7X6.0X

3.7X 4.8X8.5X

0

0.2

0.4

0.6

0.8

1

Matrix Factorization Topic Modeling Image Classification

Nor

mal

ized

Exe

c. T

ime

Baseline Gaia LAN

Performance – 11 EC2 Data Centers

30

Gaiaachieves3.7-6.0XspeedupoverBaselineGaiaisatmost1.40XofLANspeeds

Page 31: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

25.4X 14.1X53.5X

23.8X 17.3X 53.7X

0

0.2

0.4

0.6

0.8

1

Matrix Factorization

Topic Modeling Image Classification

Baseline Gaia LAN

3.7X 3.7X7.4X

3.5X 3.9X7.4X

0

0.2

0.4

0.6

0.8

1

Matrix Factorization

Topic Modeling Image Classification

Nor

mal

ized

Exe

c. T

ime

Baseline Gaia LAN

Performance and WAN Bandwidth

31

V/C WAN(Virginia/California)

S/S WAN(Singapore/São Paulo)

Gaiaachieves3.7-53.5XspeedupoverBaselineGaiaisatmost1.23XofLANspeeds

Page 32: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

2.6X5.7X 18.7X

00.5

11.5

22.5

Base

line

Gai

a

Base

line

Gai

a

Base

line

Gai

a

EC2-ALL V/C WAN S/S WAN

Results – EC2 Monetary Cost

32

4.2X 6.0X 28.5X0

0.51

1.52

2.5

Base

line

Gai

a

Base

line

Gai

a

Base

line

Gai

a

EC2-ALL V/C WAN S/S WAN

Nor

mlia

ed C

ost

Communication CostMachine Cost (Network)Machine Cost (Compute)

Matrix Factorization Topic Modeling

8.5X 10.7X 59.0X0

0.51

1.52

2.53

3.54

Base

line

Gai

a

Base

line

Gai

a

Base

line

Gai

a

EC2-ALL V/C WAN S/S WAN

Image Classification

Gaiais2.6-59.0XcheaperthanBaseline

Page 33: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

More in the Paper

•Convergence proof of Approximate Synchronous Parallel (ASP)

•ASP vs. fully asynchronous

•Gaia vs. centralizing data approach

33

Page 34: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Key Takeaways• The Problem: How to perform ML on geo-distributed data?

• Centralizing data is infeasible. Geo-distributed ML is very slow

• Our Gaia Approach• Decouple the synchronization model within the data center from

that across data centers • Eliminate insignificant updates across data centers

• A new synchronization model: Approximate Synchronous Parallel• Retain the correctness and accuracy of ML algorithms

• Key Results: • 1.8-53.5X speedup over state-of-the-art ML systems on WANs • at most 1.40X of LAN speeds• without requiring changes to algorithms 34

Page 35: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Kevin HsiehAaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu†

Page 36: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Executive Summary• The Problem: How to perform ML on geo-distributed data?

• Centralizing data is infeasible. Geo-distributed ML is very slow

• Our Goal• Minimize communication over WANs• Retain the correctness and accuracy of ML algorithms• Without requiring changes to ML algorithms

• Our Gaia Approach• Decouple the synchronization model within the data center from

that across data centers: Eliminate insignificant updates on WANs• A new synchronization model: Approximate Synchronous Parallel

• Key Results: • 1.8-53.5X speedup over state-of-the-art ML systems on WANs• within 1.40X of LAN speeds 36

Page 37: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Approximate Synchronous Parallel

Thesignificancefilter• Filterupdatesbasedtheirsignificance

ASPselectivebarrier• Ensuresignificantupdatesarereadintime

Mirrorclock• Safeguardforpathologicalcases

37

Page 38: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Mirror Clock

38

Data Center 1 Data Center 2

Parameter Server Parameter Server

d

Data Center 2 Data Center 1

Parameter Server Parameter Server

Clock N

Clock N Clock N + DSGuarantees all significant updates

are seen after DS clocks

BarrierNo guarantee under extreme network conditions

Page 39: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Effect of Synchronization Mechanisms

39

-1.5E+09 -1.4E+09 -1.3E+09 -1.2E+09 -1.1E+09 -1.0E+09 -9.0E+08

0 250 500 750 1000

Object

ive valu

e

Time (Seconds)

Gaia Gaia_Async

Convergence value

0E+00 1E+08 2E+08 3E+08 4E+08 5E+08 6E+08 7E+08 8E+08 9E+08 1E+09

0 50 100 150 200 250 300 350

Object

ive valu

e

Time (Seconds)

Gaia Gaia_Async

Convergence value

MatrixFactorization TopicModeling

Page 40: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Methodology Details• Hardware

• A 22-node cluster. Each has a 16-core Intel Xeon CPU (E5-2698), a NVIDIA Titan X GPU, 64GB RAM, and a 40GbE NIC

• Application details• Matrix Factorization: SGD algorithm, 500 ranks• Topic Modeling: Gibbs sampling, 500 topics

• Convergence criteria• The value of the objective function changes less than 2% over the

course of 10 iterations• Significance Threshold

• 1% and shrinks over time 1%2�

40

Page 41: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

ML System Performance Comparison

• IterStore [Cui et al. SoCC’15] shows 10X performance improvement over PowerGraph [Gonzalez et al., OSDI’12] for Matrix Factorization

• PowerGraph matches the performance of GraphX [Gonzalez et al., OSDI’14], a Spark-based system

41

Page 42: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Matrix Factorization (1/3)• Matrix factorization (also known as collaborative filtering) is a

technique commonly used in recommender systems

42

Page 43: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Matrix Factorization (2/3)

43

Movie

User 4

Rank(UserPreferenceParameters)(θ)

Rank(MovieParameters)(x)

Page 44: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Matrix Factorization (3/3)

• Objective function (L2 regularization)

44

• Solve with stochastic gradient decent (SGD)

Page 45: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Background – BSP• BSP (Bulk Synchronous Parallel)

• All machines need to receive all updates before proceeding to the next iteration

45

Worker1

Worker2

Worker3Clock

0 1 2 3

Page 46: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Background – SSP• SSP (Stale Synchronous Parallel)

• Allows the fastest worker ahead of the slowest worker by a bounded number of iterations

46

Worker1

Worker2

Worker3Clock

0 1 2 3

Staleness=1

Page 47: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

Compare Against Centralizing Approach

47

GaiaSpeedupoverCentralize

GaiatoCentralizeCostRatio

MatrixFactorization EC2-ALL 1.11 3.54V/CWAN 1.22 1.00S/SWAN 2.13 1.17

TopicModeling EC2-ALL 0.80 6.14V/CWAN 1.02 1.26S/SWAN 1.25 1.92

ImageClassification EC2-ALL 0.76 3.33V/CWAN 1.12 1.07S/SWAN 1.86 1.08

Page 48: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

SSP Performance – 11 Data Centers

48

2.0X 2.0X 1.5X 1.5X1.8X 1.8X1.3X 1.3X

3.8X 3.7X 3.0X 2.7X

00.10.20.30.40.50.60.70.80.9

1

Baseline Gaia LAN Baseline Gaia LAN

BSP SSP

Nro

mal

ized

Exe

cutio

n Ti

me

Amazon-EC2Emulation-EC2Emulation-Full-Speed

Matrix Factorization

Page 49: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

SSP Performance – 11 Data Centers

49

Topic Modeling

2.0X2.5X 1.5X 1.7X

3.7X 4.8X2.0X

3.5X

00.10.20.30.40.50.60.70.80.9

1

Baseline Gaia LAN Baseline Gaia LAN

BSP SSP

Nro

mal

ized

Exe

cutio

n Ti

me Emulation-EC2

Emulation-Full-Speed

Page 50: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

SSP Performance – V/C WAN

50

Matrix Factorization

3.7X 2.6X3.5X 2.3X

00.10.20.30.40.50.60.70.80.9

1

BSP SSP

Baseline Gaia LAN

Topic Modeling

3.7X 3.1X3.9X 3.2X

00.10.20.30.40.50.60.70.80.9

1

BSP SSP

Baseline Gaia LAN

Page 51: Gaia: Geo-Distributed Machine Learning Approaching LAN Speedspeople.inf.ethz.ch/omutlu/pub/gaia-geo-distributed... · -9.0E+08 0 250 500 750 1000 Objective value Time (Seconds) Gaia

SSP Performance – S/S WAN

51

Matrix Factorization Topic Modeling

25X 16X24X 14X0

0.10.20.30.40.50.60.70.80.9

1

BSP SSP

Baseline Gaia LAN

14X 17X17X 21X0

0.10.20.30.40.50.60.70.80.9

1

BSP SSP

Baseline Gaia LAN