gaia: geo-distributed machine learning approaching lan...

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Kevin HsiehAaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu†

Machine Learning and Big Data

•Machine learning is widely used to derive useful information from large-scale data

Image Classification

VideosPictures

Video Analytics

User Activities

Preference Prediction

……

Big Data is Geo-Distributed • A large amount of data is generated rapidly, all over the world

Centralizing Data is Infeasible [1, 2, 3]

• Moving data over wide-area networks (WANs) can be extremely slow

• It is also subject to data sovereignty laws

1.Vulimiri etal.,NSDI’152.Puetal.,SIGCOMM’153.Viswanathanetal.,OSDI’16

Geo-distributed ML is Challenging• No ML system is designed to run across data centers

(up to 53X slowdown in our study)

Our Goal

•Develop a geo-distributed ML system•Minimize communication over wide-area networks

•Retain the accuracy and correctness of ML algorithms

•Without requiring changes to the algorithms

KeyResult:1.8-53.5Xspeedupoverstate-of-the-artMLsystemsonWANs

Outline

•Problem & Goal•Background & Motivation•Gaia System Overview•Approximate Synchronous Parallel•System Implementation•Evaluation•Conclusion

WorkerMachineN

ParameterServer

Background: Parameter Server Architecture

• The parameter server architecture has been widely adopted in many ML systems

…WorkerMachine1

ParameterServer

MLModelUpdateUpdate

ReadRead …

TrainingData

UpdateUpdate

ReadRead

WorkerMachineN

ParameterServer

Background: Parameter Server Architecture

• The parameter server architecture has been widely adopted in many ML systems

…WorkerMachine1

ParameterServer

MLModelUpdateUpdate

ReadRead …

TrainingData

UpdateUpdate

ReadRead

Synchronization is critical to the accuracy and correctness of ML algorithms

Deploy Parameter Servers on WANs• Deploying parameter servers across data centers

requires a lot of communication over WANs

WorkerMachineN

ParameterServer

…WorkerMachine1

ParameterServer

MLModel

DataCenter1 DataCenter2

WAN: Low Bandwidth and High Cost• WAN bandwidth is 15X smaller than LAN bandwidth on average,

and up to 60X smaller• In Amazon EC2, the monetary cost of WAN communication is

up to 38X the cost of renting machines

VirginiaCaliforniaOregonIrelandFrankfurtTokyoSeoulSingaporeSydneyMumbaiSão Paulo

0100200300400500600700800900

11 Amazon EC2 Regions

3.7X 3.5X

5.9X 4.4X

10152025

LAN EC2-ALL V/C WAN S/S WANNor

ence IterStore Bӧsen

ML System Performance on WANs

1) Cui et al., “Exploiting Iterative-ness for Parallel ML Computations”, SoCC’14

2) Wei et al., “Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics”, SoCC’15 12

MatrixFactorization

3.7X 3.5X

5.9X 4.4X

10152025

LAN EC2-ALL V/C WAN S/S WANNor

ence IterStore Bӧsen

ML System Performance on WANs

1) Cui et al., “Exploiting Iterative-ness for Parallel ML Computations”, SoCC’14

2) Wei et al., “Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics”, SoCC’15 13

Running ML systems on WANs can seriously slow down ML applications

11EC2Regions

Virginia/CaliforniaSingapore/SãoPaulo

MatrixFactorization

Outline

Gaia System Overview• Key idea: Decouple the synchronization model within

the data center from the synchronization model between data centers

ParameterServer

DataCenter1

ParameterServer

WorkerMachine

LocalSync

ParameterServer

DataCenter2

WorkerMachineWorkerMachine

ApproximatelyCorrectModelCopy

RemoteSync

Gaia System Overview• Key idea: Decouple the synchronization model within

the data center from the synchronization model between data centers

ParameterServer

DataCenter1

ParameterServer

WorkerMachine

LocalSync

ParameterServer

DataCenter2

WorkerMachineWorkerMachine

ApproximatelyCorrectModelCopy

RemoteSync

Communicate over WANsonly significant updates

95.6% 95.2% 97.0%

0%20%40%60%80%

10% 5% 1% 0.5% 0.1% 0.05% 0.01%Insi

Threshold of Significant Updates

Matrix Factorization Topic Modeling Image Classification

Key Finding: Study of Update Significance

The vast majority of updates are insignificant

Outline

Approximate Synchronous Parallel

Thesignificancefilter• Filterupdatesbasedontheirsignificance

ASPselectivebarrier• Ensuresignificantupdatesarereadintime

Mirrorclock• Safeguardforpathologicalcases

The Significance Filter

Worker Machine

ParameterServer

Parameter XValue Aggregated Update

Update (Δ1) on X

Significance Function

Other Parameters

Significance Threshold>?

𝐴𝑔𝑔. 𝑈𝑝𝑑𝑎𝑡𝑒𝑉𝑎𝑙𝑢𝑒

1%𝑇�

Update (Δ2) on X

Δ1Δ1+Δ20

Thesignificancefilter• Filterupdatesbasedontheirsignificance

ASP Selective Barrier

Data Center 1 Data Center 2

Parameter Server Parameter Server

Significant Update

Arrive too late!

Significant Update

SelectiveBarrier

Only workers that depend on these parameters are blocked

Outline

Put it All Together: The Gaia System

LocalServer

GaiaParameterServer

WorkerMachine

SignificanceFilter

ParameterStore

WorkerMachine

MirrorServer

MirrorClient

DataCenterBoundary

ControlQueue

DataQueue

GaiaParameterServer

Update AggregatedUpdate

SelectiveBarrier

Put it All Together: The Gaia System

LocalServer

GaiaParameterServer

WorkerMachine

SignificanceFilter

ParameterStore

WorkerMachine

MirrorServer

MirrorClient

DataCenterBoundary

ControlQueue

DataQueue

GaiaParameterServer

Update AggregatedUpdate

SelectiveBarrier

Control messages (barriers, etc.) are always prioritized

No change is required for ML algorithms and ML programs

Problem: Broadcast Significant Updates

Communication overhead is proportional to the number of data centers

Mitigation: Overlay Networks and Hubs

Save communication on WANs by aggregating the updates at hubs

Data Center Group Data Center Group Data Center Group

Data Center Group

HubHub

Outline

Methodology

• Applications• Matrix Factorization with the Netflix dataset• Topic Modeling with the Nytimes dataset• Image Classification with the ILSVRC12 dataset

• Hardware platform• 22 machines with emulated EC2 WAN bandwidth• We validated the performance with a real EC2 deployment

• Baseline• IterStore (Cui et al., SoCC’14) and GeePS (Cui et al., EuroSys’16) on WAN

• Performance metrics• Execution time until algorithm convergence• Monetary cost of algorithm convergence

3.8X 3.7X6.0X

3.7X 4.8X8.5X

Matrix Factorization Topic Modeling Image Classification

Baseline Gaia LAN

Performance – 11 EC2 Data Centers

Gaiaachieves3.7-6.0XspeedupoverBaselineGaiaisatmost1.40XofLANspeeds

25.4X 14.1X53.5X

23.8X 17.3X 53.7X

Matrix Factorization

Topic Modeling Image Classification

Baseline Gaia LAN

3.7X 3.7X7.4X

3.5X 3.9X7.4X

Topic Modeling Image Classification

Baseline Gaia LAN

Performance and WAN Bandwidth

V/C WAN(Virginia/California)

S/S WAN(Singapore/São Paulo)

Gaiaachieves3.7-53.5XspeedupoverBaselineGaiaisatmost1.23XofLANspeeds

2.6X5.7X 18.7X

EC2-ALL V/C WAN S/S WAN

Results – EC2 Monetary Cost

4.2X 6.0X 28.5X0

Communication CostMachine Cost (Network)Machine Cost (Compute)

Matrix Factorization Topic Modeling

8.5X 10.7X 59.0X0

Image Classification

Gaiais2.6-59.0XcheaperthanBaseline

gaia: geo-distributed machine learning approaching lan...

Documents

low latency geo-distributed data...

low-latency geo-distributed data analytics...

cost minimization for big data processing in geo-distributed...

geo 596: internet mapping and distributed geographic...

low latency geo-distributed data analytics

increasing the performance of geo-distributed...

geo-distributed big data processing - github...

volley: automated data placement for geo-distributed cloud...

ambry: linkedin’s scalable geo-distributed object...

kurma: geo-distributed secure middleware for cloud-backed...

i2rs: a distributed geo-textual image retrieval and ... ·...

optimizing grouped aggregation in geo-distributed...

capacity provisioning problems in geo-distributed data...

gaia: geo-distributed machine learning approaching lan...

ttl-based approach for data aggregation in geo-distributed...

low latency geo-distributed data analytics - yale...

incremental deployment and migration of geo-distributed...

low latency geo-distributed data analytics · ·...

gaia: geo-distributed machine learning approaching lan...

2d distributed multi-parameter imaging - quantec geo