a low-latency online prediction serving system · 2017-07-10 · presenter: alexey tumanov daniel...

Presenter: Alexey TumanovDaniel Crankshaw

Corey Zumar, Guilio Zhou, Xin WangMichael Franklin, Joseph Gonzalez, Ion Stoica

BDD/RISE Mini-Retreat 2017May 2, 2017

A Low-Latency Online Prediction Serving System

Clipper

BigData

Complex Model

Training

Learning

Learning Produces a Trained Model

“CAT”

Query Decision

Model

BigData

Training

Learning

Application

Decision

Query

?

Serving

Model

BigData

Training

Learning

Application

Decision

Query

Model

Prediction-Serving for interactive applicationsTimescale: ~10s of milliseconds

Serving

Google Translate

Serving

82,000 GPUs running 24/7

[1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

140 billion words a day1Invented New Hardware!Tensor Processing Unit

(TPU)

Prediction-Serving Raises New Challenges

Prediction-Serving Challenges

Support low-latency, high-throughput serving workloads

???Create VWCaffe 9

Large and growing ecosystem of ML models and frameworks

Models getting more complexØ 10s of GFLOPs [1]


Using specialized hardware for predictions

Deployed on critical pathØ Maintain SLOs under heavy load

[1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.



???Create VWCaffe 11



???

ContentRec.

FraudDetection

PersonalAsst.

RoboticControl

MachineTranslation

Create VWCaffe 12

Big Companies Build One-Off Systems

Problems:Ø Expensive to build and maintain

Ø Highly specialized and require ML and systems expertise

Ø Tightly-coupled model and applicationØ Difficult to change or update model

Ø Only supports single ML framework

???

ContentRec.

FraudDetection

PersonalAsst.

RoboticControl

MachineTranslation

Create VWCaffe

Varying physicalresource requirements

Difficult to deploy andbrittle to manage


But most companies can’t build new

serving systems…

BigData

Batch Analytics

Model

Training

Use existing systems: Offline Scoring

BigData

Model

Training

Batch Analytics

ScoringX Y

DatastoreUse existing systems: Offline Scoring

Application

Decision

Query

Look up decision in datastore

Low-Latency Serving

X Y


Application

Decision

Query

Look up decision in datastore

Low-Latency Serving

X Y

Problems:Ø Requires full set of queries ahead of time

Ø Small and bounded input domainØ Wasted computation and space

Ø Can render and store unneeded predictionsØ Costly to update

Ø Re-run batch job




???Create VWCaffe 20


How does Clipper address these challenges?

q Simplifies deployment through layered architecture

q Serves many models across ML frameworks concurrently

q Employs caching, batching, scale-out for high-performance serving

Clipper Solutions

Clipper

Predict FeedbackRPC/REST Interface

CaffeMC MC MC

RPC RPC RPC RPC

Clipper Decouples Applications and Models

Applications

Model Container (MC)

Clipper Architecture

Clipper

Caffe

ApplicationsPredict ObserveRPC/REST Interface

MC MC MCRPC RPC RPC RPC

Model Abstraction LayerProvide a common interface to modelswhile bounding latency and maximizing throughput.

Model Selection LayerImprove accuracy through bandit methods and ensembles, online learning, and personalization


Clipper Architecture

Clipper

Caffe

ApplicationsPredict ObserveRPC/REST Interface

MC MC MCRPC RPC RPC RPC

Model Selection LayerSelection Policy

Model Abstraction LayerCaching

Adaptive Batching


ClipperModel Selection LayerSelection Policy

Selection policies supported by ClipperØExploit multiple models to estimate confidenceØ Use multi-armed bandit algorithms to learn

optimal model-selection onlineØ Online personalization across ML frameworks

*See paper for details [NSDI 2017]

A single page load may generatemany queries

Batching to Improve ThroughputØ Optimal batch depends on:

Ø hardware configurationØ model and frameworkØ system load

Ø Why batching helps:

HardwareAcceleration

Helps amortizesystem overhead

A single page load may generatemany queries

Clipper Solution:

Adaptively tradeoff latency and throughput…

Ø Inc. batch size until the latency objective is exceeded (Additive Increase)

Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease)

Ø Why batching helps:

HardwareAcceleration

Helps amortizesystem overhead

Ø Optimal batch depends on:Ø hardware configurationØ model and frameworkØ system load

Batching to Improve ThroughputAdaptive

ConclusionØ Prediction-serving is an important and challenging area for systems

researchØ Support low-latency, high-throughput serving workloadsØ Serve large and growing ecosystem of ML frameworks

Ø Clipper is a first step towards addressing these challengesØ Simplifies deployment through layered architectureØ Serves many models across ML frameworks concurrentlyØ Employs caching, adaptive batching, container scale-out to meet interactive

serving workload demandsØ Beyond academic prototype to build a real, open-source system

https://github.com/ucbrise/[email protected]

a low-latency online prediction serving system · 2017-07-10 · presenter: alexey tumanov daniel...

Documents