neptune - acmsocc.github.io · microsoft kokarana@microsoft.com peter pietzuch imperial college...

NEPTUNEScheduling Suspendable Tasks

for Unified Stream/Batch Applications

SoCC, Santa Cruz, California, November 2019

Panagiotis GarefalakisImperial College London

pgaref@imperial.ac.uk

Konstantinos KaranasosMicrosoft

kokarana@microsoft.com

Peter PietzuchImperial College London

prp@imperial.ac.uk

Unified application example

Panagiotis Garefalakis - Imperial College London 2

Inference Job

Low-latencyresponses

TrainedModel

Historical data

Real-time data

Training Job

Iterate

Stream

Batch Application

Evolution of analytics frameworks

Batch frameworks

20142010 2018

Frameworks with hybrid

stream/batch applicationsStream frameworks

Unified stream/batch frameworks

Structured Streaming

Requirements> Latency: Execute inference job with minimum delay> Throughput: Batch jobs should not be compromised> Efficiency: Achieve high cluster resource utilization

Stream/Batch application requirements

Challenge: schedule stream/batch jobs to satisfy their diverse requirements

Stream/Batch application scheduling

2xTInference (stream) Job 2xT

3T TTraining (batch) Job

Stage1

Stage2

T2x 2x

3T3T3T

Stage1

Stage24x 3x

ApplicationCode

Driver

DAG Scheduler

submitApp Contextrun job

2xTInference (stream) Job 2xT

3T TTraining (batch) Job

T T T T

TWasted

resourcesCor

Stage1

Stage2

T2x 2x

3T3T3T

Stage1

Stage24x 3x

> Static allocation: dedicate resources to each job

Resources can not be shared across jobs

2xT 2xT

4T 8T2T 6T

Stage1

Stage2

T2x 2x

3T3T3T

Stage1

Stage24x 3x

> FIFO: first job runs to completion

Long batch jobs increase stream job latency

Inference (stream) Job

Training (batch) Jobsh

2xT 2xT

4T 8T2T 6T

Stage1

Stage2

T2x 2x

3T3T3T

Stage1

Stage24x 3x

> FAIR: weight share resources across jobs

queuingBetter packing with non-optimal latency

2xT 2xT

4T 8T2T 6T

Stage1

Stage2

T2x 2x

3T3T3T

Stage1

Stage24x 3x

> KILL: avoid queueing by preempting batch tasks

Better latency at the expense of extra work

2xT 2xT

4T 8T2T 6T

Stage1

Stage2

T2x 2x

3T3T3T

Stage1

Stage24x 3x

> NEPTUNE: minimize queueing and wasted work!

esInference (stream) Job

> How to minimize queuing for latency-sensitive jobs and wasted work?Implement suspendable tasks

> How to natively support stream/batch applications?Provide a unified execution framework

> How to satisfy different stream/batch application requirements and high-level objectives?Introduces custom scheduling policies

Challenges

> How to minimize queuing for latency-sensitive jobs and wasted work?Implement suspendable tasks

> How to natively support stream/batch applications?Provide a unified execution framework

> How to satisfy different stream/batch application requirements and high-level objectives?Introduces custom scheduling policies

NEPTUNEExecution framework for Stream/Batch applications

Support suspendable tasks

Introduce pluggable scheduling policies

Unified execution framework on top ofStructured Streaming

Typical tasks

ExecutorStack

Task run

Context

Iterator

Function

> Tasks: apply a function to a partition of data

> Subroutines that run in executor to completion

> Preemption problem: > Loss of progress (kill)> Unpredictable preemption times

(checkpointing)

Suspendable tasks

Function

Context

Iterator

Coroutine Stack

callyield

> Idea: use coroutines> Separate stacks to store task

state> Yield points handing over

control to the executor

> Cooperative preemption: > Suspend and resume in

milliseconds> Work-preserving

> Transparent to the user

ExecutorStack

Task run

Context

https://github.com/storm-enroute/coroutines

Execution framework

> Idea: centralized scheduler with pluggable policies

> Problem: not just assign but also suspend and resume

ExecutorExecutorDAG scheduler

Task Scheduler

Scheduling policy

ExecutorTasks

Low-pri job High-pri job

Running Paused

suspend & run task

App + job prioritiesLowHigh

launchtask

metrics

Scheduling policies

> Idea: policies trigger task suspension and resumption> Guarantee that stream tasks bypass batch tasks> Satisfy higher-level objectives i.e. balance cluster load> Avoid starvation by suspending up to a number of times

> Load-balancing (LB): takes into account executors’ memory conditions and equalize the number of tasks per node

> Locality- and memory aware (LMA): respect task locality preferences in addition to load-balancing

> Built as an extension to 2.4.0 (https://github.com/lsds/Neptune)

> Ported all ResultTask, ShuffleMapTask functionality across programming interfaces to coroutines

> Extended Spark’s DAG Scheduler to allow job stages with different requirements (priorities)

> Added additional Executor performance metrics as part of the heartbeat mechanism

Implementation

> Cluster– 75 nodes with 4 cores and 32 GB of memory each

> Workloads– LDA: ML training/inference application uncovering

hidden topics from a group of documents– Yahoo Streaming Benchmark: ad-analytics on a

stream of ad impressions– TPC-H decision support benchmark

Azure deployment

DIFF-EXEC FIFO FAIR KILL NEP-CL NEP-LB PRI-ONLY0

Benefit of NEPTUNE in stream latency

> LDA: training (batch) job using all available resources, with a latency-sensitive inference (stream) using 15% of resources

NEPTUNE achieves latencies comparable to the ideal for the latency-sensitive jobs

LBNeptune

LMANeptune

IsolationKILLFAIRFIFOStaticallocation

13%61%

median

Impact of resource demands in performance

Past to future> YSB: increasing stream job resource demands while batch

job using all available resources

0% 20% 40% 60% 80% 100%Cores used for Streaming

Efficiently share resources with low impact on throughput

NEPTUNE supports complex unified applications with diverse job requirements!

> Suspendable tasks using coroutines> Pluggable scheduling policies> Continuous unified analytics

Thank you!Questions?

Panagiotis Garefalakispgaref@imperial.ac.uk

Summary

https://github.com/lsds/Neptune

neptune - acmsocc.github.io · microsoft kokarana@microsoft.com peter pietzuch imperial college...

Documents

imperial armour volume 1 imperial guard & imperial navy...

stateful distributed dataflow graphs -...

tiananmen square & imperial palace · tiananmen square &...

imperialirwmp.org...water forum and rwmg members imperial...

imperial county, california, imperial valley area county,...

scaling deep learning on multi-gpu servers · scaling deep...

david c wilson imperial collegedavid c wilson, imperial ......

imperial community college district imperial valley

city of imperial imperial, california · city of imperial ....

georgia institute of technology * washington state...

imperial brands finance plc imperial brands plc

imperial collection - adobe sheets/imperial... · imperial...

imperial bioengineer september 2016 issue 10.9 imperial...

imperial standard: imperial oil, exxon, and the canadian

graham cooke imperial college & imperial college nhs trust

hermes: a distributed event- based middleware architecture...

network-aware operator placement for stream-processing...

calexico el centro imperial county of imperial · t:...

libvnf: building vnfs made easy - acmsocc.github.io ·...

bourbon/rye $6 [24,600] the lounge menu · east imperial...