neptune - acmsocc.github.io · microsoft kokarana@microsoft.com peter pietzuch imperial college...
Post on 22-May-2020
4 Views
Preview:
TRANSCRIPT
NEPTUNEScheduling Suspendable Tasks
for Unified Stream/Batch Applications
SoCC, Santa Cruz, California, November 2019
Panagiotis GarefalakisImperial College London
pgaref@imperial.ac.uk
Konstantinos KaranasosMicrosoft
kokarana@microsoft.com
Peter PietzuchImperial College London
prp@imperial.ac.uk
Unified application example
Panagiotis Garefalakis - Imperial College London 2
Inference Job
Low-latencyresponses
TrainedModel
Historical data
Real-time data
Training Job
Iterate
Stream
Batch Application
Evolution of analytics frameworks
Panagiotis Garefalakis - Imperial College London 3
Batch frameworks
20142010 2018
Frameworks with hybrid
stream/batch applicationsStream frameworks
Unified stream/batch frameworks
Structured Streaming
Requirements> Latency: Execute inference job with minimum delay> Throughput: Batch jobs should not be compromised> Efficiency: Achieve high cluster resource utilization
Stream/Batch application requirements
Panagiotis Garefalakis - Imperial College London 4
Challenge: schedule stream/batch jobs to satisfy their diverse requirements
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 5
2xTInference (stream) Job 2xT
3T TTraining (batch) Job
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
ApplicationCode
Driver
DAG Scheduler
submitApp Contextrun job
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 6
2xTInference (stream) Job 2xT
3T TTraining (batch) Job
3T
3T
3T
T T T T
4T
3T
exec
utor
1ex
ecut
or 2
8T
T
T
TWasted
resourcesCor
es
2T 6T
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
> Static allocation: dedicate resources to each job
Resources can not be shared across jobs
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 7
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
> FIFO: first job runs to completion
3T
3T
3T
3T
T
T
T
T T
T
Long batch jobs increase stream job latency
Cor
es
T
Inference (stream) Job
Training (batch) Jobsh
ared
exe
cuto
rs
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 8
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
> FAIR: weight share resources across jobs
Cor
es
3T
3T
3T
3T
T
T
T
T
T
T
T
queuingBetter packing with non-optimal latency
Inference (stream) Job
Training (batch) Jobsh
ared
exe
cuto
rs
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 9
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
> KILL: avoid queueing by preempting batch tasks
Cor
es
3T
3T
3T
3T
T
T
T
T
T
T 3T
T 3T
Better latency at the expense of extra work
Inference (stream) Job
Training (batch) Jobsh
ared
exe
cuto
rs
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 10
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
> NEPTUNE: minimize queueing and wasted work!
Cor
esInference (stream) Job
Training (batch) Jobsh
ared
exe
cuto
rs
3T
3T
3T
3T
T
T
T
T
T
2T
2TT
T
> How to minimize queuing for latency-sensitive jobs and wasted work?Implement suspendable tasks
> How to natively support stream/batch applications?Provide a unified execution framework
> How to satisfy different stream/batch application requirements and high-level objectives?Introduces custom scheduling policies
Challenges
Panagiotis Garefalakis - Imperial College London 11
> How to minimize queuing for latency-sensitive jobs and wasted work?Implement suspendable tasks
> How to natively support stream/batch applications?Provide a unified execution framework
> How to satisfy different stream/batch application requirements and high-level objectives?Introduces custom scheduling policies
NEPTUNEExecution framework for Stream/Batch applications
Panagiotis Garefalakis - Imperial College London 12
Support suspendable tasks
Introduce pluggable scheduling policies
Unified execution framework on top ofStructured Streaming
Typical tasks
Panagiotis Garefalakis - Imperial College London 13
ExecutorStack
Task run
Value
Context
Iterator
Function
> Tasks: apply a function to a partition of data
> Subroutines that run in executor to completion
> Preemption problem: > Loss of progress (kill)> Unpredictable preemption times
(checkpointing)
State
Suspendable tasks
Panagiotis Garefalakis - Imperial College London 14
Function
Context
Iterator
Coroutine Stack
callyield
> Idea: use coroutines> Separate stacks to store task
state> Yield points handing over
control to the executor
> Cooperative preemption: > Suspend and resume in
milliseconds> Work-preserving
> Transparent to the user
ExecutorStack
Task run
Value
State
Context
https://github.com/storm-enroute/coroutines
Execution framework
Panagiotis Garefalakis - Imperial College London 15
> Idea: centralized scheduler with pluggable policies
> Problem: not just assign but also suspend and resume
ExecutorExecutorDAG scheduler
Task Scheduler
Scheduling policy
ExecutorTasks
Low-pri job High-pri job
Running Paused
suspend & run task
App + job prioritiesLowHigh
Tasks
Incr
emen
taliz
er
Opt
imiz
er
launchtask
metrics
Scheduling policies
Panagiotis Garefalakis - Imperial College London 16
> Idea: policies trigger task suspension and resumption> Guarantee that stream tasks bypass batch tasks> Satisfy higher-level objectives i.e. balance cluster load> Avoid starvation by suspending up to a number of times
> Load-balancing (LB): takes into account executors’ memory conditions and equalize the number of tasks per node
> Locality- and memory aware (LMA): respect task locality preferences in addition to load-balancing
> Built as an extension to 2.4.0 (https://github.com/lsds/Neptune)
> Ported all ResultTask, ShuffleMapTask functionality across programming interfaces to coroutines
> Extended Spark’s DAG Scheduler to allow job stages with different requirements (priorities)
> Added additional Executor performance metrics as part of the heartbeat mechanism
Implementation
Panagiotis Garefalakis - Imperial College London 17
> Cluster– 75 nodes with 4 cores and 32 GB of memory each
> Workloads– LDA: ML training/inference application uncovering
hidden topics from a group of documents– Yahoo Streaming Benchmark: ad-analytics on a
stream of ad impressions– TPC-H decision support benchmark
Azure deployment
Panagiotis Garefalakis - Imperial College London 18
DIFF-EXEC FIFO FAIR KILL NEP-CL NEP-LB PRI-ONLY0
1
2
3
4
5
6S
tream
ing
late
ncy
(s)
Benefit of NEPTUNE in stream latency
Panagiotis Garefalakis - Imperial College London 19
> LDA: training (batch) job using all available resources, with a latency-sensitive inference (stream) using 15% of resources
NEPTUNE achieves latencies comparable to the ideal for the latency-sensitive jobs
LBNeptune
LMANeptune
IsolationKILLFAIRFIFOStaticallocation
37%
13%61%
54%
99th
median
5th
Impact of resource demands in performance
Panagiotis Garefalakis - Imperial College London 20
Past to future> YSB: increasing stream job resource demands while batch
job using all available resources
0% 20% 40% 60% 80% 100%Cores used for Streaming
0
2
4
6S
tream
ing
late
ncy
(s)
3.85
3.88
3.90
3.92
3.95
Bat
ch(M
even
ts/s
)
1.5%
Efficiently share resources with low impact on throughput
NEPTUNE supports complex unified applications with diverse job requirements!
> Suspendable tasks using coroutines> Pluggable scheduling policies> Continuous unified analytics
Thank you!Questions?
Panagiotis Garefalakispgaref@imperial.ac.uk
Summary
https://github.com/lsds/Neptune
top related