the curious case of container orchestration and scheduling ... › ~pxt176 › publications ›...

1
The Curious Case of Container Orchestration and Scheduling in GPU-based Datacenters Prashanth Thinakaran 1 , Jashwant Raj Gunasekaran 1 , Bikash Sharma 2 , Mahmut Taylan Kandemir 1 , Chita R. Das 1 1 The Pennsylvania State University 2 Facebook Inc ACKNOWLEDGMENTS This research was generously supported by several NSF grants 1320478, 1409095, 1629129, 1439021, 1629915, 1439057, 1626251, 1526750 and NSF Chameleon Cloud project CH-819640 for the GPU cluster infrastructure. Prashanth Thinakaran ([email protected]) Era of H/W Accelerators End of Moore’s law -> beginning of Accelerator innovations like GPUs, TPUs, and FPGAs. State-of-the-art datacenter schedulers are still agnostic or at early stages to support GPUs. Need for GPU-aware scheduling policies in datacenter orchestrators like Kubernetes. The Curious Case of GPU Orchestration Kube-Knots: GPU-aware Orchestrator Overall Results and Discussion Peak Prediction (PP) improves the cluster-wide GPU utilization by up to 80% for both 50 th and 90 th percentile GPU utilization. 33% savings in power consumption across the ten- node GPU cluster. Reduced the overall QoS violations of latency sensitive queries by up to 53% when compared to the resource agnostic scheduler with GPU utilization. PP scheduler’s misprediction rate (error) was as low as 16% for a window interval of 5s. Alibaba Trace Analysis Head of the line blocking in non-preemptible accelerators Limited accountability on GPU containers at Docker-level Need for QoS-aware batching Utilization-aware scheduling Application-aware bin packing GPU Workload Analysis Best PPW achieved only @ 100% GPU Utilization Performance Per Watt (PPW) of Memcached GPU vs CPU at varying query loads Analysis of GPU-aware cluster schedulers ~50-80% Internal memory fragmentation in Need to expose application framework’s configs to the resource scheduler Over-commitment in Alibaba datacenters Average Utilization CPU -> 49% Memory -> 82% X-axis is time in milliseconds Batch workloads show strong correlation across different metrics CPU usage is strongly correlated with Memory Container workloads show similar trend for CPU and Mem utilization Correlation across utilization metrics is crucial for job placement ~20% of provisioned Memory is unused Static provisioning leads to under-utilization Knots can query the node-level GPU utilization through pyNVML library for crucial utilization metrics We evaluate four GPU-specific scheduling schemes, Uniform Kubernetes default Scheduler GPUs cannot be shared Low PPW and No QoS guarantees Resource Agnostic Sharing First Fit Decreasing bin-packing High PPW Poor QoS and high queueing delays Correlation Based Provisioning Utilization metrics based bin-packing High PPW Assured QoS but high queueing delays due to affinity constraints Peak Prediction Predicts the resource peaks of co- scheduled apps by Auto Correlation Factor High PPW and Assured QoS guarantees Low Queueing delays GPU load across three different workload mixes scheduled using Uniform scheduler GPU load across three different workload mixes scheduled using Peak Prediction scheduler Peak Prediction scheduler effectively consolidated the work-load mixes for all workloads Djinn & Tonic workload suite (latency sensitive jobs) Rodinia HPC workload suite (Batch Jobs) Predicting the resource peaks using early indicators through ARIMA Receive Bandwidth peaks are precursor to SM and Mem utilization peaks Co-locating apps that do not peak at the same time using ACF

Upload: others

Post on 07-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Curious Case of Container Orchestration and Scheduling ... › ~pxt176 › publications › kube-knots-poster.pdf · Best PPW achieved only @ 100% GPU Utilization Performance

The Curious Case of Container Orchestration and Scheduling in GPU-based DatacentersPrashanth Thinakaran1, Jashwant Raj Gunasekaran1, Bikash Sharma2,

Mahmut Taylan Kandemir1, Chita R. Das1

1The Pennsylvania State University 2 Facebook Inc

ACKNOWLEDGMENTSThis research was generously supported by several NSF grants 1320478,1409095, 1629129, 1439021, 1629915, 1439057, 1626251, 1526750 andNSF Chameleon Cloud project CH-819640 for the GPU clusterinfrastructure.

Prashanth Thinakaran ([email protected])

Era of H/W Accelerators

➢ End of Moore’s law -> beginning of Accelerator innovations like GPUs, TPUs, and FPGAs.

➢ State-of-the-art datacenter schedulers are still agnostic or at early stages to support GPUs.

➢ Need for GPU-aware scheduling policies in datacenter orchestrators like Kubernetes.

The Curious Case of GPU Orchestration

Kube-Knots: GPU-aware Orchestrator

Overall Results and Discussion

➢ Peak Prediction (PP) improves the cluster-wide GPU utilization by up to 80% for both 50th and 90th

percentile GPU utilization.

➢ 33% savings in power consumption across the ten-node GPU cluster.

➢ Reduced the overall QoS violations of latency sensitive queries by up to 53% when compared to the resource agnostic scheduler with GPU utilization.

➢ PP scheduler’s misprediction rate (error) was as low as 16% for a window interval of 5s.

Alibaba Trace Analysis

➢ Head of the line blocking in non-preemptible accelerators➢ Limited accountability on GPU containers at Docker-level

Need for ✓ QoS-aware batching✓ Utilization-aware

scheduling✓ Application-aware bin

packing

GPU Workload Analysis

Best PPW achieved only @ 100% GPU Utilization

Performance Per Watt (PPW) of Memcached GPU vs CPU at varying query loads

Analysis of GPU-aware cluster schedulers

~50-80% Internal memory

fragmentation in

Need to expose application

framework’s configs to the

resource scheduler

Over-commitment in Alibaba datacenters

Average Utilization

CPU -> 49%Memory -> 82%

X-axis is time in milliseconds

Batch workloads show strong correlation

across different metrics

CPU usage is strongly correlated with

Memory

Container workloads show similar trend for

CPU and Mem utilization

Correlation across utilization metrics is

crucial for job placement

~20% of provisioned Memory is unused

Static provisioning leads to under-utilization

Knots can query the node-level GPU utilization through pyNVML library for crucial utilization metrics

We evaluate four GPU-specific scheduling schemes,

Uniform Kubernetes default SchedulerGPUs cannot be sharedLow PPW and No QoS guarantees

Resource Agnostic Sharing

First Fit Decreasing bin-packingHigh PPWPoor QoS and high queueing delays

Correlation Based Provisioning

Utilization metrics based bin-packingHigh PPWAssured QoS but high queueing delays due to affinity constraints

Peak Prediction Predicts the resource peaks of co-scheduled apps by Auto Correlation FactorHigh PPW and Assured QoS guaranteesLow Queueing delays

GPU load across three different workload mixes scheduled using Uniform scheduler

GPU load across three different workload mixes scheduled using Peak Prediction scheduler

Peak Prediction scheduler effectively consolidated the work-load mixes for all workloads

Djin

n &

To

nic

wo

rklo

ad s

uit

e(l

aten

cyse

nsi

tive

job

s)

Ro

din

ia H

PC

wo

rklo

ad s

uit

e (B

atch

Job

s)

Predicting the resource peaks

using early indicators through

ARIMA

Receive Bandwidth peaks are precursor to

SM and Mem utilization peaks

Co-locating appsthat do not peakat the same time

using ACF