the curious case of container orchestration and scheduling ... › ~pxt176 › publications ›...

The Curious Case of Container Orchestration and Scheduling in GPU-based DatacentersPrashanth Thinakaran1, Jashwant Raj Gunasekaran1, Bikash Sharma2,

Mahmut Taylan Kandemir1, Chita R. Das1

1The Pennsylvania State University 2 Facebook Inc

ACKNOWLEDGMENTSThis research was generously supported by several NSF grants 1320478,1409095, 1629129, 1439021, 1629915, 1439057, 1626251, 1526750 andNSF Chameleon Cloud project CH-819640 for the GPU clusterinfrastructure.

Prashanth Thinakaran ([email protected])

Era of H/W Accelerators

➢ End of Moore’s law -> beginning of Accelerator innovations like GPUs, TPUs, and FPGAs.

➢ State-of-the-art datacenter schedulers are still agnostic or at early stages to support GPUs.

➢ Need for GPU-aware scheduling policies in datacenter orchestrators like Kubernetes.

The Curious Case of GPU Orchestration

Kube-Knots: GPU-aware Orchestrator

Overall Results and Discussion

➢ Peak Prediction (PP) improves the cluster-wide GPU utilization by up to 80% for both 50th and 90th

percentile GPU utilization.

➢ 33% savings in power consumption across the ten-node GPU cluster.

➢ Reduced the overall QoS violations of latency sensitive queries by up to 53% when compared to the resource agnostic scheduler with GPU utilization.

➢ PP scheduler’s misprediction rate (error) was as low as 16% for a window interval of 5s.

Alibaba Trace Analysis

➢ Head of the line blocking in non-preemptible accelerators➢ Limited accountability on GPU containers at Docker-level

Need for ✓ QoS-aware batching✓ Utilization-aware

scheduling✓ Application-aware bin

packing

GPU Workload Analysis

Best PPW achieved only @ 100% GPU Utilization

Performance Per Watt (PPW) of Memcached GPU vs CPU at varying query loads

Analysis of GPU-aware cluster schedulers

~50-80% Internal memory

fragmentation in

Need to expose application

framework’s configs to the

resource scheduler

Over-commitment in Alibaba datacenters

Average Utilization

CPU -> 49%Memory -> 82%

X-axis is time in milliseconds

Batch workloads show strong correlation

across different metrics

CPU usage is strongly correlated with

Memory

Container workloads show similar trend for

CPU and Mem utilization

Correlation across utilization metrics is

crucial for job placement

~20% of provisioned Memory is unused

Static provisioning leads to under-utilization

Knots can query the node-level GPU utilization through pyNVML library for crucial utilization metrics

We evaluate four GPU-specific scheduling schemes,

Uniform Kubernetes default SchedulerGPUs cannot be sharedLow PPW and No QoS guarantees

Resource Agnostic Sharing

First Fit Decreasing bin-packingHigh PPWPoor QoS and high queueing delays

Correlation Based Provisioning

Utilization metrics based bin-packingHigh PPWAssured QoS but high queueing delays due to affinity constraints

Peak Prediction Predicts the resource peaks of co-scheduled apps by Auto Correlation FactorHigh PPW and Assured QoS guaranteesLow Queueing delays

GPU load across three different workload mixes scheduled using Uniform scheduler

GPU load across three different workload mixes scheduled using Peak Prediction scheduler

Peak Prediction scheduler effectively consolidated the work-load mixes for all workloads

Djin

n &

To

nic

wo

rklo

ad s

uit

e(l

aten

cyse

nsi

tive

job

s)

Ro

din

ia H

PC

wo

rklo

ad s

uit

e (B

atch

Job

s)

Predicting the resource peaks

using early indicators through

ARIMA

Receive Bandwidth peaks are precursor to

SM and Mem utilization peaks

Co-locating appsthat do not peakat the same time

using ACF

the curious case of container orchestration and scheduling ... › ~pxt176 › publications ›...

Documents