the curious case of container orchestration and scheduling ... › ~pxt176 › publications ›...
TRANSCRIPT
The Curious Case of Container Orchestration and Scheduling in GPU-based DatacentersPrashanth Thinakaran1, Jashwant Raj Gunasekaran1, Bikash Sharma2,
Mahmut Taylan Kandemir1, Chita R. Das1
1The Pennsylvania State University 2 Facebook Inc
ACKNOWLEDGMENTSThis research was generously supported by several NSF grants 1320478,1409095, 1629129, 1439021, 1629915, 1439057, 1626251, 1526750 andNSF Chameleon Cloud project CH-819640 for the GPU clusterinfrastructure.
Prashanth Thinakaran ([email protected])
Era of H/W Accelerators
➢ End of Moore’s law -> beginning of Accelerator innovations like GPUs, TPUs, and FPGAs.
➢ State-of-the-art datacenter schedulers are still agnostic or at early stages to support GPUs.
➢ Need for GPU-aware scheduling policies in datacenter orchestrators like Kubernetes.
The Curious Case of GPU Orchestration
Kube-Knots: GPU-aware Orchestrator
Overall Results and Discussion
➢ Peak Prediction (PP) improves the cluster-wide GPU utilization by up to 80% for both 50th and 90th
percentile GPU utilization.
➢ 33% savings in power consumption across the ten-node GPU cluster.
➢ Reduced the overall QoS violations of latency sensitive queries by up to 53% when compared to the resource agnostic scheduler with GPU utilization.
➢ PP scheduler’s misprediction rate (error) was as low as 16% for a window interval of 5s.
Alibaba Trace Analysis
➢ Head of the line blocking in non-preemptible accelerators➢ Limited accountability on GPU containers at Docker-level
Need for ✓ QoS-aware batching✓ Utilization-aware
scheduling✓ Application-aware bin
packing
GPU Workload Analysis
Best PPW achieved only @ 100% GPU Utilization
Performance Per Watt (PPW) of Memcached GPU vs CPU at varying query loads
Analysis of GPU-aware cluster schedulers
~50-80% Internal memory
fragmentation in
Need to expose application
framework’s configs to the
resource scheduler
Over-commitment in Alibaba datacenters
Average Utilization
CPU -> 49%Memory -> 82%
X-axis is time in milliseconds
Batch workloads show strong correlation
across different metrics
CPU usage is strongly correlated with
Memory
Container workloads show similar trend for
CPU and Mem utilization
Correlation across utilization metrics is
crucial for job placement
~20% of provisioned Memory is unused
Static provisioning leads to under-utilization
Knots can query the node-level GPU utilization through pyNVML library for crucial utilization metrics
We evaluate four GPU-specific scheduling schemes,
Uniform Kubernetes default SchedulerGPUs cannot be sharedLow PPW and No QoS guarantees
Resource Agnostic Sharing
First Fit Decreasing bin-packingHigh PPWPoor QoS and high queueing delays
Correlation Based Provisioning
Utilization metrics based bin-packingHigh PPWAssured QoS but high queueing delays due to affinity constraints
Peak Prediction Predicts the resource peaks of co-scheduled apps by Auto Correlation FactorHigh PPW and Assured QoS guaranteesLow Queueing delays
GPU load across three different workload mixes scheduled using Uniform scheduler
GPU load across three different workload mixes scheduled using Peak Prediction scheduler
Peak Prediction scheduler effectively consolidated the work-load mixes for all workloads
Djin
n &
To
nic
wo
rklo
ad s
uit
e(l
aten
cyse
nsi
tive
job
s)
Ro
din
ia H
PC
wo
rklo
ad s
uit
e (B
atch
Job
s)
Predicting the resource peaks
using early indicators through
ARIMA
Receive Bandwidth peaks are precursor to
SM and Mem utilization peaks
Co-locating appsthat do not peakat the same time
using ACF