composable infrastructure for on-prem kubernetes-based … › video › gputechconf › ... ·...
TRANSCRIPT
2019 One Convergence, Inc. All rights reserved GTC 2019
Composable Infrastructure for On-Prem Kubernetes-Based SystemsS9572
Subrahmanyam Ongole Architect One Convergence, Inc.
GTC 2019
�2
Agenda
● Introduction● State of the art● Problem description● Proposal● Scale-Out performance
GTC 2019
�3
Introduction
● One Convergence Productso http://www.oneconvergence.com
● Topico GPU Composition for Kubernetes workloads
DFabric
GTC 2019
�4
Scale-Out Architecture
● Why Scale-out?o Scale-up vs Scale-out
▪ Affordable GPU servers▪ Incrementally add new GPU hardware▪ Resiliency - No single point of failure▪ Higher network speeds via RDMA NICs
o Challenges▪ Cluster management▪ Workload orchestration▪ Resource management ▪ Achieving best performance
o On-Prem▪ Cloud providers address this▪ On-Prem needs to be solved
. . .
Scale-Up System
8-16 GPUs
Scale-Out Systems
2-4 GPUs
RDMA NIC High Speed Interconnect
GTC 2019
�5
Platform of Choice
● Kuberneteso Cluster management o Container orchestrationo Standard interfaces for Network and Storage
▪ CNI & CSIo Node-specific resource management
▪ Device plugins for GPUs, RDMA, etc
GTC 2019
�6
GPU Allocation
● POD Specresources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs
● Different types of GPUso Label each node with the type of GPU
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
o Specify using node selectors in the POD specnodeSelector:
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
GTC 2019
�7
Challenges
● User needs to be aware ofo GPU vendor, Type of GPU and GPU nodes
● Resource segmentationo Experimental vs Production jobs
● Better utilization of GPUso Schedule by mutual agreement
● Multi-usero Isolation of workloads
● Cluster changeso Scale-out/scale-down o GPU health
● Topologyo RDMA, NVLink®, etc
● Complex with increasing number of users/nodes
GTC 2019
�8
Extending Kubernetes
● Custom Resourceso Dynamically extend Kubernetes APIo CRDs - Custom Resource Definitions
▪ Handled by API server▪ Uses Kubernetes storage▪ Custom Controller provides Declarative API
o Aggregated APIs▪ Separate service, Complex▪ Custom storage
● Operatorso Combines Custom Resources & Custom Controllerso Domain knowledgeo Examples
▪ Etcd, Prometheus operators▪ Tf operator in Kubeflow
GTC 2019
�9
DFabric
DFabric
GTC 2019
PCIe Ethernet
Compose & Monitor
Resource Management & Optimization
User/Group Mgmt
K8S
Sche
dulin
g
Device Plug-in
�10
DFabric Architecture
Microsegmentation (Pools)
APIs/Operators
2019 One Convergence, Inc. All rights reserved
GTC 2019
�11
Pool & Group Benefits
● Abstracts resourceso User doesn't need to be aware of GPU hardwareo Groups determine GPU association
● Better utilization of GPUso Better distribution of workload
● Isolation of workloadso Separate Namespace per user
● Topology awarenesso Schedules RDMA/GD wherever applicable
● Monitors changes to clustero Scale-out/Scale-downo GPU health
GTC 2019
�12
Disaggregated PCIe JBOG
● Introduction● Static composition
o Fixed at node composition time
● Dynamic compositiono Dynamically attaches to PODo GPUs move across nodeso Device plugin requirements
NIC SSD GPU
PCIe Switch
NIC SSD GPU
PCIe Switch
GPU GPU GPU
PCIe Fabric
Host Host Host
GTC 2019
�13
Scale-Out Performance
3 Node ClusterEach node contains:
o Lenovo™ Thinksystem™ SD530o Intel® Xeon® Gold 6148 @2.4 GHz▪ 384 GB RAM▪ 20 Cores
o 2 NVIDIA® V100 GPUs / 16GBo Mellanox® 100Gbps ConnectX®-5▪ RDMA NIC
o CUDA 9.0o Cudnn 7.4.1.5-1o TensorFlow 1.12o Mellanox OFED 4.5-1.0.1.0o NCCL openmpi-3.0.0o Horovod: 0.15.2o DKube/DFabric™ 1.0.3
With RDMA
Without RDMA
GTC 2019
�14
Summary
● Scale out architectureo http://www.oneconvergence.com/blogs/
● Platform requirementso DFabrico http://www.oneconvergence.com/dfab
Thank YouQuestions?