providers with red hat openshift enabling...
TRANSCRIPT
![Page 1: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/1.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING1
Enabling GPU-as-a-Service Providers with Red Hat OpenShift
@jeremyederSenior Principal Software Engineer, Red HatMarch, 2018
![Page 2: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/2.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Agenda
● OpenShift Cluster Overview● Infrastructure Abstraction● High Performance Features● GPU Overview
2
![Page 3: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/3.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Community Powered Innovation
3
![Page 4: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/4.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
What does an OpenShift Cluster look like?
SERVICE LAYER
ROUTING LAYER
PERSISTENTSTORAGE
REGISTRY
RHEL
NODE
C
C
RHEL
NODE
C C
RHEL
NODE
c
C
C
RHEL
NODE
C C
RHEL
NODE
C
RHEL
NODE
CRED HATENTERPRISE LINUX
MASTER
API/AUTHENTICATION
DATA STORE
SCHEDULER
HEALTH/SCALING
PHYSICAL VIRTUAL PRIVATE PUBLIC HYBRID
4
![Page 5: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/5.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Abstract away any infrastructure
SERVICE LAYER
ROUTING LAYER
PHYSICAL VIRTUAL PRIVATE PUBLIC HYBRID
● Bare Metal● RHV● OpenStack● VMware● GCE● Azure● AWS● BYO nodes...
5
![Page 6: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/6.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING 6
One Platform to...
OpenShift is the single platformto run any application: ● Old or new● Monolithic/Microservice
Big Data
NFV
FSI
Animation
ISVsHPC
Machine Learning
6
![Page 7: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/7.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING 7
High Performance RFEs by VerticalFeature FSI NFV ISV BD/ML ANIM HPC
NUMA (cpuset.cpus and cpuset.mems) Yes Yes Yes Maybe Maybe Yes
Device Passthrough (NIC/Disk/GPU etc...) Yes Yes Yes Maybe Maybe Yes
sysctl Support (non-namespaced too) Yes Yes Yes Yes Yes Yes
Separation of control- and data-plane Yes Yes Yes Yes Yes Yes
Node “fitness” (extended health info) Yes Yes Maybe Maybe Maybe Yes
Multi-homed pods Yes Yes Maybe Yes Yes Yes
Kernel Modules (DKMS-ish) Yes Yes Maybe Maybe Yes Maybe
Hugepages Yes Yes Yes Yes Maybe Maybe
7
![Page 8: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/8.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Enable containerization of Infrastructure Software● Software-defined Storage and Networking● Packet switching and routing tiers● Multi-workloads (very different) within a single cluster
○ Layered schedulers (HPC/grid)● Many more...
Why do this?
8
![Page 9: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/9.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
● Gluster/Container Native Storage● Ceph● OpenStack● rad analytics● KubeVirt
Enable containerization of Red Hat’s products
9
![Page 10: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/10.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
● Resource Management Working Group○ Features Delivered
■ Device Plugins (GPU/Bypass/FPGA)■ CPU Manager (exclusive cores)■ Huge Pages Support
○ Extensive Roadmap● Intel, IBM, Google, NVIDIA, Red Hat, many more...
Upstream First: Kubernetes Working Groups
10
![Page 11: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/11.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
● Network Plumbing Working Group○ Formalized Dec 2017
● Goal is to implement an out of tree, pseudo-standard collection of
CRDs for multiple networks, owned by sig-network, *out of tree*
● Separate control- and data-plane, Overlapping IPs, Fast Data-plane● IBM, Intel, Red Hat, Huawei, Cisco, Tigera...at least.
Upstream First: Kubernetes Working Groups
11
![Page 12: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/12.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
GPU CLUSTER TOPOLOGY
12
![Page 13: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/13.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Control Plane
Compute Nodes and Storage Tier
Infrastructure
master and etcd
master and etcd
master and etcd
registry and
router
registry and
router
LB
registry and
router
OpenShift Cluster Topology
13
![Page 14: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/14.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Compute Nodes...
● How to enable software to take advantage of “special” hardware
● Create Node Pools○ Mark them as “special”○ Taints/Tolerations○ ExtendedResourceTole
ration
OpenShift Cluster Topology
14
![Page 15: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/15.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Compute Nodes...
● How to enable software to take advantage of “special” hardware
● Tune/Configure the OS○ Tuned Profiles○ CPU Isolation○ sysctls
OpenShift Cluster Topology
15
![Page 16: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/16.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Unsafe● Experimental Kubelet Flag● kernel.sem*● kernel.shm*● kernel.msg*● fs.mqueue.*● net.*
In OpenShift, there are three “types” of sysctls
Safe● Enabled by default● kernel.shm_rmid_forced● net.ipv4.ip_local_port_range● net.ipv4.tcp_syncookies
Node-level● Can’t set from a pod● Potentially affects other
pods● Many interesting sysctls● Use TuneD
16
OpenShift Cluster Topology
![Page 17: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/17.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Compute Nodes...
● How to enable software to take advantage of “special” hardware
● Optimize your workload○ Dedicate CPU cores○ Consume hugepages
OpenShift Cluster Topology
17
![Page 18: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/18.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Compute Nodes...
● How to enable software to take advantage of “special” hardware
● Enable the Hardware○ Install drivers○ Deploy Device Plugin
OpenShift Cluster Topology
18
![Page 19: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/19.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Compute Nodes...
● How to enable software to take advantage of “special” hardware
● Consume the Device○ KubeFlow Template
deployment
OpenShift Cluster Topology
19
![Page 20: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/20.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Kubernetes Deployment for STAC-A2
● All-in-One Kubernetes Installation● (hack/local-up-cluster.sh)● Node labeled ● Containers:
○ RHEL7+CUDA9○ RHEL7+CUDA9+DEVICE-PLUGIN○ RHEL7+CUDA9+STAC-A2
● CUDA 9● 8 x NVIDIA Tesla V100 (Volta) GPUs● HPE Apollo 6500 w/XL270d Gen9 ● Red Hat Enterprise Linux 7.4● Kubernetes 1.8 (setup info)● nvidia-smi
--applications-clocks=877,1380
● https://rhelblog.redhat.com/2017/11/21/red-hat-and-partners-deliver-new-performance-records-on-prominent-risk-analytics-benchmark/
● https://news.developer.nvidia.com/a-new-stac-a2-record/ 20
![Page 21: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/21.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING 21
Kubernetes Deployment for STAC-A2
Volta GPU Kubelet
Device Plugin(daemonset)
Kube Scheduler
Volta GPUVolta GPU
Volta GPUVolta GPU
Volta GPUVolta GPU
Volta GPU
Benchmark (pod)
resources: limits: nvidia.com/gpu: 8
kubectl create
21
![Page 22: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/22.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Benchmark (pod)
resources: limits: nvidia.com/gpu: 8
22
Kubernetes Deployment for STAC-A2
Volta GPU Kubelet
Device Plugin(daemonset)
Kube Scheduler
Volta GPUVolta GPU
Volta GPUVolta GPU
Volta GPUVolta GPU
Volta GPU
kubectl create
22
![Page 23: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/23.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
● Early KubeFlow involvement
● radanalytics templates for ML-workflow on OpenShift
● Machine-Learning OpenShift Commons
● Demo Repositories
○ https://github.com/zvonkok/nvidia-k8s
○ https://github.com/redhat-performance/openshift-psap
Recent GPU-related work on OpenShift
23
![Page 24: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/24.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
THANK YOUplus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews
24
![Page 25: Providers with Red Hat OpenShift Enabling GPU-as-a-Serviceon-demand.gputechconf.com/gtc/2018/presentation/s8769... · Ceph OpenStack rad analytics KubeVirt ... Recent GPU-related](https://reader035.vdocument.in/reader035/viewer/2022062909/5b504de07f8b9a346e8e33e8/html5/thumbnails/25.jpg)
JEREMY EDER - RED HAT PERFORMANCE ENGINEERING
Commoditizing GPU-as-a-Service Providers with Red Hat OpenShiftTuesday, Mar 27, 1:00 PM - 1:25 PM, Room 210E
Red Hat OpenShift Container Platform, with Kubernetes at it's core, can play an important role in building flexible hybrid cloud infrastructure. By abstracting infrastructure away from developers, workloads become portable across any cloud. With NVIDIA Volta GPUs now available in every public cloud [1], as well as from every computer maker, an abstraction library like OpenShift becomes even more valuable. Through demonstrations, this session will introduce you to declarative models for consuming GPUs via OpenShift, as well as the two-level scheduling decisions that provide fast placement and stability.
25