john garbutt, stackhpc machine learning on feb 2021 how

Machine Learning on OpenStack

How can Scientific OpenStack help?

Feb 2021John Garbutt, StackHPC

StackHPC Company Overview

● Formed 2016, based in Bristol, UK○ Based in Bristol with presence in Cambridge, France and Poland○ Currently 16 people

● Founded on HPC expertise ○ Software Defined Networking○ Systems Integration○ OpenStack Development and Operations

● Motivation to transfer this expertise into Cloud to address HPC & HPDA● “Open” Modus Operandi

○ Upstream development of OpenStack capabilities○ Consultancy/Support to end-user organizations in managing HPC service transition○ Scientific-SIG engagement for the Open Infrastructure Foundation○ Open Source, Design, Development and Community

What is needed forMachine Learning?

Training, Inference, Data and more

https://developers.google.com/machine-learning/crash-course/production-ml-systems

https://developers.google.com/machine-learning/crash-course/production-ml-systems

TensorFlow Extended (TFX)

https://github.com/GoogleCloudPlatform/tf-estimator-tutorials/blob/54c3099d3a687052bd463e1344a8836913ac2d26/00_Miscellaneous/tfx/02_tfx_end_to_end.ipynb


TensorFlow Extended (TFX)


Transform Data Model Training

Inference


Machine Learning Breakdown● Data Processing and Pipelines

○ Transform to extract Features and Labels○ Data Visualization

● Training: Static vs Dynamic Model Training○ Does input data change over time○ Pipeline reproducibility

● Inference: Offline vs Online predictions○ Regression or Classification have similar questions○ Decision latency can be critical, with the need to use more resources to get it faster

● Model complexity○ Linear, Non-linear, how deep, how wide, ...

● Flow: Dev -> Stage -> Prod● MLOps: Configuration Management, Deployment tooling, Monitoring...

https://www.kubeflow.org/docs/started/kubeflow-overview/#introducing-the-ml-workflow

https://www.kubeflow.org/docs/started/kubeflow-overview/#introducing-the-ml-workflow

Infrastructure Requests● Offline can fit batch (e.g. Slurm), but not online

○ Offline Training and Online Inference: you may want a mix?

● Scale up○ CPUs not always the best price per performance○ GPUs often better. Also IPUs, Tensor cores, new CPU instructions○ Connect to disaggregated accelerator and/or storage

● Scale out○ Distribute work, via RDMA low latency interconnect

● High Performance Storage○ Keep the model processing fed, share transformed data○ RDMA enabled low latency access to data sets

● Monitoring to understand how your chosen model performs

Scientific OpenStack

HPC Stack 1.0

Motivations Driving a Change

● Manage the increasing complexity

● Better knowledge sharing

● Move away from Resource Silos

HPC Stack 1.0

HPC Stack 2.0

OpenStack Magnum● Kubernetes clusters on demand

○ … working to add support for K8s v1.20○ Terraform and Ansible can be used to manage clusters

● Magnum Cluster Autoscaler○ Automatically expands and shrinks K8s cluster, within defined limits○ Based on when pods can / can’t be scheduled

● Storage Integration○ Cinder CSI for Volumes (ReadWriteOnce PVC)○ Manila CSI for CephFS shares (ReadWriteMany PVC)

● Network Integration○ Octavia load balancer as a service

https://github.com/RSE-Cambridge/iris-magnum


OpenStack GPUs

GPUs in OpenStack● Ironic

○ Full access to hardware, including all GPUs and Networking

● Virtual Machine with PCI Passthrough○ Share out single physical machine○ Flavors with one or multiple GPUs○ Some protection of GPU firmware○ … restrictions around using data centre GPUs

● Virtual Machine with vGPU○ Typical vGPU requires expensive licences, created○ Time Slicing GPU created for VDI○ Depends if your workloads can saturate a full GPU○ … but A100 includes MIG (multi-instance GPU)

● Some GPU features need RDMA networking

GPU Resource Management● GPUs are expensive, need careful management to get a good ROI● Batch Queue System e.g. Slurm

○ Sharing via batch jobs can be very efficient○ … but not great for dynamic training and online inference

● OpenStack Quotas○ Today no real support for GPU quota○ … but flavors that request GPUs can be limited to projects○ and projects can be limited to specific subsets of hardware

● Reservations and Pre-emptables○ Scientific OpenStack is looking at OpenStack Blazar○ Projects can reserve resources ahead of time○ Option to use pre-emptables to scale out when resources are free○ … to stop people hanging on to GPUs and not using them

● Get in touch if you are interested in shaping this work

https://unsplash.com/photos/oBbTc1VoT-0

https://unsplash.com/photos/oBbTc1VoT-0

OpenStack and RDMA(Remote Direct Memory Access)

RDMA with OpenStack● Ethernet with RoCEv2, also Infiniband

○ Ethernet switches can use ECN and PFC to reduce packet loss, larger MTU

● SR-IOV○ Hardware physical function (PF) mapped to multiple virtual functions (VF)○ Hardware configured to limit traffic to VLANs (and sometimes overlays)○ … typically no security groups, bonding possible with some NICs, QoS possible○ https://www.stackhpc.com/vxlan-ovs-bandwidth.html and https://www.stackhpc.com/sriov-kayobe.html

● Virtual Machine runs drivers for the specific NIC○ … ignoring mdev for now

● Live-migration with SR-IOV○ Makes use of hot unplug then hot plug○ Possible bond with a virtual NIC, breaks RDMA○ … in future mdev may help, but not today

https://www.stackhpc.com/vxlan-ovs-bandwidth.html

https://www.stackhpc.com/sriov-kayobe.html

Kubernetes RDMA with OpenStack● Some Pods need RDMA enabled networking● Kubernetes in VM with VF passthrough

○ OpenStack controls network isolation○ Pods forced to use host networking to get RDMA○ … not so bad if one Pod per VM, with Magnum cluster autoscaler

● Kubernetes in VM with PF passthrough○ Kubelet manages virtual function passthrough to pods○ OpenStack maps devices to physical networks○ Switch port could be out of band configured to restrict allowed VLANs○ NIC that is passed through, typically can’t be used by host or any other VM

● Kubernetes deployed on Ironic server○ Similar to PF passthrough○ … but Neutron could orchestrate the switch port configuration

RDMA Remote Storage● OpenStack supports fast local storage

○ Ratio typically fixed with CPU and RAM, but workload needs vary○ Remote storage, like Ceph RBD, can have a very high latency

● OpenStack supports provider VLANs○ Can be shared with a select group of projects○ You can have a neutron router onto the network, if required

● Shared Storage can be a workload in OpenStack○ Examples: Lustre, BeeGFS○ Run on baremetal or VMs with RDMA enabled○ Provide a shared file system to stage data into

● External appliances can be accessed via provider VLANs○ Some storage can integrate with OpenStack Manila for Filesystem as a Service

Example workload:Monitoring Slurm

Example workload:Horovod Benchmarks

● Distributed deep learning framework● Supported by LF AI & Data Foundation● https://github.com/horovod/horovod

● P3 AlaSKA○ TCP 10GbE○ RoCE 25GbE○ Infiniband 100Gb○ 2 GPU nodes, each with 4 x P100 GPUs

● ResNET 50 Benchmark○ All tests use 8 P100 GPUs○ On baremetal, using OpenStack Ironic○ Horovod on k8s with tensorflow, openmpi○ Note: higher is better

Horovod on P3 AlaSKA

https://lfaidata.foundation/

https://github.com/horovod/horovod

Horovod

https://github.com/horovod/horovod/blob/master/docs/benchmarks.rst

https://github.com/horovod/horovod/blob/master/docs/benchmarks.rst

Example hardware:NVIDIA DGX A100

NVIDIA DGX A100● 200Gb/s ConnectX-6 for each GPU● Local NVMe to cache training data● NVIDIA NVLink

○ DGX A100 has 6 NVIDIA NVSwitch fabrics○ Each A100 GPU uses twelve NVLink

interconnects, two to each NVSwitch○ GPU-to-GPU communication 600 GB/s○ All to all peak of 4.8 TB/s in both directions

● A100 has Multi-Instance GPU (MIG)○ Up to 7 MIGs per A100○ MIG GPU instance has its own memory,

cache, and streaming multiprocessor○ Multiple users can share the same GPU

and run all instances simultaneously, maximizing GPU efficiency

https://developer.nvidia.com/blog/defining-ai-innovation-with-dgx-a100/

https://developer.nvidia.com/blog/defining-ai-innovation-with-dgx-a100/

Next Stepswith Scientific OpenStack

Scientific OpenStack Digital Assets● Existing assets

○ Reference OpenStack architecture, configuration and operational tooling○ Reference platforms and workloads, such as:

■ https://github.com/RSE-Cambridge/iris-magnum

● Edinburgh Institute of Astronomy○ 2nd IRIS site starting to adopt Scientific OpenStack Digital Assets

● SKA are funding moving P3 AlaSKA into IRIS● Updates in March 2021 due to include:

○ GPU and SR-IOV best practice guides, using P3 AlaSKA hardware○ Magnum updated to support Kubernetes v1.20○ Improved Resource Management via Blazar○ Prometheus based Utilization Monitoring○ Assessment of porting JASMIN Cluster as a Service to IRIS


Summary of ML on OpenStack● ML represents a diverse set of workloads

○ … with a corresponding diverse set of Infrastructure needs○ Latency sensitive Online inference vs Offline training, Regression vs Classification, etc○ Wide variety of model complexity and size of data inputs

● Broad ecosystem of tools and platforms○ Many assume Kubernetes is available for deployment○ OpenStack can provide the resources needed, Kubernetes or not

● Challenging to use resources efficiently○ No generic “best fit” mix of Compute, Networking and Storage○ GPUs, Tensor cores, IPUs, FPGAs can be more efficient than CPUs○ Storage and Networking need to keep processing fed○ Demand for Online Inference (and training) can be hard to predict

Questions?

john garbutt, stackhpc machine learning on feb 2021 how

Documents