12th paravirtual rdma device - openfabrics alliance · 12th annual workshop 2016 paravirtual rdma...

14
12 th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang, Shelley Gong [ April 5 th , 2016 ] VMware, Inc.

Upload: others

Post on 26-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

12th ANNUAL WORKSHOP 2016

PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George

Zhang, Shelley Gong

[ April 5th, 2016 ] VMware, Inc.

Page 2: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

MOTIVATION

2  

RDMA App

User

Kernel

Sockets API RDMA Verbs API

Host Channel Adapter

Kernel B

ypassNetwork Device

IPv4/IPv6

TCP

Sockets

Device Driver

Buffers Buffers

Buffer H

eaders

InfiniBand iWARP(Internet Wide Area RDMA

Protocol)

RoCE(RDMA over Converged Ethernet)

InfiniBandSwitch

Ethernet Switch

Socket App RDMA Enables §  OS bypass §  Zero-copy §  Low Latency (<1µs) §  High Bandwidth

Why not PCI Passthrough? §  No live migration support §  Transport dependent §  Needs an HCA §  Cannot share non-SRIOV HCA

Page 3: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

INTRODUCTION

•  Paravirtual RDMA (PVRDMA) is a new PCIe virtual NIC •  Supports standard Verbs API •  Uses HCA for performance, but works without it •  Multiple virtual devices can share an HCA without SR-IOV •  Supports vMotion (live migration)!

3  

Page 4: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

ARCHITECTURE

•  Exposes a dual function PCIe device to the

guest •  VMXNET3 •  RDMA (RoCE)

•  RDMA component reuses Ethernet properties from the paired NIC

•  Plugs into the OFED stack in the VM •  Provides verbs-level emulation

•  Guest kernel driver •  User level library

•  Operates over ESX RDMA stack(VMkernel) •  GIDs generated by guest kernel registered with

HCA

4  

Guest OS 1

RDMA App Buffers

libvrdma libibverbs

PVRDMA Driver

PVRDMA NIC

PVRDMA Device Emulation  

Page 5: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

ARCHITECTURE (CONT.)

•  Virtualize some hardware resources (like QPs and MRs) •  Required for vMotion •  Create corresponding

physical resources on the HCA

5  

WQE WQE

QP SQ RQ

WQE WQE

vQP

CQE

vCQ SQ RQ

CQE

CQ HCA

Emulation layer

Network

Post Poll

Virtual MR/QPs -> Physical Physical QPs -> Virtual

Page 6: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

ARCHITECTURE (CONT.)

•  Guest MR registered directly with the HCA •  Guest PA converted to

machine addresses •  Zero-copy

6  

Applica-on  buffer  

Guest  VA,  length  

Guest  PA  list  

MA  (host  PA)  list  

Guest  VA  -­‐>  MA  list  

Guest userspace

Guest kernel

Device emulation

HCA

HCA

Page 7: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

CONTROL AND DATA PATH

7  

Guest OS 1

RDMA App Buffers

libvrdma libibverbs

PVRDMA Driver

PVRDMA NIC

ESXi RDMA Stack

HCA Device Driver

HCA

Guest OS 2

RDMA App Buffers

libvrdma libibverbs

PVRDMA Driver

PVRDMA Device Emulation

ESXi RDMA Stack

HCA Device Driver

RoCE (RDMA over Converged

Ethernet)

PVRDMA Device Emulation

Control Path

Data Path

Page 8: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

RDMA TRANSPORT SELECTION

8  

•  PVRDMA Transport Selection •  Memcpy – RDMA between peers on same host •  TCP – RDMA between peers without HCAs (slow path) •  RDMA – Fast Path RDMA between peers with HCAs

•  PVRDMA vMotion •  Leverage transport selection to support vMotion of RDMA VMs

vSphere  Distributed  Switch  

ESX  Host  1   ESX  Host  2   ESX  Host  3  

HCA   HCA  RDMA   NIC  TCP  

Memcpy  

Page 9: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

vMOTION

9  

§  Challenge:-

•  Lots of RDMA state within hardware •  Physical resource IDs (like QPNs/MR keys) may change after

migration •  Peers will not be aware of the new IDs •  Currently, no support to create resources with specified IDs

Page 10: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

vMOTION

10  

§  Current (partial) solution:- •  Emulation layer can get virtual to physical translations from peer •  Notify peer about vMotion and pause QP/CQ processing •  After vMotion resume QPs with the new translations •  Invisible to guest •  Can only work when both endpoints are VMs

Page 11: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

vMOTION (FUTURE WORK)

11  

•  Support vMotion when one of endpoints is native (non-VM) •  Need hardware support •  Recreate specific QPNs and MR keys •  Ability to pause and resume QP state on the hardware

•  Save/Restore intermediate QP states

•  Provide isolated resource space to each PVRDMA device •  Guarantee that specified resources can be recreated •  Avoid collisions with existing resources

•  Expose hardware resources directly to guest •  Lower virtualization overhead

Page 12: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

PERFORMANCE

12  

§  Testbed •  2 x Dell T320 Hosts E5-2440 @ 2.40GHz, 24 GiB, Mellanox ConnectX - 3 •  VMs: Ubuntu 12.04, 3.5.0.45, x86_64, 2 vCPUs, 2 GiB •  OFED Send Latency Test

•  Half RTT for 10K iterations

Page 13: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

OpenFabrics Alliance Workshop 2016

CURRENT LIMITATIONS

13  

•  Communication between VM and native endpoints not supported

•  Need a way to create resources with specified IDs •  May need additional hardware support from vendors •  Formalize vMotion support on hardware

•  Currently only supports RoCEv1 in the guest •  Can still operate over underlying RoCEv2-only HCA •  No InfiniBand/iWARP support (future work)

•  No remote READ/WRITE support on DMA MRs •  No SRQ/Atomics support yet

•  SRQs not currently supported on host ESX

•  Only supports Linux guests currently •  No failover support for PVRDMA

Page 14: 12th PARAVIRTUAL RDMA DEVICE - OpenFabrics Alliance · 12th ANNUAL WORKSHOP 2016 PARAVIRTUAL RDMA DEVICE Aditya Sarwade, Adit Ranadive, Jorgen Hansen, Bhavesh Davda, George Zhang,

12th ANNUAL WORKSHOP 2016

THANK YOU Aditya Sarwade [[email protected]]

VMware, Inc.