vmworld 2013: esxi native networking driver model - delivering on simplicity and performance

60
ESXi Native Networking Driver Model - Delivering on Simplicity and Performance Margaret Petrus, VMware TEX4759

Upload: vmworld

Post on 04-Jul-2015

1.560 views

Category:

Technology


7 download

DESCRIPTION

VMworld 2013 Margaret Petrus, VMware Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare

TRANSCRIPT

Page 1: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

ESXi Native Networking Driver Model - Delivering on

Simplicity and Performance

Margaret Petrus, VMware

TEX4759

Page 2: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

2 2

Disclaimer

This presentation may contain product features that are currently

under development.

This overview of new technology represents no commitment from

VMware to deliver these features in any generally available

product.

Features are subject to change, and must not be included in

contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new technologies or features

discussed or presented have not been determined.

Page 3: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

3 3

Key Takeaways

1. The benefits of moving to native driver model with an overview

of the different layers.

2. Jumpstart to build your own native driver.

3. The significant CPU savings achieved using the native model,

while retaining simplicity and supportability.

Page 4: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

4 4

Agenda

Overview of Native Model

Module Components and Interactions

Native Network Driver Deep Dive

Building your driver in the native model

Advanced Features

Performance

Summary

Page 5: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

5 5

Overview of Native Model

Page 6: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

6 6

Why Native Driver Model?

Foundation to build new extensible features for ESXi hypervisor

Increasing number of VMs in growing cloud deployments demand

Device driver robustness

Best performance

Better supportability, manageability, and debuggability

Provide long term binary compatibility support

Better flexibility and support to release new features in the networking,

storage areas, etc.

Page 7: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

7 7

High-level Native Driver Model Overview

vmkernel

I/O Subsystems Device

Manager

Device

Layer

Legend: Physical device Logical device Relationship

Device and Driver

Objects

Drivers

Page 8: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

8 8

Module Components and Interactions

Page 9: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

9 9

Quick Comparison with VMKLNX Model

VM

I/O Subsystems

vmkplexer

vmklinux

Linux driver

vmKernel

Emulated

Linux Driver

Model VM

I/O Subsystems

Device Layer

Dev

Mgr

vmKernel

Native Driver Native Driver ESXi Driver

Native Driver

Model

Page 10: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

10 10

Device layer

PCI Native driver

vmkdevmgr

IO subsys (scsi, net)‏

vmklinux

vmklnx_driver

ACPI

vmkctl driver.map

Layer Interactions in Native Model vs. Vmklinux Model

User Level

Kernel

Page 11: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

11 11

Native Network Driver Deep Dive

Page 12: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

12 12

Device layer

PCI elxnet

vmkdevmgr

IO subsys (scsi, net)‏

ACPI

vmkctl elxnet_devices.py

High Level Native Networking Driver Model (using elxnet)

Uplink Module‏

elxnet – Emulex Native Driver for BE3 Devices

User Level

Kernel

Page 13: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

13 13

Native Networking Driver Module Interactions

Module layer - Register/Unregister driver with module layer interface • init_module()

• cleanup_module()

Device Driver layer – Register with device driver interface • Provide vmk_DriverProps and vmk_DriverOps

• Callbacks for: DriverAttachDevice(), DriverDetachDevice()

DriverScanDevice(), DriverForgetDevice()

DriverStartDevice(), DriverQuiesceDevice()

PCI layer – Needed for PCI Config access, BAR mapping, SR-IOV, etc. • vmk_PCIReadConfig(), vmk_PCIWriteConfig()

• vmk_PCIMapIOResource(), vmk_PCIUnmapIOResource()

Uplink layer - Provides the access to the networking stack • Driver has to interact with uplink directly for all operations

• Uplink registration results in logical child (vmnicX) creation

• Register networking HW capabilities and provide appropriate callbacks

Management CLI - Support only via esxcli, not ethtool!

Page 14: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

14 14

Module Layer

Page 15: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

15 15

Module Layer: init_module()

Key steps: 1. Register module with vmkernel via vmk_ModuleRegister().

2. Initialize driver name via vmk_NameInitialize().

3. Create heap via vmk_HeapCreate() and memory pool via

vmk_MemPoolCreate().

4. Register for driver logging via vmk_LogRegister().

5. Create lock domain for the module via vmk_LockDomainCreate().

6. Register the driver with the driver database via vmk_DriverRegister(). This is where you register the driver properties, i.e. the device layer CB handlers.

static vmk_DriverOps elxnetDrvOps = {

.attachDevice = elxnet_attachDevice,

.detachDevice = elxnet_detachDevice,

.scanDevice = elxnet_scanDevice,

.startDevice = elxnet_startDevice,

.quiesceDevice = elxnet_quiesceDevice,

.forgetDevice = elxnet_forgetDevice,

};

static vmk_DriverProps elxnetDrvProps = {

.ops = &elxnetDrvOps,

};

Page 16: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

16 16

Module Layer: cleanup_module()

The steps executed in init_module() occur in the reverse order:

1. Unregister driver via vmk_DriverUnregister().

2. Destroy created lock domain via vmk_LockDomainDestroy().

3. Unregister driver log via vmk_LogUnregister().

4. Destroy heap via vmk_HeapDestroy().

5. Destroy memory pool via vmk_MemPoolDestroy().

6. Unregister module via vmk_ModuleUnregister().

Page 17: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

17 17

Device Layer

Page 18: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

18 18

How does Native Driver claim its devices?

1. PCI bus drv scans PCI bus, detects PCI NICs, produces PCI NIC dev object.

2. Device Layer notifies Device Manager of device existence. Device Manager consults the PCI bus plugin to locate the driver

3. NIC Drv registers with Dev layer, providing CBs to claim PCI NIC dev object.

4. Device Manager binds NIC driver module to PCI NIC device object.

5. Device Layer calls NIC driver's AttachDevice callback: NIC driver claims PCI NIC device object

NIC driver initializes hardware

6. Device Layer calls NIC driver's StartDevice callback

NIC driver leaves quiesced state

7. Device Layer calls NIC driver's ScanDevice callback: NIC driver produces logical uplink device object.

8. Device Layer notifies Device Manager of logical device existence. Device Manager consults the Logical bus plugin to locate the driver

Device manager binds the uplink device to the uplink driver

attach, start, scan callbacks invoked for uplink device

9. NIC driver registers uplink capabilities in Uplink Registration callback.

10.NIC driver can start RX and networking subsystem can start TX on this NIC.

Page 19: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

19 19

Flow to claim NIC and make it IO-able

NIC Driver Networking

Subsystem Device Layer

vmk_DriverAttachDevice(vmk_PCIDevice)

vmk_DriverStartDevice()

vmk_DriverScanDevice()

vmk_DeviceRegister(vmk_DeviceProps,

vmkDev, &uplinkDev) Create and Register uplinkDev

vmk_UplinkAssociate() to asynchronously

notify uplink for the device

vmk_UplinkCapRegister() to register

each capability

vmk_UplinkStartIO()

1. Arm interrupts in HW

2. Enable interrupts in vmkernel

3. Update uplink link status

Uplink is ready

for Tx/Rx

processing

HW initialized

for IO

Page 20: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

20 20

Device Layer: DriverAttachDevice()

The attachDevice callback registered in vmk_DriverRegister() is invoked.

• The driver should start driving this device, get it ready for IO.

• If not capable of driving, return error and restore device to original state.

What is done in this routine? 1. Allocate memory for driver data structures.

2. Invoke vmk_DeviceGetRegistrationData() to get PCI Device handle.

3. Invoke vmk_PCIQueryDeviceID() to validate driver can support this device.

4. Invoke vmk_PCIQueryDeviceAddr() to get the PCI Device Address.

5. Create the DMA engine via vmk_DMAEngineCreate() with right properties.

6. Map the bars via vmk_PCIMapIOResource() calls.

7. Initialize the HW and ensure that it comes up fine, else error out.

8. Setup stats collections, other driver specific stuff, etc.

9. Allocate interrupt vectors via vmk_PCIAllocIntrCookie() (w/ typeVec, numVec).

10. Create UplinkData – fills up the registration data ops and sharedData fields.

11. Do other controller setup and any other needed configurations.

12. Call vmk_DeviceSetAttachedDriverData() to associate drvPrivDataPtr with

vmk_Device handle.

Page 21: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

21 21

Device Layer: DriverStartDevice()

Callback after successful attachDevice: Device will not be ready, i.e. not in IO-able state until this callback is done.

Puts the device in an IO-able state.

Can be invoked to place a device back in an IO-able state any time after

vmk_DriverQuiesceDevice() has explicitly put device in quiesced state.

What it does? 1. Get drvPrivDataPtr using vmk_DeviceGetAttachedDriverData().

2. Post rx fragments for all the Rx queues it supports.

3. Register interrupts allocated during uplink shared data creation:

• Register interrupts via vmk_IntrRegister().

• Set affinity via vmk_NetPollInterruptSet().

4. Create any worker threads as worlds via vmk_WorldCreate().

Page 22: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

22 22

Device Layer: DriverScanDevice()

Invoked at least once after a device has been attached to a driver.

May be invoked at other device hotplug events as appropriate.

New devices may be registered from this callback only.

Main Steps: 1. Find bus type of the PCI Device via vmk_BusTypeFind().

2. Create logical address via vmk_LogicalCreateBusAddress().

3. Register device with vmkernel via vmk_DeviceRegister() passing in the

vmk_DeviceProps structure.

typedef struct {

vmk_Driver registeringDriver;

vmk_DeviceID *deviceID; VMK_UPLINK_DEVICE_IDENTIFIER

vmk_DeviceOps *deviceOps; has callback .removeDevice

vmk_AddrCookie registeringDriverData; holds drvPrivDataPtr

vmk_AddrCookie registrationData;

} vmk_DeviceProps;

Page 23: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

23 23

Device Layer: DriverForgetDevice()

Notification callback from vmkernel

To indicate device is no longer accessible

Driver no longer to wait indefinitely on any device operation

Must always return with success for any subsequent device callbacks

vmk_DriverQuiesceDevice()

vmk_DriverDetachDevice()

Case-specific callback, surprise removal only, not always called

Page 24: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

24 24

Device Layer: DriverQuiesceDevice()

Callback places the device in quiesce’d state:

Prepare for operations like device removal, driver unload, or system shutdown.

This callback indicates that driver should:

• Complete any IO on the device

• Flush any device caches to quiesce device

Steps (reverse of StartDevice):

1. Get drvPrivDataPtr via vmk_DeviceGetAttachedDriverData().

2. Halt and destroy any worker threads created during StartDevice.

3. Handle all Tx completions.

4. Cleanup all Rx queues.

5. Unregister interrupts for all Rx queues:

• Invoke vmk_NetPollInterruptUnSet() to remove affinity.

• Invoke vmk_IntrUnregister() to unregister previously registered interrupt.

Page 25: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

25 25

Device Layer: DriverDetachDevice()

This is another handler passed in during vmk_DriverRegister() call. • Driver should stop driving this device, and release its resources.

• Driver should not touch the device after this.

Steps: 1. Get drvPrivDataPtr via vmk_DeviceGetAttachedDriverData().

2. Cleanup all the resources allocated for your interface:

Destroy any queues allocated

Notify HW that you are stopping all access

3. Cleanup UplinkData created and setup in DriverAttachDevice().

4. Release all interrupt vectors via vmk_PCIFreeIntrCookie().

5. Cleanup any memory allocated for driver structures from memory pool or heap.

6. Any other control path cleanup, i.e. destroy spinlock or semaphores.

7. Unmap BARs via vmk_PCIUnmapIOResource().

8. Destroy created DMA Engine via vmk_DMAEngineDestroy().

9. Free up and clean out any other resources allocated.

Page 26: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

26 26

Logical Uplink Layer

Page 27: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

27 27

Uplink Layer Major Data Structures

vmk_UplinkRegData – uplink registration data

Driver responsible for allocating and populating this structure

Pointer to this struct is stored in vmk_DeviceProps->registrationData

vmk_UplinkOps – handler for basic uplink operations

vmk_UplinkSharedData – data shared between uplink layer and NIC driver

Allocated and initialized by driver

Driver readable and writable

Uplink layer readable only

vmk_UplinkSharedQueueInfo – shared info for all queues between uplink

layer and driver

vmk_UplinkSharedQueueData – shared data for a single queue

Page 28: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

28 28

Uplink Layer: vmk_UplinkRegData

Driver associates the following registration data to the vmk_Device

when creating the logical uplink:

typedef struct vmk_UplinkRegData {

vmk_revnum apiRevision; // VMKAPI version

vmk_ModuleID moduleID; // module ID of NIC drv

vmk_UplinkOps ops;

vmk_UplinkSharedData *sharedData; // Runtime data shared

// b/w kernel & driver

vmk_AddrCookie driverData; // Driver context data

} vmk_UplinkRegData;

Page 29: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

29 29

Uplink Layer: vmk_UplinkOps

Structure containing function pointers for required driver operations.

The functions are callbacks from the vmkernel into the NIC driver.

typedef struct vmk_UplinkOps {

vmk_UplinkTxCB uplinkTx; // Tx packt list CB

vmk_UplinkMTUSetCB uplinkMTUSet; // modify MTU CB

vmk_UplinkStateSetCB uplinkStateSet; // modify state CB

vmk_UplinkStatsGetCB uplinkStatsGet; // get stats CB

vmk_UplinkAssociateCB uplinkAssociate; // notify drv about assoc uplink

vmk_UplinkDisassociateCB uplinkDisassociate; // notify drv of disassoc uplink

vmk_UplinkCapEnableCB uplinkCapEnable; // cap enable CB

vmk_UplinkCapDisableCB uplinkCapDisable; // cap disable CB

vmk_UplinkStartIOCB uplinkStartIO; // start IO CB

vmk_UplinkQuiesceIOCB uplinkQuiesceIO; // queiesce all IO

vmk_UplinkResetCB uplinkReset; // reset issued uplink

} vmk_UplinkOps;

Page 30: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

30 30

Uplink Layer: vmk_UplinkSharedData

The vmk_UplinkRegData->sharedData points to a driver allocated data

structure shared between vmkernel and NIC driver:

typedef struct vmk_UplinkSharedData {

vmk_VersionedAtomic lock; // ensure snapshot consistency

vmk_UplinkFlags flags; // uplink flags

vmk_UplinkState state; // uplink state

vmk_LinkStatus link; // uplink link status

vmk_uint32 mtu; // uplink mtu

vmk_EthAddress macAddr; // current logical MAC

vmk_EthAddress hwMacAddr; // permanent HW MAC

vmk_UplinkSupportedMode *supportedModes;

vmk_uint32 supportedModesArraySz;

vmk_UplinkDriverInfo driverInfo; // driver info

vmk_UplinkSharedQueueInfo *queueInfo; // shared qinfo

} vmk_UplinkSharedData;

Page 31: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

31 31

Uplink Layer: vmk_UplinkSharedQueueInfo

Defines uplink level shared queue info for all queues

For the queueData field above, drivers need to populate one queue

even if they do not support multiple queues.

vmk_UplinkSharedQueueInfo {

vmk_UplinkQueueType supportedQueueTypes;

vmk_UplinkQueueFilterClass supportedRxQueueFilterClasses;

vmk_UplinkQueueID defaultRxQueueID;

vmk_UplinkQueueID defaultTxQueueID;

vmk_uint32 maxRxQueues;

vmk_uint32 maxTxQueues;

vmk_uint32 activeRxQueues;

vmk_uint32 activeTxQueues;

vmk_BitVector *activeQueues;

vmk_uint32 maxTotalDeviceFilters;

vmk_UplinkSharedQueueData *queueData;

} vmk_UplinkSharedQueueInfo;

Page 32: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

32 32

Uplink Layer: vmk_UplinkSharedQueueData

Contains all the info about one specific Tx or Rx queue.

This struct is shared with uplink layer.

typedef struct vmk_UplinkSharedQueueData {

volatile vmk_UplinkQueueFlags flags;

vmk_UplinkQueueType type;

vmk_UplinkQueueID qid;

volatile vmk_UplinkQueueState state;

vmk_UplinkQueueFeature supportedFeatures;

vmk_UplinkQueueFeature activeFeatures;

vmk_uint32 maxFilters;

vmk_uint32 activeFilters;

vmk_NetPoll poll; // associated netPoll context

vmk_DMAEngine dmaEngine; // associated dma engine

vmk_UplinkQueuePriority priority; // tx queue priority

vmk_UplinkCoalesceParams coalesceParams;

} vmk_UplinkSharedQueueData;

Page 33: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

33 33

Creation of UplinkSharedData during DriverAttachDevice

Create/initialize sharedData area: sharedData has a versioned atomic (not a spinlock)

Uplink layer can only read from this area

Driver can read/write to this area

Driver needs to define its own spinlock for writer serialization

Shared Data: 1. Supported speed/duplex modes to be advertised to uplink.

2. Current MTU setting, and link/speed/duplex states.

3. Queue info (numQ, supported queue types, supported filter classes).

4. Rx and Tx queue fields (flags, type, state, supportedFeatures, dmaEngine,

maxFilters).

5. netPoll for each Rx queue via vmk_NetPollCreate().

6. Allocated default Rx and Tx queues (not yet activated).

Page 34: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

34 34

Uplink Layer: uplinkStartIO() Callback

1. Arm the interrupts (link, multiQ, etc) in the HW.

2. Configure for VLAN filtering as needed

3. Change internal driver state to IO-able.

4. Set configured flow control.

5. Now, enable interrupts in vmkernel via vmk_IntrEnable().

6. Check for link status changes, update sharedData and invoke

vmk_UplinkUpdateLinkState() as needed.

Page 35: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

35 35

Uplink Layer: uplinkQuiesceIO() Callback

1. Check if IO is already quiesce’d due to possible failures.

2. Disarm interrupts.

3. Disable netpoll via vmk_NetPollDisable() and vmk_NetPollFlushRx().

4. Mark link state as down via vmk_UplinkUpdateLinkState().

5. Stop all Tx queues

6. Sync all vectors via vmk_IntrSync().

7. Disable all vectors via vmk_IntrDisable().

8. Change internal driver state to quiesced.

Page 36: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

36 36

Register NIC capabilities to Uplink Layer

Handled when uplinkAssociateCB() is invoked to associate uplink with the

device.

Call vmk_UplinkCapRegister() to register each capability.

Two capability types:

No callbacks needed:

VMK_UPLINK_CAP_IPV4_CSO

VMK_UPLINK_CAP_VLAN_RX_STRIP

Capabilities that require callbacks:

VMK_UPLINK_CAP_MULTI_QUEUE

VMK_UPLINK_CAP_COALESCE_PARAMS

Page 37: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

37 37

Callback Ops for VMK_UPLINK_CAP_COALESCE_PARAMS:

Callback Ops for VMK_UPLINK_CAP_PRIV_STATS:

Examples of Capabilities with Callbacks

typedef struct vmk_UplinkCoalesceParamsOps {

vmk_UplinkCoalesceParamsGetCB getParams;

vmk_UplinkCoalesceParamsSetCB setParams;

} vmk_UplinkCoalesceParamsOps;

typedef struct vmk_UplinkPrivStatsOps {

vmk_UplinkPrivStatsLengthGetCB privStatsLengthGet;

vmk_UplinkPrivStatsGetCB privStatsGet;

} vmk_UplinkPrivStatsOps;

Page 38: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

38 38

Interrupt/Netpoll Handling

Registering interrupts: vmk_IntrProps is populated and passed to vmkernel in DriverStartDevice().

Driver Ack handler: Ack interrupt to HW if needed (INTx)

Increment interrupt counter

Driver Isr handler: Handle any queue notifications as needed

Activate the netpoll for the particular queue via vmk_NetPollActivate()

Driver NetPoll Callback Handler: Handle any Tx, Rx, or Ctrl events

If there is work but budget exceeded, remain in poll mode & return VMK_TRUE

If no more work, go back to interrupt mode and return VMK_FALSE

typedef struct vmk_IntrProps {

vmk_Device device;

vmk_Name deviceName;

vmk_IntrAcknowledge acknowledgeInterrupt; driver ack handler

vmk_IntrHandler handler; driver isr handler

void *handlerData;

vmk_uint64 attrs;

} vmk_IntrProps;

Page 39: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

39 39

Packet Management VMKAPIs in the Tx/Rx Path

Basic allocation, release and field manipulation: • vmk_PktAlloc()

• vmk_PktRelease()

• vmk_PktReleasePanic

• vmk_PktFrameLenGet()

• vmk_PktFrameLenSet()

• vmk_PktTrim()

• vmk_PktPartialCopy()

SG Handling: • vmk_PktSgArrayGet()

• vmk_PktSgElemGet()

• vmk_PktFrameMappedPointerGet()

• vmk_PktIsBufDescWritable()

Processing the sent down packet list: • vmk_PktListIterStart()

• vmk_PktListIterIsAtEnd()

• vmk_PktListGetFirstPkt()

• vmk_PktListIterInsertPktBefore()

• vmk_PktListIterRemovePkt()

• vmk_PktListAppendPkt()

Page 40: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

40 40

Packet Management VMKAPIs in the Tx/Rx Path

Parse/Find the different layer headers: • vmk_PktHeaderL2Find()

• vmk_PktHeaderL3Find()

• vmk_PktHeaderEntryGet()

• vmk_PktHeaderDataGet()

• vmk_PktHeaderDataRelease()

• vmk_PktHeaderLength()

Offload Handling: • vmk_PktIsMustCsum()

• vmk_PktSetCsumVfd()

• vmk_PktIsLargeTcpPacket()

• vmk_PktGetLargeTcpPacketMss()

VLAN Handling: • vmk_PktMustVlanTag()

• vmk_PktVlanIDGet()

• vmk_PktVlanIDSet()

• vmk_PktPriorityGet()

• vmk_PktPrioritySet()

Page 41: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

41 41

Advanced Features

MultiQueue Handling

SR-IOV

VXLAN Offload

Dynamic Load Balancing

Page 42: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

42 42

Multi-Queue Support

Register multi-queue support via VMK_UPLINK_CAP_MULTI_QUEUE

Following callbacks passed to uplink when registering this capability

typedef struct vmk_UplinkQueueOps {

vmk_UplinkQueueAllocCB queueAlloc;

vmk_UplinkQueueAllocWithAttrCB queueAllocWithAttr;

vmk_UplinkQueueReallocWithAttrCB queueReallocWithAttr;

vmk_UplinkQueueFreeCB queueFree;

vmk_UplinkQueueQuiesceCB queueQuiesce;

vmk_UplinkQueueStartCB queueStart;

vmk_UplinkQueueFilterApplyCB queueApplyFilter;

vmk_UplinkQueueFilterRemoveCB queueRemoveFilter;

vmk_UplinkQueueStatsGetCB queueGetStats;

vmk_UplinkQueueFeatureToggleCB queueToggleFeature;

vmk_UplinkQueueTxPrioritySetCB queueSetPriority;

vmk_UplinkQueueCoalesceParamsSetCB queueSetCoalesceParams;

} vmk_UplinkQueueOps;

Page 43: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

43 43

Multi-Queue VMKAPIs in the Tx/Rx path

Refer to vmkapi_net_queue.h

Main list of APIs for implementing multi-queue support: • vmk_UplinkQueueMkFilterID()

• vmk_UplinkQueueMkTxQueueID()

• vmk_UplinkQueueMkRxQueueID()

• vmk_UplinkQueueIDVal()

• vmk_UplinkQueueIDType()

• vmk_UplinkQueueFilterIDVal()

• vmk_UplinkQueueIDUserVal()

• vmk_UplinkQueueSetQueueIDUserVal()

• vmk_UplinkQueueIDQueueDataIndex()

• vmk_UplinkQueueSetQueueIDQueueDataIndex()

• vmk_UplinkQueueGetNumQueuesSupported()

• vmk_UplinkQueueStart()

• vmk_UplinkQueueStop()

• vmk_PktQueueIDGet()

• vmk_PktQueueIDSet()

Page 44: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

44 44

SR-IOV Support

Setup VFs: • During DriverAttachDevice(), if SR-IOV is supported by device:

Enable VFs via vmk_PCIEnableVFs()

• During DriverScanDevice(), driver Registers its VFs via vmk_PCIRegisterVF() passing along its .removeVF callback

static vmk_PCIVFDeviceOps elxnetVFDevOps = {

.removeVF = elxnet_removeVFDevice

};

Sets control callback for VF w/ vmkernel via vmk_PCISetVFPrivateData()

Cleanup VFs: • The .removeVF callback registered during registration is called:

vmk_PCIUnregisterVF() invoked to unregister particular VF from vmkernel

• DriverDetachDevice() should call vmk_PCIDisableVFs() to disable all its VFs.

Misc VF vmkapi: • vmk_PCIGetVFPCIDevice() should be used during VF registration to get the

vmk_PCIDevice handle of a PCI VF given its parent PF and VF index.

Page 45: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

45 45

VXLAN Offload Support

Register vxlan offload capability via VMK_UPLINK_CAP_ENCAP_OFFLOAD

Callback Ops for VMK_UPLINK_CAP_ENCAP_OFFLOAD:

If supporting RX_VXLAN filter, indicate in supportedRxQueueFilterClasses vmk_UplinkSharedQueueInfo->supportedRxQueueFilterClasses |=

VMK_UPLINK_QUEUE_FILTER_CLASS_VXLAN;

Packet parser APIs to get information on inner encapsulated headers: • vmk_PktHeaderEncapFind()

• vmk_PktHeaderEncapL2Find()

• vmk_PktHeaderEncapL3Find()

• vmk_PktHeaderEncapL4Find()

typedef struct vmk_UplinkEncapOffloadOps {

/** Handler used by vmkernel to notify VXLAN port number updated */

vmk_UplinkVXLANPortUpdateCB vxlanPortUpdate;

} vmk_UplinkEncapOffloadOps;

Page 46: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

46 46

Dynamic Load Balancing

New NetQ feature introduced in ESXi 5.5 release:

• VMKNETDDI_QUEUEOPS_QUEUE_FEAT_DYNAMIC

NIC requirements to support this feature:

• Device able to support different NetQ "features" on any particular NetQ

• Adding or removing support for a particular NetQ not require any critical operations

If NIC driver registers DYNAMIC feature support, load balancer can/will

• Move filters between queues (i.e. bin-packing of filters), hence reducing the number

of queues in use

• Unpack filters to more queues either for latency sensitive VMs, or to reduce burden

on over saturated queues

Page 47: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

47 47

Performance

Page 48: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

48 48

Throughput in Gbps on a 16VM Configuration

3.00 2.97

9.40 9.40

3.03 3.02

9.41 9.40

0.000

1.000

2.000

3.000

4.000

5.000

6.000

7.000

8.000

9.000

10.000

Tx Throughput (256B) Rx Throughput (256B) Tx Throughput (64KB) Rx Throughput (64KB)

be2net

elxnet

Page 49: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

49 49

Overall CPU Gains on a 16VM Configuration

320.89

335.32

29.45

55.40

282.56

307.15

29.34

52.04

0.000

50.000

100.000

150.000

200.000

250.000

300.000

350.000

400.000

Tx CPU Util (256B) Rx CPU Util (256B) Tx CPU Util (64KB) Rx CPU Util (64KB)

be2net

elxnet

12% Savings 6% Savings 8% Savings

Page 50: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

50 50

Vmkernel Cost Savings on a 16VM Configuration

137.92 132.75

8.06

26.17

89.50

96.29

7.03

21.34

0

20

40

60

80

100

120

140

160

Tx CPU Util (256B) Rx CPU Util (256B) Tx CPU Util (64KB) Rx CPU Util (64KB)

be2net

elxnet

35% Savings 27% Savings 13% Savings 18% Savings

Page 51: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

51 51

Total Mean Ping Response Time (usec) on a 16VM Config

134.19

126.82

130.04

133.39

116.23

122.41

105

110

115

120

125

130

135

140

128b 256b 512b

be2net

elxnet

Reduced by 1% Reduced by 6% Reduced by 8%

Page 52: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

52 52

Getting‏Started‏on‏the‏Native‏Driver…

Go to https://developercenter.vmware.com/group/iovp/certs/5.5/dev-kits for

1. Native DDK Developer Guide

2. Needed toolchain RPMs

vmware-esx-common-toolchain

vmware-esx-kmdk-psa-toolchain

3. Vib-Suite RPM

vmware-esx-vib-suite-5.5.0-0.0.xxxxxxx.i386.rpm

4. Vmkapi DDK RPM:

vmware-esx-vmkapiddk-devtools-5.5.0-0.0.xxxxxxx.i386.rpm

Page 53: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

53 53

Summary

A layered model approach with easy extensibility for new features

Overview of native model

Interaction of driver with different layers

Basic structs and handlerOps for different layers

Native model does not use vmklinux compatability layer

A layer of indirection completely removed

Translations (eg. pkt<->skb) avoided

o Allocation of skbs is not needed

o Savings in avoiding slab allocation (esp. at high packet rates)

Driver communicates directly with various layers

Performance boost in cpu savings

New IO features for ESXi will only be developed for native model.

Page 54: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

54 54

Questions?

Contact VMware PM for more details of the native model support and for the devkits.

Page 55: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

55 55

Other VMware Activities Related to This Session

HOL:

HOL-SDC-1302

vSphere Distributed Switch from A to Z

TEX4759

Page 56: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

56 56

• TAP Access membership includes:

New TAP Access NFR Bundle

• Access to NDA Roadmap sessions at VMworld, PEX and Onsite/Online

• VMware Solution Exchange (VSX) and Partner Locator listings

• VMware Ready logo (ISVs)

• Partner University and other resources in Partner Central

• TAP Elite includes all of the above plus:

• 5X the number of licenses in the NFR Bundle

• Unlimited product technical support

• 5 instances of SDK Support

• Services Software Solutions Bundle

• Annual Fees

• TAP Access - $750

• TAP Elite - $7,500

• Send email to [email protected]

TAP Membership Renewal – Great Benefits

Page 57: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

57 57

TAP

• TAP support: 1-866-524-4966

• Email: [email protected]

• Partner Central:

http://www.vmware.com/partners/partners.html

TAP Team

• Kristen Edwards – Sr. Alliance Program Manager

• Sheela Toor – Marketing Communication Manager

• Michael Thompson – Alliance Web Application Manager

• Audra Bowcutt –

• Ted Dunn –

• Dalene Bishop – Partner Enablement Manager, TAP

TAP Resources

VMware Solution Exchange

• Marketplace support –

[email protected]

• Partner Marketplace @ VMware

booth pod TAP1

Page 58: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

THANK YOU

Page 59: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance
Page 60: VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

ESXi Native Networking Driver Model - Delivering on

Simplicity and Performance

Margaret Petrus, VMware

TEX4759