optimizing your virtual switch for...
TRANSCRIPT
Optimizing your virtual switch for VXLAN
Ron Fuller, VCP-NV, CCIE#5851 (R&S/Storage) Staff Systems Engineer – NSBU [email protected]
VXLAN Protocol Overview
Ethernet in IP overlay network
– Entire L2 frame encapsulated in UDP
– 50+ bytes of overhead
24 bit VXLAN Network Identifier
– 16 M logical networks
VXLAN can cross Layer 3 network boundaries
Overlay between ESXi hosts
– VMs do NOT see VXLAN ID
VTEP (VXLAN Tunnel End Point)
– VMkernel interface which serves as the endpoint for encapsulation/de-encapsulation of VXLAN traffic
Technology submitted to IETF for standardization
– With Cisco, Citrix, Red Hat, Broadcom, Arista and others
Inn
er
Des
t
MA
C
Inne
r
Sour
ce
MA
C
Optio
nal
Ether
Type
Optio
nal
Inner
802.1
Q
Origina
l
Ethern
et
Payloa
d
Inner Ethernet Frame
Out
er
Des
t
MA
C
Outer
Sourc
e
MAC
Optio
nal
VXLA
N
Type
Optio
nal
Outer
802.1
Q
IP
Hea
der
Data
*
IP
Pro
to-
col
Head
er
Chec
k
Sum
Outer
Sour
ce IP
Sour
ce
Port
Dest
Port
(8472
)
UDP
Len
gth
UD
P
Che
ck
Su
m
VXLA
N
Flags
RS
VD
VXL
AN
NI
(VNI)
FC
S
RS
VD
VXLAN Encapsulated Frame
Outer
Ethernet
Header
14 bytes
Outer IP
Header
20 bytes
Outer
UDP
Head
er
8 bytes
VXLA
N
Head
er
8 bytes
Ether
Type
Out
er
Des
t IP
*IP Header Data = Version, IHL,
TOS, Length, ID
VXLAN Frame Format
Network Adapter Offloads
4
TCP Checksum
Offload
Large Receive
Offload
- Software offloads TCP checksum calculation on send and checksum verification on receive
- NIC aggregates packets to send large packet to software - Software sees fewer packets and interrupts
Network Adapter Offloads
5
TCP
Segmentation
Offload
Receive Side
Scaling
- Operating system sends large sized TCP packets to NIC - NIC segments packets as per physical MTU
- NIC distributes packets among queues - Unique receive thread per queue to drive multiple CPUs
Important features for NSX vSwitch performance
Payload MAC IP TCP
MAC IP Payload TCP
TCP Segment Offload
6
VM on VLAN dvPG sends large TCP packet to Virtual NIC
1 MAC IP Payload TCP
NSX vSwitch sends large packet to TSO enabled physical NIC
2 MAC IP Payload TCP
Physical NIC segments packet based on physical MTU
3 L2 IP Payload TCP
VLAN dvPG
Offloading TSO to NIC allows host to conserve CPU cycles while sending large TCP messages. Almost all NICs support TSO for VLAN traffic
TSO support in PNIC
Pre NSX - How is TCP forwarding optimized ?
VM sends large TCP packet to Virtual NIC
1 MAC IP Payload TCP
MAC IP VXLAN UDP MAC IP TCP Payload
VXLAN encapsulation – every packet is now a UDP frame
2
No TSO for VXLAN support in PNIC
TCP Segment Offload
7
TCP traffic with NSX – What changes?
VM on VXLAN
MAC IP VXLAN UDP MAC IP TCP Payload
MAC IP VXLAN UDP MAC IP TCP Payload
VM sends large TCP packet to Virtual NIC
1
VXLAN encapsulation and TSO in software – because the NIC can’t do TSO for the inner TCP packet within a UDP frame
2
MAC IP VXLAN UDP MAC IP TCP Payload
TCP Segment Offload – VM on VXLAN LS
8
Segmenting large TCP message in software is CPU intensive operation
TCP traffic with NSX – What changes?
No TSO for VXLAN support in PNIC
VM sends large TCP packet to Virtual NIC
1 MAC IP Payload TCP
MAC IP VXLAN UDP MAC IP TCP Payload
VXLAN encapsulation – every packet is now a UDP frame
2
TSO for VXLAN support in PNIC
TCP Segment Offload – VM on VXLAN LS
9
TSO Support for VXLAN
MAC IP VXLAN UDP MAC IP TCP Payload
MAC IP VXLAN UDP MAC IP TCP Payload
10
VM sends large TCP packet to Virtual NIC
1
VXLAN encapsulation
2
TSO for VXLAN packets in PNIC
3
MAC IP VXLAN UDP MAC IP TCP Payload
TCP Segment Offload – VM on VXLAN LS
TSO for VXLAN traffic generically referred to as “VXLAN Offload” by NIC vendors
TSO for VXLAN support in PNIC
CPU cycles conserved while sending traffic
TCP traffic with NSX – What changes?
Netqueue
11
ESXi Kernel Space
Network Adapter
Queues
30%
cor
e1
30%
cor
e2
30%
cor
e3
30%
cor
eN
threa
d1
thread
4
threa
d3 threa
d3
Defau
lt
queu
e
queue
n
queu
e3 queu
e2
- Netqueue is a feature designed to enable a vSphere host to receive line rate traffic from the physical network
- Works by driving multiple CPUs to handle receive packet processing
- Before NSX – there is no traffic encapsulation and VDS receives traffic to multiple unique MAC addresses ( MAC address assigned to VM vNICs)
- Netqueue works by steering traffic destined to each VM MAC to a unique NIC queue.
- Interrupts raised by NIC queues will drive multiple CPUs
Traffic to MAC A
Traffic to MAC B
Traffic to MAC n
With NSX all traffic to a vSphere host is destined to VTEP MAC address. Ability to drive multiple CPU with Netqueue is lost
ESXi Kernel Space
Network Adapter
Queues
Receive Side Scaling (RSS)
30%
cor
e1
30%
cor
e2
30%
cor
e3
30%
cor
en
threa
d1
thread
4
threa
d3 threa
d3
Defau
lt
queu
e
queue
n
queu
e3 queu
e2
- RSS tries to achieve the same end result as Netqueue – use multiple CPUs to handle receiving of packets
- Instead of MAC addresses, RSS uses packet’s Layer 2 – 4 headers to load balance traffic across multiple queues
- RSS is not enabled for all MACs. VDS enables RSS for traffic destined to VTEP MAC addresses
12
Flow A
Flow B
Flow C
Flow D
Receive Side Scaling (RSS)
13
ESXi Kernel Space
Network Adapter
Queues
30%
cor
e1
30%
cor
e2
30%
cor
e3
30%
cor
en
threa
d1
thread
4
threa
d3 threa
d3
Defau
lt
queu
e
queue
n
queu
e3 queu
e2
Flow A
Flow B
Flow C
Flow D
1. Packet arrives at default queue
2. MAC filters attached to the queue are evaluated to see if RSS is required for packet dest MAC
3. If RSS is not enabled for packet dest MAC then default queue / thread processes traffic
4. If RSS is enabled for the dest MAC then packet is hashed using configured RSS algorithm to one of 4 queues reserved
vSphere supports up to 4 queues for RSS. Sufficient for receiving 10G traffic at 1500 bytes
vSphere Inbox v/s Async Drivers
15
- Inbox Drivers - vSphere packages NIC drivers into ESXi. Typically updated at vSphere major releases.
- Async Drivers – New NIC driver functionality and performance improvements released by NIC vendor independent of vSphere release. These drivers can be downloaded from https://my.vmware.com/web/vmware/downloads
- Once the async driver matures that version is pulled into the next major vSphere release.
- With some NICs the vSphere inbox driver is optimal for VXLAN traffic
(i.e. support VXLAN TSO and RSS) whereas with many NICs the inbox driver is not optimal and will require an upgrade (on compute and edge hosts) to the recommended async version
Test Topology and Tools
Test Topology
- East – West traffic before NSX / Network Virtualization
- Establish Performance baseline on VDS with VMs on VLANs
- East – West traffic with NSX / Network Virtualization
- Logical Switching
1
2
NSX Performance testing topology
18
Overlay to Physical at Layer 3
3
NSX Performance testing done in typical NSX topologies documented in design guide.
V
M4 V
M1
NSX vSwitch
Logical Switch
V
M4 V
M1
Logical Switch VLAN
Physical
Server 1
Physical
Server 8
Layer 3 Peering
Compute
Host 1
Compute
Host 2
Edge Host
1
Test Tools and Methodology
19
- iperf for bandwidth testing
- Supports running TCP and UDP traffic
- Multiple TCP and UDP sessions support
- UDP streams at defined sending rate to verify no loss UDP rate
- On Server “iperf -s”.
- On client iperf –c <server IP> -t <duration> -P <number of sessions>
Netperf for latency test
• TCP Latency using netperf tcp round robin test (tcp_rr)
• One transaction - 1 byte from client to server and then 1 byte from server to client
• Test reports # of transactions / sec.
• Time for one roundtrip = 1 / # of transactions per second
• On server run “netserver”
On client run “netperf –H <server_ip> -t TCP_RR –l <test duration>”
20
Connection Establishment
Syn + ACK
Syn
ACK
Transaction
Transaction
1 byte
1 byte
1 byte
1 byte
Client Server
CPU overhead with NSX
The following sample methodology can be used to determine the CPU overhead for each feature
21
“NSX introduced VXLAN, Routing and Firewall to the hypervisor. What does this cost in terms of CPU?”
1 Run a bandwidth test with VMs on VDS (VLAN dvPG) and record total bandwidth and CPU
Bandwidth in Gbps
Total CPU on host
18.426 364.46
2 Calculate CPU per Gbps over VLAN 1 19.74
3 Run a bandwidth test with VMs on Logical Switch (no DLR and no Firewall)
18.185 409.70
4 Calculate CPU per Gbps over VXLAN 1 22.53
CPU overhead with NSX
22
5 Additional CPU for 1 Gbps over VXLAN compared to VLAN = CPU per Gbps over VXLAN – CPU per Gbps over VLAN
22.53 - 19.74 = 2.79
6 Additional CPU for 10 Gbps over VXLAN 2.79 * 10 = 27.9
6 Additional CPU for 10 Gbps over VXLAN on a 12 core hypervisor
27.9 * 100 / 1200 = 2.35 %
Similar methodology can be followed to determine additional CPU for running Routing and Firewall on the host
Test Results
V
M8
CONFIDENTIAL 24 24
V
M1
NSX vSwitch
Logical Switch
Compute Cluster Compute Cluster
V
M8 V
M1
0
5
10
15
20
64 512 1500 32k 64kSen
d t
hro
ug
hp
ut
in
Gb
ps
TCP Message Size
- Line rate traffic (~18Gbps) with 2 NICs per host with VXLAN
- Additional CPU for VXLAN traffic between hosts is ~3% of CPU.
0
2
4
6
8
10
12
14
64 512 1500 32k 64k
*Ad
dit
ion
al
CP
U %
p
er
Gb
ps
TCP Message Size
Logical Switching
*Y axis shows % of additional CPU overhead as compared to same baseline tests performed on VLAN based networks.
Summary
Summary
• RSS support is required in NIC and vSphere driver to receive line rate VXLAN traffic on a 10 Gbps NIC.
• TSO for VXLAN traffic support on NIC allows a hypervisor sending traffic to conserve CPU cycles by offloading TCP segmentation to the PNIC.
• Check with NIC vendor for TSO and RSS driver support – may need an async driver!
26