azure stack hci
TRANSCRIPT
Azure Stack HCIThe best infrastructure for hybrid
Module 3Core Networking
Core Networking
Learnings Covered in this Unit
Simplifying the Network
Network Deployment Options
The Virtual Switch
Acceleration Technologies
Virtual Network Adapters
Ph
ysic
al N
IC
Virtual SWITCH
Ph
ysi
cal N
IC
Virtual SWITCH
Ho
st v
NIC
MGMT SMB1 SMB2
Gu
est
s
Networ
k
VLAN
ID
QOS
Weight
Acceler
ations
Manage
ment
0 5
Storage
A
5 DCB 50 vRDMA
Storage
B
6 DCB 50 vRDMA
Guests 10-99 1-5 SR-IOV
Three Types of Core Networks
Management
• Part of North-South
Network
• Used for Host
Communication
Compute
• Part of North-South
Network
• Virtual Machine Traffic
• Needs varying levels of
QOS
• May need SR-IOV,
vRDMA
Storage
• East-West Only
• Needs RDMA
• 10GB+
• Can host Live Migration
Traffic in HCICluster Heartbeats & Inter-Node comms
[SMB] Storage Bus Layer
[SMB] Cluster Shared Volume
[SMB] Storage Rebuild
[SMB, possibly] Live Migrations
Generally RDMA Traffic
East West
Traffic in S2DExternal (to the S2D cluster)
VM Tenant traffic
Could be any protocol
North
South
pNIC = Physical NIC on the Host
vNIC = Host Hyper-V Virtual Network Adapter
vmNIC = Virtual Machine Hyper-V Virtual Network Adapter
tNIC = Microsoft Team Network Interface Controller (vlan tagged in the LBFO Team)
NIC Terminology Refresher
Cluster Network Deployment Options
Converged Network
Combining Multiple Network Intents (MGMT, Compute, Storage)
Best if deploying 3+ Physical Nodes
Connect pNics to Top-Of-Rack Switches
ROCEv2 Highly Recommended
Switchless
North-South Communication is a Team, combining Compute and Management Networks
Storage (E-W) is directly connected Node to Node
iWarp Recommended
No need to configure Data-Center Bridging (DCB)Features
Really only for HCI Clusters with 2 Physical Nodes
Hybrid
Best of both, easy deployment of Compute/Mgmt on North-South
Separate Storage Nics into separate adapters, not teamed
iWarp or ROCE it doesn’t matter
DCB Config not required, but recommended.
Converged
CSV
Host vNICMGMT
Host vNIC
SET VM Switch
10GBHyper-V HostPhysical NIC
10GBHyper-V HostPhysical NIC
LM
Host vNIC
VM1 VM2 VM3
SMB Multichannel
Switchless
Top of Rack Switch
SET vSwitch SET vSwitch
SMB1-10GB
SMB2-10GB
Hybrid
MGMT
Host vNIC
SET VM Switch
10GBHyper-V HostPhysical NIC
10GBHyper-V HostPhysical NIC
VM1 VM2 VM3Top of Rack Switch
Top of Rack Switch
10GBHyper-V HostPhysical NIC
10GBHyper-V HostPhysical NIC
SMB1 SMB2
Networking Stack overview
Azure Stack HCI Node
VM Storage
SMB
Host Partition
VM
DCB
pNIC
VM
DCB
pNIC
Hyper-V Switch (SDN)
Integrated NIC Teaming
Legend:RDMA
TCP/IP
Networking Stack overview
• Virtual Switch
• ManagementOS NICs
• VM NICs
Azure Stack HCI Node
VM Storage
SMB
Host Partition
VM
DCB
pNIC
VM
DCB
pNIC
Hyper-V Switch (SDN)
Integrated NIC Teaming
Legend:RDMA
TCP/IP
Networking Stack overview
• Physical NICs
• ManagementOS NICs
• VM NICs
Azure Stack HCI
VM Storage
SMB
Host Partition
VM
DCB
pNIC
VM
DCB
pNIC
Hyper-V Switch (SDN)
Integrated NIC Teaming
Legend:RDMA
TCP/IP
High availability
Load Balancing and Failover (LBFO): Switch Embedded Teaming (SET):
Switch Switch
NIC NIC
Team
vSwitch
vNIC vNIC vNIC vmNIC
Switch Switch
NIC NIC
vSwitch
vNIC vNIC vNIC vmNIC
tNIC
SET Switch benefits
Switch Embedded Teaming•
•
•
•
•
•
Switch Switch
NIC NIC
vSwitch
vNIC vNIC vNIC vmNIC
SET Switch limitations
• Network adapters has to be same
• Hyper-V/Dynamic load balancing only
• Switch Independent only
• LACP/Static not supported
• Active/Passive not supported
Switch Embedded Teaming
Switch Switch
NIC NIC
vSwitch
vNIC vNIC vNIC vmNIC
New-VMSwitch -Name SETSwitch -EnableEmbeddedTeaming $TRUE -NetAdapterName(Get-NetIPAddress -IPAddress 10.* ).InterfaceAlias
• Automatically creates one management network adapter
• InterfaceAlias can be also queried with commands such as (Get-Netadapter -InterfaceDescription Mellanox*).InterfaceAlias to select just some model
PowerShell command
Network Quality of Service
Bandwidth Mode
New-VMSwitch -Name "vSwitch" -AllowManagementOS $true -NetAdapterName NIC1-MinimumBandwidthMode <Absolute | Default | None | Weight >
New-VMSwitch -Name "vSwitch" -AllowManagementOS $true -NetAdapterName NIC1-MinimumBandwidthMode Weight
New-VMSwitch -Name "vSwitch" -AllowManagementOS $true -NetAdapterName NIC1-MinimumBandwidthMode Absolute
Warning!
Common Networking Challengesin Azure Stack HCI
Deployment Time Complexity Error Prone
Common Networking Challengesin Azure Stack HCI
Deployment Time Complexity Error Prone
Network ATC
New host management service on Azure Stack HCI
Install-WindowsFeature -Name NetworkATC
Available for All Azure Stack HCI Subscribers (via feature update) in 2021
Intent
Complexity: HCI Converged Example
pNIC
Default OS
Host vNIC
Guest
Management VLAN 100
Storage VLAN 1 711
Storage VLAN 2 712
Storage MTU 9K
Cluster Traffic Class 7
Cluster Bandwidth Reservation 1%
RDMA Traffic Class 3
RDMA Bandwidth Reservation 50%
Rename-NetAdapter -Name <OldName> -NewName ConvergedPNIC1
Set-NetAdapterAdvancedProperty -Name ConvergedPNIC1 -RegistryKeyword VLANID -RegistryValue 0
New-VMSwitch -Name ConvergedSwitch -AllowManagementOS $false -EnableIov $true -EnableEmbeddedTeaming $true -
NetAdapterName ConvergedPNIC1
Rename-NetAdapter -Name <OldName> -NewName ConvergedPNIC2
Set-NetAdapterAdvancedProperty -Name ConvergedPNIC2 -RegistryKeyword VLANID -RegistryValue 0
Add-VMSwitchTeamMember -VMSwitchName ConvergedSwitch -NetAdapterName ConvergedPNIC2
Set-NetAdapterRss -Name ConvergedPNIC1 -NumberOfReceiveQueues 16 -MaxProcessors 16 -BaseProcessorNumber 2 -
MaxProcessorNumber 19
Set-NetAdapterRss -Name ConvergedPNIC2 -NumberOfReceiveQueues 16 -MaxProcessors 16 -BaseProcessorNumber 2 -
MaxProcessorNumber 19
Complexity: Physical NICs, vSwitch, and VMQ
Add-VMNetworkAdapter -ManagementOS -SwitchName ConvergedSwitch -Name Management
Rename-NetAdapter –Name *Management* -NewName Management
New-NetIPAddress -InterfaceAlias Management -AddressFamily IPv4 -IPAddress 192.168.0.51 -PrefixLength 24 -DefaultGateway
192.168.0.1
Set-VMNetworkAdapterIsolation -ManagementOS –VMNetworkAdapterName Management -AllowUntaggedTraffic $True -IsolationMode
VLAN -DefaultIsolationID 10
Add-VMNetworkAdapter -ManagementOS -SwitchName ConvergedSwitch -Name SMB01
Rename-NetAdapter –Name *SMB01* -NewName SMB01
Set-VMNetworkAdapterIsolation -ManagementOS –VMNetworkAdapterName SMB01 -AllowUntaggedTraffic $True -IsolationMode VLAN -
DefaultIsolationID 11
Set-VMNetworkAdapterTeamMapping -ManagementOS -SwitchName ConvergedSwitch -VMNetworkAdapterName SMB01 -
PhysicalNetAdapterName ConvergedPNIC1
Set-DnsClient -InterfaceAlias *SMB01* -RegisterThisConnectionsAddress $true
Complexity: Virtual NICs (Mgmt and SMB01)
Add-VMNetworkAdapter -ManagementOS -SwitchName ConvergedSwitch -Name SMB02
Rename-NetAdapter –Name *SMB02* -NewName SMB02
Set-VMNetworkAdapterIsolation -ManagementOS –VMNetworkAdapterName SMB02 -AllowUntaggedTraffic $True -
IsolationMode VLAN -DefaultIsolationID 12
Set-VMNetworkAdapterTeamMapping -ManagementOS -SwitchName ConvergedSwitch -VMNetworkAdapterName SMB02 -
PhysicalNetAdapterName ConvergedPNIC2
Set-DnsClient -InterfaceAlias *SMB02* -RegisterThisConnectionsAddress $true
New-NetIPAddress -InterfaceAlias *SMB01* -AddressFamily IPv4 -IPAddress 192.168.1.1 -PrefixLength 24
New-NetIPAddress -InterfaceAlias *SMB02* -AddressFamily IPv4 -IPAddress 192.168.2.1 -PrefixLength 24
Complexity: Virtual NICs (SMB02)
Install-WindowsFeature -Name Data-Center-Bridging
New-NetQosPolicy -Name 'Cluster' -Cluster -PriorityValue8021Action 7
New-NetQosTrafficClass -Name 'Cluster' -Priority 7 -BandwidthPercentage 1 -Algorithm ETS
New-NetQosPolicy -Name 'SMB' -NetDirectPortMatchCondition 445 -PriorityValue8021Action 3
New-NetQosTrafficClass 'SMB' -Priority 3 -BandwidthPercentage 50 -Algorithm ETS
New-NetQosPolicy -Name 'DEFAULT' -Default -PriorityValue8021Action 0
Disable-NetQosFlowControl -Priority 0, 1, 2, 4, 5, 6, 7
Enable-NetQosFlowControl -Priority 3
Set-NetQosDcbxSetting -InterfaceAlias ConvergedPNIC1 -Willing $False
Set-NetQosDcbxSetting -InterfaceAlias ConvergedPNIC2 -Willing $False
Enable-NetAdapterQos -InterfaceAlias ConvergedPNIC1, ConvergedPNIC2
<< Customer must now get the physical fabric configured to match these settings >>
Complexity: Configure DCB for Storage NICs
> 30+ cmdlets…
> 90+ parameters…
Match Settings on Switch
Repeat Exactly on Node 2, 3, 4…
Repeat Exactly on cluster a, b, c…
Goals
• Deploy your Network Host through only a few commands
• Don’t worry about turning every knob
• Don’t worry about changed defaults between OS’
• Don’t worry about latest best practices
• Don’t worry about it changing (configuration drift)
You have enough to worry about
Add-NetIntent -Management -Compute -Storage -ClusterName HCI01 -AdapterName pNIC1, pNIC2
Networking in Azure Stack HCIwith Network ATC
Deployment Time Complexity Error Prone
Summary: Network ATC
Intent-based host network deployment
Deploy the whole cluster with ~1 command
Easily replicate the same configuration to another cluster
Outcome driven; we’ll handle default OS changes
Always deployed with the latest, Microsoft supported and validated best practices
You stay in control with overrides
Auto-remediates configuration drift
Available in Azure Stack HCI 21H2
Networking Stack overview
Components
• Physical NICs
• Virtual Switch
Supporting technologies
• LBFO Teaming/SET
• Offloading technologies
• SMB Direct (RDMA)
• …
Azure Stack HCI Node
VM Storage
SMB
Host Partition
VM
DCB
pNIC
VM
DCB
pNIC
Hyper-V Switch (SDN)
Integrated NIC Teaming
Legend:RDMA
TCP/IP
Management OS vNICs
Almost same as Management OS NICs, except connected to VMs
• Azure Stack HCI supports Guest RDMA on the vmNIC
• You can use SR-IOV in VMs (More information in the SR-IOV slides)
Virtual Machine vmNICs
RDMA (Remote Direct Memory Access)• Typically East to West traffic
• Transfer data from an application (SMB) to pre-allocated memory of another system
• Low Latency; high throughput; minimal host CPU processingUse Diskspd to test (must leverage SMB over the network)
• Two predominant RDMA “transports” on WindowsiWARP – S2D Recommended (lossless OOB)
RoCE (RDMA over Converged Ethernet) (lossless with DCB)
Vendor iWARP RoCE
Broadcom No Yes
Cavium Yes Yes
Chelsio Yes No
Intel Yes No
Mellanox No Yes
Remote Direct Memory Access (RDMA) Traffic Flow
51
File Client
SMBBuffer
File Server
With RDMAWithout RDMA
AppBuffer
SMBBuffer
OSBuffer
DriverBuffer
SMBBuffer
OSBuffer
DriverBuffer
AppBuffer
SMBBuffer
rNICrNIC NIC AdapterBufferNICAdapter
BufferAdapterBuffer
AdapterBuffer
iWARP
RoCE
• Higher performance through offloading of network I/O processing onto network adapter
• Higher throughput with low latency and ability to take advantage of high-speed networks (such as RoCE, iWARP and InfiniBand*)
• Remote storage at the speed of direct storage
• Transfer rate of around 50 Gbps on a single NIC PCIe x8 port
• Compatible with SMB Multichannel for load balancing and failover
• Windows Server 2016 add support for RDMA on vNICs
*InfiniBand is not supported on SET Switch vNIC
Remote Direct Memory Access (RDMA) Performance Limits
52
PCI Express Version Transfer rate Throughput
X1 x4 x8 x16
1.0 - 2,5 GT/s 250 MB/s 1 GB/s 2 GB/s 4 GB/s
2.0 - 5 GT/s 500 MB/s 2 GB/s 4 GB/s 8 GB/s
3.0 - 8 GT/s 984 MB/s ~4 GB/s ~8 GB/s ~16 GB/s
4.0 - 16 GT/s 1969 MB/s ~8 GB/s ~16 GB/s ~32 GB/s
• Performance limit of the PCI Express
Example with Mellanox Connect X3 Pro Dual-Port 40/56 Gigabit:
• PCI Express 3.0 x8 Card
• The Dual-Port will not be able to deliver 80Gb/s or 112 Gb/s
o Maximum will be around 60 Gigabit/s in 8GT/s slot
o Maximum will be around 30 Gigabit/s in 5GT/s Slot
• For best performance use two singe port card.
Remote Direct Memory Access (RDMA) Technologies
53
• Infiniband (IB)
o IB
• Internet Wide Area RDMA Protocol (iWARP)
o iWARP
• RDMA over Converged Ethernet (RoCE)
o RoCE Version 1
o RoCE Version 2
Remote Direct Memory Access (RDMA) Hardware
54
• Infiniband (IB)
o Mellanox
• Internet Wide Area RDMA Protocol (iWARP)
o Chelsio T580-LP-CR (10-40Gbps)
o Chelsio T62100-LP-CR (40-50-100Gbps)
o QLogic FastLinQ QL45611HLCU (100Gbps)
o Intel
• RDMA over Converged Ethernet (RoCE)
o Mellanox (Connect X3 Pro, X4 EN and X5 EN)
▪ Connect X4 LX (10-25-50Gbps )
▪ Connect X4 EN (10-25-25-100Gbps)
▪ Connect X5 EN
o Cisco (UCS VIC 1385)
o QLogic FastLinQ QL45611HLCU (100Gbps)
o Emulex/Broadcom (XE100 Series)
RDMA Network Layers (iWARP and RoCE v2)
55
iWARP RoCE v.2
RDMA Network Layers (RoCE v1 and v2)
56
RoCE v.2RoCE v.1
Azure Stack HCI support Guest RDMA (Mode 3) on the vmNIC
Virtual Machine vmNICs (Guest RDMA)
• Guest RDMA (Mode 3) on the vmNIC
• Device Manger will show both the Hyper-V Network Adapterand the VF for each vmNIC
• Driver for the pNIC/VF in the VMs
Virtual Machine vmNICs (Guest RDMA)
pSwitch
vSwitch
Host OS VM VM VM
vm
NIC
vm
NIC
vm
NIC
pN
IC
SM
B1
vN
IC
SM
B2
vN
IC
MG
MT
vN
IC
Server 01
pSwitch
SMB traffic
RDMA
VF
VF
VF
SR-IOV
Mapping vNICs (vRDMA) to pNICs
59
• Needed to avoid two vNICs with vRDMA to end on one pNIC with RDMA
• By default logic uses round robin and problem might happenhttps://technet.microsoft.com/en-us/library/mt732603.aspx
Invoke-Command -ComputerName $servers -ScriptBlock {$physicaladapters=Get-NetAdapter | where status -eq up | where Name -NotLike vEthernet* | Sort-ObjectSet-VMNetworkAdapterTeamMapping –VMNetworkAdapterName "SMB1" –ManagementOS –PhysicalNetAdapterName ($physicaladapters[0]).nameSet-VMNetworkAdapterTeamMapping –VMNetworkAdapterName "SMB2" –ManagementOS –PhysicalNetAdapterName ($physicaladapters[1]).name
}
Taken from snia.org: https://www.snia.org/sites/default/files/ESF/How_Ethernet_RDMA_Protocols_Support_NVMe_over_Fabrics_Final.pdf
What does “Lossless” mean anyway?
HCI enables Hyper-V Compute, S2D, and SDN to be collocated on the same host (and switchports)
But now you have congestion…
- iWARP uses TCP Transport
- RoCE uses IB Transport
- RoCE uses UDP as Tunnel
And congestion causes packet drops…
Data Center Bridging is REQUIREDfor RoCE to handle congestion
Data Center Bridging (DCB)
• Data Center Bridging (DCB): Can make RoCE “lossless”
• Priority Flow Control (PFC)
• Required for RoCE
• Optional for iWARP
• Enhanced Transmission Selection (ETS)
• TX Reservation Requirements (Minimum not limits)
• ECN and DCBX not used in Windows
• Implementation Guide: https://aka.ms/ConvergedRDMA
• (Windows) Configuration Collector: https://aka.ms/Get-NetView
• (Windows) Validation Tool: https://aka.ms/Validate-DCB
• RDMA Connectivity Tool: https://aka.ms/Test-RDMA
• Not stress tool – Connectivity only!
• Instructions in the Deployment Guide
Must be configured across all network hops for RoCE to be lossless under congestion.
QoS Inspection
Inspect NetQos
Use configuration guides
S2D/SDDChttps://aka.ms/ConvergedRDMA
Similar guide with developer annotationshttps://github.com/Microsoft/SDN/blob/master/Diagnostics/WS2016_ConvergedNIC_Configuration.docx
LocalAdminUser @ TK5-3WP07R0512:
PS C:\DELETEME> Get-NetAdapterQos -Name "RoCE-01" -IncludeHidden -ErrorActionSilentlyContinue | Out-String -Width 4096
Name : RoCE-01
Enabled : True
Capabilities : Hardware Current
-------- -------
MacSecBypass : NotSupported NotSupported
DcbxSupport : IEEE IEEE
NumTCs(Max/ETS/PFC) : 8/8/8 8/8/8
OperationalTrafficClasses : TC TSA Bandwidth Priorities
-- --- --------- ----------
0 ETS 39% 0-2,4,6-7
1 ETS 1% 5
2 ETS 60% 3
OperationalFlowControl : Priorities 3,5 Enabled
OperationalClassifications : Protocol Port/Type Priority
-------- --------- --------
Default 0
NetDirect 445 3
Validate the host with Validate-DCB
• https://aka.ms/Validate-DCB
• Primary Benefit• Validate the expected configuration on one to
N number of systems or clusters
• Validate the configuration meets best practices
• Secondary Benefits• Doubles as DCB documentation for the
expected configuration of your systems.
• Answer "What Changed?" when faced with an operational issue
Test-RDMA
• https://aka.ms/Test-RDMA
• PoSH tool to test Network Direct (RDMA)
• Ping doesn’t do it!
65
>> C:\TEST\Test-RDMA.PS1 -IfIndex 3 -IsRoCE $true -RemoteIpAddress
192.168.2.111 –PathToDiskspd C:\TEST
VERBOSE: Diskspd.exe found at C:\TEST\Diskspd-
v2.0.17\amd64fre\\diskspd.exe
VERBOSE: The adapter Test-40G-2 is a physical adapter
VERBOSE: Underlying adapter is RoCE. Checking if QoS/DCB/PFC is
configured on each physical adapter(s)
VERBOSE: QoS/DCB/PFC configuration is correct.
VERBOSE: RDMA configuration is correct.
VERBOSE: Checking if remote IP address, 192.168.2.111, is reachable.
VERBOSE: Remote IP 192.168.2.111 is reachable.
VERBOSE: Disabling RDMA on adapters that are not part of this test.
RDMA will be enabled on them later.
VERBOSE: Testing RDMA traffic now for. Traffic will be sent in a
parallel job. Job details:
VERBOSE: 34251744 RDMA bytes sent per second
VERBOSE: 967346308 RDMA bytes written per second
VERBOSE: 35698177 RDMA bytes sent per second
VERBOSE: 976601842 RDMA bytes written per second
VERBOSE: Enabling RDMA on adapters that are not part of this test.
RDMA was disabled on them prior to sending RDMA traffic.
VERBOSE: RDMA traffic test SUCCESSFUL: RDMA traffic was sent to
192.168.2.111
Multiple RDMA NICs
Multiple NICs
Single RSS NIC
SMB Server
SMB Client
Full Throughput
• Bandwidth aggregation with multiple NICs
• Multiple CPUs cores engaged when using Receive Side Scaling (RSS)
Automatic Failover
• SMB Multichannel implements end-to-end failure detection
• Leverages NIC teaming if present, but does not require it
Automatic Configuration
• SMB detects and uses multiple network paths
SMB Multichannel
SMB Server
SMB Client
SMB Server
SMB Client
Sample Configurations
Team of NICs
SMB Server
SMB Client
NIC Teaming
NIC Teaming
Switch10GbE
NIC10GbE
NIC10GbE
Switch10GbE
NIC10GbE
NIC10GbE
NIC10GbE
NIC10GbE
Switch1GbE
NIC1GbE
NIC1GbE
Switch1GbE
NIC1GbE
NIC1GbE
Switch10GbE/IB
NIC10GbE/IB
NIC10GbE/IB
Switch10GbE/IB
NIC10GbE/IB
NIC10GbE/IB
Switch10GbE
RSS
RSS
SMB Multichannel – Single 10GbE NIC1 session, without Multichannel
SMB Server
SMB Client
Switch10GbE
NIC10GbE
NIC10GbE
CPU utilization per core
Core 1 Core 2 Core 3 Core 4RSS
RSS
SMB Server
SMB Client
SMB Multichannel – Single 10GbE NIC
• No failover
• Can’t use full 10Gbps
o Only one TCP/IP connection
o Only one CPU core engaged
• No failover
• Full 10Gbps available
o Multiple TCP/IP connections
o Receive Side Scaling (RSS) helps distribute load across CPU cores
1 session, without Multichannel
Switch10GbE
NIC10GbE
NIC10GbE
SMB Server
SMB Client
Switch10GbE
NIC10GbE
NIC10GbE
CPU utilization per core
Core 1 Core 2 Core 3 Core 4
CPU utilization per core
Core 1 Core 2 Core 3 Core 4
RSS
RSS
RSS
RSS
1 session, without Multichannel
SMB Multichannel – Multiple NICs
SMB Server 1
SMB Client 1
Switch10GbE
SMB Server 2
SMB Client 2
NIC10GbE
NIC10GbE
Switch10GbE
NIC10GbE
NIC10GbE
Switch10GbE
Switch10GbE
NIC10GbE
NIC10GbE
NIC10GbE
NIC10GbE
RSS RSS
RSS RSS
No automatic failover
Can’t use full bandwidth
Only one NIC engaged
Only one CPU core engaged
1 session, with Multichannel1 session, without Multichannel
• No automatic failover
• Can’t use full bandwidth
o Only one NIC engaged
o Only one CPU core engaged
SMB Multichannel – Multiple NICs
• Automatic NIC failover
• Combined NIC bandwidth available
o Multiple NICs engaged
o Multiple CPU cores engaged
SMB Server 1
SMB Client 1
Switch10GbE
SMB Server 2
SMB Client 2
NIC10GbE
NIC10GbE
Switch10GbE
NIC10GbE
NIC10GbE
Switch10GbE
Switch10GbE
NIC10GbE
NIC10GbE
NIC10GbE
NIC10GbE
SMB Server 1
SMB Client 1
Switch10GbE
SMB Server 2
SMB Client 2
NIC10GbE
NIC10GbE
Switch10GbE
NIC10GbE
NIC10GbE
Switch10GbE
Switch10GbE
NIC10GbE
NIC10GbE
NIC10GbE
NIC10GbE
RSS RSS
RSS RSS
RSS RSS
RSS RSS
• Linear bandwidth scaling
o 1 NIC – 1150 MB/sec
o 2 NICs – 2330 MB/sec
o 3 NICs – 3320 MB/sec
o 4 NICs – 4300 MB/sec
• Leverages NIC support for RSS (Receive Side Scaling)
• Bandwidth for small IOs is bottlenecked on CPU
SMB Multichannel Performance
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
512 1024 4096 8192 16384 32768 65536 131072 262144 524288 1048576
MB
/sec
I/O Size
SMB Client Interface Scaling - Throughput
1 x 10GbE 2 x 10GbE 3 x 10GbE 4 x 10GbE
1 session, with NIC Teaming, no MC
SMB Multichannel + NIC Teaming
SMB Server 2
SMB Client 1
Switch1GbE
SMB Server 2
SMB Client 2
NIC1GbE
NIC1GbE
Switch1GbE
NIC1GbE
NIC1GbE
Switch10GbE
Switch10GbE
NIC10GbE
NIC10GbE
NIC10GbE
NIC10GbE
NIC Teaming
NIC Teaming
NIC Teaming
NIC Teaming
• Automatic NIC failover
• Can’t use full bandwidth
o Only one NIC engaged
o Only one CPU core engaged
1 session, with NIC Teaming and MC1 session, with NIC Teaming, no MC
• Automatic NIC failover
• Can’t use full bandwidth
o Only one NIC engaged
o Only one CPU core engaged
SMB Multichannel + NIC Teaming
• Automatic NIC failover (faster with NIC Teaming)
• Combined NIC bandwidth available
o Multiple NICs engaged
o Multiple CPU cores engaged
SMB Server 1
SMB Client 1
SMB Server 2
SMB Client 2
NIC Teaming
NIC Teaming
NIC Teaming
NIC Teaming
Switch10GbE
Switch10GbE
NIC10GbE
NIC10GbE
NIC10GbE
NIC10GbE
Switch1GbE
NIC1GbE
NIC1GbE
Switch1GbE
NIC1GbE
NIC1GbE
SMB Server 2
SMB Client 1
Switch1GbE
SMB Server 2
SMB Client 2
NIC1GbE
NIC1GbE
Switch1GbE
NIC1GbE
NIC1GbE
Switch10GbE
Switch10GbE
NIC10GbE
NIC10GbE
NIC10GbE
NIC10GbE
NIC Teaming
NIC Teaming
NIC Teaming
NIC Teaming
SMB Direct and SMB Multichannel
SMB Server 2
SMB Client 2
SMB Server 1
SMB Client 1
Switch10GbE
Switch10GbE
R-NIC10GbE
R-NIC10GbE
R-NIC10GbE
R-NIC10GbE
Switch54GbIB
R-NIC54GbIB
R-NIC54GbIB
Switch54GbIB
R-NIC54GbIB
R-NIC54GbIB
1 session, without Multichannel
• No automatic failover
• Can’t use full bandwidth
o Only one NIC engaged
o RDMA capability not used
1 session, with Multichannel1 session, without Multichannel
• No automatic failover
• Can’t use full bandwidth
o Only one NIC engaged
o RDMA capability not used
SMB Direct and SMB Multichannel
• Automatic NIC failover
• Combined NIC bandwidth available
o Multiple NICs engaged
o Multiple RDMA connections
SMB Server 2
SMB Client 2
SMB Server 1
SMB Client 1
SMB Server 2
SMB Client 2
SMB Server 1
SMB Client 1
Switch10GbE
Switch10GbE
R-NIC10GbE
R-NIC10GbE
R-NIC10GbE
R-NIC10GbE
Switch54GbIB
R-NIC54GbIB
R-NIC54GbIB
Switch54GbIB
R-NIC54GbIB
R-NIC54GbIB
Switch10GbE
Switch10GbE
R-NIC10GbE
R-NIC10GbE
R-NIC10GbE
R-NIC10GbE
Switch54GbIB
R-NIC54GbIB
R-NIC54GbIB
Switch54GbIB
R-NIC54GbIB
R-NIC54GbIB
Single Root I/O Virtualization (SR-IOV)
78
• “Network Switch in the Network Adapter”
• Directly I/0 to VMs vNICsRemoves CPU from the process of moving datato/from a VMs. Data is DMA´d directly to/fromthe VM without the Virtual Switch "touching" it
• High I/O workloads
• New in Windows Server 2016Support for SR-IOV in the SET Switch
Single Root I/O Virtualization (SR-IOV)
79
HostVirtual Machine
Virtual Machine Network Stack
Synthetic NIC
Hyper-VExtensible Switch
SR-IOV NIC VF
Virtual Function
VFVF
• Directly I/0 to NIC
• High I/O workloads
Requires
• SR-IOV capable NICs
• Windows server 2012 or higher VMs
• SET Switch
Benefits
• Maximizes use of host system processors and memory
• Reduces host CPU overhead for processing network traffic(by up to 50%)
• Reduces network latency(by up to 50%)
• Provides higher network throughput(by up to 30%)
• Full support for Live Migration
Single Root I/O Virtualization (SR-IOV): Installation Workflow
80
2. Check NIC firmware and drivers to support SR-IOV
5. Install NIC drivers on the host
Check driver advanced properties to ensure SRIOV is enabled at the
driver level.
1. Check with the server vendor if the chipset support SR-IOV
Note that some older systems will have an SR-IOV menu item in their
BIOS, but did not enable all of the necessary IOMMU functionality
needed for Windows Server 2012
6. Create SR-IOV enabled Hyper-V Switch
3. Enable SR-IOV on the Server BIOS
7. Enable SR-IOV for each VM we want to use with (default)
4. Enabler SR-IOV on the NIC BIOS. Here we define the amount of
VF per NIC
8. Install NIC driver inside the VM
Single Root I/O Virtualization (SR-IOV): SETSwitch
81
Windows Server 2016
• Support of SR-IOV in SET Switch
• Host pNIC team with the SET Switch
• No need for Guest Team, only one vNIC in VM
Single Root I/O Virtualization (SR-IOV): SETSwitch
82
Windows Server 2016
• Support of SR-IOV in SET Switch
• Host pNIC team with the SET Switch
• No need for Guest Team, only one vNIC in VM
Single Root I/O Virtualization (SR-IOV): VM Performance
83
SR-IOV vNIC Performance
VM Guest with 4 VPs
30 Gigabit (Mellanox Connect X3-Pro 40G)
3100 MB/s
84
Single Root I/O Virtualization (SR-IOV): VM Performance
85
SR-IOV vNIC Performance
VM Guest with 8 VPs
40 Gigabit (Mellanox Connect X3-Pro 40G)
4510 MB/s
Single Root I/O Virtualization (SR-IOV): VM Performance
86
• Windows Server 2012 R2 VM Guest
• Performance example with SR-IOV vNIC
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 8 VPs in the VM Guest (vRSS no enabled)
• VM use only Core 2 (no vRSS in the VM Guest)
• Test with Microsoft ctsTraffic.exe (send/receive)
Single Root I/O Virtualization (SR-IOV): VM Performance
87
• Windows Server 2012 R2 VM Guest
• Performance example with SR-IOV vNIC
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 8 VPs in the VM Guest (vRSS Enabled)
• VM use only HT cores
• Test with Microsoft ctsTraffic.exe (send/receive)
Single Root I/O Virtualization (SR-IOV): VM Performance
88
• Windows Server 2012 R2 VM Guest
• Performance example with SR-IOV vNIC
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 16 VPs in the VM Guest (vRSS Enabled)
• VM use all cores
• Test with Microsoft ctsTraffic.exe (send/receive)
SR-IOV and Virtual Machine Mobility
89
Live Migration, Quick Migration, and Snapshots are supported by SR-IOV mobility
1
2
3
VF is presented to the Virtual Machine using SR-IOV on the
source Hyper-V Host
Set-VMNetworkAdapter VM –IOVWeight 1
VF on the VM is removed once migration is started
Set-VMNetworkAdapter VM –IOVWeight 0
• VF will be presented again on the Destination Host if it
supports SR-IOV.
• Network connectivity will continue without SR-IOV if
Destination Host does not support it
Set-VMNetworkAdapter VM –IOVWeight 1
VF
PF
PF
VF
VF
PF
Troubleshooting SR-IOV
90
Hyper-V Manager Networking tab on each
VM will show if SR-IOV is not operational1
Event Viewer also indicates if some error
exists, enabling SR-IOV on the VM network
adapter. Check the Hyper-V SynthNIC log3
Using Windows PowerShell, we can validate
why SR-IOV is not operational. In this
example, the BIOS and the NIC does not
support SR-IOV
PS C:\Windows\system32> (Get-VMHost).iovsupport False PS C:\Windows\system32> (Get-VMHost).iovsupportreasons Ensure that the system has chipset support for SR-IOV and that I/O virtualization is enabled in the BIOS. The chipset on the system does not do DMA remapping, without which SR-IOV cannot be supported. The chipset on the system does not do interrupt remapping, without which SR-IOV cannot be supported. To use SR-IOV on this system, the system BIOS must be updated to allow Windows to control PCI Express. Contact your system manufacturer for an update. SR-IOV cannot be used on this system as the PCI Express hardware does not support Access Control Services (ACS) at any root port. Contact your system vendor for further information. PS C:\Windows\system32>
2
Dynamic VMMQ
Dynamic ~5Gbps
Dynamic ~20Gbps
Static
WS2016
Multiple VMQs for the same virtual NIC (VMMQ)
Statically assigned queues
WS2019\Azure Stack HCI
Autotunes queues to:
Maximize virtual NIC throughput
Maintain consistent virtual NIC throughput
Maximize Host CPU efficiency
Premium Certified Adapters Required
Supports Windows and Linux Guests
Try it out! (without special drivers)
https://aka.ms/DVMMQ-Validation
Virtual Machine Multi-Queue (VMMQ)
93
• Feature that allows Network traffic for a VM to be spread across multiple queues
• VMMQ is the evolution of VMQ with Software vRSS
• High traffic VMs, will benefit from the CPU load spreading that multiple queues can provide.
• Default disabled in the SET Switch
• Default disabled in the VM Guest
Virtual Machine Multi-Queue (VMMQ)
94
For VMMQ for a VM to be enabled, RSS in the VM needs to be enabled
• VrssEnabled: true
• VmmqEnabled: false (Off by default)
• VmmqQueuePairs: 16
• Use VMQ for low Traffic VMs
• Use VMMQ for high Traffic VMs
Virtual Machine Multi-Queue (VMMQ) Performance
95
Performance example with VMQ vs. VMMQ
Hardware used
• Mellanox Connect X3-Pro 40 Gigabit
• Dell T430 with Intel E5-2620v4
Software used:
• Microsoft NTttcp.exe
• Microsoft ctsTraffic.exe
Note:
• VM111-VM140 Windows Server 2016
• VM151-VM152 Windows Server 2012R2
Virtual Machine Multi-Queue (VMQ) Performance
96
• Performance example with VMMQ Disabled
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 4 VPs in the VM Guest
• VM use VMQ BaseProcessor 2
Virtual Machine Multi-Queue (VMMQ) Performance
97
• Performance example with VMMQ enabled
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 4 VPs in the VM Guest
• 4 Queues used (VMMQ use 4 core from 8 to 14)
Virtual Machine Multi-Queue (VMMQ) Performance
98
• Performance example with VMMQ enabled
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 8 VPs in the VM Guest
• 8 Queues used (VMQ use 8 cores from 0 to 14)
Virtual Machine Multi-Queue (VMMQ) Performance
99
• Performance example with VMMQ enabled
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 8 VPs in the VM Guest
• 7 Queues used (VMQ use 7 cores from 2 to 14)
• Test with Microsoft ctsTraffic.exe (send/receive)
Virtual Machine Multi-Queue (VMQ) Performance
100
• Windows Server 2012 R2 VM Guest
• Performance example with VMMQ Disabled
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 4 VPs in the VM Guest (vRSS default Disabled)
• VM use Core 10 (100%)
Virtual Machine Multi-Queue (VMQ) Performance
101
• Windows Server 2012 R2 VM Guest
• Performance example with VMMQ Disabled
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 4 VPs in the VM Guest (vRSS Enabled in VM)
• VM use Core 10 (100%)
Virtual Machine Multi-Queue (VMMQ) Performance
102
• Windows Server 2012 R2 VM Guest
• Performance example with VMMQ Enabled
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 4 VPs in the VM Guest (vRSS Enabled in VM)
• VM use Core 2-14
Virtual Machine Multi-Queue (VMMQ) Performance
103
• Windows Server 2012 R2 VM Guest
• Performance example with VMMQ Enabled
• Host pNIC Mellanox Connect X3-Pro 40 Gigabit
• 4 VPs in the VM Guest (vRSS Enabled in VM)
• VM use Core 2-14
• Test with Microsoft ctsTraffic.exe (send/receive)
Virtual Receive-side Scaling (vRSS)
105
• vRSS default enabled in Windows Server 2016 and higher VMs
• vRSS is supported on Host vNIC
• vRSS works with VMQ or VMMQ(VMMQ = use RSS queues in hardware)
• Not compatible with SR-IOV vmNIC
Virtual RSS Azure Stack
106
Node 0 Node 1 Node 2 Node 3
2
2
3
3
1
1
0
0
Incoming
packets
vProcvProcvProcvProc
vNIC
• vRSS provides near line rate to a Virtual Machine on existing hardware, making it possible to virtualize traditionally network intensive physical workloads
• Maximizes resource utilization by spreading Virtual Machine traffic across multiple virtual processors
• Helps virtualized systems reach higher speeds with 10 to 100 GbpsNICs
• Requires no hardware upgrade and works with any NICs that support RSS
Virtual RSS
107
• Supported Guest OS with the latest Integration Services Installed
• NIC must support RSS and VMQ
• VMQ must be enabled
• SR-IOV must be disabled for the Network Card using vRSS
• RSS enabled inside the Guest
• Enable-NetAdapterRSS -Name "AdapterName"
• RSS configuration inside the Guest is Required (same as physical computer)