azure accelerated networking: smartnics in the public cloudthe design of the smartnic. multiple...
TRANSCRIPT
AZURE ACCELERATED NETWORKING: SMART NICS IN
THE PUBLIC CLOUDFirestone, D et al
NETWORK VIRTUALIZATION
Virtualization of OSI Layers 2 – 7
Routers and Switches implemented on software simulate physical routers and switches
Provides similar features to those found on physical networking hardware, including: Route Tables Subnets DHCP DNS Firewalls Access Control List (ACL) Network Address Translation (NAT)
SOFTWARE DESIGNED NETWORKING (SDN)
Cloud providers (AWS, Azure) enable customers to configure virtual networks through software
Such networks run on the cloud platform hypervisor (NOT within the customer’s virtual machines)
Cloud providers generally provide free SDN to the end customers Only pay for outbound data and virtual networking peering (per GB) and public
IPv4 addresses
https://azure.microsoft.com/en-us/pricing/details/virtual-network/
AZURE VIRTUAL FILTERING PLATFORM (VFP)
Virtual Machine 1Virtual
Machine 2Virtual
Machine 3
THE HIDDEN COST OF NETWORKING
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU
Physical Server
Hypervisor processes, including SDN
Each rented CPU generates ~$900/year in revenue
Virtual Machine 1Virtual
Machine 2Virtual
Machine 3
THE HIDDEN COST OF NETWORKING
CPU CPU CPU CPU CPU CPU CPU
Virtual Machine 4
CPU CPU CPU
Physical Server
Hypervisor processes, including SDN
Can we get here?
HOST SDN
Host with SDNPhysical NIC
Virtual Machines
SR-IOV (SINGLE ROOT IO VIRTUALIZATION)
COMPARISION
Host SDN
Host SDN is full-featured
It is slow and expensive!
SR-IOV
SR-IOV is fast
It cannot apply all SDN rules
Can we get the best of both?
GENERIC FLOW TABLE (GFT) OFFLOAD
Just-In-Time Compilation
For the first packet in each flow, route it through Host SDN to apply polices
Store the result in a hash-table (based upon L2/L3/L4 properties) in the SR-IOV NIC
The SR-IOV routes all subsequent packets over the VF directly to the VM
DESIGN GOALS
1. Minimize burning of CPU cores
2. Maintain programmability
3. Be as good as SR-IOV
4. Be extensible for new features
5. Be with the entire fleet
6. Achieve high single-connection performance
7. Scale to 100GbE+
8. Be serviceable
Overall focus on cost minimization
SR-IOV WITH GFT VS SNAP?
How do the design goals of SR-IOV with GFT compare to those of SNAP?
Consider the following dimensions:
CPU Utilization
Programmability
Throughput
Future-proof
Compatibility
Single-connection performance
100GbE
Serviceability
Security
THE DESIGN OF THE SMARTNIC
Multiple options to consider: ASIC-based NICs Multicore SoC NICs FPGA NICs Burn host cores
ASIC
Application Specific Integrated Circuit
High performance
1-2 years development-to-manufacturing lead time
Cannot be changed after manufacturing
For well-specified, non-evolving applications, such as Bitcoin Mining (pictured), ASICs offer maximum performance for cost
MULTICORE SOC
Socket on Chip
Provides Linux environment, where one can run DPDK
Works well up to 40GbE
FPGA
Field Programable Gate Array
Like an ASIC but can be reprogrammed
Already used within Microsoft for Bing (Catapult)
WHICH IS BEST?
What are the advantages and disadvantages of these three hardware options along with using the host CPU?
DETAILED COMPARISION
ASICs SoCs FPGAs Host CPUs
Host CPU Utilization ✅ ✅ ✅ ❌
Programmability ❌ ✅ ✅ ✅
Throughput ✅ ❌ ✅ ❌
Future-proof ❌ ✅ ✅ ✅
Compatibility ✅ ✅ ✅
Single Connection Performance
✅ ❌ ✅ ❌
100 GbE ✅ ❌ ✅ ❌
Serviceability ❌ ✅ ✅ ✅
Ease of Development ❌ ✅ ✅ ✅
Power Efficiency ✅ ❌ ✅ ❌
SYSTEM DESIGN
ACCELNET ARCHITECTURE
SERVICEABILITY
Question: If all traffic goes through AccelNet, and the AccelNet is exposed as a VF directly into the VM, how does Azure maintain uptime when performing service on the FPGA or GFT?
SERVICEABILITY
KERNEL-BYPASS PROTOCOLS
DPDK: Poll Mode Driver (PMD) transparently binds between the VMBUs and the VF Exposes all DPDK APIs
RDMA: Fallback to TCP – open area of research
PERFORMANCE
BATTLE OF THE CLOUDS
DISCUSSION QUESTIONS
1. What are the security concerns of AccelNet?
2. What are the drawbacks of AccelNet?
3. Which is the better approach to take – AccelNet vs SNAP?
4. Should public clouds charge customers for the complexity of their networks?