Research – Zurich Research Laboratory
Got Loss? Get zOVN!
Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat
ACM SIGCOMM 2013, 12-16 August, Hong Kong, China
Research – Zurich Research Laboratory
Application Performance in Virtualized Datacenter Networks
2
Global Internet long-and-fat links
End-users accessing datacenter services
Physical Datacenter Network short-and-fat links
Router Router
Switch Switch
Switch Switch
Virtual Switch
NIC
VM 1
vNIC
VM K1
vNIC
Virtualized Server 1
Virtual Switch
NIC
VM 1
vNIC
VM KN
vNIC
Virtualized Server N
Virtual Switch
NIC
VM 1
vNIC
VM K2
vNIC
Virtualized Server 2
Virtual Switch
NIC
VM 1
vNIC
VM K3
vNIC
Virtualized Server 3 …
Research – Zurich Research Laboratory
Physical Network: Lossless Links • IBM builds flow-controlled links since the 80’s
• High Performance Computing community - large
scale lossless distributed systems • Flow control improves performance • HPC and Datacenter communities disconnected
• Why do we disregard the Ethernet flow-control?
• PAUSE widely available, largely ignored
• Converged Enhanced Ethernet – applies HPC and Storage lessons • Priority Flow Control (standardized 2011) • Constantly improved for 1T
3
Research – Zurich Research Laboratory 4
Physical Networks
Virtual Networks
Packet forwarding
Deterministic bandwidth and delay
Link level flow control
Bandwidth allocation
Latency µs ms
Virtual Networks in embryonic stage
Virtual Networks are Different
Research – Zurich Research Laboratory
Contributions • Loss identification and characterization in virtual
networks
• Dirty-slate approach for latency sensitive applications • Exploit a L2 technique to the benefit of TCP and application
• Introduce zero-loss Overlay Virtual Network
• Flow-controlled virtual switch
• Evaluation with Partition/Aggregate
• Prototype implementation • Cross-layer simulation
Flow control improves application performance
5
Research – Zurich Research Laboratory
Outline
• Introduction
• Losses in Virtual Networks
• zOVN Architecture
• Evaluation
• Conclusions
6
Research – Zurich Research Laboratory
Losses in Virtual Networks
• Packets traverse a series of queues • Producer/Consumer problem on each queue
Not implemented correctly on each queue
7
Physical Machine vSwitch VM 1
Source vNIC Tx
VM 2
Source vNIC Tx
VM 3
Sink vNIC Rx
Port A Tx
Port B Tx
Port C Rx
Research – Zurich Research Laboratory
Losses in Virtual Networks (2)
• Numbers: measurement points • Inject UDP packets at (1) • Count how many still arrive at (6)
• Loss locations
• vSwitch – between (3) and (4) • Receive stack – between (5) and (6)
8
Physical Machine vSwitch VM 1
Source vNIC Tx
VM 2
Source vNIC Tx
VM 3
Sink vNIC Rx
Port A Tx
Port B Tx
Port C Rx
1
1 2
2
3
3
4 5 6
Research – Zurich Research Laboratory
Losses in Virtual Networks (3)
9
Configuration Hypervisor vNIC vSwitch C1 Qemu/KVM Virtio Linux Bridge C2 Qemu/KVM Virtio Open vSwitch C3 Qemu/KVM Virtio VALE C4 H2 N2 S4 C5 H2 E1000 S4 C6 Qemu/KVM E1000 Linux Bridge C7 Qemu/KVM E1000 Open vSwitch
0
50
100
150
200
C1 C2 C3 C4 C5 C6 C7
Inje
cte
d t
raff
ic [
MB
ps]
Stack Loss
vSwitch Loss
Received
Research – Zurich Research Laboratory
Outline
• Introduction
• Losses in Virtual Networks
• zOVN Architecture
• Evaluation
• Conclusions
10
Research – Zurich Research Laboratory
NIC
Hypervisor
zOVN bridge
VM
TX Path
11
vSwitch
Application
Port B Tx
Port A Rx
Guest kernel
vNIC Tx
socket Tx write
return value Qdisc
NIC Tx
Physical link
send frame
receive PAUSE
overlay encapsulation
wake-up
receive
return value
start/stop queue
start_xmit
enqueue
free skb
Research – Zurich Research Laboratory
NIC
Hypervisor
zOVN bridge
VM
RX Path: Fix Stack Loss
12
vSwitch
Application Port B Rx
Port A Tx
Guest kernel
vNIC Rx socket Rx read
return value
NIC Rx
Physical link
receive frame
send PAUSE
overlay decapsulation
wake-up
send
return value
pause/resume queue
netif_receive skb
NET RX
Softirq
setsockopt Select lossy or
lossless.
Research – Zurich Research Laboratory
Lossless Virtual Switch
13
vSwitch
Port 1 Tx
Port 2 Tx
Port N Tx
Port 1 Rx
Port 2 Rx
Port N Rx
Senders: • Produce packets • Start forwarder • Sleep
Receivers: • Consume
packets • Start forwarder • Sleep
Forwarder: • Move packets
from Tx to Rx • Pause Tx ports if
Rx port full • Wake-up Tx ports
when something is consumed
Research – Zurich Research Laboratory
Fully Lossless Path
• Fixed vSwitch – between (3) and (4) Receive stack – between (5) and (6)
14
Physical Machine vSwitch VM 1
Source vNIC Tx
VM 2
Source vNIC Tx
VM 3
Sink vNIC Rx
Port A Tx
Port B Tx
Port C Rx
1
1 2
2
3
3
4 5 6
Research – Zurich Research Laboratory
Outline
• Introduction
• Losses in Virtual Networks
• zOVN Architecture
• Evaluation
• Conclusions
15
Research – Zurich Research Laboratory
Partition/Aggregate Workload
• Problem: TCP incast
• During Aggregate, buffers might overflow. • For short flows: TCP ineffective, ACK clock stalled. • Must rely on timeouts.
• Partition and Aggregate – datacenter internal
• Open to optimizations
16
Master
Worker Worker Worker Worker
1
4
2 2 2 2
3 3 3 3
Research – Zurich Research Laboratory
Testbed Setup
17
Control network
HP 1810-8G 1G Switch
VM 1 VM 16
IBM x3550 M4 Server
…
1G
VM 1 VM 16
IBM x3550 M4 Server
…
1G
VM 1 VM 16
IBM x3550 M4 Server
…
1G
VM 1 VM 16
IBM x3550 M4 Server
…
1G
Data network
IBM G8264 10G Switch
vSwitch vSwitch vSwitch vSwitch
10G 10G 10G 10G
• 4x Rack Servers • 16 physical cores + HyperThreading • Intel 10G adapters (ixgbe drivers)
• 16 VMs / server
• 8 VMs for PA traffic* • 8 VMs produce background flow
* as in “DCTCP: Efficient
Packet Transport for the Commoditized Data Center” SIGCOMM 2010
Research – Zurich Research Laboratory
Testbed Results (CUBIC)
18
1
10
100
1000
1 10 100 1000 10000Me
an c
om
ple
tio
n t
ime
[m
s]
Response size [Packets]
LL
LZ
ZL
ZZ
Virtual Network Flow Control
Physical Network Flow Control
No No
No Yes
Yes No
Yes Yes
• Virtual only better than physical only: vSwitch primary
congestion point. Physical switch congestion negligible
• No improvement for short/long flows: Long transfers can remain on lossy priorities
Research – Zurich Research Laboratory
Simulation Setup
• Larger topology: 256 servers
• 4 VMs / server • 3 VMs produce PA traffic • 1 VM background flows
• Assumption: infinite CPU
19
Research – Zurich Research Laboratory
Simulation Results (64 packets)
• Confirm findings from prototype experiments
• (LZ) Physical only flow control: shift the drop point into the virtual network
• (ZZ) Both flow controls required for better performance
20
0
5
10
15
20
25
30
35
40
45
NewReno Vegas Cubic
Me
an c
om
ple
tio
n t
ime
[m
s]
LL
LZ
ZL
ZZ
Virtual Network Flow Control
Physical Network Flow Control
No No
No Yes
Yes No
Yes Yes
Research – Zurich Research Laboratory
Faster CPUs or faster networks? • Loss ratio influenced by CPU/network speed ratio
21
TX
• Slow CPU coupled with a fast network is desirable
• e.g. Xeon + 1G network drops more than Core2 + 1G network
RX
• Fast CPU coupled with a slow network is desirable
• e.g. Xeon + 10G network drops more than Xeon + 1G network
• Conflicting requirements: cannot solve problem by changing hardware
The only solution: add flow control !
Research – Zurich Research Laboratory
Conclusions
• Loss identification and characterization in OVN
• First flow-controlled vSwitch for future Overlay Virtual Networks • Dirty-slate approach for latency sensitive applications • Un-tuned TCP • Commodity 1-10G Ethernet fabric • Result replication trivial • Orthogonal to other proposals
• Lossless links: Order of magnitude completion time
reduction in Partition/Aggregate
22
Research – Zurich Research Laboratory
Backup
23
Research – Zurich Research Laboratory 24
Encapsulation in Overlay Virtual Networks
Workflow
1. Source VM sends packet to its attached vSwitch. 2. vSwitch queries the Controller to find the address of the
destination. 3. Controller answers. The information is cached by the switch. 4. Packet sent over physical network encapsulated with new headers. 5. Packet decapsulated at destination virtual Switch.
Payload TCP| IP|Eth Encap|UDP|IP|Eth
Physical
Network
VM VM VM
vSwitch
Cache
(1) (3) (2)
VM VM VM
vSwitch
Cache
(5)
(4)
Destination Server Source Server Fabric Controller