mswitch: a highly-scalable, modular software switch
TRANSCRIPT
mSwitch: A Highly-‐Scalable, Modular Software SwitchMichio Honda (NetApp) *
Felipe Huici (NEC), Giuseppe Lettieri and Luigi Rizzo (Università di Pisa) ACM SOSR’15 June 17
* this work was mostly done at NEC
Motivation
• Software switches are important – Interconnection between VMs/containers and NICs –Middleboxes, SDN, NFV
• Requirements –Throughput (e.g., 10 Gbps) –Scalability (e.g., 100 ports) –Flexibility (e.g., forwarding decision, packet modification) –CPU efficiency (e.g., allocate as many CPU resources to VMs as possible)
Are existing software switches able to meet these requirements?
Software Switch
VM VM VM
NIC NIC
Existing Software Switches
• OS-‐standard ones don’t provide high throughput
1 2
4
6
8
10
60 124 252 508 1020 1514
Thro
ughp
ut (G
bps)
Packet size (bytes, excluding CRC)
FreeBSD bridgeLinux OVS
DPDK vSwitchVALE
• High throughput ones lack port scalability and/or flexibility
while forwarding packets at high rates using DPDK vSwitchor VALE (the other switches do not yield high throughputor are not publicly available). In terms of CPU usage, thefundamental feature of DPDK vSwitch, and indeed, of anyDPDK-based package, is that DPDK’s poll-mode driver re-sults in 100% utilization irrespective of the traffic rates beingprocessed. In contrast, VALE relies on interrupts, so that userprocesses are woken up only on packet arrival. In our exper-iments, for the 10 CPU cores handling packets this resultsin a cumulative CPU utilization of about 140% for mSwitchthat also adopts an interrupt-based model and a much higherbut expected 1,000% for DPDK vSwitch (the full results forthese experiments are in section 4.2).High Density: Despite its high throughput, VALE, as wewill show in section 4, scales poorly when packets are for-warded to an increasing number of ports, and the throughputfurther drops when packets from multiple senders are sentto a common destination port; both of these are commonscenarios for a back-end virtualization switch containing asingle NIC and multiple VMs.
For DPDK vSwitch, its requirement of having a core ded-icated to each port limits its density. While it is possible tohave around 62-78 or so cores on a system (e.g., 4 AMDCPU packages with 16 cores each, minus a couple of coresfor the control daemon and operating system, or 4 Intel 10-core CPUs with hyper threading enabled), that type of hard-ware represents an expensive proposition, and ultimately itmay not make sense to have to add a CPU core just to beable to connect an additional VM or process to the switch.Finally, CuckooSwitch targets physical NICs (i.e., no virtualports), so the experiments presented in that paper are limitedto 8 ports total.Flexibility: Most of the software switches currently avail-able do not expressly target a flexible forwarding plane, lim-iting themselves to L2 forwarding. This is the case for thestandard FreeBSD and Linux bridges, but also for newersystems such as VALE and CuckooSwitch. Instead, OpenvSwitch supports the OpenFlow protocol, and as such pro-vides the ability to match packets against a fairly compre-hensive number of packet headers, and to apply actions tomatching packets. However, as shown in figure 1 and in [19],Open vSwitch does not yield high throughput.
Throughput CPU Usage Density FlexibilityFreeBSD switch ⇥
p p⇥
Linux switch ⇥p p
⇥Open vSwitch ⇥
p p p
Hyper-Switch ⇥p
⇥p
DPDK vSwitchp
⇥ ⇥p
CuckooSwitchp
⇥ ⇥ ⇥VALE
p p⇥ ⇥
Table 1. Characteristics of software switches with respectto throughput, CPU usage, port density and flexibility.
DPDK vSwitch takes the Open vSwitch code base and ac-celerates it through the use of the DPDK packet framework.
DPDK itself introduces a completely different, non-POSIXprogramming environment, making it difficult to adapt ex-isting code to it. For DPDK vSwitch, this means that ev-ery Open vSwitch code release must be manually adapted towork within the DPDK vSwitch framework. In contrast, insection 5 we show how using mSwitch and applying a few,one-time code changes to Open vSwitch results in a 2.6-3times performance boost.Summary: Table 1 summarizes the characteristics of eachof the currently available software switches with respect tothe stated requirements; none of them simultaneously meetthem.
3. mSwitch DesignTowards our goal of implementing a software switch withhigh throughput, reasonable CPU utilization, high port den-sity and a flexible data plane, and taking into considerationthe analysis of the problem space in the previous section, wecan start to see a number of design principles.
First, in terms of throughput, there is no need to re-invent the wheel: several existing switches yield excellentperformance, and we can leverage the techniques they usesuch as packet batching [2, 6, 17, 18], lightweight packetrepresentation [6, 9, 17] and optimized memory copies [9,16, 18] to achieve this.
In addition, to obtain relatively low CPU utilization andflexible core assignment we should opt for an interrupt-based model, such that idle ports do not unnecessarily con-sume cycles that can be better spent by active processes orVMs. This is crucial if the switch is to act as a back-end, andhas the added benefit of reducing the system’s overall powerconsumption.
Further, we should design a forwarding algorithm that islightweight and that, ideally, scales linearly with the numberof ports on the switch; this would allow us to reach higherport densities than current software switches are capable of.Moreover, for a back-end switch muxing packets from alarge number of sending virtual ports to a common desti-nation port (e.g., a NIC), it is imperative that the forwardingalgorithm is efficiently able to handle this incast problem.
Finally, the switch’s data plane should be programmablewhile ensuring that this mechanism does not harm the sys-tem’s ability to quickly switch packets between ports. Thispoints towards a split between highly optimized switch codein charge of switching packets, and user-provided code todecide destination ports and potentially modify or filterpackets.
3.1 Starting PointHaving identified a set of design principles, the next ques-tion is whether we should base a solution on one of the ex-isting switches previously mentioned, or start from scratch.The Linux, FreeBSD and Open vSwitch switches are non-starters since they are not able to process packets with high
• Separation into fabric and logic • Fabric: switches packets between ports • Logic: modular forwarding decisions
mSwitch Design Decisions
4
Switching fabricSwitching logic
OS stack
Sock. API
Apps
Virtual Ports
App/VMApp/VM
netm
ap
API
netm
ap
API
UserKernel
NIC
• Interrupt model • Efficient, flexible CPU utilization• Runs in the kernel • To efficiently handle interrupts • Integration with OS subsystems • network stack, device drivers etc
• Separate, per-‐port packet buffers • Isolation • Copy is anyways inexpensive
• Output queue: reserve destination buffers w/ lock and copy packets w/o lock • concurrent senders can perform copy
Scalable Packet Switching Algorithms
• Input queue: group packets for each destination port before forwarding • For a batch of input packets, lock each destination port and access its device register only once
5
sender1
sender2
Output queue
Modular Switching Logic
• Switching logics are implemented as separate kernel modules that implement a lookup function: • Return value indicates a destination switch port index, drop or broadcast • L2 learning is default but we can change it at anytime while the switch is running
A Full mSwitch Module
u_intmy_lookup(u_char *buf, const struct net_device *dev){ struct ether_hdr *eh; eh = (struct ether_hdr *)buf; /* least significant byte */ return eh->ether_dst[0];}
CPU Utilization
• mSwitch efficiently utilizes CPUs
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8 9
100 200 300 400 500 600 700 800 900 1000
Thro
ughp
ut (G
bps)
Cum
ulat
ive
CPU
util
izat
ion
(%)
# of destination virtual ports (# of CPU cores - 1)
mSwitchDPDK vSwitch
mSwitch (% CPU)DPDK vSwitch (% CPU)
NIC
AppAppAppAppvirtualports
One CPU core for the NIC and per virtual port
Ports Scalability
• mSwitch scales to many ports
1
2
3
4
5
6
7
1 10 20 30 40 50 60 70 80 90 100 110 120
10 20 30 40 50 60 70 80 90 100
Thro
ughp
ut (G
bps)
CPU
util
izat
ion
(%)
# of destination virtual ports
mSwitch (Throughput)VALE (Throughput)mSwitch (NIC-CPU)
VALE (NIC-CPU)mSwitch (App-CPU)
VALE (App-CPU)
NIC
AppAppAppAppvirtualports
One CPU core for the NIC Another one for all virtual ports
mSwitch Module Use Cases
10
VPVP VPkernel
NICOpen vSwitch
datapath
VMuser VM VM
NIC
VPVPkernel
NICUDP/TCP port
filter
App/VMuser App/VM
NIC
(middlebox for TCP 80 and 443)
(middlebox for UDP/TCP 5004)
Mux/Demux (3 tuple)
VPVP
App+stack
Sock. APIOS stack
App+stack
kerneluser
(TCP 22)
(TCP 80) (TCP 53) App
NIC
• Accelerated Open vSwitch datapath • 3x speedup
• Filtering for virtualized middleboxes • Efficiently directs relevant packets to middleboxes on virtual ports
• Support for user-‐space protocols: • With isolation • Can still use OS’s stack
Conclusion• A highly-‐scalable, modular software switch • Higher scalability and flexibility compared to DPDK vSwitch and VALE • Already integrated into netmap/VALE implementation • https://code.google.com/p/netmap/ • Upstreamed into FreeBSD, works in Linux • All the modules (e.g., Open vSwitch acceleration) are publicly available • https://github.com/cnplab
• The paper is open access:http://web.sfc.wide.ad.jp/~micchie/papers/a1-‐honda-‐sosr15.pdf
• Other papers using mSwitch: • Martins et al. “ClickOS and the art of network function virtualization”, USENIX NSDI’14 • Honda et al. “Rekindling network protocol innovation with user-‐level stacks”, ACM CCR 201411
Module complexity and performance
12
70 modified lines ) to hook the Open vSwitch code to themSwitch switching logic. In essence, mSwitch-OVS replacesOpen vSwitch’s datapath, which normally uses Linux’s stan-dard packet I/O, with mSwitch’s fast packet I/O. As a re-sult, we can avoid expensive, per-packet sk_buff allocationsand deallocations.
0
2
4
6
8
10
60 124 252 50810201514
Thro
ughp
ut (G
bps)
Packet size (Bytes)
OVSmSwitch-OVS
(a) Between NICs.
0 5
10 15 20 25 30 35
60 124 252 50810201514
Thro
ughp
ut (G
bps)
Packet size (Bytes)
OVSmSwitch-OVS
(b) Between virtual ports.
Figure 15: Throughput of mSwitch’s Open vSwitchmodule (mSwitch-OVS) as opposed to that of stan-dard Open vSwitch (OVS) on a single CPU core.Measurements are done when forwarding betweenNICs (left) and between virtual ports (right). Inthe latter, for standard Open vSwitch we use tap
devices.
The results when forwarding between NICs using a sin-gle CPU core (Figure 15(a)) show that with relatively fewsmall changes to Open vSwitch, mSwitch is able to achieveimportant throughput improvements: For small packets, wenotice a 2.6-3x speed-up. The di↵erence is also large whenforwarding between virtual ports (Figure 15(b)), althoughpart of those gains are certainly due to the presence of slowtap devices for Open vSwitch.
5.5 Module ComplexityAs the final evaluation experiment, we look into how ex-
pensive the various modules are with respect to CPU fre-quency. Figure 16 summarizes how the throughput of mSwitchis a↵ected by the complexity of the switching logic for minimum-sized packets and di↵erent CPU frequencies. As shown,hash-based functions (learning bridge or 3-tuple filter) arerelatively inexpensive and do not significantly impact thethroughput of the system. The middlebox filter is evencheaper, since it does not incur the cost of doing a hashlook-up.
Conversely, Open vSwitch processing is much more CPUintensive, because OpenFlow performs packet matching againstseveral header fields across di↵erent layers; the result is re-flected in a much lower forwarding rate, and also an almostlinear curve even at the highest clock frequencies.
6. GENERALITY AND LESSONS LEARNEDThrough the process of designing, implementing and ex-
perimenting with mSwitch we have learned a number oflessons, as well as developed techniques that we believe aregeneral and thus applicable to other software switch pack-ages:
2 4 6 8
10 12 14
1.2 1.5 2.1 2.4 3.2Thro
ughp
ut (M
pps)
CPU Clock Frequency (Ghz)
BaselineFilter
L2 learn3-tuple
mSwitch-OVS
Figure 16: Throughput comparison between di↵er-ent mSwitch modules for 60 Byte packets.
• Interrupt vs. Polling Model: Using a polling modelcan yield some throughput gains with respect to aninterrupt-based one, but at the cost of much higherCPU utilization. For a dedicated platform (e.g., Cuck-ooSwitch, which uses the server it runs on solely as ahardware-switch replacement) this may not matter somuch, but for a system seeking to run a software switchas a back-end to processes, containers, or virtual ma-chines, an interrupt-based model (or a hybrid one suchas NAPI) is more e�cient and spares cycles that thoseprocesses can use. Either way, high CPU utilizationequates to higher energy consumption which is alwaysundesirable. We also showed that latency penalty aris-ing from the use of an interrupt model is negligible forOpenFlow packet matching (Section 4.5).
• Data Plane Decoupling: Logically separating mSwitchinto a fast, optimized switching fabric and a special-ized, modular switching logic achieves the best fromboth worlds: the specialization required to reach highperformance with the flexibility and ease of develop-ment typically found in general packet processing sys-tems (Section 3.4).
• High Port Density: The algorithm presented in Sec-tion 3.3 permits the implementation of a software switchwith both high port density and high performance.The algorithm is not particular to mSwitch, and so itcan be applied to other software packages.
• Destination Port Parallelism: Given the preva-lence of virtualization technologies, it is increasinglycommon to have multiple sources in a software switch(e.g., the containers or VMs) needing to concurrentlyaccess a shared destination port (e.g., a NIC). Thealgorithm described in Section 3.3 yields high perfor-mance under these scenarios and is generic, so appli-cable to other systems.
• Huge Packets: Figure 6 suggests that huge packetsprovide significant advantages for the switching planein a high performance switch. Supporting these pack-ets is important when interacting with entities (e.g.,virtual machines) which have massive per packet over-heads (e.g., see [18]).
• Zero-Copy Client Bu↵ers: Applications or virtualmachines connected to ports on a software switch arelikely to assemble packets in their own bu↵ers, di↵erentfrom those of the underlying switch. To prevent costlytransformations and memory copies, the switch shouldallow such clients to store output packets in their ownbu↵ers (Section 3.5).
• Application Model: How should functionality be
Measurement results for minimum-‐sized packet forwarding between two NICs