A.A. 2010-2011
Massimo RIMONDINI
RCNG – 02/11/10
Tecnologie per la Virtualizzazione delle
Reti di Calcolatori
Virtualization
HA! HA! Easy!
Virtualization
D’Oh!
Virtualization
“...the act of decoupling the (logical) service from its (physical) realization...”
“execution of software in an environment separated from the underlying hardware resources”
“sufficiently complete simulation of the underlying hardware to allow software, typically a guest operating system, to run unmodified”
“complete simulation of the underlying hardware”
Virtualization
Full virtualization (emulation)
Partial virtualization
Paravirtualization(OS-assisted virtualization)
Hardware-assisted virtualization
OS-Level virtualization
Full Virtualization
Emulation of a fully fledged hardware box (e.g., x86)
Binary translation
For non-virtualizable instructions(have different semantic in Rings ≠0)
Direct execution
For performance
VirtualBox, Parallels, ~VMware, Microsoft Virtual PC, QEMU, Bochs
Picture from Understanding Full Virtualization, Paravirtualization, and Hardware Assist. White Paper. VMware.
(Emulation)
Partial Virtualization
E.g., address space virtualization
Supports multiple instances of a specific hardware device
Does not support running a guest OS
FreeBSD network stack virtualization project, IBM M44/44X
(More) of historical interest
Address space “virtualization” is a basic component in modern OSs
Paravirtualization(OS-assisted virtualization)
VMM{+,/}Hypervisor
Guest OS communicates with hypervisor
Changes to guest OS (to prevent non-virtualizable instructions from contacting bare metal)
Better performance
Support for hardware-assisted virtualization
Xen, VMware, Microsoft Hyper-V, Oracle VM Server for SPARC, VirtualBox
Picture from Understanding Full Virtualization, Paravirtualization, and Hardware Assist. White Paper. VMware.
Hypervisor (and VMM)
Hypervisor
Type 1 (native): runs on bare metal; loads prior to the OSMicrosoft Hyper-V, VMware vSphere
Type 2 (hosted): runs within a conventional OS
Virtual Machine Monitor
Same as hypervisor (?)
Picture from Understanding Full Virtualization, Paravirtualization, and Hardware Assist. White Paper. VMware.
Transparent Paravirtualization
Photo credit goes to Flickr user Alexy.
Huh?
Transparent ParavirtualizationVirtual Machine Interface (VMI)
Single VMI-compliant guest kernel
VMI calls may have two implementations
inline native instructions (run on bare metal)
indirect calls to hypervisor
paravirt-ops
IBM+VMware+Red Hat+XenSource
Part of Linux kernel since 2.6.20
Picture from Understanding Full Virtualization, Paravirtualization, and Hardware Assist. White Paper. VMware.
Hardware-assisted Virtualization
Hypervisor runs below Ring 0
Sensitive calls are automatically trapped to the hypervisor
Effective guest isolation
AMD-V (Pacifica)Intel VT-x (Vanderpool)
VirtualBox, KVM, Microsoft Virtual PC, Xen, Parallels, ...
Picture from Understanding Full Virtualization, Paravirtualization, and Hardware Assist. White Paper. VMware.
OS-Level Virtualization
Single OS/kernel
Actually isolation (of contexts), not virtualization
No emulation overhead
Requires host kernel patch
Share same system call interface
Limits the set of runnable guests
Processes in a virtual server are regular processes on the host
Resources (e.g., memory) can be requested at runtime
Linux VServer, Parallels Virtuozzo Containers, OpenVZ, Solaris Containers, FreeBSD Jails; to a certain extent, UMview, UML
Able to run guest OS
Unmodified Guest
Unmodified Host
Overhead Flexibility
Full virtualization (emulation) ✓ ✓ Depends High Limited
Partial virtualization ✗ ✓ ✓ Low Limited
Paravirtualization(OS-assisted virtualization) ✓ ✗ ✗ Low High
Hardware-assisted virtualization ✓ ✓ ✓
Mostly offloaded
to hardware
Average
OS-Level Virtualization
Almost ✗ ✗ Low High
Which virtualization for networking?
Requirements
Depend much on the context
Operational network
Experimentation
Anyway...
Performance and scalability
FlexibilityConfiguration
Programmability (for development)
Support for mobility
Strong isolation
Ahem.... usability
Tools for Managing Virtual Network Scenarios
Netkit
Roma Tre University
VM engine: UML
VM interconnection: uml_switch
Core: shell scripts
Routing engine: Quagga, XORP
Lab description: (mostly) uses native router language
Lightweight
Easy-to-share labs
Several networking technologies, including MPLS forwarding
Netkit
VNUML
Universidad Politécnica de Madrid, Telefónica I+D
VM engine: UML
VM interconnection: uml_switch
Core: python + perl scripts
Routing engine: Quagga
Lab description: XML
Build, then play
Support for distributed emulation (segmentation)
Round robin
Weighted (by CPU load before deploy) round robin
VNUML
H: ssh, scp, rsync
W: SNMP, telnet TFTP
VNUML
Backplane: one or more 802.1Q-compliant switches
Host-switch and switch-switch connections are trunks
Marionnet
Université Paris
VM engine: UML
VM interconnection: uml_switch, vde
Core: OCaml
Routing engine: Quagga
Lab description: GUI (dot-based layout)
Ability to export project file for reuse
Network impairments(delay, loss → unidirectional links, bandwidth, flipped bits)
Switch status leds
Marionnet
Introducing indirection to support...
...stable endpoints
...port “defects”
GINI
McGill University
VM engine: UML
VM interconnection: customized uml_switch, implementation of a wireless channel
Core: C + python
Routing engine: custom implementation (compliant with the Linux TCP/IP stack)
Lab description: GUI, XML
Integrated task manager to start/stop nodes
Real-time performance plots
Vincent Perrier
VM engine: UML, (QEMU+)KVM, Openwrt
VM interconnection: improved uml_switch
Core: C
Routing engine: N/A
Lab description: custom markup
Several customizations and hacks
Ability to plot the value of any kernel variables
Switch supports tcp sockets, real-time configuration with XML messages
Built-on-the-fly CDROM image for machine differentiation
IMUNES
University of Zagreb
VM engine: N/A (stack virtualization)
VM interconnection: N/A
Core: N/A
Routing engine: N/A
GUI
Based on FreeBSD VirtNET
IMUNESVirtNET: network state replication
IMUNESVirtNET: network state replication
In a vimage it is possible to
configure network interfaces
open sockets
run processes
(in some way) similar to UMview
Virtual Routers
DynaMIPSChristophe Fillot (University of Technology of Compiegne)
Supported platforms (as of v0.2.7)
Cisco 7200 (NPE-100 to NPE-400)
Cisco 3600 (3620, 3640 and 3660)
Cisco 2691
Cisco 3725
Cisco 3745
No acceleration
CPU idle times must be tuned
“Of course, this emulator cannot replace a real router: you should be able to get a performance of about 1 kpps [...], to be compared to the 100 kpps delivered by a NPE-100 [...]. So, it is simply a complementary tool to real labs for administrators of Cisco networks or people wanting to pass their CCNA/CCNP/CCIE exams.”
DynaMIPSFully virtualized hardware
ROM, RAM, NVRAM
Chassis
Console, AUX
PCMCIA ATA disks
ATM/Frame Relay/Ethernet virtual switch between emulator instances
Port adapters, network modules
Interface binding
UNIX socket
VDE
tap
host interface (optionally via libpcap)
UDP port
Some lacking opcodes (mostly FPU)
Can manage multiple instances (“hypervisor” mode)
Development stalled, but still a milestone
DynaMIPS
Dynagen: a Python frontend
Dynagui
DynaMIPS
Dynagen: a Python frontend
Dynagui
GNS3
So, Performance and Scalability are 2 Bugaboos...
Carrier grade equipment:40Gbps ÷ 92Tbps
Per-packet processing capability must scale withO(line_rate)
Aggregate switching capability must scale withO(port_count * line_rate)
But software routers need not run on a single server:RouteBricks
Columbia+Intel+UCLA+Berkeley+... (a 10-authors paper!)
Click-based
Software routers:1Gbps ÷ 3Gbps
RouteBricks
Cluster router architecture
Parallelism across servers
Nodes can make independent decisions on a subset of the overall traffic
Parallelism within servers
CPU
I/O, memory
RouteBricks
Intra-cluster routing
Valiant Load Balancing (VLB):source random node destination
Randomizes input traffic
No centralized scheduling
Beware of reordering
Topology
Full mesh: not feasible(server fanout is limited!)
Commodity Ethernet switches:not viable!
Missing load-sensitive routing features
Cost
Tori and butterflies
RouteBricksButterfly topology
2-ary 4-fly
RouteBricks
Experimental setup
4 Intel Xeon servers(Nehalem microarchitecture)
1 10Gbps external line each
Full-mesh topology
Bottleneck
64-byte packets: <19Mpps sustained
Caused by CPUPer-byte CPU load higher for smaller packets
Programmability
NIC driver
2 Click elements
ClickUCLA
A modular software router
UNIX-pipe-like composition
Implemented as a Linux kernel extension
333,000 64-byte packets per second on a 700 MHz Pentium III
Element
Connection
Click
Element: a packet processing module performing a simple computation (e.g., decrease TTL, routing table lookup, packet queueing, etc.)
A C++ object wih a state
Has multiple input/output ports
May have configuration settings
May export an interface (e.g., method to report queue length)
ClickConnection: a packet handoff path
Push processingFor unsolicited packets (e.g., from a device)
Pull processingFor packet schedulers
Only 1 connection per port
Element ports can be push (black), pull (white), or agnostic (outline)
Push and pull are not mixable!
Example: simple router queue
Click
Note: queues do not live in ports: they are objects
Scheduling
Single thread(multithreaded implementation available)
Scheduling unit: element
Scheduling order: according to the packet’s path along the graph
Elements may be implicitly scheduled when their push/pull methods are called
Elements may have timers
ClickFlow-based contexts
Answer the questions: “If I were to emit a packet on my second output...
...where might it go?”
...which Queues might it encounter?”
...and it stopped at the first Queue it encountered, where might it stop?”
Click
Declarative languagesrc :: FromDevice(eth0);ctr :: Counter;sink :: Discard;src -> ctr;ctr -> sink;
or
FromDevice(eth0) -> Counter -> Discard;
Click
The Element class has ~20 virtual functions
Most default implementations are fine
Just override push, pull, and run_scheduledSample transparent element:
class NullElement: public Element { public:NullElement() { add_input(); add_output(); }const char *class_name() const { return "Null"; }NullElement *clone() const { return new NullElement; }const char *processing() const { return AGNOSTIC; }void push(int port, Packet *p) { output(0).push(p); }Packet *pull(int port) { return input(0).pull(); }
};
Click
Fine-grained elements are preferred
Not easy with BGP
Shared structures (e.g., routing tables)
Incorporated into the packet forwarding path
Example: IP Router
See page 12 of [Kohler]...
Virtual Switches
“Ordinary” Virtual Switch
A piece of software working at layer 2/3, inside the hypervisor or the hardware management layer
Can we do any better?
If we are in a virtual scenario, the network layer can...
...know about existing hosts (MAC/IP addresses) and their “movements”
...know interface operational mode (e.g., promiscuous)
...know about multicast memberships
...handle a flat topology (all the nodes are leaves, therefore we do not need STP)
...know the OS run by virtual machines (and, for example, run an OS-aware Deep Packet Inspection)
...support migration outside of a single subnet
VMware’s Approach
vNetwork Distributed switch
“VMware’s next generation virtual networking solution for spanning multiple hosts with a single virtual switch representation”
Private VLANs (restrict communication between virtual machines on the same VLAN)
Network VMotion—tracking of VM networking state
3rd Party Virtual Switch support (Cisco Nexus 1000V Series Virtual Switch)
Bi-directional traffic shaping
VMware’s Approach
VEPA
Virtual Ethernet Port Aggregator
An IEEE standard proposal
Idea: off-load switching activities from hypervisor-based virtual switches to physical switches
VEPA+OVF: the future?
OVF (Open Virtualization Format): metadata describing a virtual machine
June 2009: Linux kernel patch for VEPA support
Tagged (VEPA-aware switch) and tagless (VEPA-agnostic switch that bounces packets back) variants
VEPA
From Hudson, Congdon. Tag-less Virtual Ethernet Port Aggregator (VEPA) Proposal. 2009.
OpenvSwitch
Citrix Systems, Nicira, Intel, NEC, Google...
Virtualization layer no longer (just) Ethernet-based
OpenvSwitch
QoS, tunneling, filtering
Tunneling supports transparent inter-subnet migration without breaking transport sessions
Interface status migration (e.g., to bind policies to interfaces tightly)
Support for VLANs and GRE tunnels (intelligent forwarding)
All VMs on a single host: VPNs within a single OpenvSwitch instance
All VMs on the same LAN: implement VPNs by VLANs
VMs on different subnets: implement VPNs by GRE tunnels
Operates in the hypervisor (dom0 in Xen)
Offers connectivity between VMs and physical interfaces on the host
Forwarding state and runtime configuration can be altered by some programming interfaces
VEPA compatible (the control layer is able to manage a VEPA-enabled switch)
OpenvSwitch
Configuration options
Port mirroring (SPAN, RSPAN – a variant of SPAN that allows remote monitoring)
QoS policies
NetFlow
Bonding
Unified CLI for managing distributed switches
Exports interfaces compatible with VDE and Linux bridges
Forwarding
Ability to manipulate the forwarding table (to support status migration, where “status”=flow counters, ACLs, tunnels, etc.)
Packet processing based on layer 2/3/4 headersActions: forward (from ≥1 ports), drop, en/decapsulate
OpenvSwitch
Forwarding paths
Fast pathkernel space
forwarding engine
few code (portability, performance, hardware implementation)
comparable performance with Linux Ethernet bridge (pure kernel space, MAC-based forwarding only)
Slow pathuser space
forwarding logic (MAC learning, load balancing), remote management (NetFlow, OpenFlow)
VDE
University of Bologna
Virtual Distributed Ethernet:the swiss army knife of emulated networks
A general VPN
A mobility support technology
A tool for network testing
A reconfigurable overlay
A privacy-preserving layer
...
VDE
vde_switch vde_switch
vde_cable
vde_switch
MAC address learning (with aging)
Ability to operate as a hub
VLANs
Fast STP
Can be connected to a tap interface
vde_cable
Interconnects vde_switches
Does not exist...
vde_plug
VDEvde_cable = vde_plug + dpipe
vde_plug
socket → stdout
stdin → socket
dpipe
stdin → stdout
stdout → stdin
vde_switch
vde_plug
UNIXsocket
dpipe
VDE
vde_switch UNIXsocket
dpipe
dpipe vde_plug /tmp/vde1.ctl = vde_plug /tmp/vde2.ctl
vde_switchUNIXsocket
vde_switch UNIXsocket
dpipe+ssh
dpipe vde_plug /tmp/vde.ctl =ssh [email protected] vde_plug /tmp/vde_remote.ctl
vde_switchUNIXsocket
VDE
vde_switch UNIXsocket
vde_cryptcab
host1$ vde_cryptcab -s /tmp/vde.ctl -p 12000
host2$ vde_cryptcab -s /tmp/vde_local.ctl -c username@host1:12000
vde_switchUNIXsocket
dpipe+ssh is not a very good solution...
Dropping ssh is not advisable
With ssh, we may experience interference between congestion control algorithms
vde_cryptcab (UDP-based)
VDE
With ssh, we may experience interference between congestion control algorithms.........
VDE
vde_plug2tap --daemon -s /tmp/myvde.ctl tap0
vde_plug2tap
Direct connection from vde_switch to host tap interface
Same thing can be done during creation of the vde_switch
UNIXsocket
tap0
VDE
slirpvde
Internet connection
No root privileges
Connections are regenerated (a sort of masquerading)
Provides DHCP
UNIXsocket
slirpvde -d -s /tmp/vde.ctl -dhcp
VDE
UNIXsocket
dpipe
dpipe vde_plug /tmp/vde1.ctl = wirefilter -M /tmp/wiremgmt = vde_plug /tmp/vde2.ctl
UNIXsocket
wirefilter
dpipe
wirefilter: emulates real cable features (runtime tunable)
max packet queue capacitymtubit corruptionreordering
% lost packetsdelayduplicationbandwitdhinterface speed
VDE
Software that supports VDE as a userspace switch
VirtualBox
QEMU/KVM (with wrappers)
UML
OpenvSwitch
DynaMIPS
...
Testbeds
A collection of machines distributed over the globe
Hosted by research institutions
Only accessible by organizations in the ConsortiumNo fee for academic institutions
Each machine runs a software package (PLC – PlanetLab Central)
Includes a Linux-based OS
Node bootstrapping
Management, monitoring, and auditing tools
Supports distributed virtualization (slicing)Uses VNET(+) for traffic isolation between slices
Stand-alone version for private use: MyPLC
What would the network administrator at your organization say about the experiment running on your local site?
Continuously running experiments:
Active network probing(within netiquette):
Service disruption, actions triggering adm complaints are a no-go!
University of Utah
A facility
Windows/Linux nodes>1300 network ports that can be connected arbitrarily by remotely setting up VLANs on the switches
Virtual nodes (Xen-based)Arbitrary topology
A software system
“a kind of "operating system" for controlling collections of networked devices of all types, for the purpose of controlled experimentation”
Acceptable use: “in principle, almost any research or experimental use of the testbed by experimenters that have a need for it is appropriate”
Usersmap
>500 nodes
High-end (2.4 GHz Quad Core Xeon “Nehalem”, 12GB RAM 1066MHz, 2x250GB SATA disks, 8 GbE interfaces)
Mid-end (3.0 GHz 64-bit Xeon, 2GB RAM 400Mhz, 2x146GB 10kRPM SCSI disks, 6 GbE interfaces)
Low-end (600MHz Intel Pentium III, 256 MB RAM, 13GB IDE hdd, 5 FE interfaces, WiFi)
12 switches (Cisco and HP)
Servers
DB, DNS, users, file, serial line, etc.
Remote power controllers
VINI
“a virtual network infrastructure that allows network researchers to evaluate their protocols and services in the wide area”
Runs on top of PlanetLab
42 nodes @ 27 sites
VINI
Approach & technologies
VINI
A VINI router
XORP: routing engine
Click: forwarding engine
VINIEncapsulation
NSF project
“a virtual laboratory for exploring future internets at scale”
Keywords
Programmability
Virtualization and resource sharing
Federation (among participating organizations)
Slice-based experimentation
The PlanetLab team is also involved
So, which virtualization for networking?
(node) Virtualization type
Node virtualization technology
Link virtualization technology
Netkit “Paravirtualization” UML uml_switch
VNUML “Paravirtualization” UML uml_switch
Marionnet “Paravirtualization” UML uml_switch + vde
GINI “Paravirtualization” UMLuml_switch +
customizations
Cloonix “Paravirtualization”UML, QEMU,
OpenWRTuml_switch +
customizations
IMUNES OS-level None VirtNET
DynaMIPS Emulation Custom Custom
RouteBricks None None None
Click “Partial virtualization”
API N/A
VMware vNetwork “OS-level” N/A Custom
VEPA “OS-level” N/A Custom
OpenvSwitch “OS-level” N/A Custom
VDE OS-level N/A Custom
PlanetLab None None Overlay
Emulab Paravirtualization Xen N/A / Overlay / None
VINI “Paravirtualization” UML Overlay
GENI N/A N/A N/A
A one-fits-all proposal:The Network Hypervisor
Tunnels, VLANs, VRFs...:a pool of forwarding capacity
A one-fits-all proposal:The Network Hypervisor
Tunnels, VLANs, VRFs...:a pool of forwarding capacity
Rationale: virtualization should happen at the forwarding layer (amidst tunnels and VMs)
Network hypervisor
a mapper between the logical and physical network
gets a view of the logical network from the control plane
gets a view of the physical topology from a centralized management system
The Network Hypervisor
Control plane
Logical forwarding plane
Network Hypervisor
Physical forwarding plane
logical fwtables, ports
hw switches
The Network Hypervisor
Forwarding in a “hypervised” switch
1. Mapping packet to logical context
2. Logical forwarding
3. Mapping decision to physical context
4. Physical forwardingHandled by
the IGP
Handled by the hypervisor
The Network Hypervisor
Prototype implementation
Physical switchL2 over GRE
L2 packets are relevant in the logical context
GRE tunnels are the physical transport
L2-to-tunnel mapping (=logical forwarding decision) handled by the hypervisor
VLAN/MPLS tags to indicate logical context
Network hypervisorDistributed OpenFlow controller
Load balancing but no enforceable bandwidth on ports and links
Logical forwarding only at the edge of the network
An open standard by which high level routing decisions can be run on a separate server.
Supports experimentation without requiring vendors to expose internals.
ReferencesUnderstanding Full Virtualization, Paravirtualization, and Hardware Assist. White paper. VMware. 2007.
What’s New in VMware vSphere™ 4: Virtual Networking. White paper. VMware. 2009.
VMware Virtual Networking Concepts. White paper. VMware. 2007.
Argyraki, Baset, Chun, Fall, Iannaccone, Knies, Kohler, Manesh, Nedveschi, Ratnasamy. Can Software Routers Scale?. PRESTO 2008.
Dobrescu, Egi, Argyraki, Chun, Fall, Iannaccone, Knies, Manesh, Ratnasamy. RouteBricks: Exploiting Parallelism to Scale Software Routers. SOSP 2009.
Casado, Koponen, Ramanathan, Shenker. Virtualizing the Network Forwarding Plane. PRESTO 2010.
Galán, Fernández, Ferrer, Martín. Scenario-Based Distributed Virtualization Management Architecture for Multi-host Environments. SVM 2008.
Loddo, Saiu. How to Implement a Virtual Network Laboratory in Six Months and Be Happy. SIGPLAN Workshop on ML, 2007.
Kohler, Morris, Chen, Jannotti, Kaashoek. The Click Modular Router. ACM Transactions on Computer Systems 18(3), August 2000.
Maheswaran, Malozemoffy, Ngy, Liaoy, Guy, Maniymarany, Raymondy, Shaikhy, Gaoy. GINI: A User-Level Toolkit for Creating Micro Internets for Teaching & Learning Computer Networking. SIGCSE Bulletin. 2009.
Pfaff, Pettit, Koponen, Amidon, Casado, Shenker. Extending Networking into the Virtualization Layer. HotNets 2009.
Davoli. VDE: Virtual Distributed Ethernet. Technical Report. University of Bologna. 2004.
Bavier, Feamster, Huang, Peterson, Rexford. In VINI Veritas: Realistic and Controlled Network Experimentation. SIGCOMM 2006.