change - cordis...the change architecture, which is based around the notion of a flow processing...
TRANSCRIPT
ICT-257422
CHANGE
CHANGE: Enabling Innovation in the Internet Architecture through
Flexible Flow-Processing Extensions
Specific Targeted Research Project
FP7 ICT Objective 1.1 The Network of the Future
D4.3 – Protocols and mechanisms to combine flow
processing platforms
Due date of deliverable: December 30, 2011
Actual submission date: September 28, 2012
Start date of project October 1, 2010
Duration 36 months
Lead contractor for this deliverable Lancaster University
Version 3.0, September 28, 2012
Confidentiality status Public
c© CHANGE Consortium 2012 Page 1 of (76)
Abstract
The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-
enable innovation in the Internet. However, before processing any data flow, the communicating hosts
or agents acting on their behalf could be able to locate properly the closest CHANGE platforms. This
document, discusses first how these hosts and agents could efficiently locate these platforms, supported
by a complete comparison between different existing methods. Then it presents how efficiently flows are
attracted to these discovered platforms. The last part of the document discusses how flow can migrate
from one platform to another.
Target Audience
For the project participants, this document describes three important mechanisms that allow users to
localise efficiently CHANGE platforms, attract traffic to these discovered platforms, and finally migrate
a flow from one specific platform to another one. Moreover, this document describes additional tech-
niques to improve flow processing inside our platforms, like dynamic allocation of network services and
traffic traffic load balancing between platforms. The readers are expected to be familiar with Internet
protocols.
Disclaimer
This document contains material, which is the copyright of certain CHANGE consortium parties, and may
not be reproduced or copied without permission. All CHANGE consortium parties have agreed to the full
publication of this document. The commercial use of any information contained in this document may require
a license from the proprietor of that information.
Neither the CHANGE consortium as a whole, nor a certain party of the CHANGE consortium warrant that
the information contained in this document is capable of use, or that use of the information is free from risk,
and accept no liability for loss or damage suffered by any person using this information.
This document does not represent the opinion of the European Community, and the European Community is
not responsible for any use that might be made of its content.
Page 2 of (76) c© CHANGE Consortium 2012
Impressum
Full project title CHANGE: Enabling Innovation in the Internet Architecture through
Flexible Flow-Processing Extensions
Title of the workpackage D4.3 – Protocols and mechanisms to combine flow processing platforms
Editor Mehdi Bezahaf, Lancaster University
Project Co-ordinator Adam Kapovits, Eurescom
Technical Manager Felipe Huici, NEC
This project is co-funded by the European Union through the ICT programme under FP7.
Copyright notice c© 2012 Participants in project CHANGE
c© CHANGE Consortium 2012 Page 3 of (76)
Executive SummaryOne of the main goals of the CHANGE project is to reinvigorate innovation on the Internet, in order to better
support current services and applications and enable those of tomorrow. This will be achieved by deploying
a set of flow processing platforms at critical points in the network. This document defines vital mechanisms
that are necessary for the proper functioning of these flow processing platforms.
In the first part of this document, we discuss the platform discovery mechanism. Before using any platform in
the network, we have to be able to efficiently find the closest set of platforms with enough resources to process
our flow, and from this set, we have to be able to select the one that satisfies our constraints and metrics. We
classify platform discovery mechanisms into two different classes: centralized approaches and decentralized
ones. We then present our distributed approach, which is based on IP anycast. We use a modified version of
the Kruskal minimum spanning tree algorithm to split platforms into clusters and assign addresses to each
cluster member.
Discovering and selecting the best platform to process our traffic is not enough, since this traffic needs to
reach this selected platform. The best case scenario is that the platform is situated on the flow’s path, such
that by default all flow packets will pass through the platform, and no action needs to be performed by the
latter. However, this is not always the case, and a CHANGE platform needs to be able to attract flows to itself,
process them, and redirect them to their final destination while avoiding any forwarding loops. We further
discuss how a flow is attracted to CHANGE platforms. We briefly recall the definition of FlowSpec [43],
our selected solution, then explain how FlowSpec, in conjunction with ExaBGP [3] which allows routes
injection with arbitrary next-hops, can be used to implement a flow attraction mechanism. We further discuss
a novel load-balancing mechanism that can improve performance in scenarios where CHANGE is deployed
in networks with significant path diversity such as datacenters.
To summarize, this document presents and describes mechanisms crucial to the functioning of the CHANGE
architecture: platform discovery, resource allocation, flow attraction and load-balancing.
Page 4 of (76) c© CHANGE Consortium 2012
List of AuthorsAuthors Mehdi Bezahaf, Laurent Mathy, Gregory Detal, Simon van der Linden, Olivier Bonaventure, Pham
Quang Dung, Yves Deville, Costin Raiciu and Octavian Rinciog, Felipe Huici, Francesco Salvestrini
Participants Lancaster University, Universite catholique de Louvain, Polytechnic University of Bucharest, NEC
Europe Ltd., Nextworks s.r.l.
Work-package WP4 – Network Architecture Implementation
Security PUBLIC (PU)
Nature R
Version 3.0
Total number of pages 76
c© CHANGE Consortium 2012 Page 5 of (76)
Contents
Executive Summary 4
List of Authors 5
List of Figures 9
List of Tables 10
1 Introduction 11
1.1 Platform Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Flow attraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Platform Discovery 13
2.1 How to find the right platform? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Centralized approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Decentralized approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2.1 k anycast addresses assignment . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2.2 Anycast Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2.3 Topologies used for experiments . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2.4 Comparison of platforms discovery methods . . . . . . . . . . . . . . . . 20
2.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Traffic attraction and redirection 22
3.1 FlowSpec (RFC5575) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Performing Flow Attraction Using FlowSpec . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Deploying FlowSpec Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 FlowSpec + Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Using ExaBGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Traffic attraction using ExaBGP and FlowSpec . . . . . . . . . . . . . . . . . . . . 28
3.3.3 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Service Composition and Inter-Platform Aspects 32
5 Flow Migration 35
Page 6 of (76) c© CHANGE Consortium 2012
5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 Consistency classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.2 Ensuring per-packet consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.3 Ensuring per-flow consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Load Balancing 39
6.1 Path Diversity at the Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Case Study: Multipath TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Controllable per-Flow Load-Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3.1 Path Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3.2 Invertible Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3.3 Load Balancing Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3.4 Avoiding Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4.1 Load-Balancing for Non-Controlled Flows . . . . . . . . . . . . . . . . . . . . . . 51
6.4.2 Forwarding Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.4.3 MPTCP improvements with CFLB . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.4.4 Data Center Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4.5 Testbed Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7 Motivating Case Study: Extending TCP 60
7.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.1.1 Connection setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.1.2 Adding subflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1.3 Reliable multipath delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1.3.1 Flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.1.3.2 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1.3.3 Freeing sender buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.3.4 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.3.5 Data sequence mappings . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.1.3.6 Content-modifying middleboxes . . . . . . . . . . . . . . . . . . . . . . 67
7.1.4 Connection and subflow teardown . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
c© CHANGE Consortium 2012 Page 7 of (76)
7.2 Lessons learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Conclusion 71
References 71
Page 8 of (76) c© CHANGE Consortium 2012
List of Figures2.1 CHANGE Platform Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Simulated topology with N = 1000 platforms and k = 10 desired platforms. . . . . . . . . . 18
2.3 Results of gtitm topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Anycast on real topology with K = 3 desired platforms. . . . . . . . . . . . . . . . . . . . . 19
2.5 Relative error selecting the closest K platforms using geographical location. . . . . . . . . . 20
2.6 Comparison between anycast and virtual coordinates. . . . . . . . . . . . . . . . . . . . . . 21
3.1 Attraction mechanism terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Forwarding path of attracted packets with FlowSpec and tunnels. . . . . . . . . . . . . . . . 25
5.1 Flows that used to go to the processing module at the top now need to go to the one in the
middle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 As soon as the rules of the new configuration are installed on the transit and egress switches
with a new tag, packets entering the network are tagged so that they match the new rules. . . 37
6.1 More than 80% of the server pairs in popular models of data center topologies have two or
more paths between them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 On average, for 70% of destinations, routers and switches have multiple next hops. . . . . . 41
6.3 The load-balanced paths between S and D. . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4 The complete mode of operation of a CFLB router. . . . . . . . . . . . . . . . . . . . . . . 50
6.5 Deviation from an optimal distribution amongst two possible next-hops. . . . . . . . . . . . 52
6.6 Packet distribution computed every second amongst four possible next hops. . . . . . . . . . 53
6.7 CFLB gives equivalent forwarding performance as hash-based load balancers. . . . . . . . . 54
6.8 MPTCP needs few subflows to get a good Fat Tree utilization when using CFLB. . . . . . . 56
6.9 Regular MPTCP is unlikely to use all paths. MPTCP-CFLB on the other hand always man-
ages to use all the paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.10 Regular MPTCP has a very small probability of using link A of Figure [figure][11][6]6.11
and is thus suboptimal compared to MPTCP-CFLB. . . . . . . . . . . . . . . . . . . . . . . 57
6.11 Testbed – The maximum throughput available between S and D is at 200 Mbps due to the
bottleneck link between the router and the destination. . . . . . . . . . . . . . . . . . . . . 57
7.1 Problems with inferring the cumulative data ACK from subflow ACK . . . . . . . . . . . . 63
7.2 Flow Control on the path from C to S inadvertently stops the data flow from S to C . . . . . 66
c© CHANGE Consortium 2012 Page 9 of (76)
List of Tables6.1 General notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Page 10 of (76) c© CHANGE Consortium 2012
Glossaryattraction mechanism is the mechanism used to establish the attraction path.
attraction path is the path established to redirect traffic from the redirection point to the processing platform.
delivery mechanism is the mechanism used to establish the delivery path.
delivery path is the path from the processing platform to the grafting point.
destination is the destination of the flow that needs to be processed in the processing platform.
grafting point is the point on the initial path where the processed packets returns on their initial path.
initial path is the path between the source to the destination when the attraction mechanism is not in place.
processing platform is the CHANGE platform in charge of the actual processing of packets, the attraction
mechanism and the delivery mechanism.
redirection point is the point on the initial path where the packets from the source to the destination will be
diverted from the initial path towards the processing platform via the attraction path.
source is the source of the flow that needs to be processed in the processing platform.
c© CHANGE Consortium 2012 Page 11 of (76)
AcronymsALT Alternative Logical Topology.
AS Autonomous System.
ASBR Autonomous System Boundary Router.
BGP Border Gateway Protocol.
DNS Domain Name System.
DNSSEC Domain Name System Security Extensions.
EID Endpoint Identifier.
ETR Egress Tunnel Router.
FEC Forwarding Equivalence Class.
FlowSpec Flow Specification.
GRE Generic Routing Encapsulation.
IPsec Internet Protocol Security.
ITR Ingress Tunnel Router.
LISP Locator/Identifier Separation Protocol.
MPLS Multi-Protocol Label Switching.
NAT Network Address Translation.
NLRI Network Layer Reachability Information.
PKI Public Key Infrastructure.
RLOC Routing Locators.
RPKI Resource Public Key Infrastructure.
RR Route Reflector.
sBGP Secure BGP.
soBGP Secure Origin BGP.
Page 12 of (76) c© CHANGE Consortium 2012
1 IntroductionOne of the main characteristics of the CHANGE architecture is its flow processing platforms. In this doc-
ument we tackle several important mechanisms crucial to the architecture’s correct functioning. We begin
by discussing the platform discovery mechanism: Where are platforms located? Which is the closest set of
platforms to our flow?
As a further mechanism, we discuss how flows are attracted to CHANGE platforms so that they can be pro-
cessed. We briefly recall the definition of FlowSpec [43], our selected solution, then explain how FlowSpec,
in conjuction with ExaBGP [3] which allows routes injection with arbitrary next-hops, can be used as a flow
attraction mechanism. Finally, we discuss a novel load-balancing mechanism that can improve performance
in scenarios where CHANGE is deployed in networks with significant path diversity such as datacenters.
1.1 Platform DiscoveryBefore processing any flow, these platforms need to be localized. By discovery we do not mean only finding
a platform, but first finding a list of the closest platforms with enough resources to process our flow, and from
this set, we have to be able to select the one that satisfies our constraints and metrics.
We classify platform discovery mechanisms into two different classes: centralized and decentralized. In
the case of a centralized mechanism, using a centralized database, each platform will know the existence
of all other platforms with their supported functionality and resource availability. In such a case, users will
send requests to instantiate processing through an API, and the responses will include the identity of the
corresponding platform. In the decentralized case, we rely on a solution based on IP anycast. We use a
modified version of the Kruskal minimum spanning tree algorithm to split platforms into clusters and assign
addresses to each cluster member.
1.2 Flow attractionGiven that CHANGE platforms are not always on the initial path from a source to a destination, we need
to define a mechanism that allows these platforms to be able to attract flows. We split the flow attraction
mechanism into two steps: first, attracting the flows to the platform, and second, delivering back the processed
flow to the destination. Indeed, when processing is requested, the selected platform will activate the attraction
mechanism to attract packets corresponding to the flows that need to be processed. Once processed, the flow
needs to be correctly delivered to the destination.
We had discussed and compared three different possible solutions in deliverable D4.1:
• the first one is based on a combination of DNS and one-to-one NATs.
• the second solution consists on using BGP announcements inside and in limited scope outside an AS
• and finally a solution using FlowSpec, a way to distribute matching rules to routers, and to divert the
packets to the platform using a tunnel.
c© CHANGE Consortium 2012 Page 13 of (76)
The first solution is the simplest, but has problems like failing when a non-cooperative user is involved (e.g.,
a DDoS attacker can just avoid a filtering platform by foregoing the DNS look-up that would resolve to such
a platform). The second solution would work, but can only provide coarse-granularity when matching flows.
As a result of these limitations, we decide to implement the last option based on FlowSpec. In this deliverable
we explain this solution at length, explaining how it can be used to attract traffic. Finally, we explain how to
practically perform traffic attraction using FlowSpec and ExaBGP, a route injector.
1.3 Load BalancingIn the long term, if the view of the project is confirmed, a CHANGE site could consist of a complete data
center containing a bunch of switches and servers making up a number of platforms. Data centers are known
for having significant path diversity which can be leveraged by techniques such as load balancing to improve
the performance of flow processing in CHANGE platforms. To this end, in this deliverable we also introduce
a novel load-balancing scheme called CFLB, which unlike current, hash-based approaches, allows hosts to
explicitly select the load-balanced path they want to use for a specific flow.
Page 14 of (76) c© CHANGE Consortium 2012
2 Platform Discovery
2.1 How to find the right platform?
If the CHANGE vision is successful, it would mean that multitudes of platforms will be deployed globally
as shown in Figure 2.1. This however raises the obvious question of how do flow owners find the appropriate
platform on which to run their desired flow processing functionality, and what requirements are needed for a
good solution.
There are two main requirements; first, the platform must have the processing functionality required by the
flow owner, and it should be willing and have the resources to participate in the processing. Out of this
feasible set, the platform should be selected subject to the constraints and metrics defined by the user, for
example minimising the overall processing cost.
Today we expect cost will equal end-to-end delay. Delay has become the single most important factor affect-
ing user experience, as exemplified by the efforts of Web providers to shorten the paths [41] and to reduce the
number of RTTs required to download an average web object [17]. That is why we will use delay as our base
metric for discovering platforms. Arbitrary constraints can be implemented on top of the delay metric; once
nearby platforms are discovered, the user can query them to discover their various capabilities (e.g., required
bandwidth or CPU availability); this information enables the user to choose the most appropriate platform for
their needs.
We have considerable flexibility in designing solutions to meet these two goals, and the end solution will also
depend on the deployment type chosen. In a CDN-like deployment where there is a known set of platforms
and full trust between them, it is simple to create a database of supported functionalities and possibly of
available resources. In such a centralized case (see section 2.1.1), users will send requests to instantiate
processing via an external, opaque API, and replies will include the identity of the platform.
In a federated deployment (section 2.1.2), resource availability is sensitive information, so that maintaining a
reliable database is unfeasible, limiting the applicability of a centrally accessible API. In this case, distributed
solutions that better reflect the trust relationships between the platform owners, as well as being scalable, are
best suited.
Regardless of the deployment model, platform discovery needs to provide answers to the following questions
posed by the flow processing customers:
1. What is the closest platform to me? This might be used by Destination in Figure 2.1 to locate
platform E and instantiate an intrusion-detection system on all its traffic.
2. What is the closest platform to a given IP? A host could use this functionality to instantiate filter-
ing close to traffic sources with the purpose of defending against DDoS attacks. For instance, the
Destination could use platform A or C to filter traffic from the Source.
c© CHANGE Consortium 2012 Page 15 of (76)
Figure 2.1: CHANGE Platform Discovery
3. What are the k closest platforms to a given IP? A generalization of the two questions above, this would
allow requesting platforms to select the CHANGE platforms that can support the desired functionality
that instant in time.
4. What is the platform closest to an end-to-end path? If we wanted to monitor a TCP flow, what platforms
should we use?
The most obvious solution is to have a database of addresses of all CHANGE platforms and the requesting
host should use active measurement to choose the appropriate platform. This approach has high costs for
each platform discovery, and does not support locating platforms close to another IP.
An alternative solution leverages Internet routing by using BGP anycast. With this, each platform will have
a common IP C, advertised via BGP anycast, and its own unique IP. When a flow-processing client wants to
find a platform would create a TCP connection to IP C and a known port; the packets will be routed to the
closest platform as determined by BGP. This solution gives the most accurate results, but does not directly
support finding the k-closest platforms, or the platforms close to one IP. Using multiple IPs for CHANGE
platforms solves the first problem.
To find platforms close to given IPs we need to be able to estimate latencies between two entities. There are
a number solutions proposed in the research literature that can be applied:
1. DNS: The obvious solution is to have a database containing all platforms addresses (for example, DNS)
and the requesting user should use active measurements to choose the right platform. This approach
has high costs for each platform (latencies measured by each platform must be triggered each time a
request is processed). In addition this method can’t provide discovery of platforms that are close to a
given IP address.
Page 16 of (76) c© CHANGE Consortium 2012
2. The King approach [29] makes the assumption that each host is close topologically to its authoritative
DNS server. Thus, we can measure end-to-end delay of two hosts by measuring the delay between their
authoritative DNS servers. This solution is attractive because it uses the existing DNS infrastructure,
but its results are as close to reality as the assumption it relies on. With all the CDN deployments out
there and DNS-based server load balancing, it is unclear whether this assumption is true today, 10 years
after its proposal.
3. Virtual coordinate systems: Systems like GNP [49] or Vivaldi [21] actively measure delays between
sets of nodes, creating a Cartesian space. The promise is that with some initial measurements, and after
the system converges, these coordinates can be used directly to estimate the delay between hosts with-
out further active measurements. The main disadvantage of virtual coordinate systems is that they are
not incredibly accurate (for example, it is difficult to deal with violations of the triangle inequality [42]).
4. Geographical location: According to this solution, a specific physical location of Internet hosts can be
determined using only its public IP address. As we have emphasized, in today’s networks distance and
delay between two IP addresses are proportional. However existing solutions at this moment are not
100% accurate, due to different technologies such as NATs or Provider Independent Addresses that
break the relationship between physical location and ISP. In addition, existing solutions such as the one
offered by MaxMind [4], only offer localization based on city, so the distance can only be calculated
between connected stations from different cities.
5. BGP anycast: According to this solution, each platform can have a common IP published by BGP
anycast, but also a unique IP. When a client wants to find a platform it needs to create a TCP connection
to that IP anycast and packets will be routed to the nearest platform by the routing system. This solution
offers the most accurate results, but it cannot directly find the first k closest to a platform. In section
2.1.2.1 we describe how we can improve this solution, so that we can find the closest k platforms, not
just one.
In the rest of this section we discuss our solution for a centralized approach and one for a distributed scenario.
Finally, note that there exists a simple approach to solve an instance of (iv) above: the source can find on-
path platforms to the destination. The source can use traceroute to the destination to find the addresses of
the intermediary routers, and will then attempt to connect to each IP on a specific port. If the connection is
successful, in the final step the platform authenticates itself to the user.
2.1.1 Centralized approach
In a centralized implementation, all platforms will be managed by a single entity. In this case, the set of all
platforms must respect the following properties:
Trustworthy: Being under the same administration, all platforms must fully trust each other and must trust
the information received from other platforms.
c© CHANGE Consortium 2012 Page 17 of (76)
Transparency: With a single entity managing all the platforms, it is easy to create a database with supported
functionality and available resources. Given that this database will be managed by the entity that owns
all platforms, no sensitive commercial information regarding platforms will be exposed to the users.
In such a case, the discovery will be made very easily following these steps:
1. The user will send through an API requests to instantiate processing to that specific entity
2. The entity will search in its database the platform that has the following properties:
(a) the requested capabilities. This can be done by selecting from the database only the platforms
that have the right capabilities (e.g., CPU, bandwidth, load).
(b) proximity to the user. This can be done using the current protocols used by CDNs (Content
Delivery Networks), such as DNS indirection [50].
3. The entity will forward the request to the discovered platform.
4. The responses will include the identity of the discovered platform.
2.1.2 Decentralized approach
The only existing decentralized solution which preserves routing policies is based on IP Anycast, but it can
only find the closest platform to the requester. An obvious extension is to have k or more anycast addresses,
split the platforms into k groups and assign one address to all the hosts in the same group. The important
question is: how do we assign these addresses such that platforms, found by each host via IP anycast, are
indeed the k closest platforms to it? And if they are not the k closest, how do we minimize the average
increase in delay?
We want as few IP anycast addresses as possible: using many IP anycast addresses has an overhead as it
increases the total number of BGP UPDATE messages in the global routing system.
2.1.2.1 k anycast addresses assignment
In the following, we present our solution, which uses the classical Kruskal minimal spanning tree algorithm
as a starting point. The algorithm assigns x anycast addresses to N platforms (k ≤ x ≤ N ), where x is a
parameter. The idea is to split these N platforms in clusters with maximum x members, where each of these
members being connected to the other members with the minimum cost edge; this ensures that nearby servers
are placed in the same clusters. To allow precise lookups, each platform has a different anycast address inside
these clusters.
Our algorithm, presented in Algorithm 1, has two distinct parts. First, it splits platforms into clusters, where
each member has a different address. Second, it assigns addresses to each cluster member.
The first part is done using a modified version of Kruskal [40] spanning-tree algorithm. From the original
algorithm is kept the idea of merging two different clusters if an edge with certain properties is found.
Page 18 of (76) c© CHANGE Consortium 2012
The properties we require are: a) the edge connects two different clusters, b) the edge has the minimum cost
and c) the total number of vertices in the two clusters must be lower than x.
When each cluster is formed, we must assign to each member a different anycast address. The assignment of
addresses is made by taking the edges in descending order of cost and assigning the same minimum possible
address to the platforms connected by one edge.
Algorithm 1: Assign x anycast addresses to N platformsRequire: x ≤ n ∧G = complete delay graph
Sort each edge from G in ascending orderPut each platform in its own clusteri← 0while #each cluster < x ∨ i < N do
u← leftnodeedgeiv ← rightnodeedgeiif clusteru 6= clusterv ∧#clusteru +#clusterv ≤ x then
Merge clusteru clustervend ifi++
end whilei← N − 1while i ≥ 0 do
u← leftnodeedgeiv ← rightnodeedgeiif clusteru 6= clusterv thenminaddress =find address(clusteru, clusterv)addressu ← minaddress
addressv ← minaddress
end ifend while
2.1.2.2 Anycast Evaluation
We measured the performance of the method presented above, which allows to find the nearest multiple
platforms by using anycast addresses on multiple types of topologies. In these experiments we set k and the
number of desired platforms, and we varied the number of anycast addresses that can be provided to these
platforms.
For each experiment, we calculated the relative error introduced by selecting platforms using anycast method
with the formula: errrel =(anycastdelay−realdelay)∗100
realdelay, where anycastdelay is the sum of delays to the nearest
k anycast addresses designated by our algorithm and realdelay is the sum of delays to the closest k platforms
in reality.
For each topology and for each value of the number of anycast addresses, the relative error is calculated for
the following two methods of assignment of addresses:
Kruskal: Already detailed above. The disadvantage of this algorithm is that it requires an almost complete
graph with pings between platforms.
c© CHANGE Consortium 2012 Page 19 of (76)
0
20
40
60
80
100
120
140
0 50 100 150 200
Re
lative
Err
or
Number of Anycast Addresses
Relative Error for Kruskal Anycast k=10
Set 1Set 2Set 3
Figure 2.2: Simulated topology with N = 1000 platforms and k = 10 desired platforms.
Random: Each platform takes a random address from the anycast address existing pool. The disadvantage
of this method is that it doesn’t take into account the delay between platforms and therefore two nearby
platforms can be assigned the same anycast address.
2.1.2.3 Topologies used for experiments
Topologies with simulated data: The first type of topology that was used for experiments had 1,000 plat-
forms (in the future, we estimate that almost every active AS will have at least one flow processing
platform). As happens in the Internet today, all platforms can connect to others, and the delay between
two platforms is chosen randomly between 0 and 1000ms. We ran our algorithm on three random
topologies, we set k = 10, and varied the number of anycast addresses x from 10 to 200 (the minimum
is equal to 10, because to address 10 platforms we need at least 10 anycast addresses). The results are
shown in Figure 2.2.
As you can see from the graph, the results are not influenced by the topology, but by the number of
addresses used. As expected, with more addresses used the error is smaller. Also, according to the
results, we see that the error decreases rapidly from 115% for 10 anycast addresses to 20% for 20
anycast addresses. After this value, the error rate decreases slowly, with almost a linear trajectory.
gtitm topologies: Gtitm [61] is a network topology generator developed by Georgia Tech. It allows gener-
ation of network topologies that meet the present architectural model of the Internet: 3-level topology,
the first level with relatively few routers, but they have many connections between them and the last
level with few connections towards routers on the second level.
Based on these topologies we made two types of experiments: one in which the model simulates
platforms’ arrangement on 3 levels, each of these experiments with over 1,000 platforms and a second
type in which platforms have only a connection to routers on the last level, to simulate how the system
Page 20 of (76) c© CHANGE Consortium 2012
0
2
4
6
8
10
0 5 10 15 20 25 30 35 40
Re
lative
Err
or
(%)
Number of Anycast Addresses
Relative Error for Kruskal Anycast k=10 n=1252
Kruskal 1Random 1
(a) gtitm topology with N = 1252 platforms and k =10 desired platforms.
0
2
4
6
8
10
12
14
0 5 10 15 20 25 30 35 40
Re
lative
Err
or
(%)
Number of Anycast Addresses
Relative Error for Kruskal Anycast k=7 n=44
Kruskal 1Random 1
(b) gtitm topology with N = 44 platforms and k = 7desired platforms.
Figure 2.3: Results of gtitm topology.
0
20
40
60
80
100
120
140
0 10 20 30 40 50
Err
or
Anycast Addresses
Errors
Anycast
Figure 2.4: Anycast on real topology with K = 3 desired platforms.
behaves when there are few platforms located at the edge of the (current) Internet. For the first type of
architectural model, the results are shown in Figure 2.3(a), and for the second in Figure 2.3(b).
As we have already mentioned, we tested the performance of the anycast method for the two types of
algorithms: Kruskal and random. As you can see from the graphs, for any of these two models, the
Kruskal algorithm provides an error smaller than the Random algorithm.
Also, comparing with the results of the previous types of topologies, we can see an improvement in
the error introduced in this architectural model, and this is due to the fact that there are many platforms
located very distant from each other, with most delays having values over 500ms.
Real topologies: We ran the algorithm Kruskal anycast on a topology formed by 225 real servers located in
various places around the world. Data used for this experiment was a set of pings, lasting four hours
each, generated between these servers on 2008 [2]. In total, there are over 1.2 million records. For
k = 3 platforms desired, the results are shown in Figure 2.4.
As can be seen from the graph, the error introduced has the same behavior as in the other topologies.
It decreases strongly at x = 10 anycast addresses used, but from this point it oscillates between [0%−
20%]. Our current research is focused on finding why this oscillation exists, but most likely, by running
c© CHANGE Consortium 2012 Page 21 of (76)
0
20
40
60
80
100
0 20 40 60 80 100Err
or(
%)
K = Number of closest platforms
Errors
Geolocation
Figure 2.5: Relative error selecting the closest K platforms using geographical location.
the algorithm on the same data several times, these oscillations would flatten.
2.1.2.4 Comparison of platforms discovery methods
In order to determine the best way to find platforms, we need to make comparisons between the discovery
methods mentioned above. All of the following comparisons were made on the same real data [2].
The geographic location method was tested by selecting a random platform, calculating the average delay
from this platform to each other platform, and calculating distances from it to the other platforms using the
MaxMind framework [4]. Using this dataset and the method of least squares, we calculated the regression
line f(x) = a ∗x+ b, where x is the distance in kilometers between platforms and f(x) is the delay between
these platforms. The regression slope was on average 0.02 and the intercept has values ranging from 10 to
100.
Using this regression line, we calculated the estimated delay, knowing the distances between platforms. With
these data we selected the k closest platforms from one specific platform and we calculated relative error to
the known delay. Figure 2.5 shows the relative error, depending on the value of k. As can be seen from the
graph, for small k, the error is quite high, because the small real values of millisecond delays correspond to
the estimated values of the order of the intercept term, which usually takes values ten times higher than the
minimum delay.
We also calculated the relative error of selecting k desired platforms. So in the next experiment, we set
k = 3 platforms, and we compared the average relative error of the method for locating geographical error
introduced by using anycast. Average relative error calculated for k = 3 is very high, having a value of
389.91%.
To compare the performance of virtual coordinates with the method based on assigning k anycast addresses,
we performed the following test suite: we considered that N−1 platforms have already calculated coordinates
and that the N th platform wants to calculate its coordinates, giving pings to the platform that have coordinates.
We recorded the relative error between the distances from the platform to all other platforms, with new
coordinates. As expected, if the number of pings is relatively low, the relative error increases. The error
values are shown graphically up to 50 pings selected from 22,000 pings . If we select 10,000 pings, which
Page 22 of (76) c© CHANGE Consortium 2012
0
20
40
60
80
100
120
140
0 10 20 30 40 50 0
100
200
300
400
Err
or
Number of Anycast Addresses/Pings
Errors
AnycastNetwork Coordinates
Geolocation
Figure 2.6: Comparison between anycast and virtual coordinates.
are 50% of all pings from one platform to all other platforms, we get an error of 13%. The results are shown
in Figure 2.6.
2.2 ConclusionIn this chapter we discussed methods for CHANGE platform discovery. As we mentioned earlier, there
are two basic deployment models: centralized, in which all platforms are managed by a single entity, and
decentralized, in which platforms are administrated by several distinct entities.
In a CDN-like deployment where there is a known set of platforms and full trust between them it is simple to
create a database of supported functionalities and possibly of available resources. In such a case, users will
send requests to instantiate processing via an external, opaque API, and replies will include the identity of
the platform.
In a federated deployment, resource availability is sensitive information, so that maintaining a reliable
database is unfeasible, which limits the applicability of a centrally accessible API. In this deliverable we
presented a suite of tests between a novel solution based on BGP anycast, network virtual coordinates and
geographical location. Each discovery solution has advantages and disadvantages, but we are currently look-
ing into a hybrid solution that can achieve both the accuracy of BGP anycast and the flexibility of Virtual
Coordinate Systems solutions.
c© CHANGE Consortium 2012 Page 23 of (76)
3 Traffic attraction and redirectionIn CHANGE, the processing of flows occurs in platforms that are not always on the initial path from a source
to a destination. In this case, to provide the ability to perform processing, flows must be drawn into a platform.
Two problems must be solved: attracting the flows to the platform, and delivering the processed flow to the
destination.
Processingplatform
SourceDestination
Initial path
Redirectionpoint
Graftingpoint
Attraction path
Delivery path
Figure 3.1: Attraction mechanism terminology.
Figure 3.1 shows the terminology used in this deliverable for the problem of attracting flows into a platform
located outside the initial AS path. Note that the knowledge of the initial path is assumed to be obtained from
an external source of information. When processing is requested, the processing platform will activate the
attraction mechanism to attract packets corresponding to the flows that need to be processed. The attraction
mechanism is in charge of establishing the attraction path that will take the flow from the redirection point
to the platform. Once processed, the flow needs to be delivered to the destination. The delivery mechanism
is used to establish the delivery path, the processed flow is sent via the grafting point to the destination. It is
worth noting that depending on the mechanisms, the redirection point and grafting point positions may vary.
However, the grafting point must always be located downstream from the redirection point in the initial path.
We discussed several possible solutions in order to perform traffic attraction towards the CHANGE platforms
in deliverable D4.1. We compared the three solutions, one based on a combination of DNS and one-to-one
NATs, then a solution using BGP announcements inside and in limited scope outside an AS and finally a
solution using FlowSpec [43], i.e., a way to distribute matching rules to routers, to divert the packets to the
platform using a tunnel. Based on these comparisons, we chose to implement the latter one, i.e., the attraction
mechanism based on FlowSpec.
In the following, we first give a review of FlowSpec. Second, we explain how one can perform traffic
Page 24 of (76) c© CHANGE Consortium 2012
attraction using FlowSpec. Finally, we explain how to practically perform traffic attraction using FlowSpec
and ExaBGP [3], a route injector.
3.1 FlowSpec (RFC5575)Flow Specification (FlowSpec) [43] is a technique to distribute traffic flow specifications. FlowSpec was
primarily defined to automate inter-domain coordination of traffic filtering such as required to mitigate denial-
of-service attacks. FlowSpec information is carried via the Border Gateway Protocol (BGP) by being encoded
as a BGP Network Layer Reachability Information (NLRI). This allows the routing system to propagate flow
specifications.
FlowSpec allows to efficiently encode rules as an n-tuple consisting of several matching criteria that are
applied to IP traffic. A specific packet is considered to match the FlowSpec when it matches all components
(criteria) present in the specification. FlowSpec allows to match on any of the following components:
• destination prefix,
• source prefix,
• IP protocol,
• source port,
• destination port,
• ICMP type,
• ICMP code,
• TCP flags,
• packet length,
• DSCP field,
• fragmentation.
FlowSpec allows to control the value matched by the component by using a numeric operator such as greater
of equal, not, etc. It also allows to combine multiple components by using the binary operators AND and OR.
An action is associated with each FlowSpec, called Traffic Filtering Actions. RFC5575 [43] defines the
minimum set of filtering actions implemented on routers. These actions are the following:
Traffic-rate allows to apply rate limiting on the matching packets.
Traffic-action provides a way to control the application of the ordered sequence of actions bound to rules for
a given NLRI. It allows defining rules which terminate the application of the sequence, providing ways
c© CHANGE Consortium 2012 Page 25 of (76)
to define flow processing with flexibility. This action also provides control messages for the activation
of traffic sampling and logging at the receiver side.
Redirect allows redirecting the matching packet to a VRF routing instance. This permits to tunnel the
packets to a destination.
Traffic-marking changes the DSCP field of the matching packet to the corresponding value.
These actions are exchanged with FlowSpec rules encoded as BGP extended community values. BGP ex-
tended communities [16, 54] are mechanisms commonly used to provide means to perform inbound traffic
engineering and Denial of Service attack mitigation when associated with actions performed on paths.
Internally, these communities are used to specify properties of paths to impact routing decisions being made
about them. For example, a path can be tagged with the type of peering session over which it was received.
When selected as best by an Autonomous System Border Router, an ASBR will selectively propagate that
path based on the business property described in the tagged community.
In order to be used by external parties, ISPs describe the set of communities that they recognize, their as-
sociated actions, and the type of distant ASes that are allowed to use such communities (customers, peers,
providers, etc.).
3.2 Performing Flow Attraction Using FlowSpecIn the rest of this section, we discuss how to use FlowSpec to attract traffic towards the processing platform.
We first discuss how to distribute FlowSpec rules amongst partner ASes. Second, we discuss how to for-
ward the packet from the redirection point to the platform. Finally we discuss the security of the FlowSpec
attraction solution.
3.2.1 Deploying FlowSpec Rules
To enable attraction of flows towards the processing platform, a partnership must be established between the
platform and multiple ASes. To attract a flow, one router on the initial path will be used to act as a redirection
point towards the platform. This partnership must therefore allow installing FlowSpec rules inside some
routers, i.e., there must exist a FlowSpec signaling mechanism. In the following we address this requirement.
As FlowSpec runs on top of BGP, a BGP overlay of iBGP sessions can be used to exchange FlowSpec rules.
This overlay will connect border routers from partner ASes and the platform using Route Reflectors (RR). To
install a FlowSpec rule to redirect one flow from its initial path, the platform will initiate a BGP UPDATE
message with the FlowSpec rule for this flow, which will be received by each router. These routers will then
install the rule and start redirecting the flows towards the platform. This has the nice property of ensuring
that each packet from the flow passing through one of the partner ASes will leave its initial path and be
redirected through the platform. However this might not scale and could overload each router with unused
rules. Indeed, the number of entries that each router can handle might be limited. This therefore, limits the
maximum number of flows that can be attracted by the platform.
Page 26 of (76) c© CHANGE Consortium 2012
processingplatform
partnerAS
Figure 3.2: Forwarding path of attracted packets with FlowSpec and tunnels.
To scale, additional information must be added to the system to install FlowSpec rules only in one router,
increasing therefore the total number of flows that can be handled by the system. This information can be
carried using BGP communities. Routers from partner ASes must be configured with one or multiple shared
communities. The BGP UPDATE messages containing FlowSpec rules must then contain the BGP attributes
of the router or the set of routers that need to install the rule. The configuration of communities must be done
once per AS and be stored in the platform.
Note that FlowSpec has some limitations. FlowSpec NLRI are valid if and only if the two following rules are
matched. First, the originator of the FlowSpec must match the originator of the best-match unicast route for
the destination prefix embedded in the flow specification. Second, there must not exist more specific unicast
routes, when compared with the flow destination prefix that have been received from a different neighboring
AS than the best-match unicast route. To allow building an overlay to disseminate FlowSpec rules, these two
rules must be relaxed as the platform will never be the best unicast route for the destination in each partner
AS. Of course, the platform will never be an originator for the destination prefix.
3.2.2 FlowSpec + Encapsulation
Figure 3.2 shows the complete forwarding path of the flow when the FlowSpec rules are installed in the
redirection point. Note that the processing platform and the partner AS are not directly connected. Once
a packet matches a FlowSpec rule in the redirection point a specific action will be executed to forward the
packet to the platform. As native forwarding cannot be used, the packet must be encapsulated in a tunnel
towards the processing platform. This action is specified inside the BGP UPDATE messages as a Traffic
Filtering Action of the FlowSpec rule. The redirect action is used to point to a VRF instance, towards one
tunnel end-point inside the platform.
This requires that each router is pre-configured with one or multiple tunnels towards a next hop inside the
c© CHANGE Consortium 2012 Page 27 of (76)
platform so that when one router receives a FlowSpec rule to install, it will have a route towards the next hop
associated to the redirect action, otherwise the matching packets would be dropped.
To deliver the packets to their destination, an encapsulation mechanism can be used as well. Here, the
grafting point can be located inside the same AS (as shown in Figure 3.2), which redirects the packet towards
the processing platform. The idea is to have inside the AS the redirection point and grafting point located on
the edge, where the initial path passes through each one of them. It must be ensured that the grafting point is
located downstream of the redirection point. This only works if the FlowSpec rules are installed only in the
grafting point, otherwise a loop is created.
3.3 Using ExaBGP
ExaBGP [3] is a BGP engine that allows injecting routes with arbitrary next-hops into a network (source some
ipv4/ipv6 routes using both IPv4 and IPv6 TCP connections), mitigates DDOS using FlowSpec (see sec-
tion [section][1][3]3.1). ExaBGP is written in python and is freely available at http://code.google.
com/p/exabgp.
The remainder of this section is decomposed in thee parts. First, we discuss how to configure ExaBGP to
carry out route injection. Second, we discuss how to use ExaBGP to perform traffic attraction, i.e., dynamic
route injection. Finally, we show some experimentation results.
3.3.1 Configuration
A sample ExaBGP configuration can be found in Listing [lstlisting][1][3]3.1, which uses a similar syntax as
for configuring Juniper routers.
1 n e i g h b o r 1 9 2 . 1 6 8 . 1 2 7 . 1 2 8 {
2 d e s c r i p t i o n ” a r o u t e r ” ;
3 r o u t e r−i d 1 9 2 . 1 6 8 . 1 2 7 . 1 ;
4 l o c a l−a d d r e s s 1 9 2 . 1 6 8 . 1 2 7 . 1 ;
5 l o c a l−as 65000 ;
6 peer−as 65534 ;
7
8 f low {
9 r o u t e o p t i o n a l−name−of−the−r o u t e {
10 match {
11 s o u r c e 1 0 . 0 . 0 . 1 / 3 2 ;
12 d e s t i n a t i o n 1 9 2 . 1 6 8 . 0 . 1 / 3 2 ;
13 p o r t =80 =8080;
14 d e s t i n a t i o n −p o r t >8080&<8088 =3128;
15 # d e s t i n a t i o n −p o r t [ 8080 3128 ] ;
Page 28 of (76) c© CHANGE Consortium 2012
16 sou rce−p o r t >1024;
17 p r o t o c o l [ udp t c p ] ;
18 # p r o t o c o l [ 4 6 ] ;
19 # p r o t o c o l t c p ;
20 # packe t−l e n g t h >200&<300 >400&<500;
21 # f r a g m e n t not−a−f r a g m e n t ;
22 # f r a g m e n t [ f i r s t −f r a g m e n t l a s t −f r a g m e n t ] ;
23 # icmp−t y p e [ u n r e a c h a b l e echo−r e q u e s t echo−r e p l y ] ;
24 # icmp−code [ hos t−u n r e a c h a b l e network−u n r e a c h a b l e ] ;
25 # tcp−f l a g s [ u r g e n t r s t ] ;
26 # dscp [ 10 20 ] ;
27 # dscp >10&<20;
28 }
29 t h e n {
30 # b y t e s / s e c o n d s
31 r a t e− l i m i t 9600 ;
32 # d i s c a r d ;
33 # r e d i r e c t 6 5 5 0 0 : 1 2 3 4 5 ;
34 # r e d i r e c t 1 . 2 . 3 . 4 : 5 6 7 8 ;
35 }
36 }
37 }
38 }
Listing 3.1: Sample ExaBGP configuration in order to perform FlowSpec rule injection.
The sample configuration file described in Listing [lstlisting][1][3]3.1 injects one route in a BGP peer router
192.168.127.128 (configured from line 1 to 6). Multiple flow route injections can be configured in the flow
entry of the configuration file. A route is then decomposed into two sets: a match and an action, linked with
the then keyword.
The match section (see lines 10 to 28) can contain multiple matching rules; we only explain the most common
ones:
source, destination specify the source/destination address prefix to match,
port, source-port, destination-port specify the ports to match, operations can be used to describe the set of
ports to match. E.g., >X specify that the rule only match on the port greater than X, the & symbols
allows to combine multiple matching rules.
c© CHANGE Consortium 2012 Page 29 of (76)
protocol specifies the protocol to use, tcp, udp or icmp.
The then section contains the action to perform on the packets that match the rule. Three different actions are
available:
rate-limit this limits the rate of the flow,
discard discards the matching packets,
redirect redirects the matching packets.
3.3.2 Traffic attraction using ExaBGP and FlowSpec
ExaBGP is only meant to be used to load a static configuration easily but it can also be used to dynamically
inject rules. Dynamic rule injection can be performed by editing the configuration file and by telling ExaBGP
to reload its configuration. ExaBGP can reload its configuration using the SIGHUP signal:
k i l l −SIGHUP <p i d o f e x a b g p >
Therefore, to integrate ExaBGP into the CHANGE platform, we need to have an external program that
handles ExaBGP and its configuration. This external program can be queried by the CHANGE platform
(e.g., using RPC calls) in order to rewrite the configuration and tell ExaBGP to reload its configuration when
new flows need to be attracted or removed.
3.3.3 Experimentation
In the following, we show how ExaBGP can be combined with a FlowSpec enabled router. The testbed is
composed of 2 devices,one running ExaBGP 2.0.1 (192.168.2.2) and connected to a JunOS 10.3 Olive virtual
router (192.168.2.1). The following FlowSpec configuration has been used for the Juniper router:
. . .
p r o t o c o l s {
bgp {
group community−f s {
t y p e e x t e r n a l ;
m u l t i h o p ;
l o c a l−a d d r e s s 1 9 2 . 1 6 8 . 2 . 1 ;
p a s s i v e ;
e x p o r t no−r o u t e s ;
peer−as 65000 ;
l o c a l−as 65001 ;
n e i g h b o r 1 9 2 . 1 6 8 . 2 . 2 {
t r a c e o p t i o n s {
Page 30 of (76) c© CHANGE Consortium 2012
f i l e community−f s ;
f l a g a l l ;
}
f a m i l y i n e t {
u n i c a s t ;
f low {
no−v a l i d a t e a l l−r o u t e s ;
}
}
}
}
}
}
p o l i c y−o p t i o n s {
p o l i c y−s t a t e m e n t a l l−r o u t e s {
t h e n a c c e p t ;
}
p o l i c y−s t a t e m e n t no−r o u t e s {
t h e n r e j e c t ;
}
}
. . .
Listing 3.2: JunOS configuration used for the experiment.
And the following configuration has been used for ExaBGP, injecting in the Juniper router a rule that will
drop all packets from 192.168.0.2 and to 192.168.1.2:
n e i g h b o r 1 9 2 . 1 6 8 . 2 . 1 {
r o u t e r−i d 1 9 2 . 1 6 8 . 2 . 2 ;
l o c a l−a d d r e s s 1 9 2 . 1 6 8 . 2 . 2 ;
l o c a l−as 65000 ;
peer−as 65001 ;
f low {
r o u t e {
match {
s o u r c e 1 9 2 . 1 6 8 . 0 . 2 / 3 2 ;
d e s t i n a t i o n 1 9 2 . 1 6 8 . 1 . 2 / 3 2 ;
c© CHANGE Consortium 2012 Page 31 of (76)
}
t h e n {
d i s c a r d ;
}
}
}
}
Listing 3.3: ExaBGP configuration used for the experiment.
After running the two devices, we can validate that the route is correctly injected in the router by looking at
the FlowSpec table contained in the Juniper router:
1 r o o t> show r o u t e t a b l e i n e t f l o w . 0 e x t e n s i v e
2
3 i n e t f l o w . 0 : 1 d e s t i n a t i o n , 1 r o u t e s (1 a c t i v e , 0 holddown , 0 h i dd en )
4 1 9 2 . 1 6 8 . 1 . 2 , 1 9 2 . 1 6 8 . 0 . 2 / t e rm : 1 (1 e n t r y , 1 announced )
5 TSI :
6 KRT i n dfwd ;
7 Ac t i on ( s ) : d i s c a r d , c o u n t
8 ∗ BGP P r e f e r e n c e : 170/−101
9 Next hop t y p e : F i c t i o u s
10 Next−hop r e f e r e n c e c o u n t : 1
11 S t a t e : <A c t i v e Ext>
12 Pee r AS : 65000
13 Age : 9
14 TASK: BGP 65000 65001 . 1 9 2 . 1 6 8 . 2 . 2 + 3 8 6 0 1
15 Announcement b i t s ( 1 ) : 0−Flow
16 AS Pa th : 65000 I
17 Communit ies : t r a f f i c −r a t e : 0 . 0
18 Accepted
19 L o c a l p r e f : 100
20 Ro u t e r ID : 1 9 2 . 1 6 8 . 2 . 2
Listing 3.4: FlowSpec table contained in the router after the experiment.
We can see that the entry is correctly added in the flowspec table of the router and that the correct action is
associated with the flow (see line 7) and that the extended community associated with the action is applied
Page 32 of (76) c© CHANGE Consortium 2012
on the route, i.e., the traffic −rate :0.0 (see line 17).
3.4 ConclusionIn this section we presented the FlowSpec-based flow attraction mechanism. To date, we believe this to be
the best suited for attracting flows towards a platform, both in terms of providing the needed mechanism and
of having the highest deployment change in the current Internet. We showed that it is easy to use ExaBGP,
a tool that allows route injection as a BGP peer, to redirect flows in any FlowSpec-enabled router. Finally,
we performed simple experiments with ExaBGP and virtual routers, and presented the results from such
experiments.
c© CHANGE Consortium 2012 Page 33 of (76)
4 Service Composition and Inter-Platform
AspectsOne of the key functions in the CHANGE architecture is the service composition, which groups all the
aspects related to platform and resources resolution, flow routing and subsequent resource allocation. The
main trigger for all the actions/phases above is an end-user request received through the Service-User to
Network Interface. As described in more details in D4.2, this Service-UNI request is primarily handled at the
CHANGE Service Manager that coordinates the following actions:
• at first, the decomposition of the overall end-to-end user service specification into specific low-level
service components to be implemented by the CHANGE platforms
• then, the flow routing and per-platform allocation of the specific actions to be implemented on the flow.
The service description specified at the Service-UNI could either explicitly identify the platforms and actions
involved in the end-to-end service (i.e. provide the CHANGE Service Manager with a set of exact routes and
actions to be implemented), or it could be more generic and just identify the service endpoints (i.e. source,
destination hosts) and the service parameters (e.g. a firewall policy, a NAT rule, etc.). Depending on the
service description provided at the Service-UNI, the CHANGE service composition function has different
scopes. In particular, in the former case (i.e. explicit route-actions specification), the service composition
actions mainly consist of status and AuthN/AuthZ checks on the required resource allocations, with the
subsequent provisioning via signaling procedures. Instead, in the latter case (i.e. generic service description)
all the exact platform and resource resolution is implemented by the CHANGE service composition, before
any subsequent signaling for provisioning.
To implement its decisions, the CHANGE Service Composer could take into account the domain network
topology, the platforms capabilities and their resources availabilities.
Since these resolution operations might require the allocation of additional platform(s) and/or flow process-
ing action(s) along the route, the routing and composition decisions should be jointly taken by the Service
Composer to obtain an optimal solution. The result of this process takes the form of a Flow Processing
Route (FPR), which contains the possibly exact and complete sequence of flow processing actions and packet
forwarding rules to be performed in the CHANGE domain.
In a more schematic and high level view, the ingress-to-egress resolution occurring in the CHANGE archi-
tecture could be summarized in the following steps:
1. Parse the client-originated service description in order to produce an intermediate-representation of all
the platform-level operations (i.e. resolve the service components)
2. Compute a loop-free flow path and identify flow processing resources to be allocated along it by taking
into account the following decision actions:
Page 34 of (76) c© CHANGE Consortium 2012
• Add any additional flow processing actions and/or platforms that might be required to implement
the service into the domain
• Adjust the bindings among all the identified flow processing actions in the domain to stitch the
different parts both within a platform (e.g. in case multiple processing modules are used) and
among platforms (e.g. in case flow attraction or any other routing aspect might be needed)
The following example describes a possible service description to be used by Service Composer to implement
the actions detailed above:
# Define the first platform
PLATFORM ID=0 NAME=alpha.platform.com
INTERFACE ID=0 MBOX_ID=0 IP=126.16.13.139 TYPE=REAL
INTERFACE ID=1 MBOX_ID=0 NAME=bridge0 TYPE=VIRTUAL
INTERFACE ID=2 MBOX_ID=0 NAME=bridge1 TYPE=VIRTUAL
# Define the second platform
PLATFORM ID=1 NAME=beta.platform.com
INTERFACE ID=0 MBOX_ID=1 IP=128.16.67.99 TYPE=REAL
INTERFACE ID=1 MBOX_ID=1 IP=126.16.13.139 NAME=bridge0 TYPE=VIRTUAL
INTERFACE ID=2 MBOX_ID=1 IP=126.16.13.139 NAME=bridge1 TYPE=VIRTUAL
# Forward (which allows allows filtering too)
FORWARD FROM=0:0 TO=0:1 FILTER=’’ip src=128.16.13.139’’
# Add a processing module
PROCESSING_MODULE TYPE=FIREWALL CONFIG=... IN=0:1 OUT=0:0
# Traffic manipulation
ATTRACT_DNS NAME=nets.cs.pub.ro TARGET=0:0 DNS=...
...
The declarative language used in this example has two different types of definitions: the platform and the flow
processing actions. The platform serves a twofold purpose: a) it identifies the platform’s addressable control
interface (i.e. alpha.platform.com and beta.platform.com) that will be used for the control plane interface;
b) it declares the platform interfaces involved in the service. The flow processing actions report the type of
the processing module that must be used and its bindings with the peering modules (that constitute a part of
the service slice inside a platform) or the ingress/egress interfaces (that form the bindings among different
service slices allocated between two data plane adjacent platforms).
c© CHANGE Consortium 2012 Page 35 of (76)
Concerning the inter-platform aspects of the service composition process described above, the FPR is the key
information that binds resources and platforms among themselves. Depending on the deployment model of
the CHANGE architecture (i.e. centralized or distributed as per D4.2 ), the FPR is managed differently thus
impacting the way the different platforms cooperate for the service setup and maintenance.
In particular, if a centralized model is deployed, the FPR is split in a sub-sequence of per-platform actions
and each of these segments is directly signalled by the Signaling Manager to the corresponding platform. In
this case data-plane adjacent platforms have no adjacency in the control plane and do not peer/cooperate each
other in service provisioning/maintenance.
Contrarily, if a distributed model is deployed, the overall FPR is passed by the Signaling Manager to the
signaling instance running on the entry CHANGE platform (first hop in the FPR) and there used to initiate a
peer-mode distributed signaling along the identified flow path. On each platform, the local actions described
in the FPR are implemented, and subsequently the remaining FPR parts are forwarded to the next hop /
adjacent platform (at control plane level). Therefore, in this case the ingress and egress platforms handle
the full end-to-end service as per FPR, while the intermediate platforms just maintain the upstream and
downstream peering.
All the procedures described above primarily apply to the Internal-NNI signaling, i.e. they apply to a
CHANGE domain in which platforms can peer and share topology and resources information. As described
in D4.2, the Inter-AS NNI aspects could be assumed to be similar to the Internal-NNI ones in terms of se-
mantics and abstract messages: though not sharing full information about flow processing resources and
platforms, peering ISPs may adopt inter-platform cooperation mechanisms similar to the ones described for
the Internal-NNI protocol, i.e. peer-style or centralized/management-style. The signaling messages used at
the Inter-AS NNI to control the multi-domain flow processing service can be assumed to derive from the ones
used at Internal-NNI , but with contents that largely depends on the policies and trust relationships between
peering ISPs (e.g. for resource and/or internal topology description, multi-domain FPR details, etc.).
Page 36 of (76) c© CHANGE Consortium 2012
5 Flow Migration
Within CHANGE platforms, flows are routed across OpenFlow switches. The logically-centralized controller
of each platform installs rules to handle flows in each switch. A rule essentially specifies a pattern that
matches certain packets, and an action to take with such packets. For any reason –moving a virtual machine
to a better place for instance–, the controller might need to change the rules in a collection of switches in
order to change the path followed by certain flows following the move of a virtual machine for instance. The
controller does so by issuing remove and install commands.
Changing the configuration of multiple switches at once is not possible, and therefore the network goes
through a set of intermediate states between the initial and the final configuration. Albeit the initial and the
final configurations of the network are correct, intermediate ones might cause severe troubles such as broken
connectivity, forwarding loops, and inconsistent paths, especially in presence of middleboxes.
Internet
PM
PM
migration
Figure 5.1: Flows that used to go to the processing module at the top now need to go to the one in the middle.
In Figure [figure][1][5]5.1, the processing module at the top needs to be shut down for instance, and all flows
processed by this module will be handled by another processing module. The controller must thus send install
command to the three switches in the path to the new processing module, and remove command to the two
previous switches crossed by the old path to reach the old processing module. In order to avoid losses, the
controller must take care that rules at the two switches of the new path are installed before rules at the ingress
switch.
It does not have to be this way. Seamlessly changing the configuration of a network could be a primitive of
the OpenFlow platform.
Quite some work has already been done regarding the problem of avoiding undesired intermediate states of
distributed routing protocols.
There are basically two different approaches: either we use an algorithm to determine the order in which
commands are issued to individual switches such that the order preserves certain properties, or we virtualize
the network and direct packets to the new virtual network once it is ready.
c© CHANGE Consortium 2012 Page 37 of (76)
5.1 Algorithms
Francois et al. [25] show that it is possible to avoid all loops during the convergence of a link-state IGP like
OSPF or IS-IS, and they propose a protocol to let the routers change in a good order. They propose [26]
a way to update the link metrics so as to avoid disruptions after a planned link state change with OSPF.
Unfortunately, these methods only consider link-state routing protocols and shortest path routing. They
cannot be applied to OpenFlow.
Fu et al. [27] introduce two conditions to change the forwarding table of an IP router during a reconfiguration
without introducing a loop. They show that a forwarding table can always be updated, at least partially (i.e.,
a destination at a time). With this, they propose an algorithm to reconfigure the network. Actually, to the
contrary of the previous work, their method can be applied to any network with hop-by-hop destination-based
forwarding.
More recently, Vanbever et al. introduced another algorithm to find an order in which to update, without
disruptions, n forwarding tables in n steps, in each of which exactly one forwarding table is updated [57]. To
this end, the algorithm first enumerates orders in which a loop arises so as to generate constraints. Then, they
use linear programming to find an order, matching the constraints (i.e., safe). Such an order does not always
exist, in which case we can always fall back to partial forwarding table updates as per the aforementioned
algorithm. They also show this problem is NP-complete.
Unfortunately, none of these solutions is satisfying for OpenFlow networks. Actually, only the last two
could be applied. OpenFlow offers a lot more flexibility in forwarding; it is not limited to destination-based
forwarding. OpenFlow switches’ flow tables are also likely to be filled with a mix of aggregate entries of all
sorts since the space of matching patterns is large, and the flow tables are yet limited as of today. In addition,
the matching pattern of these aggregate rules might change as well. It is unknown how the algorithms behave
with aggregate rules possibly not matching destination only. Moreover, those solutions are aimed at planned
if not manual reconfigurations. With OpenFlow, an automated and faster solution that could be abstracted
by the platform is desirable. Finally, those solutions do not allow to enforce certain kinds of consistency, as
exposed in the next section.
5.2 Virtualization
Another way to prevent disruptions during reconfiguration is to have two or more virtual networks. Packets
thus travel across one or the other. This allows one to change the configuration of a network without dis-
ruptions by moving the traffic to the other before the reconfiguration. This is part of the idea presented by
Reitblatt et al. [52].
In that paper, the authors introduce the problem of consistency in software-defined networks like OpenFlow
with different classes of consistency. Then, they propose mechanisms to ensure such consistency. We explain
in more details their work hereafter.
Page 38 of (76) c© CHANGE Consortium 2012
5.2.1 Consistency classes
The obvious class of consistency is per-packet consistency. It guarantees that a packet is handled at a time
by either rules part of the initial configuration or of the final configuration. Such consistency is useful for
instance in a network where packets must be processed by a middlebox: although both the initial and final
configurations are correct, packets might end up not passing through the middlebox in some intermediate
configuration. Note that this is a stronger requirement than requiring no loops.
The other class of consistency is per-flow consistency. Similarly, it guarantees that packets of a flow are
handled by rules part of one or the other configuration but not both. Suppose the middlebox of the previous
example is stateful –it keeps a per-flow state–, and there are two or more middleboxes. In any configuration,
packets of the same flow must pass through the same middlebox.
5.2.2 Ensuring per-packet consistency
Ensuring per-packet consistency is relatively simple. Packets arriving at ingress switches are tagged as they
would in a traditional 802.1q VLAN. Rules on transit and egress switches match the tag.
When a new configuration is to be deployed, the new rules match a new tag. When the new rules are installed
in all transit and egress switches, the rules to tag the packets with the new tag are progressively installed with
a higher priority than the ones with the old tag on ingress switches, and the packets start flowing through the
network according to the new configuration. Once all ingress switches have been updated, the rules matching
the old tag can be removed.
Internet
PM
PM
migration
Flow Table S1
Flow Action
X tag: VLAN 0x3
...
S1 S2
S3
S4
Flow Table S2VLAN
ID Action
0x3 fw: 0
...0x4 fw: 1
0
1
migration
Flow Table S1
Flow Action
X tag: VLAN 0x4
...
Figure 5.2: As soon as the rules of the new configuration are installed on the transit and egress switches witha new tag, packets entering the network are tagged so that they match the new rules.
c© CHANGE Consortium 2012 Page 39 of (76)
5.2.3 Ensuring per-flow consistency
Ensuring per-flow consistency is a bit more complicated. As for per-packet consistency, packets are tagged
at the ingress switches while rules on transit and egress switches match tags.
However, unlike per-packet consistency, tagging rules at the egress can not be replaced at any time after the
rules have been installed on transit and egress switches. This would possibly tag packets of the same flow
with different tags, and these packets would end up being processed by different configurations.
Instead, the new tagging rule should take over only when all flows matching the old rule have terminated. With
the current version of OpenFlow, this can be done with a timer such that the rule is removed some time after
the last packet was processed. But this is an imperfect solution. A better solution is to use the instantiation
rules defined in DevoFlow [20]: a rule is instantiated for each new flow matching the instantiation rule. If such
rules were used at ingress switches, they could be replaced at anytime. The ongoing flows would still match
the rules instantiated by the old rules, until they terminated, while new flows would match rules instantiated
with the new tag.
ConclusionIn CHANGE, platforms need to do more than avoiding loops or black holes during a reconfiguration. In-
deed, there are various processing modules in a platform, and neither packets nor flows can be avoided to
pass through some processing module during a reconfiguration. One of the two aforementioned classes of
consistency is necessary. To our knowledge, there are no algorithms to enforce such consistency properties
yet. Nevertheless, the virtualization mechanisms described above are satisfying.
Page 40 of (76) c© CHANGE Consortium 2012
6 Load BalancingIn the long term, if the view of the project is confirmed, a CHANGE site could consist of a complete data
center containing a bunch of switches and servers making up a number of platforms. Data centers are known
for having significant path diversity which can be leveraged by techniques such as load balancing to improve
the performance of flow processing in CHANGE platforms. To this end, in this section we introduce a novel
load-balancing scheme called CFLB, which unlike current, hash-based approaches, allows hosts to explicitly
select the load-balanced path they want to use for a specific flow.
Load balancing allows to maximize the throughput [35], achieve redundant connectivity [36] and reduce
congestion [14]. Different forms of load balancing can be deployed at various layers of the protocol stack. At
the datalink layer, frames can be distributed over parallel links between two devices [7]. At the application
layer, requests can be spread on a pool of servers.
At the network layer, the most common technique, Equal-Cost Multi-Path (ECMP) [35, 18], allows routers
to forward packets over multiple equally-good paths. ECMP may both increase the network capacity and
improve the reaction of the control plane to failures [36]. Current ECMP-enabled routers proportionally
balance flows across a set of equal next hops on the path to the destination. Moreover, various methods
to practically perform the forwarding among multiple next hops are possible [14, 35]. The most deployed
next-hop selection method is solely based upon a hash computed over several fields of the regular packet
headers [1, 35]. Using a hash function ensures a somewhat fair distribution of the next-hop selection [14]
while preserving the packet sequence of transport-level flows.
Data center designs rely heavily on ECMP [5, 28, 30, 48] to spread the load among multiple paths and reduce
congestion. This form of load balancing is naive, congestion can still occur inside the data center and lead
to reduced performance. Data center traffic contains both mice and elephants flows [10, 37]. Mice flows
are short and numerous but they do not cause congestion. Most of the data is carried by a low fraction of
elephants flows. Based on this observation, several authors have proposed traffic engineering techniques that
allow to route elephants flows on non-congested paths (see [11, 19, 6] among others). Those techniques
rely on OpenFlow switches [46] to control the server-server paths. Unfortunately, the scalability of such
approaches is limited, which may lead to an overload of the flow tables on the OpenFlow switches.
In this chapter, we show that another design is possible to take benefit of the path diversity that exists in
data center networks, and then improve communication between CHANGE platforms. Current hash-based
implementations rely on the IP and TCP headers to select the load-balanced path over which each flow
is forwarded. In fact, the IP addresses and the TCP port numbers implicitly specify the flow path. But
unfortunately since routers rely on hash functions to perform the load-balancing, it is very difficult for a host
to predict the path that a specific flow will follow. We show in this chapter that hash functions are not the
only way to practically enable path diversity. We propose a new deterministic scheme called Controllable
per-Flow Load-Balancing (CFLB) that allows hosts to explicitly select the load-balanced path they want to
c© CHANGE Consortium 2012 Page 41 of (76)
use for a specific flow. This is a major change compared to existing hash-based techniques and opens new
possibilities. To allow packet steering by end hosts, CFLB replaces the hash-based next-hop selection method
that is implemented today on load-balancing routers with an invertible mechanism using a function such as a
block cipher. CFLB routers apply this invertible procedure over selected fields of the packet headers to select
a load-balanced next hop. CFLB does not rely on any states in routers and any extension in packet headers,
existing header fields are used to convey a path selector. CFLB is also transparent to hosts that do not want to
perform packet steering. In this case a classic hash-based load balancing is performed without that the router
distinguishes controlled packets from non-controlled ones.
The remainder of this chapter is organized as follows. We first recall in section [section][1][6]6.1 current
hash-based load-balancing basics. Then we consider in section [section][2][6]6.2 MultiPath TCP as a study
case to introduce our proposal. We provide a detailed description of the operation mode of CFLB in sec-
tion [section][3][6]6.3. In section [section][4][6]6.4, we analyze the performance of CFLB, we first use
trace-driven simulations to compare CFLB with existing hash-based techniques and, then, we implement
CFLB in the Linux kernel to evaluate its packet forwarding performance. We also evaluate the benefits for
MultiPath TCP hosts using CFLB. In section [section][5][6]6.5, we discuss other possible applications.
6.1 Path Diversity at the Network LayerThere exist several proposals to enable path diversity at the network layer [47]. However, in practice only
Equal-Cost Multi-Path (ECMP) [35] is currently deployed. ECMP is both a path selection scheme and a load
distribution mechanism. To enable path diversity, it uses paths that tie to ensure loop-free forwarding. Ac-
cording to the level of resulting path diversity, routers then proportionally balance packets over their multiple
next hops. This proportional aspect is an arbitrary design choice, which is not in the scope of this deliverable.
We focus on the practical implementation of the mapping (packet→ next-hop).
Various next-hop mapping methods exist to practically balance packets over load-balanced paths. They
should meet the following requirements.
Minimal disruption Packets from the same TCP flow should always follow the same path in order to avoid
packet reordering.
Transparency Load balancing operates on normal network packets. It does not require any additional fields
in the packet header.
Operate at Line Rate The additional computation required to balance the packets should be marginal.
ECMP load balancer should also fairly share the load over the next hops. Since flows vary widely in terms of
number of packets and volume, this is not that easy [35, 14]. Most load-balancing methods that meet these
requirements are based on a hash function [14, 1]. They compute a hash over the fields that identify the flow
in the packet headers. These fields are usually the source and destination IP addresses, the protocol number
and the source and destination ports. We call these fields the 5-tuple.
Page 42 of (76) c© CHANGE Consortium 2012
Figure 6.1: More than 80% of the server pairs in popular models of data center topologies have two or morepaths between them.
Figure 6.2: On average, for 70% of destinations, routers and switches have multiple next hops.
The computed hash can then be used in various ways to select a next hop. The simplest and most deployed
method is called Modulo-N . If there are N available next hops, the remainder of dividing the hash by N is
used as an identifier of the next hop to use. Because of the roughly uniform distribution of the hashes, this
method results in a uniform distribution in terms of number of flows. A slightly more sophisticated method
that allows for a parameterizable distribution is called Hash-Threshold. The space of hashes is divided into
subspaces, where each one of them corresponds to a next hop. A third method is Highest Random Weight,
in which the hash is computed not only over the 5-tuple but also over the identifier of a next hop. For each
packet, a hash is computed for each next hop. The next hop for which the hash it the highest is selected.
ECMP is widely used in data centers. Several recent data center proposals have been optimized to support
it [51, 48]. We consider the path diversity between pairs of servers. We use three generic models: Fat-Tree [5],
VL2 [28], and BCube [30]. For each model, we instantiate a representative topology with approximately 600
servers. Since the simple spanning tree at the link layer also tends to be replaced by multipath-capable
routing [56], we also consider switches as load balancers1.
From Figure [figure][1][6]6.1, we observe that more than 80% of pairs of servers in these data center networks
have at least two paths between them. They have up to 64 paths in BCube. Moreover, Figure [figure][2][6]6.2
shows that for 70% of destinations the switches and routers have at least a destination for which there are two
1In BCube models, servers not only act as end hosts, they also act as relay nodes for each other.
c© CHANGE Consortium 2012 Page 43 of (76)
next hops or more. The maximum number of next hops is 10.
Due to non-determinism property of hash functions, forcing a path in the network is hard. In general hosts
can only vary the transport header fields to try to influence the path selection. In the following section, we
study possible interactions between a transport protocol and hash-based load balancers.
6.2 Case Study: Multipath TCP
In the remaining of this section, we consider MultiPath TCP (MPTCP) [23] as a case study to describe the
interactions between transport level protocols and per-flow load balancers.
As previously mentioned, hash-based load-balancing techniques rely on the 5-tuple to perform their forward-
ing decisions. A TCP connection is identified by this 5-tuple, and thus a single TCP connection only uses
one of the available paths due to the nature of the hash-based load-balancing technique. Splitting a single
data-stream among the different load-balanced paths may bring significant performance increases.
MPTCP is an extension to TCP that allows to split a data stream over multiple TCP subflows while still
presenting a standard TCP socket API to applications [9]. MultiPath TCP associates each TCP subflow
(identified by its 5-tuple) to its MPTCP-session identified by a token. After the establishment of the initial
TCP subflow of an MPTCP-session, the token enables the use of any arbitrary port number in subsequent new
TCP subflows, as the token uniquely associates the TCP subflow to its corresponding MPTCP session. The
subflows can be established using the same IP addresses pair and different ports [51] or just using distinct IP
addresses of the same end hosts [23].
If the subflows follow distinct paths, MPTCP is able to balance traffic across distinct paths thanks to the
Coupled Congestion Control [59].
Raiciu et al. evaluated MultiPath TCP inside data centers [51]. Simulations and measurements show that
performance improves when MultiPath TCP is allowed to use multiple subflows in such data centers. Due to
the load balancing deployed in the routers and switches of the data center, two subflows may follow distinct
paths and thus the overall throughput of the MPTCP session may be higher. Additionally, the network may
experience a better load-balancing as potentially more links are used. However, in practice, MultiPath TCP
establishes additional subflows on random port numbers. With a standard hash-based load balancing, there is
no guarantee that a different path will be chosen for each of these subflows. Furthermore, even using distinct
paths, it does not guarantee their independences in terms of shared congested bottleneck.
Let us consider a source that has m load-balanced paths towards a destination and that l of those m paths
are distinct. If all paths are equiprobable, then the probability that k subflows go through k different paths
amongst the l distinct paths is defined by:
Pm(k,l) =
l!
(l − k)!×mk(∀ k, l,m ∈ N | k ≤ l ≤ m) (6.1)
If there are 16 load-balanced paths, which seems realistic with respect to the observations we made in the
Page 44 of (76) c© CHANGE Consortium 2012
previous section, the probability to cover 4 distinct paths is low, e.g., if a source generates either 2 or 4
subflows the probability is respectively P 16(2,4) = 6.7% and P 16
(4,4) = 0.5%. In practice, the equiprobability
assumption between load-balanced paths may not hold such that actual figures may be worse. The load-
balanced paths may be unbalanced such that distinct paths may be hardest to setup using a random approach.
One could argue that sources could generate as many subflows as possible to try to increase the probability
displayed in Equation 6.1. However, establishing additional subflows comes with a cost, as each new sub-
flow requires a three-way handshake with crypto-authentication before being established and each additional
subflow increases the requirements in memory at the receiver due to a higher receive-buffer [9]. From a
performance viewpoint, MultiPath TCP would obviously benefit from being able to establish the minimum
number of subflows to efficiently utilize distinct paths. This, however, requires the ability to map determinis-
tically a subflow to a path offered by the network in order to setup subflows on distinct paths.
6.3 Controllable per-Flow Load-Balancing
Controllable per-Flow Load-Balancing (CFLB) has been designed to overcome the limitations of current
hash-based load-balancing techniques. CFLB allows sources to encode inside the existing fields of the packet
header the load-balanced path that each packet should follow. CFLB is transparent for applications that do
not want to steer packets and does not require any changes for non-CFLB-aware end hosts. The network layer
still balances non-controlled traffic without disrupting transport layer flows. CFLB does not require to store
any state in routers, which performs only simple calculations.
CFLB is not a source routing solution, it only enables hosts to select a load-balanced path among loop-free
paths offered by the network layer. Compared to a source routing solution, it does not require any header
extension.
CFLB is decomposed in four separate operations. First, the desired path is specified as a sequence of next-
hop selections by the source, that we call a path selector. This path selector is then encoded inside selected
header fields of the packets. We call these header fields controllable. Third, each router recovers from these
fields the encoded path selector and, finally, the load-balanced next-hop selection for this packet. These four
operations allow CFLB-aware sources to steer their packets inside the network.
The remaining of this section is organized in such a manner to help understand the design choices behind
CFLB. As a basic study case, we focus on IPv4 networks running MultiPath TCP end hosts2. MultiPath TCP
allows the use of any arbitrary port numbers for the additional TCP subflows (see section [section][2][6]6.2),
and thus we use the port numbers as the controllable fields in the packet header. We also define uncontrollable
fields that are used to add randomness in the forwarding process of packets from non-CFLB-aware sources.
With IPv4, these uncontrollable fields are the source and destination IP addresses and the protocol number.
The design of CFLB starts from two assumptions. First, the same CFLB function should be used on CFLB
capable devices belonging to the data center. Second, the data center topology must be known either by the
2CFLB also works in different network environment and for different applications (see section [section][5][6]6.5).
c© CHANGE Consortium 2012 Page 45 of (76)
sources or by a server that can be queried by sources (in such a case, we can also envision that the server
possesses global load information).
In the following, we first describe how CFLB translates a forwarding path into a path selector and how a router
retrieves from it the next-hop selection it should apply. Second, we discuss how to encode the path selector
inside the packet header thanks to an invertible function. Then, we discuss how CFLB adds randomness for
load distribution and avoids polarization. Finally, we summarize the complete operations of CFLB.
6.3.1 Path Selector
In a network offering path diversity, there exists multiple load-balanced paths between a source and a desti-
nation. Figure [figure][3][6]6.3 shows the Direct Acyclic Graph (DAG) of all load-balanced paths between
a source S and a destination D in a simple network. In this example there exist five different load-balanced
paths between nodes S and D. Table [table][1][6]6.1 lists the symbols used in this section and their defini-
tions.
Symbol DefinitionB The radix of the path selector, i.e., the numeral base to encode the path selector.ni The next-hop selection of the ith positioned router.L The length of the path selector, i.e., number of next-hop selections that can be encoded in it.Ni The number of load-balanced next hops available at a router whose position is i (for the sake of clarity,
we ignore the destination prefix).F (x) The invertible function applied on x.H(x) The Hash function applied on x.cf The controllable fields used.uf The uncontrollable fields used.Ei(ni) The function applied on a next-hop selection, it adds “randomness” using uf .Di(x) The function that performs the inverse of Ei, i.e., Di(Ei(x)) = x, ∀x ∈ [0, B[.A||B The concatenation of A and B.
Table 6.1: General notations.
CFLB allows sources to force a packet to follow a given load-balanced path. Such a path can be described
as a set of subsequent routers and their next-hop selections. For instance, the load-balanced path highlighted
in bold in Figure [figure][3][6]6.3 can be expressed as the following sequence: R1 → R2, R4 → R7. The
notation Ri → Rj means that router Ri forwards the packet to its neighbor Rj . There is no need to represent
the next-hop selection of router R7 towards D as R7 only has one possible next hop.
By knowing the number Ni of available next hops towards a destination for each router in the network, next-
hop selections can be mapped to a number ni ∈ [0, Ni[, where n indicates the index of the next hop, which
should be selected. By using this representation, the highlighted path in Figure [figure][3][6]6.3 can therefore
be expressed as the following sequence of next-hop selections: {(1→ 0), (4→ 2)}, where (i→ j) specifies
that router Ri on the path selects its jth next hop. In CFLB, we define this sequence of next-hop selections to
be a path selector.
Path Selector Representation To force a packet to follow a specific load-balanced path, CFLB encodes
the path selector inside the source and destination ports of the packet headers. CFLB stores a path selector
Page 46 of (76) c© CHANGE Consortium 2012
S R1
R2
R3
R4
R5
R6
R7
R8
D
0
101
012
Figure 6.3: The load-balanced paths between S and D.
as a positional base-B unsigned integer, where B is known as the radix and shared by all the nodes in the
network. This allows to maximize the number of next-hop selections that can be encoded inside the path
selector, while minimizing the number of bits used.
A path selector p can be generalized as:
p =
L−1∑i=0
ni ×Bi (6.2)
Where ni is an unsigned integer in base B that represents the next-hop selection of the router having the ith
position within the path selector. For the moment, we assume that each router is able to determine its position
in the path selector (see the dedicated paragraph for more details).
Only routers having multiple load-balanced next hops to forward a packet must retrieve the path selector. In
this case, the router first extracts the path selector p from the packet’s header fields and then retrieves the
next-hop selection.
A path selector p can be inverted on the router having the ith position within the path selector to find the
next-hop selection ni it needs to apply on a packet by applying Equation 6.3.
ni =⌊ p
Bi
⌋mod B (6.3)
The integer division by Bi removes all load-balanced next-hop selections of upstream routers while the
modulo operation removes all load-balanced next-hop selections of downstream routers.
Path Selector Length The fixed size of the packet header fields used to encode the path selector limits the
number of encodable next-hop selections to:
L = blogB(2X)c (6.4)
where X is the size in bits of the header fields used to encode the path selector. With IPv4, X = 32 since the
ports are used to convey the path selector.
Increasing B decreases L, thus the number of load-balanced next-hop selections that can be encoded inside
the path selector. As the potential value of ni is limited by the radix B, one should choose a B value so that it
c© CHANGE Consortium 2012 Page 47 of (76)
is the maximum degree in every possible DAG (for each possible destination prefix). It can be expressed as:
B = max(N i) ∀ i (6.5)
It is possible to represent for any router in the network a next-hop selection with a value, which is lower than
the maximum number of available next hops at any router. In practice, most routers have a hardcoded upper
bound on the number of next hops they can use for a destination, most of them use a maximum of 16 next
hops [8].
Using a radix, which is equal to the maximum number of next hops for all CFLB routers, might end up in an
inefficient usage of the available bits in the packet header. Relaxing the problem by allowing CFLB routers to
use more than one position inside the path selector would give more flexibility on the load-balanced next-hop
selection values that can be encoded at the source. If a router uses n positions of the path selector, then a
source can select a next hop among Bn on this CFLB-router.
Position Inside the Path Selector Up to now, we assumed that each router knows its position inside
the path selector. However, in a real world scenario it is impossible for a router to identify the sequence
of upstream routers thus making impossible to know its position inside the path selector. To still enable a
source to construct a path selector and for routers to extract corresponding next-hop selections, CFLB uses
the Time-to-Live (TTL) of the packet to identify each router’s position inside the path selector. As the TTL
is decremented by each router and as the path selector is bounded by L, we can define the position i of each
router within the path selector to be ttli mod L, where ttli is the TTL value of the packet received at the
router whose position is ith, e.g. the router located at the (64−ttli+1)th hop along the load balanced path (64
gives here the TTL used at the source). The source can, by knowing the initial TTL of the packets3, encode
the next-hop selection of each CFLB router on the path at the corresponding position inside the path selector.
As previously mentioned, to allow a router to use more than one position inside the path selector, the router
must decrement the TTL of the packets it forwards more than one time. We do not discuss further this
extension in this deliverable.
One Load-Balanced Path – Multiple Path Selectors When multiple connections between the same
pair of hosts want their packets to follow the same load-balanced path, using the same path selector generates
a collision as each connection will use the same port numbers. To overcome this issue, CFLB allows to
generate multiple path selectors to describe the same load-balanced path using two solutions that can be
combined. First, the unused positions inside the path selector can be filled with random values. Second, by
changing the initial TTL, the position of each router in the path selector changes and thus the pair of ports
used to force a specific load-balanced path. This ensures that the 5-tuple used by the hosts varies from one
connection to another.
3Most operating systems use a system wide default TTL.
Page 48 of (76) c© CHANGE Consortium 2012
Example Let us now illustrate how CFLB works in the simple network shown in Figure [figure][3][6]6.3.
First, based on Equation 6.5, we can deduce that the radix B must be 3. Using this radix, from Equation 6.4,
we can encode 20 load-balanced next-hop selections inside a path selector.
Let us assume that the source uses an initial TTL of 64 and wants this packet to follow the highlighted path
in Figure [figure][3][6]6.3. The positions of the next-hop selections of routers R1 and R4 inside the path
selector are respectively 4 (64 mod 20) and 2 (62 mod 20). The path selector computed by the source
based on Equation 6.2 can therefore be expressed as (where the notation [x→ y] refers to a router at position
x, which should select its yth next hop):
p = {[4→ 0], [2→ 2]} = 0× 34 + 2× 32 = 18
Note that this implies encoding a next-hop selection of 0 for all other positions inside the path selector. This
value is then encoded inside the packet header. R1 retrieves from the packet header the same path selector
and the TTL value to compute its position inside the path selector, i.e., 4. It then computes the next-hop
selection it needs to apply on the packet based on Equation 6.3:
n4 =
⌊18
34
⌋mod 3 = 0
R1 decrements the TTL of the packet and forwards it to the next hop labeled 0, i.e., R2. R2 does not have
load-balanced next hops. It forwards the packet to R4 and decrements the TTL. R4 applies the same operation
as R1. R4 computes:
n2 =
⌊18
32
⌋mod 3 = 2
The packet is therefore forwarded to R7 and then to D as R7 only has one possible next hop to forward the
packet.
6.3.2 Invertible Function
CFLB encodes the path selector inside the packet header using an invertible function F . This invertible
function must meet two important properties:
Bijection F must be bijective such that there is a one-to-one correspondence between its domain and image,
which must be {0, 1}X where X is the length in bits of the controllable fields. F must also be invertible,
i.e., ∃F−1 that is the inverse of F such that:
F−1(F (x)) = x ∀x ∈ {0, 1}X
Avalanche effect F must exhibit the avalanche effect [58]. Indeed, a simple bijective function might end up
in a poor distribution of the non-controlled traffic [14]. For instance, if only the port numbers are used
as controllable fields, we do not want that all web traffic goes through the same next hop. Therefore,
c© CHANGE Consortium 2012 Page 49 of (76)
we require that the invertible function exhibits the avalanche effect that is for a small variation of the
input (different source-ports) a large variation of the output is observed.
Based on these two requirements, block ciphers such as Skip324 or RC5 [53] are good candidates to imple-
ment this invertible function when 32 bits are controllable. When more bits are available, we recommend to
use a block cipher mode of operation as FPE construction, such as the eXtended CodeBook (XCB) mode of
operation [45] that accepts arbitrarily-sized blocks, provided they are as large as the blocks of the underlying
block cipher. Depending on the amount of bits that are controllable in the packet header, different types of
block ciphers can be used with XCB, from 32-bit symmetric-key block cipher to most common ones such
as DES, 3DES or AES. Furthermore, efficient hardware-based implementations of such block ciphers ex-
ist [22, 34]. Using such functions to encode the path selector enables a router to apply the inverse of this
function on the controllable fields of the packet header to retrieve the path selector.
6.3.3 Load Balancing Efficiency
CFLB is designed based on the properties of hash-based load balancers. It must be transparent to sources that
do not need to control the load-balanced path taken by their packets. In this case, the controllable fields are
random and do not encode a path selector. CFLB must still distribute such packets efficiently amongst the
available load-balanced next hops.
Uncontrollable fields As Cao et al. showed, the most efficient packet distribution is achieved when the
whole fields representing a flow is used as input to the load balancing function [14]. CFLB therefore uses
also the uncontrollable fields, source and destination addresses and the protocol number, as input. For that,
the way the path selector is encoded slightly changes from Equation 6.2:
Ei(ni) = (ni +H(uf )) mod B (6.6)
p =L−1∑i=0
Ei(ni)×Bi (6.7)
where, H is a hash function and uf contains the uncontrollable fields of the packet header. This allows to
efficiently distribute packets over available next hops of each router, while still allowing routers to recover
the next-hop selections encoded by the sources.
Equation 6.7 can be inverted to find the next-hop selection ni to apply on the router whose position is i by
applying the following operation on the path selector extracted from the packet header:
Di(x) = (x−H(uf )) mod B (6.8)
4http://www.qualcomm.com.au/PublicationsDocs/skip32.c.
Page 50 of (76) c© CHANGE Consortium 2012
ni = Di(⌊ p
Bi
⌋) (6.9)
Next-hop selection The next-hop selection ni is a value comprised between 0 and B − 1. To select a next
hop, CFLB applies a mapping between ni and a value between 0 and Ni − 1. However, as Ni ≤ B, an issue
arises when using a simple modulus operation when B is not a multiple of Ni. In this case, the load balancing
distribution might be poor. For instance, if Ni = 2 and B = 3, if the input is uniformly distributed, then the
router ends up forwarding 75% of the incoming packets to the first next hop. To resolve this problem, CFLB
computes the next-hop selection ni to apply on the packet as follows :
ni =
Di(⌊ pBi
⌋), if Di(
⌊ pBi
⌋) < Ni
H(cf ||uf ) mod Ni, otherwise.(6.10)
The intuition behind Equation 6.10 is that CFLB must distinguish whether the packet was controlled by a
source or not. If the packet was indeed controlled, the next-hop selected, ni, must be the one encoded in
the path selector. However, the non-controlled packets must be distributed randomly among the Ni available
next hops. In Equation 6.10, if the packet to forward is a controlled one, then the resulting next hop to
select should be lower than Ni (the number of next hops in the routing table for the packet’s destination), the
decision encoded at the source is correctly taken. Otherwise, it means that the packet is not a controlled one,
resulting in a random distribution of the packet on one of the Ni available next hops.
In case of topological changes (transient or permanent), a CFLB-router will renumber indexes, i.e., update
the ni → Rj mapping and Ni, of its current available next hops towards each destination. In such a case,
while new flows are “aware” of the new state and are so correctly controlled, previous existing ones may be
impacted. Indeed, when the desired next hop does not exist anymore or if its index has changed, the resulting
path will change. In CFLB, the impacted controlled flows (i.e., the elephants ones) will fall back to a classic
hash-based load balancing thanks to Equation 6.10. We consider that such topological changes should be
quite marginal (at a time scale greater than flows duration) and new subflows may be created if impacted ones
share a common bottleneck.
6.3.4 Avoiding Polarization
Some hash-based load-balancing techniques suffer from the polarization problem [44]. This problem arises
when there are several load-balancing routers in sequence. If they all perform the same computation on the
received packets, they will select the same next hop resulting in an uneven traffic distribution. With CFLB,
the polarization problem only arises when packets traverse more than L CFLB routers. Every router spaced
by L hops computes the same next-hop selection. CFLB solves this problem by assuming that the every
packet from the same flow received on a router has the same TTL5. CFLB therefore includes the TTL of
5This is a reasonable assumption since hosts use the same TTL for all packets and all packets from a flow follow the same path.
c© CHANGE Consortium 2012 Page 51 of (76)
IPsrc IPdst Proto TTL Portsrc Portdst
Uncontrollablefields
H
h
Controllablefields
F−1
p =
︷ ︸︸ ︷. . .+ ai ×Bi + . . .+ aL ×BL
i = TTL mod L
−mod B ni
Figure 6.4: The complete mode of operation of a CFLB router.
Network-wide constant: B = The radix in use in the network.Network-wide constant: X = The number of bits that are controllable in the packet header.Require: pckt = The packet to forward.Ensure: The next-hop selection to apply on pckt.1: L← blogB(2X)c2: cf ← ExtractControllableFields(pckt)3: uf ← ExtractUncontrollableFields(pckt)4: ttl← ExtractTTL(pckt)5: p← F−1(cf )
6: ni ← (⌊
p
B(ttl mod L)
⌋−H(uf ||ttl)) mod B
7: if ni < Ni then8: return ni
9: else10: return H(uf ||cf ||RouterID) mod Ni
11: end if
Algorithm 1: Pseudocode showing operations performed by a CFLB router.
the packet inside the hash function ensuring that all routers will make different next-hop selections along the
path6. Respectively Equation 6.6 and Equation 6.8 become:
Ei(ni) = (ni +H(uf ||ttli)) mod B (6.11)
Di(p) = (⌊ p
Bi
⌋−H(uf ||ttli)) mod B (6.12)
6.3.5 Summary
In the previous sections, we have explained all the design decisions behind the CFLB algorithm. For clarity,
we provide in this section the detailed pseudocode of CFLB.
Figure [figure][4][6]6.4 and Algorithm 1 show respectively the operations and the pseudocode performed
by a CFLB router to forward a packet among load-balanced next hops. The first operation is to extract the
controllable and the uncontrollable fields and the TTL from the packet header. Operation F−1 is the inverse of
the invertible function that extracts the path selector from the controllable fields. In Algorithm 1, after having
retrieved the next-hop selection ni, the router performs an if-then-else on the value ni, to determine whether
the packet was controlled by the source. The ai value, in Figure [figure][4][6]6.4, corresponds to the addition
6Another solution could have been to use a simple router id as in classical hash-based load-balancing techniques [44], howeverthis requires each router to be configured with a unique router id and requires the sources or the network information server to knowall router ids.
Page 52 of (76) c© CHANGE Consortium 2012
Network-wide constant: B = The radix in use in the network.Network-wide constant: X = The number of bits that are controllable in the packet header.Require: path = A sequence (ttli, ni), where ttli is the TTL of the packet when received by router i and ni the next-hop that should be selected
by router i.Require: uf = The uncontrollable fields the source needs to use.Ensure: The controllable fields (cf ) to use to force a packet to follow the load balanced path path.1: L← blogB(2X)c2: p← 03: for (ttli, ni) ∈ path do4: p← p+ ((ni +H(uf ||ttli)) mod B)×B(ttli mod L)
5: end for6: return F(p)
Algorithm 2: Pseudocode showing the path selector construction.
modulo B of the next-hop selection performed at position i and the hash computed on the uncontrollable
fields and the TTL. This ai value was inserted by the source at the ith position inside the path selector. The
router can thus retrieve it and compute the subtraction modulo B with the same hash value, to finally retrieve
the next-hop selection ni.
Algorithm 2 shows the pseudocode used by sources to construct a path selector. The source needs to first
compute the length of the path selector. This is needed because the routers position themselves inside the
path selector by using the TTL and the length of the path selector. Thus, this length has to be taken into
consideration to find the path selector. Then, the source iterates over all routers on the path (with a TTL ttli
and a next-hop selection ni) to compute the path selector.
6.4 Evaluation
In this section, we evaluate the performances of CFLB compared to hash-based load-balancing. Our goal is
twofold: first, we evaluate its load-balancing and forwarding performances, and second, we show through
simulations and experiments how MultiPath TCP can benefit from CFLB to exploit the underlying path
diversity.
6.4.1 Load-Balancing for Non-Controlled Flows
The first requirement is that a router having multiple next hops for a given destination should uniformly
distribute the load [14] for non-controlled flows. If there exist N next hops for a given destination prefix, the
load balancer should distribute 1N of the total traffic to each next hop.
CFLB enables sources to steer controlled packets while also acting as a classic load balancer for non-
controlled packets (e.g., mice flows). To compare the hash-based load balancing techniques and CFLB,
we simulated each method using realistic traces and evaluated the fraction of packets forwarded to each next
hop. We based our simulations on the CAIDA passive traces collected in July 2008 at an Equinix data center
in San Jose, CA [55].
To analyze how CFLB balances the non-controlled traffic compared to hash-based techniques, we first simu-
lated 10 million packets (extracted from the CAIDA traces) forwarded through one load balancer performing
a distribution among N = 2 next hops. Figure 6.5(a) shows the result of this simulation (computed every
second). There are three observations resulting from this figure. First, using CRC16 as a hash-based load
c© CHANGE Consortium 2012 Page 53 of (76)
0 1 2 3 4 5 6 7 8 9Load repartition [% packets]
0
20
40
60
80
100
CD
F[%
]
CFLB-Skip32CFLB-RC5MD5CRC16
(a) After one load balancer.
0 1 2 3 4 5 6 7Load repartition [% packets]
0
20
40
60
80
100
CD
F[%
]
CFLB-Skip32CFLB-RC5MD5
(b) After multiple load balancers.
Figure 6.5: Deviation from an optimal distribution amongst two possible next-hops.
balancer, gives a rather poor distribution of packets. Second, as the maximum deviation value never goes up
to 4% of packets, the load distribution among the two output links is close to an equal 50/50 % repartition
of traffic for all evaluated techniques except CRC16. Third, CFLB, whatever the block cipher used, achieves
an equivalent load distribution as a hash-based load balancer using MD5. We did not observe a significant
impact on the quality of the load distribution according to the B value used.
We also evaluated the load balancing performance considering a sequence of several load balancers. Fig-
ure 6.5(b) shows the cumulative distribution of the maximum deviation of the load distribution after crossing
four subsequent load balancers (computed every half second). The same observation as for Figure 6.5(a)
applies, CFLB performs at least as good as a classical hash-based load balancing technique.
Figure 6.6(a) and Figure 6.6(b) show, for respectively a hash-based load balancer using MD5 and CFLB using
RC5, the load balancing distribution of packets over time when N = 4. We analyze here the case of a router
having four outgoing links toward a given destination. We can notice that there are no significant differences
between the two techniques, as they behave in the same way over time. They both slightly fluctuate within the
same tight interval [22%, 28%] and their median is close to 25%. Simulations with other traces and different
Page 54 of (76) c© CHANGE Consortium 2012
0 5 10 15 2020
22
24
26
28
30
Time [sec]
%pa
cket
s
(a) MD5 Hash-Based
0 5 10 15 2020
22
24
26
28
30
Time [sec]
%pa
cket
s
(b) RC5 CFLB
Figure 6.6: Packet distribution computed every second amongst four possible next hops.
values of N provide similar results.
6.4.2 Forwarding Performances
The second requirement is the forwarding performance. In order to evaluate it, we implemented the forward-
ing path of CFLB as a module in the Linux kernel 2.6.387. Note that the Universite Catholique de Louvain
partner has filed a patent for its CFLB code release, thus, the code is completely restricted to the European
Commission and reviewers.
The basic behavior of the Linux kernel when dealing with multiple next hops for a given destination is to
apply a round robin distribution of packets based on the IP addresses, therefore performing a pure layer-3
load balancing. As this is not comparable to the hash-based load-balancing behavior introduced in sec-
tion [section][1][6]6.1, we extended the Linux kernel to take into consideration the 5-tuple of the packets and
then apply a hash function to select a next-hop (only CRC like functions are available). To implement CFLB
7More information can be found at: http://inl.info.ucl.ac.be/cflb.
c© CHANGE Consortium 2012 Page 55 of (76)
400K 500K 600K 700K
100K
200K
300K
400K
500K
600K
Sent pps
Forw
arde
dpp
s
Linux R-RCRC16
CFLB-RC5CFLB-Skip32
Figure 6.7: CFLB gives equivalent forwarding performance as hash-based load balancers.
in the kernel, we extend the previously mentioned hash function to enable the deterministic selection of a
next hop as described in section [section][3][6]6.3. We used two different 32-bit block ciphers to implement
the invertible function: RC5 and Skip32. The implementation of these two block ciphers is not optimized, the
goal is solely to prove the feasibility of our solution (various techniques could be used to improve its perfor-
mance [39, 31]). Note that CFLB also applies a CRC function on the uncontrollable fields to add randomness
for non-controlled flows.
The goal of such an experiment is to analyze the impact of using CFLB on the forwarding of packets as adding
computation for each packet increases the processing delay, it may also decrease the overall throughput
achieved.
We deploy a testbed relying on three computers to emulate the forwarding path of a Linux router. The
computer acting as a load balancer is an Intel Xeon X3440 @2.53GHz, and both sender and receiver are
AMD Opteron 6128 @2GHz. The sender is connected through a 1Gbps link to the load balancer, which
balances traffic amongst two 1Gbps links to the receiver. The traffic was generated using 8 parallel iperf 8
generators, creating UDP-packets with a payload of 64 Bytes, in order to overload the load balancer. The
result of this experiment is given in Figure [figure][7][6]6.7.
The classic Linux Round-Robin on the IP-addresses obviously performs the best (it only requires to lookup
at the IP address in the routing cache to forward the packet). It forwards approximately 600,000 packets
per second. Not far below, both the classical hash-based technique using CRC16 and CFLB using RC5
achieve respectively 570,000 and 560,000 packets per seconds forwarding. This performance drop compared
to the standard Linux Round-Robin is mainly due to the more complex hash-algorithm to select the next hop.
Finally, CFLB using Skip32 achieves up to 500,000 packets per seconds forwarding. We can conclude that
CFLB, even using non-optimized block ciphers, comes with a marginal additional cost as it gives equivalent
forwarding performances as classical hash-based techniques.
6.4.3 MPTCP improvements with CFLB
In the following section, we evaluate the advantages of running MPTCP hosts, as it is our case study, con-
jointly with a CFLB-enabled network. We first simulate a data center environment to show that when using
8http://iperf.sourceforge.net/
Page 56 of (76) c© CHANGE Consortium 2012
CFLB, MPTCP requires to establish less subflows for elephants connections than with a probabilistic ap-
proach. Finally, we also show in a small testbed that MPTCP can benefit from the usage of CFLB to avoid
crossing hot spots.
6.4.4 Data Center Simulations
We performed simulations of MPTCP-enabled data centers and evaluate the performances achieved when a
simple central flow scheduling algorithm that allows to allocate elephants subflows. The scheduler that we
use for simulations simply consists in counting the number of flows going through each link of the data center.
In practice, hosts can use a similar technique as in [19] to detect whether one connection corresponds to an
elephant flow, and if so query the scheduler to establish additional subflows. The scheduler then specifies to
the host the ports to be used to setup a new subflow. The required ports are computed using the CFLB mode
of operation allowing to map a subflow to a specific path in the network. We refer to this combination of
MultiPath TCP and CFLB in the remaining of this section as MPTCP-CFLB.
To evaluate the benefits of CFLB with MPTCP in data centers, we first enhanced the htsim packet-level
simulator used in [51] to support path selection with CFLB. We consider exactly the same Fat-Tree datacenter
topology as discussed in Figure 2 of [51]. This simulated datacenter has 128 MPTCP servers, 80 eight-port
switches and uses 100 Mbps links. The traffic pattern is a permutation matrix, meaning that senders and
receivers are chosen at random with the constraint that receivers do not receive more than one connection.
The regular MPTCP bars of Figure [figure][8][6]6.8 are the same as Figure 2 of [51]. It shows the throughput
achieved by MPTCP when MPTCP subflows are load-balanced using ECMP. The MPTCP-CFLB bars show
the throughput that MPTCP is able to obtain when CFLB balances the MPTCP subflows over the less loaded
paths. The simulations show that with only 2 subflows, MPTCP-CFLB is much closer to the optimum than
MPTCP with hash-based load balancing. Even with only one subflow (smartly allocated with the scheduler),
improvements are considerable and MPTCP-CFLB achieves a good utilization of the network. This can
be explained by the fact that relying on a random distribution of subflows ends in a poor use of available
resources.
Similar results have been observed on other data center topologies such as VL2 and BCube. We also per-
formed simulations for an overloaded data center and observed that using MPTCP-CFLB conjointly with a
flow scheduler focusing on less congested paths offers more fairness amongst the different connections.
6.4.5 Testbed Experiments
We applied a modification to the MPTCP Linux kernel 2.6.36 implementation [9] to add the deterministic
selection feature offered by CFLB. We created a netlink interface to the kernel so that a user-space module
can interact with MPTCP and announce to the kernel the subflows to create.
The CFLB functionality has been implemented in the user-space as a 32 controllable bits version. Our
prototype allows a source to control source and destination ports to follow a specific path inside an IPv4
network.
c© CHANGE Consortium 2012 Page 57 of (76)
0 1 2 3 4 5 6 7 8 9No. of MPTCP subflows
0
20
40
60
80
100
Thr
ough
put
(%of
opti
mal
)
MPTCP-CFLBregular MPTCP
Figure 6.8: MPTCP needs few subflows to get a good Fat Tree utilization when using CFLB.
The two block ciphers from subsection [subsection][2][6,4]6.4.2 (RC5 and Skip32) were also implemented
inside the kernel crypto library.
A python library pycflb was developed to bring a simple API for interacting with the user-space CFLB. We
also developed an RPC server to show the feasibility to centralize the computation of CFLB in a server. This
latter has information about the network topology and is the only one to interact with the pycflb library. pycflb
must be configured with the cipher and key parameters in use in the network. Sources only query it to retrieve
the ports to use or to recover the path taken by a specific flow. These three implementations (Linux MPTCP
netlink-interface, user-space CFLB and pycflb library) allow a source to deterministically map subflows to
paths and represent approximately 4,000 lines of code.
When MultiPath TCP runs on a single homed server, additional subflows are created by modifying the port
numbers in a random manner. Since MultiPath TCP relies on tokens to identify to which MPTCP connection a
new subflow belongs (see section [section][2][6]6.2), both the source and destination ports can be used to add
entropy. Combining CFLB and MultiPath TCP in the Linux MPTCP implementation provides a significant
benefit because the subflow 5-tuple can be selected in such a way that the underlying path diversity offered
by the network can be easily exploited.
We evaluate the benefit of this technique in a small testbed with a client and a server (AMD Opteron 6128
@2GHz) and two CFLB-capable routers (Xeon X3440 @2.53GHz).
In the first experiment, each host is connected to one router via a 1Gbps link. The routers are directly
connected via seven 100 Mbps links. These 7 links offer 7 different distinct paths between the client and the
server. If seven MPTCP-subflows are created, an optimal usage of the network should result in about 700
Mbps of throughput. To evaluate this, we ran iperf between the hosts, creating traffic during one minute. The
experiment has been repeated 400 times to collect representative results. Figure [figure][9][6]6.9 provides
Page 58 of (76) c© CHANGE Consortium 2012
1/7 2/7 3/7 4/7 5/7 6/7 7/7Proportion of paths used
0
20
40
60
80
100P
DF
(in
%)
MPTCP-CFLBregular MPTCP
Figure 6.9: Regular MPTCP is unlikely to use allpaths. MPTCP-CFLB on the other hand alwaysmanages to use all the paths.
2 3 4 5 6 7 8 9 10Number of MPTCP-Subflows
0
20
40
60
80
100
120
140
160
180
200
Ave
rage
Goo
dput
inM
bps
MPTCP-CFLBregular MPTCP
Figure 6.10: Regular MPTCP has a very small prob-ability of using link A of Figure [figure][11][6]6.11and is thus suboptimal compared to MPTCP-CFLB.
S LB
R
D1Gps Link A: 100Mbps
6 × 100Mbps100Mbps
Figure 6.11: Testbed – The maximum throughput available between S and D is at 200 Mbps due to thebottleneck link between the router and the destination.
the probability distribution function of the number of distinct paths used by the classical MPTCP and our
enhanced MPTCP-CFLB implementation.
Figure [figure][9][6]6.9 raises the following observation: as expected, using seven subflows, MPTCP-CFLB
is able to take the full benefit of the seven paths while the classical MultiPath TCP cannot efficiently uti-
lize them. Indeed, the performance of MPTCP-CFLB is completely deterministic as the MPTCP connection
balances exactly its seven subflows over the seven paths. Among the 400 experiments, when paths are ran-
domly selected, only two experiments were able to use the seven paths. This confirms the analysis of sec-
tion [section][2][6]6.2 on the probability of selecting different paths by using random port numbers. Indeed,
P 7(7,7) = 0.6% ≈ 2
400 , which explains the poor result of MPTCP to cover entirely the 7 load-balanced paths.
Most of the experiments result in four or five paths being used. This implies that two or three paths carry two
competing TCP subflows from the same MPTCP connection.
Our second evaluation (Figure [figure][11][6]6.11) still offers 7 distinct paths from the source to the destina-
tion, but this time the destination has two 100 Mbps links. One is a direct link from the load balancer to the
destination and the second is attached to the router.
With only two subflows, MPTCP-CFLB is able to saturate the two 100 Mbps interfaces of the destination.
Figure [figure][10][6]6.10 compares the performance of MPTCP and MPTCP-CFLB when the number of
subflows varies. Each measurement with MPTCP was repeated 100 times and Figure [figure][10][6]6.10
provides the average measured goodput. These measurements clearly show that when using random port
c© CHANGE Consortium 2012 Page 59 of (76)
numbers, MPTCP is unable to efficiently use the two different 100 Mbps links. Increasing the number of
subflows slowly increases the performance, but, as explained in section [section][2][6]6.2, adding a subflow
to an MPTCP connection comes with a significant cost. Thus, the less TCP sublows are established, the best
it is. MPTCP-CFLB is able to cover all the available paths with the minimal cost.
6.5 Discussion
In this chapter, we have mainly focused on the utilization of CFLB in data centers networks carrying
TCP/IPv4 packets. As a case study, we consider the coupling with MPTCP. However CFLB could be applied
to other problems in different networking technologies. Extending CFLB to support another networking tech-
nology can be done by selecting the controllable and the uncontrollable fields of the packet header that are
used as input to the load-balancing algorithm. We briefly discuss some of them in this section.
A first natural extension of CFLB would be to deploy it in IPv6 networks. In an IPv6 network, CFLB could
also rely on the source and destination ports, but IPv6 packets contain a 20 bits flow label field. The semantics
of this field is still being debated within the IETF [15]. IPv6 sources could leverage CFLB to encode a path
selector in the flow label field of their packets.
Although we illustrated the benefits of CFLB with MultiPath TCP, the same could be applied for single
TCP/UDP connections, in this case only the source port could be controlled as applications running on top of
those transport protocols require to specify the destination port.
A CFLB network has other benefits than improving end hosts performances. One of the side effect benefits
is the monitorability of the load-balanced paths. Commercial networks often deploy monitoring tools that
probe network paths to verify whether their network meets the stringent SLAs that are requested by their
customers. Unfortunately, when there are load-balanced paths, it is very difficult for the monitoring station to
steer packets on a specific path, which complicates network monitoring. This operational problem is one of
the reasons why the MPLS-TP architecture prohibits the utilization of ECMP [13]. With CFLB, this problem
disappears since a monitoring station can easily steer packets along specific paths through CFLB routers.
MPLS networks often use ECMP to load balance the traffic. To enable MPLS routers to support ECMP
even when carrying non-IP packets, router vendors have proposed the utilization of special MPLS entropy
labels [38] to identify flows that can be load-balanced. CFLB could easily exploit these entropy labels and
ensure that flows are both well balanced and that paths can be efficiently monitored.
6.6 Conclusion
Most data centers networks nowadays rely on hash-based load-balancing to distribute the load over multiple
paths. Hash-based techniques are able to efficiently spread the load but it is difficult to predict (and impossible
to choose) the next hop selection that such load balancers will select. In this section, we have shown that it
is possible to achieve both efficient load-balancing while enabling hosts to explicitly select the paths of their
flows. The CHANGE architecture could benefit from our technique since it would allow it to predictably
Page 60 of (76) c© CHANGE Consortium 2012
select multiple paths between platforms in order to increase network utilization and thus throughput.
c© CHANGE Consortium 2012 Page 61 of (76)
7 Motivating Case Study: Extending TCPIn this section we provide an overview of our work to design Multipath TCP, guiding the reader through the
lengthy MPTCP design process that has taken the better part of four years in the making.
Our experience in designing MPTCP has been one of the biggest motivations to develop a new Internet
architecture that is evolvable and incorporates flow processing and middleboxes as first class citizens of the
network.
As many researchers have lamented, changing the behavior of the core Internet protocols is very difficult [32].
An idea may have great merit, but unless there is a clear deployment path whereby the cost/benefit tradeoff
for early adopters is positive, then it is very unlikely to see widespread adoption.
The majority of applications use TCP for transport. Although newer protocols such as SCTP and DCCP exist
that may be a better match for application requirements, software writers must maximize the chance of a
successful connection. Understandably, they use TCP as it is always available and almost always works, and
then work around its limitations at a higher layer.
We wish to move from a single-path Internet to one where the robustness, performance and load-balancing
benefits of multipath transport are available to (almost) all applications. To support such unmodified appli-
cations we must work below the sockets API. There is no widely deployed signaling mechanism to select
between transport protocols, we have to use options in TCP’s SYN exchange to negotiate new functionality.
The goal then is for an unmodified application to open a TCP connection in the normal way. When both
endpoints support MPTCP and multiple paths are available, MPTCP should be able to set up additional
subflows and stripe the connection’s data across these subflows, sending most data on the least congested
paths.
The potential benefits are clear, but there are potential costs too. If negotiating MPTCP can cause connections
to fail when regular TCP would have succeeded, then MPTCP is unlikely to be deployed. The second goal,
then, is for MPTCP to work in all current scenarios where regular TCP works. If a subflow fails for any
reason, the connection must be able to continue as long as some other subflow has connectivity.
Third, MPTCP must be able to utilize the network at least as well as regular TCP, but must not starve TCP.
The congestion control scheme described in [60] meets this requirement, but congestion control is not the
only factor that can limit throughput.
Finally MPTCP must be implementable in operating systems without using excessive memory or processing
power. As we will see, this requires careful consideration of both fast-path processing and overload scenarios.
7.1 DesignThe five main mechanisms in TCP are:
• Connection setup handshake and state machine.
• Reliable transmission & acknowledgment of data.
Page 62 of (76) c© CHANGE Consortium 2012
• Congestion control.
• Flow control.
• Connection teardown handshake and state machine.
All of these need modifications to achieve robust, high performance multipath operation, but congestion
control has been described elsewhere[60] so we will not discuss it further in this paper.
MPTCP is negotiated via new TCP options in SYN packets, and during this phase the endpoints also exchange
connection identifiers. These are then used to add new paths—subflows—to an existing connection. Subflows
resemble TCP flows on the wire, but they all share a single send and receive buffer at the endpoints. MPTCP
uses per subflow sequence numbers to detect losses and drive retransmissions, and connection-level sequence
numbers to allow reordering at the receiver. Connection-level acknowledgements are used to implement
proper flow control. We discuss the rationale behind these design choices below.
7.1.1 Connection setup
The TCP three-way handshake serves to synchronize state between the client and server1. In particular, initial
sequence numbers are exchanged and acknowledged, and TCP options carried in the SYN and SYN/ACK
packets are used to negotiate optional functionality.
MPTCP must use this initial handshake to negotiate multipath capability. An MP CAPABLE option2 is sent
in the SYN and echoed in the SYN/ACK if the server understands MPTCP and wishes to enable it. Although
this form of extension has been used many times, the Internet has grown a great number of middleboxes in
recent years. Does such a handshake still work?
Out tests in the previous section found that 6% of paths tested remove new options from SYN packets. This
rises to 14% for connections to port 80 (http). We did not observe any access networks that actually dropped
a SYN with a new option. Perhaps most importantly, no path removed options from data packets unless it
also removed them from the SYN, so it is possible to test a path using just the SYN exchange. A separate
study[12] probed Internet servers to see whether new options in SYN packets caused any problems. Of the
Alexa top 10,000 sites, 15 did not respond to a SYN packet containing a new option.
From these experiments we can conclude that negotiating MPTCP in the initial handshake is feasible, but
with some caveats. There is no real problem if a middlebox removes the MP CAPABLE option from the
SYN: MPTCP simply falls back to regular TCP behavior. However removing it from the SYN/ACK would
cause the client to believe MPTCP is not enabled, whereas the server believes it is. This mismatch would be
a problem if data packets were to be encoded differently with MPTCP. The obvious solution is to require the
third packet of the handshake (ACK of SYN/ACK) to carry an option indicating that MPTCP was enabled.
However this packet may be lost, so MPTCP must require all subsequent data packets to also carry the
1The correct terms really should be active opener and passive opener, although even these ignore simultaneous open. Forconciseness, we use the terms client and server, but we do not imply any additional limitations on how TCP is used.
2Formally, this is subtype MP CAPABLE of a single TCP option used by MPTCP for multiple purposes.
c© CHANGE Consortium 2012 Page 63 of (76)
option until one of them has been acked. If the first non-SYN packet received by the server does not contain
an MPTCP option, the server must assume the path is not MPTCP-capable, and drop back to regular TCP
behavior.
Finally, if a SYN needs to be retransmitted, it would be a good idea to follow the retransmitted SYN with one
that omits the MP CAPABLE option.
It should be clear from this brief discussion of what should be the simplest part of MPTCP that anyone
designing extensions to TCP must no longer think of the mechanisms as concerning only two parties. Rather,
the negotiation is two-way with mediation, where the packets that arrive are not necessarily those that were
sent. This requires a more defensive approach to protocol design than has traditionally been the case.
7.1.2 Adding subflows
Once two endpoints have negotiated MPTCP, they can open additional subflows. In an ideal world there
would be no need to send new SYN packets before sending data on a new subflow - all that would be needed
is a way to identify the connection that packets belong to. In practice though, we see that NATs and Firewalls
rarely pass data packets that were not preceded by a SYN.
Adding a subflow raises two problems. First, the new subflow needs to be associated with an existing MPTCP
flow. The classical five-tuple cannot be used as a connection identifier, as it does not survive NATs. Second,
MPTCP must be robust to an attacker that attempts to add his own subflow to an existing MPTCP connection.
When the first MPTCP subflow is established, the client and the server insert 64-bit random keys in the
MP CAPABLE option. These will be used to verify the authenticity of new subflows.
To open a new subflow, MPTCP performs a new SYN exchange using the additional addresses or ports it
wishes to use. Another TCP option, MP JOIN is added to the SYN and SYN/ACKs. This option carries
a MAC of the keys from the original subflow; this prevents blind spoofing of MP JOIN packets from an
adversary who wishes hijack an existing connection. MP JOIN also contains a connection identifier derived
as a hash of the recipient’s key [24]; this is used to match the new subflow to an existing connection.
If the client is multi-homed, then it can easily initiate new subflows from any additional IP addresses it owns.
However, if only the server is multi-homed, the wide prevalence of NATs makes it unlikely that a new SYN
it sends will be received by a client. The solution is for the MPTCP server to send an ADD ADDR option
informing the client that the server has an additional address. The client may then initiate a new subflow. This
assymetry is not inherent - there is no protocol design limitation that means the client cannot send ADD ADDR
or the server must send a SYN for a new subflow. But the Internet itself is so frequently assymetric that we
need two distinct ways, one implicit and one explicit, to indicate the existence of additional addresses.
7.1.3 Reliable multipath delivery
In a world without middleboxes, MPTCP could simply stripe data across the multiple subflows, with the
sequence numbers in the TCP headers indicating the sequence number of the data in the connection in the
normal TCP way. Our measurements show that this is infeasible in today’s Internet:
Page 64 of (76) c© CHANGE Consortium 2012
[ , ]
Recv BuffData ACK
1
Recv Wnd
2
Subflow:1001, Data:1
Ack:1001,W
nd:1
Subflow:2001,Data:2 Ack:2001,Wnd:0
[1, ]2 1
[1,2]3 0
Data ACK(inferred)
1
1
3 Subflow:1002,Data:3
Out of WindowDrop Segment
(a) Drops due to incorrect inference
[ ,
RecData ACK
1
Recv Wnd
2
1001,1
Ack 1
001,Wnd 1
2001,2
Ack 2001, Wnd 1
[1, ]2 1
[2, 3 1
Data ACK(inferred)
1
1
3
1002,3
[---app read---]
1002,3
Could send 3 – missed opportunity
(b) Stalls due to incorrect inference
Figure 7.1: Problems with inferring the cumulative data ACK from subflow ACK
• We observed that 10% of access networks rewrite TCP initial sequence numbers (18% on port 80).
Some of this re-writing is by proxies that remove new options; a new subflow will fail on these paths.
But many that rewrite do pass new options - these appear to be firewalls that attempt to increase TCP
initial sequence number randomization. As a result, MPTCP cannot assume the sequence number space
on a new subflow is the same as that on the original subflow.
• Striping sequence numbers across two paths leaves gaps in the sequence space seen on any single path.
We found that 5% of paths (11% on port 80) do not pass on data after a hole - most of these seem to be
proxies that block new options on SYNs and so don’t present a problem as MPTCP is never enabled
on these paths. But a few do not appear to be proxies, and so would stall MPTCP. Perhaps worse, 26%
of paths (33% on port 80) do not correctly pass on an ACK for data the middlebox has not observed -
either the ACK is dropped or it is “corrected”.
Given the nature of today’s Internet, it appears extremely unwise to stripe a single TCP sequence space
across more than one path. The only viable solution is to use a separate contiguous sequence space for each
MPTCP subflow. For this to work, we must also send information mapping bytes from each subflow into the
overall data sequence space, as sent by the application. We shall return to the question of how to encode such
mappings after first discussing flow control and acknowledgments, as the three are intimately related.
7.1.3.1 Flow control
TCP’s receive window indicates the number of bytes beyond the sequence number from the acknowledgment
field that the receiver can buffer. The sender is not permitted to send more than this amount of additional
data.
Multipath TCP also needs to implement flow control, although packets now arrive over multiple subflows. If
we inherit TCP’s interpretation of receive window, this would imply an MPTCP receiver maintains a pool of
buffering per subflow, with receive window indicating per-subflow buffer occupancy. Unfortunately such an
interpretation can lead to a deadlock scenario:
1. The next packet that needs to be passed to the application was sent on subflow 1, but was lost.
c© CHANGE Consortium 2012 Page 65 of (76)
2. In the meantime subflow 2 continues delivering data, and fills its receive window.
3. Subflow 1 fails silently.
4. The missing data needs to be re-sent on subflow 2, but there is no space left in the receive window,
resulting in a deadlock.
The receiver could solve this problem by re-allocating subflow 1’s unused buffer to subflow 2, but it can only
do this by rescinding the advertised window on subflow 1. Besides, the receiver does not know which subflow
the next packet will be sent on. The situation is made even worse because a TCP proxy3 on the path may hold
data for subflow 2, so even if the receiver opens its window, there is no guarantee that the first data to arrive
is the retransmitted missing packet.
The correct solution is to generalize TCP’s receive window semantics to MPTCP. For each connection a
single receive buffer pool should be shared between all subflows. The receive window then indicates the
maximum data sequence number that can be sent rather than the maximum subflow sequence number. As
a packet resent on a different subflow always occupies the same data sequence space, no such deadlock can
occur.
The problem for an MPTCP sender is that to calculate the highest data sequence number that can be sent, the
receive window needs to be added to the highest data sequence number acknowledged. However the ACK
field in the TCP header of an MPTCP subflow must, by necessity, indicate only subflow sequence numbers.
Does MPTCP need to add an extra data acknowledgment field for the receive window to be interpreted
correctly?
7.1.3.2 Acknowledgments
To correctly deduce a cumulative data acknowledgment from the subflow ACK fields, an MPTCP sender
might keep a scoreboard of which data sequence numbers were sent on each subflow. However, the inferred
value of the cumulative data ACK does not step in precisely the same way that an explicit cumulative data
ACK would. Consider the following sequence4:
1. Data sequence no. 1 is sent on subflow 1 with subflow sequence number 1001.
2. Receiver sends ACK for 1001 on subflow 1.
3. Data sequence no. 2 is sent on subflow 2 with subflow sequence number 2001.
4. Receiver sends ACK for 2001 on subflow 2.
5. ACK for 2001 arrives at sender (the RTT on subflow 2 was shorter).
6. ACK for 1001 arrives at sender.3Most will prevent MPTCP being negotiated, but a few do not.4The example uses packet sequence numbers for clarity, but MPTCP actually uses byte sequence numbers just like TCP
Page 66 of (76) c© CHANGE Consortium 2012
The receiver expected the ack for 1001 to be an implicit data ACK for 1, and the ACK for 2001 to be an implicit
ACK for 2. However, as the ACK for 2001 does not implicitly acknowledge both 1 and 2, the sender’s inferred
data ACK is still 0 after step 5. Only after step 6 does the inferred data ACK become 2.
This sort of reordering is inevitable with multipath, and it would not by itself be a problem, except that the
receiver needs to code the receive window field relative to the implicit data ACK. Figure 7.1(a) shows the
problem. Suppose the receive window were only two packets, and the application is slow to empty the receive
buffer. In the ACK for 1001, the receiver closes the receive window to one packet. In the ACK for 2001 the
receiver closes the receive window completely, as there is no space remaining. Unfortunately when the ACK
for 1001 is finally received, the inferred cumulative data ACK is now 2; the sender adds the receive window
of size one to this, and concludes incorrectly that the receiver has sufficient buffer space for one more packet.
Figure 7.1(b) shows a similar situation where reordering causes sending opportunities to be missed.
To avoid such scenarios MPTCP must carry an explicit data acknowledgment field, which gives the left edge
of the receive window.
7.1.3.3 Freeing sender buffers
Having an explicit cumulative DATA ACK also improves robustness when faced with middleboxes. Consider
a connection with two subflows; subflow 1 is direct, but subflow 2 traverses a middlebox that pro-actively
acknowledges TCP segments once it has received them in order, even though they have not yet reached the
TCP receiver. In our results 3% of paths had such proxies (6% on port 80); all of these removed MP CAPABLE
from SYNs so MPTCP would not enable on these paths, but future MPTCP-aware middleboxes might not do
so.
Consider what happens when connectivity to the receiver via subflow 2 is lost, as might happen if it moved
out of coverage of a wireless basestation. The middlebox acknowledges a segment, but only then discovers
it can no longer reach the receiver. The sender receives the subflow acknowledgment for this segment. If it
uses the subflow acknowledgment to free the data at the sender, then this segment cannot then be resent on
subflow 1 which is still working. As a result the connection fails.
If, instead, the sender uses the explicit cumulative DATA ACK to free buffers, such a failure is avoided.
7.1.3.4 Encoding
We have seen that in the forward path we need to encode a mapping of subflow bytes into the data sequence
space, and in the reverse path we need to encode cumulative data acknowledgments. There are two viable
ways to encode this additional data:
• Send the additional data in TCP options.
• Carry the additional data within the TCP payload, using a chunked or escaped encoding to separate
control data from payload data.
c© CHANGE Consortium 2012 Page 67 of (76)
Figure 7.2: Flow Control on the path from C to S inadvertently stops the data flow from S to C
For the forward path we have not found any compelling arguments either way, but the reverse path is a
different matter.
Consider a hypothetical encoding that divides the payload into chunks where each chunk has a TLV header.
A data acknowledgment can then be embedded into the payload using its own chunk type. Under most
circumstances this works fine. However, unlike TCP’s pure ACK, anything embedded in the payload must be
treated as data. In particular:
• It must be subject to flow control because the receiver must buffer data to decode the TLV encoding.
• If lost, it must be retransmitted consistently, so that middleboxes can track sequence state correctly5
• If packets before it are lost, it might be necessary to wait for retransmissions before the data can be
parsed - causing head-of-line blocking.
Flow control presents the most obvious problem for the chunked payload encoding. Figure 7.2 provides an
example. Client C is pipelining requests to server S; meanwhile S’s application is busy sending the large
response to the first request so it isn’t yet ready to read the subsequent requests. At this point, S’s receive
buffer fills up.
S sends segment 10, C receives it and wants to send the DATA ACK, but can’t do so: flow control imposed
by S’s receive window stops him. Because no DATA ACKs are received from C, S cannot free its send buffer,
so this fills up and blocks the sending application on S. S’s application will only read when it has finished
sending data to C, but it cannot do so because its send buffer is full. The send buffer can only empty when S
receives the DATA ACK from C, but C cannot send a DATA ACK until S’s application reads. This is a classic
deadlock cycle.
As no DATA ACK is received, S will eventually time out the data it sent to C and will retransmit it; after many
retransmits the whole connection will time out.
It has been suggested that this can be avoided if DATA ACK are simply excluded from flow control. Unfortu-
nately any middlebox that buffers data can foil this; it is unaware the DATA ACK is special because it looks
just like any other TCP payload.5In our observations, the usual TCP proxies re-asserted the original content when sent a “retransmission” with different data. We
also found one path that did this without exhibiting any other proxy behavior - this is symptomatic of a traffic normalizer[33] - andone on port 80 that reset the connection.
Page 68 of (76) c© CHANGE Consortium 2012
When the return path is lossy, decoding DATA ACKs will be delayed until retransmissions arrive - this will ef-
fectively trigger flow control on the forward path, reducing performance. In effect, this would break MPTCP’s
goal of doing “no worse” than TCP on the best path.
Our conclusion is that DATA ACKs cannot be safely encoded in the payload. The only real alternative is to
encode them in TCP options which (on a pure ACK packet) are not subject to flow control.
7.1.3.5 Data sequence mappings
If MPTCP must use options to encode DATA ACKs, it is simplest to also encode the mapping from subflow
sequence numbers to data sequence numbers in a TCP option. We refer to this as the data sequence number
mapping or DSM.
At first we thought that the DSM option simply needed to carry the data sequence number corresponding to
the start of the MPTCP segment. Unfortunately middleboxes and “smart” NICs make this far from simple.
Middleboxes that resegment data would cause a problem. 6 TCP Segmentation Offload (TSO) hardware in
the NIC also resegments data and is commonly used to improve performance. The basic idea is that the OS
sends large segments and the NIC resegments them to match the receiver’s MSS. What does a NIC performing
TSO do with TCP options? We tested twelve NICs supporting TSO from four different vendors. All of them
copy a TCP option sent by the OS on a large segment into all the split segments.
If MPTCP’s DSM option only listed the data sequence number, TSO would copy the same DSM to more
than one segment, breaking the mapping. Instead the DSM option must say precisely which subflow bytes
map to which data sequence numbers. But this is further complicated by middleboxes that rewrite sequence
numbers; these are commonplace — 10% of paths. Instead, the DSM option must map the offset from
the subflow’s initial sequence number to the data sequence number, as the offset is unaffected by sequence
number rewriting. The option must also contain the length of the mapping. This is robust - as long as the
option is received, it does not greatly matter which packet carries it, so duplicate mappings caused by TSO
are not a problem.
7.1.3.6 Content-modifying middleboxes
Many NAT devices include application-level gateway functionality for protocols such as FTP. IP addresses
and ports in the FTP control channel are re-written by such middleboxes to correct for the address changes
imposed by the NAT.
Multipath TCP and such content-modifying middleboxes have the potential to interact badly. In particular,
due to FTP’s ASCII encoding, re-writing an IP address in the payload can necessitate changing the length of
the payload. Subsequent sequence and ack numbers are then fixed up by the middlebox so they are consistent
from the point of view of the end systems.
Such length changes break the DSM option mapping - subflow bytes can be mapped to the wrong place in the
data stream. They also break every other mapping mechanism we considered, including chunked payloads.
6We did not observe any that would both permit MPTCP and resegment, though.
c© CHANGE Consortium 2012 Page 69 of (76)
There is no easy way to handle such middleboxes.
After much debate, we concluded that MPTCP must include a checksum in the DSM mapping so such
content changes can be detected. MPTCP rejects a modified segment and triggers a fallback process: if
any other subflows exists, MPTCP terminates the subflow on which the modification occurred; if no other
subflow exists, MPTCP drops back to regular TCP behavior for the remainder of the connection, allowing the
middlebox to perform rewriting as it wishes.
Calculating a checksum over the data is comparatively expensive, and we did not wish to slow down MPTCP
just to catch such rare corner cases. MPTCP therefore uses the same 16-bit ones complement checksum used
in the TCP header. This allows the checksum over the payload to be calculated only once. The payload
checksum is added to a checksum of an MPTCP pseudo header covering the DSM mapping values and then
inserted into the DSM option. The same payload checksum is added to the checksum of the TCP pseudo-
header and then used in the TCP checksum field.
With this mechanism a software implementation incurs little additional cost from calculating the MPTCP
checksum. Unfortunately, modern NICs frequently perform checksum offload. If the TCP stack uses the
NIC to calculate checksums, with MPTCP it will still need to calculate the MPTCP checksum in software,
negating the benefits of checksum offload. There is little we can do about this, other than to note that future
NICs will likely perform MPTCP checksum offload too, if MPTCP is widely deployed. In the meantime,
MPTCP allows checksums to be disabled for high performance environments such as data-centers where
there is no chance of encountering such an application-level gateway.
The fallback-to-TCP process triggered by a checksum failure can also be triggered in other circumstances.
For example, if a routing change moves an MPTCP subflow to a path where a middlebox removes DSM
options, this also triggers the fallback procedure.
7.1.4 Connection and subflow teardown
TCP has two ways to indicate connection shutdown: FIN for normal shutdown and RST for errors such as
when one end no longer has state. With MPTCP, we need to distinguish subflow teardown from connection
teardown. With RST, the choice is clear: it must only terminate the subflow, or an error on a single subflow
would cause the whole connection to fail.
Normal shutdown is slightly more subtle. TCP FINs occupy sequence space; the FIN/FIN-ACK/ACK hand-
shake and the cumulative nature of TCP’s acknowledgments ensure that not only has all data been received,
but also both endpoints know the connection is closed and know who needs to hold TIMEWAIT state.
How then should a FIN on an MPTCP subflow be interpreted? Does it mean that the sending host has no
more data to send, or only that no more data will be sent on this subflow? Another way to phrase this is to
ask whether a FIN on a subflow occupies data sequence space, or just subflow sequence space?
Consider first what would happen if a FIN occupied data sequence space. This could be achieved by extending
the length of the DSM mapping in a packet to cover the FIN. Mapping the FIN into the data sequence space
Page 70 of (76) c© CHANGE Consortium 2012
in this way tells the receiver what the data sequence number of the last byte of the connection is, and hence
whether any more data is expected from other subflows.
Suppose now that some data had been transmitted on subflow A just before the last data and FIN were sent
on subflow B. If the receiver is really unlucky, subflow A may fail (perhaps due to mobility) before the last
data arrives. When the sender times out this data, it will wish to re-send it on subflow B, but it has already
sent a FIN on this subflow. Sending data after the FIN is sure to confuse middleboxes and firewalls that tore
down state when they observed the FIN.
This particular problem might be avoided by delaying sending the FIN until all outstanding data has been
DATA ACKed, but this adds an unnecessary RTT to all connections during which the receiving application
doesn’t know if more data will arrive.
Much simpler is for a FIN to have the more limited “no more data on this subflow” semantics, and this is what
MPTCP does. An explicit DATA FIN, carried in a TCP option, then indicates the end of the data sequence
space and can be sent immediately the application closes the socket. To be safe, either the sender waits for
the DATA ACK of the DATA FIN before sending a FIN on each subflow, or it sends Data-FIN on all subflows
together with a FIN.
MPTCP’s FIN semantics also allow subflows to be closed cleanly while allowing the connection to continue
on other subflows. Finally, to support mobility, MPTCP provides a REMOVE ADDR message, allowing one
subflow to indicate that other subflows using the specified address are closed. This is necessary to cleanly
cope with mobility when a host loses the ability to send from an address and so cannot send a subflow FIN.
7.2 Lessons learned
In today’s Internet, the three-way-handshake involves not only the two communicating hosts, but also all the
middleboxes on the path. Verifying the presence of a particular TCP option in a SYN+ACK is not sufficient
to ensure that a TCP extension can be safely used. As shown in the previous chapter, some middleboxes
pass TCP options that they don’t understand. This is safe for TCP options that are purely informative (e.g.
RFC1323 timestamps) but causes problems with other options such as those that redefine the semantics of
TCP header fields. For example, the large window extension in RFC1323 changes the semantics of the
window field of the TCP header and extends it beyond 16 bits. Nearly 20 years after the publication of
RFC1323, there are still stateful firewalls that do not understand this option in SYNs but block data packets
that are sent in the RFC1323 extended window. A TCP extension that changes the semantics of parts of the
packet header must include mechanisms to cope with middleboxes that do not understand the new semantics.
A second issue for all TCP designers is the mutability of the TCP packets. In an end-to-end Internet, all the
information carried inside TCP packets is immutable. Today this is no longer true. The entire TCP header
and the payload must be considered as mutable fields. If a TCP extension needs to rely on a particular field,
it must check its value in a way that cannot be circumvented by middleboxes that do not understand this
extension. The DSM checksum is an example of a solution to deal with these problems.
c© CHANGE Consortium 2012 Page 71 of (76)
The third, but probably most important point about new TCP extensions is that to be deployable, they must
necessarily include techniques that enable them to fallback to regular TCP when something wrong happens.
If a middlebox interferes badly with a TCP extension, the problem must be detected and the extension auto-
matically disabled to preserve the data transfer. A TCP extension will only be deployed if its designers can
guarantee that it will transfer data correctly (and hopefully better) in all the situations where a regular TCP is
able to transfer data.
The last point that we would like to raise is that the hidden middleboxes increase the complexity of the net-
work. This gives us a strong motivation to change the network architecture to recognize explicitly their role.
The CHANGE architecture aims to do precisely this, embracing flow-processing and implicitly middleboxes,
while allowing the Internet to evolve at the same time.
Page 72 of (76) c© CHANGE Consortium 2012
ConclusionThis document has tackled several important mechanisms that are required to enable CHANGE platforms to
process flows in real networks and the global Internet.
The first mechanism consists in localizing the closest and suitable CHANGE platform for a specific flow to
process. Indeed, CHANGE platforms are heterogeneously characterized by a set of resources (e.g., CPU,
memory and bandwidth), and, given a specific flow to process, we have to be able to select the platform
that satisfies these flow constraints and metrics. We classified platform localization mechanisms into two
different types (centralized and decentralized), and proposed an approach for each class. In the centralized
case, the use of a centralized database is enough to keep information about all platforms in term of supported
functionality and resources availability. In the decentralized case, resources availability is more delicate to
manage, and maintaining a reliable centralized database is infeasible. In this case, we present our distributed
approach, which is based on IP anycast. We used a modified version of the Kruskal minimum spanning tree
algorithm to split platforms into clusters and assign addresses to each cluster member. We also discussed how
a CHANGE platform dynamically manages its internal resources between its servers, when a host requests a
network service.
A second mechanism that we discussed in this deliverable is flow attraction. CHANGE platforms are not
always on the initial path from a source to a destination, thus, we need to define a mechanism that allows
these platforms to be able to attract flows. We split the flow attracting mechanism into two steps: first,
attracting the flows to the platform, and second, delivering back the processed flow to the destination. Based
on comparisons done in deliverable D4.1 between three different possible solutions, we chose to implement
FlowSpec, a mechanism to distribute traffic flow specifications. FlowSpec information is carried via BGP,
which allows the routing system to propagate flow specifications. We showed how FlowSpec coupled with
ExaBGP, a BGP engine that allows routes injection with arbitrary next-hops into the network, can be used to
deploy the flow attraction mechanism required by CHANGE platforms.
The final two mechanisms included an algorithm for calculating how to allocate the resources of a set of
CHANGE platforms to a set of service requests; and a novel Controllable per-Flow Load-Balancing (CFLB)
mechanism that can improve performance in scenarios where CHANGE is deployed in networks with signif-
icant path diversity such as datacenters.
This deliverable, and the mechanisms described herein, provide the primitives needed to start implementing
CHANGE platforms in future work.
c© CHANGE Consortium 2012 Page 73 of (76)
Bibliography[1] Load balancing with Cisco Express Forwarding. Technical report, Cisco Systems, Inc., 1998.
[2] 4 hour planetlab ping trace, http://www.eecs.harvard.edu/ syrah/nc/sim/pings.4hr.stamp.gz, 2011 (ac-
cessed November, 2011).
[3] A bgp engine and route injector, 2011 (accessed November, 2011).
[4] Ip address location technology, http://www.maxmind.com/app/ip-location, 2011 (accessed November,
2011).
[5] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture.
ACM SIGCOMM CCR, 38:63–74, August 2008.
[6] M. Al-fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic flow schedul-
ing for data center networks. In Proc. USENIX NSDI, 2010.
[7] IEEE Standards Association. IEEE Std 802.1AX-2008 IEEE Standard for Local and Metropolitan Area
Networks - Link Aggregation. 2008.
[8] B. Augustin, T. Friedman, and R. Teixeira. Measuring Load-balanced Paths in the Internet. In Proc.
ACM IMC, volume 6, 2007.
[9] S. Barre, C. Paasch, and O. Bonaventure. MultiPath TCP: From Theory to Practice. In IFIP Networking,
Valencia, May 2011.
[10] T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In
Proc. ACM IMC, pages 267–280, Melbourne, 2010.
[11] T. Benson, A. Anand, A. Akella, and M. Zhang. The case for fine-grained traffic engineering in data
centers. In Proc. of INM/WREN, pages 2–2, 2010.
[12] Andrea Bittau, Michael Hamburg, Mark Handley, David Mazieres, and Dan Boneh. The case for ubiq-
uitous transport-level encryption. In USENIX Security’10, pages 26–26, Berkeley, CA, USA, 2010.
USENIX Association.
[13] M. Bocci, S. Bryant, D. Frost, L. Levrau, and L. Berger. A Framework for MPLS in Transport Networks.
RFC 5921 (Informational), July 2010. Updated by RFC 6215.
[14] Z. Cao, Z. Wang, and E. Zegura. Performance of Hashing-Based Schemes for Internet Load Balancing.
In Proc. IEEE INFOCOM, 2000.
[15] B. Carpenter and S. Amante. Using the IPv6 flow label for equal cost multipath routing and link
aggregation in tunnels. Internet draft, draft-carpenter-flow-ecmp-05, IETF, July 2011.
Page 74 of (76) c© CHANGE Consortium 2012
[16] R. Chandra, P. Traina, and T. Li. BGP Communities Attribute. RFC 1997 (Proposed Standard), August
1996.
[17] J. Chu, Nandita Dukipatti, Y. Cheng, and M. Mathis. Increasing TCP’s Initial Window, Internet Draft.
IETF, april 2011.
[18] Cisco. Server Cluster Designs with Ethernet. http://www.cisco.com/en/US/
docs/solutions/Enterprise/Data_Center/DC_Infra2_5/DCInfra_3.html#
wp1088785.
[19] A. Curtis, W. Kim, and P. Yalagandula. Mahout: Low-overhead datacenter traffic management using
end-host-based elephant detection. In INFOCOM, pages 1629–1637. IEEE, 2011.
[20] A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, and S. Banerjee. DevoFlow: Scaling
Flow Management for High-Performance Networks. In Proc. of ACM SIGCOMM, 2011.
[21] Frank Dabek, Russ Cox, Frans Kaashoek, and Robert Morris. Vivaldi: a decentralized network coordi-
nate system. SIGCOMM Comput. Commun. Rev., 34(4):15–26, August 2004.
[22] C. De Canniere, O. Dunkelman, and M. Knezevic. KATAN and KTANTAN – A family of small and
efficient hardware-oriented block ciphers. In Proc. CHES 2009, pages 272–288, 2009.
[23] A. Ford, C. Raiciu, M. Handley, and O. Bonaventure. TCP Extensions for Multipath Operation with
Multiple Addresses. Internet draft, draft-ietf-mptcp-multiaddressed-04, IETF, July 2011.
[24] A. Ford, C. Raiciu, M. Handley, and O. Bonaventure. TCP extensions for multipath operation with
multiple addresses, Jul 2011. IETF draft (work in progress).
[25] P. Francois and O. Bonaventure. Avoiding transient loops during IGP convergence in IP networks. In
Proc. of IEEE INFOCOM, 2005.
[26] P. Francois, M. Shand, and O. Bonaventure. Disruption free topology reconfiguration in OSPF networks.
In Proc. of IEEE INFOCOM, 2007.
[27] J. Fu, P. Sjodin, and G. Karlsson. Loop-free updates of forwarding tables. IEEE Transactions on
Network and Service Management, 5(1), March 2008.
[28] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sen-
gupta. VL2: a scalable and flexible data center network. In Proc. ACM SIGCOMM, 2009.
[29] Krishna P. Gummadi, Stefan Saroiu, and Steven D. Gribble. King: estimating latency between arbitrary
internet end hosts. SIGCOMM Comput. Commun. Rev., 32:11–11, July 2002.
c© CHANGE Consortium 2012 Page 75 of (76)
[30] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. BCube: a high perfor-
mance, server-centric network architecture for modular data centers. In Proc ACM SIGCOMM, pages
63–74, 2009.
[31] S. Han, K. Jang, K. Park, and S. Moon. Packetshader: a gpu-accelerated software router. In Proc.ACM
SIGCOMM, pages 195–206, 2010.
[32] M. Handley. Why the internet only just works. BT Technology Journal, 24:119–129, 2006.
[33] M. Handley, V. Paxson, and C. Kreibich. Network intrusion detection: evasion, traffic normalization,
and end-to-end protocol semantics. In Proc. USENIX Security Symposium, pages 9–9, 2001.
[34] A. Hodjat and I. Verbauwhede. A 21.54 Gbits/s Fully Pipelined AES Processor on FPGA. In Proc IEEE
Symp. Field-Programmable Custom Computing Machines, pages 308–309, 2004.
[35] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992 (Informational), November
2000.
[36] G. Iannaccone, CN. Chuah, S. Bhattacharyya, and C. Diot. Feasibility of IP Restoration in a Tier-1
Backbone. IEEE Network, 18(2), March 2004.
[37] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center traffic:
measurements & analysis. In Proc. ACM SIGCOMM IMC, pages 202–208, 2009.
[38] K. Kompella, J. Drake, S. Amante, W. Henreickx, and L. Yong. The Use of Entropy Labels in MPLS
Forwarding. Internet draft, draft-ietf-mpls-entropy-label-00, IETF, May 2011.
[39] Michael E. Kounavis, Xiaozhu Kang, Ken Grewal, Mathew Eszenyi, Shay Gueron, and David Durham.
Encrypting the internet. In Proc. ACM SIGCOMM, pages 135–146, 2010.
[40] J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. In
Proc. of the American Mathematical Society, 7, 1956.
[41] Craig Labovitz, Scott Iekel-Johnson, Danny McPherson, Jon Oberheide, and Farnam Jahanian. In-
ternet inter-domain traffic. In Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM,
SIGCOMM ’10, pages 75–86, New York, NY, USA, 2010. ACM.
[42] Cristian Lumezanu, Randy Baden, Neil Spring, and Bobby Bhattacharjee. Triangle inequality varia-
tions in the internet. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement
conference, IMC ’09, pages 177–183, New York, NY, USA, 2009. ACM.
[43] P. Marques, N. Sheth, R. Raszuk, B. Greene, J. Mauch, and D. McPherson. Dissemination of Flow
Specification Rules. RFC 5575 (Proposed Standard), August 2009.
Page 76 of (76) c© CHANGE Consortium 2012
[44] R. Martin, M. Menth, and M. Hemmkeppler. Accuracy and Dynamics of Multi-Stage Load Balancing
for Multipath Internet Routing. In Proc. IEEE ICC, June 2007.
[45] D. A. McGrew and S. R. Fluhrer. The Extended Codebook (XCB) Mode of Operation. Cryptology
ePrint Archive, Report 2004/278, 2004.
[46] N McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and
J. Turner. Openflow: enabling innovation in campus networks. SIGCOMM CCR, 38:69–74, March
2008.
[47] P. Merindol, P. Francois, O. Bonaventure, S. Cateloin, and J.-J. Pansiot. An efficient algorithm to enable
path diversity in link state routing networks. Computer Networks, 55(1):1132–1149, April 2011.
[48] J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul. SPAIN: COTS data-center Ethernet for
multipathing over arbitrary topologies. In Proc. USENIX NSDI, pages 18–18, 2010.
[49] T. S. Eugene Ng and Hui Zhang. Predicting internet network distance with coordinates-based ap-
proaches. In In INFOCOM, pages 170–179, 2001.
[50] Gang Peng. Cdn: Content distribution network. CoRR, cs.NI/0411069, 2004.
[51] C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and M. Handley. Improving datacenter
performance and robustness with multipath tcp. In Proc. SIGCOMM, August 2011.
[52] M. Reitblatt, N. Foster, J. Rexford, and D. Walker. Consistent Updates for Software-Defined Networks
: Change You Can Believe In! In Proc. of ACM SIGCOMM HotNets-X Workshop, 2011.
[53] R. L. Rivest. The RC5 Encryption Algorithm. In Proc. FSE, volume 1008, pages 86–96, 1994.
[54] S. Sangli, D. Tappan, and Y. Rekhter. BGP Extended Communities Attribute. RFC 4360 (Proposed
Standard), February 2006.
[55] C. Shannon, E. Aben, Kc Claffy, and D. Andersen. The CAIDA Anonymized 2008 Internet Traces
– 2008-07-17 12:59:07 - 2008-07-17 14:01:00. http://www.caida.org/data/passive/
passive_2008_dataset.xml.
[56] J. Touch and R. Perlman. Transparent Interconnection of Lots of Links (TRILL): Problem and Appli-
cability Statement. RFC 5556 (Informational), May 2009.
[57] L. Vanbever, S. Vissicchio, C. Pelsser, P. Francois, and O. Bonaventure. Seamless Network-Wide IGP
Migrations. In Proc. of ACM SIGCOMM, 2011.
[58] A. F. Webster and S. E. Tavares. On The Design Of S-Boxes. In Proc. CRYPTO, pages 523–534, 1986.
c© CHANGE Consortium 2012 Page 77 of (76)
[59] D. Wischik, C. Raiciu, A. Greenhalgh, and M. Handley. Design, implementation and evaluation of
congestion control for multipath tcp. In Proc.USENIX NSDI, pages 99 – 113, 2011.
[60] Damon Wischik, Costin Raiciu, Adam Greenhalgh, and Mark Handley. Design, implementation and
evaluation of congestion control for multipath tcp. In NSDI’11, pages 8–8, Berkeley, CA, USA, 2011.
USENIX Association.
[61] Ellen W. Zegura, Kenneth L. Calvert, and Samrat Bhattacharjee. How to model an internetwork. In In
Proceedings of IEEE INFOCOM, volume 2, pages 594–602, 1996.
Page 78 of (76) c© CHANGE Consortium 2012