change - cordis...the change architecture, which is based around the notion of a flow processing...

78
ICT-257422 CHANGE CHANGE: Enabling Innovation in the Internet Architecture through Flexible Flow-Processing Extensions Specific Targeted Research Project FP7 ICT Objective 1.1 The Network of the Future D4.3 – Protocols and mechanisms to combine flow processing platforms Due date of deliverable: December 30, 2011 Actual submission date: September 28, 2012 Start date of project October 1, 2010 Duration 36 months Lead contractor for this deliverable Lancaster University Version 3.0, September 28, 2012 Confidentiality status Public c CHANGE Consortium 2012 Page 1 of (76)

Upload: others

Post on 26-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

ICT-257422

CHANGE

CHANGE: Enabling Innovation in the Internet Architecture through

Flexible Flow-Processing Extensions

Specific Targeted Research Project

FP7 ICT Objective 1.1 The Network of the Future

D4.3 – Protocols and mechanisms to combine flow

processing platforms

Due date of deliverable: December 30, 2011

Actual submission date: September 28, 2012

Start date of project October 1, 2010

Duration 36 months

Lead contractor for this deliverable Lancaster University

Version 3.0, September 28, 2012

Confidentiality status Public

c© CHANGE Consortium 2012 Page 1 of (76)

Page 2: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Abstract

The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-

enable innovation in the Internet. However, before processing any data flow, the communicating hosts

or agents acting on their behalf could be able to locate properly the closest CHANGE platforms. This

document, discusses first how these hosts and agents could efficiently locate these platforms, supported

by a complete comparison between different existing methods. Then it presents how efficiently flows are

attracted to these discovered platforms. The last part of the document discusses how flow can migrate

from one platform to another.

Target Audience

For the project participants, this document describes three important mechanisms that allow users to

localise efficiently CHANGE platforms, attract traffic to these discovered platforms, and finally migrate

a flow from one specific platform to another one. Moreover, this document describes additional tech-

niques to improve flow processing inside our platforms, like dynamic allocation of network services and

traffic traffic load balancing between platforms. The readers are expected to be familiar with Internet

protocols.

Disclaimer

This document contains material, which is the copyright of certain CHANGE consortium parties, and may

not be reproduced or copied without permission. All CHANGE consortium parties have agreed to the full

publication of this document. The commercial use of any information contained in this document may require

a license from the proprietor of that information.

Neither the CHANGE consortium as a whole, nor a certain party of the CHANGE consortium warrant that

the information contained in this document is capable of use, or that use of the information is free from risk,

and accept no liability for loss or damage suffered by any person using this information.

This document does not represent the opinion of the European Community, and the European Community is

not responsible for any use that might be made of its content.

Page 2 of (76) c© CHANGE Consortium 2012

Page 3: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Impressum

Full project title CHANGE: Enabling Innovation in the Internet Architecture through

Flexible Flow-Processing Extensions

Title of the workpackage D4.3 – Protocols and mechanisms to combine flow processing platforms

Editor Mehdi Bezahaf, Lancaster University

Project Co-ordinator Adam Kapovits, Eurescom

Technical Manager Felipe Huici, NEC

This project is co-funded by the European Union through the ICT programme under FP7.

Copyright notice c© 2012 Participants in project CHANGE

c© CHANGE Consortium 2012 Page 3 of (76)

Page 4: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Executive SummaryOne of the main goals of the CHANGE project is to reinvigorate innovation on the Internet, in order to better

support current services and applications and enable those of tomorrow. This will be achieved by deploying

a set of flow processing platforms at critical points in the network. This document defines vital mechanisms

that are necessary for the proper functioning of these flow processing platforms.

In the first part of this document, we discuss the platform discovery mechanism. Before using any platform in

the network, we have to be able to efficiently find the closest set of platforms with enough resources to process

our flow, and from this set, we have to be able to select the one that satisfies our constraints and metrics. We

classify platform discovery mechanisms into two different classes: centralized approaches and decentralized

ones. We then present our distributed approach, which is based on IP anycast. We use a modified version of

the Kruskal minimum spanning tree algorithm to split platforms into clusters and assign addresses to each

cluster member.

Discovering and selecting the best platform to process our traffic is not enough, since this traffic needs to

reach this selected platform. The best case scenario is that the platform is situated on the flow’s path, such

that by default all flow packets will pass through the platform, and no action needs to be performed by the

latter. However, this is not always the case, and a CHANGE platform needs to be able to attract flows to itself,

process them, and redirect them to their final destination while avoiding any forwarding loops. We further

discuss how a flow is attracted to CHANGE platforms. We briefly recall the definition of FlowSpec [43],

our selected solution, then explain how FlowSpec, in conjunction with ExaBGP [3] which allows routes

injection with arbitrary next-hops, can be used to implement a flow attraction mechanism. We further discuss

a novel load-balancing mechanism that can improve performance in scenarios where CHANGE is deployed

in networks with significant path diversity such as datacenters.

To summarize, this document presents and describes mechanisms crucial to the functioning of the CHANGE

architecture: platform discovery, resource allocation, flow attraction and load-balancing.

Page 4 of (76) c© CHANGE Consortium 2012

Page 5: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

List of AuthorsAuthors Mehdi Bezahaf, Laurent Mathy, Gregory Detal, Simon van der Linden, Olivier Bonaventure, Pham

Quang Dung, Yves Deville, Costin Raiciu and Octavian Rinciog, Felipe Huici, Francesco Salvestrini

Participants Lancaster University, Universite catholique de Louvain, Polytechnic University of Bucharest, NEC

Europe Ltd., Nextworks s.r.l.

Work-package WP4 – Network Architecture Implementation

Security PUBLIC (PU)

Nature R

Version 3.0

Total number of pages 76

c© CHANGE Consortium 2012 Page 5 of (76)

Page 6: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Contents

Executive Summary 4

List of Authors 5

List of Figures 9

List of Tables 10

1 Introduction 11

1.1 Platform Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Flow attraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Platform Discovery 13

2.1 How to find the right platform? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Centralized approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.2 Decentralized approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.2.1 k anycast addresses assignment . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.2.2 Anycast Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.2.3 Topologies used for experiments . . . . . . . . . . . . . . . . . . . . . . 18

2.1.2.4 Comparison of platforms discovery methods . . . . . . . . . . . . . . . . 20

2.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Traffic attraction and redirection 22

3.1 FlowSpec (RFC5575) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Performing Flow Attraction Using FlowSpec . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 Deploying FlowSpec Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.2 FlowSpec + Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Using ExaBGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 Traffic attraction using ExaBGP and FlowSpec . . . . . . . . . . . . . . . . . . . . 28

3.3.3 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Service Composition and Inter-Platform Aspects 32

5 Flow Migration 35

Page 6 of (76) c© CHANGE Consortium 2012

Page 7: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2.1 Consistency classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2.2 Ensuring per-packet consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2.3 Ensuring per-flow consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Load Balancing 39

6.1 Path Diversity at the Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.2 Case Study: Multipath TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.3 Controllable per-Flow Load-Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.3.1 Path Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.3.2 Invertible Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3.3 Load Balancing Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3.4 Avoiding Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.4.1 Load-Balancing for Non-Controlled Flows . . . . . . . . . . . . . . . . . . . . . . 51

6.4.2 Forwarding Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.4.3 MPTCP improvements with CFLB . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.4.4 Data Center Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.4.5 Testbed Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7 Motivating Case Study: Extending TCP 60

7.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.1.1 Connection setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.1.2 Adding subflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.1.3 Reliable multipath delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.1.3.1 Flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.1.3.2 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.1.3.3 Freeing sender buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.1.3.4 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.1.3.5 Data sequence mappings . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.1.3.6 Content-modifying middleboxes . . . . . . . . . . . . . . . . . . . . . . 67

7.1.4 Connection and subflow teardown . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

c© CHANGE Consortium 2012 Page 7 of (76)

Page 8: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

7.2 Lessons learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Conclusion 71

References 71

Page 8 of (76) c© CHANGE Consortium 2012

Page 9: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

List of Figures2.1 CHANGE Platform Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Simulated topology with N = 1000 platforms and k = 10 desired platforms. . . . . . . . . . 18

2.3 Results of gtitm topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Anycast on real topology with K = 3 desired platforms. . . . . . . . . . . . . . . . . . . . . 19

2.5 Relative error selecting the closest K platforms using geographical location. . . . . . . . . . 20

2.6 Comparison between anycast and virtual coordinates. . . . . . . . . . . . . . . . . . . . . . 21

3.1 Attraction mechanism terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Forwarding path of attracted packets with FlowSpec and tunnels. . . . . . . . . . . . . . . . 25

5.1 Flows that used to go to the processing module at the top now need to go to the one in the

middle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 As soon as the rules of the new configuration are installed on the transit and egress switches

with a new tag, packets entering the network are tagged so that they match the new rules. . . 37

6.1 More than 80% of the server pairs in popular models of data center topologies have two or

more paths between them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 On average, for 70% of destinations, routers and switches have multiple next hops. . . . . . 41

6.3 The load-balanced paths between S and D. . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.4 The complete mode of operation of a CFLB router. . . . . . . . . . . . . . . . . . . . . . . 50

6.5 Deviation from an optimal distribution amongst two possible next-hops. . . . . . . . . . . . 52

6.6 Packet distribution computed every second amongst four possible next hops. . . . . . . . . . 53

6.7 CFLB gives equivalent forwarding performance as hash-based load balancers. . . . . . . . . 54

6.8 MPTCP needs few subflows to get a good Fat Tree utilization when using CFLB. . . . . . . 56

6.9 Regular MPTCP is unlikely to use all paths. MPTCP-CFLB on the other hand always man-

ages to use all the paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.10 Regular MPTCP has a very small probability of using link A of Figure [figure][11][6]6.11

and is thus suboptimal compared to MPTCP-CFLB. . . . . . . . . . . . . . . . . . . . . . . 57

6.11 Testbed – The maximum throughput available between S and D is at 200 Mbps due to the

bottleneck link between the router and the destination. . . . . . . . . . . . . . . . . . . . . 57

7.1 Problems with inferring the cumulative data ACK from subflow ACK . . . . . . . . . . . . 63

7.2 Flow Control on the path from C to S inadvertently stops the data flow from S to C . . . . . 66

c© CHANGE Consortium 2012 Page 9 of (76)

Page 10: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

List of Tables6.1 General notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Page 10 of (76) c© CHANGE Consortium 2012

Page 11: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Glossaryattraction mechanism is the mechanism used to establish the attraction path.

attraction path is the path established to redirect traffic from the redirection point to the processing platform.

delivery mechanism is the mechanism used to establish the delivery path.

delivery path is the path from the processing platform to the grafting point.

destination is the destination of the flow that needs to be processed in the processing platform.

grafting point is the point on the initial path where the processed packets returns on their initial path.

initial path is the path between the source to the destination when the attraction mechanism is not in place.

processing platform is the CHANGE platform in charge of the actual processing of packets, the attraction

mechanism and the delivery mechanism.

redirection point is the point on the initial path where the packets from the source to the destination will be

diverted from the initial path towards the processing platform via the attraction path.

source is the source of the flow that needs to be processed in the processing platform.

c© CHANGE Consortium 2012 Page 11 of (76)

Page 12: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

AcronymsALT Alternative Logical Topology.

AS Autonomous System.

ASBR Autonomous System Boundary Router.

BGP Border Gateway Protocol.

DNS Domain Name System.

DNSSEC Domain Name System Security Extensions.

EID Endpoint Identifier.

ETR Egress Tunnel Router.

FEC Forwarding Equivalence Class.

FlowSpec Flow Specification.

GRE Generic Routing Encapsulation.

IPsec Internet Protocol Security.

ITR Ingress Tunnel Router.

LISP Locator/Identifier Separation Protocol.

MPLS Multi-Protocol Label Switching.

NAT Network Address Translation.

NLRI Network Layer Reachability Information.

PKI Public Key Infrastructure.

RLOC Routing Locators.

RPKI Resource Public Key Infrastructure.

RR Route Reflector.

sBGP Secure BGP.

soBGP Secure Origin BGP.

Page 12 of (76) c© CHANGE Consortium 2012

Page 13: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

1 IntroductionOne of the main characteristics of the CHANGE architecture is its flow processing platforms. In this doc-

ument we tackle several important mechanisms crucial to the architecture’s correct functioning. We begin

by discussing the platform discovery mechanism: Where are platforms located? Which is the closest set of

platforms to our flow?

As a further mechanism, we discuss how flows are attracted to CHANGE platforms so that they can be pro-

cessed. We briefly recall the definition of FlowSpec [43], our selected solution, then explain how FlowSpec,

in conjuction with ExaBGP [3] which allows routes injection with arbitrary next-hops, can be used as a flow

attraction mechanism. Finally, we discuss a novel load-balancing mechanism that can improve performance

in scenarios where CHANGE is deployed in networks with significant path diversity such as datacenters.

1.1 Platform DiscoveryBefore processing any flow, these platforms need to be localized. By discovery we do not mean only finding

a platform, but first finding a list of the closest platforms with enough resources to process our flow, and from

this set, we have to be able to select the one that satisfies our constraints and metrics.

We classify platform discovery mechanisms into two different classes: centralized and decentralized. In

the case of a centralized mechanism, using a centralized database, each platform will know the existence

of all other platforms with their supported functionality and resource availability. In such a case, users will

send requests to instantiate processing through an API, and the responses will include the identity of the

corresponding platform. In the decentralized case, we rely on a solution based on IP anycast. We use a

modified version of the Kruskal minimum spanning tree algorithm to split platforms into clusters and assign

addresses to each cluster member.

1.2 Flow attractionGiven that CHANGE platforms are not always on the initial path from a source to a destination, we need

to define a mechanism that allows these platforms to be able to attract flows. We split the flow attraction

mechanism into two steps: first, attracting the flows to the platform, and second, delivering back the processed

flow to the destination. Indeed, when processing is requested, the selected platform will activate the attraction

mechanism to attract packets corresponding to the flows that need to be processed. Once processed, the flow

needs to be correctly delivered to the destination.

We had discussed and compared three different possible solutions in deliverable D4.1:

• the first one is based on a combination of DNS and one-to-one NATs.

• the second solution consists on using BGP announcements inside and in limited scope outside an AS

• and finally a solution using FlowSpec, a way to distribute matching rules to routers, and to divert the

packets to the platform using a tunnel.

c© CHANGE Consortium 2012 Page 13 of (76)

Page 14: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

The first solution is the simplest, but has problems like failing when a non-cooperative user is involved (e.g.,

a DDoS attacker can just avoid a filtering platform by foregoing the DNS look-up that would resolve to such

a platform). The second solution would work, but can only provide coarse-granularity when matching flows.

As a result of these limitations, we decide to implement the last option based on FlowSpec. In this deliverable

we explain this solution at length, explaining how it can be used to attract traffic. Finally, we explain how to

practically perform traffic attraction using FlowSpec and ExaBGP, a route injector.

1.3 Load BalancingIn the long term, if the view of the project is confirmed, a CHANGE site could consist of a complete data

center containing a bunch of switches and servers making up a number of platforms. Data centers are known

for having significant path diversity which can be leveraged by techniques such as load balancing to improve

the performance of flow processing in CHANGE platforms. To this end, in this deliverable we also introduce

a novel load-balancing scheme called CFLB, which unlike current, hash-based approaches, allows hosts to

explicitly select the load-balanced path they want to use for a specific flow.

Page 14 of (76) c© CHANGE Consortium 2012

Page 15: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

2 Platform Discovery

2.1 How to find the right platform?

If the CHANGE vision is successful, it would mean that multitudes of platforms will be deployed globally

as shown in Figure 2.1. This however raises the obvious question of how do flow owners find the appropriate

platform on which to run their desired flow processing functionality, and what requirements are needed for a

good solution.

There are two main requirements; first, the platform must have the processing functionality required by the

flow owner, and it should be willing and have the resources to participate in the processing. Out of this

feasible set, the platform should be selected subject to the constraints and metrics defined by the user, for

example minimising the overall processing cost.

Today we expect cost will equal end-to-end delay. Delay has become the single most important factor affect-

ing user experience, as exemplified by the efforts of Web providers to shorten the paths [41] and to reduce the

number of RTTs required to download an average web object [17]. That is why we will use delay as our base

metric for discovering platforms. Arbitrary constraints can be implemented on top of the delay metric; once

nearby platforms are discovered, the user can query them to discover their various capabilities (e.g., required

bandwidth or CPU availability); this information enables the user to choose the most appropriate platform for

their needs.

We have considerable flexibility in designing solutions to meet these two goals, and the end solution will also

depend on the deployment type chosen. In a CDN-like deployment where there is a known set of platforms

and full trust between them, it is simple to create a database of supported functionalities and possibly of

available resources. In such a centralized case (see section 2.1.1), users will send requests to instantiate

processing via an external, opaque API, and replies will include the identity of the platform.

In a federated deployment (section 2.1.2), resource availability is sensitive information, so that maintaining a

reliable database is unfeasible, limiting the applicability of a centrally accessible API. In this case, distributed

solutions that better reflect the trust relationships between the platform owners, as well as being scalable, are

best suited.

Regardless of the deployment model, platform discovery needs to provide answers to the following questions

posed by the flow processing customers:

1. What is the closest platform to me? This might be used by Destination in Figure 2.1 to locate

platform E and instantiate an intrusion-detection system on all its traffic.

2. What is the closest platform to a given IP? A host could use this functionality to instantiate filter-

ing close to traffic sources with the purpose of defending against DDoS attacks. For instance, the

Destination could use platform A or C to filter traffic from the Source.

c© CHANGE Consortium 2012 Page 15 of (76)

Page 16: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Figure 2.1: CHANGE Platform Discovery

3. What are the k closest platforms to a given IP? A generalization of the two questions above, this would

allow requesting platforms to select the CHANGE platforms that can support the desired functionality

that instant in time.

4. What is the platform closest to an end-to-end path? If we wanted to monitor a TCP flow, what platforms

should we use?

The most obvious solution is to have a database of addresses of all CHANGE platforms and the requesting

host should use active measurement to choose the appropriate platform. This approach has high costs for

each platform discovery, and does not support locating platforms close to another IP.

An alternative solution leverages Internet routing by using BGP anycast. With this, each platform will have

a common IP C, advertised via BGP anycast, and its own unique IP. When a flow-processing client wants to

find a platform would create a TCP connection to IP C and a known port; the packets will be routed to the

closest platform as determined by BGP. This solution gives the most accurate results, but does not directly

support finding the k-closest platforms, or the platforms close to one IP. Using multiple IPs for CHANGE

platforms solves the first problem.

To find platforms close to given IPs we need to be able to estimate latencies between two entities. There are

a number solutions proposed in the research literature that can be applied:

1. DNS: The obvious solution is to have a database containing all platforms addresses (for example, DNS)

and the requesting user should use active measurements to choose the right platform. This approach

has high costs for each platform (latencies measured by each platform must be triggered each time a

request is processed). In addition this method can’t provide discovery of platforms that are close to a

given IP address.

Page 16 of (76) c© CHANGE Consortium 2012

Page 17: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

2. The King approach [29] makes the assumption that each host is close topologically to its authoritative

DNS server. Thus, we can measure end-to-end delay of two hosts by measuring the delay between their

authoritative DNS servers. This solution is attractive because it uses the existing DNS infrastructure,

but its results are as close to reality as the assumption it relies on. With all the CDN deployments out

there and DNS-based server load balancing, it is unclear whether this assumption is true today, 10 years

after its proposal.

3. Virtual coordinate systems: Systems like GNP [49] or Vivaldi [21] actively measure delays between

sets of nodes, creating a Cartesian space. The promise is that with some initial measurements, and after

the system converges, these coordinates can be used directly to estimate the delay between hosts with-

out further active measurements. The main disadvantage of virtual coordinate systems is that they are

not incredibly accurate (for example, it is difficult to deal with violations of the triangle inequality [42]).

4. Geographical location: According to this solution, a specific physical location of Internet hosts can be

determined using only its public IP address. As we have emphasized, in today’s networks distance and

delay between two IP addresses are proportional. However existing solutions at this moment are not

100% accurate, due to different technologies such as NATs or Provider Independent Addresses that

break the relationship between physical location and ISP. In addition, existing solutions such as the one

offered by MaxMind [4], only offer localization based on city, so the distance can only be calculated

between connected stations from different cities.

5. BGP anycast: According to this solution, each platform can have a common IP published by BGP

anycast, but also a unique IP. When a client wants to find a platform it needs to create a TCP connection

to that IP anycast and packets will be routed to the nearest platform by the routing system. This solution

offers the most accurate results, but it cannot directly find the first k closest to a platform. In section

2.1.2.1 we describe how we can improve this solution, so that we can find the closest k platforms, not

just one.

In the rest of this section we discuss our solution for a centralized approach and one for a distributed scenario.

Finally, note that there exists a simple approach to solve an instance of (iv) above: the source can find on-

path platforms to the destination. The source can use traceroute to the destination to find the addresses of

the intermediary routers, and will then attempt to connect to each IP on a specific port. If the connection is

successful, in the final step the platform authenticates itself to the user.

2.1.1 Centralized approach

In a centralized implementation, all platforms will be managed by a single entity. In this case, the set of all

platforms must respect the following properties:

Trustworthy: Being under the same administration, all platforms must fully trust each other and must trust

the information received from other platforms.

c© CHANGE Consortium 2012 Page 17 of (76)

Page 18: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Transparency: With a single entity managing all the platforms, it is easy to create a database with supported

functionality and available resources. Given that this database will be managed by the entity that owns

all platforms, no sensitive commercial information regarding platforms will be exposed to the users.

In such a case, the discovery will be made very easily following these steps:

1. The user will send through an API requests to instantiate processing to that specific entity

2. The entity will search in its database the platform that has the following properties:

(a) the requested capabilities. This can be done by selecting from the database only the platforms

that have the right capabilities (e.g., CPU, bandwidth, load).

(b) proximity to the user. This can be done using the current protocols used by CDNs (Content

Delivery Networks), such as DNS indirection [50].

3. The entity will forward the request to the discovered platform.

4. The responses will include the identity of the discovered platform.

2.1.2 Decentralized approach

The only existing decentralized solution which preserves routing policies is based on IP Anycast, but it can

only find the closest platform to the requester. An obvious extension is to have k or more anycast addresses,

split the platforms into k groups and assign one address to all the hosts in the same group. The important

question is: how do we assign these addresses such that platforms, found by each host via IP anycast, are

indeed the k closest platforms to it? And if they are not the k closest, how do we minimize the average

increase in delay?

We want as few IP anycast addresses as possible: using many IP anycast addresses has an overhead as it

increases the total number of BGP UPDATE messages in the global routing system.

2.1.2.1 k anycast addresses assignment

In the following, we present our solution, which uses the classical Kruskal minimal spanning tree algorithm

as a starting point. The algorithm assigns x anycast addresses to N platforms (k ≤ x ≤ N ), where x is a

parameter. The idea is to split these N platforms in clusters with maximum x members, where each of these

members being connected to the other members with the minimum cost edge; this ensures that nearby servers

are placed in the same clusters. To allow precise lookups, each platform has a different anycast address inside

these clusters.

Our algorithm, presented in Algorithm 1, has two distinct parts. First, it splits platforms into clusters, where

each member has a different address. Second, it assigns addresses to each cluster member.

The first part is done using a modified version of Kruskal [40] spanning-tree algorithm. From the original

algorithm is kept the idea of merging two different clusters if an edge with certain properties is found.

Page 18 of (76) c© CHANGE Consortium 2012

Page 19: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

The properties we require are: a) the edge connects two different clusters, b) the edge has the minimum cost

and c) the total number of vertices in the two clusters must be lower than x.

When each cluster is formed, we must assign to each member a different anycast address. The assignment of

addresses is made by taking the edges in descending order of cost and assigning the same minimum possible

address to the platforms connected by one edge.

Algorithm 1: Assign x anycast addresses to N platformsRequire: x ≤ n ∧G = complete delay graph

Sort each edge from G in ascending orderPut each platform in its own clusteri← 0while #each cluster < x ∨ i < N do

u← leftnodeedgeiv ← rightnodeedgeiif clusteru 6= clusterv ∧#clusteru +#clusterv ≤ x then

Merge clusteru clustervend ifi++

end whilei← N − 1while i ≥ 0 do

u← leftnodeedgeiv ← rightnodeedgeiif clusteru 6= clusterv thenminaddress =find address(clusteru, clusterv)addressu ← minaddress

addressv ← minaddress

end ifend while

2.1.2.2 Anycast Evaluation

We measured the performance of the method presented above, which allows to find the nearest multiple

platforms by using anycast addresses on multiple types of topologies. In these experiments we set k and the

number of desired platforms, and we varied the number of anycast addresses that can be provided to these

platforms.

For each experiment, we calculated the relative error introduced by selecting platforms using anycast method

with the formula: errrel =(anycastdelay−realdelay)∗100

realdelay, where anycastdelay is the sum of delays to the nearest

k anycast addresses designated by our algorithm and realdelay is the sum of delays to the closest k platforms

in reality.

For each topology and for each value of the number of anycast addresses, the relative error is calculated for

the following two methods of assignment of addresses:

Kruskal: Already detailed above. The disadvantage of this algorithm is that it requires an almost complete

graph with pings between platforms.

c© CHANGE Consortium 2012 Page 19 of (76)

Page 20: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

0

20

40

60

80

100

120

140

0 50 100 150 200

Re

lative

Err

or

Number of Anycast Addresses

Relative Error for Kruskal Anycast k=10

Set 1Set 2Set 3

Figure 2.2: Simulated topology with N = 1000 platforms and k = 10 desired platforms.

Random: Each platform takes a random address from the anycast address existing pool. The disadvantage

of this method is that it doesn’t take into account the delay between platforms and therefore two nearby

platforms can be assigned the same anycast address.

2.1.2.3 Topologies used for experiments

Topologies with simulated data: The first type of topology that was used for experiments had 1,000 plat-

forms (in the future, we estimate that almost every active AS will have at least one flow processing

platform). As happens in the Internet today, all platforms can connect to others, and the delay between

two platforms is chosen randomly between 0 and 1000ms. We ran our algorithm on three random

topologies, we set k = 10, and varied the number of anycast addresses x from 10 to 200 (the minimum

is equal to 10, because to address 10 platforms we need at least 10 anycast addresses). The results are

shown in Figure 2.2.

As you can see from the graph, the results are not influenced by the topology, but by the number of

addresses used. As expected, with more addresses used the error is smaller. Also, according to the

results, we see that the error decreases rapidly from 115% for 10 anycast addresses to 20% for 20

anycast addresses. After this value, the error rate decreases slowly, with almost a linear trajectory.

gtitm topologies: Gtitm [61] is a network topology generator developed by Georgia Tech. It allows gener-

ation of network topologies that meet the present architectural model of the Internet: 3-level topology,

the first level with relatively few routers, but they have many connections between them and the last

level with few connections towards routers on the second level.

Based on these topologies we made two types of experiments: one in which the model simulates

platforms’ arrangement on 3 levels, each of these experiments with over 1,000 platforms and a second

type in which platforms have only a connection to routers on the last level, to simulate how the system

Page 20 of (76) c© CHANGE Consortium 2012

Page 21: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

0

2

4

6

8

10

0 5 10 15 20 25 30 35 40

Re

lative

Err

or

(%)

Number of Anycast Addresses

Relative Error for Kruskal Anycast k=10 n=1252

Kruskal 1Random 1

(a) gtitm topology with N = 1252 platforms and k =10 desired platforms.

0

2

4

6

8

10

12

14

0 5 10 15 20 25 30 35 40

Re

lative

Err

or

(%)

Number of Anycast Addresses

Relative Error for Kruskal Anycast k=7 n=44

Kruskal 1Random 1

(b) gtitm topology with N = 44 platforms and k = 7desired platforms.

Figure 2.3: Results of gtitm topology.

0

20

40

60

80

100

120

140

0 10 20 30 40 50

Err

or

Anycast Addresses

Errors

Anycast

Figure 2.4: Anycast on real topology with K = 3 desired platforms.

behaves when there are few platforms located at the edge of the (current) Internet. For the first type of

architectural model, the results are shown in Figure 2.3(a), and for the second in Figure 2.3(b).

As we have already mentioned, we tested the performance of the anycast method for the two types of

algorithms: Kruskal and random. As you can see from the graphs, for any of these two models, the

Kruskal algorithm provides an error smaller than the Random algorithm.

Also, comparing with the results of the previous types of topologies, we can see an improvement in

the error introduced in this architectural model, and this is due to the fact that there are many platforms

located very distant from each other, with most delays having values over 500ms.

Real topologies: We ran the algorithm Kruskal anycast on a topology formed by 225 real servers located in

various places around the world. Data used for this experiment was a set of pings, lasting four hours

each, generated between these servers on 2008 [2]. In total, there are over 1.2 million records. For

k = 3 platforms desired, the results are shown in Figure 2.4.

As can be seen from the graph, the error introduced has the same behavior as in the other topologies.

It decreases strongly at x = 10 anycast addresses used, but from this point it oscillates between [0%−

20%]. Our current research is focused on finding why this oscillation exists, but most likely, by running

c© CHANGE Consortium 2012 Page 21 of (76)

Page 22: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

0

20

40

60

80

100

0 20 40 60 80 100Err

or(

%)

K = Number of closest platforms

Errors

Geolocation

Figure 2.5: Relative error selecting the closest K platforms using geographical location.

the algorithm on the same data several times, these oscillations would flatten.

2.1.2.4 Comparison of platforms discovery methods

In order to determine the best way to find platforms, we need to make comparisons between the discovery

methods mentioned above. All of the following comparisons were made on the same real data [2].

The geographic location method was tested by selecting a random platform, calculating the average delay

from this platform to each other platform, and calculating distances from it to the other platforms using the

MaxMind framework [4]. Using this dataset and the method of least squares, we calculated the regression

line f(x) = a ∗x+ b, where x is the distance in kilometers between platforms and f(x) is the delay between

these platforms. The regression slope was on average 0.02 and the intercept has values ranging from 10 to

100.

Using this regression line, we calculated the estimated delay, knowing the distances between platforms. With

these data we selected the k closest platforms from one specific platform and we calculated relative error to

the known delay. Figure 2.5 shows the relative error, depending on the value of k. As can be seen from the

graph, for small k, the error is quite high, because the small real values of millisecond delays correspond to

the estimated values of the order of the intercept term, which usually takes values ten times higher than the

minimum delay.

We also calculated the relative error of selecting k desired platforms. So in the next experiment, we set

k = 3 platforms, and we compared the average relative error of the method for locating geographical error

introduced by using anycast. Average relative error calculated for k = 3 is very high, having a value of

389.91%.

To compare the performance of virtual coordinates with the method based on assigning k anycast addresses,

we performed the following test suite: we considered that N−1 platforms have already calculated coordinates

and that the N th platform wants to calculate its coordinates, giving pings to the platform that have coordinates.

We recorded the relative error between the distances from the platform to all other platforms, with new

coordinates. As expected, if the number of pings is relatively low, the relative error increases. The error

values are shown graphically up to 50 pings selected from 22,000 pings . If we select 10,000 pings, which

Page 22 of (76) c© CHANGE Consortium 2012

Page 23: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

0

20

40

60

80

100

120

140

0 10 20 30 40 50 0

100

200

300

400

Err

or

Number of Anycast Addresses/Pings

Errors

AnycastNetwork Coordinates

Geolocation

Figure 2.6: Comparison between anycast and virtual coordinates.

are 50% of all pings from one platform to all other platforms, we get an error of 13%. The results are shown

in Figure 2.6.

2.2 ConclusionIn this chapter we discussed methods for CHANGE platform discovery. As we mentioned earlier, there

are two basic deployment models: centralized, in which all platforms are managed by a single entity, and

decentralized, in which platforms are administrated by several distinct entities.

In a CDN-like deployment where there is a known set of platforms and full trust between them it is simple to

create a database of supported functionalities and possibly of available resources. In such a case, users will

send requests to instantiate processing via an external, opaque API, and replies will include the identity of

the platform.

In a federated deployment, resource availability is sensitive information, so that maintaining a reliable

database is unfeasible, which limits the applicability of a centrally accessible API. In this deliverable we

presented a suite of tests between a novel solution based on BGP anycast, network virtual coordinates and

geographical location. Each discovery solution has advantages and disadvantages, but we are currently look-

ing into a hybrid solution that can achieve both the accuracy of BGP anycast and the flexibility of Virtual

Coordinate Systems solutions.

c© CHANGE Consortium 2012 Page 23 of (76)

Page 24: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

3 Traffic attraction and redirectionIn CHANGE, the processing of flows occurs in platforms that are not always on the initial path from a source

to a destination. In this case, to provide the ability to perform processing, flows must be drawn into a platform.

Two problems must be solved: attracting the flows to the platform, and delivering the processed flow to the

destination.

Processingplatform

SourceDestination

Initial path

Redirectionpoint

Graftingpoint

Attraction path

Delivery path

Figure 3.1: Attraction mechanism terminology.

Figure 3.1 shows the terminology used in this deliverable for the problem of attracting flows into a platform

located outside the initial AS path. Note that the knowledge of the initial path is assumed to be obtained from

an external source of information. When processing is requested, the processing platform will activate the

attraction mechanism to attract packets corresponding to the flows that need to be processed. The attraction

mechanism is in charge of establishing the attraction path that will take the flow from the redirection point

to the platform. Once processed, the flow needs to be delivered to the destination. The delivery mechanism

is used to establish the delivery path, the processed flow is sent via the grafting point to the destination. It is

worth noting that depending on the mechanisms, the redirection point and grafting point positions may vary.

However, the grafting point must always be located downstream from the redirection point in the initial path.

We discussed several possible solutions in order to perform traffic attraction towards the CHANGE platforms

in deliverable D4.1. We compared the three solutions, one based on a combination of DNS and one-to-one

NATs, then a solution using BGP announcements inside and in limited scope outside an AS and finally a

solution using FlowSpec [43], i.e., a way to distribute matching rules to routers, to divert the packets to the

platform using a tunnel. Based on these comparisons, we chose to implement the latter one, i.e., the attraction

mechanism based on FlowSpec.

In the following, we first give a review of FlowSpec. Second, we explain how one can perform traffic

Page 24 of (76) c© CHANGE Consortium 2012

Page 25: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

attraction using FlowSpec. Finally, we explain how to practically perform traffic attraction using FlowSpec

and ExaBGP [3], a route injector.

3.1 FlowSpec (RFC5575)Flow Specification (FlowSpec) [43] is a technique to distribute traffic flow specifications. FlowSpec was

primarily defined to automate inter-domain coordination of traffic filtering such as required to mitigate denial-

of-service attacks. FlowSpec information is carried via the Border Gateway Protocol (BGP) by being encoded

as a BGP Network Layer Reachability Information (NLRI). This allows the routing system to propagate flow

specifications.

FlowSpec allows to efficiently encode rules as an n-tuple consisting of several matching criteria that are

applied to IP traffic. A specific packet is considered to match the FlowSpec when it matches all components

(criteria) present in the specification. FlowSpec allows to match on any of the following components:

• destination prefix,

• source prefix,

• IP protocol,

• source port,

• destination port,

• ICMP type,

• ICMP code,

• TCP flags,

• packet length,

• DSCP field,

• fragmentation.

FlowSpec allows to control the value matched by the component by using a numeric operator such as greater

of equal, not, etc. It also allows to combine multiple components by using the binary operators AND and OR.

An action is associated with each FlowSpec, called Traffic Filtering Actions. RFC5575 [43] defines the

minimum set of filtering actions implemented on routers. These actions are the following:

Traffic-rate allows to apply rate limiting on the matching packets.

Traffic-action provides a way to control the application of the ordered sequence of actions bound to rules for

a given NLRI. It allows defining rules which terminate the application of the sequence, providing ways

c© CHANGE Consortium 2012 Page 25 of (76)

Page 26: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

to define flow processing with flexibility. This action also provides control messages for the activation

of traffic sampling and logging at the receiver side.

Redirect allows redirecting the matching packet to a VRF routing instance. This permits to tunnel the

packets to a destination.

Traffic-marking changes the DSCP field of the matching packet to the corresponding value.

These actions are exchanged with FlowSpec rules encoded as BGP extended community values. BGP ex-

tended communities [16, 54] are mechanisms commonly used to provide means to perform inbound traffic

engineering and Denial of Service attack mitigation when associated with actions performed on paths.

Internally, these communities are used to specify properties of paths to impact routing decisions being made

about them. For example, a path can be tagged with the type of peering session over which it was received.

When selected as best by an Autonomous System Border Router, an ASBR will selectively propagate that

path based on the business property described in the tagged community.

In order to be used by external parties, ISPs describe the set of communities that they recognize, their as-

sociated actions, and the type of distant ASes that are allowed to use such communities (customers, peers,

providers, etc.).

3.2 Performing Flow Attraction Using FlowSpecIn the rest of this section, we discuss how to use FlowSpec to attract traffic towards the processing platform.

We first discuss how to distribute FlowSpec rules amongst partner ASes. Second, we discuss how to for-

ward the packet from the redirection point to the platform. Finally we discuss the security of the FlowSpec

attraction solution.

3.2.1 Deploying FlowSpec Rules

To enable attraction of flows towards the processing platform, a partnership must be established between the

platform and multiple ASes. To attract a flow, one router on the initial path will be used to act as a redirection

point towards the platform. This partnership must therefore allow installing FlowSpec rules inside some

routers, i.e., there must exist a FlowSpec signaling mechanism. In the following we address this requirement.

As FlowSpec runs on top of BGP, a BGP overlay of iBGP sessions can be used to exchange FlowSpec rules.

This overlay will connect border routers from partner ASes and the platform using Route Reflectors (RR). To

install a FlowSpec rule to redirect one flow from its initial path, the platform will initiate a BGP UPDATE

message with the FlowSpec rule for this flow, which will be received by each router. These routers will then

install the rule and start redirecting the flows towards the platform. This has the nice property of ensuring

that each packet from the flow passing through one of the partner ASes will leave its initial path and be

redirected through the platform. However this might not scale and could overload each router with unused

rules. Indeed, the number of entries that each router can handle might be limited. This therefore, limits the

maximum number of flows that can be attracted by the platform.

Page 26 of (76) c© CHANGE Consortium 2012

Page 27: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

processingplatform

partnerAS

Figure 3.2: Forwarding path of attracted packets with FlowSpec and tunnels.

To scale, additional information must be added to the system to install FlowSpec rules only in one router,

increasing therefore the total number of flows that can be handled by the system. This information can be

carried using BGP communities. Routers from partner ASes must be configured with one or multiple shared

communities. The BGP UPDATE messages containing FlowSpec rules must then contain the BGP attributes

of the router or the set of routers that need to install the rule. The configuration of communities must be done

once per AS and be stored in the platform.

Note that FlowSpec has some limitations. FlowSpec NLRI are valid if and only if the two following rules are

matched. First, the originator of the FlowSpec must match the originator of the best-match unicast route for

the destination prefix embedded in the flow specification. Second, there must not exist more specific unicast

routes, when compared with the flow destination prefix that have been received from a different neighboring

AS than the best-match unicast route. To allow building an overlay to disseminate FlowSpec rules, these two

rules must be relaxed as the platform will never be the best unicast route for the destination in each partner

AS. Of course, the platform will never be an originator for the destination prefix.

3.2.2 FlowSpec + Encapsulation

Figure 3.2 shows the complete forwarding path of the flow when the FlowSpec rules are installed in the

redirection point. Note that the processing platform and the partner AS are not directly connected. Once

a packet matches a FlowSpec rule in the redirection point a specific action will be executed to forward the

packet to the platform. As native forwarding cannot be used, the packet must be encapsulated in a tunnel

towards the processing platform. This action is specified inside the BGP UPDATE messages as a Traffic

Filtering Action of the FlowSpec rule. The redirect action is used to point to a VRF instance, towards one

tunnel end-point inside the platform.

This requires that each router is pre-configured with one or multiple tunnels towards a next hop inside the

c© CHANGE Consortium 2012 Page 27 of (76)

Page 28: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

platform so that when one router receives a FlowSpec rule to install, it will have a route towards the next hop

associated to the redirect action, otherwise the matching packets would be dropped.

To deliver the packets to their destination, an encapsulation mechanism can be used as well. Here, the

grafting point can be located inside the same AS (as shown in Figure 3.2), which redirects the packet towards

the processing platform. The idea is to have inside the AS the redirection point and grafting point located on

the edge, where the initial path passes through each one of them. It must be ensured that the grafting point is

located downstream of the redirection point. This only works if the FlowSpec rules are installed only in the

grafting point, otherwise a loop is created.

3.3 Using ExaBGP

ExaBGP [3] is a BGP engine that allows injecting routes with arbitrary next-hops into a network (source some

ipv4/ipv6 routes using both IPv4 and IPv6 TCP connections), mitigates DDOS using FlowSpec (see sec-

tion [section][1][3]3.1). ExaBGP is written in python and is freely available at http://code.google.

com/p/exabgp.

The remainder of this section is decomposed in thee parts. First, we discuss how to configure ExaBGP to

carry out route injection. Second, we discuss how to use ExaBGP to perform traffic attraction, i.e., dynamic

route injection. Finally, we show some experimentation results.

3.3.1 Configuration

A sample ExaBGP configuration can be found in Listing [lstlisting][1][3]3.1, which uses a similar syntax as

for configuring Juniper routers.

1 n e i g h b o r 1 9 2 . 1 6 8 . 1 2 7 . 1 2 8 {

2 d e s c r i p t i o n ” a r o u t e r ” ;

3 r o u t e r−i d 1 9 2 . 1 6 8 . 1 2 7 . 1 ;

4 l o c a l−a d d r e s s 1 9 2 . 1 6 8 . 1 2 7 . 1 ;

5 l o c a l−as 65000 ;

6 peer−as 65534 ;

7

8 f low {

9 r o u t e o p t i o n a l−name−of−the−r o u t e {

10 match {

11 s o u r c e 1 0 . 0 . 0 . 1 / 3 2 ;

12 d e s t i n a t i o n 1 9 2 . 1 6 8 . 0 . 1 / 3 2 ;

13 p o r t =80 =8080;

14 d e s t i n a t i o n −p o r t >8080&<8088 =3128;

15 # d e s t i n a t i o n −p o r t [ 8080 3128 ] ;

Page 28 of (76) c© CHANGE Consortium 2012

Page 29: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

16 sou rce−p o r t >1024;

17 p r o t o c o l [ udp t c p ] ;

18 # p r o t o c o l [ 4 6 ] ;

19 # p r o t o c o l t c p ;

20 # packe t−l e n g t h >200&<300 >400&<500;

21 # f r a g m e n t not−a−f r a g m e n t ;

22 # f r a g m e n t [ f i r s t −f r a g m e n t l a s t −f r a g m e n t ] ;

23 # icmp−t y p e [ u n r e a c h a b l e echo−r e q u e s t echo−r e p l y ] ;

24 # icmp−code [ hos t−u n r e a c h a b l e network−u n r e a c h a b l e ] ;

25 # tcp−f l a g s [ u r g e n t r s t ] ;

26 # dscp [ 10 20 ] ;

27 # dscp >10&<20;

28 }

29 t h e n {

30 # b y t e s / s e c o n d s

31 r a t e− l i m i t 9600 ;

32 # d i s c a r d ;

33 # r e d i r e c t 6 5 5 0 0 : 1 2 3 4 5 ;

34 # r e d i r e c t 1 . 2 . 3 . 4 : 5 6 7 8 ;

35 }

36 }

37 }

38 }

Listing 3.1: Sample ExaBGP configuration in order to perform FlowSpec rule injection.

The sample configuration file described in Listing [lstlisting][1][3]3.1 injects one route in a BGP peer router

192.168.127.128 (configured from line 1 to 6). Multiple flow route injections can be configured in the flow

entry of the configuration file. A route is then decomposed into two sets: a match and an action, linked with

the then keyword.

The match section (see lines 10 to 28) can contain multiple matching rules; we only explain the most common

ones:

source, destination specify the source/destination address prefix to match,

port, source-port, destination-port specify the ports to match, operations can be used to describe the set of

ports to match. E.g., >X specify that the rule only match on the port greater than X, the & symbols

allows to combine multiple matching rules.

c© CHANGE Consortium 2012 Page 29 of (76)

Page 30: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

protocol specifies the protocol to use, tcp, udp or icmp.

The then section contains the action to perform on the packets that match the rule. Three different actions are

available:

rate-limit this limits the rate of the flow,

discard discards the matching packets,

redirect redirects the matching packets.

3.3.2 Traffic attraction using ExaBGP and FlowSpec

ExaBGP is only meant to be used to load a static configuration easily but it can also be used to dynamically

inject rules. Dynamic rule injection can be performed by editing the configuration file and by telling ExaBGP

to reload its configuration. ExaBGP can reload its configuration using the SIGHUP signal:

k i l l −SIGHUP <p i d o f e x a b g p >

Therefore, to integrate ExaBGP into the CHANGE platform, we need to have an external program that

handles ExaBGP and its configuration. This external program can be queried by the CHANGE platform

(e.g., using RPC calls) in order to rewrite the configuration and tell ExaBGP to reload its configuration when

new flows need to be attracted or removed.

3.3.3 Experimentation

In the following, we show how ExaBGP can be combined with a FlowSpec enabled router. The testbed is

composed of 2 devices,one running ExaBGP 2.0.1 (192.168.2.2) and connected to a JunOS 10.3 Olive virtual

router (192.168.2.1). The following FlowSpec configuration has been used for the Juniper router:

. . .

p r o t o c o l s {

bgp {

group community−f s {

t y p e e x t e r n a l ;

m u l t i h o p ;

l o c a l−a d d r e s s 1 9 2 . 1 6 8 . 2 . 1 ;

p a s s i v e ;

e x p o r t no−r o u t e s ;

peer−as 65000 ;

l o c a l−as 65001 ;

n e i g h b o r 1 9 2 . 1 6 8 . 2 . 2 {

t r a c e o p t i o n s {

Page 30 of (76) c© CHANGE Consortium 2012

Page 31: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

f i l e community−f s ;

f l a g a l l ;

}

f a m i l y i n e t {

u n i c a s t ;

f low {

no−v a l i d a t e a l l−r o u t e s ;

}

}

}

}

}

}

p o l i c y−o p t i o n s {

p o l i c y−s t a t e m e n t a l l−r o u t e s {

t h e n a c c e p t ;

}

p o l i c y−s t a t e m e n t no−r o u t e s {

t h e n r e j e c t ;

}

}

. . .

Listing 3.2: JunOS configuration used for the experiment.

And the following configuration has been used for ExaBGP, injecting in the Juniper router a rule that will

drop all packets from 192.168.0.2 and to 192.168.1.2:

n e i g h b o r 1 9 2 . 1 6 8 . 2 . 1 {

r o u t e r−i d 1 9 2 . 1 6 8 . 2 . 2 ;

l o c a l−a d d r e s s 1 9 2 . 1 6 8 . 2 . 2 ;

l o c a l−as 65000 ;

peer−as 65001 ;

f low {

r o u t e {

match {

s o u r c e 1 9 2 . 1 6 8 . 0 . 2 / 3 2 ;

d e s t i n a t i o n 1 9 2 . 1 6 8 . 1 . 2 / 3 2 ;

c© CHANGE Consortium 2012 Page 31 of (76)

Page 32: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

}

t h e n {

d i s c a r d ;

}

}

}

}

Listing 3.3: ExaBGP configuration used for the experiment.

After running the two devices, we can validate that the route is correctly injected in the router by looking at

the FlowSpec table contained in the Juniper router:

1 r o o t> show r o u t e t a b l e i n e t f l o w . 0 e x t e n s i v e

2

3 i n e t f l o w . 0 : 1 d e s t i n a t i o n , 1 r o u t e s (1 a c t i v e , 0 holddown , 0 h i dd en )

4 1 9 2 . 1 6 8 . 1 . 2 , 1 9 2 . 1 6 8 . 0 . 2 / t e rm : 1 (1 e n t r y , 1 announced )

5 TSI :

6 KRT i n dfwd ;

7 Ac t i on ( s ) : d i s c a r d , c o u n t

8 ∗ BGP P r e f e r e n c e : 170/−101

9 Next hop t y p e : F i c t i o u s

10 Next−hop r e f e r e n c e c o u n t : 1

11 S t a t e : <A c t i v e Ext>

12 Pee r AS : 65000

13 Age : 9

14 TASK: BGP 65000 65001 . 1 9 2 . 1 6 8 . 2 . 2 + 3 8 6 0 1

15 Announcement b i t s ( 1 ) : 0−Flow

16 AS Pa th : 65000 I

17 Communit ies : t r a f f i c −r a t e : 0 . 0

18 Accepted

19 L o c a l p r e f : 100

20 Ro u t e r ID : 1 9 2 . 1 6 8 . 2 . 2

Listing 3.4: FlowSpec table contained in the router after the experiment.

We can see that the entry is correctly added in the flowspec table of the router and that the correct action is

associated with the flow (see line 7) and that the extended community associated with the action is applied

Page 32 of (76) c© CHANGE Consortium 2012

Page 33: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

on the route, i.e., the traffic −rate :0.0 (see line 17).

3.4 ConclusionIn this section we presented the FlowSpec-based flow attraction mechanism. To date, we believe this to be

the best suited for attracting flows towards a platform, both in terms of providing the needed mechanism and

of having the highest deployment change in the current Internet. We showed that it is easy to use ExaBGP,

a tool that allows route injection as a BGP peer, to redirect flows in any FlowSpec-enabled router. Finally,

we performed simple experiments with ExaBGP and virtual routers, and presented the results from such

experiments.

c© CHANGE Consortium 2012 Page 33 of (76)

Page 34: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

4 Service Composition and Inter-Platform

AspectsOne of the key functions in the CHANGE architecture is the service composition, which groups all the

aspects related to platform and resources resolution, flow routing and subsequent resource allocation. The

main trigger for all the actions/phases above is an end-user request received through the Service-User to

Network Interface. As described in more details in D4.2, this Service-UNI request is primarily handled at the

CHANGE Service Manager that coordinates the following actions:

• at first, the decomposition of the overall end-to-end user service specification into specific low-level

service components to be implemented by the CHANGE platforms

• then, the flow routing and per-platform allocation of the specific actions to be implemented on the flow.

The service description specified at the Service-UNI could either explicitly identify the platforms and actions

involved in the end-to-end service (i.e. provide the CHANGE Service Manager with a set of exact routes and

actions to be implemented), or it could be more generic and just identify the service endpoints (i.e. source,

destination hosts) and the service parameters (e.g. a firewall policy, a NAT rule, etc.). Depending on the

service description provided at the Service-UNI, the CHANGE service composition function has different

scopes. In particular, in the former case (i.e. explicit route-actions specification), the service composition

actions mainly consist of status and AuthN/AuthZ checks on the required resource allocations, with the

subsequent provisioning via signaling procedures. Instead, in the latter case (i.e. generic service description)

all the exact platform and resource resolution is implemented by the CHANGE service composition, before

any subsequent signaling for provisioning.

To implement its decisions, the CHANGE Service Composer could take into account the domain network

topology, the platforms capabilities and their resources availabilities.

Since these resolution operations might require the allocation of additional platform(s) and/or flow process-

ing action(s) along the route, the routing and composition decisions should be jointly taken by the Service

Composer to obtain an optimal solution. The result of this process takes the form of a Flow Processing

Route (FPR), which contains the possibly exact and complete sequence of flow processing actions and packet

forwarding rules to be performed in the CHANGE domain.

In a more schematic and high level view, the ingress-to-egress resolution occurring in the CHANGE archi-

tecture could be summarized in the following steps:

1. Parse the client-originated service description in order to produce an intermediate-representation of all

the platform-level operations (i.e. resolve the service components)

2. Compute a loop-free flow path and identify flow processing resources to be allocated along it by taking

into account the following decision actions:

Page 34 of (76) c© CHANGE Consortium 2012

Page 35: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

• Add any additional flow processing actions and/or platforms that might be required to implement

the service into the domain

• Adjust the bindings among all the identified flow processing actions in the domain to stitch the

different parts both within a platform (e.g. in case multiple processing modules are used) and

among platforms (e.g. in case flow attraction or any other routing aspect might be needed)

The following example describes a possible service description to be used by Service Composer to implement

the actions detailed above:

# Define the first platform

PLATFORM ID=0 NAME=alpha.platform.com

INTERFACE ID=0 MBOX_ID=0 IP=126.16.13.139 TYPE=REAL

INTERFACE ID=1 MBOX_ID=0 NAME=bridge0 TYPE=VIRTUAL

INTERFACE ID=2 MBOX_ID=0 NAME=bridge1 TYPE=VIRTUAL

# Define the second platform

PLATFORM ID=1 NAME=beta.platform.com

INTERFACE ID=0 MBOX_ID=1 IP=128.16.67.99 TYPE=REAL

INTERFACE ID=1 MBOX_ID=1 IP=126.16.13.139 NAME=bridge0 TYPE=VIRTUAL

INTERFACE ID=2 MBOX_ID=1 IP=126.16.13.139 NAME=bridge1 TYPE=VIRTUAL

# Forward (which allows allows filtering too)

FORWARD FROM=0:0 TO=0:1 FILTER=’’ip src=128.16.13.139’’

# Add a processing module

PROCESSING_MODULE TYPE=FIREWALL CONFIG=... IN=0:1 OUT=0:0

# Traffic manipulation

ATTRACT_DNS NAME=nets.cs.pub.ro TARGET=0:0 DNS=...

...

The declarative language used in this example has two different types of definitions: the platform and the flow

processing actions. The platform serves a twofold purpose: a) it identifies the platform’s addressable control

interface (i.e. alpha.platform.com and beta.platform.com) that will be used for the control plane interface;

b) it declares the platform interfaces involved in the service. The flow processing actions report the type of

the processing module that must be used and its bindings with the peering modules (that constitute a part of

the service slice inside a platform) or the ingress/egress interfaces (that form the bindings among different

service slices allocated between two data plane adjacent platforms).

c© CHANGE Consortium 2012 Page 35 of (76)

Page 36: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Concerning the inter-platform aspects of the service composition process described above, the FPR is the key

information that binds resources and platforms among themselves. Depending on the deployment model of

the CHANGE architecture (i.e. centralized or distributed as per D4.2 ), the FPR is managed differently thus

impacting the way the different platforms cooperate for the service setup and maintenance.

In particular, if a centralized model is deployed, the FPR is split in a sub-sequence of per-platform actions

and each of these segments is directly signalled by the Signaling Manager to the corresponding platform. In

this case data-plane adjacent platforms have no adjacency in the control plane and do not peer/cooperate each

other in service provisioning/maintenance.

Contrarily, if a distributed model is deployed, the overall FPR is passed by the Signaling Manager to the

signaling instance running on the entry CHANGE platform (first hop in the FPR) and there used to initiate a

peer-mode distributed signaling along the identified flow path. On each platform, the local actions described

in the FPR are implemented, and subsequently the remaining FPR parts are forwarded to the next hop /

adjacent platform (at control plane level). Therefore, in this case the ingress and egress platforms handle

the full end-to-end service as per FPR, while the intermediate platforms just maintain the upstream and

downstream peering.

All the procedures described above primarily apply to the Internal-NNI signaling, i.e. they apply to a

CHANGE domain in which platforms can peer and share topology and resources information. As described

in D4.2, the Inter-AS NNI aspects could be assumed to be similar to the Internal-NNI ones in terms of se-

mantics and abstract messages: though not sharing full information about flow processing resources and

platforms, peering ISPs may adopt inter-platform cooperation mechanisms similar to the ones described for

the Internal-NNI protocol, i.e. peer-style or centralized/management-style. The signaling messages used at

the Inter-AS NNI to control the multi-domain flow processing service can be assumed to derive from the ones

used at Internal-NNI , but with contents that largely depends on the policies and trust relationships between

peering ISPs (e.g. for resource and/or internal topology description, multi-domain FPR details, etc.).

Page 36 of (76) c© CHANGE Consortium 2012

Page 37: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

5 Flow Migration

Within CHANGE platforms, flows are routed across OpenFlow switches. The logically-centralized controller

of each platform installs rules to handle flows in each switch. A rule essentially specifies a pattern that

matches certain packets, and an action to take with such packets. For any reason –moving a virtual machine

to a better place for instance–, the controller might need to change the rules in a collection of switches in

order to change the path followed by certain flows following the move of a virtual machine for instance. The

controller does so by issuing remove and install commands.

Changing the configuration of multiple switches at once is not possible, and therefore the network goes

through a set of intermediate states between the initial and the final configuration. Albeit the initial and the

final configurations of the network are correct, intermediate ones might cause severe troubles such as broken

connectivity, forwarding loops, and inconsistent paths, especially in presence of middleboxes.

Internet

PM

PM

migration

Figure 5.1: Flows that used to go to the processing module at the top now need to go to the one in the middle.

In Figure [figure][1][5]5.1, the processing module at the top needs to be shut down for instance, and all flows

processed by this module will be handled by another processing module. The controller must thus send install

command to the three switches in the path to the new processing module, and remove command to the two

previous switches crossed by the old path to reach the old processing module. In order to avoid losses, the

controller must take care that rules at the two switches of the new path are installed before rules at the ingress

switch.

It does not have to be this way. Seamlessly changing the configuration of a network could be a primitive of

the OpenFlow platform.

Quite some work has already been done regarding the problem of avoiding undesired intermediate states of

distributed routing protocols.

There are basically two different approaches: either we use an algorithm to determine the order in which

commands are issued to individual switches such that the order preserves certain properties, or we virtualize

the network and direct packets to the new virtual network once it is ready.

c© CHANGE Consortium 2012 Page 37 of (76)

Page 38: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

5.1 Algorithms

Francois et al. [25] show that it is possible to avoid all loops during the convergence of a link-state IGP like

OSPF or IS-IS, and they propose a protocol to let the routers change in a good order. They propose [26]

a way to update the link metrics so as to avoid disruptions after a planned link state change with OSPF.

Unfortunately, these methods only consider link-state routing protocols and shortest path routing. They

cannot be applied to OpenFlow.

Fu et al. [27] introduce two conditions to change the forwarding table of an IP router during a reconfiguration

without introducing a loop. They show that a forwarding table can always be updated, at least partially (i.e.,

a destination at a time). With this, they propose an algorithm to reconfigure the network. Actually, to the

contrary of the previous work, their method can be applied to any network with hop-by-hop destination-based

forwarding.

More recently, Vanbever et al. introduced another algorithm to find an order in which to update, without

disruptions, n forwarding tables in n steps, in each of which exactly one forwarding table is updated [57]. To

this end, the algorithm first enumerates orders in which a loop arises so as to generate constraints. Then, they

use linear programming to find an order, matching the constraints (i.e., safe). Such an order does not always

exist, in which case we can always fall back to partial forwarding table updates as per the aforementioned

algorithm. They also show this problem is NP-complete.

Unfortunately, none of these solutions is satisfying for OpenFlow networks. Actually, only the last two

could be applied. OpenFlow offers a lot more flexibility in forwarding; it is not limited to destination-based

forwarding. OpenFlow switches’ flow tables are also likely to be filled with a mix of aggregate entries of all

sorts since the space of matching patterns is large, and the flow tables are yet limited as of today. In addition,

the matching pattern of these aggregate rules might change as well. It is unknown how the algorithms behave

with aggregate rules possibly not matching destination only. Moreover, those solutions are aimed at planned

if not manual reconfigurations. With OpenFlow, an automated and faster solution that could be abstracted

by the platform is desirable. Finally, those solutions do not allow to enforce certain kinds of consistency, as

exposed in the next section.

5.2 Virtualization

Another way to prevent disruptions during reconfiguration is to have two or more virtual networks. Packets

thus travel across one or the other. This allows one to change the configuration of a network without dis-

ruptions by moving the traffic to the other before the reconfiguration. This is part of the idea presented by

Reitblatt et al. [52].

In that paper, the authors introduce the problem of consistency in software-defined networks like OpenFlow

with different classes of consistency. Then, they propose mechanisms to ensure such consistency. We explain

in more details their work hereafter.

Page 38 of (76) c© CHANGE Consortium 2012

Page 39: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

5.2.1 Consistency classes

The obvious class of consistency is per-packet consistency. It guarantees that a packet is handled at a time

by either rules part of the initial configuration or of the final configuration. Such consistency is useful for

instance in a network where packets must be processed by a middlebox: although both the initial and final

configurations are correct, packets might end up not passing through the middlebox in some intermediate

configuration. Note that this is a stronger requirement than requiring no loops.

The other class of consistency is per-flow consistency. Similarly, it guarantees that packets of a flow are

handled by rules part of one or the other configuration but not both. Suppose the middlebox of the previous

example is stateful –it keeps a per-flow state–, and there are two or more middleboxes. In any configuration,

packets of the same flow must pass through the same middlebox.

5.2.2 Ensuring per-packet consistency

Ensuring per-packet consistency is relatively simple. Packets arriving at ingress switches are tagged as they

would in a traditional 802.1q VLAN. Rules on transit and egress switches match the tag.

When a new configuration is to be deployed, the new rules match a new tag. When the new rules are installed

in all transit and egress switches, the rules to tag the packets with the new tag are progressively installed with

a higher priority than the ones with the old tag on ingress switches, and the packets start flowing through the

network according to the new configuration. Once all ingress switches have been updated, the rules matching

the old tag can be removed.

Internet

PM

PM

migration

Flow Table S1

Flow Action

X tag: VLAN 0x3

...

S1 S2

S3

S4

Flow Table S2VLAN

ID Action

0x3 fw: 0

...0x4 fw: 1

0

1

migration

Flow Table S1

Flow Action

X tag: VLAN 0x4

...

Figure 5.2: As soon as the rules of the new configuration are installed on the transit and egress switches witha new tag, packets entering the network are tagged so that they match the new rules.

c© CHANGE Consortium 2012 Page 39 of (76)

Page 40: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

5.2.3 Ensuring per-flow consistency

Ensuring per-flow consistency is a bit more complicated. As for per-packet consistency, packets are tagged

at the ingress switches while rules on transit and egress switches match tags.

However, unlike per-packet consistency, tagging rules at the egress can not be replaced at any time after the

rules have been installed on transit and egress switches. This would possibly tag packets of the same flow

with different tags, and these packets would end up being processed by different configurations.

Instead, the new tagging rule should take over only when all flows matching the old rule have terminated. With

the current version of OpenFlow, this can be done with a timer such that the rule is removed some time after

the last packet was processed. But this is an imperfect solution. A better solution is to use the instantiation

rules defined in DevoFlow [20]: a rule is instantiated for each new flow matching the instantiation rule. If such

rules were used at ingress switches, they could be replaced at anytime. The ongoing flows would still match

the rules instantiated by the old rules, until they terminated, while new flows would match rules instantiated

with the new tag.

ConclusionIn CHANGE, platforms need to do more than avoiding loops or black holes during a reconfiguration. In-

deed, there are various processing modules in a platform, and neither packets nor flows can be avoided to

pass through some processing module during a reconfiguration. One of the two aforementioned classes of

consistency is necessary. To our knowledge, there are no algorithms to enforce such consistency properties

yet. Nevertheless, the virtualization mechanisms described above are satisfying.

Page 40 of (76) c© CHANGE Consortium 2012

Page 41: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

6 Load BalancingIn the long term, if the view of the project is confirmed, a CHANGE site could consist of a complete data

center containing a bunch of switches and servers making up a number of platforms. Data centers are known

for having significant path diversity which can be leveraged by techniques such as load balancing to improve

the performance of flow processing in CHANGE platforms. To this end, in this section we introduce a novel

load-balancing scheme called CFLB, which unlike current, hash-based approaches, allows hosts to explicitly

select the load-balanced path they want to use for a specific flow.

Load balancing allows to maximize the throughput [35], achieve redundant connectivity [36] and reduce

congestion [14]. Different forms of load balancing can be deployed at various layers of the protocol stack. At

the datalink layer, frames can be distributed over parallel links between two devices [7]. At the application

layer, requests can be spread on a pool of servers.

At the network layer, the most common technique, Equal-Cost Multi-Path (ECMP) [35, 18], allows routers

to forward packets over multiple equally-good paths. ECMP may both increase the network capacity and

improve the reaction of the control plane to failures [36]. Current ECMP-enabled routers proportionally

balance flows across a set of equal next hops on the path to the destination. Moreover, various methods

to practically perform the forwarding among multiple next hops are possible [14, 35]. The most deployed

next-hop selection method is solely based upon a hash computed over several fields of the regular packet

headers [1, 35]. Using a hash function ensures a somewhat fair distribution of the next-hop selection [14]

while preserving the packet sequence of transport-level flows.

Data center designs rely heavily on ECMP [5, 28, 30, 48] to spread the load among multiple paths and reduce

congestion. This form of load balancing is naive, congestion can still occur inside the data center and lead

to reduced performance. Data center traffic contains both mice and elephants flows [10, 37]. Mice flows

are short and numerous but they do not cause congestion. Most of the data is carried by a low fraction of

elephants flows. Based on this observation, several authors have proposed traffic engineering techniques that

allow to route elephants flows on non-congested paths (see [11, 19, 6] among others). Those techniques

rely on OpenFlow switches [46] to control the server-server paths. Unfortunately, the scalability of such

approaches is limited, which may lead to an overload of the flow tables on the OpenFlow switches.

In this chapter, we show that another design is possible to take benefit of the path diversity that exists in

data center networks, and then improve communication between CHANGE platforms. Current hash-based

implementations rely on the IP and TCP headers to select the load-balanced path over which each flow

is forwarded. In fact, the IP addresses and the TCP port numbers implicitly specify the flow path. But

unfortunately since routers rely on hash functions to perform the load-balancing, it is very difficult for a host

to predict the path that a specific flow will follow. We show in this chapter that hash functions are not the

only way to practically enable path diversity. We propose a new deterministic scheme called Controllable

per-Flow Load-Balancing (CFLB) that allows hosts to explicitly select the load-balanced path they want to

c© CHANGE Consortium 2012 Page 41 of (76)

Page 42: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

use for a specific flow. This is a major change compared to existing hash-based techniques and opens new

possibilities. To allow packet steering by end hosts, CFLB replaces the hash-based next-hop selection method

that is implemented today on load-balancing routers with an invertible mechanism using a function such as a

block cipher. CFLB routers apply this invertible procedure over selected fields of the packet headers to select

a load-balanced next hop. CFLB does not rely on any states in routers and any extension in packet headers,

existing header fields are used to convey a path selector. CFLB is also transparent to hosts that do not want to

perform packet steering. In this case a classic hash-based load balancing is performed without that the router

distinguishes controlled packets from non-controlled ones.

The remainder of this chapter is organized as follows. We first recall in section [section][1][6]6.1 current

hash-based load-balancing basics. Then we consider in section [section][2][6]6.2 MultiPath TCP as a study

case to introduce our proposal. We provide a detailed description of the operation mode of CFLB in sec-

tion [section][3][6]6.3. In section [section][4][6]6.4, we analyze the performance of CFLB, we first use

trace-driven simulations to compare CFLB with existing hash-based techniques and, then, we implement

CFLB in the Linux kernel to evaluate its packet forwarding performance. We also evaluate the benefits for

MultiPath TCP hosts using CFLB. In section [section][5][6]6.5, we discuss other possible applications.

6.1 Path Diversity at the Network LayerThere exist several proposals to enable path diversity at the network layer [47]. However, in practice only

Equal-Cost Multi-Path (ECMP) [35] is currently deployed. ECMP is both a path selection scheme and a load

distribution mechanism. To enable path diversity, it uses paths that tie to ensure loop-free forwarding. Ac-

cording to the level of resulting path diversity, routers then proportionally balance packets over their multiple

next hops. This proportional aspect is an arbitrary design choice, which is not in the scope of this deliverable.

We focus on the practical implementation of the mapping (packet→ next-hop).

Various next-hop mapping methods exist to practically balance packets over load-balanced paths. They

should meet the following requirements.

Minimal disruption Packets from the same TCP flow should always follow the same path in order to avoid

packet reordering.

Transparency Load balancing operates on normal network packets. It does not require any additional fields

in the packet header.

Operate at Line Rate The additional computation required to balance the packets should be marginal.

ECMP load balancer should also fairly share the load over the next hops. Since flows vary widely in terms of

number of packets and volume, this is not that easy [35, 14]. Most load-balancing methods that meet these

requirements are based on a hash function [14, 1]. They compute a hash over the fields that identify the flow

in the packet headers. These fields are usually the source and destination IP addresses, the protocol number

and the source and destination ports. We call these fields the 5-tuple.

Page 42 of (76) c© CHANGE Consortium 2012

Page 43: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Figure 6.1: More than 80% of the server pairs in popular models of data center topologies have two or morepaths between them.

Figure 6.2: On average, for 70% of destinations, routers and switches have multiple next hops.

The computed hash can then be used in various ways to select a next hop. The simplest and most deployed

method is called Modulo-N . If there are N available next hops, the remainder of dividing the hash by N is

used as an identifier of the next hop to use. Because of the roughly uniform distribution of the hashes, this

method results in a uniform distribution in terms of number of flows. A slightly more sophisticated method

that allows for a parameterizable distribution is called Hash-Threshold. The space of hashes is divided into

subspaces, where each one of them corresponds to a next hop. A third method is Highest Random Weight,

in which the hash is computed not only over the 5-tuple but also over the identifier of a next hop. For each

packet, a hash is computed for each next hop. The next hop for which the hash it the highest is selected.

ECMP is widely used in data centers. Several recent data center proposals have been optimized to support

it [51, 48]. We consider the path diversity between pairs of servers. We use three generic models: Fat-Tree [5],

VL2 [28], and BCube [30]. For each model, we instantiate a representative topology with approximately 600

servers. Since the simple spanning tree at the link layer also tends to be replaced by multipath-capable

routing [56], we also consider switches as load balancers1.

From Figure [figure][1][6]6.1, we observe that more than 80% of pairs of servers in these data center networks

have at least two paths between them. They have up to 64 paths in BCube. Moreover, Figure [figure][2][6]6.2

shows that for 70% of destinations the switches and routers have at least a destination for which there are two

1In BCube models, servers not only act as end hosts, they also act as relay nodes for each other.

c© CHANGE Consortium 2012 Page 43 of (76)

Page 44: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

next hops or more. The maximum number of next hops is 10.

Due to non-determinism property of hash functions, forcing a path in the network is hard. In general hosts

can only vary the transport header fields to try to influence the path selection. In the following section, we

study possible interactions between a transport protocol and hash-based load balancers.

6.2 Case Study: Multipath TCP

In the remaining of this section, we consider MultiPath TCP (MPTCP) [23] as a case study to describe the

interactions between transport level protocols and per-flow load balancers.

As previously mentioned, hash-based load-balancing techniques rely on the 5-tuple to perform their forward-

ing decisions. A TCP connection is identified by this 5-tuple, and thus a single TCP connection only uses

one of the available paths due to the nature of the hash-based load-balancing technique. Splitting a single

data-stream among the different load-balanced paths may bring significant performance increases.

MPTCP is an extension to TCP that allows to split a data stream over multiple TCP subflows while still

presenting a standard TCP socket API to applications [9]. MultiPath TCP associates each TCP subflow

(identified by its 5-tuple) to its MPTCP-session identified by a token. After the establishment of the initial

TCP subflow of an MPTCP-session, the token enables the use of any arbitrary port number in subsequent new

TCP subflows, as the token uniquely associates the TCP subflow to its corresponding MPTCP session. The

subflows can be established using the same IP addresses pair and different ports [51] or just using distinct IP

addresses of the same end hosts [23].

If the subflows follow distinct paths, MPTCP is able to balance traffic across distinct paths thanks to the

Coupled Congestion Control [59].

Raiciu et al. evaluated MultiPath TCP inside data centers [51]. Simulations and measurements show that

performance improves when MultiPath TCP is allowed to use multiple subflows in such data centers. Due to

the load balancing deployed in the routers and switches of the data center, two subflows may follow distinct

paths and thus the overall throughput of the MPTCP session may be higher. Additionally, the network may

experience a better load-balancing as potentially more links are used. However, in practice, MultiPath TCP

establishes additional subflows on random port numbers. With a standard hash-based load balancing, there is

no guarantee that a different path will be chosen for each of these subflows. Furthermore, even using distinct

paths, it does not guarantee their independences in terms of shared congested bottleneck.

Let us consider a source that has m load-balanced paths towards a destination and that l of those m paths

are distinct. If all paths are equiprobable, then the probability that k subflows go through k different paths

amongst the l distinct paths is defined by:

Pm(k,l) =

l!

(l − k)!×mk(∀ k, l,m ∈ N | k ≤ l ≤ m) (6.1)

If there are 16 load-balanced paths, which seems realistic with respect to the observations we made in the

Page 44 of (76) c© CHANGE Consortium 2012

Page 45: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

previous section, the probability to cover 4 distinct paths is low, e.g., if a source generates either 2 or 4

subflows the probability is respectively P 16(2,4) = 6.7% and P 16

(4,4) = 0.5%. In practice, the equiprobability

assumption between load-balanced paths may not hold such that actual figures may be worse. The load-

balanced paths may be unbalanced such that distinct paths may be hardest to setup using a random approach.

One could argue that sources could generate as many subflows as possible to try to increase the probability

displayed in Equation 6.1. However, establishing additional subflows comes with a cost, as each new sub-

flow requires a three-way handshake with crypto-authentication before being established and each additional

subflow increases the requirements in memory at the receiver due to a higher receive-buffer [9]. From a

performance viewpoint, MultiPath TCP would obviously benefit from being able to establish the minimum

number of subflows to efficiently utilize distinct paths. This, however, requires the ability to map determinis-

tically a subflow to a path offered by the network in order to setup subflows on distinct paths.

6.3 Controllable per-Flow Load-Balancing

Controllable per-Flow Load-Balancing (CFLB) has been designed to overcome the limitations of current

hash-based load-balancing techniques. CFLB allows sources to encode inside the existing fields of the packet

header the load-balanced path that each packet should follow. CFLB is transparent for applications that do

not want to steer packets and does not require any changes for non-CFLB-aware end hosts. The network layer

still balances non-controlled traffic without disrupting transport layer flows. CFLB does not require to store

any state in routers, which performs only simple calculations.

CFLB is not a source routing solution, it only enables hosts to select a load-balanced path among loop-free

paths offered by the network layer. Compared to a source routing solution, it does not require any header

extension.

CFLB is decomposed in four separate operations. First, the desired path is specified as a sequence of next-

hop selections by the source, that we call a path selector. This path selector is then encoded inside selected

header fields of the packets. We call these header fields controllable. Third, each router recovers from these

fields the encoded path selector and, finally, the load-balanced next-hop selection for this packet. These four

operations allow CFLB-aware sources to steer their packets inside the network.

The remaining of this section is organized in such a manner to help understand the design choices behind

CFLB. As a basic study case, we focus on IPv4 networks running MultiPath TCP end hosts2. MultiPath TCP

allows the use of any arbitrary port numbers for the additional TCP subflows (see section [section][2][6]6.2),

and thus we use the port numbers as the controllable fields in the packet header. We also define uncontrollable

fields that are used to add randomness in the forwarding process of packets from non-CFLB-aware sources.

With IPv4, these uncontrollable fields are the source and destination IP addresses and the protocol number.

The design of CFLB starts from two assumptions. First, the same CFLB function should be used on CFLB

capable devices belonging to the data center. Second, the data center topology must be known either by the

2CFLB also works in different network environment and for different applications (see section [section][5][6]6.5).

c© CHANGE Consortium 2012 Page 45 of (76)

Page 46: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

sources or by a server that can be queried by sources (in such a case, we can also envision that the server

possesses global load information).

In the following, we first describe how CFLB translates a forwarding path into a path selector and how a router

retrieves from it the next-hop selection it should apply. Second, we discuss how to encode the path selector

inside the packet header thanks to an invertible function. Then, we discuss how CFLB adds randomness for

load distribution and avoids polarization. Finally, we summarize the complete operations of CFLB.

6.3.1 Path Selector

In a network offering path diversity, there exists multiple load-balanced paths between a source and a desti-

nation. Figure [figure][3][6]6.3 shows the Direct Acyclic Graph (DAG) of all load-balanced paths between

a source S and a destination D in a simple network. In this example there exist five different load-balanced

paths between nodes S and D. Table [table][1][6]6.1 lists the symbols used in this section and their defini-

tions.

Symbol DefinitionB The radix of the path selector, i.e., the numeral base to encode the path selector.ni The next-hop selection of the ith positioned router.L The length of the path selector, i.e., number of next-hop selections that can be encoded in it.Ni The number of load-balanced next hops available at a router whose position is i (for the sake of clarity,

we ignore the destination prefix).F (x) The invertible function applied on x.H(x) The Hash function applied on x.cf The controllable fields used.uf The uncontrollable fields used.Ei(ni) The function applied on a next-hop selection, it adds “randomness” using uf .Di(x) The function that performs the inverse of Ei, i.e., Di(Ei(x)) = x, ∀x ∈ [0, B[.A||B The concatenation of A and B.

Table 6.1: General notations.

CFLB allows sources to force a packet to follow a given load-balanced path. Such a path can be described

as a set of subsequent routers and their next-hop selections. For instance, the load-balanced path highlighted

in bold in Figure [figure][3][6]6.3 can be expressed as the following sequence: R1 → R2, R4 → R7. The

notation Ri → Rj means that router Ri forwards the packet to its neighbor Rj . There is no need to represent

the next-hop selection of router R7 towards D as R7 only has one possible next hop.

By knowing the number Ni of available next hops towards a destination for each router in the network, next-

hop selections can be mapped to a number ni ∈ [0, Ni[, where n indicates the index of the next hop, which

should be selected. By using this representation, the highlighted path in Figure [figure][3][6]6.3 can therefore

be expressed as the following sequence of next-hop selections: {(1→ 0), (4→ 2)}, where (i→ j) specifies

that router Ri on the path selects its jth next hop. In CFLB, we define this sequence of next-hop selections to

be a path selector.

Path Selector Representation To force a packet to follow a specific load-balanced path, CFLB encodes

the path selector inside the source and destination ports of the packet headers. CFLB stores a path selector

Page 46 of (76) c© CHANGE Consortium 2012

Page 47: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

S R1

R2

R3

R4

R5

R6

R7

R8

D

0

101

012

Figure 6.3: The load-balanced paths between S and D.

as a positional base-B unsigned integer, where B is known as the radix and shared by all the nodes in the

network. This allows to maximize the number of next-hop selections that can be encoded inside the path

selector, while minimizing the number of bits used.

A path selector p can be generalized as:

p =

L−1∑i=0

ni ×Bi (6.2)

Where ni is an unsigned integer in base B that represents the next-hop selection of the router having the ith

position within the path selector. For the moment, we assume that each router is able to determine its position

in the path selector (see the dedicated paragraph for more details).

Only routers having multiple load-balanced next hops to forward a packet must retrieve the path selector. In

this case, the router first extracts the path selector p from the packet’s header fields and then retrieves the

next-hop selection.

A path selector p can be inverted on the router having the ith position within the path selector to find the

next-hop selection ni it needs to apply on a packet by applying Equation 6.3.

ni =⌊ p

Bi

⌋mod B (6.3)

The integer division by Bi removes all load-balanced next-hop selections of upstream routers while the

modulo operation removes all load-balanced next-hop selections of downstream routers.

Path Selector Length The fixed size of the packet header fields used to encode the path selector limits the

number of encodable next-hop selections to:

L = blogB(2X)c (6.4)

where X is the size in bits of the header fields used to encode the path selector. With IPv4, X = 32 since the

ports are used to convey the path selector.

Increasing B decreases L, thus the number of load-balanced next-hop selections that can be encoded inside

the path selector. As the potential value of ni is limited by the radix B, one should choose a B value so that it

c© CHANGE Consortium 2012 Page 47 of (76)

Page 48: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

is the maximum degree in every possible DAG (for each possible destination prefix). It can be expressed as:

B = max(N i) ∀ i (6.5)

It is possible to represent for any router in the network a next-hop selection with a value, which is lower than

the maximum number of available next hops at any router. In practice, most routers have a hardcoded upper

bound on the number of next hops they can use for a destination, most of them use a maximum of 16 next

hops [8].

Using a radix, which is equal to the maximum number of next hops for all CFLB routers, might end up in an

inefficient usage of the available bits in the packet header. Relaxing the problem by allowing CFLB routers to

use more than one position inside the path selector would give more flexibility on the load-balanced next-hop

selection values that can be encoded at the source. If a router uses n positions of the path selector, then a

source can select a next hop among Bn on this CFLB-router.

Position Inside the Path Selector Up to now, we assumed that each router knows its position inside

the path selector. However, in a real world scenario it is impossible for a router to identify the sequence

of upstream routers thus making impossible to know its position inside the path selector. To still enable a

source to construct a path selector and for routers to extract corresponding next-hop selections, CFLB uses

the Time-to-Live (TTL) of the packet to identify each router’s position inside the path selector. As the TTL

is decremented by each router and as the path selector is bounded by L, we can define the position i of each

router within the path selector to be ttli mod L, where ttli is the TTL value of the packet received at the

router whose position is ith, e.g. the router located at the (64−ttli+1)th hop along the load balanced path (64

gives here the TTL used at the source). The source can, by knowing the initial TTL of the packets3, encode

the next-hop selection of each CFLB router on the path at the corresponding position inside the path selector.

As previously mentioned, to allow a router to use more than one position inside the path selector, the router

must decrement the TTL of the packets it forwards more than one time. We do not discuss further this

extension in this deliverable.

One Load-Balanced Path – Multiple Path Selectors When multiple connections between the same

pair of hosts want their packets to follow the same load-balanced path, using the same path selector generates

a collision as each connection will use the same port numbers. To overcome this issue, CFLB allows to

generate multiple path selectors to describe the same load-balanced path using two solutions that can be

combined. First, the unused positions inside the path selector can be filled with random values. Second, by

changing the initial TTL, the position of each router in the path selector changes and thus the pair of ports

used to force a specific load-balanced path. This ensures that the 5-tuple used by the hosts varies from one

connection to another.

3Most operating systems use a system wide default TTL.

Page 48 of (76) c© CHANGE Consortium 2012

Page 49: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Example Let us now illustrate how CFLB works in the simple network shown in Figure [figure][3][6]6.3.

First, based on Equation 6.5, we can deduce that the radix B must be 3. Using this radix, from Equation 6.4,

we can encode 20 load-balanced next-hop selections inside a path selector.

Let us assume that the source uses an initial TTL of 64 and wants this packet to follow the highlighted path

in Figure [figure][3][6]6.3. The positions of the next-hop selections of routers R1 and R4 inside the path

selector are respectively 4 (64 mod 20) and 2 (62 mod 20). The path selector computed by the source

based on Equation 6.2 can therefore be expressed as (where the notation [x→ y] refers to a router at position

x, which should select its yth next hop):

p = {[4→ 0], [2→ 2]} = 0× 34 + 2× 32 = 18

Note that this implies encoding a next-hop selection of 0 for all other positions inside the path selector. This

value is then encoded inside the packet header. R1 retrieves from the packet header the same path selector

and the TTL value to compute its position inside the path selector, i.e., 4. It then computes the next-hop

selection it needs to apply on the packet based on Equation 6.3:

n4 =

⌊18

34

⌋mod 3 = 0

R1 decrements the TTL of the packet and forwards it to the next hop labeled 0, i.e., R2. R2 does not have

load-balanced next hops. It forwards the packet to R4 and decrements the TTL. R4 applies the same operation

as R1. R4 computes:

n2 =

⌊18

32

⌋mod 3 = 2

The packet is therefore forwarded to R7 and then to D as R7 only has one possible next hop to forward the

packet.

6.3.2 Invertible Function

CFLB encodes the path selector inside the packet header using an invertible function F . This invertible

function must meet two important properties:

Bijection F must be bijective such that there is a one-to-one correspondence between its domain and image,

which must be {0, 1}X where X is the length in bits of the controllable fields. F must also be invertible,

i.e., ∃F−1 that is the inverse of F such that:

F−1(F (x)) = x ∀x ∈ {0, 1}X

Avalanche effect F must exhibit the avalanche effect [58]. Indeed, a simple bijective function might end up

in a poor distribution of the non-controlled traffic [14]. For instance, if only the port numbers are used

as controllable fields, we do not want that all web traffic goes through the same next hop. Therefore,

c© CHANGE Consortium 2012 Page 49 of (76)

Page 50: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

we require that the invertible function exhibits the avalanche effect that is for a small variation of the

input (different source-ports) a large variation of the output is observed.

Based on these two requirements, block ciphers such as Skip324 or RC5 [53] are good candidates to imple-

ment this invertible function when 32 bits are controllable. When more bits are available, we recommend to

use a block cipher mode of operation as FPE construction, such as the eXtended CodeBook (XCB) mode of

operation [45] that accepts arbitrarily-sized blocks, provided they are as large as the blocks of the underlying

block cipher. Depending on the amount of bits that are controllable in the packet header, different types of

block ciphers can be used with XCB, from 32-bit symmetric-key block cipher to most common ones such

as DES, 3DES or AES. Furthermore, efficient hardware-based implementations of such block ciphers ex-

ist [22, 34]. Using such functions to encode the path selector enables a router to apply the inverse of this

function on the controllable fields of the packet header to retrieve the path selector.

6.3.3 Load Balancing Efficiency

CFLB is designed based on the properties of hash-based load balancers. It must be transparent to sources that

do not need to control the load-balanced path taken by their packets. In this case, the controllable fields are

random and do not encode a path selector. CFLB must still distribute such packets efficiently amongst the

available load-balanced next hops.

Uncontrollable fields As Cao et al. showed, the most efficient packet distribution is achieved when the

whole fields representing a flow is used as input to the load balancing function [14]. CFLB therefore uses

also the uncontrollable fields, source and destination addresses and the protocol number, as input. For that,

the way the path selector is encoded slightly changes from Equation 6.2:

Ei(ni) = (ni +H(uf )) mod B (6.6)

p =L−1∑i=0

Ei(ni)×Bi (6.7)

where, H is a hash function and uf contains the uncontrollable fields of the packet header. This allows to

efficiently distribute packets over available next hops of each router, while still allowing routers to recover

the next-hop selections encoded by the sources.

Equation 6.7 can be inverted to find the next-hop selection ni to apply on the router whose position is i by

applying the following operation on the path selector extracted from the packet header:

Di(x) = (x−H(uf )) mod B (6.8)

4http://www.qualcomm.com.au/PublicationsDocs/skip32.c.

Page 50 of (76) c© CHANGE Consortium 2012

Page 51: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

ni = Di(⌊ p

Bi

⌋) (6.9)

Next-hop selection The next-hop selection ni is a value comprised between 0 and B − 1. To select a next

hop, CFLB applies a mapping between ni and a value between 0 and Ni − 1. However, as Ni ≤ B, an issue

arises when using a simple modulus operation when B is not a multiple of Ni. In this case, the load balancing

distribution might be poor. For instance, if Ni = 2 and B = 3, if the input is uniformly distributed, then the

router ends up forwarding 75% of the incoming packets to the first next hop. To resolve this problem, CFLB

computes the next-hop selection ni to apply on the packet as follows :

ni =

Di(⌊ pBi

⌋), if Di(

⌊ pBi

⌋) < Ni

H(cf ||uf ) mod Ni, otherwise.(6.10)

The intuition behind Equation 6.10 is that CFLB must distinguish whether the packet was controlled by a

source or not. If the packet was indeed controlled, the next-hop selected, ni, must be the one encoded in

the path selector. However, the non-controlled packets must be distributed randomly among the Ni available

next hops. In Equation 6.10, if the packet to forward is a controlled one, then the resulting next hop to

select should be lower than Ni (the number of next hops in the routing table for the packet’s destination), the

decision encoded at the source is correctly taken. Otherwise, it means that the packet is not a controlled one,

resulting in a random distribution of the packet on one of the Ni available next hops.

In case of topological changes (transient or permanent), a CFLB-router will renumber indexes, i.e., update

the ni → Rj mapping and Ni, of its current available next hops towards each destination. In such a case,

while new flows are “aware” of the new state and are so correctly controlled, previous existing ones may be

impacted. Indeed, when the desired next hop does not exist anymore or if its index has changed, the resulting

path will change. In CFLB, the impacted controlled flows (i.e., the elephants ones) will fall back to a classic

hash-based load balancing thanks to Equation 6.10. We consider that such topological changes should be

quite marginal (at a time scale greater than flows duration) and new subflows may be created if impacted ones

share a common bottleneck.

6.3.4 Avoiding Polarization

Some hash-based load-balancing techniques suffer from the polarization problem [44]. This problem arises

when there are several load-balancing routers in sequence. If they all perform the same computation on the

received packets, they will select the same next hop resulting in an uneven traffic distribution. With CFLB,

the polarization problem only arises when packets traverse more than L CFLB routers. Every router spaced

by L hops computes the same next-hop selection. CFLB solves this problem by assuming that the every

packet from the same flow received on a router has the same TTL5. CFLB therefore includes the TTL of

5This is a reasonable assumption since hosts use the same TTL for all packets and all packets from a flow follow the same path.

c© CHANGE Consortium 2012 Page 51 of (76)

Page 52: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

IPsrc IPdst Proto TTL Portsrc Portdst

Uncontrollablefields

H

h

Controllablefields

F−1

p =

︷ ︸︸ ︷. . .+ ai ×Bi + . . .+ aL ×BL

i = TTL mod L

−mod B ni

Figure 6.4: The complete mode of operation of a CFLB router.

Network-wide constant: B = The radix in use in the network.Network-wide constant: X = The number of bits that are controllable in the packet header.Require: pckt = The packet to forward.Ensure: The next-hop selection to apply on pckt.1: L← blogB(2X)c2: cf ← ExtractControllableFields(pckt)3: uf ← ExtractUncontrollableFields(pckt)4: ttl← ExtractTTL(pckt)5: p← F−1(cf )

6: ni ← (⌊

p

B(ttl mod L)

⌋−H(uf ||ttl)) mod B

7: if ni < Ni then8: return ni

9: else10: return H(uf ||cf ||RouterID) mod Ni

11: end if

Algorithm 1: Pseudocode showing operations performed by a CFLB router.

the packet inside the hash function ensuring that all routers will make different next-hop selections along the

path6. Respectively Equation 6.6 and Equation 6.8 become:

Ei(ni) = (ni +H(uf ||ttli)) mod B (6.11)

Di(p) = (⌊ p

Bi

⌋−H(uf ||ttli)) mod B (6.12)

6.3.5 Summary

In the previous sections, we have explained all the design decisions behind the CFLB algorithm. For clarity,

we provide in this section the detailed pseudocode of CFLB.

Figure [figure][4][6]6.4 and Algorithm 1 show respectively the operations and the pseudocode performed

by a CFLB router to forward a packet among load-balanced next hops. The first operation is to extract the

controllable and the uncontrollable fields and the TTL from the packet header. Operation F−1 is the inverse of

the invertible function that extracts the path selector from the controllable fields. In Algorithm 1, after having

retrieved the next-hop selection ni, the router performs an if-then-else on the value ni, to determine whether

the packet was controlled by the source. The ai value, in Figure [figure][4][6]6.4, corresponds to the addition

6Another solution could have been to use a simple router id as in classical hash-based load-balancing techniques [44], howeverthis requires each router to be configured with a unique router id and requires the sources or the network information server to knowall router ids.

Page 52 of (76) c© CHANGE Consortium 2012

Page 53: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Network-wide constant: B = The radix in use in the network.Network-wide constant: X = The number of bits that are controllable in the packet header.Require: path = A sequence (ttli, ni), where ttli is the TTL of the packet when received by router i and ni the next-hop that should be selected

by router i.Require: uf = The uncontrollable fields the source needs to use.Ensure: The controllable fields (cf ) to use to force a packet to follow the load balanced path path.1: L← blogB(2X)c2: p← 03: for (ttli, ni) ∈ path do4: p← p+ ((ni +H(uf ||ttli)) mod B)×B(ttli mod L)

5: end for6: return F(p)

Algorithm 2: Pseudocode showing the path selector construction.

modulo B of the next-hop selection performed at position i and the hash computed on the uncontrollable

fields and the TTL. This ai value was inserted by the source at the ith position inside the path selector. The

router can thus retrieve it and compute the subtraction modulo B with the same hash value, to finally retrieve

the next-hop selection ni.

Algorithm 2 shows the pseudocode used by sources to construct a path selector. The source needs to first

compute the length of the path selector. This is needed because the routers position themselves inside the

path selector by using the TTL and the length of the path selector. Thus, this length has to be taken into

consideration to find the path selector. Then, the source iterates over all routers on the path (with a TTL ttli

and a next-hop selection ni) to compute the path selector.

6.4 Evaluation

In this section, we evaluate the performances of CFLB compared to hash-based load-balancing. Our goal is

twofold: first, we evaluate its load-balancing and forwarding performances, and second, we show through

simulations and experiments how MultiPath TCP can benefit from CFLB to exploit the underlying path

diversity.

6.4.1 Load-Balancing for Non-Controlled Flows

The first requirement is that a router having multiple next hops for a given destination should uniformly

distribute the load [14] for non-controlled flows. If there exist N next hops for a given destination prefix, the

load balancer should distribute 1N of the total traffic to each next hop.

CFLB enables sources to steer controlled packets while also acting as a classic load balancer for non-

controlled packets (e.g., mice flows). To compare the hash-based load balancing techniques and CFLB,

we simulated each method using realistic traces and evaluated the fraction of packets forwarded to each next

hop. We based our simulations on the CAIDA passive traces collected in July 2008 at an Equinix data center

in San Jose, CA [55].

To analyze how CFLB balances the non-controlled traffic compared to hash-based techniques, we first simu-

lated 10 million packets (extracted from the CAIDA traces) forwarded through one load balancer performing

a distribution among N = 2 next hops. Figure 6.5(a) shows the result of this simulation (computed every

second). There are three observations resulting from this figure. First, using CRC16 as a hash-based load

c© CHANGE Consortium 2012 Page 53 of (76)

Page 54: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

0 1 2 3 4 5 6 7 8 9Load repartition [% packets]

0

20

40

60

80

100

CD

F[%

]

CFLB-Skip32CFLB-RC5MD5CRC16

(a) After one load balancer.

0 1 2 3 4 5 6 7Load repartition [% packets]

0

20

40

60

80

100

CD

F[%

]

CFLB-Skip32CFLB-RC5MD5

(b) After multiple load balancers.

Figure 6.5: Deviation from an optimal distribution amongst two possible next-hops.

balancer, gives a rather poor distribution of packets. Second, as the maximum deviation value never goes up

to 4% of packets, the load distribution among the two output links is close to an equal 50/50 % repartition

of traffic for all evaluated techniques except CRC16. Third, CFLB, whatever the block cipher used, achieves

an equivalent load distribution as a hash-based load balancer using MD5. We did not observe a significant

impact on the quality of the load distribution according to the B value used.

We also evaluated the load balancing performance considering a sequence of several load balancers. Fig-

ure 6.5(b) shows the cumulative distribution of the maximum deviation of the load distribution after crossing

four subsequent load balancers (computed every half second). The same observation as for Figure 6.5(a)

applies, CFLB performs at least as good as a classical hash-based load balancing technique.

Figure 6.6(a) and Figure 6.6(b) show, for respectively a hash-based load balancer using MD5 and CFLB using

RC5, the load balancing distribution of packets over time when N = 4. We analyze here the case of a router

having four outgoing links toward a given destination. We can notice that there are no significant differences

between the two techniques, as they behave in the same way over time. They both slightly fluctuate within the

same tight interval [22%, 28%] and their median is close to 25%. Simulations with other traces and different

Page 54 of (76) c© CHANGE Consortium 2012

Page 55: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

0 5 10 15 2020

22

24

26

28

30

Time [sec]

%pa

cket

s

(a) MD5 Hash-Based

0 5 10 15 2020

22

24

26

28

30

Time [sec]

%pa

cket

s

(b) RC5 CFLB

Figure 6.6: Packet distribution computed every second amongst four possible next hops.

values of N provide similar results.

6.4.2 Forwarding Performances

The second requirement is the forwarding performance. In order to evaluate it, we implemented the forward-

ing path of CFLB as a module in the Linux kernel 2.6.387. Note that the Universite Catholique de Louvain

partner has filed a patent for its CFLB code release, thus, the code is completely restricted to the European

Commission and reviewers.

The basic behavior of the Linux kernel when dealing with multiple next hops for a given destination is to

apply a round robin distribution of packets based on the IP addresses, therefore performing a pure layer-3

load balancing. As this is not comparable to the hash-based load-balancing behavior introduced in sec-

tion [section][1][6]6.1, we extended the Linux kernel to take into consideration the 5-tuple of the packets and

then apply a hash function to select a next-hop (only CRC like functions are available). To implement CFLB

7More information can be found at: http://inl.info.ucl.ac.be/cflb.

c© CHANGE Consortium 2012 Page 55 of (76)

Page 56: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

400K 500K 600K 700K

100K

200K

300K

400K

500K

600K

Sent pps

Forw

arde

dpp

s

Linux R-RCRC16

CFLB-RC5CFLB-Skip32

Figure 6.7: CFLB gives equivalent forwarding performance as hash-based load balancers.

in the kernel, we extend the previously mentioned hash function to enable the deterministic selection of a

next hop as described in section [section][3][6]6.3. We used two different 32-bit block ciphers to implement

the invertible function: RC5 and Skip32. The implementation of these two block ciphers is not optimized, the

goal is solely to prove the feasibility of our solution (various techniques could be used to improve its perfor-

mance [39, 31]). Note that CFLB also applies a CRC function on the uncontrollable fields to add randomness

for non-controlled flows.

The goal of such an experiment is to analyze the impact of using CFLB on the forwarding of packets as adding

computation for each packet increases the processing delay, it may also decrease the overall throughput

achieved.

We deploy a testbed relying on three computers to emulate the forwarding path of a Linux router. The

computer acting as a load balancer is an Intel Xeon X3440 @2.53GHz, and both sender and receiver are

AMD Opteron 6128 @2GHz. The sender is connected through a 1Gbps link to the load balancer, which

balances traffic amongst two 1Gbps links to the receiver. The traffic was generated using 8 parallel iperf 8

generators, creating UDP-packets with a payload of 64 Bytes, in order to overload the load balancer. The

result of this experiment is given in Figure [figure][7][6]6.7.

The classic Linux Round-Robin on the IP-addresses obviously performs the best (it only requires to lookup

at the IP address in the routing cache to forward the packet). It forwards approximately 600,000 packets

per second. Not far below, both the classical hash-based technique using CRC16 and CFLB using RC5

achieve respectively 570,000 and 560,000 packets per seconds forwarding. This performance drop compared

to the standard Linux Round-Robin is mainly due to the more complex hash-algorithm to select the next hop.

Finally, CFLB using Skip32 achieves up to 500,000 packets per seconds forwarding. We can conclude that

CFLB, even using non-optimized block ciphers, comes with a marginal additional cost as it gives equivalent

forwarding performances as classical hash-based techniques.

6.4.3 MPTCP improvements with CFLB

In the following section, we evaluate the advantages of running MPTCP hosts, as it is our case study, con-

jointly with a CFLB-enabled network. We first simulate a data center environment to show that when using

8http://iperf.sourceforge.net/

Page 56 of (76) c© CHANGE Consortium 2012

Page 57: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

CFLB, MPTCP requires to establish less subflows for elephants connections than with a probabilistic ap-

proach. Finally, we also show in a small testbed that MPTCP can benefit from the usage of CFLB to avoid

crossing hot spots.

6.4.4 Data Center Simulations

We performed simulations of MPTCP-enabled data centers and evaluate the performances achieved when a

simple central flow scheduling algorithm that allows to allocate elephants subflows. The scheduler that we

use for simulations simply consists in counting the number of flows going through each link of the data center.

In practice, hosts can use a similar technique as in [19] to detect whether one connection corresponds to an

elephant flow, and if so query the scheduler to establish additional subflows. The scheduler then specifies to

the host the ports to be used to setup a new subflow. The required ports are computed using the CFLB mode

of operation allowing to map a subflow to a specific path in the network. We refer to this combination of

MultiPath TCP and CFLB in the remaining of this section as MPTCP-CFLB.

To evaluate the benefits of CFLB with MPTCP in data centers, we first enhanced the htsim packet-level

simulator used in [51] to support path selection with CFLB. We consider exactly the same Fat-Tree datacenter

topology as discussed in Figure 2 of [51]. This simulated datacenter has 128 MPTCP servers, 80 eight-port

switches and uses 100 Mbps links. The traffic pattern is a permutation matrix, meaning that senders and

receivers are chosen at random with the constraint that receivers do not receive more than one connection.

The regular MPTCP bars of Figure [figure][8][6]6.8 are the same as Figure 2 of [51]. It shows the throughput

achieved by MPTCP when MPTCP subflows are load-balanced using ECMP. The MPTCP-CFLB bars show

the throughput that MPTCP is able to obtain when CFLB balances the MPTCP subflows over the less loaded

paths. The simulations show that with only 2 subflows, MPTCP-CFLB is much closer to the optimum than

MPTCP with hash-based load balancing. Even with only one subflow (smartly allocated with the scheduler),

improvements are considerable and MPTCP-CFLB achieves a good utilization of the network. This can

be explained by the fact that relying on a random distribution of subflows ends in a poor use of available

resources.

Similar results have been observed on other data center topologies such as VL2 and BCube. We also per-

formed simulations for an overloaded data center and observed that using MPTCP-CFLB conjointly with a

flow scheduler focusing on less congested paths offers more fairness amongst the different connections.

6.4.5 Testbed Experiments

We applied a modification to the MPTCP Linux kernel 2.6.36 implementation [9] to add the deterministic

selection feature offered by CFLB. We created a netlink interface to the kernel so that a user-space module

can interact with MPTCP and announce to the kernel the subflows to create.

The CFLB functionality has been implemented in the user-space as a 32 controllable bits version. Our

prototype allows a source to control source and destination ports to follow a specific path inside an IPv4

network.

c© CHANGE Consortium 2012 Page 57 of (76)

Page 58: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

0 1 2 3 4 5 6 7 8 9No. of MPTCP subflows

0

20

40

60

80

100

Thr

ough

put

(%of

opti

mal

)

MPTCP-CFLBregular MPTCP

Figure 6.8: MPTCP needs few subflows to get a good Fat Tree utilization when using CFLB.

The two block ciphers from subsection [subsection][2][6,4]6.4.2 (RC5 and Skip32) were also implemented

inside the kernel crypto library.

A python library pycflb was developed to bring a simple API for interacting with the user-space CFLB. We

also developed an RPC server to show the feasibility to centralize the computation of CFLB in a server. This

latter has information about the network topology and is the only one to interact with the pycflb library. pycflb

must be configured with the cipher and key parameters in use in the network. Sources only query it to retrieve

the ports to use or to recover the path taken by a specific flow. These three implementations (Linux MPTCP

netlink-interface, user-space CFLB and pycflb library) allow a source to deterministically map subflows to

paths and represent approximately 4,000 lines of code.

When MultiPath TCP runs on a single homed server, additional subflows are created by modifying the port

numbers in a random manner. Since MultiPath TCP relies on tokens to identify to which MPTCP connection a

new subflow belongs (see section [section][2][6]6.2), both the source and destination ports can be used to add

entropy. Combining CFLB and MultiPath TCP in the Linux MPTCP implementation provides a significant

benefit because the subflow 5-tuple can be selected in such a way that the underlying path diversity offered

by the network can be easily exploited.

We evaluate the benefit of this technique in a small testbed with a client and a server (AMD Opteron 6128

@2GHz) and two CFLB-capable routers (Xeon X3440 @2.53GHz).

In the first experiment, each host is connected to one router via a 1Gbps link. The routers are directly

connected via seven 100 Mbps links. These 7 links offer 7 different distinct paths between the client and the

server. If seven MPTCP-subflows are created, an optimal usage of the network should result in about 700

Mbps of throughput. To evaluate this, we ran iperf between the hosts, creating traffic during one minute. The

experiment has been repeated 400 times to collect representative results. Figure [figure][9][6]6.9 provides

Page 58 of (76) c© CHANGE Consortium 2012

Page 59: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

1/7 2/7 3/7 4/7 5/7 6/7 7/7Proportion of paths used

0

20

40

60

80

100P

DF

(in

%)

MPTCP-CFLBregular MPTCP

Figure 6.9: Regular MPTCP is unlikely to use allpaths. MPTCP-CFLB on the other hand alwaysmanages to use all the paths.

2 3 4 5 6 7 8 9 10Number of MPTCP-Subflows

0

20

40

60

80

100

120

140

160

180

200

Ave

rage

Goo

dput

inM

bps

MPTCP-CFLBregular MPTCP

Figure 6.10: Regular MPTCP has a very small prob-ability of using link A of Figure [figure][11][6]6.11and is thus suboptimal compared to MPTCP-CFLB.

S LB

R

D1Gps Link A: 100Mbps

6 × 100Mbps100Mbps

Figure 6.11: Testbed – The maximum throughput available between S and D is at 200 Mbps due to thebottleneck link between the router and the destination.

the probability distribution function of the number of distinct paths used by the classical MPTCP and our

enhanced MPTCP-CFLB implementation.

Figure [figure][9][6]6.9 raises the following observation: as expected, using seven subflows, MPTCP-CFLB

is able to take the full benefit of the seven paths while the classical MultiPath TCP cannot efficiently uti-

lize them. Indeed, the performance of MPTCP-CFLB is completely deterministic as the MPTCP connection

balances exactly its seven subflows over the seven paths. Among the 400 experiments, when paths are ran-

domly selected, only two experiments were able to use the seven paths. This confirms the analysis of sec-

tion [section][2][6]6.2 on the probability of selecting different paths by using random port numbers. Indeed,

P 7(7,7) = 0.6% ≈ 2

400 , which explains the poor result of MPTCP to cover entirely the 7 load-balanced paths.

Most of the experiments result in four or five paths being used. This implies that two or three paths carry two

competing TCP subflows from the same MPTCP connection.

Our second evaluation (Figure [figure][11][6]6.11) still offers 7 distinct paths from the source to the destina-

tion, but this time the destination has two 100 Mbps links. One is a direct link from the load balancer to the

destination and the second is attached to the router.

With only two subflows, MPTCP-CFLB is able to saturate the two 100 Mbps interfaces of the destination.

Figure [figure][10][6]6.10 compares the performance of MPTCP and MPTCP-CFLB when the number of

subflows varies. Each measurement with MPTCP was repeated 100 times and Figure [figure][10][6]6.10

provides the average measured goodput. These measurements clearly show that when using random port

c© CHANGE Consortium 2012 Page 59 of (76)

Page 60: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

numbers, MPTCP is unable to efficiently use the two different 100 Mbps links. Increasing the number of

subflows slowly increases the performance, but, as explained in section [section][2][6]6.2, adding a subflow

to an MPTCP connection comes with a significant cost. Thus, the less TCP sublows are established, the best

it is. MPTCP-CFLB is able to cover all the available paths with the minimal cost.

6.5 Discussion

In this chapter, we have mainly focused on the utilization of CFLB in data centers networks carrying

TCP/IPv4 packets. As a case study, we consider the coupling with MPTCP. However CFLB could be applied

to other problems in different networking technologies. Extending CFLB to support another networking tech-

nology can be done by selecting the controllable and the uncontrollable fields of the packet header that are

used as input to the load-balancing algorithm. We briefly discuss some of them in this section.

A first natural extension of CFLB would be to deploy it in IPv6 networks. In an IPv6 network, CFLB could

also rely on the source and destination ports, but IPv6 packets contain a 20 bits flow label field. The semantics

of this field is still being debated within the IETF [15]. IPv6 sources could leverage CFLB to encode a path

selector in the flow label field of their packets.

Although we illustrated the benefits of CFLB with MultiPath TCP, the same could be applied for single

TCP/UDP connections, in this case only the source port could be controlled as applications running on top of

those transport protocols require to specify the destination port.

A CFLB network has other benefits than improving end hosts performances. One of the side effect benefits

is the monitorability of the load-balanced paths. Commercial networks often deploy monitoring tools that

probe network paths to verify whether their network meets the stringent SLAs that are requested by their

customers. Unfortunately, when there are load-balanced paths, it is very difficult for the monitoring station to

steer packets on a specific path, which complicates network monitoring. This operational problem is one of

the reasons why the MPLS-TP architecture prohibits the utilization of ECMP [13]. With CFLB, this problem

disappears since a monitoring station can easily steer packets along specific paths through CFLB routers.

MPLS networks often use ECMP to load balance the traffic. To enable MPLS routers to support ECMP

even when carrying non-IP packets, router vendors have proposed the utilization of special MPLS entropy

labels [38] to identify flows that can be load-balanced. CFLB could easily exploit these entropy labels and

ensure that flows are both well balanced and that paths can be efficiently monitored.

6.6 Conclusion

Most data centers networks nowadays rely on hash-based load-balancing to distribute the load over multiple

paths. Hash-based techniques are able to efficiently spread the load but it is difficult to predict (and impossible

to choose) the next hop selection that such load balancers will select. In this section, we have shown that it

is possible to achieve both efficient load-balancing while enabling hosts to explicitly select the paths of their

flows. The CHANGE architecture could benefit from our technique since it would allow it to predictably

Page 60 of (76) c© CHANGE Consortium 2012

Page 61: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

select multiple paths between platforms in order to increase network utilization and thus throughput.

c© CHANGE Consortium 2012 Page 61 of (76)

Page 62: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

7 Motivating Case Study: Extending TCPIn this section we provide an overview of our work to design Multipath TCP, guiding the reader through the

lengthy MPTCP design process that has taken the better part of four years in the making.

Our experience in designing MPTCP has been one of the biggest motivations to develop a new Internet

architecture that is evolvable and incorporates flow processing and middleboxes as first class citizens of the

network.

As many researchers have lamented, changing the behavior of the core Internet protocols is very difficult [32].

An idea may have great merit, but unless there is a clear deployment path whereby the cost/benefit tradeoff

for early adopters is positive, then it is very unlikely to see widespread adoption.

The majority of applications use TCP for transport. Although newer protocols such as SCTP and DCCP exist

that may be a better match for application requirements, software writers must maximize the chance of a

successful connection. Understandably, they use TCP as it is always available and almost always works, and

then work around its limitations at a higher layer.

We wish to move from a single-path Internet to one where the robustness, performance and load-balancing

benefits of multipath transport are available to (almost) all applications. To support such unmodified appli-

cations we must work below the sockets API. There is no widely deployed signaling mechanism to select

between transport protocols, we have to use options in TCP’s SYN exchange to negotiate new functionality.

The goal then is for an unmodified application to open a TCP connection in the normal way. When both

endpoints support MPTCP and multiple paths are available, MPTCP should be able to set up additional

subflows and stripe the connection’s data across these subflows, sending most data on the least congested

paths.

The potential benefits are clear, but there are potential costs too. If negotiating MPTCP can cause connections

to fail when regular TCP would have succeeded, then MPTCP is unlikely to be deployed. The second goal,

then, is for MPTCP to work in all current scenarios where regular TCP works. If a subflow fails for any

reason, the connection must be able to continue as long as some other subflow has connectivity.

Third, MPTCP must be able to utilize the network at least as well as regular TCP, but must not starve TCP.

The congestion control scheme described in [60] meets this requirement, but congestion control is not the

only factor that can limit throughput.

Finally MPTCP must be implementable in operating systems without using excessive memory or processing

power. As we will see, this requires careful consideration of both fast-path processing and overload scenarios.

7.1 DesignThe five main mechanisms in TCP are:

• Connection setup handshake and state machine.

• Reliable transmission & acknowledgment of data.

Page 62 of (76) c© CHANGE Consortium 2012

Page 63: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

• Congestion control.

• Flow control.

• Connection teardown handshake and state machine.

All of these need modifications to achieve robust, high performance multipath operation, but congestion

control has been described elsewhere[60] so we will not discuss it further in this paper.

MPTCP is negotiated via new TCP options in SYN packets, and during this phase the endpoints also exchange

connection identifiers. These are then used to add new paths—subflows—to an existing connection. Subflows

resemble TCP flows on the wire, but they all share a single send and receive buffer at the endpoints. MPTCP

uses per subflow sequence numbers to detect losses and drive retransmissions, and connection-level sequence

numbers to allow reordering at the receiver. Connection-level acknowledgements are used to implement

proper flow control. We discuss the rationale behind these design choices below.

7.1.1 Connection setup

The TCP three-way handshake serves to synchronize state between the client and server1. In particular, initial

sequence numbers are exchanged and acknowledged, and TCP options carried in the SYN and SYN/ACK

packets are used to negotiate optional functionality.

MPTCP must use this initial handshake to negotiate multipath capability. An MP CAPABLE option2 is sent

in the SYN and echoed in the SYN/ACK if the server understands MPTCP and wishes to enable it. Although

this form of extension has been used many times, the Internet has grown a great number of middleboxes in

recent years. Does such a handshake still work?

Out tests in the previous section found that 6% of paths tested remove new options from SYN packets. This

rises to 14% for connections to port 80 (http). We did not observe any access networks that actually dropped

a SYN with a new option. Perhaps most importantly, no path removed options from data packets unless it

also removed them from the SYN, so it is possible to test a path using just the SYN exchange. A separate

study[12] probed Internet servers to see whether new options in SYN packets caused any problems. Of the

Alexa top 10,000 sites, 15 did not respond to a SYN packet containing a new option.

From these experiments we can conclude that negotiating MPTCP in the initial handshake is feasible, but

with some caveats. There is no real problem if a middlebox removes the MP CAPABLE option from the

SYN: MPTCP simply falls back to regular TCP behavior. However removing it from the SYN/ACK would

cause the client to believe MPTCP is not enabled, whereas the server believes it is. This mismatch would be

a problem if data packets were to be encoded differently with MPTCP. The obvious solution is to require the

third packet of the handshake (ACK of SYN/ACK) to carry an option indicating that MPTCP was enabled.

However this packet may be lost, so MPTCP must require all subsequent data packets to also carry the

1The correct terms really should be active opener and passive opener, although even these ignore simultaneous open. Forconciseness, we use the terms client and server, but we do not imply any additional limitations on how TCP is used.

2Formally, this is subtype MP CAPABLE of a single TCP option used by MPTCP for multiple purposes.

c© CHANGE Consortium 2012 Page 63 of (76)

Page 64: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

option until one of them has been acked. If the first non-SYN packet received by the server does not contain

an MPTCP option, the server must assume the path is not MPTCP-capable, and drop back to regular TCP

behavior.

Finally, if a SYN needs to be retransmitted, it would be a good idea to follow the retransmitted SYN with one

that omits the MP CAPABLE option.

It should be clear from this brief discussion of what should be the simplest part of MPTCP that anyone

designing extensions to TCP must no longer think of the mechanisms as concerning only two parties. Rather,

the negotiation is two-way with mediation, where the packets that arrive are not necessarily those that were

sent. This requires a more defensive approach to protocol design than has traditionally been the case.

7.1.2 Adding subflows

Once two endpoints have negotiated MPTCP, they can open additional subflows. In an ideal world there

would be no need to send new SYN packets before sending data on a new subflow - all that would be needed

is a way to identify the connection that packets belong to. In practice though, we see that NATs and Firewalls

rarely pass data packets that were not preceded by a SYN.

Adding a subflow raises two problems. First, the new subflow needs to be associated with an existing MPTCP

flow. The classical five-tuple cannot be used as a connection identifier, as it does not survive NATs. Second,

MPTCP must be robust to an attacker that attempts to add his own subflow to an existing MPTCP connection.

When the first MPTCP subflow is established, the client and the server insert 64-bit random keys in the

MP CAPABLE option. These will be used to verify the authenticity of new subflows.

To open a new subflow, MPTCP performs a new SYN exchange using the additional addresses or ports it

wishes to use. Another TCP option, MP JOIN is added to the SYN and SYN/ACKs. This option carries

a MAC of the keys from the original subflow; this prevents blind spoofing of MP JOIN packets from an

adversary who wishes hijack an existing connection. MP JOIN also contains a connection identifier derived

as a hash of the recipient’s key [24]; this is used to match the new subflow to an existing connection.

If the client is multi-homed, then it can easily initiate new subflows from any additional IP addresses it owns.

However, if only the server is multi-homed, the wide prevalence of NATs makes it unlikely that a new SYN

it sends will be received by a client. The solution is for the MPTCP server to send an ADD ADDR option

informing the client that the server has an additional address. The client may then initiate a new subflow. This

assymetry is not inherent - there is no protocol design limitation that means the client cannot send ADD ADDR

or the server must send a SYN for a new subflow. But the Internet itself is so frequently assymetric that we

need two distinct ways, one implicit and one explicit, to indicate the existence of additional addresses.

7.1.3 Reliable multipath delivery

In a world without middleboxes, MPTCP could simply stripe data across the multiple subflows, with the

sequence numbers in the TCP headers indicating the sequence number of the data in the connection in the

normal TCP way. Our measurements show that this is infeasible in today’s Internet:

Page 64 of (76) c© CHANGE Consortium 2012

Page 65: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

[ , ]

Recv BuffData ACK

1

Recv Wnd

2

Subflow:1001, Data:1

Ack:1001,W

nd:1

Subflow:2001,Data:2 Ack:2001,Wnd:0

[1, ]2 1

[1,2]3 0

Data ACK(inferred)

1

1

3 Subflow:1002,Data:3

Out of WindowDrop Segment

(a) Drops due to incorrect inference

[ ,

RecData ACK

1

Recv Wnd

2

1001,1

Ack 1

001,Wnd 1

2001,2

Ack 2001, Wnd 1

[1, ]2 1

[2, 3 1

Data ACK(inferred)

1

1

3

1002,3

[---app read---]

1002,3

Could send 3 – missed opportunity

(b) Stalls due to incorrect inference

Figure 7.1: Problems with inferring the cumulative data ACK from subflow ACK

• We observed that 10% of access networks rewrite TCP initial sequence numbers (18% on port 80).

Some of this re-writing is by proxies that remove new options; a new subflow will fail on these paths.

But many that rewrite do pass new options - these appear to be firewalls that attempt to increase TCP

initial sequence number randomization. As a result, MPTCP cannot assume the sequence number space

on a new subflow is the same as that on the original subflow.

• Striping sequence numbers across two paths leaves gaps in the sequence space seen on any single path.

We found that 5% of paths (11% on port 80) do not pass on data after a hole - most of these seem to be

proxies that block new options on SYNs and so don’t present a problem as MPTCP is never enabled

on these paths. But a few do not appear to be proxies, and so would stall MPTCP. Perhaps worse, 26%

of paths (33% on port 80) do not correctly pass on an ACK for data the middlebox has not observed -

either the ACK is dropped or it is “corrected”.

Given the nature of today’s Internet, it appears extremely unwise to stripe a single TCP sequence space

across more than one path. The only viable solution is to use a separate contiguous sequence space for each

MPTCP subflow. For this to work, we must also send information mapping bytes from each subflow into the

overall data sequence space, as sent by the application. We shall return to the question of how to encode such

mappings after first discussing flow control and acknowledgments, as the three are intimately related.

7.1.3.1 Flow control

TCP’s receive window indicates the number of bytes beyond the sequence number from the acknowledgment

field that the receiver can buffer. The sender is not permitted to send more than this amount of additional

data.

Multipath TCP also needs to implement flow control, although packets now arrive over multiple subflows. If

we inherit TCP’s interpretation of receive window, this would imply an MPTCP receiver maintains a pool of

buffering per subflow, with receive window indicating per-subflow buffer occupancy. Unfortunately such an

interpretation can lead to a deadlock scenario:

1. The next packet that needs to be passed to the application was sent on subflow 1, but was lost.

c© CHANGE Consortium 2012 Page 65 of (76)

Page 66: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

2. In the meantime subflow 2 continues delivering data, and fills its receive window.

3. Subflow 1 fails silently.

4. The missing data needs to be re-sent on subflow 2, but there is no space left in the receive window,

resulting in a deadlock.

The receiver could solve this problem by re-allocating subflow 1’s unused buffer to subflow 2, but it can only

do this by rescinding the advertised window on subflow 1. Besides, the receiver does not know which subflow

the next packet will be sent on. The situation is made even worse because a TCP proxy3 on the path may hold

data for subflow 2, so even if the receiver opens its window, there is no guarantee that the first data to arrive

is the retransmitted missing packet.

The correct solution is to generalize TCP’s receive window semantics to MPTCP. For each connection a

single receive buffer pool should be shared between all subflows. The receive window then indicates the

maximum data sequence number that can be sent rather than the maximum subflow sequence number. As

a packet resent on a different subflow always occupies the same data sequence space, no such deadlock can

occur.

The problem for an MPTCP sender is that to calculate the highest data sequence number that can be sent, the

receive window needs to be added to the highest data sequence number acknowledged. However the ACK

field in the TCP header of an MPTCP subflow must, by necessity, indicate only subflow sequence numbers.

Does MPTCP need to add an extra data acknowledgment field for the receive window to be interpreted

correctly?

7.1.3.2 Acknowledgments

To correctly deduce a cumulative data acknowledgment from the subflow ACK fields, an MPTCP sender

might keep a scoreboard of which data sequence numbers were sent on each subflow. However, the inferred

value of the cumulative data ACK does not step in precisely the same way that an explicit cumulative data

ACK would. Consider the following sequence4:

1. Data sequence no. 1 is sent on subflow 1 with subflow sequence number 1001.

2. Receiver sends ACK for 1001 on subflow 1.

3. Data sequence no. 2 is sent on subflow 2 with subflow sequence number 2001.

4. Receiver sends ACK for 2001 on subflow 2.

5. ACK for 2001 arrives at sender (the RTT on subflow 2 was shorter).

6. ACK for 1001 arrives at sender.3Most will prevent MPTCP being negotiated, but a few do not.4The example uses packet sequence numbers for clarity, but MPTCP actually uses byte sequence numbers just like TCP

Page 66 of (76) c© CHANGE Consortium 2012

Page 67: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

The receiver expected the ack for 1001 to be an implicit data ACK for 1, and the ACK for 2001 to be an implicit

ACK for 2. However, as the ACK for 2001 does not implicitly acknowledge both 1 and 2, the sender’s inferred

data ACK is still 0 after step 5. Only after step 6 does the inferred data ACK become 2.

This sort of reordering is inevitable with multipath, and it would not by itself be a problem, except that the

receiver needs to code the receive window field relative to the implicit data ACK. Figure 7.1(a) shows the

problem. Suppose the receive window were only two packets, and the application is slow to empty the receive

buffer. In the ACK for 1001, the receiver closes the receive window to one packet. In the ACK for 2001 the

receiver closes the receive window completely, as there is no space remaining. Unfortunately when the ACK

for 1001 is finally received, the inferred cumulative data ACK is now 2; the sender adds the receive window

of size one to this, and concludes incorrectly that the receiver has sufficient buffer space for one more packet.

Figure 7.1(b) shows a similar situation where reordering causes sending opportunities to be missed.

To avoid such scenarios MPTCP must carry an explicit data acknowledgment field, which gives the left edge

of the receive window.

7.1.3.3 Freeing sender buffers

Having an explicit cumulative DATA ACK also improves robustness when faced with middleboxes. Consider

a connection with two subflows; subflow 1 is direct, but subflow 2 traverses a middlebox that pro-actively

acknowledges TCP segments once it has received them in order, even though they have not yet reached the

TCP receiver. In our results 3% of paths had such proxies (6% on port 80); all of these removed MP CAPABLE

from SYNs so MPTCP would not enable on these paths, but future MPTCP-aware middleboxes might not do

so.

Consider what happens when connectivity to the receiver via subflow 2 is lost, as might happen if it moved

out of coverage of a wireless basestation. The middlebox acknowledges a segment, but only then discovers

it can no longer reach the receiver. The sender receives the subflow acknowledgment for this segment. If it

uses the subflow acknowledgment to free the data at the sender, then this segment cannot then be resent on

subflow 1 which is still working. As a result the connection fails.

If, instead, the sender uses the explicit cumulative DATA ACK to free buffers, such a failure is avoided.

7.1.3.4 Encoding

We have seen that in the forward path we need to encode a mapping of subflow bytes into the data sequence

space, and in the reverse path we need to encode cumulative data acknowledgments. There are two viable

ways to encode this additional data:

• Send the additional data in TCP options.

• Carry the additional data within the TCP payload, using a chunked or escaped encoding to separate

control data from payload data.

c© CHANGE Consortium 2012 Page 67 of (76)

Page 68: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Figure 7.2: Flow Control on the path from C to S inadvertently stops the data flow from S to C

For the forward path we have not found any compelling arguments either way, but the reverse path is a

different matter.

Consider a hypothetical encoding that divides the payload into chunks where each chunk has a TLV header.

A data acknowledgment can then be embedded into the payload using its own chunk type. Under most

circumstances this works fine. However, unlike TCP’s pure ACK, anything embedded in the payload must be

treated as data. In particular:

• It must be subject to flow control because the receiver must buffer data to decode the TLV encoding.

• If lost, it must be retransmitted consistently, so that middleboxes can track sequence state correctly5

• If packets before it are lost, it might be necessary to wait for retransmissions before the data can be

parsed - causing head-of-line blocking.

Flow control presents the most obvious problem for the chunked payload encoding. Figure 7.2 provides an

example. Client C is pipelining requests to server S; meanwhile S’s application is busy sending the large

response to the first request so it isn’t yet ready to read the subsequent requests. At this point, S’s receive

buffer fills up.

S sends segment 10, C receives it and wants to send the DATA ACK, but can’t do so: flow control imposed

by S’s receive window stops him. Because no DATA ACKs are received from C, S cannot free its send buffer,

so this fills up and blocks the sending application on S. S’s application will only read when it has finished

sending data to C, but it cannot do so because its send buffer is full. The send buffer can only empty when S

receives the DATA ACK from C, but C cannot send a DATA ACK until S’s application reads. This is a classic

deadlock cycle.

As no DATA ACK is received, S will eventually time out the data it sent to C and will retransmit it; after many

retransmits the whole connection will time out.

It has been suggested that this can be avoided if DATA ACK are simply excluded from flow control. Unfortu-

nately any middlebox that buffers data can foil this; it is unaware the DATA ACK is special because it looks

just like any other TCP payload.5In our observations, the usual TCP proxies re-asserted the original content when sent a “retransmission” with different data. We

also found one path that did this without exhibiting any other proxy behavior - this is symptomatic of a traffic normalizer[33] - andone on port 80 that reset the connection.

Page 68 of (76) c© CHANGE Consortium 2012

Page 69: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

When the return path is lossy, decoding DATA ACKs will be delayed until retransmissions arrive - this will ef-

fectively trigger flow control on the forward path, reducing performance. In effect, this would break MPTCP’s

goal of doing “no worse” than TCP on the best path.

Our conclusion is that DATA ACKs cannot be safely encoded in the payload. The only real alternative is to

encode them in TCP options which (on a pure ACK packet) are not subject to flow control.

7.1.3.5 Data sequence mappings

If MPTCP must use options to encode DATA ACKs, it is simplest to also encode the mapping from subflow

sequence numbers to data sequence numbers in a TCP option. We refer to this as the data sequence number

mapping or DSM.

At first we thought that the DSM option simply needed to carry the data sequence number corresponding to

the start of the MPTCP segment. Unfortunately middleboxes and “smart” NICs make this far from simple.

Middleboxes that resegment data would cause a problem. 6 TCP Segmentation Offload (TSO) hardware in

the NIC also resegments data and is commonly used to improve performance. The basic idea is that the OS

sends large segments and the NIC resegments them to match the receiver’s MSS. What does a NIC performing

TSO do with TCP options? We tested twelve NICs supporting TSO from four different vendors. All of them

copy a TCP option sent by the OS on a large segment into all the split segments.

If MPTCP’s DSM option only listed the data sequence number, TSO would copy the same DSM to more

than one segment, breaking the mapping. Instead the DSM option must say precisely which subflow bytes

map to which data sequence numbers. But this is further complicated by middleboxes that rewrite sequence

numbers; these are commonplace — 10% of paths. Instead, the DSM option must map the offset from

the subflow’s initial sequence number to the data sequence number, as the offset is unaffected by sequence

number rewriting. The option must also contain the length of the mapping. This is robust - as long as the

option is received, it does not greatly matter which packet carries it, so duplicate mappings caused by TSO

are not a problem.

7.1.3.6 Content-modifying middleboxes

Many NAT devices include application-level gateway functionality for protocols such as FTP. IP addresses

and ports in the FTP control channel are re-written by such middleboxes to correct for the address changes

imposed by the NAT.

Multipath TCP and such content-modifying middleboxes have the potential to interact badly. In particular,

due to FTP’s ASCII encoding, re-writing an IP address in the payload can necessitate changing the length of

the payload. Subsequent sequence and ack numbers are then fixed up by the middlebox so they are consistent

from the point of view of the end systems.

Such length changes break the DSM option mapping - subflow bytes can be mapped to the wrong place in the

data stream. They also break every other mapping mechanism we considered, including chunked payloads.

6We did not observe any that would both permit MPTCP and resegment, though.

c© CHANGE Consortium 2012 Page 69 of (76)

Page 70: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

There is no easy way to handle such middleboxes.

After much debate, we concluded that MPTCP must include a checksum in the DSM mapping so such

content changes can be detected. MPTCP rejects a modified segment and triggers a fallback process: if

any other subflows exists, MPTCP terminates the subflow on which the modification occurred; if no other

subflow exists, MPTCP drops back to regular TCP behavior for the remainder of the connection, allowing the

middlebox to perform rewriting as it wishes.

Calculating a checksum over the data is comparatively expensive, and we did not wish to slow down MPTCP

just to catch such rare corner cases. MPTCP therefore uses the same 16-bit ones complement checksum used

in the TCP header. This allows the checksum over the payload to be calculated only once. The payload

checksum is added to a checksum of an MPTCP pseudo header covering the DSM mapping values and then

inserted into the DSM option. The same payload checksum is added to the checksum of the TCP pseudo-

header and then used in the TCP checksum field.

With this mechanism a software implementation incurs little additional cost from calculating the MPTCP

checksum. Unfortunately, modern NICs frequently perform checksum offload. If the TCP stack uses the

NIC to calculate checksums, with MPTCP it will still need to calculate the MPTCP checksum in software,

negating the benefits of checksum offload. There is little we can do about this, other than to note that future

NICs will likely perform MPTCP checksum offload too, if MPTCP is widely deployed. In the meantime,

MPTCP allows checksums to be disabled for high performance environments such as data-centers where

there is no chance of encountering such an application-level gateway.

The fallback-to-TCP process triggered by a checksum failure can also be triggered in other circumstances.

For example, if a routing change moves an MPTCP subflow to a path where a middlebox removes DSM

options, this also triggers the fallback procedure.

7.1.4 Connection and subflow teardown

TCP has two ways to indicate connection shutdown: FIN for normal shutdown and RST for errors such as

when one end no longer has state. With MPTCP, we need to distinguish subflow teardown from connection

teardown. With RST, the choice is clear: it must only terminate the subflow, or an error on a single subflow

would cause the whole connection to fail.

Normal shutdown is slightly more subtle. TCP FINs occupy sequence space; the FIN/FIN-ACK/ACK hand-

shake and the cumulative nature of TCP’s acknowledgments ensure that not only has all data been received,

but also both endpoints know the connection is closed and know who needs to hold TIMEWAIT state.

How then should a FIN on an MPTCP subflow be interpreted? Does it mean that the sending host has no

more data to send, or only that no more data will be sent on this subflow? Another way to phrase this is to

ask whether a FIN on a subflow occupies data sequence space, or just subflow sequence space?

Consider first what would happen if a FIN occupied data sequence space. This could be achieved by extending

the length of the DSM mapping in a packet to cover the FIN. Mapping the FIN into the data sequence space

Page 70 of (76) c© CHANGE Consortium 2012

Page 71: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

in this way tells the receiver what the data sequence number of the last byte of the connection is, and hence

whether any more data is expected from other subflows.

Suppose now that some data had been transmitted on subflow A just before the last data and FIN were sent

on subflow B. If the receiver is really unlucky, subflow A may fail (perhaps due to mobility) before the last

data arrives. When the sender times out this data, it will wish to re-send it on subflow B, but it has already

sent a FIN on this subflow. Sending data after the FIN is sure to confuse middleboxes and firewalls that tore

down state when they observed the FIN.

This particular problem might be avoided by delaying sending the FIN until all outstanding data has been

DATA ACKed, but this adds an unnecessary RTT to all connections during which the receiving application

doesn’t know if more data will arrive.

Much simpler is for a FIN to have the more limited “no more data on this subflow” semantics, and this is what

MPTCP does. An explicit DATA FIN, carried in a TCP option, then indicates the end of the data sequence

space and can be sent immediately the application closes the socket. To be safe, either the sender waits for

the DATA ACK of the DATA FIN before sending a FIN on each subflow, or it sends Data-FIN on all subflows

together with a FIN.

MPTCP’s FIN semantics also allow subflows to be closed cleanly while allowing the connection to continue

on other subflows. Finally, to support mobility, MPTCP provides a REMOVE ADDR message, allowing one

subflow to indicate that other subflows using the specified address are closed. This is necessary to cleanly

cope with mobility when a host loses the ability to send from an address and so cannot send a subflow FIN.

7.2 Lessons learned

In today’s Internet, the three-way-handshake involves not only the two communicating hosts, but also all the

middleboxes on the path. Verifying the presence of a particular TCP option in a SYN+ACK is not sufficient

to ensure that a TCP extension can be safely used. As shown in the previous chapter, some middleboxes

pass TCP options that they don’t understand. This is safe for TCP options that are purely informative (e.g.

RFC1323 timestamps) but causes problems with other options such as those that redefine the semantics of

TCP header fields. For example, the large window extension in RFC1323 changes the semantics of the

window field of the TCP header and extends it beyond 16 bits. Nearly 20 years after the publication of

RFC1323, there are still stateful firewalls that do not understand this option in SYNs but block data packets

that are sent in the RFC1323 extended window. A TCP extension that changes the semantics of parts of the

packet header must include mechanisms to cope with middleboxes that do not understand the new semantics.

A second issue for all TCP designers is the mutability of the TCP packets. In an end-to-end Internet, all the

information carried inside TCP packets is immutable. Today this is no longer true. The entire TCP header

and the payload must be considered as mutable fields. If a TCP extension needs to rely on a particular field,

it must check its value in a way that cannot be circumvented by middleboxes that do not understand this

extension. The DSM checksum is an example of a solution to deal with these problems.

c© CHANGE Consortium 2012 Page 71 of (76)

Page 72: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

The third, but probably most important point about new TCP extensions is that to be deployable, they must

necessarily include techniques that enable them to fallback to regular TCP when something wrong happens.

If a middlebox interferes badly with a TCP extension, the problem must be detected and the extension auto-

matically disabled to preserve the data transfer. A TCP extension will only be deployed if its designers can

guarantee that it will transfer data correctly (and hopefully better) in all the situations where a regular TCP is

able to transfer data.

The last point that we would like to raise is that the hidden middleboxes increase the complexity of the net-

work. This gives us a strong motivation to change the network architecture to recognize explicitly their role.

The CHANGE architecture aims to do precisely this, embracing flow-processing and implicitly middleboxes,

while allowing the Internet to evolve at the same time.

Page 72 of (76) c© CHANGE Consortium 2012

Page 73: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

ConclusionThis document has tackled several important mechanisms that are required to enable CHANGE platforms to

process flows in real networks and the global Internet.

The first mechanism consists in localizing the closest and suitable CHANGE platform for a specific flow to

process. Indeed, CHANGE platforms are heterogeneously characterized by a set of resources (e.g., CPU,

memory and bandwidth), and, given a specific flow to process, we have to be able to select the platform

that satisfies these flow constraints and metrics. We classified platform localization mechanisms into two

different types (centralized and decentralized), and proposed an approach for each class. In the centralized

case, the use of a centralized database is enough to keep information about all platforms in term of supported

functionality and resources availability. In the decentralized case, resources availability is more delicate to

manage, and maintaining a reliable centralized database is infeasible. In this case, we present our distributed

approach, which is based on IP anycast. We used a modified version of the Kruskal minimum spanning tree

algorithm to split platforms into clusters and assign addresses to each cluster member. We also discussed how

a CHANGE platform dynamically manages its internal resources between its servers, when a host requests a

network service.

A second mechanism that we discussed in this deliverable is flow attraction. CHANGE platforms are not

always on the initial path from a source to a destination, thus, we need to define a mechanism that allows

these platforms to be able to attract flows. We split the flow attracting mechanism into two steps: first,

attracting the flows to the platform, and second, delivering back the processed flow to the destination. Based

on comparisons done in deliverable D4.1 between three different possible solutions, we chose to implement

FlowSpec, a mechanism to distribute traffic flow specifications. FlowSpec information is carried via BGP,

which allows the routing system to propagate flow specifications. We showed how FlowSpec coupled with

ExaBGP, a BGP engine that allows routes injection with arbitrary next-hops into the network, can be used to

deploy the flow attraction mechanism required by CHANGE platforms.

The final two mechanisms included an algorithm for calculating how to allocate the resources of a set of

CHANGE platforms to a set of service requests; and a novel Controllable per-Flow Load-Balancing (CFLB)

mechanism that can improve performance in scenarios where CHANGE is deployed in networks with signif-

icant path diversity such as datacenters.

This deliverable, and the mechanisms described herein, provide the primitives needed to start implementing

CHANGE platforms in future work.

c© CHANGE Consortium 2012 Page 73 of (76)

Page 74: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

Bibliography[1] Load balancing with Cisco Express Forwarding. Technical report, Cisco Systems, Inc., 1998.

[2] 4 hour planetlab ping trace, http://www.eecs.harvard.edu/ syrah/nc/sim/pings.4hr.stamp.gz, 2011 (ac-

cessed November, 2011).

[3] A bgp engine and route injector, 2011 (accessed November, 2011).

[4] Ip address location technology, http://www.maxmind.com/app/ip-location, 2011 (accessed November,

2011).

[5] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture.

ACM SIGCOMM CCR, 38:63–74, August 2008.

[6] M. Al-fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic flow schedul-

ing for data center networks. In Proc. USENIX NSDI, 2010.

[7] IEEE Standards Association. IEEE Std 802.1AX-2008 IEEE Standard for Local and Metropolitan Area

Networks - Link Aggregation. 2008.

[8] B. Augustin, T. Friedman, and R. Teixeira. Measuring Load-balanced Paths in the Internet. In Proc.

ACM IMC, volume 6, 2007.

[9] S. Barre, C. Paasch, and O. Bonaventure. MultiPath TCP: From Theory to Practice. In IFIP Networking,

Valencia, May 2011.

[10] T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In

Proc. ACM IMC, pages 267–280, Melbourne, 2010.

[11] T. Benson, A. Anand, A. Akella, and M. Zhang. The case for fine-grained traffic engineering in data

centers. In Proc. of INM/WREN, pages 2–2, 2010.

[12] Andrea Bittau, Michael Hamburg, Mark Handley, David Mazieres, and Dan Boneh. The case for ubiq-

uitous transport-level encryption. In USENIX Security’10, pages 26–26, Berkeley, CA, USA, 2010.

USENIX Association.

[13] M. Bocci, S. Bryant, D. Frost, L. Levrau, and L. Berger. A Framework for MPLS in Transport Networks.

RFC 5921 (Informational), July 2010. Updated by RFC 6215.

[14] Z. Cao, Z. Wang, and E. Zegura. Performance of Hashing-Based Schemes for Internet Load Balancing.

In Proc. IEEE INFOCOM, 2000.

[15] B. Carpenter and S. Amante. Using the IPv6 flow label for equal cost multipath routing and link

aggregation in tunnels. Internet draft, draft-carpenter-flow-ecmp-05, IETF, July 2011.

Page 74 of (76) c© CHANGE Consortium 2012

Page 75: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

[16] R. Chandra, P. Traina, and T. Li. BGP Communities Attribute. RFC 1997 (Proposed Standard), August

1996.

[17] J. Chu, Nandita Dukipatti, Y. Cheng, and M. Mathis. Increasing TCP’s Initial Window, Internet Draft.

IETF, april 2011.

[18] Cisco. Server Cluster Designs with Ethernet. http://www.cisco.com/en/US/

docs/solutions/Enterprise/Data_Center/DC_Infra2_5/DCInfra_3.html#

wp1088785.

[19] A. Curtis, W. Kim, and P. Yalagandula. Mahout: Low-overhead datacenter traffic management using

end-host-based elephant detection. In INFOCOM, pages 1629–1637. IEEE, 2011.

[20] A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, and S. Banerjee. DevoFlow: Scaling

Flow Management for High-Performance Networks. In Proc. of ACM SIGCOMM, 2011.

[21] Frank Dabek, Russ Cox, Frans Kaashoek, and Robert Morris. Vivaldi: a decentralized network coordi-

nate system. SIGCOMM Comput. Commun. Rev., 34(4):15–26, August 2004.

[22] C. De Canniere, O. Dunkelman, and M. Knezevic. KATAN and KTANTAN – A family of small and

efficient hardware-oriented block ciphers. In Proc. CHES 2009, pages 272–288, 2009.

[23] A. Ford, C. Raiciu, M. Handley, and O. Bonaventure. TCP Extensions for Multipath Operation with

Multiple Addresses. Internet draft, draft-ietf-mptcp-multiaddressed-04, IETF, July 2011.

[24] A. Ford, C. Raiciu, M. Handley, and O. Bonaventure. TCP extensions for multipath operation with

multiple addresses, Jul 2011. IETF draft (work in progress).

[25] P. Francois and O. Bonaventure. Avoiding transient loops during IGP convergence in IP networks. In

Proc. of IEEE INFOCOM, 2005.

[26] P. Francois, M. Shand, and O. Bonaventure. Disruption free topology reconfiguration in OSPF networks.

In Proc. of IEEE INFOCOM, 2007.

[27] J. Fu, P. Sjodin, and G. Karlsson. Loop-free updates of forwarding tables. IEEE Transactions on

Network and Service Management, 5(1), March 2008.

[28] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sen-

gupta. VL2: a scalable and flexible data center network. In Proc. ACM SIGCOMM, 2009.

[29] Krishna P. Gummadi, Stefan Saroiu, and Steven D. Gribble. King: estimating latency between arbitrary

internet end hosts. SIGCOMM Comput. Commun. Rev., 32:11–11, July 2002.

c© CHANGE Consortium 2012 Page 75 of (76)

Page 76: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

[30] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. BCube: a high perfor-

mance, server-centric network architecture for modular data centers. In Proc ACM SIGCOMM, pages

63–74, 2009.

[31] S. Han, K. Jang, K. Park, and S. Moon. Packetshader: a gpu-accelerated software router. In Proc.ACM

SIGCOMM, pages 195–206, 2010.

[32] M. Handley. Why the internet only just works. BT Technology Journal, 24:119–129, 2006.

[33] M. Handley, V. Paxson, and C. Kreibich. Network intrusion detection: evasion, traffic normalization,

and end-to-end protocol semantics. In Proc. USENIX Security Symposium, pages 9–9, 2001.

[34] A. Hodjat and I. Verbauwhede. A 21.54 Gbits/s Fully Pipelined AES Processor on FPGA. In Proc IEEE

Symp. Field-Programmable Custom Computing Machines, pages 308–309, 2004.

[35] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992 (Informational), November

2000.

[36] G. Iannaccone, CN. Chuah, S. Bhattacharyya, and C. Diot. Feasibility of IP Restoration in a Tier-1

Backbone. IEEE Network, 18(2), March 2004.

[37] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center traffic:

measurements & analysis. In Proc. ACM SIGCOMM IMC, pages 202–208, 2009.

[38] K. Kompella, J. Drake, S. Amante, W. Henreickx, and L. Yong. The Use of Entropy Labels in MPLS

Forwarding. Internet draft, draft-ietf-mpls-entropy-label-00, IETF, May 2011.

[39] Michael E. Kounavis, Xiaozhu Kang, Ken Grewal, Mathew Eszenyi, Shay Gueron, and David Durham.

Encrypting the internet. In Proc. ACM SIGCOMM, pages 135–146, 2010.

[40] J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. In

Proc. of the American Mathematical Society, 7, 1956.

[41] Craig Labovitz, Scott Iekel-Johnson, Danny McPherson, Jon Oberheide, and Farnam Jahanian. In-

ternet inter-domain traffic. In Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM,

SIGCOMM ’10, pages 75–86, New York, NY, USA, 2010. ACM.

[42] Cristian Lumezanu, Randy Baden, Neil Spring, and Bobby Bhattacharjee. Triangle inequality varia-

tions in the internet. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement

conference, IMC ’09, pages 177–183, New York, NY, USA, 2009. ACM.

[43] P. Marques, N. Sheth, R. Raszuk, B. Greene, J. Mauch, and D. McPherson. Dissemination of Flow

Specification Rules. RFC 5575 (Proposed Standard), August 2009.

Page 76 of (76) c© CHANGE Consortium 2012

Page 77: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

[44] R. Martin, M. Menth, and M. Hemmkeppler. Accuracy and Dynamics of Multi-Stage Load Balancing

for Multipath Internet Routing. In Proc. IEEE ICC, June 2007.

[45] D. A. McGrew and S. R. Fluhrer. The Extended Codebook (XCB) Mode of Operation. Cryptology

ePrint Archive, Report 2004/278, 2004.

[46] N McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and

J. Turner. Openflow: enabling innovation in campus networks. SIGCOMM CCR, 38:69–74, March

2008.

[47] P. Merindol, P. Francois, O. Bonaventure, S. Cateloin, and J.-J. Pansiot. An efficient algorithm to enable

path diversity in link state routing networks. Computer Networks, 55(1):1132–1149, April 2011.

[48] J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul. SPAIN: COTS data-center Ethernet for

multipathing over arbitrary topologies. In Proc. USENIX NSDI, pages 18–18, 2010.

[49] T. S. Eugene Ng and Hui Zhang. Predicting internet network distance with coordinates-based ap-

proaches. In In INFOCOM, pages 170–179, 2001.

[50] Gang Peng. Cdn: Content distribution network. CoRR, cs.NI/0411069, 2004.

[51] C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and M. Handley. Improving datacenter

performance and robustness with multipath tcp. In Proc. SIGCOMM, August 2011.

[52] M. Reitblatt, N. Foster, J. Rexford, and D. Walker. Consistent Updates for Software-Defined Networks

: Change You Can Believe In! In Proc. of ACM SIGCOMM HotNets-X Workshop, 2011.

[53] R. L. Rivest. The RC5 Encryption Algorithm. In Proc. FSE, volume 1008, pages 86–96, 1994.

[54] S. Sangli, D. Tappan, and Y. Rekhter. BGP Extended Communities Attribute. RFC 4360 (Proposed

Standard), February 2006.

[55] C. Shannon, E. Aben, Kc Claffy, and D. Andersen. The CAIDA Anonymized 2008 Internet Traces

– 2008-07-17 12:59:07 - 2008-07-17 14:01:00. http://www.caida.org/data/passive/

passive_2008_dataset.xml.

[56] J. Touch and R. Perlman. Transparent Interconnection of Lots of Links (TRILL): Problem and Appli-

cability Statement. RFC 5556 (Informational), May 2009.

[57] L. Vanbever, S. Vissicchio, C. Pelsser, P. Francois, and O. Bonaventure. Seamless Network-Wide IGP

Migrations. In Proc. of ACM SIGCOMM, 2011.

[58] A. F. Webster and S. E. Tavares. On The Design Of S-Boxes. In Proc. CRYPTO, pages 523–534, 1986.

c© CHANGE Consortium 2012 Page 77 of (76)

Page 78: CHANGE - CORDIS...The CHANGE architecture, which is based around the notion of a flow processing platform, aims to re-enable innovation in the Internet. However, before processing

[59] D. Wischik, C. Raiciu, A. Greenhalgh, and M. Handley. Design, implementation and evaluation of

congestion control for multipath tcp. In Proc.USENIX NSDI, pages 99 – 113, 2011.

[60] Damon Wischik, Costin Raiciu, Adam Greenhalgh, and Mark Handley. Design, implementation and

evaluation of congestion control for multipath tcp. In NSDI’11, pages 8–8, Berkeley, CA, USA, 2011.

USENIX Association.

[61] Ellen W. Zegura, Kenneth L. Calvert, and Samrat Bhattacharjee. How to model an internetwork. In In

Proceedings of IEEE INFOCOM, volume 2, pages 594–602, 1996.

Page 78 of (76) c© CHANGE Consortium 2012