restoring tcp sessions with a dht - de laat › rp › 2015-2016 › p80 › report.pdfrestoring tcp...

16
Restoring TCP sessions with a DHT Peter Boers [email protected] Wednesday 3 rd August, 2016 Abstract Traditionally web scale applications use various load balancing techniques to maintain TCP ses- sions. Usually this means sharing all TCP session information over a number of load balancing appli- ances. This model is unsustainable in very large infrastructures. Our research proposes a novel method of maintaining TCP session integrity. It sets out to investigate if storing TCP session in- formation can be done in a Peer-to-Peer, Distrib- uted Hash Table, overlay network. By evaluating a proof of concept, we conclude that storing TCP session information in a DHT, might provide a reli- able and scalable solution to maintain TCP session integrity in the event of network failures. 1 Introduction With the ever increasing amount of nodes in net- works there comes an ever increasing demand on the management of the network. How does one make it possible to keep a dynamic environment with thousands of Virtual Machines (VM) and vari- ous services online, clustered and balanced. In their recent draft RFC about their network architecture, Facebook has indicated a trend in network design and implementation [1]. The trend these days is to move towards IP based networks. Large parties such as Facebook are start- ing to rely on protocols such as the well established Border Gateway Protocol v 4 (BGP) as an Interior Gateway protocol instead of OSPF or IS-IS [1, 2]. In their networks each host is takes part in the BGP protocol to announce routes into the network. With this new network model there might be an opportunity to use the protocol properties of BGP and its protocol extensions, to create inherent re- dundancy in the Network layer. With Web-Scale infrastructure, load balancing is becoming an issue as they need more and more bandwidth to be able to handle the throughput of these large infrastructures. Typical load balancing devices operate in Act- ive/Passive mode and use technologies such as Virtual Router Redundancy Protocal (VRRP) to maintain high availability. Furthermore load balan- cers share information about sessions to make sure TCP session are consistent. As they are mostly located at the edge of a network, they can create bottlenecks and single points of failure. The technical report by Facebook indicates that the wish is to remove the load balancer at the edge of the network and do more traffic engineering and balancing on the network itself [1]. Equal Cost Multi Path routing (ECMP) is a protocol that to- gether with Anycast will be able to do this. All hosts that are part of the service will announce their membership of the Anycast group and ECMP will be used to be able to do a more fine grained balancing over the network[3]. Anycast has been adopted by the global Root Domain Name System (DNS) zones to create a highly available infrastructure which is transpar- ent to the users of the DNS service [4]. It is well known that it works well with stateless communica- tion such as DNS, but that it is more difficult to use with stateful connections such as the Transmission Control Protocol(TCP). In moving towards this Anycast and ECMP ar- chitecture, there is still a problem that needs to be solved. In the case that ECMP gets recalculated through an event on the network due to a failure or a reconfiguring of paths, a TCP session still needs to reach the right Anycast host to maintain con- sistency. This research will target this problem and discuss a possible solution for this problem. Distributed Hash Tables (DHT) have long since been used to store information in a distributed way. As web scale architecture handles vast numbers of TCP sessions, if load balancing is removed from the edge of the Networks to the hosts, there must be a way in which hosts can look up TCP session in- formation in the case that a host receives an invalid TCP session due to a topology change. We propose to use a DHT to store TCP session information and design a proof of concept, which we use to compare to more traditional load balancing methods. 1

Upload: others

Post on 29-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

Restoring TCP sessions with a DHT

Peter [email protected]

Wednesday 3rd August, 2016

AbstractTraditionally web scale applications use variousload balancing techniques to maintain TCP ses-sions. Usually this means sharing all TCP sessioninformation over a number of load balancing appli-ances. This model is unsustainable in very largeinfrastructures. Our research proposes a novelmethod of maintaining TCP session integrity. Itsets out to investigate if storing TCP session in-formation can be done in a Peer-to-Peer, Distrib-uted Hash Table, overlay network. By evaluatinga proof of concept, we conclude that storing TCPsession information in a DHT, might provide a reli-able and scalable solution to maintain TCP sessionintegrity in the event of network failures.

1 IntroductionWith the ever increasing amount of nodes in net-works there comes an ever increasing demand onthe management of the network. How does onemake it possible to keep a dynamic environmentwith thousands of Virtual Machines (VM) and vari-ous services online, clustered and balanced. In theirrecent draft RFC about their network architecture,Facebook has indicated a trend in network designand implementation [1].The trend these days is to move towards IP based

networks. Large parties such as Facebook are start-ing to rely on protocols such as the well establishedBorder Gateway Protocol v 4 (BGP) as an InteriorGateway protocol instead of OSPF or IS-IS [1, 2].In their networks each host is takes part in theBGP protocol to announce routes into the network.With this new network model there might be anopportunity to use the protocol properties of BGPand its protocol extensions, to create inherent re-dundancy in the Network layer.With Web-Scale infrastructure, load balancing

is becoming an issue as they need more and morebandwidth to be able to handle the throughput ofthese large infrastructures.Typical load balancing devices operate in Act-

ive/Passive mode and use technologies such as

Virtual Router Redundancy Protocal (VRRP) tomaintain high availability. Furthermore load balan-cers share information about sessions to make sureTCP session are consistent. As they are mostlylocated at the edge of a network, they can createbottlenecks and single points of failure.

The technical report by Facebook indicates thatthe wish is to remove the load balancer at the edgeof the network and do more traffic engineering andbalancing on the network itself [1]. Equal CostMulti Path routing (ECMP) is a protocol that to-gether with Anycast will be able to do this. Allhosts that are part of the service will announcetheir membership of the Anycast group and ECMPwill be used to be able to do a more fine grainedbalancing over the network[3].

Anycast has been adopted by the global RootDomain Name System (DNS) zones to create ahighly available infrastructure which is transpar-ent to the users of the DNS service [4]. It is wellknown that it works well with stateless communica-tion such as DNS, but that it is more difficult to usewith stateful connections such as the TransmissionControl Protocol(TCP).

In moving towards this Anycast and ECMP ar-chitecture, there is still a problem that needs to besolved. In the case that ECMP gets recalculatedthrough an event on the network due to a failure ora reconfiguring of paths, a TCP session still needsto reach the right Anycast host to maintain con-sistency. This research will target this problem anddiscuss a possible solution for this problem.

Distributed Hash Tables (DHT) have long sincebeen used to store information in a distributed way.As web scale architecture handles vast numbers ofTCP sessions, if load balancing is removed from theedge of the Networks to the hosts, there must bea way in which hosts can look up TCP session in-formation in the case that a host receives an invalidTCP session due to a topology change. We proposeto use a DHT to store TCP session information anddesign a proof of concept, which we use to compareto more traditional load balancing methods.

1

Page 2: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

2 Research QuestionsSection 1 has defined the problem and hinted to-wards a solution here we state the direction of theresearch and help scope the project:How can a DHT be leveraged to maintain TCP

session state in the case of a failure in a Large BGPnetworks with thousands of hosts [1]?

• What technical requirements are needed tomaintain the TCP session in the case of a fail-ure?

• Does using a DHT to look up invalid sessionsprovide enough performance so that the ses-sion can continue?

3 Related WorkIn the past there have been a number of effortsinto researching the feasibility of using Anycast asa method to establish stateful resiliency in the net-work layer. What follows are a number of contri-butions about Anycast, load balancing in generaland scaling in large systems. The following relatedwork establishes a context and gives informationabout a number of solutions that already exist toestablish scalable networks and load balancing.

3.1 Network ProtocolsThe TCP protocol is the workhorse of Internettraffic. It is a stateful protocol that ensures a co-herent end-to-end bytestream, from one applicationto another[5]. Research has been done to com-bine Anycast with TCP for load balancing solu-tions. The prime method that a number of these re-searches have proposed is an extension to the TCPprotocol [6]. Miura et al propose a modification ofthe TCP protocol so that it can handle the chan-ging of the TCP socket tuple. This would makeit possible that Anycast can be used for the initialset up of the connection and then the connectionwould be established on a different unique InternetProtocol(IP) address of the server. Even though itis well thought out in theory, in practice this hasnot been implemented. We believe that in search-ing a solution to the research questions, we need tofind a solution that does not necessitate a protocolchange as this greatly lowers the chance of it everbeing implemented.Weber et al evaluated potential methods of set-

ting up Anycast in IP version six (IPv6) networksas Anycast was designed into the protocol [7]. Intheir survey they also touch upon the problems thatexist when using a stateful protocol. In previouspapers it was stated that using the source rout-ing option in the IPv4 header could be an option.However as this is no longer recommended by the

Internet Engineering Task Force (IETF). Due tosecurity reasons it has been stripped from the pro-tocol [8].

3.2 Load balancingLoad balancing is typically the area of distributedsystems research. There are a number of papersthat describe how this can be used in combina-tion with Anycast. Castro et al, experimented withAnycast and Chord to create a group in which re-sources were shared [9]. They used it to be able toschedule jobs, manages storage and manage band-width.

A comparative study from 2010 also establishesthe fact that the load balancing model of todaywhere a small number of instances are responsiblefor the distribution of load is not going to be feas-ible in the near future [10]. In their paper theymention a number of different approaches to loadbalancing in cloud computing where the cloud isself aware and can balance the load, within the ap-plication, between instances.

Research into combining load balancing andDHT has been mainly done to improve the load dis-tribution in a Peer to Peer (P2P) network. [11] pro-poses a method for nodes in a p2p ring to becomeproximity aware and balance the load between closeneighbors.

3.3 Distributed SystemsThis research looks towards the field of distrib-uted systems research to help find solutions to theproblem of scaling. The area of distributed systemcomputing may provide a number of solutions andstrategies which could be employed. Examples areDHT implementations and scaling strategies.

3.3.1 Distributed Hash Tables

Performance of DHT networks has also been widelyscrutinised. As the scale of a network increasesthe look ups will inherently take longer howeverthey always take O = log(n) time where n is theamount of nodes in a network [12]. This achievespredictable scaling, however the question is if itis feasible to incorporate this with handling TCPsessions.

Chord is one of the most well know examples ofa DHT, however the globally used Bittorrent net-work has adopted the Kademlia protocol as DHToverlay [13]. This Peer-to-Peer overlay network iseasier to implement in software compared to Chordas it makes use of the XOR operation whilst cal-culating routing tables. Both systems achieve thesame search efficiency and are both a viable choiceto implement in our design.

2

Page 3: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

3.3.2 Scaling

The report by Facebook and Arista networks[1],the writers indicate that a key feature in theirdesign is that the network need to be horizont-ally scalable. Our solution to the load balancingchallenge must meet the same requirements to bea viable solution otherwise it will not be usable.To help define how to scale in large systems we

fall back on Distributed Systems scaling theory byTannenbaum and van Steen [14]. In their book theydescribe that scaling can be done in various ways,but that you can identify a number of areas thatcan be used to categorise types of scaling. Primeexamples of this are; Administrative scaling, Geo-graphical scaling, Functional scaling, etc.In conjunction with categories of scaling they

also state that two scaling strategies exist; scal-ing up and scaling out[14]. Scaling up means buy-ing bigger and better hardware, a strategy advisedand employed by vendors of traditional load bal-ancers. Scaling out means buying more identicaldevices and connecting them together to solve acapacity problem. Our solution needs to be com-patible with the latter scenario as this is the growthstrategy that Facebook and Arista propose in theirreport [1].

4 Traditional load balancingsolutions

The design of load balancing solutions is very mucha problem of handling scale. What is the rightdesign for the amount of hosts and traffic that youneed to handle? Depending on the scale Distrib-uted systems theories will come into play regardingthe maintenance of a consistent view of the networkand application. Load balancing architectures canbe done on many layers in the OSI stack. They canbe done on network layer, transport layer and evenon the application layer. When looking at solu-tions on the network and transport layer we seethe following options that sysadmins have for theirinfrastructure.

4.1 Hardware load balancingConsider the traditional approach to load balan-cing as shown in Figure 1 on layer 3/4. Here wesee the Internet represented by the cloud, an edgerouter and then two hardware load balancers inthe box with the dotted line who are in sync witheach other. They provide a balancing solution fora number of back end servers that are able to reachboth balancers through the network. For the out-side world and the inside world the two balancingnodes seem like they are one single unit.

These boxes share a session table and have somesort of keep alive mechanism which will trigger afail over from the active unit if it fails. If theappliances work correctly during a fail over, theusers and the servers will not know that anythinghappened. This system is limited by the amountof bandwidth that the appliances are able to sup-port and the amount of sessions that you can trackin the session table. The obvious solution to scalethis, is to buy bigger load balancers and increasetheir capacity. So called scaling up[14].

These appliances can implement a number of dif-ferent algorithms that will attempt to share theload of incoming requests over the available hostsin its balancing pool. Examples of these algoritmsare: Round Robin and Weighted Least Connec-tions [15].

Figure 1: Load balancing with an active (darkblue)/passive(light blue) setup. This setup scalesup to n hosts depending on the capacity of the LoadBalancers

Next to various load balancing algorithms thebalancers are also capable of using techniques suchas TLS offloading, compression, proxying, cach-ing and TCP connection marshaling. The lattermeaning that a balancers will marshal multiple ex-ternal connections through one single connection toa backend device.

4.2 Software load balancingThe previously section describes solutions that areavailable in hardware and describes appliances thatcan be bought and used. However there are alsoa great deal of possibilities of installing softwareload balancers. Prime examples are Linux Vir-

3

Page 4: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

tual Server(LVS)1 and HAProxy2. LVS can beused similar to hardware solutions. This packageof Linux can be installed in a virtual machine oron a bare-metal server and can function as a loadbalancer for a cluster.LVS can be configured in an active/passive mode

where the high available pair shares a virtual IPaddress and is able to fail over if a keep alive timerhas timed-out. As you can see from the descriptionLVS functions as a traditional load balancer, andonly works on layers 3 and 4 in the OSI model.HA proxy is another well known software load

balancer in Linux which can work on layer 3 and4, but is really designed to proxy services who usethe HTTP protocol. It is optimized to provide goodperformance towards high traffic sites and can betuned towards a specific application. It is oftenused together with LVS. HAProxy is able to do theapplication level balancing and LVS takes care ofthe Layer 3 balancing.Another example is a package called “con-

ntrackd”, which synchronises the connection track-ing table between high available hosts. Often thisis uses together with “keepalived” as configured to-gether they can do similar things as LVS [16].

4.3 Scaling traditional architecturesThe above examples operate on the premise thatthere are either one or more appliances that areactive at one point in time. Secondly, that all appli-ances in the load balancing cluster share the sameview of the network and that they are transparenttowards the application and user.By adding various technologies such as compres-

sion and offloading they attempt to make the ap-plication more resilient and reduce the amount oftraffic on the network so that the appliances canscale better.In doing so, these appliances and software pack-

ages are becoming more and more complex, as theyneed to be able to handle a wide variety of situ-ations. Further more by letting the load balancersdo such a wide variety of tasks, debugging problemson the network become more difficult and if a fail-ure occurs on the balancer, this may have seriousramifications.

4.3.1 How to handle more connections?

Even though appliances that do balancing are get-ting more and more efficient, at one point oneneeds to scale even more. This can be done byadding more appliances and creating an active/act-ive setup with tens of nodes. These nodes will all

1http://www.linuxvirtualserver.org2http://www.haproxy.org

be aware of each others states and back end serv-ers and provide a consistent view of the applicationthey are trying to balance to the outside world.

By operating on the philosophy that all nodes inthe network need to know the complete state andhave the complete state in their own memory, itis easy to see that in Large Scale Networks withthousands of hosts, this is not possible anymore. Itsimply will cost too many resources to keep trackof all this information and to make sure that allnodes have the same view.

Furthermore a second reason this does not scaleis the fact that the network topology as shown inFigure 1 requires a lot of investment to upgrade.Lets assume that as your applications grows yourbalancer is no longer able to handle the traffic andincoming connections. If you are running an act-ive/passive setup as shown in the figure, you willneed to buy two new appliances instead of one. Inthe case of a failure of one of the nodes the othermust be able to take over all of the traffic seam-lessly, meaning that simply buying one new devicewill not be good enough as the fail over device willnot be able to handle all of the normal traffic.

Alternatively if we assume that the setup is act-ive/active, this means that we would need to buyone or more appliances to come back to a statewhich is acceptable for the network. Dependingon the architecture and failure domain, a 2 nodeactive/active setup may never operate above 50%capacity as if one fails the other must be able tohandle all traffic if the network is to survive. Man-aging scaling in this scenario requires carefull plan-ning and fore thought.

4.4 Handling FailuresTraditional load balancers handle failures differ-ently depending on which layer they operate. Ingeneral, when a load balancer detects a failure onone of the real servers, it will make the failed nodeinactive and no longer forward any traffic to it. De-pending on the type of load balancer and how it isconfigured, TCP sessions will be seamlessly handedto a new server or it may need to be reestablished.

5 New Network DesignAs stated in the previous Section 4 the more tra-ditional solutions to scaling a large network as-sume that all load balancing nodes are equal andthat all states need to be synchronised betweenthe members of the load balancing group. Due tothe aforementioned problems considering scalabil-ity and complexity, the draft RFC by Facebook andArista tries to tackle the problem in a new waywhere their mantra in creating a network design isdefined as follows:

4

Page 5: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

‘Environments of this scale have a unique set ofnetwork requirements with an emphasis on opera-tional simplicity and network stability [1].’

5.1 Using proven tech - RoutingThe solutions that Facebook and Arista propose forthe network design form the basis of the scenarioand tests of this research. In their design they as-sume that the network topology is based upon layer3. By doing this they alleviate many problems thatLayer 2 has, when it is necessary to scale up andout. At one point traditional layer 2 without en-hancements, will no longer handle the traffic withinone single broadcast domain, especially with thou-sands of hosts. This means that in the new design,all nodes on the network take part in the routingprotocol of choice, which in this case is EBGP.They have chosen this design and protocol to en-

able the network to scale faster horizontally andbecause BGP is relatively simple to setup and use.OSPF and IS-IS are widely used as Interior Gate-way routing protocols but have much more pro-tocol overhead, and are more difficult to configureand debug. Figure 2 shows an example topology asproposed in the RFC design.

Figure 2: This is the topology on which the ex-periments were conducted. The interconnects areshown and the Anycast group with their uniqueidentifiers.

By using the design in Figure 2 it is easy to seehow this would scale horizontally. If you need moreports or bandwidth, the only thing you need todo is add more spines and leaves. This allows thenetwork to scale with exactly the amount of portsand traffic that the device you just added to the

network allows, meaning it is very predictable andcost effective.

5.1.1 All nodes are Routers

In the past traditional designs of networks havechosen to have a minimum amount of routers dueto the fact that it is very costly to buy appliancesthat support these routing protocols. However, re-cently vendors such as Cumulus Networks are ableto deliver white label boxes to customers. Thesedevices are able to run an operating system likeLinux and have specific optimized network driversfor performance [17].

This enables users to install an open source rout-ing daemons like Quagga. The Quagga daemon is alightweight program that has implemented variousnetwork protocol standards[18]. Overnight it hasbecome possible to run a routing protocol cheaplyon all sorts hardware without a user having to payfor license fees.

The architecture of Figure 2 has now becomepossible.

5.1.2 BGP unnumbered

To alleviate management problems in a networksuch as this, it is necessary to make make sure thatrouters are added to the BGP topology easily. TheBGP protocol depends on the setup of a sessionwith a neighbor[19]. When this is done correctly itwill be able to share routing information throughthis session. Normally, this requires manual in-put and management. An extension to the BGPprotocol by Cisco, allows administrators to setupBGP sessions automatically by making use of cer-tain intricacies of IPv6; Neighbor Discovery(ND)and Router Advertisements(RA)[20].

Simply put, this extension relies on the fact thateach interface of a router will automatically config-ure a Link-Local IPv6 address with StateLess Ad-dress AutoConfiguration (SLAAC). To make surethat duplicate addresses are not chosen, the NDprotocol is used and by then making use of RAsthe routers are able to setup a BGP session andshare routes on the IPv6 local link. Figure 3 givesa good overview of how it works.

5.1.3 Link state detection

It is well known that IGP protocols such as IS-IS and OSPF are capable of detecting Link statechanges and rerouting packets very efficiently, BGPhowever is not as good at doing this in a standardconfiguration. Normally it sends keep alive mes-sages every 30 seconds or so to its neighbor andchanges its routes according to the information thisgenerates.

5

Page 6: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

Figure 3: The implementation of BGP unnumberedaccording to RFC-5549[20]

In this topology this is not fast enough. Howevertogether with the BGP unnumbered configurationit is possible to make BGP detect link failures bybinding it to a specific interface. Therefore if theKernel detects a Link State change on the interface,BGP will update its routes accordingly and therebytrigger a topology change on the network[21]. Thisis called setting up Interface Scoped BGP sessions.

5.2 Load BalancingNow that the topology has changed, the new designno longer allows for the traditional load balancingsolution on Layers 3 and 4. If an administratorwould choose to do this he/she would greatly com-promise future ability to scale out horizontally. Thequestion of sharing load across the servers howeverremains. The RFC by Facebook and Arista statesthat load balancing will be done on network levelusing the principles of Anycast and Equal CostMultiPath (ECMP) routing.

5.2.1 Anycast

The Anycast principle has been used to great ef-fect in the global DNS root server infrastructure,enabling DNS requests to be routed towards theclosest root DNS server according to the lengthof the path to the IP address. The theory being,that if the same IP address is announced to thenetwork from multiple locations, routers will auto-matically choose the route to said IP address overthe least amount of hops[19]. In the case of DNSthis makes sure that requests are serviced from thenodes closest to the requester and it makes surethat the load can be evenly spread throughout theinfrastructure.The new network design will make use of these

characteristics. All servers in the network shown inFigure 2 (the bottom layer) take part in an Anycastgroup. Each server announces a specific loopbackaddress to the world on which it hosts a service.In this case it is the address “192.168.0.1.” How-

ever when one looks closely at the topology, youcan see that the amount of hops from the Internetrepresented by the cloud at the top of Figure 2, toa Server, is always three hops. This is regardlessof the route a packet takes in the network and as-suming that packets always take the shortest routefrom the Internet to a Server.

Facebooks and Aristas use of Anycast representsthe membership of the server in a specific group. Itenables a server to become available in a group ofservers that are hosting a specific application, byannouncing an Anycast IP address to the network.

5.2.2 ECMP

The previous paragraph showed that the localityof a node no longer influences the route a packettakes on the network to a server, as all paths are ofequal length. Instead, the path of a packet is ma-nipulated by making use of ECMP[22]. The nameof this protocol extension already suggests that itcan be used in a case such as this, as the routesfrom Internet to Server flow along multiple pathsof equal cost.

In principle ECMP calculates routes by hashingthe source of a packet and then choosing the paththat it needs to take. All results of the hash aremapped to certain buckets and if the result of thehash falls into a certain bucket, the router knowswhat the next hop of the packet will be. Figure 4shows a schematic representation of how ECMPworks. Incoming packets are mapped to a certainbucket and then forwarded to the correspondingnext-hop.

Figure 4: A reprensentation of the hashbuckets inthe ECMP protocol

ECMP can be manipulated on routers by addinga weight to an outgoing interface. This means thatadministrators can easily control and balance theload of traffic on the network over all hops and serv-ers on the network. It allows a very fine grain con-trol of the flows and is just as effective as the meansof distributing load that traditional solutions use.

6

Page 7: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

5.3 Handling FailuresIn this design, when a failure occurs on the network,or if an administrator cycles a router on the net-work, ECMP will recalculate the routes accordingto the the available interfaces towards the destina-tion. However, traditional solutions allow for TCPsession state synchronisation across the load balan-cing appliances. It makes sure that the integrity ofthe TCP session remains intact when the connec-tion is passed on to the next node. The client onlynotices a minimal hiccup, but is still able to accessthe service.The new design does not cater for this, because

when ECMP recalculates, the TCP session will beforwarded to a different server. It will not recog-nize the session and will send a TCP Reset (RST)message to the client as per the protocol standard.In doing so the clients TCP session is terminated,which depending on the use case is something thatis undesirable.

5.3.1 Maintaining the TCP session integ-rity

As the new network solution no longer has a limitedamount of nodes over which session states need tobe synchronised, it is not realistic to expect that allTCP sessions are synchronised across the platform.If this would be done it would cause an quadraticalgrowth in network management traffic every timea node is added. Such a network can be describedas a fully connected network (complete graph) [23].This does not scale, for obvious reasons.As stated in the Research Questions in Section 2

we propose to solve this scaling problem by imple-menting a Distributed Hash Table solution to storethe TCP session identifiers to help maintain theTCP session state.

6 New load balancer designThe new network design requires a new way ofmaintaining TCP session state across a very largenetwork. Section 5 makes it clear that the old solu-tion of synchronising state between all nodes in theAnycast group is infeasible due to the problem thattraffic overhead increases quadratically, every timea new node is added to the network.This Section defines what a scalable solution is

for this problem, it states what protocols are in-volved and finally shows the design of a system thatcan solve the problem.

6.1 Distributed systemsScaling has been a much discussed subject in theresearch community. In the past M.Hill assertedthat saying that something is scalable does not even

mean anything [24], however in recent years theconsensus is that scalability is a key requirementin the area of Distributed systems and can be splitup into a number of different elements[14].

In their RFC Facebook and Arista talk aboutwanting to achieve horizontal scalability [1]. Thiscan be seen as a form of functional and load scalab-ility. Of course there are many other aspects ofscalability that come into the network design asproposed. However, if one looks purely at achiev-ing the most efficient design then functional andload scalability definitions match the closest.

The theory of distributed systems resides aroundcreating applications that through a middelwarelayer, can communicate with each other to allowmore nodes and systems to work together for co-ordination and communication purposes. In thisway the middelware helps the infrastructure toachieve the goal of the application, by combiningthe efforts of all nodes without one having to knowor do everything [14].

6.1.1 Maintaining state

As we have seen, the challenge that this researchtries to solve is the question of how to maintainTCP session state across thousands of servers. Webelieve that the most efficient way of doing this isby using the principles defined by distributed sys-tems. If one looks at the topology you see that allnodes in the Anycast group are equal participantswithout a centralised coordinator. This makes theAnycast group an ideal candidate to use the Peer-To-Peer (P2P) overlay model. P2P overlays oper-ate on the premise that all nodes in the network areequal participants and have equal resources [25].

6.2 Distributed Hash TableA discovery that lead to a great change in howfiles and information is shared on the Internet isthe introduction of the Distributed Hash Table,Chord [12]. First introduced in 2001 it sets out tocreate a way in which P2P overlays are able to storeand share information on the network. In theirmodel, they propose an efficient way of searching apiece of data on a P2P overlay which is made outof a ring of nodes. Simply put, by mapping piecesof information to a key, and making all nodes onthe overlay responsible for a part of the key space,they can combine all the resources of the P2P over-lay and not have to store duplicate information.

DHT implementations come in various shapesand sizes and this research uses the Kademlia pro-tocol to implements its DHT [13]. The reason forthis is due to the fact that it has been widely ad-opted in P2P networks. A notable example beingthe Bitorrent network [26]. The Kademlia searchalgorithm enjoys wider adoption as it makes use of

7

Page 8: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

the XOR operation to calculate where the inform-ation is located on the network. The maintenanceof routing tables and calculation of routes is there-fore less costly compared to the Chord implementa-tion [27]. The search algorithm implemented by theKademlia protocol achieves a predictable search inthe following order:

O(n) = log(n)

If the node on the network does not have the in-formation located at its own location, it can findthe information in log(n) hops, where n is theamount of nodes in the DHT network.

6.2.1 Maintaining state

To facilitate the storage of information in the P2Poverlay we propose to use an implementation of aDHT. In this way we do not need to replicate in-formation to all nodes on the DHT, but we do makesure that the information can be looked up in anefficient and predictable way. Ensuring this willmean we can make assertions over how scalable thenew load balancing design is.

6.3 Transmission Control ProtocolThe Transmission Control Protocol (TCP) is thedefacto standard for providing reliable, end to enddata transport over the Internet. TCP is imple-mented directly above layer 3 on the OSI stackand provides an interface for the application totransport a byte stream over a network. The TCPprotocol can be best described as a state machinewith a role for the client and a role for the server.The states of the machine describe the life cycle ofTCP session setup, data transfer and session teardown. By sending sequence numbers and acknow-ledgements the client and the server ensure a reli-able transfer of data from one point to the next [5].Figure 5 shows the states that a TCP session canhave at have in a certain instant.

6.3.1 Tracking TCP connections

A TCP connection is defined as the relationshipbetween a socket on the client side of the connectionand a socket on the server side of the connection.A socket is a tuple which combines an IP addressand a port number. The IP address identifies theinterface on which a packet arrives and the portnumber identifies the process to whom the inform-ation is addressed. Together the client and serversockets are the TCP 4-tuple [29].TCP connections can be tracked by the Linux

Kernel and this provides information about thestatus of a TCP session between a client and aserver [30]. On the server side this can be usedto see if the TCP session is valid. In the case of an

CLOSED

LISTEN

SYN_SENTSYN_RCVD

ESTABLISHED

FIN_WAIT_1

FIN_WAIT_2

CLOSE_WAIT

CLOSINGLAST_ACK

TIME_WAIT

Passive open Close

SYN/SYN + ACK Send/SYN

Timeout/RSTClose

Active open/SYN

SYN/SYN + ACK

Close/FIN

ACK SYN + ACK/ACK

Close/FIN FIN/ACK

ACK

ACK

FIN +ACK-/ACK

FIN/ACK

ACK

Close/FIN

ACK

Timeout after two maximumsegment lifetimes (2*MSL)

Figure 5: The TCP state machine for the serverand client. The server is the left side of the pictureand the client is represented by the right side of thepicture [28].

erroneous packet, the server will send the client aRST message. This means the socket becomes freeon the server side and the client will need to re-initiate the connection. The connection trackingsystem detects validity of a session by looking atthe acknowledged bytes in the TCP header. If theacknowledged bytes is greater than zero, the pro-tocol knows that the packet is not the first packetof a session and therefore invalid.In our proof of concept we need to act upon such

a misplaced session. Instead of directly sending aRST message to the client we need to lookup thesession on the DHT to see if it can be associatedwith a different node in the Anycast group. If so,we can forward the packet to the correct host. Oth-erwise, we act as the protocol dictates and send theRST message anyway.

6.3.2 Maintaining State

To maintain session state we propose that everynew session to the Anycast group is stored by theserver on the DHT. As you can see in Figure 2,servers announce two loopback addresses; One isthe Anycast address and the other is their uniqueIPv4 address. The information stored on the DHTrepresents a TCP 4-tuple. The client socket willact as a key and the unique loopback addressand port serves as the value.Figure 6 clearly shows how a search on the DHT

8

Page 9: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

{ "20.1.1.1:1234" : "10.0.0.2:80" }

Figure 6: This is an example in JSON of the in-formation that will be stored in the DHT.

would work, in the case of an erroneous session.When an unknown connection is received, the TCPclient socket information is used as a key to lookup the identifier of the correct node.

6.4 Combining all elementsWhen combining the previous information we pro-pose the following design to solve the problem ofmaintaining TCP session states across thousandsof nodes in an Anycast group. The proof of conceptwill be made up of the following elements:

• A network Topology implemented according tothe new design.

• A group of servers taking part in a Kademlia,DHT overlay network.

• A group of web servers who listen on an Any-cast address and look up and store TCP ses-sion information in the provided DHT.

Figure 7, Figure 8 and Figure 9 show the differentinteractions that happen between client, server andKademlia network.

Figure 7: This shows the applications running onthe server. The DHT is listening on UDP port 7000and uses this to listen to messages on the overlay.At the same time it is hosting a web application onport 80 on the Anycast address.

6.4.1 Summary

The proposed design can in theory achieve hori-zontal scalability. When implemented well it ispossible to add and delete nodes from the networkwithout having to explicitly inform every othernode on the network. Nodes join the web servergroup by announcing the Anycast address and canjoin the DHT by knowing one single other node inthe DHT.

Figure 8: This shows the interactions between cli-ent, web server and Kademlia network when fol-lowing a normal flow. We see that upon sessionestablishment asynchronously the session is storedon the DHT network.

As the DHT search is very predictable this meansadministrators can calculate maximum searchtimes that are needed to look up invalidated ses-sions.

7 MethodThe new network design as laid out in Section 5shows how load balancing will be done on a LargeBGP network. However a question remains abouthow to make sure you can preserve hundreds ofthousands of TCP session states across such a largenetwork. This Section will lay out a scenario whichwe will test and evaluate to see if the Research ques-tions can be answered. The test scenario and res-ults will be gathered by building a proof of concept.

7.1 Proof of concept

The design of the load balancer as elaborated uponin Section 6 was implemented with the scriptinglanguage python 2.7. It was implemented in sucha way as to make sure that TCP requests were notblocked by look ups in the DHT, to approach a realworld scenario as close as possible. This means thatthe web server was running in one thread and theDHT in another thread.

9

Page 10: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

Figure 9: This figure shows what happens inthe event that an erroneous packet is sent to theweb server. When you follow the arrows in anti-clockwise direction you see that the server will lookup the client socket, forward the packet to the ori-ginal server and the original server will service thesession. This keeps happening until TCP sessiontermination.

7.1.1 Web server

The web server was implemented by using thesocket library provided by python. It served asimple binary file towards any connection that con-nected to it. All new connections to the web serverare asynchronously entered into the DHT.

7.1.2 DHT

The Kademlia protocol is implemented by using alibrary available on Github. The overlay is setupby starting the network with a single root node andfrom that point adding nodes to it by bootstrap-ping from one of the existing nodes on the Kadem-lia network. The bootstrapping process makes surethe node can find its correct place in the overlay.All communication between nodes on the overlay isdone using UDP.

7.1.3 Sniffer and Forwarder

If the server detects a wrong packet it looks upthe correct destination in the DHT overlay, un-packs the IP header, removes the Anycast address,inserts the correct destination IP, recalculates allchecksums and sends the packet towards the cor-

rect node. The look up on the DHT overlay is onlydone once as subsequent look ups can be done byusing a local cache. Figure 10 shows the changesto the ip header to reroute the traffic to the rightnode. The value in red is the value that is storedwith the client TCP session key on the DHT.

Figure 10: This shows the changes that are madein the IP header on each host when an incorrectpacket arrives.

7.2 The ScenarioThe scenario of this research assumes there is anetwork with four servers connected to a numberof routers that allow access towards the Internet.The servers in this topology represent machinesthat host a number of web services on TCP port80. Together these servers take part in a Distrib-uted Hash Table overlay network and announce theshared Anycast address that hosts the web servicewith BGP. A visualisation of the topology is seenin Figure 2. Furthermore we assume that it is con-figured and set up as described in Section 5, thenew network design.

7.2.1 Testing the proof of concept

During testing we assume the role of a visitor ofthe Anycast site, who is downloading a large file.The goal is to ensure that we have an active andvalid TCP session during the time of the testingand that we are able to ensure TCP session con-tinuity if the test is successful. Prerequisites forthis test are that time is synchronised in the topo-logy and that the DHT overlay and web server arestarted successfully.

We define the following metrics as being relevant.Combining them will tell us what the performanceis of the DHT:

10

Page 11: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

• Setting time: This is defined as the time ittakes to set a key-value pair on the DHT over-lay. It is measured from the moment when apacket is sent on the network to the momentit receive an acknowledgement packet.

• Detection time: This is defined as the timeit takes for the network to detect a link failureon the BGP topology, up to the moment theoverlay reacts to an ECMP rerouted packet.According to the design this means the over-lay will react to the first ECMP rerouted TCPpacket.

• Look up time: This is defined as the time ittakes for a node to look up a key value pairon the DHT overlay network. It is the time ittakes to send a question and receive a resultfrom the network.

To measure these metrics we established the fol-lowing procedure:

1. From one of the exit nodes of the topology es-tablish a TCP session on the Anycast addressto download the file.

2. Collect the time stamps that are printed beforeand after storing the TCP session in the DHT.This is the DHT “setting time.”

3. Simulate a link failure on the network bymanually shutting down an interface overwhich the traffic is flowing.

4. Let ECMP recalculate the path of the traffic soit will be forwarded to the next Anycast node.

5. Collect the time stamp in the system log forwhen the Kernel detects interface shutdownand compare it with the time that is printedjust before the script looks up an entry in theDHT. This is the total “link state detectiontime.”

6. Collect the time stamp before and after thesession look up on the DHT, this is the “lookup time” of the network.

7. Verify the TCP packet is forwarded correctlyand the TCP session is still active.

These steps were repeated a number of times tocalculate an average time for the three metrics sowe can analyse if the results are predictable andconsistent.

7.2.2 Comparing to traditional solutions

With the results gathered in the testing phase, wecan compare the achieved performance to threetraditional load balancers available on the market

today. Traditional balancers use health checks andtimeout timers to measure availability of nodes intheir group. Depending on the timer thresholdsthey are configured to act in a certain way.

What follows are a number of definitions thatwill help understand how traditional load balancerswork.

• Health Check: A traditional load balancercan be configured to check the functionalityof an application or simply the response to anICMP request to judge if a node is responsive.

• Check interval: A load balancer will performthe health check at regular intervals. This canbe configured with the “Check interval.”

• Attempts: Upon failure of a check, a bal-ancer can be configured to make a number offurther attempts before it will take action.

• Timeout timer: When a check fails, the“timeout timer” defines the maximum amountof time a balancer may wait before it needs totake action.

A combination of the aforementioned definitionsare used to configure load balancers to detect afailure. This is fundamentally different comparedto the new design as it depends on different typesof inputs and not on a stray TCP session

Table 1 shows how these checks can be con-figured in the case of three vendors of traditionalload balancers; Amazon Webservices software bal-ancer [31], Kemp Technologies hardware appli-ance[32], f5 networks hardware appliance [33].

7.3 Hardware and software usedThe topology was built using certain hardware andsoftware. This section briefly explains the hard-ware and software that has been used to conductthe experiments.

7.3.1 Hardware and Operating Systems

The topology was created by making use of Vir-tual Machines (VMs) with the Virtual Box pro-vider. The topology needed to be capable torun 16 VMs. To be able to run this with somespeed we used a hardware node with the followingspecifications; two, Six-Core AMD Opteron(tm)Processor 2435, with 64GB of RAM and a RAID1 setup of two 60GB SAS hard drives. Thisnode was running a fully patched Ubuntu 14.04.4LTS operating system with Virtual Box version4.3.36

The Virtual Machines could be split upinto two types: Servers and Routers. Theserver machines had the following specifications;One Core of the Six-Core AMD Opteron(tm)

11

Page 12: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

Product Health check Check Interval Attempts Timeout timer

AWS ICMP/TCP/HTTP 30 (5) 2 (1) 5Kemp ICMP/TCP/HTTP 9 (3) 2 (1) 4f5* ICMP/TCP/HTTP 5 - 16

Table 1: The health check configuration offered by three major vendors. All values represented are defaultvalues. Known minimum values are shown in between “( )”.f5*: The timeout setting should be three times the Interval setting, plus 1 second [33]. f5 do not configureattempts.

Processor 2435, with 400MB of RAM and a lo-gical volume of around 4 GB. The server VMs rana fully patched version of Ubuntu 14.04.4 LTS.

The router VMs have the same hardware spe-cifications, but run Cumulus Linux 3.0.0.

7.3.2 Software Installed

The hardware node uses Vagrant3 and Ansible4

to spin up the VMs that run the topology. Thesource of VagrantFile and packages were providedby Cumulus Linux and are available in https://github.com/packetninja. The server VMs re-quired the following python packages to run theproof of concept which was responsible for creatingthe prototype: pydht, pynetfilter_conntrack,ipy. The source of the proof of concept isavailable here: https://github.com/pboers1988/dhtsession. The following Linux package was alsoinstalled from the repositories to run the code:libnetfilter-conntrack3

7.3.3 Settings to run the topology

The Linux Kernel does not track incoming con-nections with a vanilla install. To enable connec-tion tracking the easiest solution was to create aniptables rule which instructed Linux to track allincoming connections in the connection trackingtable:

iptables -I INPUT -m state --stateNEW,RELATED,ESTABLISHED,INVALID,UNTRACKED -j ACCEPT

↪→

↪→

As some of the logic of the python script de-pended on certain states of the TCP session it wasnecessary to make it fast enough to be able to in-tercept the states of the session in the connectiontracking table. As the TCP stack is very efficient inLinux we needed to introduce a delay on the net-work card to help the script cope with the speedand make it able to handle the logic. The delay was

3An interpreter that is able to read definition files andinterface with a number of VM managers to automate theprovisioning of VMs

4A configuration management tool based on python

introduced on both links towards the leaf nodes inthe topology with the tc tool:

tc qdisc add dev eth1 root netem delay10ms↪→

Lastly, default TCP behaviour dictates that allconnections that come in an interface, who are notrecognized, will receive a TCP reset (RST) packetback. For our experiment we disabled the TCPRST packets as this would mean the client wouldclose their socket while we were trying to reroutethe traffic to the correct destination. The followingiptables rule was used to achieve this goal:

iptables -A OUTPUT -p tcp --tcp-flags RSTRST -j DROP↪→

8 ResultsIn this section we provide the results of testingthe proof of concept according to the definitionswe stated in Section 7, “setting time,” “detectiontime” and “look up time.” We gathered data overa number of tests. Last of all we show if the Proofof concept worked as expected.

8.1 Setting timeThe setting time is defined as the time it takesto set a key-value pair on the DHT overlay of 4nodes. It is measured by checking the time dif-ference between the moment setting begins and anokay is returned. Figure 11 shows the results of 15measurements.

8.2 Detection timeThe detection time is defined as the time betweenthe moment a router detects a link failure and themoment a rerouted packet is searched on the DHT.This time encompasses the recalculation of ECMPand may also include a number of retransmissionsby the client, as the some packets may have beendropped during the link failure. Figure 12 showsthe results of 10 measurements.

12

Page 13: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

0.200 0.202 0.204 0.206 0.208

Setting on the DHT

Seconds

Attempts 15

Figure 11: This plot shows the time in seconds thatit takes to set the Key - Value pair on the DHT.

1.0 1.2 1.4 1.6 1.8 2.0

Detection Time

Seconds

Attempts 10 +

Figure 12: This plot shows the time in seconds thatit takes between the failure of a link and detection.

8.3 Look up timeThis is the time that is needed to look up a keyon the network. It is measured from the momentthe search starts until the result is received andcan be processed. Figure 13 shows the results of 11measurements.

0.1000 0.1005 0.1010 0.1015 0.1020

Looking up on the DHT

Seconds

Attempts 11 +

Figure 13: This plot shows the time in seconds thatit takes for a node to look up a key on the DHT.

8.4 Proof of conceptThe last step of the proof of concept has not yetbeen implemented. Our python script is able tolook up sessions and forward them to the correctnode, however the correct node is not able to re-cognise the packet for reasons we discuss later inthe report (Section 9). Unfortunately this meansthe TCP session does not continue when the packetgets forwarded. Even though the proof of conceptis not complete, this did not have impact on thegathering of the measurements, as those steps are

implemented correctly.

9 DiscussionIn this section we discuss the results of the researchand point out some areas that need further invest-igation and work. In particular some areas of theproof of concept still need work to really see if thissolution is viable.

9.1 Proof of conceptThe proof of concept has been designed and builtin a short space of time and at the point of testingwas under active development. The design of theconcept indicates that the last step of the chainis to reestablish the TCP session at the originalnode. This should happen when a packet is for-warded from the temporary node, who looked upa stray TCP session in the DHT. As the destin-ation IP is changed during the forwarding phasefrom the Anycast address to the unique IPv4 ad-dress of the original node, the original node doesnot recognise the packet(Figure 10). In its connec-tion tracking table the server sees a relationshipbetween the Anycast address and the client socketand not the unique Ipv4 address and the client. Ittherefore treats the packet just as it would treat aregular misplaced TCP session and sends a TCPRST towards the client, completely defeating thepurpose of the new design.

It should be relatively trivial to capture thesepackets and forward them to the local Anycast ad-dress, or perhaps to encapsulate the original Any-cast packet through TCP or UDP and transport itin that way from the temporary node to the originalnode, however this has not been implemented andis as yet untested. This is unfortunate as it makesany judgement about the success of this proof ofconcept up for debate.

9.2 TCP ProtocolEven though the complete proof of concept is notfinished at the time of writing, most of the elementsof the design have been successfully implemented.Furthermore the results show that it is not neces-sary to make an intrinsic change in the TCP stack,unlike Miura et al [6]. This design does not step outof the bounds that the TCP protocol requires andthis research focuses more on how to implement it.Our solution states that before a TCP RST is sent,there needs to be a check to see if the session existson a different node by looking that up on the DHT.

We believe that this model is very achievable,because the only change necessary needs to be doneon the server side and is completely transparentto the client. Only changing implementation on

13

Page 14: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

the server side is much easier then changing thingson the client and server side. With this design itshould be possible to make the solution transparentfor the TCP client.

9.3 Timing Results

In comparing the various timing results to moretraditional solutions, we see that on the very smallscale experiment that our solution is much faster.Our solution can achieve the rerouting of TCPpackets within two seconds, this is much fasterthan the theoretical minimum of traditional solu-tion, which is between 5 and 10 seconds. The res-ults also show that setting, detection and look upwith this system is very predictable and consistent.Where traditional solutions rely on methods suchas ICMP to check if a node is still alive, our methodreacts to a misplaced packet and forwards this ac-cordingly. This is inherently much faster than wait-ing until certain arbitrary timers in health checkshave failed. Our model assumes that when a packetis rerouted this means that a failure has occurredand a problem needs to be solved.

9.3.1 Scaling

Previously nodes could look up results in memorywhich is obviously very fast, however in our casenetwork delay needs to be taken into account as weoften need to query another node in the DHT over-lay. Even though this may sound like a perform-ance cost, our solution is very scalable. The net-work latency together with the fixed search amountof log(n) that the DHT provides, means that lookup time can be calculated exactly. These attrib-utes are predictable which means admins can makechoices about how far they would like to scale.

9.3.2 Calculating look up times

From the measurements gathered we can define thesuccessful look up time in relation to the amountof nodes in the network to make it a predictablemodel. Let n be the amount of nodes in the net-work, then look up time T is dependant on the av-erage look up time on every node L and the amountof nodes that need to be visited during a search onthe DHT log(n).

T (n) = L · log(n)

L in this case contains the average networklatency and search time of one hop. T providesa means to estimate the look up time of a key onthe DHT overlay when a network increases in size.

9.3.3 Implement as a native application

As mentioned in Section 7 to make the applica-tion work we needed to induce artificial delay onthe link. Furthermore, the pydht library was onlyevaluated in a rudimentary way. We believe thatan implementation in a native binary of the proofof concept needs to be made to ensure the accur-acy of the timing results. The measurements inthis research are more of an indication to see if thenetwork performs in a consistent manner.

9.4 RobustnessThe robustness of this solution is still very muchdebatable. Our test case was to see if one TCPstream could be rerouted. In the case of a net-work outage, or in the case that multiple links failsimultaneously, there will be many more TCP ses-sions that need to be looked up within a very smalltime frame. It remains to be seen if a solutionsuch as this is suitable for HTTP traffic. In previ-ous research it has been shown that the longevityof a HTTP TCP session decreases logarithmicallydepending on the bandwidth between client andserver [34]. A HTTP session could already be com-plete by the time a session is stored on the DHT.

It is well known and confirmed by research thatwith packet loss TCP performance is reduced sig-nificantly [35]. It could be easier to just setup anew session instead of trying to save the existingsession.

9.5 Open issuesQuite obviously this solution still has to be testedon a larger scale with many TCP sessions and it re-mains to be seen if this solution will achieve whatis necessary to recover those TCP sessions. Theconclusions of this research are based on observa-tions in a very small scale in a virtual environmentwhilst trying to recover one TCP session.

Secondly the last step in the chain, the reestab-lishing of the TCP session between the originalnode and the client has not been tested. This isof course the ultimate goal of the design and stillneeds to be implemented.

Lastly the integrity of the DHT needs to betested: this research has not looked at how DHTsreact to constant setting and removal of data.

10 ConclusionThis research has designed a new solution to TCPsession integrity management in Large Scale BGPnetworks. In building the proof of concept, wehave shown that a DHT could be used to storeTCP session information and provide a means for

14

Page 15: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

a large cluster to spread out all the session in-formation across a large network. Secondly on asmall scale, the presented solution outperforms tra-ditional solutions. It does not wait for timeouttimers, like traditional solutions, but reacts onTCP packets who are rerouted by ECMP that arenot recognized by new host on which they arrive.Our design does not require any change in the

TCP protocol, but could be made to work by chan-ging some implementation logic on the server side.The client side of the TCP session can connect andinteract with the server in a normal way. Thismakes eventual implementation much more achiev-able as it is transparent to the client.Finally, the measured results indicate that in the

case of one TCP session our solution performs ina very predictable manner. Even though we be-lieve that the proof of concept is a relatively naiveimplementation which could benefit from a num-ber of improvements, the look up results show thatit is predictable and consistent. Along with thefact that DHT search performance scales efficiently,we think this model could be a viable solution formaintaining TCP session integrity on Large BGPnetworks.

11 Future WorkTo continue effort in this area of research weidentify the following opportunities:

• Finishing of the proof of concept by makingsure the TCP session is restored.

• The testing of the proof of concept on a largerscale with more TCP sessions and more serversin the Anycast group.

• Measure realistic web traffic and see if theDHT is robust enough to handle typical burstyHTTP traffic.

• Convert the python script to a native binaryor driver and measure the performance.

Acknowledgements

I would like to thank Attila de Groot for provid-ing the initial idea and infrastructure to conductthis research. Secondly, Rama Darbha and PeteLumbis of Cumulus networks for their feedback andpatience during the course of the project.

References[1] P. Lapukhov, A. Premji and J. Mitchell.

Use of BGP for routing in large-scaledata centers. Tech. rep. Technical report,IETF, 2016. url: https://datatracker.ietf.org/doc/draft - ietf - rtgwg- bgp- routing-large-dc/.

[2] Office of the Secretary-General. ChapterTwo: Understanding TelecommunicationNetwork Trends. 2011. url: https://www.itu . int/osg/ spu/ ip/chapter_ two .html(visited on 30/05/2016).

[3] C. Hopps. Analysis of an Equal-CostMulti-Path Algorithm. Tech. rep. Tech-nical report, IETF, 2000. url: https ://tools.ietf.org/html/rfc2992.

[4] Wikipedia. Anycast — Wikipedia, TheFree Encyclopedia. [Online; accessed 1-June-2016]. 2016. url: https : / / en .wikipedia.org/wiki/Anycast.

[5] Wikipedia. Transmission Control Pro-tocol — Wikipedia, The Free Encyclope-dia. [Online; accessed 1-June-2016]. 2016.url: https : / / en . wikipedia . org / wiki /Transmission_Control_Protocol.

[6] Hirokazu Miura et al. ‘Server load bal-ancing with network support: Active any-cast’. In: Active Networks. Springer, 2000,pp. 371–384.

[7] Scott Weber and Liang Cheng. ‘A surveyof anycast in IPv6 networks’. In: Com-munications Magazine, IEEE 42.1 (2004),pp. 127–132.

[8] Valter Popeskic. Source-based routing inIPv4 and IPv6 networks. 2015. url: http:/ / howdoesinternetwork . com / 2014 /source-based-routing.

[9] Miguel Castro et al. ‘Scalable application-level anycast for highly dynamicgroups’. In: Group Communicationsand Charges. Technology and BusinessModels. Springer, 2003, pp. 47–57.

[10] Martin Randles, David Lamb and ATaleb-Bendiab. ‘A comparative study intodistributed load balancing algorithms forcloud computing’. In: Advanced Informa-tion Networking and Applications Work-shops (WAINA), 2010 IEEE 24th In-ternational Conference on. IEEE. 2010,pp. 551–556.

[11] Yingwu Zhu and Yiming Hu. ‘Efficient,proximity-aware load balancing for DHT-based P2P systems’. In: Parallel and dis-tributed systems, ieee transactions on 16.4(2005), pp. 349–361.

15

Page 16: Restoring TCP sessions with a DHT - de Laat › rp › 2015-2016 › p80 › report.pdfRestoring TCP sessions with a DHT Peter Boers peter.boers@os3.nl Wednesday 3rd August, 2016 Abstract

[12] Ion Stoica et al. ‘Chord: A scalable peer-to-peer lookup service for internet ap-plications’. In: ACM SIGCOMM Com-puter Communication Review 31.4 (2001),pp. 149–160.

[13] Petar Maymounkov and David Mazieres.‘Kademlia: A peer-to-peer informationsystem based on the xor metric’. In: Inter-national Workshop on Peer-to-Peer Sys-tems. Springer. 2002, pp. 53–65.

[14] Andrew S Tanenbaum and Maarten VanSteen. Distributed systems - Principlesand Paradigms. Prentice-Hall, 2015.

[15] Wikipedia. Load balancing (computing) —Wikipedia, The Free Encyclopedia. [On-line; accessed 1-June-2016]. 2016. url:https : //en .wikipedia . org/wiki /Load_balancing_(computing).

[16] Pablo Neira Ayuso. Conntrackd. 2016.url: http://conntrack-tools.netfilter.org/conntrackd.html (visited on 28/06/2016).

[17] Cumulus Networks. Routing on the Host:An Introduction. 2016. url: https : / /support.cumulusnetworks.com/hc/en-us/articles/216805858-Routing-on-the-Host-An-Introduction (visited on 28/06/2016).

[18] Paul Jakma. Quagga Routing Suite. 2016.url: http://www.nongnu.org/quagga/.

[19] Wikipedia. Border Gateway Protocol —Wikipedia, The Free Encyclopedia. [On-line; accessed 1-June-2016]. 2016. url:https://en.wikipedia.org/wiki/Border_Gateway_Protocol.

[20] F. Le Faucheur and E. Rosen. AdvertisingIPv4 Network Layer Reachability Inform-ation with an IPv6 Next Hop. Tech. rep.Technical report, IETF, 2009. url: https://tools.ietf.org/html/rfc5549.

[21] Cumulus Networks. Managing Un-numbered Interfaces. 2016. url: https :/ / docs . cumulusnetworks . com/display /DOCS / Border + Gateway + Protocol+ -+BGP#BorderGatewayProtocol - BGP -ConfiguringBGPUnnumberedInterfaces.

[22] Wikipedia. ECMP — Wikipedia, The FreeEncyclopedia. [Online; accessed 1-June-2016]. 2016. url: https ://en.wikipedia.org / wiki / Equal - cost _ multi - path _routing.

[23] Wikipedia. Network Topology — Wikipe-dia, The Free Encyclopedia. [Online; ac-cessed 1-June-2016]. 2016. url: https :/ / en . wikipedia . org / wiki / Network _topology.

[24] Mark D Hill. ‘What is scalability?’ In:ACM SIGARCH Computer ArchitectureNews 18.4 (1990), pp. 18–21.

[25] Wikipedia. Peer-To-Peer — Wikipedia,The Free Encyclopedia. [Online; accessed1-June-2016]. 2016. url: https : / / en .wikipedia.org/wiki/Peer-to-peer.

[26] Wikipedia. Kademlia — Wikipedia, TheFree Encyclopedia. [Online; accessed 1-June-2016]. 2016. url: https : / / en .wikipedia.org/wiki/Kademlia.

[27] Jakob Jenkov. Peer Routing Table. 2014.url: http://tutorials . jenkov.com/p2p/peer-routing-table.html.

[28] Ivan Griffin. TCP state machine. 2016.url: http : / /www . texample . net / tikz /examples/tcp-state-machine/.

[29] Wikipedia. Network Socket — Wikipedia,The Free Encyclopedia. [Online; accessed1-June-2016]. 2016. url: https : / / en .wikipedia.org/wiki/Network_socket.

[30] P Ayuso. ‘Netfilter’s connection track-ing system’. In: LOGIN: The USENIXmagazine 31.3 (2006).

[31] Amazon. Elastic Load Balancing- Configure Health Checks. 2016.url: http : / / docs . aws . amazon .com / ElasticLoadBalancing / latest /DeveloperGuide / elb - healthchecks . html(visited on 28/06/2016).

[32] Kemp Technologies. Frequently AskedQuestions. 2016. url: https : / / support .kemptechnologies.com/hc/en-us/articles/203863135-Web-User-Interface-WUI-#_Toc450210533 (visited on 28/06/2016).

[33] f5 solutions. Manual Chapter: ConfiguringMonitors. 2016. url: https://support.f5.com/kb/en- us/products/big - ip_ ltm/manuals / product / ltm_ configuration _guide _ 10 _ 0 _ 0 / ltm _ appendixa _monitor_types.html#1172375 (visited on28/06/2016).

[34] Joachim Charzinski. ‘HTTP/TCP con-nection and flow characteristics’. In: Per-formance Evaluation 42.2 (2000), pp. 149–162.

[35] Anurag Kumar. ‘Comparative perform-ance analysis of versions of TCP in a localnetwork with a lossy link’. In: IEEE/ACMTransactions on Networking (ToN) 6.4(1998), pp. 485–498.

16