master thesis - a distributed algorithm for stateless load balancing

UNIVERSITY OF CATANIA

MASTER’S THESIS

A Distributed Algorithm for StatelessLoad Balancing

Author:Andrea TINO

Supervisor:Prof. Eng. Orazio

TOMARCHIO

Assistant Supervisor:Eng. Antonino BLANCATO

A thesis submitted in fulfillment of the requirementsfor the degree of Master of Engineering

in the

Faculty of Computer Science EngineeringDepartment of Electrical, Electronic and Computer Science Engineering

July 21, 2017

http://unict.it

http://www.johnsmith.com

http://www.jamessmith.com



http://www.dieei.unict.it


iii

Declaration of AuthorshipI, Andrea TINO, declare that this thesis titled, “A Distributed Algorithm for StatelessLoad Balancing” and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a research de-gree at this University.

• Where any part of this thesis has previously been submitted for a degree orany other qualification at this University or any other institution, this has beenclearly stated.

• Where I have consulted the published work of others, this is always clearlyattributed.

• Where I have quoted from the work of others, the source is always given. Withthe exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I havemade clear exactly what was done by others and what I have contributed my-self.

Signed:

Date:

v

“I like thinking that this work of mine kind of reflects my international personality. Theidea of this algorithm has crossed my mind while I was working in Japan (summer 2012)on non-deterministic mathematical models to describe fast similarity search algorithms. It’sincredible how some ideas come to life so spontaneously! Then, I started working and devel-oping the foundations of the algorithm in Italy. After I started as an employee in Microsoft,I kept on working on this project in Denmark. Even on vacation, I found time to work onthis thesis while roaming in several areas of South Korea. It is also worth mentioning that Iworked on some chapters while I was in The Netherlands.When I think about this, I feel happy! ”

Andrea Tino

vii

University of Catania

AbstractFaculty of Computer Science Engineering

Department of Electrical, Electronic and Computer Science Engineering

Master of Engineering

A Distributed Algorithm for Stateless Load Balancing

by Andrea TINO

The algorithm object of this thesis deals with the problem of balancing data unitsacross different stations in the context of storing large amounts of information indata stores or data centres. The approaches being used today are mainly based onemploying a central balancing node which often requires information from the dif-ferent stations about their load state.The algorithm being proposed here follows the opposite strategy for which data isbalanced without the use of any centralized balancing unit, thus fulfilling the dis-tributed property, and without gathering any information from stations about theircurrent load state, thus the stateless property.This document will go through the details of the algorithm by describing the ideaand the mathematical principles behind it. By means of an analytical proof, the equa-tion of balancing will be devised and introduced. Later on, tests and simulations,carried on by means of different environments and technologies, will illustrate theeffectiveness of the approach. Results will be introduced and discussed in the secondpart of this document together with final notes about current state of art, challengesand deployment considerations in real scenarios.

(IT) L’algoritmo oggetto della tesi tratta il problema del bilanciamento di unitádati all’interno di un pool di diverse stazioni, contestualmente alla necessitá di man-tenere in persistenza grandi quantitá di informazione all’interno di server-farm odata-centre. Le strategie tuttora in utilizzo sono principalmente basate sull’impiegodi un componente centrale per il bilanciamento il quale, spesso, necessita di alcuneinformazioni da parte dei nodi della rete circa il loro stato attuale di carico.L’algoritmo proposto in questa sede procede verso un approccio diametralmenteopposto per cui il bilanciamento dati viene effettuato senza l’utilizzo di alcun com-ponente centralizzato, da cui la proprietá distributed, e senza la necessitá di ottenerealcun dato dalle stazioni relativamente al loro stato di carico, da cui la proprietástateless.In questo documento, procederemo nell’esaminare i dettagli dell’algoritmo tramiteuna descrizione dell’idea di fondo e dei principi matematici alla sua base. Attraversol’impiego di una dimostrazione analitica, verrá dedotta e analizzata l’equazione di bi-lanciamento. Successivamente, procederemo ad esaminare i test e le simulazioni, en-trambi condotti tramite diverse tecnologie, a supporto dell’efficacia dell’algoritmo.I risultati verranno esaminati e discussi nella seconda parte di questo documento,assieme alle note finali riguardo lo stato corrente della tecnologia nel campo delbilanciamento dati. Verranno esaminate, inoltre, le problematiche e gli scenari diutilizzo dell’algoritmo.

http://unict.it

http://unict.it


ix

Contents

Declaration of Authorship iii

Abstract vii

1 Introduction 11.1 About Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Describing the scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Characterization of balancing algorithms . . . . . . . . . . . . . . . . . 2

1.3.1 Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3.2 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.3 Static vs. dynamic . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.4 Centralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.5 DU retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Well known balancing algorithms . . . . . . . . . . . . . . . . . . . . . . 41.4.1 Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.2 Weighted Round Robin . . . . . . . . . . . . . . . . . . . . . . . 41.4.3 Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.4 Source Address Hash . . . . . . . . . . . . . . . . . . . . . . . . 51.4.5 Least Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.6 Graph based algorithms . . . . . . . . . . . . . . . . . . . . . . . 6

Nearest Neighbour . . . . . . . . . . . . . . . . . . . . . . . . . . 6RAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Never Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6THRESHOLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 The algorithm 92.1 Network organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Station ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Ring access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Unbalanced ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Balancing the ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Extending the ring . . . . . . . . . . . . . . . . . . . . . . . . . . 18Adapting concepts in extended ring . . . . . . . . . . . . . . . . 19Defining sizing equations . . . . . . . . . . . . . . . . . . . . . . 19

2.3.2 Designing hash function φ . . . . . . . . . . . . . . . . . . . . . . 20Designing r.v. sφ’s PDF . . . . . . . . . . . . . . . . . . . . . . . . 20Designing r.v. sφ . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.3 Understanding how φ works . . . . . . . . . . . . . . . . . . . . 262.4 Ring balancing example . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.1 Defining the ring . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

x

2.4.2 Defining the formatting impulse . . . . . . . . . . . . . . . . . . 272.4.3 Binding impulses to stations . . . . . . . . . . . . . . . . . . . . 282.4.4 Calculating amplitudes . . . . . . . . . . . . . . . . . . . . . . . 282.4.5 Computing functions . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Simulation results 313.1 Small-size simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.1 Verifying load balance . . . . . . . . . . . . . . . . . . . . . . . . 323.1.2 Evaluating load levels per station . . . . . . . . . . . . . . . . . 33

3.2 Large-size simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2.2 Evaluating the variance of hash segment amplitudes . . . . . . 373.2.3 Evaluating load levels per station . . . . . . . . . . . . . . . . . 39

Migrations flows . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 System API 434.1 Storing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.1 Packet fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . 444.1.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Retrieving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Dynamic conditions 495.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1.1 Updating φ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Broadcasting in DHT . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1.2 Load re-arrangement . . . . . . . . . . . . . . . . . . . . . . . . . 545.1.3 Scaling overall impact . . . . . . . . . . . . . . . . . . . . . . . . 585.1.4 Ring scale-down . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Fault conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.2.1 Collisions threshold . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Conclusions and final notes 656.1 Open issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.2 What’s next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A C/C++ simulation engine’s architecture 67

Acknowledgements 73

Bibliography 75

xi

List of Abbreviations

BS Balancing SystemDSLB Distributed Stateless Load BalancingPA Proposed AlgorithmSS Storage System

BA Balancing AlgorithmBP Balancing PoolDLB Data Load BalancingDLBA Data Load Balancing Algorithm

DU Data UnitLD Load DistributionSL Station Load

CDF Cumulative Distribution FunctionPDF Probability Density Function

DHT Distributed Hash TableHS Hash SegmentID IDentifierLS Leaf SetLLS Lower Leaf SetP2P Peer To PeerULS Upper Leaf Set

API Application Program Intrerfacer.v. random variable

xiii

List of Symbols

N Number of stationsΩ Balancing pool (set)P Data Units (packets) (set)si Stationp Data Unit (packet)

ψ Packet/station assignment applicationΣ Load distributionl Hash length (number of bits)ξ Hash functionh Hash string (number)hξ Regular hash string (number)hφ φ-hash string (number)η Station packet load (number)η(ξ) Station packet load (via ξ) (number)η(φ) Station packet load (via φ) (number)πi Packet in station probability (probability)π

(ξ)i Packet in station probability (via ξ) (probability)π

(φ)i Packet in station probability (via φ) (probability)

f PDF (function)F CDF (function)F−1 CDF Inverse (function)g Formatting impulse (function)G Formatting impulse antiderivative (function)

Λ Leaf setΛU Upper leaf setΛL Lower leaf set

xv

dedicated to my Mother and my Father

1

Chapter 1

Introduction

1.1 About Balancing

Under the (umbrella) term balancing, it is possible to refer to different problems andsolutions: balancing of connections, of workloads, of tasks or data. What tells eachsingle type of balancing apart from the other is actually what is being balanced.

Definition 1 (Balancing). In Computing and Computer Science, it indicates the problemof distributing an indefinitely high number of entities across multiple subjects (stations).The selection is performed in order to guarantee that, at any given time, all stations have(roughly) the same amount of entities.

From which follows:

Definition 2 (Balancing Algorithm). An algorithm, or a system based on a certain algo-rithm, designed to solve the balancing problem.

In the context of this research work, we are going to focus on a specific type ofbalancing:

Definition 3 (Data Load Balancing). A type of balancing focusing on units of data oftenreferred to as packets or, simply, data units.

The solutions to the latter are referred to as:

Definition 4 (Data Load Balancing Algorithm). A BA targeting DLB in order to solveit by minimizing a certain objective function.

This research effort focuses on DLB and DLBAs in order to introduce a new al-gorithm targeting multiple performance metrics.

Objective

The objective of this thesis is actually to employ this algorithm in a cloud storagesystem designed to serve different applications. The system must be capable of:

1. Accepting data as input which will be stored in a pool of servers (stations).

2. Retrieving stored data on demand.

3. Removing stored data on demand.

These essential functionalities must be enabled by the BA and its design.

2 Chapter 1. Introduction

1.2 Describing the scenario

Let us describe DLB and its most important aspects in formal terms.

The network A certain number of stations N is always considered and togetherthey form a balancing pool: a set which we will indicate as Ω. Each station si ∈ Ω(having i = 1 . . . N ) is connected to the others by a generic protocol, we do not con-sider any specific communication technology, the only required assumption is thatthe protocol employs direct addressing of each station (one unique address per sta-tion).

Station assignment At a any given time, a DU (or packet) p ∈ P (being P the setof all DUs) must be stored in the BP. The system/algorithm responsible for carryingout this activity is expressed by application ψ : P 7→ Ω, which basically assigns a DUto a station.

The way ψ works is essentially the core of the balancing system. The applicationwill choose a station with the objective of guaranteeing that, at any given time, allstations roughly have the same amount of packets stored in them.

Remark. Application ψ typically receives a DU as input: ψ(p), however it actually acceptsmore arguments: ψ(p, ·) depending on the strategy it uses to perform the balancing.

Loads At any given time, each station si in the BP will have a certain number ofDUs assigned, we indicate this quantity as station load with operator: |si|, thus:

|si| = ‖p ∈ P : ψ(p) = si‖ ⊆ N (1.1)

This quantity can sometimes be expressed as the number of bytes (or any of itsmultiples) of the total packets stored in a station:

|si| =∑

∀p∈P :ψ(p)=si

|p| ⊆ R (1.2)

Where |p| indicates the length of a packet (typically in bytes). Otherwise speci-fied, we will refer to the former definition.

To have an overview of the balancing state, another quantity is introduced: loaddistribution indicated with symbol Σ = (|s1|, |s2|, . . . , |sN |), representing the orderedvector of station loads at any given time.

Time The algorithm runtime does not require a continuos description. Thereforetime will be considered discrete: t ∈ N and characterized by events, an event being,for instance, the arrival of a new DU to route in the network.

1.3 Characterization of balancing algorithms

DLBAs can be differentiated basing on several points of view. Considering this clas-sification is important in order to locate the proposed algorithm inside the taxonomyof today’s most used systems.

1.3.1 Randomness

Algorithms can employ non-deterministic components like pseudo-random numbergenerators in order to pick up the station to associate to a DU. This approach is

1.3. Characterization of balancing algorithms 3

not that bad because, provided the generator is characterized by an uniform PDF, itguarantees a fairly good level of balancing at low cost.

Proposition 1 (Determinism). The algorithm being proposed is fully deterministic.

1.3.2 State

In order to perform balancing, algorithms may require stations to keep informationregarding their load state (e.g. disk usage or residual available space). Statefulnessimplies that stations communicate the state information to other stations or specialnodes in the network; by employing this knowledge, the BA can perform a moreprecise job. The downside is mostly related to communication overhead as stateinformation must be regularly exchagnged.

Proposition 2 (Statefulness). The algorithm being proposed is stateless, it does not requireany information from stations to be sent in order to perform balancing.

1.3.3 Static vs. dynamic

The balancing can happen at two possible points in time:

• At runtime The association to a station is performed while the DU is beingtransmitted to stations. In these conditions, the same DU might be routed todifferent stations depending on the contingent situation. It is also possible tohave DUs being re-routed. This behaviour makes the algorithm dynamic. Asa rule of thumb, a BA is dynamic when it is not possible to always know inadvance where a DU will be routed until the algorithm is actually run.

• Before runtime The destination station is known before the algorithm is run.This makes the algorithm static. As a rule of thumb, a BA is static when thesame DU is always routed to the same station.

Proposition 3 (Staticity). The algorithm being proposed is static.

1.3.4 Centralization

This property determines whether the algorithm requires a central node in the net-work used to perform the balancing. A centralized BA requires a unit, called bal-ancer, which takes care of routing the DU to its destination station. Conversely, adistributed BA will not need this extra component. Centralized BAs are easier toimplement, but they have 2 major downsides:

• All traffic must pass through the balancer which acts like a hub node.

• The balancer represents a single fault unit in the network. If it goes down,the whole network is compromised. Safety mechanisms can be employed inorder to avoid network downtime by limiting the outage to the balancing fea-ture only: if the balancer fails, the traffic will still be routed to stations but nobalancing will occur.

This property also impacts the topology of the network. Typically, centralized algo-rithms employ a star topology where the balancer is the central node.

Proposition 4 (Non-centralization). The algorithm being proposed is distributed.


1.3.5 DU retrieval

A very important characteristic of BAs is the way it is possible to retrieve a DU onceit has been stored in a station. This does not really relate to the BA itself as DU re-trieval is more an aspect concerning the storage algorithm (which employs the BA).However the 2 systems are connected together and will be treated as one.

An essential part of the data retrieval story is DU and station identification. Sincea DU is assigned to a station, the association performed by si = ψ(p) must be iden-tifiable. We have already introduced station identifiers, so we need to do the samewith DUs and introduce, for a packet p, its identifier indicated as p ∈ N (the couple(p, p) is unique).

A key concept to understand is that station association application ψ works bothwith packets ψ : P 7→ Ω and with packet IDs: ψ : N 7→ Ω.

• If the algorithm is dynamic, then ψ is not a bijective application and it will notalways return the same station when invoked. In such a case, the association(p, si) must be saved somewhere. This condition requires the algorithm to em-ploy a database functioning as a lookup table, which of course takes resourcesand impacts memory.

• If the BA is static, then ψ is bijective and it will be always possible to retrievethe station where a packet p was routed by simply calculating ψ(p). It meansthat it is not necessary to store the coordinates of a packet in order to retrieve it.

When a DU is sent for storage, its ID is used by the owner as a key to retrieve it in alater time.

1.4 Well known balancing algorithms

The proposed algorithm competes with some other algorithms today available in themarket and commonly used in many different application domains. We are going todescribe some of these in order to compare, later, how the proposed approach ranksamong them.

1.4.1 Round Robin

This class of algorithms keeps a counter c = 1 . . . N which points to the destinationstation sc where the current DU will be routed to. The counter is incremented atevery new incoming packet: ct+1 = (ct + 1)%N . These BAs guarantee a very precisebalancing as fairness in packet association is their strongest point.

Such algorithms are deterministic, stateless, static and require a packet lookuptable. They are typically centralized and the balancer keeps track of the counter.However this condition is not a limitation as it is still possible to use Round Robinin a distributed way, though such an approach is not common in the market today.

1.4.2 Weighted Round Robin

Like Round Robin, these algorithms guarantee fairness by looping through stations.However the counter is used in a different way: every station si is not assigned with

1.4. Well known balancing algorithms 5

one number, but with a range of contiguous numbers: c′i . . . c′′i , the counter will range

from 1 to C = maxi c′i, c′′i and be incremented according to rule: ct+1 = (ct+1)%C.The higher the interval c′′i −c′i for one station si, the more packets that station will

receive. Although this seems like breaking the balancing principle, this approachallows to keep into account stations not having the same storage capabilities. A pos-sible application is assigning more DUs to stations having a larger storage capacity.

These algorithms are typically centralized, they are deterministic and still requirea packet lookup table. Given their nature, they can be either static/stateless or dy-namic/stateful; the latter implementation is valid when characteristics of stationscan change in time, the state is typically related to storage capabilities.

1.4.3 Random

Random algorithms use a random number generator employing a discrete randomvariable distributed in range 1 . . . N to choose the destination station si. Such algo-rithms are non-deterministic, dynamic, stateless and always require a packet lookuptable. They can be either centralized or distributed, though the former are the mostcommon in the market.

One important aspect of these BAs is the requirement on the probability distri-bution of the random variable employed by the number generator. The distributionmust be uniform in order to have proper balancing.

1.4.4 Source Address Hash

Commonly used in TCP/IP applications, these classes of centralized algorithms em-ploy a balancing unit which assigns a hash range h′i . . . h

′′i to each station si. Hashes

are evaluated numerically so ranges are basically contiguous sequences of integernumbers h′i, h

′′i ∈ N. When a packet arrives, the balancer will compute a hash h ∈ N

on the source address (the hashing function can either be one of the well knowncryptographic ones or some other ad-hoc implementations) and route the DU to thestation whose hash range includes the calculated hash.

Source Address Hash algorithms guarantee that packets coming from same sourcesare routed to the same stations. These algorithms are deterministic, static, statelessand require a lookup table. The balancing relies on the hash function, a crypto-graphic hash is necessary to guarantee that stations whose address differ by a fewbits do not end up being stored in the same station. Given their implementation,such algorithms can offer a fairly good balancing.

1.4.5 Least Load

These algorithms usually rely on a centralized balancer, however it is still possibleto perform the balancing in a distributed way (though not common in the marketfor this implementation). When a DU arrives, it is routed to the station with thelowest current load. This means that the balancer needs to know the load of eachstation, which is the reason why these algorithms are stateful and typically introducea considerable overhead in the network. Least Load algorithms are deterministic,dynamic and require a lookup table.


1.4.6 Graph based algorithms

There is a class of (typically) stateful, distributed and dynamic algorithms whichcalculate the destination station for a DU basing on a graph search algorithm on thenetwork. These algorithms are usually highly scalable, also no assumption is madeon the topologies of the network (free topology).

Nearest Neighbour

A random node si is picked in the network in order to initiate the transmission ofa DU. The balancing is performed only in the context of the subnet represented bythe chosen node and its direct neighbours sj (having j 6= i). The algorithm picks thestation which minimizes a specific metric usually being the load state of each node.

RAND

This algorithm is non-deterministic and randomly selects a node (station) where toroute a packet p. A threshold L ∈ R is considered for a certain metric (usually theload state), if the packet exceeds the threshold: |p| > L, then the DU is re-routed toanother randomly selected station; otherwise it is stored in the current one.

The most prominent characteristic of this algorithm is its statelessness. SinceRAND does not require any information from stations, the implementation is veryeasy, minimum overhead is generated (due to threshold-exceeding packets whichneed re-transmission) and the balancing is pretty good.

CYCLIC This algorithm is a variant of RAND in which a minimal state informa-tion is kept: the last station a packet was re-transmitted to is always remembered bythe system, this guarantees that the same station is not picked up twice in a row incase of consecutive threshold exceed occurrences.

Never Queue

There are algorithms which use state information exchanged among stations in orderto evaluate data not strictly relating to nodes. Typically the state is represented bystation-specific quantities, like the current load or the residual amount of storagememory left. Never Queue employs a different state across the network, in orderto able to evaluate, the moment a DU arrives and needs to be stored, the stationto which transmitting the packet implies the least cost. Thus, the algorithm alwaystransmits the packet to the fastest available node.

This balancing strategy (which requires a lookup table) is poor as does not guar-antee a good quality of the overall balancing across stations. What it guaranteesthough, and this is the reason why we mention this approach, is good performancein processing packets in terms of throughput and latency. However this has a costin overhead due to the amount of information exchanged by nodes to refresh thenetwork state.

THRESHOLD

This class of highly dynamic algorithms uses network state information, derivedfrom message exchange among stations, to decide where a packet will be stored. Anincoming packet p is initially routed to a random node (this makes the algorithmnon-deterministic), that station then compares the packet size with a load threshold

1.5. System overview 7

User Storage sys. Balancing sys. Srv poolStore

Retrieve

Balance

FIGURE 1.1: Overall system architecture. The end user interacts onlywith the storage system, while the balancing system is hidden to theuser and transparent to the storage system with regards to accessing

the server pool.

L ∈ R and decides what to do. If the threshold is not exceeded: |p| < L, then theDU is stored in the current station, otherwise another station is picked via a pollingmechanism. A maximum number of attempts M ∈ N is considered after which apacket will stay in the current station even though the threshold is exceeded.

Even though this approach looks a lot like RAND, it differs in the way a stationis selected. Only the initial node selection is random, in case of threshold exceed, thenext station is picked with a process based on analysis of the network state.

LEAST This algorithm works like THRESHOLD but limited to one single iter-ation. When a packet arrives into a randomly selected node and the threshold isexceeded, the algorithm will poll a certain pool of stations to pick the next one in itsfirst attempt to route the packet. The station is usually picked basing on its currentload (least loaded node). After that, no more attempt is performed. Think aboutLEAST as THRESHOLD where M = 1.

1.5 System overview

Before detailing how the PA works, it is important to understand the architectureof the storage system being designed as part of the research in this thesis. From apoint of view based on the API that the overall system exposes, we recognize 2 majorcomponents:

• Storage system The component interacting with the end user and providingthe API for storing and retrieving data.

• Balancing system The component responsible for arranging the storage pooland balancing data across its servers.

As pointed out by Figure 1.1, the architecture separates concerns by defining 2sets of API: one exposed to the user, for submitting data and retrieving it, and an-other one, hidden to the user, which is responsible for balancing DUs in the storagepool.

9

Chapter 2

The algorithm

In chapter 1, we have anticipated the most important properties that the PA shows.What makes this algorithm innovative is the fact that it is at the same time dis-tributed, static, stateless and does not require a lookup table. In this chapter weare going to describe the mathematics behind the algorithm which makes all thispossible.

2.1 Network organization

The PA does not allow stations to be arranged in a free connection scheme. A strongassumption is made on the topology of the network according to the DHT1 scheme.It is important to make a few considerations about the protocol used by stationsto communicate to each other: in the specific case, no assumption is made of thenetworking technology as every protocol can be employed in the network (moreprotocols can actually be used as long as they are compatible with each other) exceptfor one:

Proposition 5 (Networking protocol). Stations are free to employ any arbitrary commu-nication protocol, as long as this guarantees direct addressing and that a message can bedelivered from / to every pair of stations.

Real case scenario, the one considered here, is Internet and the TCP/IP protocol.Given proposition 5, one station is actually capable to communicate with every otherone in the network; however this rarely can happen because, and this is the reasonfor which direct addressing is essential, a node’s address must be known. Networkscan be mannered according to DHT specifications thanks to a limited knowledge ofother nodes’ addresses.

The direct consequence of proposition 5 is that a separation is made between sta-tions’ physical and logical connection schemes. Physically, stations can be arrangedwithout any constraint, however the limited knowledge of other nodes in the graphallows the system to generate an overlay network which is the one being consideredhere.

A station is supposed to have a very limited knowledge regarding other sta-tions, thus having in memory few of them (considered as neighbours). According toDHT specifications, a ring topology is employed and it derives from nodes holdinga neighbour set of only 2 nodes: a predecessor and a successor as shown in figure 2.1.

1Distributed Hash Tables are employed in distributed networks. This network protocol wasadopted by 4 major P2P systems: Chord, Pastry, CAN and Tapestry.

10 Chapter 2. The algorithm

Station 1

Station 2

Station 3

Station 4

Station 5

Station 6

Station 7

Station 8

FIGURE 2.1: A N = 8 network example showing the logical ringtopology. Each station is assigned with an ID (typically the IP address

hash) and packets are routed by content.

Station ordering

A natural order occurs among stations. When the ring is formed, every node sicomputed an identifier Id(si) ∈ N represented by the hash of the station’s address.This identifier is used to build the ring as every station needs to locate its predecessorand successor in the network:

Lemma 1 (Ring construction). As long as every station has a minimal initial set of con-nections which guarantees that all nodes form a connected graph, the ring can be built byhaving every station reshape its neighbourhood with one successor and one predecessor.

After this initialization phase, the ring is on-line and ready to accept packets.This aspect is very important as it allows us to define the set of stations Ω as orderedand we can define:

Definition 5 (Station preceding operator). Given a couple of stations (si, sj) ∈ Ω2 (i 6= jand i, j ≤ N ), operator≺: Ω×Ω 7→ true, false defines the preceding relation among them.The operator works as follows:

si ≺ sj ⇐⇒ Id(si) < Id(sj) (2.1)

The first important result is the following:

Theorem 2 (Ring complete ordering). The set of stations Ω with precedence operator≺: Ω× Ω 7→ true, false is a complete ordered set.

Proof. Immediate by considering that operator ≺ on Ω, because of its definition, di-rectly maps on operator < on N which is a fully ordered set.

Ring access

The ring is the place where data is stored and the purpose of the PA is to help thestorage system balance all DUs across stations. The first detail we focus on is howthe ring is accessed when the SS needs to send a packet to be stored or retrieved. Thering has to be kept safe both from external nodes and from the same nodes that arepart of the network. The way to guarantee the latter is through the following:

2.1. Network organization 11

Station 1

Station 2

Station 3

Proxy

FIGURE 2.2: Access to the ring is guarded by proxies.

Proposition 6 (Limited knowledge principle). To guarantee safety and scalability, everynode in the system has a partial knowledge of the overall network.

In order to protect the ring from external activity, direct access to the networkmust be forbidden:

Proposition 7 (Zero knowledge principle). To guarantee safety from external intrusions,no node, except from those in the ring, knows the address of any station in the system.

This principle, though valid, cannot be adopted as-is because we would end upwith an isolated network otherwise. However, in order to fulfil the security featurespromoted by proposition 7, it is possible to build a guarding system around the ringwhich hides it from the external world. A collection of proxy stations is employedfor this purpose. As shown in figure 2.2, those stations will be exposed to the ex-ternal world and they will act as intermediary to the ring, whose addresses are keptprivate (in order to fulfil proposition 6, proxy stations will know the addresses ofonly a few stations in the ring).

Routing

The SS we are designing is distributed. The basic idea, according to the DHT spec-ifications, is that a packet will enter the ring from an arbitrary station, called entrypoint, in the context of a transmission. From there, every station, which has the BAdeployed, knows whether that packet should be stored there or should otherwise berouted to a different station.

Every station keeps a limited knowledge of the network. This knowledge is rep-resented by the set of neighbour nodes one stations keeps. Given the topology, wedefine a parameter called leaf radius: r ∈ N (r < N ) which represents the number ofsuccessor (or predecessor) nodes every station holds as its neighbourhood.

Definition 6 (Leaf Set). Every station si ∈ Ω keeps track of its neighbour nodes (plusitself) in an ordered set called leaf set: Λ(si) ⊂ Ω. The leaf set’s cardinality is always‖Λ(si)‖ = 2r + 1 where r is the leaf radius of the ring. The following equation holds:

Λ(si) = sj ∈ Ω : ai,j = 1 ∪ si (2.2)

Where A = [ai,j ] ∈ NN×N is the adjacency matrix of the network.


Note how one station’s leaf set contains the station itself. Also, the leaf set isalways an ordered set as it is, by definition, a subset of Ω which we proved beingordered in theorem 2. Unless otherwise specified, we will always consider r = 1.

Lastly, it is sometimes convenient to picture one station’s leaf set extensively asthe ordered vector of neighbour stations (including si):

Λ(si) =(s′i, . . . , si−1, si, si+1, . . . , s

′′i

)The notation above helps us detecting nodes s′i and s′′i as the extreme nodes in

every stations’s leaf set. Those nodes will play an important role when definingrouting function ψ later on.

Definition 7 (Upper Leaf Set). Let si ∈ Ω be a station and Λ(si) ⊂ Ω its leaf set. Theupper leaf set ΛU (si) ⊂ Λ(si) is defined as the set of all neighbours which the stationprecedes:

ΛU (si) = sj ∈ Λ(si) : si ≺ sj (2.3)

This set has always cardinality ‖ΛU (si)‖ = r.

Definition 8 (Lower Leaf Set). Let si ∈ Ω be a station and Λ(si) ⊂ Ω its leaf set. Thelower leaf set ΛL(si) ⊂ Λ(si) is defined as the set of all neighbours which the station ispreceded by:

ΛL(si) = sj ∈ Λ(si) : sj ≺ si (2.4)

This set has always cardinality ‖ΛL(si)‖ = r.

When a station receives a packet, it performs certain operations in order to un-derstand whether that packet is to be stored there or elsewhere. In the latter, thestation will pick one of the stations in its leaf set and route the packet there. The nextstation will repeat the same sequence of operations until the packet is stored into anode. This algorithm is represented by function ψ : P 7→ Ω, to perform it, everystation in the ring uses the same hash function:

Definition 9 (Hash function). Let P be the set of packets and H ⊆ N , we define ξ : P 7→H as a hash function used to calculate the station where to route a packet in the ring.

Not all hash functions can be used to route packets in the ring:

Proposition 8 (Cryptographic hash function). Hash function ξ : P 7→ H is a crypto-graphic hash function. By definition, ξ behaves in a way such that one bit change in the inputpacket will cause the change of at least 50% of the output hash’s bit string.

It is possible to consider many cryptographic hash functions out of those cur-rently employed in modern systems. Among the most common today we have thefollowing families: SHA2 and MD3.

As anticipated earlier, stations self-organize in a logical overlay ring by assigningIDs. One station’s ID Id(si) is computed by using the same hash function ξ on thestation’s address. For formal consistency, we intend hash function ξ to also workon stations: ξ : Ω 7→ H which is perfectly valid as hash functions do not really careabout the type of data fed as input as long as it is a bitstream. The following expres-sion: hi = Id(si) = ξ(si) is to be intended as hash function ξ calculated on station

2Shamir Hash Function. SHA-1 (128 bits), SHA-256 (256 bits), SHA-512 (512 bits).3Message Digest Hash. MD2 and MD4 (128 bits) today considered unsafe. MD5 (128 bits) and MD6

(512 bits).

2.1. Network organization 13

si’s address.

As soon as the ring is initialized and ready to work, the topology will define theordering of stations: s1 ≺ s2 ≺ · · · ≺ si ≺ · · · ≺ sN−1 ≺ sN following the order ofIDs (hashes): h1 < h2 < · · · < hi < · · · < hN−1 < hN . From here the DHT assigns toeach station an hash segment:

Definition 10 (Hash Segment). Let si ∈ Ω be a station in the ring and hi = Id(si) = ξ(si)its ID. Station si’s hash segment Ξ(si) is defined as the set of contiguous hashes ranging fromhi up to hi+1 (excluded):

Ξ(si) =

h ∈ H : hi ≤ h < hi+1 if i 6= Nh ∈ H : hi ≤ h ≤ hM ∪ h ∈ H : 0 ≤ h < h1 if i = N

(2.5)

Where hM ∈ H is the highest value that hash function ξ can produce: ξ(·) ∈ [0, hM].

Routing function ψ employs hash function ξ in order to compute the destinationstation for a packet. The algorithm is deployed on every station and behaves alwaysthe same:

Algorithm 1 Routing a packet in the ring

Require: Ring initializedRequire: Station si has ID hi = ξ(si)Require: Station si has associated hash segment Ξ(si)Require: Station si has associated leaf set Λ(si)

1: function ψ(p ∈ P )2: hp← ξ(p)3: Λ← ∅4: if hp ≥ hi then5: Λ← ΛU (si) \ s′′i ∪ si6: else7: Λ← ΛL(si)8: end if9: for s← Λ do

10: if h′ ≤ hp ≤ h′′ then . Since Ξ(s) = [h′, h′′] ⊂ H11: return s12: end if13: end for14: if hp ≥ hi then . Having that Λ(si) = (s′i, . . . , si, . . . , s

′′i )

15: return s′′i . The packet either belongs to s′′i or further16: else17: return s′i . The packet belongs to a station preceding s′i18: end if19: end function

We can now provide a better formal description of a ring by introducing its defi-nition:

Definition 11 (Ring). Let Ω be the set of stations, r ∈ N be the leaf radius, ξ : · 7→ H ⊆ Nthe hash function used by each station and ψ : P 7→ Ω the routing function, based on ξ,used to assign packets to stations. Then we define R = (Ω, r, ξ, ψ) as a fully qualified ring


overlay across N = ‖Ω‖ stations si ∈ Ω where packets p ∈ P are routed and delivered toeach station via routing function ψ employing hash function ξ.

Given algorithm 1, we have the following result:

Lemma 3. Let R = (Ω, r, ξ, ψ) be a ring, then the following holds:

ψ p = si ⇐⇒ ξ(p) ∈ Ξ(si),∀p ∈ P

Proof. Immediate by considering the first exit point of algorithm 1.

Regarding ψ, we want to describe a few more important aspects:

Definition 12 (Routing). Function ψ will be repeatedly executed for a certain number ofiterations from the moment packet p enters the ring until it finds its destination station. Therouting is over when ψ returns the same station where it is evaluated.

It is now evident that one single application of function ψ does not effectivelyroutes the packet in the correct station. It is necessary to perform a certain number ofiterations and applyψ to the same packet in different stations. This scheme generatesa recursive condition which we want to make more evident. Let us denote withψk ∈ Ω the station returned by the k-th application of ψ, the recursive definition iscompleted by setting the initial condition:

ψk+1 = ψ (ψk)ψ1 = si

(2.6)

As for every recursive function, we ask ourselves whether the recursive defini-tion in equation 2.6 converges to a value. As per definition 12, we expect functionψ to assume a value, at a certain iteration b ∈ N, and always keep it in every futureiteration b+ k, k ∈ N. For this reason, it is imperative that the cyclic application of ψdoes not lead to an infinite sequence of iterations, which would make the recursivedefinition generate an alternating sequence.

Theorem 4 (Routing is always successful). Let R = (Ω, r, ξ, ψ) be a ring and p ∈ P apacket entering it from station si ∈ Ω. Let b ∈ N be the number of different applicationsof routing function ψ, across the different stations of the ring, before p finally reaches itsdestination. Then b is always limited: b ≤ B ∈ N

Proof. We take this by contradiction, thus assuming ∃p ∈ P : b→∞.By analyzing algorithm 1, in order to have an infinite number of iterations, we

need to make sure that function ψ, when evaluated on station si ∈ Ω, never returnsthe current station: ψ(p) 6= si. For such a condition to hold, then the following mustoccur:

∃p ∈ P : hp = ξ(p) /∈ Ξ(si), ∀si ∈ Ω

which translates into:

∃p ∈ P : hp = ξ(p) /∈ [hi, hMi ],∀si ∈ Ω

Since we do not know whether si is the last station in the ring, we use hMi to indicatethe final hash in station si’s hash segment.

However, since Ξ(si) = [hi, hMi ] in the equations above depends only on si, and

since those equations hold for all stations, we can consider the totality of the hashsegments: ⋃

s∈Ω

Ξ(s) =

N⋃k=1

[hk, h

Mk

]= [0, hM]

2.2. Unbalanced ring 15

So we can re-write the previous equations as:

∃p ∈ P : hp = ξ(p) /∈ [0, hM]

Which is contradictory as hash function ξ is, by definition, limited in range ξ(·) ∈[0, hM], and since hp is calculated via hash function ξ, it must fall in that range.

Theorem 4 proves that the recursive definition introduced before converges:

Corollary 4.1 (Recursive application of ψ converges). Let R = (Ω, r, ξ, ψ) be a ringand p ∈ P a packet entering it from station si ∈ Ω. Then the recursive term ψk during therouting of the packet converges to station sj ∈ Ω after b ∈ N iterations:

limk→∞

ψk = ψb = sj

In order to avoid confusion between the final computed destination station andthe intermediate hopping stations calculated by the several iterations of the rout-ing function; from now on, we will indicate with expression si = ψ(p) the stationwhere packet p is stored at the end of the routing process. That is, we consider ψ asreturning ψb = si (last iteration) unless otherwise specified.

At this point, we have completed describing and formalizing the storage system.

2.2 Unbalanced ring

We now move forward by analyzing what problems this structure presents in termsof balancing. Even though only the storage system has been covered so far, it is im-portant to point out that, as it is now, the architecture already enables a primitiveform of balancing. Packets are, in fact, distributed across different stations and mod-ern P2P networks are entirely based on this scheme. What kind of balancing do weend up with?

Since routing algorithm ψ employs hash function ξ, the balancing state of thering depends entirely on ξ only! Under static conditions (the ring does not change),the routing of packet p is done the moment hp = ξ(p) is computed! Routing functionψ is executed many times because the knowledge that each station has of the ringis limited, but that number of iterations does not affect the destination station beingcomputed.

The question we want to answer to is: “Given hash h ∈ H , what is the probabilitythat, given a random input packet p ∈ P , hash function ξ computed on p returns h:Pr ξ(p) = h?”. Since ξ is a cryptographic hash function, it has an interesting prop-erty: the probability distribution of the hashes being generated is approximatelyuniform, it means that:

∀h ∈ H,∀p ∈ P,Pr ξ(p) = h =1

hM + 1(2.7)

Since ξ : · 7→ [0, hM]. If we indicate with l ∈ N the number of bits of the hash(hash length): l = dlog2 hMe, then ξ : · 7→ [0, 2l − 1] and can write:

∀h ∈ H,∀p ∈ P,Pr ξ(p) = h = 2−l (2.8)

What we want to focus on is knowing, in the long run, how many packets eachstations gets with this configuration. From this knowledge, we will then analytically


calculate the information we need about the balancing.

The problem of Packets in Stations is very close (though not entirely equivalent)to another well-known one which we will take into consideration: Balls in Bins 4. Wewill see that the scenario of throwing a ball in an area full of bins and assessing inwhich bin the balls falls, is equivalent to producing a random packet, calculating itshash and checking into which station it is going to be routed.

Our analysis starts with identifying the probability for a packet to be routed intoone station of the ring:

Theorem 5 (Packet-in-station probability). Let R = (Ω, r, ξ, ψ) be a ring. Then theprobability that a packet p ∈ P is routed to station si ∈ Ω is:

π(ξ)i = Pr ψξ(p) = si =

‖Ξ(si)‖1 + hM

Where ‖Ξ(si)‖ is station si’s hash segment’s length (hash coverage).

Proof. Since each station si owns a specific hash segment Ξ(si) =[hk, h

Mk

], The proof

is immediate by considering that every hash h ∈ H ≡ [0, hM] has the same probabil-ity to be selected, as per equation 2.8. So we just need to multiply that probabilityby the length of the segment.

We use expressions π(ξ)i and ψξ(·) to indicate the probability for a packet to fall

into a certain station and the routing function ψ, both when hashing function ξ isemployed in the ring. The reason why we want to explicit the hash function is be-cause, later, we are going to evaluate the same hash-based quantities with a differenthashing function and compare results.

Given one station si, the length of its hash segment ‖Ξ(si)‖ is an important quan-tity. We can efficiently formalize its value by using expression: ‖Ξ(si)‖ = δi,i+1, andby defining the following quantity:

Definition 13 (Segment length calculator). Let Ω be the set of stations and let everystation si ∈ Ω have assigned an hash segment

[hk, h

Mk

]⊂ N such that the hash partitioning

is circular, thus last station sN has hash segment [hN , hM] ∪ [0, h1 − 1]. We define thesegment length calculator function as the application returning the number of hashes inthe segment assigned to one station:

δi,j =∥∥∥1i,j · 2l −

∥∥hi mod (N+1) − hj mod (N+1)

∥∥∥∥∥Having ∀i, j ∈ N ∧ i, j > 0 and:

1i,j =

1 i > j0

Then we can express the packet-in-station probability in theorem 5 as follows:

π(ξ)i = Pr ψξ(p) = si = 2−l · δi,i+1 (2.9)

Theorem 6 (Packets-in-station probability). Let R = (Ω, r, ξ, ψ) be a ring. Let µi =|si| = 0 . . .m be a r.v. counting the number of packets is station si where m ∈ N represents

4The problem describes a non-deterministic scenario where balls are thrown on an area full of binsin a random direction as described in: (Kolchin, 1998).

2.3. Balancing the ring 17

the number of total packets sent so far to the ring. Then the probability that station si ∈ Ωhas k ∈ N packets is:

Prµ

(ξ)i = k

=

(m

k

)·[π

(ξ)i

]k·[1− π(ξ)

i

]m−kProof. Immediate. We consider m packets routed into the ring and we want to cal-culate that the probability that k among them were routed in station si. This callsfor Bernoulli Trials. The probability for a packet to end up in one station is given bytheorem 5.

Thanks to theorem 6, we know the PDF of r.v. µ(ξ)i and we are able to calculate

how many packets in average one station gets:

η(ξ)i (m) = E

[µ

(ξ)i

]=

m∑k=0

k · Prµ

(ξ)i = k

=

m∑k=0

k

(m

k

)·[π

(ξ)i

]k·[1− π(ξ)

i

]m−k (2.10)

As described in (Kolchin, 1998), in the Balls in Bins problem, the following holds:

m∑k=0

k

(m

k

)· pk(1− p)m−k = mp, ∀m ∈ N,m > 0, p ∈ [0, 1] ⊆ R

Which allows us to calculate the average load per station in a simpler form:

η(ξ)i = m · 2−l · δi,i+1 (2.11)

Proposition 9 (Unbalanced ring). Let R = (Ω, r, ξ, ψ) be a ring. Given equation 2.11,the network is not balanced. The wider is one station’s hash segment, the more packets thatstation gets:

‖Ξ(si)‖ > ‖Ξ(sj)‖ ⇐⇒ η(ξ)i > η

(ξ)j , ∀si, sj ∈ Ω

Proposition 9 is very important as it states that the system architecture, so far,is able to achieve a very high level of decentralization, however it fails in balancingstations.

2.3 Balancing the ring

Equation 2.11 is our starting point to take the current architecture and try to modifyit in order to reach load balancing. Our ideal model is such for which each stationgets the same number of packets:

Definition 14 (Ideal station load). Given a network ofN ∈ N different stations, we definethe following as the ideal load per station:

η?i =m

N,∀si ∈ Ω

Where m ∈ N is the total number of packets sent to the network.


Definition 14 points out that our final goal is having our architecture move to-wards the Balls in Bins model. Our goal can also be expressed by considering theprobability that each single bin has to get a ball:

π?i = N−1,∀si ∈ Ω (2.12)

If we can have equation 2.9 converge to equation 2.12, our goal is reached sinceboth definition 14 and equation 2.12 describe a uniformly distributed r.v.

2.3.1 Extending the ring

Moving forward to our goal, as per definition 14, we need to understand why thering is not balanced. We can identify 2 possible causes:

1. Hash function ξ does not keep into account the fact that stations have hashsegments of different lengths. It actually assumes that all stations have hashsegments of the same size.

2. Stations should have hash segments of the same size.

The 2 problems described above are actually 2 possible explanations of the sameissue: with regards to balancing, the network structure and the hash function are notwell coupled together. For our solution, we actually choose to accept the standpointoffered by point 1 which blames the hash function rather than stations.

Our approach is replacing hash function ξ with another one:

Definition 15 (Hash function φ). Let φ : P 7→ [0, φM] ⊂ R, be a hash function. Byemploying function φ, a ring can achieve balancing and each station approximately receivesthe same number of packets:

η(φ)i ≈ η?i =

m

N

Where m ∈ N is the total number of packets sent to the network. Also, we still define l as thelength (bits) of hashes generated by φ: l = dlog2 φMe.

Definition 15 represents a goal for us. In the next section we are going to designφ so that load balancing in the ring is achieved.

The first thing we notice about φ is that we have designed it to return real num-bers, thus the hash space is no more discrete, but continuos. We will see that thecontinuos characterization of φ will not be a problem when employed in the ringand in routing algorithm ψ. Furthermore, in our model, we will consider function φto be used on packets only, we will not be using this new hash function to computethe IDs of stations: for them, we will keep using hash function ξ.

Definition 16 (Extended ring). Let Ω be the set of stations and r ∈ N be the leaf radius.Let ξ : · 7→ H ⊆ N be the underlying hash function and φ : P 7→ [0, φM] ⊂ R be thebalancing hash function used by each station and based on ξ. Let ψ : P 7→ Ω be the routingfunction, based on φ, used to assign packets to stations. Then we define R = (Ω, r, ξ, φ, ψ)as the extended ring overlay where packets are balanced via hash function φ.

Given definition 16, we see that φ does not act as a replacement of ξ, so we willactually consider the former as an extension of the latter.


Adapting concepts in extended ring

With hash function φ in place, routing algorithm ψ needs to be slightly changed.Actually, since the whole ring structure is based on hashes, we need to adjust a fewdefinitions so that extended ring (Ω, r, ξ, φ, ψ) can be properly described.

The most important concept to introduce in the extended architecture, is that φacts transparently with regards to the hash space. Hash function ξ returns hashes inthe discrete space H ≡ [0, hM] ⊂ N, while φ generates hashes into continuos spaceΦ ≡ [0, φM] ⊂ R. We design φ such that φM = hM; thanks to this, one space containsthe other but they are bound by the same extremes: H ⊂ Φ.

This also means that hash segments can be expressed both as enumerable setsand real intervals. Whether the former or the latter will be specified via set identitiesor inferred by context.

The next concept to adapt is stations and their IDs. Nothing changes with re-gards to this matter: every station will keep using hash function ξ to calculate thehash of its address hi ∈ H , however here the important point is understanding thatthe hash identifying one station is also contained in the spaces of phi-hashes: hi ∈ Φ.

The last aspect to cover is routing functionψ. Given the assumptions above, algo-rithm 1 remains 99% unchanged. What changes is line 2 where, instead of using hashfunction ξ to calculate the packet, hash function φ is used instead: hp ← φ(p). Allother operations remain unchanged because Φ extends H . The direct consequenceof this last point is the following:

Lemma 7. Let R = (Ω, r, ξ, φ, ψ) be an extended ring, then the following holds:

ψ p = si ⇐⇒ φ(p) ∈ Ξ(si), ∀p ∈ P

Proof. Immediate by considering lemma 3 and the fact that function φ replaces ξ inalgorithm 1 at line 2.

Defining sizing equations

Equation 2.12 describes the PDF of r.v. s? ∈ Ω which represents the bin (stationin an ideally balanced ring) where a ball (packet) falls into. As that equation pre-scribes, if we want to have the ring balanced, we need to make sure that all stationsget the same probability to receive a packet, which is not the case for a normal ring(Ω, r, ξ, ψ) as per equation 2.9.

So, by employing hash function φ, r.v. sφ ∈ Ω can be defined as the station wherea packet falls into by assuming the extended ring (Ω, r, ξ, φ, ψ) is in place.

R.v. sφ’s PDF is the start point from where we can commence our sizing effort.Since sφ is continuos, and given lemma 7, the probability that a packet is routed tostation si in the extended ring is:

π(φ)i = Pr

hi ≤ φ(p) ≤ hM

i

=

∫Ξ(si)

fφ (r) dr (2.13)

Where fφ : [0, hM] ⊂ R 7→ R is r.v. hφ’s PDF: it represents the generated φ-hashes;and Ξ(si) ⊆ Φ is station si’s continuos hash segment. It is important to notice how


this equation relates r.v. sφ ∈ Ω (since its PDF has expression:∑N

k=1 π(φ)k · δ(r − k)5)

together with r.v. hφ ∈ Φ.

Recalling equation 2.12, we basically want: π(φ)i = π?i :∫ hM

i

hi

fφ (r) dr = N−1, ∀si ∈ Ω (2.14)

We start from equation 2.14. Our purpose is designing hash function φ’s im-plementation so that this equation holds. This approach will guarantee that r.v. sφbehaves like continuously distributed r.v. s? in the Balls in Bins scenario.

2.3.2 Designing hash function φ

Equation 2.14 represents a constraint on fφ. This expression points out an importantrelationship:

Proposition 10 (Relationship between r.v. sφ and hφ). Given equation 2.14, the effortof designing hash function φ is transferred on r.v. hφ, as its PDF fφ is the subject of suchdesign.

Designing r.v. sφ’s PDF

Of course, we cannot extract function fφ from the integral sign in equation 2.14, sowe need to make some assumptions on it.

Definition 17 (Formatting impulse). Let g : R 7→ R be a continuos, domain and valuebounded function with the following constraints:

1. g(r) ≥ 0,∀r ∈ R.

2. g is a compact-support6 function: ∃r1, r2 ∈ R, r1 < r2 : g(r) = 0,∀r ∈ [r1, r2].

3. ∃A ∈ R, A > 0 : g(r) ≤ A,∀r ∈ R.

4.∫ +∞−∞ g(r)dr ≤ 1.

5. It is possible to calculate g’s antiderivative: ∃G(r) : G′(r) = g(r).

We define g as fφ formatting impulse: a function used to shape r.v. hφ’s PDF and solveequation 2.14. Because of its definition, we use expression gr1,r2,A(r) to refer to an impulsewith amplitude A and definition interval [r1, r2] ⊂ R.

We want to show right now an important result concerning definition 17 whichwill be useful later on in this chapter:

Lemma 8 (Impulse antiderivative is invertible). Let g : R 7→ R be a formatting impulse.Then its antiderivative G : R 7→ R is invertible: ∃G−1.

Proof. The inverse function theorem7 states that a continuously differentiable uni-variate function with nonzero derivative in a certain interval is therein invertible. Inour case, G is the antiderivative of a continuous function, thus it is continuous itself

5Function δ is intended to be a generalized function, or distribution: 〈δ, ϕ〉 = ϕ(0).6Compact-support functions used in distributional calculus.7As described in (Nijenhuis, 1974), the theorem provides a sufficient condition for a function to be

invertible.


s1 s2 sk sN s1

hM 0[h1, hM1 ]

gh1,hM1 ,A1

[h2, hM2 ]

gh2,hM2 ,A2

[hk, hMk ]

ghk,hMk ,Ak

[hN , hM]

ghN ,hM,AN

[0, h1]

g0,h1,AN

FIGURE 2.3: Hash-partitioning of a ring into different segments, oneper each station. For each segment, a different impulse is used, its

coverage matches the segment’s length.

and differentiable by definition of primitive function. Thus we meet the conditionsof invertibility.

The nonzero derivative condition is not met by g’s definition. However this doesnot undermine its invertibility: rather, it does not guarantee that the inverse functionis also continuously differentiable.

Function fφ’s domain [0, hM] ⊂ R can be partitioned into N different segments:one hash segment Ξ(si) ≡ [hi, h

Mi ] per each station si.

The basic idea is having function fφ employ impulse g to cover the different seg-ments in the whole hash space [0, hM] ⊂ R as shown in figure 2.3. So, for each hashsegment [hi, h

Mi ] ⊂ R, impulse ghi,hM

i ,Aiis considered and is employed to calculate

fφ’s values falling into that specific segment. For the last segment relative to sN ,since it crosses the max hash value hM, we need to actually use 2 different impulses:ghN ,hM,AN,1 and g0,h1,AN,2 . AmplitudesA1, A2, . . . , AN,1, AN,2 are sized quantities andtheir values will be calculated later in this chapter.

Definition 18 (Function fφ’s structure). Let R = (Ω, r, ξ, φ, ψ) be an extended ring, lethφ ∈ [0, hM] ⊂ R be the r.v. representing a φ-hash, then its PDF is formally defined as:

fφ(r) =

N−1∑k=1

ghk,hMk ,Ak

(r) + ghN ,hM,AN,1(r) + g0,h1,AN,2(r)

It is important to notice that fφ is a not a regular unconstrained function, it isa PDF, thus it must meet certain requirements. Later on, we will verify that thoserequirements are actually in place. We can now take equation 2.14 and replace fφwith its definition:∫ hM

i

hi

fφ (r) dr =

∫ hMi

hi

ghi,hMi ,Ai

(r)dr = N−1, ∀i = 1 . . . N − 1 (2.15)

In case last station sN is considered, then 2.14 becomes:

∫ hMi

hi

fφ (r) dr =

∫Ξ(sN )

[ghN ,hM,AN,1(r) + g0,h1,AN,2(r)

]dr

=

∫ hM

hN

ghN ,hM,AN,1(r)dr +

∫ h1

0g0,h1,AN,2(r)dr = N−1

(2.16)

Since g has antiderivative as per definition 17, we can proceed further in bothequations: [

Ghi,hMi ,Ai

(r)]hM

i

hi= N−1, ∀i = 1 . . . N − 1 (2.17)

And:


[GhN ,hM,AN,1(r)

]hM

hN+[G0,h1,AN,2(r)

]h10

= N−1 (2.18)

Equations 2.17 and 2.18 are the closed-form constraints we have calculated fromequation 2.14 right now.

Remark (Solutions of equations 2.16 and 2.18). Regarding last station’s hash segment,we have to use 2 different impulses whose amplitudes AN,1 and AN,2 can be sized via equa-tions 2.16 and 2.18. Those equations provide a possibly infinite set of solutions where bothamplitudes are interdependent. Among the possible ones, we will choose to have each impulsecover half of the target value:∫ hM

hN

ghN ,hM,AN,1(r)dr =[GhN ,hM,AN,1(r)

]hM

hN=

1

2N(2.19)

And: ∫ h1

0g0,h1,AN,2(r)dr =

[G0,h1,AN,2(r)

]h10

=1

2N(2.20)

Remark (Requirements on impulse). In definition 17, we have required formatting im-pulse g to have antiderivative G and that its definition is known. This assumption is prettystrong but not essential. As we could see from calculations so far, equations 2.15 and 2.16can actually be used to size the value of the impulse’s amplitude by using an alternativemethod to exact integration. Throughout the rest of our analysis, equations 2.17 and 2.18will always be referred to as the preferred amplitude sizing method; however it will alwaysbe implicitly intended that equations 2.15 and 2.16 can replace them.

The process of designing fφ is completed as the equations above can be used tocalculate all impulse amplitudes A1, A2, . . . , AN,1, AN,2:

Proposition 11 (Defining function fφ). In order to reach load balancing in the ring, thefollowing operations are considered:

1. Formatting impulse ghi,hMi ,Ai

: [hi, hMi ] ⊂ R 7→ [0, Ai] ⊂ R is defined for each station

si ∈ Ω.

2. Function fφ is designed as per definition 18 by summing impulses all together.

3. Function fφ will be parametric on the set of impulse amplitudes, hence its input spacewill be RN+2: fφ(A1, . . . , AN,1, AN,2, r) with r,Ak ∈ R, Ak > 0, ∀k = 1 . . . N .

4. For each impulse, the corresponding antiderivative:

Ghi,hMi ,Ai

(r) =

∫ r

0ghi,hM

i ,Ai(x)dx (2.21)

is calculated.

5. For each impulse, by means of equations 2.17 and 2.18, the corresponding amplitudeAi is calculated.

Now that we know fφ’s formal definition, we need to verify that such expressionmeets the constraints of a PDF:

Theorem 9 (Function fφ is a regular PDF). Let R = (Ω, r, ξ, φ, ψ) be an extended ringwith N = ‖Ω‖ stations and let ghi,hM

i ,Ai: [hi, h

Mi ] ⊂ R 7→ [0, Ai] ⊂ R be the formatting

impulse for each station si ∈ Ω such that equations 2.17 and 2.18 hold. Then function fφ,as per definition 18, is a regular PDF.


Proof. The 3 basic properties of PDF functions must be met:

1. fφ is positive and bounded given the definition of impulse g:

0 ≤ ghi,hMi ,Ai

(r) ≤ Ai,∀i = 1 . . . N, r ∈ R

2. Given its definition, fφ’s domain is the union of all non-overlapping domainsof the impulses:

N−1⋃k=1

[hi, hMi ] ∪ [hN , hM] ∪ [0, h1] ≡ [0, hM] ⊂ R

Thus the function is 0 out of its definition range:

fφ(r) = 0, ∀r < 0 ∧ r > hM =⇒ limr→±∞

fφ(r) = 0

3. fφ’s area is unitary because of equations 2.17 and 2.18:∫ +∞

−∞fφ(r)dr =

∫ hM

0fφ(r)dr = N ·N−1 = 1

The following comes as a direct consequence of theorem 9:

Corollary 9.1 (Function Fφ is a regular CDF). Function Fφ : R 7→ R has the followingform:

Fφ(r) =

N−1∑k=1

Ghk,hMk ,Ak

(r) +GhN ,hM,AN,1(r) +G0,h1,AN,2(r) (2.22)

And is a regular CDF and r.v. hφ’s CDF.

Proof. Immediate by considering r.v. hφ’s CDF’s definition:

Fφ =

∫ r

−∞fφ(x)dx =

∫ r

−∞

[N−1∑k=1

ghk,hMk ,Ak

(x) + ghN ,hM,AN,1(x) + g0,h1,AN,2(x)

]dx

=

∫ r

−∞

N−1∑k=1

ghk,hMk ,Ak

(x)dx+

∫ r

−∞ghN ,hM,AN,1(x)dx+

∫ r

−∞g0,h1,AN,2(x)dx

=N−1∑k=1

∫ r

−∞ghk,hM

k ,Ak(x)dx+

∫ r

−∞ghN ,hM,AN,1(x)dx+

∫ r

−∞g0,h1,AN,2(x)dx

=N−1∑k=1

[Ghk,hM

k ,Ak(x)]r−∞

+[GhN ,hM,AN,1(x)

]r−∞ +

[G0,h1,AN,2(x)

]r−∞

Where Ghk,hMk ,Ak

is defined according to equation 2.21. Given impulse’s definition,its antiderivative and binding to hash segments as per equation 18, we know thatthe following holds:

limr→∞

gr1,r2,A(r) = 0 ∧ r1, r2 ≥ 0 =⇒ limr→−∞

Gr1,r2,A(r) = 0


Hence, leading to the following result:

[Gr1,r2,A(x)]r−∞ = Gr1,r2,A(r)− limr→−∞

Gr1,r2,A(r) = Gr1,r2,A(r), ∀r ∈ R

Which leads us to equation 2.22. Finally, theorem 9 has proved fφ(x) is a regularPDF and this covers the proof of all aspects of the thesis.

Designing r.v. sφ

Although proposition 11 describes the procedure for calculating fφ, it does not pro-vide a way to build r.v. hφ in a way such that it generates hashes according to thatfunction. This problem is known in literature as random variable generation8. By em-ploying this technique, we are able to get the algorithm for calculating φ-hasheswhich allow us to balance load in the ring. For the sake of completeness, the proofof this process is described below:

Theorem 10 (R.v. generation via inverse transform). Let F : R 7→ [0, 1] ⊂ R be acontinuos inversible function meeting the characteristics of a CDF:

1. Bounded in [0, 1]: 0 ≤ F (r) ≤ 1, ∀r ∈ R.

2. limr→−∞ F (r) = 0.

3. limr→+∞ F (r) = 1.

4. Monotone increasing: r1 < r2 =⇒ F (r1) ≤ F (r2),∀r1, r2 ∈ R.

Let U ∈ R be a continuos uniformly distributed r.v. over [0, 1] ⊂ R. Define X ∈ R as a r.v.such that the following transformation holds:

X = F−1(U) (2.23)

Where F−1 : [0, 1] ⊂ R 7→ R denotes F ’s inverse function. Then X is distributed as F :

Pr X ≤ x = F (x), ∀x ∈ R (2.24)

Proof. We need to prove that equation 2.23 causes r.v. X to be distributed accordingto CDF F . Starting from equation 2.24, we need to prove the following:

PrF−1(U) ≤ x

= F (x),∀x ∈ R

Since F is invertible, then the function is both injective and surjective. Also, F is, byhypothesis, a continuos function. Thanks to those 2 conditions, we can apply F toboth ends of the inequality under the probability sign:

PrF(F−1 (U)

)≤ F (x)

= F (x) =⇒ Pr U ≤ F (x) = F (x), ∀x ∈ R

The sign of the inequality is left unchanged because F is monotone increasing. Letnow a = F (x), as x ranges in R, a will range in [0, 1] because F is a CDF and emitsvalues in that interval. So the previous equation becomes:

Pr U ≤ a = a,∀a ∈ [0, 1] ⊂ R8As described in: (Haugh, 2004), r.v. transformation via inverse transform is a known technique

which makes it possible to define a random variable by using the inverse of its CDF.


But we have that Pr U ≤ a = FU . U is supposed to be a continuos uniformlydistributed r.v. over [0, 1]. The previous equation is actually r.v. U ’s CDF’s formaldefinition which proves the thesis.

Theorem 10 answers our question about the implementation of hash functionφ. When considering a ring (Ω, r, ξ, φ, ψ), we can use hash function ξ to build hashfunction φ in order to reach balancing. Before detailing this process, we need tomake sure we meet all conditions defined by the theorem above:

Lemma 11 (Function Fφ is invertible). Let fφ be r.v. hφ’s PDF defined as per proposition11. Let Fφ be its CDF. Then Fφ is invertible and we will indicate with F−1

φ its inverse.

Proof. The proof is almost immediate. We consider Fφ’s definition, as per corol-lary 9.1, and note that it is basically built up by many different impulse antideriva-tives Gi = Ghi,hM

i ,Aihaving i = 1 . . . N . It is possible to invert a function by parts;

given Fφ’s structure, we can prove its invertibility by proving that each impulse an-tiderivative Gi is itself invertible. Thanks to lemma 8, every impulse antiderivativeis actually invertible, and this proves the thesis.

The process for achieving this result is as follows:

Proposition 12 (Hash function φ’s implementation). Let R = (Ω, r, ξ, φ, ψ) be an ex-tended ring. In order to build hash function φ, the following operations must be performedat ring initialization time:

1. For each station si ∈ Ω, compute its hash identifier: hi = ξ(si) and sort all ID hashesby increasing value: hi, h2, . . . , hN .

2. Define a formatting impulse g(r, r1, r2, A) to use, parametric in (r1, r2, A) ∈ R3. It ispossible to use one impulse definition for all stations or use different impulse definitionsper each.

3. Bind each station’s associated impulse gr1,r2,A to the station’s hash segment Ξ(si) =[hi, h

Mi ], thus obtaining an impulse gi = ghi,hM

i ,Aiparametric in amplitude Ai.

4. Use sizing equations 2.17 and 2.18 to compute, for each impulse gi, the value of am-plitude Ai which allows to achieve balancing. Thus compute an array of fully qualifiedimpulses: g1, g2, . . . , gN−1, gN,1, gN,2 (no more parametric).

5. Build PDF function fφ as per equation 18.

6. Compute CDF function Fφ as per corollary 9.1.

7. As per theorem 10, compute φ as:

φ(·) = F−1φ

[(2l − 1

)−1· ξ(·)

](2.25)

Equation 2.25 represents hash function φ formal definition.

We will refer to this equation as the: Balancing Equation.


0 hM

0 1

h1 h2 hk hN−1 hN

12N

12N

+ 1N . . .

12N

+ kN

12N

+ N−1N

h1 h2 − h1 . . . . . . hN − hN−1 hM − hN

12N

1N

. . . . . . 1N

12N

φ φ φ φ φ φ

sN s1 s2 sk sN−1 sN

sN s1 s2 sk sN−1 sN

FIGURE 2.4: Hash segments mapped onto φ segments illustratinghow hash function φ works. The top part of the diagram shows the φ

hash-space while the bottom part the ξ hash-space

2.3.3 Understanding how φ works

Equation 2.25 allows us to balance the ring. Before moving on, we would like topoint out a few important facts regarding the balancing equation to better under-stand how it works.

Figure 2.4 clearly demonstrates the basic principle behind φ: the balancing appli-cation basically maps an unevenly distributed space (regular hashes, that is ξ hashes)onto an evenly distributed space (φ hashes). The φ space even partitioning is basedon the number of stations N .

We want to remark the fact that, given its definition (equation 2.25), hash functionφ has a domain which spans values ranging in interval [0, 1] ⊂ R. This interval issubdivided into equal segments, each one assigned to a station. We will call themφ-segments, and we will use term φ-coverage, later on, to indicate the same concept.

2.4 Ring balancing example

For the sake of completeness, we are going to provide an example regarding how tobuild hash function φ for a simple small ring consisting of a few stations.

Since our purpose here is to provide a real case scenario, easy to understand, weare going to consider the following conditions:

1. The ring will be made of N = 6 stations.

2. Leaf set radius is the minimum: r = 1.

3. We are going to consider extremely short hashes with l = 10 bits. This meansthat hM = 210 − 1 = 1023.

Remark. Hash function ξ will be considered but not defined as we are not going to physicallyuse it in our calculations.

We will now follow proposition 12’s prescriptions.

2.4. Ring balancing example 27

Station Hash identifier HS (Ξ(si) ⊂ N) HS (Ξ(si) ⊂ R)

s1 = St. 1 h1 = 101 101 . . . 209 [101, 210)s2 = St. 2 h2 = 210 210 . . . 339 [210, 340)s3 = St. 3 h3 = 340 340 . . . 552 [340, 553)s4 = St. 4 h4 = 553 553 . . . 700 [553, 701)s5 = St. 5 h5 = 701 701 . . . 997 [701, 998)s6 = St. 6 h6 = 998 998 . . . 1023 ∪ 0 . . . 100 [998, 1023] ∪ [0, 101)

TABLE 2.1: Showing, in the example, values of hash identifiers andhash segments for each station.

2.4.1 Defining the ring

We must first define ring R = (Ω, r, ξ, φ, ψ). Remember that hash function φ is thelast quantity we will define.

Each station si ∈ Ω must first define an identifier and compute hash function ξon that in order to calculate its hash identifier hi.

As it is possible to see in table 2.1, once each station receives its hash identifier,hash segments are defined so that routing is possible in the ring.

2.4.2 Defining the formatting impulse

According to the second point of proposition 12, we must move on to defining theformatting impulse to use in order to achieve balancing. We can choose to eitherdefine one impulse type for all stations or a different impulse type per each one ofthem. We are going to choose the first option for 2 reasons:

1. Choosing one single impulse type is easier from a computation point of viewas it implies to formulate its parametric antiderivative only once.

2. Choosing more impulse types has not proved, so far, to be any more beneficialthan using a single one. The quality of the balancing is not impacted by thischoice9.

For the sake of simplicity, we are going to consider a very simple impulse type:the rectangular impulse:

gr1,r2,A(r) = A ·Π(r − r1

r2 − r1

)(2.26)

Having:

Π (r) =

1 0 ≤ r ≤ 10

The impulse we have chosen in equation 2.26 is compliant to definition 17. Sincethe proof is immediate, we will not cover it.

9This is based on observations from simulations run so far. No proof supports this theory nor deniesit though.


2.4.3 Binding impulses to stations

Now we need to bind every station’s HS to an impulse in order to get a collection ofimpulses all parametric with respect to their amplitudes.

s1 =⇒ g1 = g101,210,A1

s2 =⇒ g2 = g210,340,A2

s3 =⇒ g3 = g340,553,A3

s4 =⇒ g4 = g553,701,A4

s5 =⇒ g5 = g701,998,A5

s6 =⇒ g6,1 = g998,1023,A6,1 , g6,2 = g0,101,A6,2

Remark (Impulse for last station). The last station in the ring has a special treatmentbecause it might span two hash intervals since it will include the highest hash hM and thelowest one (the null hash). Thus 2 impulses are actually used.

2.4.4 Calculating amplitudes

At the moment, we have a collection of N + 1 = 7 impulses, all parametric withrespect to amplitudes. We need to find the value of those amplitudes in order tohave these impulses define fφ in a way that hash function φ can balance the load inthe ring.

Equations 2.17 and 2.18 will be used to size those impulses. Since the sizingequations require the computation of impulse’s antiderivative, we need to calculateit first. Given its extremely simple definition, the calculation is almost immediate:

Gr1,r2,A(r) =

∫ r

0gr1,r2,A(x)dx =

∫ r

0A ·Π

(x− r1

r2 − r1

)dx

= A ·[Π

(r − r1

r2 − r1

)· (r − r1) + H (r − r2)

]Where H (r) is Heaviside’s step function10:

H (r) =

1 r > 00

In order to apply equations 2.17 and 2.18, we need to calculate the followingquantity, we will do the binding to hash segments later:

[Gr1,r2,A]r2r1 =

[A ·Π

(r − r1

r2 − r1

)· (r − r1) +A ·H (r − r2)

]r2r1

= A ·[Π

(r − r1

r2 − r1

)· (r − r1) + H (r − r2)

]r2r1

= A · (r2 − r1)

We can now apply equation 2.17 to g1 . . . g5:

10Heaviside’s function exists in literature in different forms, here we consider the variation wherethe function assumes only 2 values: 0 and 1.

2.4. Ring balancing example 29

[Ghk,hM

k ,Ak

]hMk

hk= N−1 =⇒ Ak =

(hMk − hk

)−1·N−1,∀k = 1 . . . 5

The same goes for g6,1 and g6,2 where we apply equations 2.19 and 2.20:

[GhN ,hM,AN,1

]hM

hN= 1

2N =⇒ AN,1 · (hM − hN ) = 12N =⇒ AN,1 = 1

2(hM−hN )·N[G0,h1,AN,2

]h10

= 12N =⇒ AN,2 · h1 = 1

2N =⇒ AN,2 = 12h1·N

We now have the values of all impulses:

A1 = (h2 − h1)−1 ·N−1 = (210− 101)−1 · 6−1 = 654−1

A2 = (h3 − h2)−1 ·N−1 = (340− 210)−1 · 6−1 = 780−1

A3 = (h4 − h3)−1 ·N−1 = (553− 340)−1 · 6−1 = 1278−1

A4 = (h5 − h4)−1 ·N−1 = (701− 553)−1 · 6−1 = 888−1

A5 = (h6 − h5)−1 ·N−1 = (998− 701)−1 · 6−1 = 1782−1

A6,1 = (hM − h6)−1 · (2N)−1 = (1023− 998)−1 · 12−1 = 300−1

A6,2 = h−11 · (2N)−1 = 101−1 · 12−1 = 1212−1

2.4.5 Computing functions

Now that all impulses have been properly sized and we have their values, functionfφ is fully defined. As a direct result, we also have that function Fφ is fully defined.Given its simplicity, Fφ can be easily inverted piecewise.

31

Chapter 3

Simulation results

In this chapter we are going to describe the simulations which were performed inorder to validate, on a practical standpoint, all the results analytically achieved inchapter 5.

As a generic overview, two different simulation systems were designed and de-veloped:

• Regular simulations A high-level engine developed in Matlab1 and Mathe-matica2 targeting small-size simulations in order to produce real data to vali-date the whole system.

• High performance simulations A low-level engine developed in C/C++ andtargeting large-size simulations in order to produce high fidelity data to vali-date the system in real life conditions.

Both solutions were used to prove that all analytical results in chapter 5 do pro-vide a valid description of the system’s behaviour.

3.1 Small-size simulations

This simulation engine was developed to generate results in the context of a con-trolled environment where conditions are similar to those in real life. The mainfeatures regarding this simulation set are:

• Functional definition of impulses and functions.

• Real hashes are calculated using standard Crypto3 library.

• All big integers are normalized into a smaller interval.

Of course, given its nature, the engine comes with some limitations and somedownsides too:

• Even though real hashes are used, their values are normalized to fit a smallerinterval. Thus, these values cannot be considered as high fidelity.

• Simulations are slow. Given the application of functional calculus, impulsefunctions and their antiderivatives are defined in open form, thus requiringnumerical integration to be performed every time.

1Mathworks Matlab https://www.mathworks.com/products/matlab.html.2Wolfram Mathematica https://www.wolfram.com/mathematica/.3OpenSSL library was used to compute regular hashes. More information available in appendix A.

https://www.mathworks.com/products/matlab.html

https://www.wolfram.com/mathematica/

32 Chapter 3. Simulation results

100

200

300

400

30

210

60

240

90

270

120

300

150

330

180 0

PHCP non−balanced case

50

100

150

30

210

60

240

90

270

120

300

150

330

180 0

PHCP in balanced case

FIGURE 3.1: Showing the Polar Hash Coverage Plot (PHCP) of a sim-ulation on an N = 10 station ring after sending m = 103 packets.Both plots show the configuration of the station hash segments to-gether with the final load levels at the end of the simulation. The ploton the left refers to a normal ring (hash function ξ applied), the oneon the right refers to an extended ring where hash function φ basedon same ξ is considered. The same packets were sent in both rings.

• Given the different subsystems being used, numerical accuracy is not guaran-teed.

On one side, this set of simulations is characterized by a relatively easy imple-mentation, thus they come with certain intrinsic limitations (mainly related to thesubsystems being used). The other set of simulations is meant to target those issuesand provide a better numerical fidelity.

3.1.1 Verifying load balance

One of the most basic simulations are used to verify that the algorithm effectivelyhelps the ring achieving load balance given different hash segment distributionsamong stations in the hash range interval [0, hM]. These simulations perform thefollowing operations:

1. The hash space is divided into N random parts and each assigned to one sta-tion.

2. A total number of m packets (random numeric vectors) are generated and fedto the hash function which can be either ξ or φ (both are considered in order tocompare loads per station at the end of one simulation).

3. Packets are assigned to stations according to routing function ψ based on se-lected hash function.

4. Final results are collected: the number of packets per each station is tracked.

3.1. Small-size simulations 33

0 1000 20000

100

200

St. 1

0 1000 20000

100

200

St. 2

0 1000 20000

100

200

St. 3

0 1000 20000

100

200

St. 4

0 1000 20000

100

200

St. 5

0 1000 20000

100

200

St. 6

0 1000 20000

100

200

St. 7

0 1000 20000

100

200

St. 8

0 1000 20000

100

200

St. 9

FIGURE 3.2: Showing load state (in blue) |sk| in each station sk astime grows. In this simulation, hash function ξ is used (normal ring).The green line shows the expected load state (uniform) for each point

in time.

Definition 19 (Polar Hash Coverage Plot (PHCP)). Let R = (Ω, r, ξ, φ, ψ) be a ringwith ‖Ω‖ = N stations. The Polar Hash Coverage Plot is a set of N vectors in the 2Dspace:

E =Ak · ei·ωk , k = 1 . . . N

Every vector has its amplitude indicate the packet load relative to the station it refers to, whilethe phase indicates the station’s hash segment amplitude and position in the ring:

Ak = ηkm

ωk = ‖Ξ(sk)‖hM

+ ωk,0

Where ωk,0 =∑k−1

j=0 ωj indicates the phase shift due to all stations preceding sk.

Figure 3.1 shows the PHCP of the same simulation in which the same m pack-ets have been sent to the network, with and without load balancing hash functionφ in place. As it is possible to see, the vectors in the second plot (on the right) haveroughly the same amplitude in comparison with the first diagram (on the left), indi-cating that hash function φ is effectively able to provide balancing on the same set ofpackets across the stations in the ring.

3.1.2 Evaluating load levels per station

Another set of simulations are used to measure the difference between the final loadstate in each station and the expected one (uniform) after the network has been fedwith a certain number of packets.


0 1000 20000

100

200

St. 1

0 1000 20000

100

200

St. 2

0 1000 20000

100

200

St. 3

0 1000 20000

100

200

St. 4

0 1000 20000

100

200

St. 5

0 1000 20000

100

200

St. 6

0 1000 20000

100

200

St. 7

0 1000 20000

100

200

St. 8

0 1000 20000

100

200

St. 9

FIGURE 3.3: Showing load state (in blue) |sk| in each station sk as timegrows. In this simulations set (same as in figure 3.1), hash function φis used (extended ring). The green line shows the expected load state

(uniform) for each point in time.

These simulations also have the objective of showing how the network behaveswith and without balancing hash function φ in place. These normal vs. extendedring scenarios are important as they allow us to visually assess the work done by φin reshaping the load distribution in the network. For this analysis to be effective,it is crucial that both scenarios are evaluated on the exact same set of packets gen-erated. To guarantee this condition, when randomly generating packets, the sameseed is used when evaluating the normal and the extended ring during one simula-tion session.

Figure 3.2 and 3.3 show, respectively, the same simulation session first conductedon the ring and then again on the same ring but extended (φ in place). As it ispossible to see, as time grows, each station reports its load level. In the normal ring,station load levels do not all meet expected load level η? = m

N . On the other hand,when φ is in place (figure 3.3), all station loads tend to match the expected levels.

Remark (Discrete time). This set of simulations is very important as the load state in eachstation is evaluated during all the time. In this context, time is considered discrete and timeinstants are to be associated to events. The only event being considered here is the generationof a random packet.

3.2 Large-size simulations

This simulation engine was developed for two reasons: getting high fidelity simu-lation data, and providing an initial implementation of the algorithm. As a direct

3.2. Large-size simulations 35

result, we could deliver the first implementation of the algorithm described in theprevious chapter. The main features of this system are the following:

• Being developed in C/C++, the application is very fast computing regularhashes and performing φ-hashes processing.

• Simulations can be run sequentially or in parallel (packet generation).

• Standard Crypto library is used, therefore all generated hashes are real hashesand not simulated quantities.

• Big integers are employed, so no scaling is performed in order to adapt realdata to simulation artifacts, hence providing more fidelity to the real scenarios.

Simulation flow In the context of this simulation effort, several computation andmemory intensive runs have been scheduled on a dedicated pool of servers. A de-tailed description of the infrastructure being used is available in appendix A; herewe provide a brief synopsis about how these simulations work:

1. When the engine starts an initialization phase ensures memory and other con-ditions.

2. Random packets are generated. Packets are generated as random bitstreamsof specific size. Different sizes can be specified and during one simulation thesize can range in a certain interval.

3. Hashes (using ξ and φ) are computed.

4. Routing of packets is performed for each hash by using application ψ.

5. All results are persisted in memory. Data manipulation is then performed inorder to extract information of interest.

6. Simulation output files are generated.

7. Post-processing is performed by generating diagrams and aggregated quanti-ties using output files.

3.2.1 Overview

Many simulations have been run, all targeting different network structures and con-ditions. Before showing results, we need to provide a synopsis of which configu-rations have been considered in order to understand what was actually simulated.Every simulation run is characterized by the following properties:

• Number of stations in the ring: N . This parameter directly impacts the size ofthe network.

• Number of generated packets: m.

• Leaf radius r. For all simulations, the radius is unitary: r = 1.

• Packet size S ∈ 100Kb, 1Mb, 3Mb, 10Mb.

Since one same configuration can be run different times with different seed val-ues, aggregate properties describing one simulation group/batch include:


• Number of simulations in batch: C ∈ N.

• Overall simulation time of the batch: T ∈ R.

Grouping simulations Simulation conducted in the context of this research canbe classified using the parameters described above. The following batches were run:

ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10

ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10

ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10

ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

100Kb

13.1M

10ξ,φ

#

1Mb

13.1M

10ξ,φ

#

1Mb

13.1M

10

ξ,φ

#

1Mb

13.1M

10ξ,φ

#

1Mb

13.1M

10ξ,φ

#

1Mb

13.1M

10ξ,φ

#

1Mb

13.1M

10ξ,φ

#

1Mb

13.1M

10ξ,φ

#

1Mb

13.1M

10ξ,φ

#

1Mb

13.1M

10ξ,φ

#

1Mb

13.1M

10

PARξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30

PARξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30

PARξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30

PARξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30

PARξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30

PARξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

1Mb

89M

30

PARξ,φ

#

1Mb

89M

30PAR

ξ,φ

#

3Mb

89M

30PAR

ξ,φ

#

3Mb

89M

30PAR

ξ,φ

#

3Mb

89M

30PAR

ξ,φ

#

3Mb

89M

30PAR

ξ,φ

#

3Mb

89M

30PAR

ξ,φ

#

3Mb

89M

30PAR

ξ,φ

#

3Mb

89M

30

PARξ,φ

#

3Mb

89M

30PAR

ξ,φ

#

3Mb

89M

30PAR

ξ,φ

#

3Mb

89M

30PAR

ξ,φ

#

10Mb

89M

30

PARξ,φ

#

1Mb

10M

50PAR

ξ,φ

#

1Mb

10M

50

PARξ,φ

#

1Mb

10M

50PAR

ξ,φ

#

1Mb

10M

50

PARξ,φ

#

1Mb

10M

50PAR

ξ,φ

#

1Mb

10M

50

PARξ,φ

#

1Mb

10M

50PAR

ξ,φ

#

1Mb

10M

50

PARξ,φ

#

1Mb

10M

50PAR

ξ,φ

#

1Mb

10M

50

PARξ,φ

#

1Mb

10M

100PAR

ξ,φ

#

1Mb

10M

100

PARξ,φ

#

1Mb

10M

100PAR

ξ,φ

#

1Mb

10M

100

PARξ,φ

#

1Mb

10M

100PAR

ξ,φ

#

1Mb

10M

100

PARξ,φ

#

1Mb

10M

100PAR

ξ,φ

#

1Mb

10M

100

PARξ,φ

#

1Mb

10M

100PAR

ξ,φ

#

1Mb

10M

100

The diagram above illustrates the different configurations used to run simula-tions. To read each single tile, just refer to the following legend:

ξ,φ

#

pkt. size S

m gen. pkt.

NPAR

ξ,φ

#

pkt. size S

m gen. pkt.

Nξ,φ

#

pkt. size S

m gen. pkt.

NPAR

ξ,φ

#

pkt. size S

m gen. pkt.

Nregular parallel intensive parallel

intensive

For each simulation, a different seed was used (thus the #-symbol in the bottom-left corner) and both regular and φ-hashes were computed (top-left corner).


0 0.2 0.4 0.6 0.8 1 1.2 1.4

·105

0

0.2

0.4

0.6

0.8

1

1.2

·105

σ of hξ

σ2/η

ofhξ

0 0.5 1 1.5

·105

0.5

1

1.5

2

2.5

·105

σ of hξ

σ2/η

ofhξ

1.5 2 2.5

·104

1.6

1.8

2

2.2

2.4·104

σ of hξ

σ2/η

ofhξ

0 200 400 600 800

0

1,000

2,000

3,000

σ of hφ

σ2/η

ofhφ

0 0.5 1 1.5

·104

0

2

4

6

8·104

σ of hφ

σ2/η

ofhφ

0 200 400 600 800 1,0001,2001,400

1,000

2,000

3,000

4,000

5,000

σ of hφ

σ2/η

ofhφ

FIGURE 3.4: Plotting standard deviation vs dispersion factor of gen-erated ξ-hashes and φ-hashes during simulations batches (from left toright): N = 10 (40 simulations), N = 30 (60 simulations) and N = 50

(10 simulations).

3.2.2 Evaluating the variance of hash segment amplitudes

Two information were of interest and, accordingly, two different types of data wereextracted from every simulation:

1. The statistical variation of regular hash values and φ’hash values in order tosee whether patterns exist.

2. The statistical relation between the distribution of hash segment amplitudesand the distribution of φ-hash values. Since more φ hashes are routed intoa specific segment if that segment has a small amplitude, we want to assesswhether special patterns arise in case of high variance in segment amplitudeswhen observing φ-hash values.

Figure 3.4 reports possible patterns between variations of regular and φ-hashes.In general we can conclude that φ-hashes have a more localized behaviour as theirvariations are more contained than regular hashes via hash function ξ.

This is expected: if we consider the whole hash space [0, hM] ⊂ R, we have thathash function ξ has a uniform distribution over that range; on the other side, φ ischaracterized by a distribution which allocates hashes with different probabilities indifferent sub-intervals of the overall hash range. This last observation is the mainreason why we want to investigate the relation between the variance of segmentlengths and the variance of φ-hashes.

Hashes and segment amplitudes As anticipated, the following questions wereof interest with regards to the behaviour of φ-hashes and the distribution of hashsegment lengths ‖Ξ(si)‖ ,∀si ∈ Ω:

1. If all stations in the ring are arranged in a way such that the distribution ofhash segment lengths is approximately uniform, what behaviour should weexpect from φ-hashes?


0 5 10 15 20 25 30 35 400

2

4

HS amplitudes (×1037)

0

1,000

2,000

3,000

Stan

dard

devi

atio

nσ

φ-hashes

0 5 10 15 20 25 30 35 40 45 50 55 600

5

10

15

20HS amplitudes (×1036)

0

2

4

6

8·104

Stan

dard

devi

atio

nσ

φ-hashes

0 1 2 3 4 5 6 7 8 9 100

2

4

6

8HS amplitudes (×1036)

0

2,000

4,000

6,000

Stan

dard

devi

atio

nσ

φ-hashes

FIGURE 3.5: Plotting standard deviation of hash segment lengths andstandard deviation of φ-hashes during each simulations in batches(from top to bottom): N = 10 (40 simulations), N = 30 (60 simula-

tions) and N = 50 (10 simulations).

2. If all stations in the ring define very different hash segments (some very wideand some very short), what behaviour should we expect from φ-hashes?

The diagrams in figure 3.5 try to catch such behaviour and describe it from a sta-tistical point of view. Both questions raised above can be mathematically mappedone one statistical descriptor which, therefore, becomes of high interest in this con-text: the standard deviation of hash segment amplitudes and φ-hashes.

By looking at those diagrams, we can assess a very weak trend for which thevariance of φ-hashes tends to be higher the higher is the variance of hash segmentlengths. As pointed out, this is classifiable as a pattern but in a very prudent wayas the trend is not immediately evident and there are some cases where such a trend


0

0.5

1

·105

Station skSimulations

Load

η(ξ

)k

0

2

4

·104

Station skSimulations

Load

η(φ

)k

FIGURE 3.6: Plotting station loads η(ξ)k (no balancing) and η(φ)k (bal-

anced ring) at the end of four N30 simulations with different seeds.

does not show up. Our conclusion is that a correlation between segment lengthsΞ(si) and hashes hφ is probably present, however more variables are involved andmore investigation on this regard is necessary.

3.2.3 Evaluating load levels per station

This set of high performance simulations have been used, of course, to verify thequality of the balancing performed by hash function φ.

Figure 3.6 shows station loads in the context of four different simulations with30 stations. As it is possible to see, packets are balanced across stations and thebalancing is evident when comparing loads to simulations where no balancing isperformed.

Migrations flows

A concept extremely important in the context of these simulations and, more gener-ally, in the context of this research effort, is the following:

Definition 20 (Migration flow ξ-φ). Let R = (Ω, r, ξ, φ, ψ) be a ring and p ∈ P a packet.Let si = ψ(ξ)(p) be the station where the packet is routed to by using hash function ξ, and


S0

0%10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S1

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S2

0%10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S3

0%10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S4

0%10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S5

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100% S

6

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100% S

7

0%10%20%30%40%50%60%70%80%90%100%

S8

0%10%20%30%40%50%60%70%80%90%100%

S9

0%10%20%30%40%50%60%70%80%90%100%

S10

0%

10%20%30%40%50%60%70%80%90%100%

S11

0%10%20%30%40%50%60%70%80%9

0%1

00%

S12

0%10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S13

0%10%20%30%40%50%

60%

70%

80%

90%

100%

S14

0%10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S15

0%10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S16

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S17

0%

10%

20%

30%

40%

50%

60%

70%80%90%

100%

S18

0%10%20%30%40%50%6

0%7

0%80%90%10

0%

S19

0%10%20%30%40%50%60%70%80%90%10

0%

S20 0%

10%

20%

30%

40%

50%60%70%80%90%

100%

S21

0%10%20%30%40%50%60%70%80%90%100%

S22

0%10%20%30%40%50%60%70%80%90%100%

S23

0%10%20%30%40%50%60%70%80%90%100%

S24

0%10%20%

30%

40%

50%

60%

70%

80%

90%

100%

S25

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S26

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S27

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S28

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

S29

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

FIGURE 3.7: Migrations flows in a N30 ring.

let sj = ψ(φ)(p) be the station where the packet is routed to by using hash function φ. Thevirtual transition that packet p experiences from si to sj is called transition flow.

Definition 20 is the foundation of a differential analysis conducted during allsimulations. By collecting all hashes and mapping them to stations, it is possible, atthe end of the simulation, to extract all migration flows. At the end, it is possibleto identify as many flows as generated packets, the final process is aggregating thisinformation and counting duplicates.

Visualizing migration flows is difficult using tables, thus circo-diagrams4 are em-ployed instead. Figure 3.7 shows migration flows for a simulation with 30 stationsand 1M packets generated. The diagram clearly provides a good description abouthow packets virtually move from one station to another when hash function φ isused for routing. Thanks to these diagrams it is possible to state the following:

Proposition 13 (Packet migrations). Let R = (Ω, r, ξ, φ, ψ) be a ring, then stations be-have in two different ways:

• Wide-coverage stations are more likely to donate packets to other stations.

• Narrow-coverage stations are more likely to accept packets from other stations.4Circo-diagrams have been generated by using software Circos: http://circos.ca/. To read

these diagrams, read the on-line documentation.

http://circos.ca/


It is also worth noticing that all migration flows are localized in adjacent stationswhen considering one node in the ring. This pattern is interesting because, differ-ently from expectations, the re-arrangement performed by φ does not move packetsso far from their ξ-selected station.

43

Chapter 4

System API

In this chapter we want to provide a description of the different interactions thesystem exposes to the end user for storing and retrieving data, and what protocolsare used in the ring to ensure those services.

In chapter 1, we have covered the system architecture. As we recall, the end useris able to interact with the storage system in order to take advantage of its services;what happens on the other side of it, in the ring, is not known to him. The questionswe want to give an answer to are: "What happens when the user sends data to bestored?", "How can the user retrieve data he previously stored?".

The overall system exposes a minimal set of API consisting of 4 primitives:

1. Store By means of this functionality, the user can transmit a DU and have itpersisted in the system. Typically, upon invoking this API, the user receivessome data in return, a token, which will be used later for retrieving that samedata.

2. Retrieve By invoking this API, the user can retrieve data previously stored inthe system. If the operation is successful, the user receives his data in return.

3. Remove The user can actually decide to remove data he previously stored byinvoking this primitive. No data is returned after invoking the API excepta status code indicating whether the removal was successful plus additionalinformation (optional, like the amount of total data that was deleted).

4. Update This functionality is used by the user to update existing data. It even-tually results in the sequential application of a delete and store invocation.

We are going to examine in detail the first 2 primitives: store and retrieve as theothers are extensively based on the former pair.

4.1 Storing data

The API for transmitting a DU and have it persisted in the system requires the userto provide, as input, the byte stream. A identification token is returned if the call issuccessful:

t_token s t o r e ( stream_t& input )

The moment the user invokes store on DU p ∈ P , 2 things happen in sequence:

1. The token is computed by calculating the hash of the input packet h = ξ(p).

44 Chapter 4. System API

2. DU p’s size is considered together with fragmentation threshold c ∈ N. If p ex-ceeds the threshold: |p| > c, then the packet is fragmented into smaller units.

The token is returned to the user in case the storage process is successful. Giventhe DHT and content addressing, it is possible to retrieve the DU later by using thatspecific hash.

4.1.1 Packet fragmentation

The fragmentation process is necessary for some important reasons:

• The ring has a high level of control traffic. Given the DHT and routing al-gorithm ψ, many transmissions occur between contiguous stations in the net-work. In order to reduce the latency of communications, the network tends tofavour quantity over size, thus allowing many packets to be exchanged as longas their size is small enough.

• When a packet reaches a certain station which is not the final destination, an-other routing iteration is necessary. This means that another communicationmust be performed with one of the contiguous stations in the ring. Howeverif that link is in use, as another packet is being transmitted, then the incomingone must be queued. In order to reduce station traversing times (while hop-ping because of routing) for packets, packets’ size is set to a reasonably lowlevel.

• By dealing with small DUs, it is possible to ensure better balancing over time.If data were stored without breaking them down into smaller pieces, we wouldnot ensure units of the same size to be stored across stations, this goes againstone of the assumptions for our balancing algorithm: all data units have thesame size.

As a packet is submitted for storage, the fragmentation process breaks it downinto n smaller units:

n =

⌈|p|c

⌉In case the original DU is fragmented into smaller units, the final returned token

is still the hash of the original packet. Later on we will see that, by using the sametoken, all fragments can be retrieved back. For this to happen, it is necessary tocreate fragments in a specific way:

Definition 21 (Packet fragmentation). Given packet p ∈ P and fragmentation thresh-old c ∈ N (number of bytes), then application ζ : P 7→ 2P returns the set of fragmentsp1, p2, . . . , pn given input p. Every returned fragment has the following format:

1. The hash of the original packet.

2. The sequence number of the fragment (needed when re-constructing the packet).

3. The hash of this fragment.

4. The data stream (up to c bytes).

As shown in figure 4.1.

4.1. Storing data 45

ξ(p) Seq. k ξ(pk) . . . Data

Ctrl. info

Max c bytes

FIGURE 4.1: Data unit format.

Remark. The frame format for non fragmented packets, called whole packets, is the same,however the first field (parent hash) is null (all zeros) and the sequence number is −1, whichis the value which can be looked at for distinguishing fragments from whole packets.

4.1.2 Routing

After the fragmentation phase, which can end up with no fragment to be generatedif the original packet’s size does not exceed threshold c, every fragment pk ∈ P issent to the ring to be routed according to routing function ψ based on balancing hashfunction φ.

As anticipated in chapter 1, the ring is never directly accessed by users. Proxiesare employed instead. A proxy station serves as an intermediary entity to balancethe access to the ring and to hide all ring’s stations from the outside world. Whenthe store primitive is invoked, the system, from the user’s computer, sends a storerequest SReq to one of the known available proxies. When receiving a request, theproxy will decide which station of the ring to pick for letting SReq enter the network.The decision is based on a balancing algorithm which decides basing on each of theknown station’s link usage: the proxy’s goal is to avoid overloading one station withincoming traffic.

Once the request reaches one of the stations, algorithm 1 will do the job andguarantee that a station is found for the packet. The system client ensures that allpackets undergo the same process. If every fragment is successfully stored, the orig-inal packet’s hash is returned to the user as a token for retrieving all fragments.

Asynchronous communications For performance reasons, the best approach ishaving every communication asynchronous. It means that when a node (one of thestations, a proxy or the user client node) sends a SReq, it does not keep the con-nection open until the packet is successfully routed waiting for the final response toclose that connection. It is much better to send a request as a datagram transmission.That station will receive a store response SRes when its request has been processed.Every intermediate node that passes the request forward, will wait for its responseand, after receiving it, will construct its own response for the node who sent the re-quest to it in the first place. This increases the bandwidth as links will not be ownedfor long times.

Employing asynchronous transmissions complicates the communication proto-col but allows better performance. One of the complications is represented by timerswhich every station has to implement in order to raise an error when the responsedoes not get delivered within a reasonable time (request transmission failure). In asynchronous scheme, timers are handled by the transmission protocol (e.g. TCP/IP)in a transparent way to the caller, however in asynchronous scenarios, the stationhas to implement timers on its own for each sent request. Figure 4.2 shows both


SReqsuccess

SReq

success

store(stream)

token

SReqsuccess

SReqsuccess

SRessuccess

SRessuccess

store(stream)

token

User: Proxy: Entry station: Dst station:

Synchronous communications


Asynchronous communications

FIGURE 4.2: Synchronous vs. asynchronous communication modelwhen storing a single packet.

communication schemes.

As part of the effort in writing tests and simulations of the algorithm, an actualimplementation of the ring has been developed in Microsoft .NET using communi-cation library WCF1. Today, it is possible to implement asynchronous transmissionsin a very easy way as the IT industry has moved forward to that direction providingdevelopers with the set of API required to implement such protocols.

4.2 Retrieving data

The other side of the story, a little more complicated, is about getting data back. Weare going to cover this topic by considering the 2 possible scenarios here:

1. Retrieving a whole packet.

2. Retrieving a fragmented packet.

In both cases, the process always starts with the same set of operations: the userhas a token he received when storing data in the past and utilizes it to retrieve thatstream back as per retrieve primitive:

1Microsoft’s Windows Communication Foundation: a library consisting of a collection of networkprotocols highly customizable and flexible.

4.2. Retrieving data 47

ξ(p) Total n ξ (p1) . . . ξ (pn)

Fragments

FIGURE 4.3: Packet info format.

stream_t& r e t r i e v e ( t_token t )

Retrieving a whole packet As soon as the user invokes the retrieve primitive,through the proxy, a retrieve request RReq message is built and routed in the ring.The token is the hash of the original DU, so, by following the DHT retrieval, therequest is routed to the destination station. Once in there, the station will search thedatabase to find the stored stream.

In order to have great performance in the packet search process, a dictionary canbe used inside every station. Since DUs are saved according to the format shown infigure 4.1, the hash of the stream is always available and can be used for looking upthat specific packet when a RReq is routed to a station.

As soon as the stream is retrieved, it can be sent back to the request originator:the end user, who will receive the DU in return from its retrieve call.

Retrieving a fragmented packet When the packet was fragment the time it wasstored in the ring, a problem occurs. In fact, when the request is sent and reachesthe destination station basing on the token (the original packet’s hash), nothing isfound. The original packet has been fragmented and each fragment has a differenthash completely unrelated to the token (since we use cryptographic hashing, thereis no way to get the original stream from the hash).

In order to solve this issue we can actually store all packet’s fragments’ hashesinto the token, which would become an array of hashes and grow in size. Althoughthis solution might work, we don’t really like it. The user should still be able tolocate all fragments just by having the original packet’s hash. In order to do so, weneed to modify the store protocol.

After a packet p ∈ P has been fragmented into n several units pk ∈ P , beforetransmitting them, a packet info unit is constructed:

Definition 22 (Packet info DU). Given packet p ∈ P such that its size exceeds the frag-menting threshold: |p| > c, a special data unit is built to track information about it and allits fragments. The stream contains the following fields:

1. The original packet’s hash h = ξ(p).

2. The number of fragments n.

3. The hash of each single fragment pk (k = 1 . . . n) in order (from first to last).

As shown in figure 4.3.

In the revised store protocol, before sending each single fragment to be stored,thus before calling store on each single fragment, the same primitive is called onthe packet info DU which has been built right after computing all hashes (originalpacket and its fragments). This initial call will route the packet info into a station byusing the original packet’s hash.

Thanks to this approach, when retrieving a DU, the first RReq will reach thestation where the system will find the packet info. Using that, the system will then


RReqsuccess

RReqsuccess

lookup

pkt-infoRRes

successRRes

success

retrieve(token)

pkt-info

RReqsuccess

. . .

RRessuccess

retrieve(token pk)

fragment pk

aggregate(pk)

packet p


Station:

loop

[∀pk]

FIGURE 4.4: Sequence diagram showing the retrieval protocol in caseof a fragmented packet.

issue n retrieve calls in order to fetch each single fragment. Later, after getting allstreams, the original DU can be built, the order into which combining each fragmentis given by the sequence number in each retrieved fragment packet.

As shown in figure 4.4, the process to retrieve and build a stored packet mighttake some time, not only the ring size influences this latency, but the number offragments too play a significant role in the process. It goes without saying that alarger packet requires more time to be fully retrieved.

49

Chapter 5

Dynamic conditions

In the previous chapters we have described and analyzed the behaviour of the ringunder static conditions.

Definition 23 (Dynamic conditions). Let R = (Ω, r, ξ, φ, ψ) be a ring. We say the net-work is under dynamic conditions when any of its characterizing elements changes:

1. Stations si ∈ Ω. Stations might disconnect or new stations might extend the ring.This possibility also covers the event of stations faulting and becoming off-line.

2. Leaf set radius r changes.

3. Any of the connections in the overlay ring changes.

4. Hash function ξ or φ changes.

5. Routing strategy ψ changes.

Static conditions are the opposite of dynamic: the ring does not change and re-mains the same. So, why do we need to talk about dynamic conditions? Why shouldthe ring change?

Ideally, if well designed, the system can be configured with a certain numberof stations, a certain radius and work optimally under static conditions. However,today every system is exposed to dynamic conditions as many different planned orunplanned events may occur:

1. One station enters a faulty state. It can happen for any reason like an hardwareissue (e.g. hard disk failure, data corruption, etc.) or a software problem (e.g.system failure, emergency system reboot, etc.).

2. Stations can experience network issues. This can cause both a permanent of-fline state or a temporary one if machines have a way to automatically recoverfrom these types of failures.

3. More stations are required because the system needs to serve an higher volumeof data (planned scale-up).

4. One or more stations need to undergo planned or unplanned maintenance.

5. Security related issues force some stations to be pulled away from the ring.

Those enumerated above are only a few possibilities. The point here is that astorage system must keep into account such circumstances which are part of the realworld of connected systems.

50 Chapter 5. Dynamic conditions

When dynamic conditions are in place, the ring structure and the balancing al-gorithm described so far need to be revised and modified in order to avoid perfor-mance degradation and, in some other more critical cases, service outage. We aregoing to examine the following dynamic cases:

• Scalability The ability of the ring to grow or shrink in a flexible way causingthe least possible performance degradation.

– Station join A station joins the ring causing it to expand.

– Station removal A station is pulled off the ring, causing it to shrink.

• Fault conditions One station experiences internal problems which cause it tobe unresponsive.

5.1 Scalability

What happens when a station joins the ring? When such an event occurs, there are afew operations that need to be considered to re-initialize the ring:

1. The new station needs to build its leaf-set in order to identify its successorsand predecessors.

2. All nodes in the neighbourhood of the new station must re-arrange their leaf-sets in order to update their successors or predecessors depending on the leaf-set radius r.

3. Balancing hash function φ must be re-designed as now the ring has changed.Since we have more stations, we have different hash segments and this impactsfunction φ’s implementation.

The first 2 operations are infrastructural and can be addressed through wellknown protocols currently employed in DHT-mannered networks; since the prob-lem is nothing new, we are not going to spend more time talking about it. The 3rdpoint though is a different story as it poses a new situation inside our network ar-chitecture: stations must be synchronized to use a new balancing hash function φ.

Lemma 12 (Balancing hash function φ’s outdatedness upon ring scaling). Let R =(Ω, r, ξ, φ, ψ) be a ring with N = ‖Ω‖ stations. Let, at any point in time, consider onestation joining R or being pulled out of it, causing the number of stations to become N ′ =N ± 1. Then hash function φ is no more suited for balancing the ring.

Proof. Immediate by considering proposition 12. According to that, hash function φdepends on the number of stations in the ring, if that changes, the hash segmentsaffecting Fφ’s codomain change too; hence causing original hash function φ not toreflect the new state of the network anymore.

The main problem we want to face here is the process of synchronising stationsin the ring and it consists of:

1. Computing new balancing hash function φ′.

2. Updating all stations to use new hash function φ′.

3. Rearranging packets across stations to ensure the balancing state of the ring.

5.1. Scalability 51

The last point is actually crucial. We are going to assume that a station joining thering comes with no packets stored in it. That is because any other scenario does notmake any sense. When the new station s∗ is on-line in the ring, the load distributionchanges from:

Σ = (|s1|, |s2|, . . . , |sN |) , |si| ≈ m · n−1,∀i = 1 . . . N

to this form:

Σ′ = (|s1|, |s2|, . . . , |s∗| = 0, . . . , |sN |) , |si| ≈ m · n−1,∀si ∈ Ω \ s∗

Which implies that the ring is not balanced anymore, hence the last point men-tioned in the synchronization process introduced earlier, which looks more and moreexpensive as we investigate the challenges introduced by the dynamic conditionsjust taken into consideration.

If the operations required to synchronize the ring get too expensive (time-wise),then the proposed algorithm has a serious issue in terms of scalability as it makesthe network adapt pretty badly under the hypothesis of dynamic conditions. Ourpurpose is, therefore, trying to understand how actually expensive it is to scale thering.

Given our analysis so far, we have been able to break down the scalability issuedown to 2 sub-problems:

1. Updating hash function φ′ and aligning all stations in the ring to use it.

2. Re-arranging existing stored packets across stations in order to bring the loaddistribution in the ring back to its balanced state.

We are going to look at these two problems separately and evaluate the finalperformance impact later.

Conjecture 1 (Scaling overall impact). Let R = (Ω, r, ξ, φ, ψ) be a ring experiencing ascaling process due to one station joining or leaving the network.

• Let τφ ∈ R measure the performance impact (latency) of the process of updating hashfunction φ to φ′ on all stations in the ring.

• Let τψ ∈ R measure the performance impact of the process of rearranging packetsamong stations in order to take the ring back to its balanced condition.

• Let τS ∈ R measure the overall latency experienced by the system while carrying outthe two operations above in order to scale the ring.

We expect the following equation to hold:

τS ≤ τφ + τψ (5.1)

Conjencture 1 expresses our feeling that the overall performance impact causedby the two scaling operations cannot be computed as the sum of the latencies in-troduced by each one of them, as the two operations can be carried out in parallel,rather than sequentially. We try to prove this throughout the rest of this chapter.


MT hsrc ξ(Data) . . . Data

Header

Max c bytes

FIGURE 5.1: Message format.

5.1.1 Updating φ

Hash function φ is a global contract in the network.

Definition 24 (Global contract). A variable, or, more generally, a piece of informationshared by all stations in the ring. The main assumption is about all station keeping an exactcopy of the same value.

The protocol we need to design for updating hash function φ to φ′ on all stationsis, more generically speaking, a protocol to update a global contract in the network.Since a DHT is designed for distributed scenarios, every condition implying a certainlevel of centrality causes the system to behave with lower performance, and this isthe case here. The PA is based on a distributed approach, however the balancingprocess is carried on through a global contract which is hash function φ; this explainswhy we should expect this process to be relatively expensive.

Broadcasting in DHT

In order to have a global contract updated, we basically need to transmit a message,containing the updated contract, in broadcast on the network because we need toreach every single node. The message that needs to be sent, in terms of the API ofthe balancing system, is PUM (φ Update Message). The cost of updating φ is equal tothe cost of sending a message in broadcast in the ring.

Since the broadcast occurs in the context of a network overlay, we need to createa protocol specific for message broadcasting. Generally speaking, we can create amessage format which all transmissions between stations in the ring must complyto. The message must contain, at least, the following information:

1. Message Type An enumeration indicating the type of communication (e.g.RReq, RRes, etc.)

2. Source hash The hash of the source station (not strictly needed but nice to havefor performance reasons, as a station receiving a message knows its neighboursand it is able to generate the hash of their IP addresses).

3. Body hash The hash of field Body. This is used for routing the message (des-tination hash).

4. Body The content to transmit.

A possible implementation for the broadcasting protocol has to occur at stationlevel. Since the architecture is distributed, we cannot employ any centralized entity.

5.1. Scalability 53

Algorithm 2 Message broadcasting in the ring

Require: Ring initializedRequire: Station si has ID hi = ξ(si)Require: Station si has an associated leaf set Λ(si)Require: Station si receives packet p ∈ P from station ssrc ∈ Λ(si)Require: Global variable hp ∈ N is availableRequire: Global variable d ∈ −1, 0, 1 ⊂ N is available and initially set to 0

1: function BROADCAST(p ∈ P )2: hsrc← ξ (ssrc) . Actually computed or taken from message3: if hsrc < hi then4: d∗←−1 . Message from LLS5: else6: d∗← 1 . Message from ULS7: end if8: if ξ(p) = hp ∧ d+ d? = 0 then . Same message from opposite side of ring9: return . Abort. End condition reached

10: else if ξ(p) = hp then . Duplicate message from same side of ring11: return . Don’t send again12: end if13: hp← ξ(p)14: Λ← ∅15: if hsrc < hi then16: Λ← ΛU (si) . Message from LLS =⇒ Send to ULS17: else18: Λ← ΛL(si) . Message from ULS =⇒ Send to LLS19: end if20: for s← Λ do21: Send p to s22: end for23: end function

Stations can recognize such a type of communication by inspecting the content ofthe message, a possible solution is using a flag field in the message, or, better, usinga special value in the destination address field1.

Lemma 13 (Message broadcasting complexity). Let R = (Ω, r, ξ, φ, ψ) be a ring withN = ‖Ω‖ stations. The cost of transmitting a message in broadcast, in best case scenario, is:

Θ∗B =

⌈N

2r

⌉Where ΘB is expressed in number of message hops2.

Proof. Without loss of generality, we indicate with sA ∈ Ω the station initiating thebroadcast transmission in the ring. As soon as a station receives a broadcast message,it consumes the content and then forwards it to the opposite side of its own leaf-setin relation to which node it received the message from, as per algorithm 2. If sAstarts the protocol by sending the message to only one side of its own leaf-set, then

1Usually protocols use the all-1 string to indicate a broadcasting address2A hop, in the scenario of message routing, is a single direct transmission from one node to another.


the maximum number of hops required to cover all the ring is:

ΘB =

⌈N

r

⌉Because one station forwards the message in one go to r neighbours. However,initiator sA can be smarter and send the message to all nodes in its own leaf-set (bothsides). This would trigger a symmetric chain to both sides of the ring, thus leadingto the thesis as the best case scenario is when all messages travel at the same speed,and the last transmission occurs at the very opposite side of the ring (the hypothesisis that no delayed transmission occurs).

Finger tables A well-known routing enhancing techniques, often used in DHTs,is the employment of finger tables. Briefly, it consists in arranging leaf-sets in thering in a way such the LLF is empty and the ULF contains all successors in thering according to relative position sequence: 20, 21, 22 until reaching 2l. This wayof linking stations implies an higher cost from a control point of view because ittakes more time to re-arrange those links at initialization time and when dynamicconditions are in place (e.g. one station joining or pulling off the ring).

That being said, on the other hand, this pattern actually also ensures better per-formance from routing standpoint, hence guaranteeing an even better complexitythan the one considered in lemma 13 in message-broadcasting scenarios.

So, as we can see, the cost of updating global contract φ is acceptable and it ispossible to consider many well-known approaches in literature. Therefore, we haveno interest in detailing this issue any further.

5.1.2 Load re-arrangement

The part of the scaling cost we are most worried about is actually the re-arrangementof packets. This operation is not required just from a balancing point of view, thereis a more critical aspect which needs to be addressed as soon as one station joins thering: packet retrieval.

Let us consider a scenario where station s∗ has joined the ring and hash functionφ has been updated. We consider that s∗’s predecessor is now station si. If no loadre-arrangement is performed, then RReq messages targeting a packet p whose hashhp = ξ(p) is now covered by s∗: hp ∈ Ξ(s∗), will not be found as they are actuallystill stored in si, since that station was covering hp before the ring scaped-up.

The question we want to answer is: "How is the retrieve primitive badly im-pacted by the ring scaling up?". The example we just considered suggests that onlya portion of the ring is impacted by the scaling up, so the packet re-arrangementshould only occur between 2 stations, however this is something that needs to beproved.

Theorem 14 (Load re-arrangement upon scale-up by 1 station). LetR = (Ω, r, ξ, φ, ψ)be a ring. Let s∗ ∈ Ω be a station joining the ring causing hash function φ to be updated on allstations to φ′. Let us also consider station si now becoming s∗’s predecessor, so that its hashsegment is: Ξ(s∗) = [h∗, hM

i ], assuming that h∗ = ξ(s∗). Then the packet re-arrangementeffort required to make all packets in the new network retrievable and to re-balance the ringimpacts all stations in the network.

5.1. Scalability 55

Proof. Recalling how hash function φ works as we described in section 2.3.3, weneed to understand whether the joining of a station causes φ segments in its do-main to change boundaries (see figure 2.4). We can try to visualize the impact on thedomain by considering the domain mapping diagram while keeping into considera-tion dynamic conditions. We consider, for simplicity and without loss of generality,si = s1:

0 hM

0 hM

0 1

0 1

h1 h2 hk hN−1 hN

h1 h∗ h2 hk hN−1 hN

s∗

12N

12N

+ 1N . . .

12N

+ kN

12N

+ N−1N

12(N+1)

12(N+1)

+ 1N+1

12(N+1)

+ 2N+1 . . .

12(N+1)

+ kN+1

12(N+1)

+ NN+1

h1 h2 − h1 . . . . . . hN − hN−1 hM − hN

12N

1N

. . . . . . 1N

12N

h∗ − h1 h2 − h∗

12N

1N+1

new φ segment: 1N+1

1N+1

1N+1

1N+1

12N

As it is possible to see, additional station s? causes only one change in the ξ space(φ’s codomain), but it causes all φ segments to resize in order to make room for anadditional interval of amplitude 1

N+1 .

As we can see our initial assumption was not quite right unfortunately. Thewhole domain of hash function φ is impacted and, possibly, all packets need to bere-routed according to new hash function φ′. Nonetheless, we still don’t know basicinformation which can really tell us how bad the re-arrangement effort is, like:

1. Are packets re-routed to new stations completely unrelated to the original one?Or there is a pattern?

2. Do all packets require re-routing? Is there a percentage of them that remainsin their current station when hash function φ makes transition to φ′?

These two question are crucial to evaluate the cost of the re-arrangement effort.So we need more investigation.

Lemma 15 (Station transition direction upon packets re-arrangement). Under thesame hypothesis and conditions of theorem 14, any packet p ∈ P stored in any stationsj ∈ Ω of the ring, if moved because of the re-arrangement, it is moved either:

• To any of sj ’s successors if sj ≺ s∗.

• To any of sj ’s predecessors if s∗ ≺ sj .

Proof. To show this, we consider the ring in the 2 different configurations (beforethe scale-up and after). The diagram below shows hash function φ’s domain in bothconditions (N stations at the bottom andN+1 at the top), and also plots the locationof 2 hashes therein.


h′ h′′

‖Ω‖ = N

‖Ω‖ = N + 1

12N

+ k−1N

12N

+ kN

12N

+ k+1N

12(N+1)

+ k−1N+1

12(N+1)

+ kN+1

12(N+1)

+ k+1N+1

12(N+1)

+ k+2N+1

1N

1N

1N+1

1N+1

1N+1

sk−2 . . . sk−1 sk sk+1 . . .

sk−2 . . . sk−1 s∗ sk

sk+1 . . .

As we can see, hash h′ falls initially in station sk−1, but after the transition it ends upfalling in station s∗’s coverage. In the same way, hash h′′ falls initially in station sk+1,but after the transition it ends up falling in station sk’s coverage. The formulationof the lemma implies that packets can also remain in the same station. It is actuallypossible as the diagram shows regions on the top and bottom hash spaces whichhave values in common.

We now know that a minimal pattern is present while re-routing packets. How-ever the information provided by lemma 15 is not much. A more interesting resultcan be considered, but, before that, we need a quantity to be introduced:

Definition 25 (Packet’s station transition delta). Under the hypothesis of dynamic con-ditions originating from the ring scaling up by one station, let si and sj be the originalstation and the new station (after re-arrangement) for any packet p; then quantity ∆(p) ∈ Nrepresents the number of stations packet p had to be moved across:

∆(p) =

i− j if |i− j| ≤ N

2sign(i− j) ·N − i+ j

∆(p) provides information about whether a packet was moved or not from itsoriginal station (∆(p) = 0), and also about the direction of the move (∆(p) < 0 or∆(p) > 0). The value of ∆(p) for each packet is the main subject of the next importantresult:

Theorem 16 (Packets station transition delta upon scale-up by 1 station). Under thehypothesis and conditions of theorem 14, the transition delta ∆(p) of any packet p ∈ P storedin any station si ∈ Ω of the ring is, at most, unitary in absolute value: |∆(p)| ≤ 1.

Proof. The theorem basically states that if a packet is moved, that is moved to oneof the 2 direct contiguous stations. To prove this statement, we want to re-formulatethe thesis by using an equivalent definition. In conjunction with lemma 15, we needto prove that:

1. A packet hosted in a station preceding s∗ is re-routed, at most, to its immediatesuccessor.

2. A packet hosted in a station preceded by s∗ is re-routed, at most, to its imme-diate predecessor.

We will initially prove the first point, later on the second by considering it as a mir-rored condition of the former.

Let us consider a linear bounded real space divided intoN ∈ N even parts. Everypart is marked with an identifying number k = 1 . . . N . Then, we considerN to raiseto N + 1, we consider that every existing segment shrinks down in order to make

5.1. Scalability 57

space for segment N + 1 which is, therefore, supposed to be added as the last one.This scenario abstracts the condition where stations’ φ coverages per each stationare shrunk to lower φ values due to s∗ joining the ring, in the specific case where allstations being considered are predecessors (down to s1) of s∗.

a

[N ]

[N + 1]

kN

k+1N

k+2N

kN+1

k+1N+1

k+2N+1

k+3N+1

1N

1N

1N+1

1N+1

1N+1

Without loss of generality, we consider point a ∈ [0, 1] ⊂ R and re-express the thesisas follows: "Is it possible to find any combination of a, k and N such that, after theshift from N to N + 1, a falls into a segment further than its original’s successor?".Formally, this question is stated as follows:

∃a ∈ [0, 1] ⊂ R, N ∈ N, N > 0, k ∈ N, k = 1 . . . N :

a < k+1

N

a ≥ k+2N+1

If that system of inequalities has no solution, then the thesis is confirmed. Be devel-oping both inequalities we get the following:

a− k+1N < 0

a− k+2N+1 ≥ 0

=⇒

aN − k − 1 < 0

a(N + 1)− k − 2 ≥ 0=⇒

aN − k − 1 < 0

aN + a− k − 2 ≥ 0

By isolating N , we get:aN < k + 1

aN ≥ k + 2− a=⇒

N < k+1

a

N ≥ k+2−aa

∨

−k − 1 < 0

−k − 2 ≥ 0

The second system arises from dividing both members, in both inequalities, by a.We need to consider what solutions the system might present in case a = 0. This lastsystem is easily proved to be impossible:

k + 1 > 0

k + 2 ≤ 0=⇒

k > −1

k ≤ −2

Resuming on the former system and considering from now on a ∈ (0, 1] ⊂ R, we candevelop more and get:

k + 2− aa

≤ N <k + 1

a=⇒ k + 2− a

a<k + 1

a=⇒ k + 2− a < k + 1 =⇒ a > 1

The system has solutions for a > 1, however this is in contrast with our hypothesisfor which a ∈ (0, 1] ⊂ R, thus the system does not have solutions in the definitionboundaries of a, N and k!

We still need to prove the symmetric case of stations that are successors of s∗.However it is possible to skip this by considering that such a scenario is the mirrorof the one just proved.

As a direct result, we have the following:


Corollary 16.1 (Packet lookup failure at re-arrangement time). Under the hypothesisand conditions of theorem 14, if packet p ∈ P is not found in station si ∈ Ω while the systemis in the process or re-arranging packets, then it will be found in the previous or next nodedepending on whether s∗ ≺ si or si ≺ s∗!

Lemma 15, theorem 16 and corollary 16.1 provide the answers to our initial ques-tions. To draw our conclusions: the ring is not perfectly scalable as all stations needto rearrange their packets under dynamic conditions; however the effort is extremelylocalized in the context of each station.

5.1.3 Scaling overall impact

We have now more information in order to evaluate conjecture 1. Considering thecharacteristics of the operations of updating hash function φ across stations and re-distributing packets, we now understand that they can be executed in parallel. Assoon as the joining station computes φ′, it commences the protocol for broadcast-ing this knowledge in the ring. At the same time, the same station can start goingthrough all its packets and evaluating the new hash function on those in order to re-route its DUs. This process can be started in every station the moment φ′ is availableand it is traversing the ring.

That being said, the packet moving operations are more expensive than the op-eration of computing the new hash function or receiving it from other stations, thusthe time needed for re-routing DU loads in the network is far higher: τψ τφ, so theoverall scaling time is basically defined by τψ.

5.1.4 Ring scale-down

All the considerations made so far regarding the ring scaling up can be transferredto the opposite case where a station leaves the network. A few considerations mustbe made though in relation to this dynamic condition:

• When a station leaves the network, the physical detachment to the other nodesis not performed until all packets are re-routed. This is crucial and differentin comparison to the scenario of a station joining the ring; in fact we cannotafford here to lose a whole bucket of packets.

• A station leaving the network is not the same scenario of a station abandoningthe ring. The former is a controlled process happening through a specific pro-tocol and requires time; the latter is a sudden event and cannot be controlled,its nature is described later in this chapter.

5.2 Fault conditions

As anything can happen, stations in the ring might enter weird states. The reasonsfor such a scenario to occur can be many: hardware or software related and adequatecountermeasures can be considered. Nonetheless, when it comes to disaster recovery,it is not much about all possible cases we know, but rather more about everythingwe don’t know. So, we will now consider the possibility of a station becoming un-available and we are not going to ask ourselves why! What we ask instead is: "Howdo we guarantee data retrieval services and the balancing in such conditions?".

5.2. Fault conditions 59

Sp

hS p1

h(1)S p2

h(2)S p3

. . .

FIGURE 5.2: Multiple hashing mechanism for achieving safe redun-dancy. Hashes are computed and then concatenated to the data

stream, hence generating packets ready to be sent.

When a station goes down, the first issue is infrastructural. If the ring is set tohave leaf-set radius r = 1, then we have a problem as the ring basically breaks apartand messages cannot be routed across stations. Of course, if the radius is higher:r > 1, then no immediate consequences are experienced in terms of message rout-ing. In both cases, DHT networks have existing protocols in literature to fix danglinglinks and isolate the unavailable station; the only difference is that a unitary radiusring will experience some downtime until links are fixed, this is one of the reasonsfor which non-unitary radius rings are more robust to disasters.

The second issue to solve is from data retrieval perspective. A station went downunexpectedly, thus there was no time to apply any scale-down protocol (in fact thescenario here is not a station leaving the ring, but a station disappearing from it).The direct consequence is virtual data loss: all packets stored in that station are nowunavailable and when any RReq is sent to the ring targeting one of those DUs, thedestination station will not find the packet hash in its database.

It is clear that, to solve this issue, something has to be done before the station goesdown. However we cannot make any assumption on this condition and its timing.So we need to change the data storage protocol to target situations where emergencypacket retrieval is needed as we cannot afford, for any reason, the possibility of databecoming unavailable to users.

In chapter 4 we have described the API for storing a packet in the ring. Our in-tention is to modify the storage protocol (primitive store) in order to save one packetin multiple locations in the ring without losing balancing. The procedure applies toeither packets or fragments, in general, we consider a certain stream of data to besent for storage:

1. The data stream S to send is processed and its hash computed: hS = φ(S).

2. Another hash is computed, by using as input previously computed hash hS :h

(1)S = φ (hS).

3. The same recursive operation is repeated for % ∈ N times and several hashesare computed in chain: h(k)

S = φ(h

(k−1)S

).

4. % different packets are generated by constructing a frame with the same body(the data stream) but different associated hash as per figure 4.1 and then sentto the ring.


RReqsuccess

RReqsuccess

lookup

null

RRes (error)success

RRes (error)success

retrieve(φ(p))

null

RReqsuccess

RReqsuccess

lookup

pkt pRRes

successRRes

success

retrieve(φ (φ(p)))

pkt p

User: Proxy: Entry station: Dst station 1:

Dst station 2:

FIGURE 5.3: Packet retrieval session under the hypothesis of one sta-tion down. The diagram illustrates how a failed RReq triggers the

emergency retrieval process.

The procedure just described will generate % different copies of the same DUand they will all be sent to different locations in the ring. Thanks to the Lamportscheme3, we can compute % more hashes of the same initial stream and use them asstorage keys.

Remark. Generating the first hash hS is potentially expensive because the input stream canbe long (however bound to a certain level considering fragmentation threshold c). The samecannot be said for the other hashes h(k)

S because they are computed on another hash (veryshort string). So the process of computing the redundant hashes is very cheap.

3The process of generating the hash of an hash is used today in security-related scenarios in order togenerate ephemeral keys. The scheme has been proved to be safe and, by using a secure cryptographichash function, irreversible.


How can this procedure help us when attempting to retrieve a DU stored into anunavailable station? We consider again the broken scenario of before where stationsi suddenly become unavailable:

1. The system tries to retrieve packet p via its hash h = φ(p).

2. The RReq message will reach station si−1 as it now covers the hash segment ofsi when it was on-line. However station si−1 cannot find hash h in its database,thus returns an error in the RRes.

3. The system acknowledges the first RReq is not successful, so it tries again toretrieve the packet by computing φ(h).

4. The second RReq now reaches another station sj where the packet is foundand returned.

Figure 5.3 illustrates the protocol just described.

5.2.1 Collisions threshold

As we promote the idea of introducing redundancy of packets in the network as amean to achieve good levels of disaster recovery, we should try to be careful to makethis effort the most efficient possible, therefore avoiding unnecessary cost. Since weare routing the same packet in the ring but with different hashes, we want to makesure all the copies do not end up being routed into the same station. If we generatedone copy a packet and they both were routed to the same station, our effort wouldbe pointless: the moment that station goes down, our emergency retrieval procedurewould fail. On the other end we don’t want to generate too many copies of the samepacket as we would waste precious memory in our stations. How to find a goodbalance? Let’s start by considering collisions in the ring:

Lemma 17 (Packets collision probability). Let R = (Ω, r, ξ, φ, ψ) be a ring with ‖Ω‖ =N stations, and let p1 ∈ P and p2 ∈ P be two packets. Then the probability that they collide(routed to) onto the same station is:

γ =1

N(5.2)

Proof. A collision occurs when p1 and p2 are routed to the same station si ∈ Ω:ψ (p1) = ψ (p2) = si. The probability of this event can be defined as follows:

γ = Pr ψ (p1) = ψ (p2) = si , ∀p1, p2 ∈ P,∀si ∈ Ω

We first consider packet p1 routed into the ring into station sk ∈ Ω and then considerpacket p2 being processed: the probability to have a collision with p1 is the proba-bility of being routed into sk considering that sk can be any station of the ring, thiscalls for the Law of Total Probability:

γ =

N∑k=1

Pr ψ (p2) = sk|ψ (p1) = sk · Pr ψ (p1) = sk (5.3)

Since ψ is based on hash function φ which is based on ξ which is a cryptographichash function, we have that consecutive applications of ψ are not interdependent, itmens that:

Pr ψ (p2) = sk|ψ (p1) = sk = Pr ψ (p2) = sk ,∀p1, p2 ∈ P,∀sk ∈ Ω


We can rewrite equation 5.3 as follows:

γ =

N∑k=1

Pr ψ (p2) = sk · Pr ψ (p1) = sk

Recalling theorem 5 and the definition of πk ∈ [0, 1] ⊂ R as the packet in station kprobability, we can write our equation as follows:

γ =

N∑k=1

πk · πk =

N∑k=1

π2k

We are under the hypothesis of a balanced ring, since hash function φ is applied: so,according to equation 2.12, we have:

γ =

N∑k=1

1

N2=

1

N2·N∑k=1

1 =1

N2·N =

1

N

Which proves the thesis.

The follow up to lemma 17 is calculating the average number of collisions thatare experienced in the ring when sending % packets. Remember that the fact of send-ing % copies p1 . . . p% of packet p does not create a correlation between the differentinstances being sent. This is due to the fact that we are sending different hasheshS, h

(1)S . . . h

(%)S related to each other by the Lamport chain which, actually, guaran-

tees that all the hashes are not (stochastically) interdependent.

Lemma 18 (Average number of collisions). Let R = (Ω, r, ξ, φ, ψ) be a ring with ‖Ω‖ =N stations. Then, when generating m ∈ N packets, the average number of collisions experi-enced between different couples of units is:

ηγ =

(m

2

)1

N(5.4)

Proof. We introduce r.v. y ∈ N counting the number of collisions between couples ofm packets. This variable can range from 0 up to all the possible combinations of twodifferent packets: |C(m, 2)| =

(m2

). We also introduce r.v. χ = 0, 1 ⊂ N defined as

follows:

χ(p1, p2) =

1 if a collision occurs between packets0 no collision

Remembering that C(m, 2) enumerates all possible combinations of packets (orderdoes not matter), we can define y as follows:

y =∑

(p1,p2)∈C(m,2)

χ(p1, p2)

R.v. y’s mean value can then be calculated as:

ηγ = E [y] = E

∑(p1,p2)∈C(m,2)

χ(p1, p2)


Since operator E [·] is linear, we have that:

E

∑(p1,p2)∈C(m,2)

χ(p1, p2)

=∑

(p1,p2)∈C(m,2)

E [χ(p1, p2)]

R.v. χ is discrete and distributed on two values only, so its mean can be easily calcu-lated:

E [χ(p1, p2)] = 1 · Pr χ = 1+ 0 · Pr χ = 0 = Pr χ = 1 = γ,∀p1, p2 ∈ Ω

So, back to r.v. y’s mean value:

ηγ =∑

(p1,p2)∈C(m,2)

E [χ(p1, p2)] =

|C(m,2)|∑k=0

γ =

|C(m,2)|∑k=0

1

N=

1

N·|C(m,2)|∑k=0

=1

N

(m

2

)

Proving the thesis.

Thanks to lemma 18, we can now try to calculate a reasonable value for % anddecide how many clones of a packet we should send in the network to ensure aneffective level of redundancy.

Theorem 19 (Optimal %). Let R = (Ω, r, ξ, φ, ψ) be a ring with ‖Ω‖ = N stations and letp ∈ P be a packet sent with redundancy factor % ∈ N. Then, in order to guarantee that atleast 50 % of sent packets do not collide, the optimal redundancy factor is:

% < %opt = N

Proof. Let β ∈ [0, 1] ⊂ R be the percentage of collisions that we allow on the numberof total packets %+ 1 (the original packet and its clones) sent to the network. So, thefollowing holds:(

%+ 1

2

)1

N< β(%+ 1) =⇒

(%+ 1

2

)< βN(%+ 1)

=⇒(%+ 1

2

)1

%+ 1< βN

=⇒ (%+ 1)!

2!(%− 1)!· 1

%+ 1< βN

=⇒ (%+ 1)%(%− 1)!

2(%− 1)!· 1

%+ 1< βN

=⇒ %

2< βN =⇒ % < 2βN

Which proves the thesis by considering: β = 12 .

65

Chapter 6

Conclusions and final notes

Simulations have shown the effectiveness of the balancing performed by the algo-rithm, together with the use of known distributed architectures (DHT networks),the proposed balancing approach is feasible and potentially employable in real casescenarios.

6.1 Open issues

The algorithm currently presents some challenges which must be addressed in orderto make the architecture more flexible and less costly from a network performancestandpoint (traffic and control overhead)

Scalability is the first priority. The analysis performed so far has provided goodupper bindings to the cost of scaling up the ring by one station; however more isto be investigated. More simulations should be run on scaling rings and a differ-ential analysis must be carried on to identify possible patterns which can be madeadvantage of.

6.2 What’s next

As a continuation of the effort described in this document, the next action items tofocus on are:

1. Improving C/C++ simulations to target more advanced scenarios.

2. Performing more simulations on very large networks (up to 1000 stations andmore) and higher traffic volumes.

3. Developing simulations targeting traffic handling in the ring, in order to getmore information about the impact on network performance introduced bythe PA.

4. Enriching simulations with more features addressing differential analysis onscaling rings.

The next iteration should focus on collecting more information regarding theperformance of the algorithm with special focus on high variance conditions in theamplitudes on hash segments. Furthermore, it can be beneficial to evaluate migra-tions flows on scaling scenarios.

67

Appendix A

C/C++ simulation engine’sarchitecture

The C/C++ simulation engine has been developed with the following technologies:

• Intel’s Threading Building Blocks (TBB1) for parallel packet generation andhash computation.

• GNU C/C++ compiler.

• Boost2 C++ libraries for big integers and other utilities.

• Tina’s Random Number Generator (TRNG3) library for randomizers.

• OpenSSL4 cryptographic library for hash computation.

• Circos5 library for circo-diagrams generation (migration flows).

Simulation steps Simulations can run sequentially or in parallel. When runningin parallel, a Monte Carlo approach is used so that packet generation and hash com-putation can be performed much faster. When running a simulation, the followingsteps are performed:

1. Pre-compilation configuration Compilation variables are assigned. The en-gine is based on STL6 and parameters such as the number of stations N , thenumber of generated packets m are all defined as compile-time constants; thusthey need to be set.

2. Compilation The simulation engine undergoes compilation in order to pro-duce simulation executables.

3. Post-compilation configuration Simulation input files are prepared in order tospecify hash segments and other network descriptive variables.

4. Execution Simulations run.

5. Data extraction Output data is generated in order to get aggregated informa-tion and markup files to be used for generating circo-diagrams.

1Intel’s library for multi-threaded processing. https://www.threadingbuildingblocks.org/.

2Boost libraries. http://www.boost.org/3Random number generator library. https://www.numbercrunch.de/trng/.4Standard SSL implementation. https://www.openssl.org/.5Circos. http://circos.ca/.6C++ Standard Template Library allows the use of generic types and compile-time constants.

https://www.threadingbuildingblocks.org/

https://www.threadingbuildingblocks.org/

http://www.boost.org/

https://www.numbercrunch.de/trng/

https://www.openssl.org/

http://circos.ca/

68 Appendix A. C/C++ simulation engine’s architecture

Every simulation generates 3 files:

• A data file tracking hash segments per each station and all generated packets,hashes and φ-hashes.

• A table file containing a matrix used by Circos to generate migration flows.

• A Karyotype file used by Circos to generate other diagrams (for the future).

Infrastructure All simulations mentioned in this document have been run againsta pool of Intel 4-core machines: HP ProLiant DL180 G6 (64 bit) on CentOS 6 (RHEL).

69

List of Figures

1.1 Overall system architecture. The end user interacts only with the stor-age system, while the balancing system is hidden to the user andtransparent to the storage system with regards to accessing the serverpool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 A N = 8 network example showing the logical ring topology. Eachstation is assigned with an ID (typically the IP address hash) andpackets are routed by content. . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Access to the ring is guarded by proxies. . . . . . . . . . . . . . . . . . . 112.3 Hash-partitioning of a ring into different segments, one per each sta-

tion. For each segment, a different impulse is used, its coverage matchesthe segment’s length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Hash segments mapped onto φ segments illustrating how hash func-tion φ works. The top part of the diagram shows the φ hash-spacewhile the bottom part the ξ hash-space . . . . . . . . . . . . . . . . . . . 26

3.1 Showing the Polar Hash Coverage Plot (PHCP) of a simulation on anN = 10 station ring after sending m = 103 packets. Both plots showthe configuration of the station hash segments together with the finalload levels at the end of the simulation. The plot on the left refers to anormal ring (hash function ξ applied), the one on the right refers to anextended ring where hash function φ based on same ξ is considered.The same packets were sent in both rings. . . . . . . . . . . . . . . . . . 32

3.2 Showing load state (in blue) |sk| in each station sk as time grows. Inthis simulation, hash function ξ is used (normal ring). The green lineshows the expected load state (uniform) for each point in time. . . . . . 33

3.3 Showing load state (in blue) |sk| in each station sk as time grows. Inthis simulations set (same as in figure 3.1), hash function φ is used (ex-tended ring). The green line shows the expected load state (uniform)for each point in time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Plotting standard deviation vs dispersion factor of generated ξ-hashesand φ-hashes during simulations batches (from left to right): N = 10(40 simulations), N = 30 (60 simulations) and N = 50 (10 simulations). 37

3.5 Plotting standard deviation of hash segment lengths and standarddeviation of φ-hashes during each simulations in batches (from topto bottom): N = 10 (40 simulations), N = 30 (60 simulations) andN = 50 (10 simulations). . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 Plotting station loads η(ξ)k (no balancing) and η

(φ)k (balanced ring) at

the end of four N30 simulations with different seeds. . . . . . . . . . . 393.7 Migrations flows in a N30 ring. . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 Data unit format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

70 List of Figures

4.2 Synchronous vs. asynchronous communication model when storinga single packet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Packet info format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4 Sequence diagram showing the retrieval protocol in case of a frag-

mented packet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1 Message format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2 Multiple hashing mechanism for achieving safe redundancy. Hashes

are computed and then concatenated to the data stream, hence gener-ating packets ready to be sent. . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Packet retrieval session under the hypothesis of one station down.The diagram illustrates how a failed RReq triggers the emergency re-trieval process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

71

List of Tables

2.1 Showing, in the example, values of hash identifiers and hash seg-ments for each station. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

73

AcknowledgementsThanks to my supervisor: Prof. Eng. O. Tomarchio, for having enough patience andwaiting a few more years for me to finish this research while working in Denmark.

Thanks to Medilink srl: my host company for my master traineeship duringwhich this research effort was started and completed in its first step. They providedeverything I needed (resources, infrastructure) to complete my work.

Thanks to my Team Lead at Microsoft: Horina, for her flexibility and availability,allowing me to submit this work in time.

Graphics and artwork Icons and graphics in figures created by Katemangostar -Freepik.com.

Last but not least, thanks to all the amazing public libraries in Copenhagen whichhave hosted me and my work during many weekends spent on this thesis.

75

Bibliography

Haugh, Martin (2004). “Generating Random Variables and Stochastic Processes”. In:Monte Carlo Simulation: IEOR E4703 1.1, pp. 6–10. URL: http://www.columbia.edu/~mh2078/MCS04/MCS_generate_rv.pdf.

Kolchin, Valentin F. (1998). “Random Allocations”. In: Washington: Winston 69.3, pp. 1236–1239. URL: http://link.aip.org/link/?RSI/69/1236/1.

Nijenhuis, Albert (1974). “Strong derivatives and inverse mappings”. In: The Amer-ican Mathematical Monthly: DOI: 10.2307/2319298 81.1, pp. 1–12. URL: http://www.jstor.org/stable/2319298.

http://www.columbia.edu/~mh2078/MCS04/MCS_generate_rv.pdf

http://www.columbia.edu/~mh2078/MCS04/MCS_generate_rv.pdf

http://link.aip.org/link/?RSI/69/1236/1

http://www.jstor.org/stable/2319298

http://www.jstor.org/stable/2319298

master thesis - a distributed algorithm for stateless load balancing

Software