architectures for peer-to-peer media streaming in large scale systems

137
UNIVERSIT ´ E DE NEUCH ˆ ATEL Architectures for Peer-to-Peer Media Streaming in Large Scale Systems par Marc Schiely Institut d’informatique Th` ese pr´ esent´ ee le 8 D´ ecembre 2009 ` a la Facult´ e des Sciences pour l’obtention du grade de Docteur ` es Sciences Accept´ ee sur proposition du jury: Prof. Pascal Felber, directeur de th` ese Universit´ e de Neuchˆ atel, Suisse Prof. Peter Kropf, rapporteur Universit´ e de Neuchˆ atel, Suisse Prof. Laszlo B¨osz¨ ormenyi, rapporteur Universit´ e de Klagenfurt, Autriche Prof. Benoˆ ıt Garbinato, rapporteur Universit´ e de Lausanne, Suisse

Upload: akn-nanthan

Post on 17-Aug-2015

236 views

Category:

Documents


5 download

DESCRIPTION

Peer to Peer Streaming Systems

TRANSCRIPT

UNIVERSITEDENEUCHATELArchitecturesforPeer-to-PeerMediaStreaminginLargeScaleSystemsparMarcSchielyInstitutdinformatiqueTh`esepresenteele8Decembre2009` alaFacultedesSciencespourlobtentiondugradedeDocteur `esSciencesAccepteesurpropositiondujury:Prof. PascalFelber,directeurdeth`eseUniversitedeNeuch atel,SuisseProf. PeterKropf,rapporteurUniversitedeNeuch atel,SuisseProf. LaszloBosz ormenyi,rapporteurUniversitedeKlagenfurt,AutricheProf. BenotGarbinato,rapporteurUniversitedeLausanne,SuisseAcknowledgementsIrarelyhadtheopportunitytothankpersonswhohelpedmetocompletemythesisinthelastfewyears. Manyof myfriendswerealwaysreadytodiscussproblemswithmeandtriedtounderstandme.FirstofallIwanttothankSilviawhowassharingtheoceandagreattime with many deadlines with me. Also all my other friends from the insti-tuteweregivingmealotof energyandjoy, namelySabina, Raphael, Leo,Fred,Steve,Heiko,Claire,Christian,OlenaandWalther.Also many thanks to Pascal and Peter for their support all the years andthepossibilitytoworkwiththemandtoparticipateinmanywinterschoolsandconferences.Thanks a lot to Professor B oszormenyi and Professor Garbinato for theirvaluablework.Myfamilywasalsomotivatingmethroughouttheyearsandhelpedmetonishmythesis. ManythankstomyparentsChristaandMarkusforthepatience they had with me and the support they oered me. Also my sistersAndrea and Corinne and my brother Dominic were there when I needed them.Alsomygrandmotherwasaimportantpersonkeepingaskingmeaboutmyworkandmotivatingme.ThereweremanyotherimportantfriendspresentatmysidewhichIdonotexplicitlylisthere. Thankstoallofthem.AbstractKeywords: Peer-to-Peer, MediaStreaming, Tree-basedArchitectures, Dy-namicAdaptationAlgorithms,Tit-for-Tat.Mots-Cles: Pair-` a-Pair, diusiondeux, arbresdediusion, algorithmesdadaptationdynamique,Tit-for-Tat.The distribution of substantial amounts of data to a large number of clients isaproblemthatisoftentackledbypeer-to-peer(P2P)architectures. Bottle-necks are alleviated by distributing the work of the server to all participatingpeers. Contentisnolongerpasseddirectlyfromtheservertoallclientsbutonly to a small subset of peers fromwhere it is forwarded to a dierent sub-setofclients. ThesebasicP2Pideascanalsobeappliedtothedistributionof livecontent, suchas videostreams. Additional timingconstraints andbandwidthrequirementsofthisapplicationcontextleadtonewchallenges.Peerfailuresorlatearrivingpacketsdirectlyinuencetheuserperception,whichisnotthecaseinsimpleledistributionscenarios.ThisthesisrstanalyzessomeofthemajorproblemsfacedbyP2Plivemedia streaming, and then presents a new architecture to address these chal-lenges. Startingfromatree-basedapproach, thearchitectureisenhancedwithadaptationalgorithmstonallyevolveinamesh-basedsystem. Thein-depthanalysisof tree-basedarchitecturesshowsthatitisimportanttoadapt a nodes position in the tree according to its bandwidth capacity. Thesame analysis is conducted for mesh-based architectures and it is shown thatthepositiononthedistributionpathhasasignicantinuenceonperfor-mance. Anotherimportant problemconcernsthefairnessaspectintermsofcollaboratorsandso-calledfree-riders. AP2Psystemworksbestifallpeers contribute withtheir resources. This canbe ensuredbytit-for-tatmechanismswherepeersreturnasmuchastheyget. Inthisthesisanewkindoftit-for-tatmechanismisdevelopedtocombinebandwidthcontribu-tionwithrobustnessthemorebandwidthapeerprovidesthemorerobustitspositiononthepathbecomes.Contents1 Introduction 11.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 ChallengesinP2PMediaStreaming. . . . . . . . . . . . . . . 21.2.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.4 ThroughputOptimization . . . . . . . . . . . . . . . . 41.2.5 TimingConstraints . . . . . . . . . . . . . . . . . . . . 41.2.6 Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 ResearchGoal . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4.1 ThroughputOptimization . . . . . . . . . . . . . . . . 61.4.2 AdaptationAlgorithms. . . . . . . . . . . . . . . . . . 71.4.3 FairnessandRobustness . . . . . . . . . . . . . . . . . 71.4.4 Mesh-basedArchitectures . . . . . . . . . . . . . . . . 81.5 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 RelatedWork 102.1 OverviewofExistingSystems . . . . . . . . . . . . . . . . . . 102.2 AnalyticalModelsforDownloadRate. . . . . . . . . . . . . . 142.3 AdaptationAlgorithms . . . . . . . . . . . . . . . . . . . . . . 152.4 FairnessandRobustness . . . . . . . . . . . . . . . . . . . . . 152.5 Mesh-basedApproaches . . . . . . . . . . . . . . . . . . . . . 162.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17viii3 Peer-to-peer DistributionArchitectures providingUniformDownloadRates 193.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 SystemModelandDenitions . . . . . . . . . . . . . . . . . . 203.2.1 UniformRate . . . . . . . . . . . . . . . . . . . . . . . 223.3 AStudyofThreeArchitectures . . . . . . . . . . . . . . . . . 233.3.1 LinearChainArchitecture . . . . . . . . . . . . . . . . 233.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.3 MeshArchitecture . . . . . . . . . . . . . . . . . . . . 293.3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.5 ParallelTrees . . . . . . . . . . . . . . . . . . . . . . . 343.3.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4 ComparativeAnalysis . . . . . . . . . . . . . . . . . . . . . . 363.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 Self-organizationinCooperativeContentDistributionNet-works 414.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 P2PContentDistribution . . . . . . . . . . . . . . . . . . . . 424.3 DynamicReorganizationAlgorithm. . . . . . . . . . . . . . . 464.4 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 Tit-for-tatRevisited: TradingBandwidthforReliabilityinP2PMediaStreaming 625.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.2 TheCrossFluxArchitecture . . . . . . . . . . . . . . . . . . 645.2.1 DesignGuidelines. . . . . . . . . . . . . . . . . . . . . 645.2.2 DistributionOverlay . . . . . . . . . . . . . . . . . . . 665.2.3 JoiningtheSystem. . . . . . . . . . . . . . . . . . . . 695.2.4 ContentDistribution . . . . . . . . . . . . . . . . . . . 725.2.5 DeparturesandFailures . . . . . . . . . . . . . . . . . 735.2.6 OverlayOptimization . . . . . . . . . . . . . . . . . . . 735.3 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75ix5.3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . 755.3.2 ModelnetSimulations . . . . . . . . . . . . . . . . . . . 835.3.3 LoadBalancing . . . . . . . . . . . . . . . . . . . . . . 865.3.4 PlanetLab . . . . . . . . . . . . . . . . . . . . . . . . . 875.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886 Tree-based Analysis of Mesh Overlays for Peer-to-Peer Stream-ing 916.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.2 Mesh-basedP2PStreaming . . . . . . . . . . . . . . . . . . . 926.2.1 MeshOverlayProperties . . . . . . . . . . . . . . . . . 936.2.2 Tree-basedViewofMeshOverlays . . . . . . . . . . . 946.2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2.4 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . 986.3 MeshAdaptationAlgorithm. . . . . . . . . . . . . . . . . . . 1006.3.1 Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 1016.3.2 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . 1026.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057 Conclusions 1077.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.2 PerspectivesofFutureWork . . . . . . . . . . . . . . . . . . . 108APrototypeImplementation 117A.1 SystemOverview . . . . . . . . . . . . . . . . . . . . . . . . . 117A.2 BuerManagement . . . . . . . . . . . . . . . . . . . . . . . . 118A.3 NetworkLayer . . . . . . . . . . . . . . . . . . . . . . . . . . . 118A.4 HeapTop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119A.5 FailureRecovery . . . . . . . . . . . . . . . . . . . . . . . . . 119BListofPublications 121xListofFigures3.1 Linearchainswithoneexpansionstepandk = 2 . . . . . . . . 243.2 Linearchainswithtwoexpansionstepsandk = 2 . . . . . . . 253.3 Linearchainswithoneexpansionstepandk = 4 . . . . . . . . 263.4 Download time (in rounds) of the linear chain architecture fordierentk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5 Download time (in rounds) of the linear chain architecture fordierents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.6 Meshwithoneexpansionstepandk = 2 . . . . . . . . . . . . 313.7 Meshwithtwoexpansionstepsandanyk . . . . . . . . . . . 323.8 DownloadtimeofthemesharchitecturefordierentC . . . . 333.9 Downloadtimeofthemesharchitecturefordierents . . . . 343.10 ParalleltreeswithN= 8andk = 2. . . . . . . . . . . . . . . 353.11 DownloadtimefortheparalleltreesarchitecturefordierentC363.12 Downloadtimefordierentarchitectureswithk = 100 . . . . 373.13 Downloadtimefordierentarchitectureswithk = 4 . . . . . 384.1 Positionsofaslownodeinthebinarytree. . . . . . . . . . . . 454.2 Averagereceptiontimedependingontheheightofslownodes. 474.3 Throughputmeasuredtoestimateeectivebandwidth. . . . . 484.4 Originaldistributiontree. . . . . . . . . . . . . . . . . . . . . 514.5 Possiblecongurationobtainedafteralgorithmexecution . . . 514.6 Averageimprovementfactor . . . . . . . . . . . . . . . . . . . 534.7 Averagenumberofexchangespernode . . . . . . . . . . . . . 544.8 BandwidthcapacityoftheHeapToptreevs. anoptimalbinary 544.9 Averageimprovementfactorwithtwoparalleltrees . . . . . . 554.10 Bestcaseimprovementfactorfortwoparalleltrees . . . . . . 56xi4.11 Congurationsforaveragereceptiontimetests. . . . . . . . . 574.12 Averagereceptiontimeswithoneslownode . . . . . . . . . . 574.13 Receptiontimesofthepeersfortheinitialrandomtree . . . . 584.14 ReceptiontimesofthepeersfortheHeapToptree . . . . . . . 595.1 Exampleinwhichthepathdiversitypropertyisnotsatised. 675.2 Exampleofcontentdistributioninnormalandbackupmode . 685.3 Illustrationofthepathdiversityproperty. . . . . . . . . . . . 685.4 Exampleofcontentdistributionafterfailure . . . . . . . . . . 685.5 Averageimprovementfactorwithtwostripes. . . . . . . . . . 765.6 Impactoffailuresontheaveragedownloadcapacity. . . . . . 775.7 Downloadcapacity(withoutHeapTop) . . . . . . . . . . . . . 785.8 Download capacity compared to the optimal (without HeapTop) 795.9 Downloadcapacity(withHeapTop) . . . . . . . . . . . . . . . 805.10 Downloadcapacitycomparedtotheoptimal(withHeapTop) . 805.11 Maximalandmax-averagepathlengths. . . . . . . . . . . . . 815.12 Cumulativedistributionfunctionofthenumberofchildren . . 825.13 NodestressfordierentpopulationsizesanddistributionD4. 835.14 Averageimprovementfactorwithtwoparalleltrees . . . . . . 845.15 Bestcaseimprovementfactorfortwoparalleltrees . . . . . . 855.16 Cumulative distribution function of the free receive buer space. 865.17 CumulativeDistributionFunctionoftheUsedCapacity. . . . 865.18 ComparisonofRecoveryTimewithBackupLinksandwithout. 875.19 DownloadrateforaPlanetLabnodewithaparentfailure . . 886.1 Meshoverlayanditstwodiusiontrees. . . . . . . . . . . . . 956.2 OptimalconstructionofKtrees . . . . . . . . . . . . . . . . . 976.3 Averagetreeheightfordierentnumberofparents . . . . . . 996.4 CDFoftheheightofdiusiontreesinmeshoverlays . . . . . 1006.5 Propagationdelayforvaryingnumberoftrees . . . . . . . . . 1016.6 Peerspandparent exchangetheirpositions . . . . . . . . . . 1036.7 Treeheightsfordierentproportionsofuploadbandwidth . . 1046.8 CDFoftheheightofdiusiontrees . . . . . . . . . . . . . . . 104xiiListofTables4.1 Distributionsofpeerclassesforthesimulations. . . . . . . . . 525.1 Distributionsofpeerclassesforevaluation. . . . . . . . . . . . 75xiiiListofAlgorithms1 HeapTopalgorithmatpeerp . . . . . . . . . . . . . . . . . . . 492 Receptionof JRQ(si, j)atpeerpforstripesiandnewpeerj 723 Adaptingpositionofpeerpinthemeshoverlay . . . . . . . . 103xivChapter1Introduction1.1 ContextThetypical usageof theInternethasevolvedoverthelasttenyearsfromsimple message transfer and Web browsing to applications that involve muchlargeramountofdata,suchasvideoandmusicstreaming. Additionallythenumberof Internetusershasexponentiallyincreased, whichmadethede-mand for bandwidth and server capacities explode [24]. Simply adding moreservers to deal with this high demand is very expensive as the costs increaselinearlywiththenumberofusersandadditionalhardwareisneededtobal-ancetheloadamongtheservers. Thereforenewcommunicationparadigmshavebeenproposedtoreplacetheshort-comings of classical server/clientarchitectures.Instead of directly downloading content from servers, clients are requestedtoforwardcontentaswellandtocooperatesuchthatresourcesfromclientsarealsointegratedinthedistributionprocess. Initially, peer-to-peer(P2P)architecturesweremainlyusedfordistributinglargelestoabignumberof users, buttheyhavesoonbeenintegratedinotherapplicationssuchasmediastreaming. Insteadof onlysupportingpre-generatedles, theP2Papproachhasbeenappliedtolivegeneratedcontent,suchasmusicorvideostreamsthathavemorestringentrequirementsintermsoftiming. Therstlivecontentstreamingsystemsevolvedinthelateninetiesandmanyhavebeenproposedafterwards.1Inthisthesisweanalyzetheproblemof P2Pmediastreaming, andweproposesolutionsandanewarchitecturebasedontheresultsofourstudy.In the rest of this section we discuss the context of this thesis and summarizeourcontributions.1.2 ChallengesinP2PMediaStreamingWhiletheuseofP2Ptechniquesformediastreamingoersmanybenets,italsocomeswithanumberof challengesthathavetobedealtwith. InadditiontotheproblemsusuallyencounteredinclassicalP2Psystems,suchashighchurnrates, additional challengesarespecictomediastreaming.Wetakeacloserlookattheseissuesanddiscussexistingsolutions.1.2.1 ScalabilityIn all live streaming systems a single streaming source exists which producesstreamingcontent. Weassumethatthiscentral uniqueinstanceneverfailsasthewholestreamingsystemwouldfailotherwise. WeusuallydistinguishbetweenthreetypesofP2Parchitectures: (1)systemsthatrelyonasinglecentral instance (dierent to the source) of a specic component (e.g, the Aka-mai CDN [53]), (2) fully decentralized architectures where each participatingpeerhasthesamerole(exceptthesource), and(3)hybridorhierarchicalmodelswithsomepeers(usuallycalledsuperpeers)havingaspecicrole.Obviously, thescalabilityoftherstapproachislimitedbythecapacityofthecentralinstance,whichalsorepresentsasinglepointoffailure. Thehy-bridapproachmitigatesthisproblembyessentiallyreplicatingtheserviceprovidedbythesuperpeers.Fully decentralized architectures have the greatest potential for scalabilityand reliability, but they have to face additional complexity in their topologymanagementprotocols. Asthereisnoglobalknowledgeaboutthenetwork,all operations have to rely on local partial knowledge. This may lead to sub-optimal performance (when compared to centralized, omniscient algorithms)butthisisusuallyasmall priceforhavingascalablesystem. Further, aseachpeerhasthesamerole, failuresarenotascritical asthefailureof a2centralizedcomponentorasuperpeerinahybridmodel. Inenvironmentswith high churn,where many nodes join and leave the system,this propertyisessential.1.2.2 RobustnessOneofthemajoradvantagesoftheP2Pparadigmisalsooneofitsbiggestproblem: the peers are acting in an unpredictable manner and independentlyof each other. The rate at which peers join and leave the system can be veryhigh. Worse, departures may be ungraceful in the sense that peers fail or leavewithout prior notication. As a consequence, P2P systems must incorporatesomeformofself-healingmechanismsandconstructrobusttopologies(e.g.,usingredundantpaths).Ingeneral,therobustnessofP2Pnetworkscanbeincreasedbyusingei-ther data redundancy or path redundancy. The most widely used techniquesfordataredundancyareforwarderrorcorrection(FEC)[45],layeredcodingand multipledescriptioncoding(MDC) [10]. FEC uses encoding techniques,suchasReed-Solomoncodes,toencodeanumberofpacketsnintompack-etswherem>n. Anysubsetk(k n)of thesempacketsisenoughtoreconstructthenoriginalpackets.Inlayeredcoding, theoriginalmediastreamissplitintodierentlayersof quality. The base layer is the most important and must be received in anycase. Eachadditionallayerimprovesthequalityofthestream.Finally, MDCdividesastreamintodierentdescriptions(substreams)whereeachdescriptioncanbedistributedonadierent networkpathforavoidingnetworkfailuresorcongestion. Anysubsetof descriptionscanbeusedtodecodetheoriginalstream. Themoredescriptionsareavailablethehigherthestreamingqualityis. Theerrorresilienceisthereforehigherthanit is in layered coding as the loss of a description does not lead to a streaminginterruptionbutonlytoatemporarilydecreasedstreamingquality.Redundancyindataalonedoes not helpif apeer has onlyoneotherpeerservingthedata. Ifthissingledatasourcefailsthenthereceivingpeerdoesnotgetanydata. Abetterstrategyistohavemultiplesourcesthatserve dierent parts of the stream such that, if a subset of the neighbors fail,3theremainingpeerscanstillservethedataneededtoplaybackthestream.MostexistingP2Pmediastreamingsystemsprovidesuchsupportforpathdiversityusingredundantdistributiontreesormesh-basedtopologies[34].1.2.3 LatencyP2Pmediastreamingarchitectures are basedonapplication-level overlaynetworks. Thismeansthatmessagesfromthesourcetoagivenpeertyp-icallyfollowalongerroutethanIPsshortestpath. Inordertominimizethenetworksstretchandtheend-to-endlatency, peersthatarephysicallycloseshouldbeneighborsinthelogical overlay. Thus, theconstructionofthe distribution architecture does not only need to take into account robust-ness,butalsoperformancemetrics[42]. Theneighborsofanodehavetobeselectedinanintelligentwaytooptimizethechosenmetrics.1.2.4 ThroughputOptimizationIntraditionalP2Plesharingsystems,eachpeertriestodownloadcontentas fast as possible, i.e., maximize its eective bandwidth. In contrast, mediastreaming architectures must provide a timely constrained download rate forasmoothplaybackofthestream. Multiplecooperatingpeersareneededtobalance out bandwidth uctuations, so that the loss or degradation of servicefromonepeercanbecompensatedbyotherpeers.Theservicecapacityof aP2Psystemconsistsof theaggregateuploadbandwidth of all participating nodes. As this bandwidth is a scarce resourceits usage must be optimized. In the optimal case, each peer can obtain a peakservice capacity equal to the aggregate bandwidth divided by the number ofnodes.1.2.5 TimingConstraintsFordistributingalargeletoahighnumberof clientsitistypicallysplitintoanumberof equal-sizedchunks. Eachchunkcanbedistributedinde-pendently and the le can be reassembled at the peers. A participating noderequestsmissingchunksfromneighborsbasedonachunk-selectionstrategy4(e.g., rarest-chunk-rst strategy) until it has the complete le. The order hasnoimportanceastheleisonlyusedwhenitsdownloadiscompleted.Live streaming systems are very dierent in this respect: chunks are onlyuseful for apeer if it arrives beforeits scheduledplaybacktime. Chunksfarinthefuturecannotberequestedastheyoftendonotexistyet. Thesetiming constraints make the trading window of a peer very narrow and imposeadditionalmechanismstohandlebandwidthdegradationsandpeerfailures.Aswecannotrelyonguaranteesfromthetransportlayerthatapacketarrivesontime, weneedsomeover-provisioningofbandwidthtodeal withsmalldelaysandlimittheriskofmissingsomedeadlines. Thisrequirementis tightlycoupledwiththe throughput optimizationproblem. The moredownload bandwidth each peer gets, the higher the probability that all blocksarriveontime.1.2.6 FairnessAn important observation made on current peer-to-peer le distribution sys-temsistheexistenceofselshpeers,so-calledfreeriders[1]. Thesepeerstrytodownloadfromthesystemwithoutservingothernodes. Aspeer-to-peersystems live from cooperation, this is an important problem that needs to beaddressed.Withoutanycentralinstancethatcontrolsdataowsinthesystem,thepeersmusthavetheabilitytopenalizepeersthatareunfair. Ideally, eachpeer should only get as much as it contributes, except the initialization phasewhereapeerisnotabletoprovidedata, butthisobjectiveisnotcompat-iblewiththeuniformbandwidthrequirement of mediastreaming. Yet, apeerthatcontributesshall getmoreadvantagesthananodethatdoesnotcontributeatall,forinstancelowerdelaysorhigherreliability.1.3 ResearchGoalThegoalofthisthesisistostudycooperativedistributionofstreamingme-diaandlargelesfromanetworkingperspective. Theproblemshouldbeaddressedbyanalysis,modeling,prototypeimplementationandexperimen-5tal validation. Thealgorithmspresentedshouldbepractical enoughtobeimplemented.The solutions developed in this thesis should not depend on special peerswithdedicatedroles(e.g.,centralizednodes,super-peers). Insteadallpeersshouldbeassumedtobeequal interms of their roleinthenetworkbutheterogeneousintermsof bandwidth. Thepresentedarchitecturesshouldfurther be resilient tofailures andprovide simple incentives for peers tocontributetostreamdistribution.1.4 ContributionsWe briey summarize below how the dierent problems identied are solvedinthisthesis.1.4.1 ThroughputOptimizationIntodaysInternetusersaretypicallyconnectedbyasymmetriclinks, suchas ADSL where upload bandwidth is smaller than the download bandwidth.Therefore in this work the assumption is made that the upload bandwidth isthelimitingresource(usuallybandwidthisalsomoreconstrainedcomparedto CPU and disk space). The peers must obviously have a download capacitythatisatleastashighasthemaximumstreamingratetobeabletoreceivethestream.Under the assumptionthat all peers want toconsume astreamwithagivenrate, one must developarchitectures able toguarantee the sameminimum download rate across the whole network. In an environment wherepeershavedierentuploadcapacities, thepositionofeachnodehasahighimpact on overall download performance, e.g., if a peer with high-bandwidthcapacity is placed at the end of a distribution chain then its upload bandwidthis wasted. In Chapter 3 we present our dierent approaches for architecturesprovidinguniformdownloadratesandtheiranalysis.61.4.2 AdaptationAlgorithmsThe architectures developed in Chapter 3 can only be constructed with globalknowledge of all peers. Maintaining such information would necessitate non-scalable solutions, e.g., acentral tracker node. Therefore, algorithms todynamicallyapproximateoptimaldistributionarchitecturehavebeendevel-oped in Chapter 4. Peers initially join the system at any position where theyare accepted (each node has a maximal number of neighbors). Starting fromthis conguration, nodes dynamically adapt their position according to theiruploadbandwidth. Thehigheritsuploadcapacitythenearertothesourceapeershouldbe. Itisanalyticallyshownthatthisadaptationleadstoasignicantlyhigherthroughput.1.4.3 FairnessandRobustnessThepositionof thepeersindistributiontreesdoesnotonlyaectoverallperformance,butitalsoletspeersatleafsoperateasfree-riders,i.e.,con-sumewithoutcontributing. Topreventpeersfrombeingselsh, additionalfairnessmechanismshavetobeimplemented. SystemslikeBitTorrent[17]use a tit-for-tat mechanismwhere peers only returnas many chunks as theyreceive(exceptforthebootstrapphase). Thisapproachcanslowdownthesystemas peers are waitingfor databefore theysendchunks. Inalivestreamingsystemthisdelaycanhaveastrongimpactonthestreamqualityandmaybreakdownthesystem. ThereforeanewfairnessapproachbasedonrobustnesshasbeendevelopedandispresentedinChapter5.The more children a peer p serves, the higher the number of backup linksitgets. Inthecaseofafailureofoneofitsparents, thepeercanchooseanewsourceamongall itsbackups. Apeerthatisnotcontributing(hasnochildren) will strongly be aected by a failure and has to rejoin the systemanoperationthatistime-consumingandproducesgapsinthestream. Onthe other hand, a cooperating peer has a high chance to nd a stable backuppeerwithsucientuploadcapacity.71.4.4 Mesh-basedArchitecturesAs opposed to tree-based systems (studied in the rst chapters of the thesis),inmesh-basedsystems peers have more exibilityinchoosingneighbors.Insteadofhavingaxedpositionandforwardingchunksfromitsparenttoits children, a peer p trades chunks with a much larger set of neighbors whosecompositionevolvesdynamically. Ifoneofpsactiveneighborsfailsthenpsimplydiscards it fromits neighbor set andchooses another neighbor totradechunkswith. Thismechanismmakesmesh-basedsystemsmuchmorerobustthantree-basedarchitectureswhereaparentfailureaectsthechildanditswholesub-tree. Mesh-basedarchitecturesareanalyzedinChapter6byconsideringthemasacollectionofdistributiontrees.1.5 StructureThe thesis is divided into 7 chapters where Chapter 2 discusses related work,Chapters3toChapter5formthemaintechnical partandChapter7con-cludesthethesisandoutlinesfuturework.Outline. Chapter 3 analyzes dierent architectures which have all the samepropertyof providingauniformdownloadratetoall participatingpeers.Parts of thechapter havebeenpublishedin[48]. Chapter 4presents anadaptation algorithm which can further enhance the throughput of distribu-tionarchitectures. Thecontenthasbeenpartiallypublishedin[51].InChapter5anewP2Pmediastreamingarchitecture, integratingtheresultsfromthepreviouschapters, ispresented. Thischapterhaspartiallybeen published in [49] and in [50]. Chapter 6 analyzes the structure of mesh-based systems and makes the link from the tree-based CrossFlux design tomesh-basedarchitectures. Partsofthischapterhavebeenpublishedin[6].8Chapter2RelatedWork2.1 OverviewofExistingSystemsAlthoughmanyproposalsforP2Pmediastreamingarchitecturesexist,onlyfewwere implementedanddeployed. One verypopular systemwhichiswidely used in Europe is Zattoo [60]. It became available in 2006 and quicklyattracted a large user base (more than 20% of all Swiss Internet users). Zat-toooersdozensof TVchannelsandoperatesinseveral countries. Unfor-tunatelynotechnicaldetailsareavailable. ThesameholdsforPPLive[43],whichhasonlybeenstudiedbyperformancemeasurements[23].WecanidentifythefollowingreasonsfortheslowgrowthofP2Pmediastreamingsystems:1. A critical mass of users is needed for cooperation to be eective. If thereareonlyfewparticipants, thenthemediaserver canusetraditionalmulticastcommunication.2. Most existing systems fail short of providing the properties users expectfrom media streaming architectures: (1) no interruptions and no jitter;(2)faststartupofthestream; and(3)quickrecoveryoffailuressuchthatthestreamisplayedbackcontinuouslyalsounderhighchurn.3. P2P systems have to deal with legal and political issues. The owners ofastreamlosethecontrolofhowthestreamisbeingdistributed. Fur-10ther, any user with video recording equipment is able to serve streams,whichopensthedoortocopyrightinfringement.Wedescribebelowsomeofthemostwellknownstreamingsystemsandhighlight the dierences with our CrossFlux architecture described in Chap-ter5. Giventhelargenumberofproposalsfoundintheliterature, thislistisnotexhaustive.CoolStreaming. One of the most widelyusedsystems is CoolStream-ing[61], whichhasbeendeployedwithupto30,000distinctusersin2004buthasbeenstoppedduetocopyrightissuesin2005. Unlikemanyothersystems, CoolStreamingisdata-driven, considersbandwidthheterogeneity,andtriestoreducelatencybetweenpairsofpeers. Asetofbackupnodesismaintainedtodealwithfailuresandadapttochangingnetworkproperties.Backup nodes are periodically contacted to see if they provide higher perfor-mancethancertainnodescurrentlyservingthestream; inthatcase, nodesmaybeexchanged. Alimitationof thisoptimizationstrategyisthatitisrestrictedtothenodesof thebackupset. Incontrast, thealgorithmsusedinCrossFluxallowtoperformtransitiveoptimizations,i.e.,notlimitedtotheexchangeswithdirectneighbors.Chunkyspread. Chunkyspread [56] is an unstructured approach to mediastreaming. It uses amulti-tree (multi-description) basedstructure. Thestructureisverydynamicaseachpeerperiodicallylooksfornewpartnersin its local environment. It exchanges information (load, latency, creation ofloops)witheachneighbortosearchforthebestparent-childpairsforeachtree. Theconstraintsontheserelationshipsare(1)avoidloops, (2)satisfyanytit-for-tat constraints, (3) adapt load(shall be inaper peer denedrange)and(4)reducelatency. IncontrasttoChunkyspread, CrossFluxcombinesfairnesswithrobustness. Treesarebuiltinamorestructuredwaytoincludebackuplinks.EndSystemMulticast. End System Multicast (ESM) [14] is a P2P me-dia streaming solution that provides several desirable properties. An overlay11meshis initiallyconstructedandmultiple spanningtrees, rootedat eachpossiblesource, areconstructedontopofit. Thetreesarethenincremen-tally enhanced by adding or dropping additional links depending on a utilityfunction. InCrossFlux, wetrytoconstruct agoodmeshfromthebe-ginningandincorporateperformancemetricsduringthejoiningprocessofnew nodes. In addition, CrossFlux introduces a notion of fairness by usinglinks between nodes in one direction to serve streaming data and in the otherdirectionasbackuplink.PeerStreaming. PeerStreaming[30]diersfromothersystemsinthatitadapts the streaming bit-rate dynamically to the available bandwidth, whichdirectlydependsonthenumber of servingpeers. Theclientsreadingthestreamreceivedierentpartsfrommultiplealtruisticservingpeers. Anewnodejoins thesystembyaskingalist of servingpeers andconnects toanumberof them. Themaindrawbackisthat, unlikeCrossFlux, thereisnoincentivefortheservingpeerstoparticipateinthesystemandtohelpdistributethestream.GnuStream. GnuStream[27] is built ontopof the GnutellaP2Psub-strate[47,20]. ApeerinGnuStreamqueriestheGnutellanetworktolocatemultipleparentsthathavepartofthestream. Partsofthestreamarethenrequestedfromtheseparentsandaggregatedinthepeerforplayback. AsGnuStreamrelies uponGnutella, its implementationis verysimple: joinsandsearchesaremappedtotheunderlyingprotocols,whilefailurerecoveryis achievedbysimplyexchangingafailedsourcewithanother one. Thissimplicity comes at the price of some performance loss. Gnutella is not opti-mized for live media streaming and, therefore, may not perform as good as asystemthathasbeendesignedspecicallyforthatpurpose,asCrossFluxis.SplitStream. SplitStream[8]isaP2Pmediastreamingarchitecturethatfocuses onrobustness. As inour model, thestreamis split intomultiplestripesthatcanbedistributedindependently. Adistincttreeisconstructedfor each of these stripes spanning over all participating peers. The robustness12in SplitStream comes from the fact that each node is inner node in at most onetree and leaf in all the other trees. Thus, if a peer fails, only one distributiontreeisaectedandhastoberebuilt. InCrossFlux,peersmaybeplacedas interior nodes in more than one tree but quick recovery from a peer failurepisachievedbyusingbackuppathswhichdonotinlcudep. Additionally,fairnessisintroducedinthetreearchitectureinrewardingforwardingpeerswithhigherrobustness.CollectCast. TheCollectCast[22] architectureisbuiltontopof aP2PDHTsubstrate, suchasChord, CAN, orPastry. Failuresorstreamdegra-dations are handled by exchanging active senders. Further, CollectCast triestooptimizethedownloadrateateachpeerbyselectingthebestperformingpeers out of a candidate set. In contrast, CrossFlux does not rely on xedcandidate sets but performs amore global optimizationbymovingpeersacrossthetrees.CoopNet. CoopNet [36] combines aclassical client-server model withaP2Parchitecture. Theserver is responsiblefor directingjoiningnodes topotential parentsandforreconnectingpeersuponfailureof theirparents.The central instance obviously limits scalability and represents a single pointoffailurewhereasinCrossFluxthereisnocentralcomponent.NICE. NICE [2] uses a hybrid architecture in which peers are clustered inahierarchicallayerstructure. Eachclusterhasaleader,whichalsobelongstothenextlayerabove. Latencycanbeoptimizedbyselectingasleaderapeerthatisclosetothecenterof thecluster. Thesystemfocusesonlow-bandwidth streams distributed to a large receiver set. Thus, optimization oftheavailablebandwidthisnotamajorobjectiveofNICEandhasnotbeenexplicitlyaddressed.ZIGZAG. ZIGZAG [54] is another layer-based architecture. Like NICE [2],it constructs clusters that aregroupedinahierarchical structure. UnlikeNICE, ZIGZAGdynamicallyadaptstotheloadof theclusterheads: if anodehastoomanychildrenornosucientbandwidthcapacity, itcandis-13tributetheloadbyreconguringthecluster. ZIGZAGdoesnotusepathredundancyandit is not clear howwell it scales whendistributinghigh-bandwidthstreams.2.2 AnalyticalModelsforDownloadRateTwo main approaches exist for dealing with dierences in uplink band-width in overlay multicast systems. Narada [12], CollectCast [22] and GnuS-tream[27] usebandwidthmeasurements toimprovetheoverlaystructureby dynamically replacing links. In contrast Scattercast [11],SplitStream [8],Overcast[26]andALMI[38]usedegree-constrainedstructurestodealwithheterogeneity. Ifapeersdegreeissaturatedwhenanewpeerwantstocon-nect,thensomereorganizationneedstotakeplace. CoopNet[36]usesbothofthesetechniques. Itdeploysmultipleparalleltreesandreorganizesthembasedonperformancefeedbacks.All of these systems do not try to uniformly distribute the download ratetoall peers. Instead, theysenddistinctstreamsatdierentrates, ortheyconsider bounded streams and use buers to deal with timing problems. Ourgoal is to minimize the buer requirements by evening out the download rateatallpeers.In [44], the authors investigate the impact of heterogeneous uplink band-widthcapacitiesonScribe[9]. Theirexperimentsshowthatheterogeneitymay create distribution trees with high depths, which is not desirable. Afterproposing several ways to address the problem they conclude that heterogene-ityinDHT-basedmulticastprotocolsremainsachallengingopenproblem.Analytical models have been proposed for peers with homogeneous band-width capacities [3, 59], as well as for heterogeneous peers but for non-uniformdownloadrates[7]. Dierentarchitecturesforhomogeneousandheteroge-neous bandwidthconstraints are analyzed. Incontrast tothis work, theauthorsmaketheassumptionthatthedownlinkanduplinkcapacitiesaresymmetricanddonotconsideruniformdownloadrates.142.3 AdaptationAlgorithmsManyarchitectures for content distributionhavebeenproposed. Most ofthese systems build an overlay network that is kept throughout the distribu-tionprocess. Linksareonlychangedifeitheraneighborfailsortheperfor-manceheavilydegrades. Aectednodesthensimplyrejointhetreestartingat theroot. Most architectures donot activelyrecongurelinks beforeadegradationoccurs.CollectCast [22] is anexampleof suchapassivesystem. Theauthorsproposeanarchitecturethatworksontwodierentsetsofnodesformediastreaming. Fromasetofpotentialsendersthebestonesaretakenandformthe active set. The other potential senders are kept in a standby set. Duringthestreamingprocesspeersdopassivelymeasurebandwidthandlatency. Ifthequalityofthemediastreamingfallsbelowathreshold, apeerfromtheactivesetisexchangedwithonefromthestandbyset. AsimilarexchangetechniquehasbeenproposedinGnuStream[27] forusewiththeGnutellasystem.OthersystemslikeScattercast[11] trytoconstructnear-optimal distri-bution trees in advance. A set of agents is deployed across the network. Theagentstogetherprovideamulticastservice. Thenumberofclientsthatjoinanagentislimitedbyitsbandwidthcapacity. ThegoalofScattercastistoconstructadegree-constrainedspanningtreeacrossall agentsandkeepingtheaveragedelaybetweenthesourceandall destinations at aminimum.ThisproblemisknowntobeNP-hard.One system which dynamically adapts to the network conditions was pre-sented with TMesh [58]. The architecture aims at reducing latencies betweennodes in a multicast group. Based on a set of heuristics, new links are addedto the existing tree or mesh. If the new link reduces the overall latency thenitiskept;otherwise,itisdropped.2.4 FairnessandRobustnessTheimpactof fairnessondownloadperformancehasbeenstudiedin[15].Aframeworkis presentedtoevaluatemaximumachievabledownloadrate15of receivers as afunctionof altruism. Theresults showthat fairness hasahighimpact ontheperformanceof receiversandthat asmall degreeofaltruismbringssignicantbenet. In[13] ataxationmodel ispresentedinwhichpeerswithhigher uploadcapacityhelpcompensatingbandwidthofpeerswithlowerbandwidth.Aframeworkbasedongametheoryispresentedin[31]. Inthispaperincentive-based strategies to enforce peer cooperation are evaluated and com-pared.In[57] itisshownthatnotonlyahighnumberofusersisnecessarytobuildarobustsystem,butthemaincontributiontothesystemisprovidedby some stable peers with high upload capacity. This conrms our approachof propagatingwell-performingpeerstowardtherootandenhancingtheirrobustnessbyadditionalbackuplinks.2.5 Mesh-basedApproachesManymesh-basedP2Pstreamingsystems havebeenproposedinthelastfewyears[37, 33, 40], butnoneofthemhasbeenformallyanalyzedduetotheircomplexity. Mainlythesearchitectureshavebeenstudiedbymeansofsimulations[18,32]orexperimentalevaluation[41].A comparative study of tree- and mesh-based approaches for media stream-ingispresentedin[34]. Theauthorsrstproposeanorganizedviewofdatadeliveryinmeshoverlays, whichconsists of datadiusionandswarmingphases,and later introduce delivery trees,which they discover in mesh over-lays in a similar fashion to diusion trees described in our thesis. This work isdierent in that it focuses on formally analyzing properties of diusion treesratherthanevaluatingthembysimulation. Furtheranoverlayadaptationalgorithmthatimprovespropertiesofthesetreesisproposed.A dierent approach to analyzing P2P media streaming systems are uidmodels. In[29]theauthorspresentastochasticuidmodelthattakesintoaccount peer churn, heterogeneous peer upload capacities, peer buering anddelays. Inthisthesisthedistributiontreescreatedinameshareanalyzedsuchthat knownadaptations for tree-basedapproaches canbeappliedtomeshes.162.6 SummaryThesystemspresentedinSection2.1studydierentimportantaspectsofmediastreaminginP2Psystems. Theyfocusmostlyonisolatedproblemsandtheirsolutions. Thearchitecturesproposedinthisthesisarenotonlyadapted during data dissemination but also are designed from the beginningtomeetthechallengesofP2Pmediastreaming.Several analytical models developedfor P2Pmediastreamingaredis-cussedinSection2.2. Manyof themfocusonsystemscomposedof peerswithhomogeneous uploadbandwidths, whereonlyfewconsider heteroge-nousuploadbandwidths. Thepresentthesisstudiesamodel thatassumespeerswithheterogeneousup-anddownloadbandwidths,withtheobjectiveofprovidingallpeersthroughoutthesystemwithsucientdownloadband-widthforreceivingthemediastreamwiththesamestreamingrate.Section 2.3 presents systems based on adaptation algorithms. In many ofthem, peersuseabackupsetforreplacingunderperformingpeersfromthecurrentneighborsetbymorepowerful ones. Suchanapproachcouldalsobeusedinourarchitectures, leadingtoamorerobustsystem. Incontrast,theadaptationalgorithmspresentedinthisworkaredesignedtooptimizebandwidth usage instead of reducing latency. They are apt to both tree- andmesh-basedarchitectures.Section 2.4 discusses the problem of fairness in P2P media streaming. Lit-erature studies this topic in detail, mainly through experimental approaches.All thosestudiesconcludethat theproblemof free-ridersisanimportantissueinP2Psystems. Thearchitecturespresentedinthisthesisintroducesanovelapproachtofairness,bytradingbandwidthcontributionagainstin-creasedrobustness.Finally, mesh-basedmodels are discussedinSection2.5. This thesispresentsanovel analytical approachtostudytree-baseddiusionpatternsinmesh-basedarchitectures. Withthisapproach, existingalgorithmsandmodelsfortree-basedarchitecturescanbeappliedtomesh-basedsystems.17Chapter3Peer-to-peerDistributionArchitecturesprovidingUniformDownloadRates3.1 IntroductionEarlystudies of content distributionarchitectures have primarilyfocusedonhomogeneous systems where the bandwidthcapacities of all peers aresimilar, or simpleheterogeneous scenarios wheredierent classes of peerswithsymmetricbandwidthtrytominimizetheaveragedownloadduration.Suchsettingsarenotrepresentativeforreal-worldstreamingnetworks.Inthischapter, westudytheproblemofcontentdistributionundertheassumptionthatpeershaveheterogeneousandasymmetricbandwidth(typ-ical for ADSL connections),with the objective to provide uniform downloadratestoall peersadesirablepropertyfordistributingstreamingcontent.Ourgoal istoproposeandanalyzedierentarchitecturesforpeer-to-peernetworks that are able to sustain large populations of clients while deliveringPartsofthischapterhavebeenpublishedin: M.SchielyandP.Felber. Peer-to-peerDistributionArchitecturesprovidingUniformDownloadRates. ProceedingsoftheInter-national SymposiumonDistributedObjectsandApplications(DOA05),October2005.19agivenstreamtoall ofthem, underthesimplifyingassumptionthatpeersarefairandnofailuresoccur.Unlikeprevious studies, weassumethat thepeers haveheterogeneousandasymmetricbandwidth(typical forADSLconnections)andweaimatproviding a uniform download rate to each of them. This property is crucialfor applications like media streaming, for which users expect an uninterruptedstreamofdata.Weconsidersimplemodelswithtwoclassesofpeersthatdierintheiruplinkcapacities. Westudyseveral architecturesthatachieveoptimal uti-lizationoftheaggregateuplinkcapacityofthesystemandshareitequallybetweenall thepeers. Itobviouslyfollowsthatfastpeersmustsharemorebandwidth than they receive, but this unfairness can be balanced by placingthemnearertothesourceforincreasedreliabilityandshorterlatency.Theanalytical modelsdevelopedinthischapterprovideinterestingin-sights on the performance of content distribution architectures with uniformdownloadrates invarious congurations. Bycomparingthemwithotherarchitecturesprovidingnon-uniformrates, weconcludethatuniformitycanbeachievedwithlittleadditionalcomplexityandnoperformancepenalty.The rest of the chapter is organized as follows: we rst present the systemmodel in Section 3.2. Then, we analyze three dierent architectures providinguniformdownloadrates inSection3.3andcompare theminSection3.4.Finally,Section3.5summarizesthechapter.3.2 SystemModelandDenitionsFor therest of this chapter weusethefollowingmodel. Weassumethatnodesinthenetworkhavedierentuploadcapacities. Weanalyzecontentdistributionarchitectureswithtwoclassesofnodes, referredtoasfast andslowpeersaccordingtotheiruploadbandwidth. All nodesinaclasshavethesamebandwidth. Thedatastreamissentbyasinglesourcewhichhasthe same bandwidthas fast nodes. Tosimplifythe analysis, we assumethat the source receives the data at the same uniform rate as the other peersbefore distributing it within the content distribution network. We shall ignorelatencyinourmodel.20As is the case for typical ADSL connections, we assume that the slow peersareessentiallylimitedbytheiruplinkcapacityandhavesucientdownloadbandwidthtoreceivethedataatthesameuniformrateastheotherpeers.1We consider Nffast peers inclass FwithuploadbandwidthBfandNsslowpeersinclassSwithuploadbandwidthBs(Bf> Bs). Forthesakeofsimplicity, weassumeinouranalysisthatBs=Bfkwithkbeinganintegervalue. ThetotalnumberofpeersisN= Nf+ Ns.Weanalyzethebehaviorof dierentarchitectureswhentransmittingalarge content. We assume that the le beingtransmittedis split intoCchunks that canbesent independently: as soonas apeer has receivedachunk, it can start sending it to another peer. We consider one unit of time tobe the time necessary for transmitting the whole content at the uniform rater that is provided to all peers. Each chunk is thus received in1Cunit of time.Forclarity,weshalldescribethedierentarchitectureswiththeassumptionthat we transmit the whole le at once and we shall introduce chunks later intheanalysis. Astotaldownloadtimeisafunctionofthenumberofchunks,ourmainobjectiveof supportingstreamingdatacorrespondstosituationswhereC .Apeermayreceivechunksfromthesourceviadierentpaths. Forin-stance,inthecaseofSplitStream[8],thesourcesplitsthecontentintosev-eral layers and sends each of them along distinct trees spanning all the nodes.Two chunks sent at the same timeby the source may thus traverse a dierentnumberof peersandbereceivedatdierenttimes. Thisimpliesthateachpeer may have to buer some chunks until all of those sent at the same timehavebeenreceived. WecomputeTasthemaximal dierenceindistancebetweenapeerandtheclosestcommonnodealongthepathstothesourceviadistinctincominglinks. Thisvalueindicatesthebuerspaceneededatthe peer. For instance, in Figure 3.1, the rst node of the right chain receiveschunks from the source in 1 (directly), 2 (via one peer), and 3 (via two peers)unitsof timeandwehaveT=3 1=2. Clearly, small valuesof Taredesirableandweshallalsocomparethedierentarchitectureswithrespecttothisproperty.1Asweshallsee,thisrateisnothigherthantheuplinkcapacityofthefastpeers.213.2.1 UniformRateAspreviouslymentioned, ourgoal istoprovidethesamedownloadratetoall peers in the network. Obviously, the maximal rate rthat can be achievedcorrespondstotheaggregateuploadbandwidthofallnodesdividedbythenumber of peers (Bs1)inparallelstartingfromasinglesource.Obviously, anexpansion(i.e., forkingof chains) canonlybe achievedbyfast peers, as they have more upload capacity than the target download rate23t=2t=14t=1t=0t=13t=12t=11t=10t=9t=8t=7t=6t=5t=4t=3FFFFFFFFFFFSSS1/4SSSS3/41/41/21/4SSSSSFFigure3.1: Linearchainswithoneexpansionstepandk = 2(Bs=Bf2).r(whereris theaggregateuploadbandwidthof all peers dividedbythenumber of peers). Usingthis freecapacityallows us tobuildtheservicecapacitymrnecessarytoservempeersinparallel.Informally, thegrowingphaseproceeds as follows. Therst fast node(thesource)startsachainbyservingoneotherfastpeerwithrater. TheremainingbandwidthBf rwillbeusedinanotherchain. Thesecondfastpeer againserves another fast peer withrater, whichalsoleaves it withBf rremainingbandwidth. Thisprocesscontinuesuntil thesumof theremainingbandwidthsoftherstpfastnodesissucienttoserveanotherpeer,i.e.,p(Bf r) r. GiventhatBs=1kBf,pcanbecomputedas:p =

k + 1k 1

24t=0t=13t=14SS S FFFSSSSSSSS SSSSFFt=1t=2t=3t=5t=6t=7t=8t=9t=10t=11t=12t=43/41/41/4 1/43/41/2 1/21/41/41/4FFFFFFFFFFFigure3.2: Linearchainswithtwoexpansionstepsandk = 2(Bs=Bf2).In the formula above,depending on the value of k,some bandwidth maybelostintheintegerconversion. Thiscanbeavoidedbyexpandingtoknodesatonce. Thenumberofpeerspknecessaryforthisexpansioncanbecomputedbysolvingpk(Bf r) = r(k 1),whichgives:pk= k + 1Intherestofthechapter,weshallassumeexpansionstokchainsusingpkpeers (insteadof 2chains usingppeers). Eachfast peer caninturnforkanotherkchainswiththehelpofpk 1otherfastpeers. Byrepeatingthisprocess, thenumberof chainscanbemultipliedbykeveryiteration.Eachexpansionobviouslyrequirespkunitsof time. Exampleswithk=2(r=34Bf)andk=4(r=58Bf)areshowninFigures3.1, 3.2, and3.3. It25t=13t=0t=14F FF FF Ft=1t=2t=3t=5t=6t=7t=8t=9t=10t=11t=12t=45/82/8 2/81/83/8 3/8FFFFF FF FSSSSSSS2/81/82/82/81/8SSS SSSSSFFigure3.3: Linearchainswithoneexpansionstepandk = 4(Bs=Bf4).is important to note that the peers are organized as a directed acyclic graph(DAG).Phase2- Parallel phase. Theparallel phasestartswhenthegrowingphasehas nishedits expansiontompeers. It constructs twosets ofm2linearchains, composedrespectivelyof fastandslowpeers. Eachchainofslowpeers is combinedwithachainof fast peers. Aslowpeer serves itssuccessoratrateBf/k. AfastpeerservesitssuccessoratraterandthenextslowpeerinthecompanionchainatrateBf r. Thus, eachpeerisservedatrater. Phase2proceedsuntilallfastpeersarebeingserved(seeFigures3.1,3.2,and3.3).26Phase 3- Shrinkingphase. Inthelast phase, weareleft withasetof slowpeerstoserveatrater. Asaslowpeercannotserveanotherpeerbyitself, thebandwidthofseveral peersmustbecombined, whichleadstoshrinking down the number of parallel chains. This phase is almost symmet-ricaltothegrowingphase,inthatwecanservepkslowpeersfromeachsetof kchains. We repeat this process until all slow peers have been served (seeFigures3.1,3.2,and3.3).3.3.2 AnalysisWe can easily notice that delays of T= k are encountered during the growingphase. Thecaseof theshrinkingphaseismoresubtle, asTgrowslargerif wekeepitperfectlysymmetrictothegrowingphase. Byallowingsomeasymmetry, wecanbothboundthedelaysbythesamevalueT=kandreducethetotallengthoftheshrinkingphase.Wenowcomputethenumberofpeersthatcanbeservedwithinagiventimeinterval. Afterpksteps, kpeerscanstartagainanotherchain. If wedenesasthenumberof expansionsteps, wecancalculatethenumberofpeersintherstphaseN1tobe:N1=s1

i=0kipk= pkks 1k 1The shrinking phase is built in a symmetric manner. Therefore the num-ber of nodes N3inthethirdphaseis thesameas inthegrowingphase:N3=N1. GiventheconstraintthatN1 + N3 N, themaximal valueofsis:smax= logk

Nk 12pk+ 1

Thenumberof nodesN2thatcanbeservedinphase2inagiventimeintervalTis:N2= ks(T 2spk + 1)27Indeed, thereareksparallel nodesandphase2lastsforthegiventimeinterval minus the duration of the growing and shrinking phases. The numberof peers served in a time interval Twith s growing steps (1 s smax|) isthen:N(T, s, k) = 2pkks 1k 1+ ks(T 2spk + 1)We observe that the number of peers servedinagiventime intervalgrows with s, producing thus more ecient content distribution architectures(compareN(14, 1, 2) = 24inFigure3.1andN(14, 2, 2) = 30inFigure3.3).Solving the equation for Tgives the number of units of time necessary toserveNpeers:T(N, s, k) =N(k 1) 2pk(ks 1)ks(k 1)+ 2spk 1 (3.2)Assuming that the content is split into chunks, the total download time forthe complete le is then 1 +1CT(N, s, k), i.e., the time necessary to transmitthewholeleatraterplusthepropagationtimeofthechunksthroughthecontentdistributionnetwork. UsingEquation(3.2)leadsto:T(N, s, k, C) = 1 +1C

N(k 1) 2pk(ks 1)ks(k 1)+ 2spk 1

(3.3)Figure3.4showsthetimenecessarytocompletethedownloadwiththelinearchainarchitecturefordierentvaluesof kandC. Weobservethatperformanceimproveswithlargernumbersofchunks,becauseallpeerscanbeactivemost of thetime. Incontrast, withfewchunks onlyafractionof thepeerswill beuploadingatanypointintime, whiletheothershaveeitheralreadyforwardedtheentireleornotyetreceivedasinglechunk.Therefore, thevalueof k, whichinuencesthedepthof thecontentdistri-butionarchitecture, hasmoreimpactonperformancewhenthenumberofchunksissmall. InFigure3.4onecannoticeindeedthatthereductionofdownload times starts earlier with small values of k because they yield deeperarchitectures.Figure 3.5compares the downloadtimes for dierent values of s(the28100101102103104105106103104105106107108Number of roundsNumber of clients NC=102 k=2k=4C=104 k=2k=4C=106 k=2k=4Figure3.4: Downloadtime(inrounds)of thelinearchainarchitecturefordierentvaluesofkandC(s = 4).valuesmaxcorrespondstothemaximal numberof expansionpossiblewiththegivenpeerpopulation). Asexpected,performanceimproveswithhighervalues of s because they produce architectures which have shorter paths fromthe source to all other peers. The optimal value smax exhibits extremely goodscalability.3.3.3 MeshArchitectureThelinearchainsarchitecturecanbeimprovedinseveral waysif weallowpeerstobeorganizedasadirectedgraphwithcycles. Wecanreducethedurationof thegrowingphaseandthusthelengthof thepaths(andcon-sequentlythelatency); wecansimplifynetworkmanagementbyonlyusingconnections with identical bandwidth capacities; and we can limit the size ofbuersateachpeertoaconstantvalue.TheresultingmesharchitectureisshowninFigure3.6(fork=2andoneexpansionstep)andFigure3.7(forageneralvalueofkandtwoexpan-sionsteps). Thedownloadbandwidthforeachpeeristheaggregateuploadbandwidth divided by the number of peers (Bs+BfN). In the mesh a node does29100101102103104105106103104105106107108Number of roundsNumber of clients Ns=4s=6s=8s=smaxFigure3.5: Downloadtime(inrounds)of thelinearchainarchitecturefordierentvaluesofs(k = 2,C= 102).notonlyreceivedatafromitsparent,butalsofromitssiblings. Thesourcehas2kfastpeersaschildrenandsendsdataatrateBf2ktoeachofthem;theremaining missing bandwidthBf2is provided by their siblings. The rst-levelfastpeerstogetherservek2childrenwiththeirremainingbandwidthofBf2;again,theremainingbandwidthk12kBfisprovidedbythesiblings. Second-level peershaveenoughbandwidthtocompletelyservek2children. Eachthird-levelchildcaninturnexpandtok2peersinthreesteps.As inthe previous architecture, one canbuildlinear chains after theexpansion phase before reducing the architecture to one peer. The shrinkingphaseissymmetrictothegrowingphase,asshowninFigure3.6.UsingonlyconnectionswithidenticalrateBf2ksimpliessignicantlythemanagement of the architecture. The throughput is controlled by the sourceand peers only dier in their number of outgoing connections: the outdegreeisalways2kforfastnodesand2forslownodes. Allpeershaveanindegreeofk + 1.30t=6t=7t=8t=9t=10t=11t=12FS SS SFFFFSS S S SS S S SS SS SS SFFFF S S FF S S F1/4F F FF F F FFFFt=0t=1t=2t=3t=4t=51/4FFigure3.6: Meshwithoneexpansionstepandk = 2(Bs=Bf2).3.3.4 AnalysisOne can note in Figure 3.6 that the rst level fast peers receive chunks fromthesourceatt=1andfromtheirsiblingatt=2; similarly, secondlevelpeersreceivechunksatt=2andt=3; onthethirdlevel, all chunksarereceived simultaneously at t = 3. A similar observation can be made with theshrinking phase and it follows that constant delays of T= 1 are encounteredinthiscontentdistributionarchitecture.Forcomputingthenumberof nodeswhichcanbeservedintimeTweagainanalyzethethreephases. Aswehaveseen,afastpeercanexpandtok2peersinthreeunitsoftimewiththehelpof2k + k2otherfastpeers. If31t=5t=9t=8t=10t=11t=12t=7t=61/(2k)2k2k2k2k2k2t=0t=1t=2t=3t=4k2k2k2k2k2k2k22kk21/(2k)(k1)/(2k)r1/2k2k2S2kr2k 2k1/kr2k 2k1/(2k)1/kk2k 2k 2k 2kFFigure3.7: Meshwithtwoexpansionstepsandanyk(Bs=Bfk).wedenestobethenumberofexpansionsteps, thenthenumberofpeersservedintherstphaseis:N1= 1 + (2k + 2k2)s1

i=0k2i= 1 + 2kk2s 1k 1Theshrinkingphaseagainissymmetricinthenumberof nodessothenumberofnodesinthethirdphaseN3isequaltoN1,thusN3= N1. GiventheconstraintthatN1 + N3 Nwecancomputethemaximalvalueofs:32smax=12logk

(N 2)(k 1)4k+ 1

Inphase2, k2k2(s1)parallelnodescanbeservedintheremainingtimeT 6s 1. IntotalthenumberofpeersservedwithinTunitsoftimeforagivennumberofsexpansionsteps1 s smax|isthen:N(T, s, k) = 2 + 4kk2s 1k 1+ k2s(T 6s 1)Solving the equation for Tand introducing the number of chunks Cgives:T(N, s, k, C) = 1 +1C

1k2s

N 2 4kk2s 1k 1

+ 6s + 1

100101102103104105106103104105106107108Number of roundsNumber of clients NC=102 k=2k=4C=104 k=2k=4C=106 k=2k=4Figure 3.8: Download time of the mesh architecture for dierent values of C(k = 2,s = 4).Figure3.8showsthetimenecessarytocompletethedownloadwiththeuseof themesharchitecturefordierentvaluesof Candk. Asexpected,thedownloadtimesfollowthesamegeneral shapeasforthelinearchainsarchitectureinFigure3.4butperformanceissignicantlyimproveddueto33100101102103104105106103104105106107108Number of roundsNumber of clients Ns=4s=6s=8s=smaxFigure3.9: Downloadtimeofthemesharchitecturefordierentvaluesofs(k = 2,C= 102).thefasterexpansionofthemesharchitecture. WecanobserveinFigure3.9that a higher number of expansion steps s also produces atter architecturesandtherefore reduces the downloadtime. The maximal expansionfor agivenpeerpopulationsmaxyieldsthebestdownloadtimes,whichisalmostconstant,independentofthepopulationsize.3.3.5 ParallelTreesThe third architecture studied in this chapter consists in constructing multi-ple trees spanning all the nodes and sending a separate part of the content inparallel to each tree similarly to SplitStream [8] and PTreek[3] (as Nf= Ns,we shall use binary trees). If we construct k +1 trees that distribute contentatrateBf2k ,theneverypeerwillreceivedataatthesameuniformrater.Weconstructparalleltreesbyplacingeachfastpeer(exceptthesource)as interior nodeinktrees. Fast nodes will thus serve2kother peers atrateBf2k k(i.e., ataggregaterateBf). TheslownodesareplacedasinteriornodesinasingletreeandmustthusservetwoothernodesatrateBf2k(i.e.,ataggregaterateBfk). Asthenumberof leavesinacompletebinarytree342 2 31 2 346757 6 5 4 7 6 5 41 13t=0t=1t=2t=31/41/41/41/41/4SFigure3.10: ParalleltreeswithN= 8andk = 2(Bs=Bf2).isequal tothenumberof interiornodesplusoneandthesourceisafastnode, the constraint Nf= Nsis met. Figure 3.10 illustrates the parallel treearchitecture(peersarenumberedforclarity). Notethateverypeerexceptthesourceappearsinalltrees.3.3.6 AnalysisWerstneedtodeterminethedepthdof thetrees. Ateachlevel iinthetree,wehave2inodes(therootisatlevel0). Thus,thenumberofnodesinabinarytreeofdepthdis di=0 2i=2d+1 1. Consideringthespecialroleofthesource, theN 1remainingnodescanbeplacedinparallel treesofdepthd = log2(N 1)|.It follows from the construction of the trees that delays of T= log2(N1)|areencounteredinthiscontentdistributionarchitecture. Delaysgrowwiththenumberof peers, incontrasttotheotherarchitecturesstudiedinthischapter.The number of nodes that can be served by the parallel tree architecture ina given time interval Tcan be computed as follows (the rst term representsthesource):N(T) = 1 +T1

i=02i= 2TSolving this equation to Tand introducing the number of chunks Cleadstothetimeusedtodistributealetoallnodes:35T(C, N) = 1 +1C,log2N|100101103104105106107108109Number of roundsNumber of clients NC=102C=104Figure3.11: Downloadtimefortheparallel treesarchitecturefordierentvaluesofC.Figure 3.11 shows the time necessary to complete the download with theparallel treearchitecturefortwovaluesof C(improvementsbecomeunno-ticeablewhenCgrowslarger). Asthedownloadtimeisafunctionof thedepth of the trees, which increases logarithmically with the number of peers,performancedegradesonlyslowlywiththepopulationsize.3.4 ComparativeAnalysisInthissectionwecomparethethreearchitecturespresentedinthischapterwiththelinearchainarchitectureanalyzedin[7](referredtoasLinear). Incontrast to our architectures, in Linearthe peers have symmetric bandwidthcapacities. Thepeers areorganizedinseparatechains accordingtotheirbandwidth capacity and there is no cooperation between fast and slow nodes.Fastpeerscanthereforenishthedownloadfaster.3610-310-210-110010110210310*1088*1086*1084*1082*108Number of roundsNumber of clients NLinearLinear ChainsMesh ArchitectureTrees ArchitectureFigure 3.12: Download time for dierent architectures with k = 100, C= 100ands=smax. Linear showsthecompletiontimesforapopulationof 109peerswithsymmetricbandwidth.AswecanseeinFigure3.12,thisdierenceleadstoastepwisefunctionwiththefastnodescompletingtheirdownloadfasterthantheslownodes(Nf=Ns). Incontrast, theuniformarchitecturesall scalewell andyieldanalmostconstantdownloadrateindependentof thepopulationsize. Asexpected,uniformlinearchainsarelessecientthanthemeshandparalleltreearchitecturesduetothelongerpaths.In Figure 3.13 we can observe that with a smaller dierence between fastand slow peers (lower value of k) the download time of Lineargrows, whereasitdecreasesforthelinearchainsandthemesharchitecture(rememberthata unit of time is dened as a function of the uniform rate r). We can furtherseethat themesharchitectureperformsslightlybetter thanparallel treesinFigure3.12, unlikeinFigure3.13. Thisisduetothefactthatthemesharchitectureexpandsasafunctionof k2swhereastheexpansionof paralleltrees does not depend on k. Thus the expansion in the mesh will grow fasterwhen kis large. Higher values of Cdo not produce interesting results as thedierencebetweenthevariousarchitecturesquicklybecomesunnoticeable.3710-110010110210310410*1088*1086*1084*1082*108Number of roundsNumber of clients NLinearLinear ChainsMesh ArchitectureTrees ArchitectureFigure3.13: Downloadtimefordierentarchitectureswithk= 4,C= 100ands=smax. Linear showsthecompletiontimesforapopulationof 109peerswithsymmetricbandwidth.3.5 SummaryIn this chapter, we have studied the problem of providing uniform downloadrates to a population of peers with asymmetric and heterogeneous bandwidthcapacities. The architectures that best achieve this goal among those studiedin the chapter are the mesh and the parallel tree, but the latter requires peerstobuerdataforadurationproportionaltothedepthofthetrees. Asthenumberofchunksgrows,i.e.,whenthestreamdurationbecomesverylong,thedierencesbetweenallthearchitecturesbecomeinsignicant.Althoughweonlyfocusedonanalyticalmodelsforsimplecontentdistri-butionarchitectures, webelievethatouranalysisprovidessomeimportantinsightsashowtosetuppeer-to-peernetworksfordistributingstreamingdata. Itcanalsoguidethedesignofcooperativeapplicationsthatorganizethe nodes in a more dynamic manner than chains or trees. In particular, thesystemneedstobuildupuploadcapacityasfastaspossible(whichcorre-sponds to maximizing the number of expansion steps) and the content shouldbepartitionedintoalargenumberofchunks(butnottoomanychunksas38eachoneadds somecoordinationandconnectionoverhead). Byproperlycombining high and low capacity nodes, one can provide a high initial qualityofservicetoeverypeerandevenouttheirdierencesinatrulycooperativemanner.Themodels usedinthis chapter assumedthat nonodes fail andthatthecapacityremainsstableovertime. Inareal environmentpeersarenotstable and often fail or refuse to participate in distribution. Further networkcapacities change over time due to change of usage. Both of these challengesare analyzed in the following chapters based on the models introduced in thischapter.39Chapter4Self-organizationinCooperativeContentDistributionNetworks4.1 IntroductionThearchitecturespresentedinChapter3areconstructedinonepassandare not adaptedtoconditions that change afterwards. The environmentcanchangeduetopeersfailingorbandwidththatisuctuatingbecauseofvaryinglinkusage.In contrast to Chapter 3, in this chapter, we aim at providing techniquesthatareecient inheterogeneoussettings, adaptivesoastotoleraterun-time changes like bandwidthuctuations,andpractical enoughto be imple-mentable in real systems. For the sake of simplicity, our study mostly focusesonarchitectureswithbinarytrees; theprinciplesandalgorithmspresentedhere do, however, also apply to other architectures, as will be discussed later.The main metric we consider is the average time for each of the clients toPartsof thischapterhavebeenpublishedin: M. Schiely, L. RenferandP. Felber.Self-organization in Cooperative Content Distribution Networks. ProceedingsoftheIEEEInternational SymposiumonNetworkComputingandApplications(NCA05),July2005.41receivethecompletecontent. Earlierstudies[3, 59] havedevelopedanaly-ticalmodelsandindicatedtheoreticallimitsforthisproblem,buttheyonlyconsideredhomogeneousscenarioswhereall thepeershaveidentical band-width. In particular, a comparison of several distribution architectures basedon linear chains, trees, and parallel trees, has indicated that performance canbemaximizedifallthepeerscanusetheiruploadcapacityandthecontentissplitinenoughsmall blockssothatthepeersareall activeatthesametime.Thecontributions of this chapter areas follows: Werst analyzetheproblem of cooperative distribution of content from a single source to a largenumberof heterogeneousclientsandweidentifythelimitationsof existingsolutions. Weproposetechniquesandalgorithmsthatdynamicallyoptimizethedistributionnetwork, basedontheobservedeectivebandwidthcapac-ities, inordertoavoidbottlenecksandimproveglobal throughput. Thesealgorithmshaveseveral desirablefeatures. Mostnotably, theyarefullyde-centralized and work by only performing local reorganizations;as such, theymight stopshort of producinganoptimal conguration, but performex-tremelywellundertheaforementionedconstraints. Weanalyzetheproper-tiesofouralgorithmsandweevaluatethembythemeansofsimulations,aswell as experimentally in a LAN and in the Internet using the PlanetLab [39]testbed.Thechapterisorganizedasfollows: Werstpresentclassicaltree-baseddistribution architectures and analyze their shortcomings in Section 4.2. Sec-tion4.3introducestheprinciples, mechanisms, andalgorithmsproposedtodynamicallyimprovetheeciencyof tree-basedcontentdistribution. Sec-tion4.4presentsresultsfromsimulationsandexperimentalevaluation, andSection4.5summarizesthechapter.4.2 P2PContentDistributionTree-basedArchitectures. Dierentarchitectureshavebeendevelopedfor organizing clients in a P2P fashion for cooperatively distributing content,e.g., a large le. The key idea is to have clients that have already downloadedthelehelpredistributeit toother clients, insteadof relyingonasingle42source. The time necessarytosendthe le toall peers is not anymoreproportional tothenumberof clientsinthenetworkasforclassical client-server distribution, but proportional to the logarithm of the number of peers.As anexample, consider thesituationwhereaserver must replicateacritical le, e.g., anantivirus update, toall 100, 000machines of alargecompany. Given a le size of 4 MB and a server (client) bandwidth capacityof 100Mb/s(10Mb/s) with90%linkutilization, aclassical client/serverdistributionprotocol woulddistributethelebyiterativelyservinggroupsof10simultaneousclientsinu =32Mb9Mb/s= 3.55seconds. Updating100, 000clientswouldthusnecessitate100,00010u,i.e.,almost10hours.In contrast, cooperative distribution leverages the bandwidth of the nodesthathavealreadyobtainedthele, thusdynamicallyincreasingtheservicecapacityof thesystemasthelepropagatestotheclients. Aseachclientthathasalreadyreceivedthelecanserveanotherclientwhiletheserverupdates10newclients, wecancomputethenumberof clientsupdatedattimetasn(t)=2n(t u) + 10=2t/u10 10. Updating100, 000clientswouldthusnecessitatelessthan1minute. Theexponential increaseofthenumber of served peers provides a sharp contrast with the linear progressionof traditional client/server distribution (see [19] for a more detailed analysis).The simplest architecture for cooperative content distribution consists informing a chain (or pipeline) in which each client downloads the le from onepeer and uploads it to another peer. The le is divided into small blocks of agiven size that can be transmitted independently from each other: as soon asablockisreceivedatonepeer,itisforwardedtothenextpeer. Thisarchi-tecture leads to impressively short distribution times in high speed networkswithfull duplexconnectivity. Thetotal distributiontimeisessentiallythetime to sendthe whole le to the rst node plus the delay for the rst blocktoreachthelastnode.Ifeachpeerservesmorethanoneotherpeer, weobtaintreesinsteadoflinearchains. Asthebandwidthsof uploadconnectionshavetobesharedbetween several downloaders, such architectures are best adapted in settingswhere peers (especially those close to the source) have large upload capacities.Chainsandtreearchitectureshavethedisadvantagethatthefailureofanode adversely impacts the whole subtree rooted at that node. Indeed,once43theonlylinktothesubtreeisbroken, nodatacanowtoanyofitspeers.Toaddressthisproblem,onecanorganizethepeersintomultiplespanningtrees, witheachpeerbelongingtoall thetreesandbeinginteriornodeofatmostoneofthem,andhavethesourcesenddistinctblockstoeachtree.SucharchitecturesbasedonparalleltreeshavebeenusedinSplitStream[8]toimprove bandwidtheciencyandincrease robustness. Obviously, thefailureof apeerwill aectatmostoneof thedistributiontreesandleavetherestoperational. Analytical modelsandanalysisof thesearchitecturesinhomogeneoussettingscanbefoundin[3]. Weshall primarilyfocusonarchitectures based on a single binary tree in the rest of the chapter, althoughweshallbrieydiscussextensionsforn-aryandparalleltrees.DealingwithHeterogeneity. Theperformanceof contentdistributionusing a single tree composed of peers with heterogeneous bandwidth directlydependsontheorganizationofthenodesinthetree. Oneslowpeerpscanincreasetheaveragereceptiontimeofallthepeersinthesubtreerootedatps,eveniftheyhavemorebandwidthandcomputationalpowerthanps.To show the eect of a single slow peer ps in a balanced binary distributiontreeof nnodes, wecomputetheaveragereceptiontimedependingonthepositionofpsinthetree. WeassumeasymmetricbandwidthofBfforthefastpeersandthesourceS, andBs 0 bethedownloadcapacityofchildi.Basedonthesemeasurements,wecandistinguishtwocases:1. un< mi=1 uiwithuj=unforsomenodesj anduk 0fp(si) : [Cp(si)[ = 0(5.1)wherefp(si)isthenumberof newchildrenthatpcanacceptforstripesiandCp(si)isthesetofchildrenofpforstripesi. Onecannotethatthe70freecapacityofphasmoreweightthanthatofitschildren. Thisallowsustofavornewconnectionstonodeshighinthetrees. Notethatapeermayreturn0for fp(si) if it still has enoughupstreamcapacitybut is alreadyinteriornodeofseveraldistributiontreesotherthansi.Anewpeerstartstojointhesystematthesource. Itcanobtaininfor-mationaboutthesource, forinstance, fromaWebpage(e.g., IPaddress,streamrate). Tojoinadistributiontree,apeerqissuesajoinrequestJRQforeachstripetothesource. Joinrequeststraversethedistributiontreesusingbiasedrandomwalks(whichhaveonlyafractionoftheoverheadofabroadcast). AjoinrequestJRQforstripesiispropagatedalongthedistri-butiontreeofsiasfollows. Ifthecurrentnodepcanacceptqasachildforsi, itsendsamessage(CAN)toqtogetherwithitshealthinesshpandthepath from the root to p to inform it that p is able to (can) accept qas a newchild. Then, if phaschildreninsi, itforwardsthejoinrequesttoachildchosen at random according to a biased distribution in which the probabilityof choosing a child is proportional to its healthiness. These messages traversetheassociateddistributiontreesuntil enoughpotential parentsforqarefound(trade-obetweenwaitingtimeandqualityofparentposition).During such a random walk, joining node q typically receives several CANreplies. Itthenselectsamongtherepliesthenodepclosesttotherootand,upontie, thenodewiththehighest healthiness, under theconditionthatthepathdiversitypropertyissatisedfortheconnectionfromptoq. NotethatthepropertycanbeveriedusingthepathinformationembeddedintheCANmessage. Thejoinprocedurenisheswhenthenewnodestartsreceivingchunks fromits parent. If q receives novalidreplies, it issuesanother join request that will likely follow a dierent path in the distributiontree. Note that qcan also request multiple random walks to be conducted inparalleltoquicklygathermorecandidates.The behavior of the source diers from other peers in that it always triestohavethesamenumberof childrenforeachstripe. Thesourceacceptsanewchildforstripesiifithassucientbandwidthandnootherstripehaslesschildrenthanarecurrentlyregisteredforsi.The source serves directly the rst few peers in parallel, during the boot-strapphase. Thereafter, newpeerswill connectdeeperinthedistribution71Algorithm2 Reception ofJRQ(si, j) at peer p for stripe siand new peer jif fp(si) 1andj , CpthensendCAN(p, hp(si), source p)tojendifif Cp ,= thenc biasedrandomnodefromCpsendJRQ(si, j)tocendiftrees. Thisunfairnessbetweenearlyandlatejoinersiscompensatedovertime by the HeapTopalgorithm that continuously optimizes the distributiontreeandchangesthepositionofthenodes.A new peer always connects to the distribution trees as a leaf. Therefore,ithasnochildrenandnobackuplinksinitially. Thisconsciousdesigndeci-sion is motivated by the fact that, in typical P2P systems, many peers remainconnected a very short amount of time: the longer a peer has been online, thehigher the probability that it remains connected [21], [46]. Therefore, depar-turesamongthevolatilepopulationofnewcomerswillhavelimitedimpact.Aspeersremaininthesystem, theywill acceptchildrenandconsequentlyacquirebackuplinks. Theymayalsomoveupwardthetreeif theyhavegood service capacity. This approach acts as an incentive for peers to remainconnectedforlongperiodsoftimeandcontributewelltothesystem.Note that the heuristics usedtomeet these criteriawill not produceoptimal distributiontrees. Thedynamicreorganizationofthenodesinthetreeshasbeenpreciselydesignedtoimprovetheeciencyofthetreesaftertheirconstruction.5.2.4 ContentDistributionThe chunks of each stripe are forward along the associated distribution treesin a straightforward manner: each inner node of a distribution tree forwardsincomingchunkstoallofitschildreninthattree. Weassumethatthelinksbetweennodesarereliable(weuseTCPinourimplementation).Peersbuerthechunksforsometime, sothattheycantransmitthemtotheirneighborsoversecondarylinksincaseof afailure. Todisposeof72buered chunks, each peer regularly sends a notication to its backup neigh-borsindicatingthelastchunkithasreceivedforeachrelevantstripe. Thismechanism allows secondary sources to dispose of the chunks that they buerfor retransmission purposes. If the buers of a peer are full, it may delete thechunksinitsretransmissionbuersevenifthepeersdownstreamsecondarylinkshavenotyetacknowledgedtheirreception.5.2.5 DeparturesandFailuresUponanode failure, its childrenineachstripe must ndanewparent.This operation must be very fast to guarantee smooth playback of the mediastream. CrossFluxreliesonthebackuplinksforquickfailover: aectedchildrenasktheirbackupsourcestosendmissingchunks. Failurescanbedetectedbychildrenwhenanetworkconnectionisclosedortimesout.By ensuring that each contributing node has at least one valid secondarylinkforeachstripe, thesystemcanbequicklyreconguredafterafailurewhileprovidinggoodloadbalancing: thechildrenof afailednodewill re-quest the missing stripes from distinct peers with high probability. Obviously,backup sources must have spare bandwidth to send the missing stripes, evenwithdegradedperformance, until theprimarylinkis restored. Typically,peerskeepsomefreebandwidthfordealingwithfailures,i.e.,theyunderes-timatetheirsparecapacitywhencomputingtheirhealthiness. Theamountofsparebandwidthcanbereducedovertimeastheprobabilityofaparentfailuredecreases[21], [46]. Thenumberof parentsforanodepisequal tothe number of stripes m and corresponds to the maximal number of nodes phas to serve as a backup. This also limits the additional bandwidth used forservingasabackup,inthecaseofafailure.Afterpromotingasecondarylinktoprimary, thepeersaectedbythefailureexecutethejoinprotocol tondanewparentandrevertthestatusofthesecondarylink.5.2.6 OverlayOptimizationWhen optimizing eective throughput, one needs to take into account the dy-namism of the underlying network and the bandwidth heterogeneity. To that73end, HeapTop[51] isusedinCrossFluxtodynamicallymovefastnodesupwardthetreestowardtheroot. Forscalabilityreasons,reorganizationofthe tree should aect as few nodes as possible. Exchanging the position of anodewithitsparentisalocaloperationthatcanbeeasilyimplementedbe-cause both nodes are directly connected with each other and they essentiallyhavetoexchangetheirrespectiveneighbors.Eachnode continuouslyexecutes the HeapTop algorithm. Peer ppe-riodicallycompares its bandwidthcapacitywiththat of its parent. If psbandwidthisnotablyhigherthanitsparentsbandwidth, thentheyswitchpositions, i.e., theyexchangetheir parents andchildren, under thecondi-tionthat thediversitypropertyis still satised. Thealgorithmpreservesthestructureoftheinitialtree(evenifitisnotbalanced),butthepositionof thenodesevolvesover time. For avoidingpairwiseexchangesresultingfromshortbandwidthuctuations,theestimationsarebasedonaweightedmovingaverage.Giventhe special role of the source node, it appears clearlythat thepeers cannot move from one 1st-level subtree to another 1st-level subtree. Assuch, the resulting distribution tree may be slightly sub-optimal but perform-ingfurtheroptimizationswouldnecessitatenon-localoperationsandhighercomplexity.Ifthereisnobandwidthuctuation, thetreewillquicklyreachastableconguration. Intheworstcase,anodelocatedatdepthd 1(therootisatdepth0)caninitiated 1exchanges. Theactual numberof exchangesdependsonboththeinitialcongurationofthetreeandtheorderinwhichexchangesareperformed.Several important considerations must be taken into account when usingHeapTopto optimize the distribution trees in CrossFlux. First, the Heap-Topalgorithmisrunindependentlyinthedistributiontreesofeachstripe.Second, leaf nodesarenotexchangedwithinnernodesif theformerisal-ready inner node of several other trees (typically 2, as previously discussed).Finally, before performing any pairwise exchange, it is veried that the pathdiversitypropertywillbepreservedinthenewconguration.745.3 EvaluationAprototypeof CrossFluxhasbeendevelopedinJava(seeAppendixA).The implementation has been designed in such a way that it can be deployedinsimulatedsettings, incontrolledenvironments suchas clusters, andinlarge-scalenetworks.Most of the experiments were carried out in a simulator, which allowed ustoobservethebehaviorofthesystemwithlargeclientpopulationswithoutthe need of a critical mass of users. Several experiments have been performedin distributed settings using the Modelnet [55] network simulator and nallytheevaluationhas beencompletedusingPlanetLab. Someresults of thisevaluationarediscussedinthissection.5.3.1 SimulationsExperimental Setup. Threemainclassesof peershavebeensimulated(FastF: 1024Kbit/s,MediumM: 512Kbit/s,SlowS: 128Kbit/s),chosentomatchtheobservations that havebeenmadeof real-worldpopulationsinanearlier studyof theBitTorrent protocol [25]. Simulatedpopulationsizes rangefrom500to4, 000peers. Astheuploadbandwidthisusuallythe limiting factor, download capacities are not explicitly taken into account(peers of classes MandStypicallyhave asymmetric bandwidth). Eachsimulatedpeersclasswaschosenrandomlyaccordingtothe6distributionsD1, . . . , D6showninTable5.1.ClassF ClassM ClassSD190% 5% 5%D260% 30% 10%D350% 25% 25%D430% 60% 10%D525% 25% 50%D65% 90% 5%Table5.1: Distributionsofpeerclassesforevaluation.75DynamicAdaptationof theOverlay. Firstthedynamicoptimizationof theoverlayusingtheHeapTop algorithmhas beenstudied. Themaincriterion considered is the average upload bandwidth capacity using the treeadapted by HeapTop, as compared with that of an initial randomly generatedtree. For everyconsidereddistribution, binarytrees wereconstructedbyiterativelyaddingeachnodeat avalidposition, chosenbytraversingthetreefromtherootuntil aleaforanodewithasinglechildisencountered.This join procedure is much simpler than in CrossFlux but makes it easierto observe the eect of HeapTopin isolation. Both balanced and unbalancedtrees have been experimented with. As the dierences between both settingswere negligible, only results for balanced trees are shown and it is noted thattheyarealsovalidforunbalancedtrees. 0 1 2 3 4 5 6 5001000150020002500Improvement FactorNumber of PeersD1D2D3D4D5D6Figure5.5: Averageimprovementfactorwithtwostripesfordierentpopu-lationsizesandvariousclassdistributions.Theimprovement factor of HeapTophasbeenevaluatedwithdierentpopulationsizes for eachnodedistribution. HeapTop has beensimulatedbyrunningit separatelyontwostripes, as implementedinCrossFlux.Figure 5.5 shows the improvement factor f, dened as the ratio of the averagebandwidth BHTof the tree generated by HeapTopto the average bandwidthBRoftherandominitialtree: f= BHT/BR.76One canobserve that the gainis signicant (upto350%). The bestimprovement factor observedduringsimulations was 750%, whichgives ameasure of the potential benets of dynamic overlay optimizations in Cross-Flux. 300 400 500 600 700 800 900t+10 t+5 t t-5 t-10Average Download CapacityTime10% Failures15% Failures20% FailuresFigure5.6: Impactoffailuresontheaveragedownloadcapacityforapopu-lationof2, 000nodeswithdistributionD4.FailureRecovery. Theabilityof CrossFluxtorecoverfromnodefail-ures has been tested on a population of 2, 000 nodes with upload bandwidthsmatchingthedistributionD4whichbestreectsreal-worldscenarios. Inarstphaseof thesimulation, all thenodeswereaddedtothesystemandthesimulationwasrunforsometimewithoutactivatingHeapToptostabi-lizethesystem. Thensimultaneouslyafractionof nodeschosenrandomlywasshutdown, andHeapTopwasactivatedtoimproverecoveryeective-ness. The failures were detected without delay for not depending on timeoutparameters.Figure 5.6 shows the average download capacity as a function of the sim-ulationtime(discretesteps). Thefailureoccurredattimet andrecoverydirectlybeganbyswitchingtobackuplinks. It canbeobservedthat theimpact of the failures on the download capacity is very moderate. The subse-77quent improvement,withthe downloadcapacityreachingabove its previousvalue,areduetoHeapTop.DownloadPerformance. CrossFluxs scalability and performance hasbeenstudiedby simulating a content distributionnetwork witha singlesource, 4stripes, andastreamingratesetto320Kbit/s. Thesimulationwas startedwithonlythe source, andtheniterativelynodes were addedusingthejoinalgorithm. Theuploadcapacityofeachnodewaschosenran-domlyaccordingtodistributionsandclassesintroducedinTable5.1. Themaximal numberofchildrenofanodewasdeterminedbasedonitsuploadcapacity. When all the nodes completed the join procedure for all stripes, theavailabledownloadcapacityofeachnode(higherthantheactualstreamingrate) was computed. The experiments were done (1) with HeapTopdisabled(Figures 5.7 and 5.8) and (2) with HeapTopactivated (Figures 5.9 and 5.10). 350 400 450 500 550 600 5001000150020002500300035004000Average Download CapacityNumber of PeersD1D2D3D4D5D6Figure5.7: Averagedownloadcapacityasafunctionofthenumberofpeers(withoutHeapTop).AscanbeseeninFigure5.7, thesystemscaleswellwiththenodepop-ulationforall distributions: thedownloadcapacitydoesnotdegradewhenaddingnewpeers andis consistentlyabove320Kbit/s evenfor distribu-tionswithahighfractionofslowandmedium-speednodes. Itcanevenbe78observed that the average download capacity increases with the node popula-tion for distributions with many fast nodes (e.g.,distribution D1). This canbeexplainedbythefactthatslownodescanmoresignicantlyimpacttheperformance of small networks, because the restrictions imposed by the pathdiversitypropertysometimes enforceplacingslownodes intheinterior ofthedistributiontrees. Thisproblembecomeslessrelevantinlargenetworkswheremorepositionsareavailableforplacingnodes. 0 0.2 0.4 0.6 0.8 1 5001000150020002500300035004000Avg Download Capacity / Service CapacityNumber of PeersD1D2D3D4D5D6Figure5.8: Averagedownloadcapacitycomparedtotheoptimalserviceca-pacity(withoutHeapTop).Figure 5.8 compares the eective download capacity obtained with Cross-Fluxtotheoptimalservicecapacity,i.e.,theratiooftheaggregateuploadbandwidthofthenetworktothenumberofnodes. Asonecansee,Cross-Fluxutilizesbetween60%and95%oftheavailablebandwidthdependingonthedistribution. InFigures5.9and5.10itcanbeseenthatHeapTopsignicantly increases the average download capacity for all distributions, asexpectedfromthesimulationresultsofHeapTop.PathLengths. Inadditiontobeingbandwidth-ecient,agooddistribu-tion tree should also balance well the load on all the nodes. This is generallythe case for balanced trees, as all nodes have approximately the same number79 350 400 450 500 550 600 500100015002000250030003500Average Download CapacityNumber of PeersD1D2D3D4D5D6Figure5.9: Averagedownloadcapacityasafunctionofthenumberofpeers(withHeapTop). 0 0.2 0.4 0.6 0.8 1 500100015002000250030003500Avg Download Capacity / Service CapacityNumber of PeersD1D2D3D4D5D6Figure5.10: Averagedownloadcapacitycomparedtotheoptimal servicecapacity(withHeapTop).of children and all leaves the same depth. Obviously, the heterogeneity of thenodecapacitiesdoesnotallowmaintainingbalancedtrees. Toevaluatethe80qualityof thedistributiontreesobtainedwithCrossFlux, thelengthsofthe paths from the root have been evaluated for populations of 2, 000 nodes.DistributionsD1andD6wereused,astheyarethetwomosthomogeneousdistributionsandareexpectedtoproducereasonablybalancedtrees. Thenodedegreeofotherdistributionsistoouneventodrawmeaningfulconclu-sions. Themaximumpathlengthover all nodes andall stripes has beencomputed, as well as the maximum over all nodes of the average path length,computedasthemeanoverallstripes(denotedbymax-average). Forcom-parison, themaximal pathlengthof abalancedtreewithaconstantnodedegreedandnnodesisinO(logd n). 0 5 10 15 20 25 02004006008001000 1200 1400 1600 1800 2000Path LengthNumber of PeersMax Length D1Max-Avg Length D1log11.15(x)Max Length D6Max-Avg Length D6log6.05(x)Figure5.11: Maximalandmax-averagepathlengthscomparedtothetheo-reticalmaximalpathlength.Figure5.11shows themaximal andmax-averagepathlengthfor bothdistributions, as well as the asymptotic depthof abalancedtree withadegreeequal totheaveragenumberofchildrenoftheinteriornodesofthedistributiontrees producedbyCrossFlux(11.15 and6.05). It canbeobserved that paths remain reasonably short, which indicates good structurebalancing. Thegrowthofthepathlengthfollowalogarithmiccurve,withina factor of 24 from the theoretical optimum. The reason for this dierenceis that the degree ofCrossFlux trees varies signicantly from node to node81dependingontheircapacity,whichmakesthecomparisonunfair. 0 0.2 0.4 0.6 0.8 1 024681012Fraction of Nodes (CDF)Number of ChildrenD1D2D3D4D5D6Figure5.12: Cumulativedistributionfunctionofthenumberofservedchil-dren.Tofurther evaluatehoweectivethejoinprocedureis at producingabalanced tree, the number of stripes served by each node has been compared.Figure 5.12shows the cumulative distributionfunctionof the number ofchildren that are served by a node for the dierent distributions of Table 5.1.As one can see,the loadin CrossFlux is quite balanced. Independently ofthe distribution, only a small portion of the nodes (at most 20%) serve morethan6children, althoughinthesetupthefastnodeshavethecapacitytoserve12children. Thisindicatesthatthenodeswithlesscapacityalsohavetocontribute.NodeStress. Nodestressisdenedasthenumberofmanagementpack-etsanodereceivesinatimeunit. Themainoverheadisdueto(1)thejoinprocedureduringwhichjoinrequestsareissuedandforwardedand(2)thereporting mechanism used by children to send their healthiness value to theirparents. InCrossFlux,randomwalksareusedinsteadofbroadcastsdur-ing the join to reduce the number of messages. Figure 5.13 shows the averagenodestressfordistributionD4fordierentpopulationsizes. Atsimulation82 0 10 20 30 40 50 60 050100150200250300350Node StressSimulation Time1000 Nodes1500 Nodes2000 NodesFigure5.13: NodestressfordierentpopulationsizesanddistributionD4.time 280afraction(10%of the populationsize) failed. As one cansee,theaveragenumberofmanagementpacketsincreasesduringthejoinproce-dure and quickly stabilizes afterwards to the amount of healthiness reportingmessages. During failure recovery the average node stress contains peaks butremains within25%of its previous value. Theaveragenodestress is thesame for all three population sizes, thus highlighting the low overhead of therandomwalks.5.3.2 ModelnetSimulationsExperimental Setup. Modelnetisanetworksimulatorthatemulatesavirtual networkontopofasetofmachines(typicallyacluster). Thesoft-waretobeevaluatedisdeployedonmultiplevirtual hostsresidingoneachmachine. Thetracgeneratedbythesevirtualhostsisroutedthroughthesimulator,whichmimicsthebehaviorofthemodeledlinks(delay,through-put,loss)andforwardsittothedestination.Modelnet[55] wasusedtoevaluateCrossFluxonasmall testbed. InModelnet, eachend-to-endlinkinthe topologycanbe assigneddierentvalues for bandwidth, latency, andloss rate. For this purpose, the Inet83generator [28] was used to generate a random transit-stub topology of 4, 000nodes with 50 CrossFlux clients spread across 19 stubs. The bandwidth ofeach link was chosen randomly in the range from 512 Kbit/s to 1024 Kbit/s.Thenumberofstripesthatanodecouldservewasdeterminedaccordingtoitsconnectionspeed. Asinglestreamingsourcehasbeensetuptoserveanendless stream, which was split into chunks of 40 Kbit and distributed using8 stripes. The streaming rate was xed to 320 Kbit/s, thus each peer shouldreceiveatleast8chunkspersecond.DynamicAdaptationof theOverlay. Asthestructureof thedistri-butiontreesobtainedwithCrossFluxdependsonmanyparametersthatcannotbeeasilycontrolled(includingrandomfactors),wehavestudiedthebehaviorofHeapTopusingsimulationsthatfaithfullyreproducetheopera-tionsofthealgorithmandevaluateitseciency. 1 1.5 2 2.5 3 3.5 4 050010001500200025003000Improvement FactorNumber of NodesD1D2D3D4D5D6D7Figure 5.14: Average improvement factor with two parallel trees for dierentpopulationsizesandvariousclassdistributions.WesimulatedHeapTopbyrunningitontheinnernodesofeachstripe,as happens inCrossFlux. Inother words, aleaf inthe initial tree isnever promoted to inner node. Figure 5.14 shows the improvement factor fordierentpopulationsizesandvariousclassdistributions.84 1 2 3 4 5 6 7 8 050010001500200025003000Improvement FactorNumber of NodesD1D2D3D4D5D6D7Figure 5.15: Best case improvement factor for two parallel trees for dierentpopulationsizesandvariousclassdistributions.Onecanobservethatthegainissignicant(uptoalmost400%). Fig-ure 5.15 shows the best improvement factor observed during the simulations(upto750%)andgivesameasureofthepotential benetsofHeapTopforCrossFlux.Intheimplementationof CrossFlux,abuerwasusedwherereceivedchunksarestoreduntiltheyhavebeenreadforplayback. Theaveragesizeof this buer gives an estimate of the download rate of the peer: if the buerbecomes empty, thenthepeer does not receivethecontent at asucientrate.Figure5.16showsthecumulativedistributionfunctionofthefractionofnodeswithagivenpercentageoffreereceivebuerspace. Asonecansee,thereis muchless freebuer whenHeapTopis usedthanwithout. WithHeapTopactivated, 90%ofthenodescanll theirbuerupto80%oftheavailable space. When HeapTopis disabled, only 50% can ll their buer upto 80%. This indicates that fast nodes are eectively moved toward the rootandchunksareowingfasterfromtheroottotheleaves.85 0 0.2 0.4 0.6 0.8 1 00.10.20.30.40.50.60.70.8Fraction of Nodes (CDF)Average Relative Free BufferUsing HeapTopWithout HeapTopFigure 5.16: Cumulative distribution function of the free receive buer space. 0 0.2 0.4 0.6 0.8 1 00.20.40.60.81CDFRelative Used CapacityFigure5.17: CumulativeDistributionFunctionoftheUsedCapacity.5.3.3 LoadBalancingTo evaluate the load balancing and fairness properties of the join procedure,wecomparedthenumberof stripesservedbyeachnode. Tothatend, weaddedall50nodessequentially, withonenewhostjoiningevery5seconds.Figure5.17showsthecumulativedistribu