better end-to-end adaptation using centralized predictive controljunchenj/thesis/proposal.pdf ·...

June 25, 2015Thesis Proposal

Better End-to-End Adaptation UsingCentralized Predictive Control

Junchen Jiang

July 2, 2015

School of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213

Thesis Committee:Hui Zhang, Co-ChairVyas Sekar, Co-Chair

Peter Steenkiste,Srinivasan Seshan,

Ion Stoica (UC Berkeley)

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy.

Copyright c© 2015 Junchen Jiang


Keywords:


iv


AbstractTransport layer and application layer of network stack use end-to-end adapta-

tion protocols (e.g., TCP and bitrate-adaptive video) to achieve high performanceby continuously adapting endpoint behavior to changes of network conditions. Thetraditional belief is that these protocols must be run independently by endpoints toachieve desirable performance. In essence, they use reactive logic triggered only bylocally observable events. For instance, TCP reacts to a packet timeout by halvingthe congestion window.

In this thesis, we argue that centralized predictive control can lead to better end-to-end adaptation and large performance improvement at both transport layer andapplication layer. We show that it is feasible to decouple adaptation logics from end-to-end adaptation protocols and centralize them into a global controller that makespredictive control using a global view of different connections’ performance. Forinstance, TCP with centralized predictive control can predict the best congestionwindow using other similar TCP sessions’ performance.

To deliver the promised performance benefits of centralized predictive control,we must address two key technical challenges. First, we present prediction algo-rithms, which accurately predict the optimal adaptation behavior of endpoints byexploiting the structural information of the global view (e.g., some connections aresubjected to same network bottleneck). Second, we present designs of a scalablecontrol platform, which leverage the persistence of optimal decisions to minimizenegative impacts of the inherent delay between the controller and widely distributedendpoints.

This thesis will present algorithms and system designs of centralized predictivecontrol for both transport layer and application layer. We show that our approachcan lead to better performance for TCP, Internet video and real-time communica-tion applications like Skype. Our preliminary experiments have shown significantimprovement of Internet video quality by centralized predictive control.


vi


Contents

1 Introduction 11.1 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background and Related Work 52.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 CPC Overview 93.1 CPC vs. Today’s Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Challenges of CPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Key Insights of CPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Accurate Data-Driven Prediction Logics 154.1 Technical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 My Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Scalable CPC Control Platform 195.1 Technical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2 My Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6 Timeline and Risk Analysis 236.1 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2 Risk Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Bibliography 25

vii


viii


List of Figures

3.1 Three components of CPC architecture (local logics, global controller and CPCAPI), and the logical control loop (measurement collection, predictive controland pushing adaptation decisions) . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 A relatively small number of combinations of critical features (i.e., critical clus-ters) are shared by many bad quality video sessions. . . . . . . . . . . . . . . . . 12

3.3 Slight staleness (e.g., 30 minutes) of critical features (i.e., structure) does notimpact prediction accuracy significantly. . . . . . . . . . . . . . . . . . . . . . . 12

3.4 The best CDNs for different content providers persist on a timescale of tens ofseconds to minutes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Benefits of CFA: more accurate prediction on video quality and better qualitythrough accurate prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Benefits of DDA: more accurate prediction on throughput and better bitrate se-lection based on accurate throughput prediction. . . . . . . . . . . . . . . . . . . 17

5.1 Splitting CPC controller into three control loops . . . . . . . . . . . . . . . . . . 20

ix


x


List of Tables

1.1 Mapping my prior work and proposed work to CPC’s insights and solutions toaddress the challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Some examples of how different end-to-end adaptation protocols impact perfor-mance of different layers by tuning their adaptation parameters. . . . . . . . . . . 5

2.2 Terminology used in this document and their description. . . . . . . . . . . . . . 5

xi


xii


Chapter 1

Introduction

The ultimate goal of end-to-end adaptation protocols in transport layer (e.g., TCP) and applica-tion layer (e.g., DASH [2] video) is to ensure high performance by adapting endpoints’ behaviourto changes of network conditions and resource availability. For instance, TCP tunes congestionwindow in response to network congestion or queue buildup. DASH video players tune bitrateand CDN to cope with throughput fluctuation and availability of content.

Traditional tenet of end-to-end adaptation protocols states that, for fast reaction to networkevents and fate-sharing principles [6], they should be implemented and run independently byendpoints. For instance, TCP’s instant reaction to packet loss when observed by end hosts isnecessary to preserve packet conservation [17]. Fate-sharing principle requires that critical statesof a connection (e.g., sequence numbering) should be only kept by end hosts.

However, with this distributed design, today’s end-to-end adaptation has fundamental limita-tions, including basing on local information (e.g., locally observed events), using reactive (e.g.,trial-and-error), and hard-coded one-size-fits-all (same logic is used in different scenarios andover time) logics. For instance, adaptive streaming protocols usually start with a statically con-figured bitrate. If this bitrate is too low, the protocol might not be even able to reach the optimalrate by the time the video has ended (e.g., for a 30s or 60s news clip). Hard-coded one-size-fits-all logics of TCP do not consistently achieve optimal performance in diverse network contexts,like data center, wireless and large bandwidth-delay product networks.

This thesis argues for an alternative approach of centralized predictive control (CPC), whereadaptation logics are separated from end-to-end adaptation protocols and run by a global con-troller, who has a global view of many sessions’ performance1. The global view enables predic-tive control (i.e., ability to predict the optimal decision without trying all decisions) and flexiblecontrol logic (i.e., ability to customize control logic based on network context, rather than usingone-size-fit-all logics).

The technical contribution of this thesis is to address two key challenges of the CPC approach.First, while the global view enables predictive control, it is challenging to accurately predictperformance and optimal decisions. We present accurate data-driven prediction logics based onthe insight of “critical features” – “bottlenecks” that impact session performance and are share

1We generalized the notion of session in this thesis to interaction of two end hosts in the context of differentlayers, such as TCP session, HTTP session, video session, etc.

1


across many sessions. For instance, TCP sessions impacted by the same factors (e.g., congestedlinks) have similar performance and best adaptation decisions (e.g., initial congestion window).

Second, as CPC uses a global controller, it creates additional complexity, such as scalability,fault tolerance, stability, additional delay between controller and end hosts. They are challeng-ing for end-to-end adaptation because of the scale (potentially billions of end hosts with globaldistribution) and highly dynamic nature (adaptation must be made immediately with fresh infor-mation). Our design of the CPC control platform addresses these challenges using several keyinsights, including leveraging persistence of optimal adaptation decisions to minimize negativeimpacts of delay between controller and end hosts, and deoupling controller design to combineboth global analytics and real-time per-session control.

To put it in context, CPC is inspired by a diverse set of related research. First, CPC sharessimilar motivations with SDN (e.g., [28, 45]) in network layer that control plane should be de-coupled from data plane and centralized for more flexibility. Second, CPC is inspired by seminalwork of CM [4] and SPAND [36], which showed benefits of sharing congestion informationbetween TCP sessions in the context of a web server. CPC extends this work to a global scaleby taking advantage of many technological trends such as cloud computing and large-scale an-alytics platforms. Third, CPC is inspired by recent work of Remy [37, 44], which showed thebenefits of customizing congestion control logics for different network characteristics during of-fline training. CPC generalize this work to online training of adaptation logics, which is supposedto unleash more benefits of customizing adaptation logics based on more accurate knowledge ofthe targeted network.Organization of the thesis proposal:• Chapter 2: Background and related work.• Chapter 3: Overview of challenges and insights of CPC• Chapter 4: Design of accurate data-driven prediction logics.• Chapter 5: Design of scalable control platform.• Chapter 6: Timeline of the proposed thesis work and risk analysis.

1.1 RoadmapThis thesis makes the case for CPC with three steps ( Table 1.1 maps my prior work and proposedwork to the above roadmap):• Step 1: We make case for data-driven protocols (design of protocols should be driven by

data, such as a global view of real-time performance measurement). The case for data-drivenprotocols suggests the potential performance benefits of CPC.• Step 2: We use measurement studies to present insights that enable the technical designs of

CPC (§3).• Step 3: We present techniques to build a practical system of CPC (§4 and 5).

2


My prior work(Internet video) Proposed work FutureCase for (global) data-driven (Step 1) [27](Not primary contributor) Data-driven protocolsMeasurement on insights of CPC (Step 2) [21]

Skype TCPOtherprotocols

Accurate data-driven prediction (Step 3) CFA, DDA [23]Scalable control platform (Step 3) C3 [11](Not primary contributor)

Table 1.1: Mapping my prior work and proposed work to CPC’s insights and solutions to addressthe challenges.

3


4


Chapter 2

Background and Related Work

2.1 BackgroundEnd-to-end adaptation protocols are used at transport layer and application layer of the networkstack (e.g., TCP in transport layer, and DASH streaming protocols in application layer). Theseprotocols are critical to ensure high performance in each layer. Different protocols use differentadaptation parameters to adapt end hosts’ behavior against changes of network conditions orresource availability. Table 2.1 gives some example of end-to-end adaptation protocols.

Network stack Protocols Adaptation parameters Impacted performance

Application layer DASH, Skype Bitrate/CDN/peer selection Quality of video/Skype/webHTTP Parallel TCP, server/cache selection Throughput

Transport layer TCP Congestion control Throughput, Latency

Table 2.1: Some examples of how different end-to-end adaptation protocols impact performanceof different layers by tuning their adaptation parameters.

The ideal design of end-to-end adaptation protocols should tune adaptation parameters opti-mally and quickly so that each session can adapt to network condition and resource availability.For instance, the ideal bitrate-adaptive video player should always choose the highest bitrate sus-tainable by throughput, and the ideal TCP should use the largest initial window size that has nopacket loss.

Table 2.2 below provides a list of terminologies used in this document.

Name Description Examples(Adaptation) Parameter Control knob CDN(Adaptation) Decision Specific value of an adaptation parameter AkamaiSession Interaction of two end hosts in the context

of different layersTCP session, HTTP session,video session

Session features Features of a session ISP, CDN, connection typeGlobal view Performance measurements from many

distributed end hostsVideo quality measurements ofall viewers

Table 2.2: Terminology used in this document and their description.

5


2.2 Related WorkRelated work of this thesis falls into the following categories.

Centralized control platform for network protocols:• Network layer: Centralized architecture for network-layer functionalities have gain much

attention since the seminal work on centralized systems for routing (e.g., [5, 45]) and morerecently with a focus on software-defined networks (e.g., [28]) and their applications, in-cluding network function virtualization (e.g., [1]), network updates (e.g., [24]) and trafficengineering (e.g., [15, 19, 30]). CPC shares many system issues with these studies, includ-ing scalability (e.g., [7, 41]) and fault tolerance (e.g., [31, 46]). The most similar applicationto CPC is SDN’s traffic engineering. However, unlike these studies, CPC makes no as-sumption on the knowledge of available resource (e.g., total capacity between two points)or the visibility of all traffic. As a result, CPC does not focus on centralized scheduling (asin [15, 19, 30]), but explores a data-driven design of protocols.• Transport layer: Traditionally, transport-layer protocols are implemented in a distributed

fashion. The closest work to CPC is CM [4] and SPAND [36], which has suggested thebenefits of sharing congestion control information across individual TCP sessions, thoughin the scope of the same host (e.g., server). There has been recent work on applying theconcepts of SDN to transport layer (e.g., SDT [14] and FastPass [32]) or taking advantage ofexistence of SDN (e.g., OpenTCP [12]). So far, they either focus on specific scenarios likedata centers or only work with the presence of SDN deployed in the network.• Application layer: The most similar work to CPC is the case of global control plane for

video (e.g., [26, 27]. It shows a substantial spatial and temporal diversity of performanceacross different CDNs and content providers [26], which suggest the potential performanceimprovement brought by making adaptation using a global of client-side measurements. Thisthesis plans to generalize the concept of centralized control to multiple applications (such asSkype) and give a systematic and practical algorithmic design.

Machine learning and other approaches to prediction: CPC leverages the power of data-driven prediction. To put it in context of machine learning techniques, the CPC’s predictionalgorithm (CFA §4), in essence, is an instance of a “variable kernel conditional density estima-tion” method [40]. It addresses the curse of dimensionality by contracting parts of the featurespace that are not critical for prediction or in which there is too little available data. It lever-ages application-specific insights and is thus able to outperform conventional machine learningtechniques (e.g., decision tree, naive bayes and SVM[35]) in scalability and accuracy.

There are several studies on predicting network performance by leveraging the history ofthe same client-server pair (e.g., [13, 18, 29, 39, 43]). However, they are less reliable when theavailable history of the same client and server is sparse. In contrast, CPC is able to predict moreaccurately by leveraging a global view of many sessions’ performance measurements.

Other improvements on end-to-end adaptation protocols: We focus on other improvementson Internet video and congestion control.

• Internet video: There is a large literature on measuring video quality in the wild (e.g., con-tent popularity [34, 47], quality issues [21] and server selection [38, 42]) and techniques to

6


improve user experience (e.g., bitrate adaptation algorithms [16, 20], cross-CDN optimiza-tion and federation [3, 26, 33] and cross-provider cooperation [10, 22, 48]). While our workborrows ideas from the prior work (e.g., critical features are inspired by quality “bottleneck”in [21]), CPC is an enhancement of these approaches as it uses the accurate prediction tomake predictive control.• Congestion control: TCP congestion control algorithms have been studied for multiple

decades and for brevity, we only focus on recent developments. Remy [37, 44] providesa tool to demonstrate feasibility and limitation of off-line TCP training, which suggests theneeds for customization of congestion control algorithm under different network characteris-tics. PCC [8] argues that using A/B testing and the observed performance results in a simplercongestion control algorithm and better performance. Compared to Remy (offline training)and PCC (local view), CPC stands for a more generalized design point where the congestioncontrol can be determined with real-time global information.

Other related work include the studies on the benefit of increasing TCP initial congestionwindow (e.g., [9]), for which CPC provides a platform to dynamically set the initial conges-tion window. Another similar system to CPC is split TCP (e.g., [25]), which is different fromCPC in that CPC does not split the connection between end hosts.

7


8


Chapter 3

CPC Overview

3.1 CPC vs. Today’s ApproachesTo motivate CPC, we first revisit two limitations for today’s end-to-end protocols – local reactiveadaptation and hard-code logics, which are inherent to the traditional design tenet that end-to-endadaptation protocols are run independently by endpoints.Local reactive adaptation: Local reactive adaptation uses locally observed information andreactive strategies such as trial-and-error. This is suboptimal for two reasons.• Suboptimal initial configurations: Existing protocols typically use static configuration pa-

rameters that are often suboptimal. For example, adaptive streaming protocols usually startwith a statically configured bitrate. If this bitrate is too low, the protocol might not be evenable to reach the optimal rate by the time the video has ended (e.g., for a 30s or 60s newsclip). Similarly, in many cases, the initial window size of TCP is too small, which may causea transfer to take far more RTTs than necessary.• Inefficient exploration of decision space: As protocols and applications become more so-

phisticated the number of configuration choices increases dramatically. For instance, witha video application one can select the initial bitrate, the CDN, and at a finer granularity theproxy or the web server from which to stream the content. This makes it hard, or eveninfeasible, for a reactive protocol to explore the configuration space and select the best con-figuration.

Hard-coded logics: Today’s end-to-end adaptation protocols are hard-coded; E.g., TCP is com-piled in OS kernel, and video adaptation logic is hard-coded in the video binary code. Thiscauses two issues:• Hampering customization to diverse network context: It is widely known that different

network contexts require different TCP logics to deal with different network properties. Forinstance, wireless network requires different TCP to those used data center or networks withhigh bandwidth-delay product. Recent work Remy has shown the benefits of customizingcongestion control logics to different network characteristics.• Difficult to evolve and deploy: There have been tremendous research on TCP that yields

many congestion control algorithms. However, their evaluation and deployment is mostlyconfined to simulation because TCP is hard-coded in OS kernel, which has a long cycle

9


for update and deployment. Similar ossification is also common in application layer. Forinstance, video bitrate adaptation logics are compiled in video players, which typically arepart of mobile/smartTV apps, and thus have a long update cycle [11] of up to months.

Benefits of CPC: CPC overcomes the above limitations of today’s end-to-end adaptation proto-cols by introducing two new capabilities.• Data-driven predictive control: The global view of many connections’ performance en-

ables the protocols to accurately predict the outcome of making a particular choice, e.g.,would a stream be able to sustain a particular bitrate? Would a TCP connection experienceany loss given a particular initial window size? In theory, perfect prediction would allowprotocols to use “optimal” configuration parameters and make “optimal” decisions. For ex-ample, it would be possible to pick the largest sustainable bitrate for a video stream, or thelargest window size for which a TCP connection won’t experience congestion losses.• Flexible logics: CPC decouples adaptation logics from end hosts. This enables the adapta-

tion logics to be customized for each session at any time based on its network context andapplication requirements.

3.2 Challenges of CPC

CPC architecture: CPC consists of three major components. (1) Global controller, which runsthe data-driven prediction logics and makes predictive decision for end hosts. (2) Local logics,which run inside the end hosts for performance monitoring and executing decisions made bythe global controller. (3) CPC API, which provides the interface between global controller andlocal logics. Logically, CPC uses a control loop that consists of three steps: (1) Measurementcollection, which maintains a global view of performance measurements collected from end hostsvia CPC API. (2) Predictive control logics that makes predictive adaptation decisions for eachsession based on the global view, (3) Finally, the adaptation decisions are pushed to end hosts forexecution via CPC API.

Internet&

&&

&

Global&Controller&Predic1ve&control&

CPC&API&Measurement&collec1on& CPC&API&

Pushing&adapta1on&decisions&

&&

&End&Point&

Local&Logics&&&

&

&&

&End&Point&

Local&Logics&

Figure 3.1: Three components of CPC architecture (local logics, global controller and CPC API),and the logical control loop (measurement collection, predictive control and pushing adaptationdecisions)

Challenge I: Accurate data-driven prediction: The first challenge is to have a highly accu-rate prediction algorithm in the predictive control logics. However, achieving high predictionaccuracy is hard for two reasons.

10


• Needs for expressive models: First, there are a lot of complex factors that can affect asession’s performance, and so the prediction models must be expressive to capture thesecomplex factors. Taking real-world event as an example, when an overloaded Level3 CDNedge server caused bad quality on video sessions in certain Comcast (ISP) and New YorkCity, the prediction model must be able to isolate the sessions associated with the specificISP and city using specific CDN; otherwise (e.g., all sessions in Comcast), we may not ableto identify the bad quality and predict accurately. In addition, different sessions may subjectto different factors; some sessions may be bottlenecked by its last connection while othersmay bottlenecked by the content availability at CDN.• Needs for fresh updates: Second, a session’s performance changes rapidly, and so the pre-

diction algorithm must be scalable to make prediction based on fresh measurements. Thisis particularly challenging given the sheer size of performance measurements (millions ofconcurrent video sessions and billions of concurrent TCP sessions) and the algorithm mustbe expressive and complex. For instance, a complex machine learning algorithm (e.g., SVM)might take too long to build the prediction model (e.g., it takes SVM more than one hour toprocess 10 minutes of data of one video site) and the prediction is made only with very staledata, which degrades prediction accuracy.

Challenge II: Scalable CPC control platform: A scalable CPC control platform must addresstwo issues:• Scalable global controller, which meets three arguably conflicting goals on a scale of po-

tentially billions of end hosts with global distribution: (1) First, accurate prediction benefitsfrom using measurements of a long history (e.g., several hours) as input. (2) Second, giventhe variability in performance across time and space (e.g., ISP-CDN combinations) the CPCcontroller needs an up-to-date global view (at most tens of seconds stale). (3) Third, it needsto be responsive at sub-second timescales to handle new session arrivals.• Functionality separation between end hosts and controller: The goal of separating the

functionalities of end hosts and the global controller is to (1) minimize the negative impactof the delay between end hosts and controller, (2) minimize overhead on both end points andcontroller, and (3) tolerate failure of controller and communication between controller andend hosts.

3.3 Key Insights of CPCThis section describes the key insights and how these key insights are used to address the abovechallenges. We will give more details of the solutions in the next two chapters.Insight I: Critical features: The key insights underlying our approach to accurate data-drivenprediction is that performance of a session is determined by a subset of critical features, and thussessions matching values on the critical features have similar performance. For instance, if allsessions have good quality except for sessions in a specific CDN, “CDN” is the critical featureof these bad quality sessions.

The critical features naturally offer an expressive prediction model one can predict a session’sperformance using the history sessions that match the values on its critical features, because theirperformance is determined by the same features.

11


10

100

1000

10000

100000

3/11 0AM

3/12 0AM

3/13 0AM

3/14 0AM

3/15 0AM

3/16 0AM

3/17 0AM

# o

f clu

ste

rs (

log)

Time (hr)

Problem clustersCritical clusters

Figure 3.2: A relatively small number of combinations of critical features (i.e., critical clusters)are shared by many bad quality video sessions.

Insight II: Persistence of critical features: Critical features tend to persist on a long timescale(tens of minutes). For instance (Figure 3.3), using critical features learned 30 minutes ago willyield very similar prediction accuracy to using critical features learned now.

0

0.2

0.4

0.6

4 32 256

Deg

rada

tion

of

Acc

urac

y

Structure Staleness (min)

BufRatio AvgBitrate JoinTime VSF

Figure 3.3: Slight staleness (e.g., 30 minutes) of critical features (i.e., structure) does not impactprediction accuracy significantly.

Persistence of critical features naturally leads to a scalable implementation of predictive con-trol logics of the global controller. As we will see in next chapter, learning of critical featurestakes longer than actual prediction and decision making. Therefore, we can decouple the predic-tion logics into an offline process for learning critical features and an online process for predic-tion and decision making, such that prediction and decision making can leverage the most recentmeasurement as well as up-to-date critical features.Insight III: Persistence of optimal decisions: We have found that the optimal decisions ofsome adaptation parameters have a relatively long persistence, on a timescale of seconds to tensof seconds. For instance (Figure 3.4), the best CDNs for different content providers persist ona timescale of tens of seconds to minutes. Notice that there are adaptation parameters whoseoptimal decision changes every seconds or even more often. For instance, congestion controlwindow size may need to be updated on the receive of a packet ack to preserve packet conserva-tion principle.

This insight naturally suggests that, despite the additional delay between end hosts and globalcontroller, the global controller can make timely decision for the adaptation parameters whoseoptimal decisions persist on a timescale that is longer than the delay between end hosts and globalcontroller. For instance, CDN of a video client can be selected by the global controller becausethe optimal CDN typically changes on the timescale of tens of seconds, which is much longerthan the delay between end hosts and global controller.

12


0

0.2

0.4

0.6

0.8

1

1 10 100 1000

CD

F

Persistence(min)

CP ACP BCP C

Figure 3.4: The best CDNs for different content providers persist on a timescale of tens ofseconds to minutes.

13


14


Chapter 4

Accurate Data-Driven Prediction Logics

4.1 Technical OverviewThe key reason for CPC’s better performance is the capability to predict the outcome of a sessionwith specific adaptation decision at any time based on a global view of different sessions’ per-formance. For instance, whether a video client using a specific CDN and bitrate will experienceany re-buffering. Then a decision-making system can make optimal decision for any sessionbased on the performance prediction on every possible decision. Remember in §3.2 that an ac-curate prediction algorithm must meet the needs for expressive models and fresh updates. Forconvenience, in this chapter, we define session to denote a connection using a specific adaptationdecision (e.g., a video client using specific CDN and bitrate), and we would like to accuratelypredict its performance.

This section presents a practical prediction system called CFA. CFA uses a data-driven ap-proach: with a global view of performance measurements from many history and concurrentsessions, it is likely to find some similar sessions (e.g., those with same AS and CDN with thesession under prediction) whose performance is similar to the session under prediction. Giventhe insight of critical features (§3.3), in order to predict performance of a session s, it is sufficientto use sessions matching critical features with s in a short history, because their performance isimpacted by the same decisive factors as performance of s.

CFA Framework based on critical features: Formally, we denote session aggregation Agg(s, F, T )as the set of sessions that match session s on features in a feature set F and happened within timewindow T before s. In particular, the set of all features is denoted by F all. Based on the insightsof critical features, CFA uses two steps to predict performance of a session s.• Structure learning: First, CFA learns the critical features of s, denoted as F critical

s ;• Value estimation: Then CFA estimates the performance of s based on sessions in Agg(s, F critical

s , Test)(e.g., the average of their performance) where Test is a small time window, e.g., 5 minutes.There are two practical issues of this framework: (1) How to learn the critical features of a

session? (2) How to be scalable to use fresh updates? We postpone the later to the next chapter.

Learning critical features: By definition, sessions in Agg(s, F criticals , Test) should have the

same (or very similar) performance distribution to those in Agg(s, F alls , Test) because critical

features reflect the important factors that determine the performance. So a naive solution is to

15


look for the feature set F such that Agg(s, F, Test) has the most similar performance distribu-tion to Agg(s, F all

s , Test). However, this naive method is not practical, because it is hard forAgg(s, F all

s , Test) to have enough sessions to extract a reliable performance distribution. CFAaddresses this problem by taking advantage of the persistence of critical features (§3.3). Crit-ical features tend to persist on a long timescale (tens of minutes), which is an order of mag-nitude longer than performance persists (minutes). This suggests that if we replace replac-ing Test by persistence of critical features Tlearn(>> Test), we are likely to get much moredata in Agg(s, F all

s , Tlearn) while still be able to learn the critical features because sessions inAgg(s, F all

s , Tlearn) share the same critical features with Agg(s, F criticals , Tlearn).

4.2 My Prior WorkThree of my prior studies have focused on the accurate prediction and its insight of criticalfeatures.Study on Structures of Internet Video Quality Problems: This work is a first step to under-stand the structure of Internet video issues in the wild. To quantify the structure of video qualityproblems, we define problem cluster as a set of bad-quality video sessions matching on certainfeatures (e.g., CDN, ISP, content genre), and define the critical clusters as those likely to “cause”many problem clusters. Our findings show that (1) a relatively small number of critical clusters(2% of problem clusters) can account for most observe quality problems and (2) they are persis-tent (50% of problem clusters last for more than 2 hours, with a few pathological critical clustersspanning several hours).CFA for Video Quality Prediction: This work presents a practical prediction algorithm CFAfor video quality prediction based on the insights of critical features and their persistence. Wepresent real-world evidence for the needs of expressive models and fresh updates, and why con-ventional machine learning techniques are not sufficient. Trace-driven evaluation and real-worldexperiments show that CFA achieves up to 60% improvement in prediction accuracy comparedto conventional machine learning techniques (such as decision tree), and up to 15% improvementacross different quality metrics relative to current approaches.

-20

0

20

40

60

80

BufRatio

AvgBitrate JoinTime

VSF Impr

ov. o

ver N

B (%

)

DT CFA

(a) CFA is more accurate than conventionalML.

0

2000

4000

6000

8000

CFA CDN1

CDN2 CDN3

Random

Join

tim

e (m

s)

(b) Accurate prediction yields better videoquality.

Figure 4.1: Benefits of CFA: more accurate prediction on video quality and better quality throughaccurate prediction.

Throughput Prediction for Video Initial Bitrate Selection: In this project, we develop a data-

16


driven prediction algorithm DDA for throughput prediction by leveraging throughput measuredon sessions of different servers and clients. Trace-driven evaluation of real-world datasets showsthat DDA predicts throughput more accurately than simple predictors and conventional machinelearning algorithms; e.g., DDA’s 80%ile prediction error of DDA is 50% lower than other al-gorithms. We also show that this improved accuracy enables video players to select a highersustainable initial bitrate; e.g., compared to initial bitrate without prediction, DDA leads to 4xhigher average bitrate.

0

0.2

0.4

0.6

0.8

1

0 15 30 45 60

CD

F

Prediction error (%)

DDADTLSNBLM

(a) DDA is more accurate than conven-tional ML in predicting throughput.

(b) Accurate prediction yields higher average bitrate with less re-buffering.

Figure 4.2: Benefits of DDA: more accurate prediction on throughput and better bitrate selectionbased on accurate throughput prediction.

4.3 Proposed Work

CFA for Skype Quality Improvement: This will be a collaborate work with Skype team and re-searchers in Microsoft. The goal is to use real-time analytics on Skype client-side measurementsto generate insights and improve Skype quality. This seems a perfect fit for CFA. In particular,we plan to answer the following the questions:• What are the control knobs that have impact on Skype quality?• How much do using critical features help predict Skype quality accurately?• How to enhance CFA with auxiliary information (e.g., topology and packet-level informa-

tion)?• What are the differences between video streaming and real-time communication such Skype,

in terms of applicability and benefits of CPC? Why is CPC/CFA more/less useful for Skypethan video streaming?

17


18


Chapter 5

Scalable CPC Control Platform

5.1 Technical OverviewThe introduction of a separate controller creates additional complexity to the end-to-end archi-tecture, which must be addressed by a scalable and robust control platform.

Scalable global controller: Remember in §3.2 that the global controller must satisfy threerequirements: (1) capability of running structure learning over a long-term history (i.e., perfor-mance measurements of all users in last 10s of minutes), (2) capability of updating the predictionmodel with fresh global view of a short history (i.e., performance measurements of all users inlast 10s of seconds), and (3) capability of responding end hosts within a second. Unfortunately,simultaneously achieving all three requirements is hard. For instance, the structure learning maytake a long time (e.g., tens of minutes), and naively running both the structure learning and mak-ing predictive decision together will take too long to produce prediction based on up-to-dateglobal view.

Our design achieves these goals by splitting the global controller into three loosely coupledcontrol loops (Figure 5.1).• Offline structure learning that operates at the timescale of tens of minutes and runs structure

learning of CFA to learn critical features based on performance measurements of last tens ofminutes.• Online model updating that operates at the timescale of tens of seconds to update a global

model – the performance estimation of different values of critical features, using up-to-dateglobal view of performance measurement.• Fine-grained per-session control that operates at the timescale of millisecond timescale and

makes actual decisions for end hosts.The rationale of this triple-loop design is that how often each loop runs matches the persis-

tence of the results of each loop. Offline structure learning happens every tens of minutes and itoutputs critical features, which persists on the timescale of tens of mintues (§3.3). Online modelupdating happens every tens of seconds and it outputs value estimation of different values ofcritical feature (e.g., performance of different CDNs and their ranking), which persists on thetimescale of tens of seconds (§3.3).

Finally, the cloud computing infrastructure provides a practical platform for implementing

19


Offline&Structure&Learning&(Every&10s&of&minutes)&

Online&Model&Upda?ng&(Every&10s&of&seconds)&

PerAsession&Control&(In&realA?me)&

Cri$cal(features(

Global(model:(Value(es$ma$on(for(cri$cal(feature(values(

End&hosts&

Adapta$on(decisions(

Figure 5.1: Splitting CPC controller into three control loops

the triple-loop architecture of CPC global controller. In particular, it provides a global presenceof servers that are close to end hosts, which meets the needs of fine-grained per-session decisionmaking. Second, the proposed triple-loop architecture requires an efficient communication chan-nel between layers, which cloud infrastructure typically provides. Finally, cloud infrastructureprovides elastic resource, from which we can benefit to easily scale up to more end hosts.Separation of functionalities between end hosts and controller: CPC control platform de-couples the functionalities between end hosts and the global controller. One extreme strawman isto make all adaptation logics by the global controller. However, this strawman has two problems.First, there is an inherent delay between end hosts and the global controller, which may createinstability and even pathological cases (e.g., congestion collapse due to the violation of packetconservation). Second, relying on the global controller for all adaptation decisions is not faulttolerant as the performance will degrade if the controller fails.

To address the above issues, we make two design decisions. First, controller should onlymake adaptation decisions whose optimal decisions change more slowly than than the delaybetween end hosts and controller. For instance, CDN of a video client should be selected by theglobal controller because the optimal CDN typically changes on the timescale of tens of seconds(Figure 3.4), which is much longer than the delay between end hosts and controller. However,TCP congestion control window should be (for most of the time) selected by local logics of endhosts since it must be changed on timescale of less than a second to meet packet conservation.However, the key parameters of the congestion control algorithm (e.g., initial congestion windowand AIMD parameters) can be decided by by the global controller as they are more likely tohave persistence on long timescale. Second, for the purpose of fault tolerance, local logics ofend hosts should run a default adaptation logic even if the adaptation decisions are overridenby the controller when it is available. Once the controller fails or timeout, the end hosts canautomatically fallback to use the default local logics.

5.2 My Prior work

Scalable Implementation of CFA (part of CFA): Persistence of critical features naturally leadsto a scalable implementation of CFA. CFA learns the critical features at a coarse timescale, andwhen making prediction, CFA reuses the learned critical features and run simple logics over fresh

20


data for prediction.C3 (not major contributor): C3 is a scalable global control platform used for optimizationof Internet video quality. Its implementation benefits from two key ideas that are introduced inthis chapter: decoupling global analytics and per-session control for scalability and fresh globalview, and fallback strategy of local logics for fault tolerance.

5.3 Proposed Work

CPC for TCP: We propose to build a prototype of CPC for TCP, including the three componentsof CPC (global controller, local logics and CPC API). The design of CPC for TCP needs toaddress all the challenges mentioned in the beginning of this chapter and we will examine theeffectiveness of the proposed solutions. In a high level, we are interested in answering twoquestions:• What are the design differences between CPC for TCP and CPC for video (C3)?• What differences between TCP and Internet video cause these design differences?• Can CPC lead to more/less performance improvement for TCP than for video?

Another delivery of this project is to deploy CPC for TCP for public use. We envision severalmodes of deployment. For instance, the global controller of CPC for TCP can run as a publicservice and the local logics can be released as a kernel patch or part of user-space TCP for wideadoption.

21


22


Chapter 6

Timeline and Risk Analysis

6.1 Timeline• By July 17, 2015 – Position paper on case for data-driven protocols (submission to Hot-

Nets’15)• By Sep, 2015 – CFA for video (submission to NSDI’16)• By Jan, 2016 – CFA for Skype (submission to SIGCOMM’16)• By Sep, 2016 – CPC for TCP (submission to NSDI’17)

6.2 Risk Analysis

CFA for Skype: Because the collaborative nature of the project, the major risk of this projectis that the scope may not be controlled by us. The ideal case is to develop some variant of CFA(CFA++) that can predict Skype quality accurately and scalably and demonstrate that CFA++achieve better Skype quality. As a backup plan, we should be able to argue that applying data-driven techniques on Skype dataset can lead to better understanding of real-time network condi-tion, especially on critical structures (e.g., bottleneck). Another risk of this project is the potentialissues of the Skype dataset, such privacy issue with Skype data, incorrectness, broken collectionmethods. We should confirm that with Microsoft team as soon as possible.CPC for TCP: There is few risks in designing and building a prototype of CPC for TCP. Themajor risk is that the deployment may be hard to get wide adoption.• Plan A: run the local logics in end hosts as a linux patch and release it for public use.• Plan B: run the local logics as part of user-level TCP or TCP over UDP• Plan C: write the whole system as part of the NS2.• Plan D: give up the plan for TCP deployment and deploy it over json video player.

In any case, we should run global controller as a service in public cloud. To bootstrap thedeployment, we should deploy the local logics in Planetlab or GENI.

23


24


Bibliography

[1] Network functions virtualisation. http://portal.etsi.org/nfv/nfv_white_paper.pdf. 2.2

[2] I. Sodagar. The MPEG-DASH Standard for Multimedia Streaming Over the Internet. IEEEMultimedia, 2011. 1

[3] Athula Balachandran, Vyas Sekar, Aditya Akella, and Srinivasan Seshan. Analyzing thepotential benefits of cdn augmentation strategies for internet video workloads. 2013. 2.2

[4] Hari Balakrishnan, Hariharan S Rahul, and Srinivasan Seshan. An integrated congestionmanagement architecture for internet hosts. In ACM SIGCOMM Computer CommunicationReview, volume 29, pages 175–187. ACM, 1999. 1, 2.2

[5] Matthew Caesar, Donald Caldwell, Nick Feamster, Jennifer Rexford, Aman Shaikh, andJacobus van der Merwe. Design and implementation of a routing control platform. In NSDI2005. 2.2

[6] David Clark. The design philosophy of the darpa internet protocols. ACM SIGCOMMComputer Communication Review, 18(4):106–114, 1988. 1

[7] Advait Dixit, Fang Hao, Sarit Mukherjee, TV Lakshman, and Ramana Kompella. Towardsan elastic distributed sdn controller. In ACM HotSDN, 2013. 2.2

[8] Mo Dong, Qingxi Li, Doron Zarchy, P. Brighten Godfrey, and Michael Schapira. Pcc: Re-architecting congestion control for consistent high performance. In 12th USENIX Sympo-sium on Networked Systems Design and Implementation (NSDI 15), pages 395–408, Oak-land, CA, 2015. USENIX Association. 2.2

[9] Nandita Dukkipati, Tiziana Refice, Yuchung Cheng, Jerry Chu, Tom Herbert, Amit Agar-wal, Arvind Jain, and Natalia Sutin. An argument for increasing tcp’s initial congestionwindow. Computer Communication Review, 40(3):26–33, 2010. 2.2

[10] Benjamin Frank, Ingmar Poese, Yin Lin, Georgios Smaragdakis, Anja Feldmann, BruceMaggs, Jannis Rake, Steve Uhlig, and Rick Weber. Pushing cdn-isp collaboration to thelimit. ACM SIGCOMM CCR, 43(3), 2013. 2.2

[11] Aditya Ganjam, Faisal Siddiqi, Jibin Zhan, Ion Stoica, Junchen Jiang, Vyas Sekar, and HuiZhang. C3: Internet-scale control plane for video quality optimization. In To appear inNSDI. USENIX, 2015. 1.1, 3.1

[12] Monia Ghobadi, Soheil Hassas Yeganeh, and Yashar Ganjali. Rethinking end-to-end con-gestion control in software-defined networks. In Proceedings of the 11th ACM Workshop

25

http://portal.etsi.org/nfv/nfv_white_paper.pdf

http://portal.etsi.org/nfv/nfv_white_paper.pdf


on Hot Topics in Networks, pages 61–66. ACM, 2012. 2.2

[13] Qi He, Constantine Dovrolis, and Mostafa Ammar. On the predictability of large transfertcp throughput. ACM SIGCOMM Computer Communication Review, 35(4):145–156, 2005.2.2

[14] Chi-Yao Hong. Software defined transport. PhD thesis, University of Illinois at Urbana-Champaign, 2015. 2.2

[15] Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri,and Roger Wattenhofer. Achieving high utilization with software-driven wan. In ACMSIGCOMM 2013. 2.2

[16] Te-Yuan Huang, Ramesh Johari, Nick McKeown, Matthew Trunnell, and Mark Watson. Abuffer-based approach to rate adaptation: evidence from a large video streaming service. InACM SIGCOMM 2014. 2.2

[17] V. Jacobson. Congestion avoidance and control. In ACM SIGCOMM Computer Communi-cation Review, volume 18, pages 314–329. ACM, 1988. 1

[18] Manish Jain and Constantinos Dovrolis. End-to-end estimation of the available bandwidthvariation range. In ACM SIGMETRICS Performance Evaluation Review, volume 33, pages265–276. ACM, 2005. 2.2

[19] Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh,Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, et al. B4: Experience with aglobally-deployed software defined wan. In ACM SIGCOMM 2013. 2.2

[20] Junchen Jiang, Vyas Sekar, and Hui Zhang. Improving Fairness, Efficiency, and Stabilityin HTTP-Based Adaptive Streaming with Festive . In ACM CoNEXT 2012. 2.2

[21] Junchen Jiang, Vyas Sekar, Ion Stoica, and Hui Zhang. Shedding light on the structure ofinternet video quality problems in the wild. In CoNEXT. ACM, 2013. 1.1, 2.2

[22] Junchen Jiang, Xi Liu, Vyas Sekar, Ion Stoica, and Hui Zhang. Eona: Experience-orientednetwork architecture. In ACM HotNets, 2014. 2.2

[23] Junchen Jiang, Vyas Sekar, and Yi Sun. Dda: Cross-session throughput prediction withapplications to video bitrate selection. arXiv preprint arXiv:1505.02056, 2015. 1.1

[24] Naga Praveen Katta, Jennifer Rexford, and David Walker. Incremental consistent updates.In Proceedings of the second ACM SIGCOMM workshop on Hot topics in software definednetworking, HotSDN ’13, pages 49–54, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2178-5. doi: 10.1145/2491185.2491191. URL http://doi.acm.org/10.1145/2491185.2491191. 2.2

[25] Swastik Kopparty, Srikanth V Krishnamurthy, Michalis Faloutsos, and Satish K Tripathi.Split tcp for mobile ad hoc networks. In Global Telecommunications Conference, 2002.GLOBECOM’02. IEEE, volume 1, pages 138–142. IEEE, 2002. 2.2

[26] Harry Liu, Ye Wang, Yang Richard Yang, Alexander Tian, and Hao Wang. Optimizing Costand Performance for Content Multihoming. In Proc. SIGCOMM, 2012. 2.2

[27] Xi Liu, Florin Dobrian, Henry Milner, Junchen Jiang, Vyas Sekar, Ion Stoica, and Hui

26

http://doi.acm.org/10.1145/2491185.2491191

http://doi.acm.org/10.1145/2491185.2491191


Zhang. A case for a coordinated internet video control plane. In Proceedings of the ACMSIGCOMM 2012 conference on Applications, technologies, architectures, and protocolsfor computer communication, pages 359–370. ACM, 2012. 1.1, 2.2

[28] Nick McKeown. Software-defined networking. INFOCOM keynote talk, 17(2):30–32,2009. 1, 2.2

[29] Mariyam Mirza, Joel Sommers, Paul Barford, and Xiaojin Zhu. A machine learning ap-proach to tcp throughput prediction. In ACM SIGMETRICS Performance Evaluation Re-view, volume 35, pages 97–108. ACM, 2007. 2.2

[30] Matthew K Mukerjee, JungAh Hong, Junchen Jiang, David Naylor, Dongsu Han, Srini-vasan Seshan, and Hui Zhang. Enabling near real-time central control for live video de-livery in cdns. In Proceedings of the 2015 ACM conference on SIGCOMM. ACM, 2015.2.2

[31] Aurojit Panda, Colin Scott, Ali Ghodsi, Teemu Koponen, and Scott Shenker. Cap fornetworks. In ACM HotSDN, 2013. 2.2

[32] Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, and Hans Fugal. Fast-pass: A centralized zero-queue datacenter network. In Proceedings of the 2014 ACM con-ference on SIGCOMM, pages 307–318. ACM, 2014. 2.2

[33] Larry Peterson and Bruce Davie. Framework for cdn interconnection. 2013. 2.2

[34] Louis Plissonneau and Ernst Biersack. A longitudinal view of http video streaming perfor-mance. In Proc. MMSys, 2012. 2.2

[35] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector ma-chines, regularization, optimization, and beyond. MIT press, 2002. 2.2

[36] Srinivasan Seshan, Mark Stemm, and Randy H Katz. Spand: Shared passive networkperformance discovery. In USENIX Symposium on Internet Technologies and Systems,pages 135–146, 1997. 1, 2.2

[37] Anirudh Sivaraman, Keith Winstein, Pratiksha Thaker, and Hari Balakrishnan. An exper-imental study of the learnability of congestion control. In Proceedings of the 2014 ACMconference on SIGCOMM, pages 479–490. ACM, 2014. 1, 2.2

[38] Ao-Jan Su, David R Choffnes, Aleksandar Kuzmanovic, and Fabian E Bustamante. Draft-ing behind akamai (travelocity-based detouring). ACM SIGCOMM CCR, 2006. 2.2

[39] Martin Swany and Rich Wolski. Multivariate resource performance forecasting in the net-work weather service. In Proceedings of the 2002 ACM/IEEE conference on Supercomput-ing, pages 1–10. IEEE Computer Society Press, 2002. 2.2

[40] George R Terrell and David W Scott. Variable kernel density estimation. The Annals ofStatistics, pages 1236–1265, 1992. 2.2

[41] Amin Tootoonchian, Sergey Gorbunov, Yashar Ganjali, Martin Casado, and Rob Sherwood.On controller performance in software-defined networks. In USENIX Workshop on HotTopics in Management of Internet, Cloud, and Enterprise Networks and Services (Hot-ICE), 2012. 2.2

27


[42] Ruben Torres, Alessandro Finamore, Jin Ryong Kim, Marco Mellia, Maurizio M. Munafo,and Sanjay Rao. Dissecting Video Server Selection Strategies in the YouTube CDN. InICDCS, 2011. 2.2

[43] Sudharshan Vazhkudai, Jennifer M Schopf, and Ian Foster. Predicting the performance ofwide area data transfers. In Parallel and Distributed Processing Symposium., ProceedingsInternational, IPDPS 2002, Abstracts and CD-ROM, pages 10–pp. IEEE, 2001. 2.2

[44] Keith Winstein and Hari Balakrishnan. Tcp ex machina: Computer-generated congestioncontrol. In ACM SIGCOMM Computer Communication Review, volume 43, pages 123–134. ACM, 2013. 1, 2.2

[45] Hong Yan, David A Maltz, TS Eugene Ng, Hemant Gogineni, Hui Zhang, and Zheng Cai.Tesseract: A 4d network control plane. In NSDI, volume 7, pages 27–27, 2007. 1, 2.2

[46] Hong Yan, David A Maltz, TS Eugene Ng, Hemant Gogineni, Hui Zhang, and Zheng Cai.Tesseract: A 4d network control plane. In NSDI, volume 7, pages 27–27, 2007. 2.2

[47] H Yin et al. Inside the Bird’s Nest: Measurements of Large-Scale Live VoD from the 2008Olympics. In Proc. IMC, 2009. 2.2

[48] M. Yu, W. Jiang, H. Li, and I. Stoica. Tradeoffs in cdn designs for throughput orientedtraffic. In Proceedings of the 8th international conference on Emerging networking experi-ments and technologies, pages 145–156. ACM, 2012. 2.2

28

better end-to-end adaptation using centralized predictive controljunchenj/thesis/proposal.pdf ·...

Documents