Optimizing Big Data Analytics Frameworks inGeographically Distributed Datacenters
by
Shuhao Liu
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
The Edward S. Rogers Sr. Department of Electrical and ComputerEngineering
University of Toronto
© Copyright 2019 by Shuhao Liu
AbstractOptimizing Big Data Analytics Frameworks in Geographically Distributed Datacenters
Shuhao LiuDoctor of Philosophy
The Edward S. Rogers Sr. Department of Electrical and Computer EngineeringUniversity of Toronto
2019
A variety of Internet applications rely on big data analytics frameworks to efficiently
process large volumes of raw data. As such applications expand to a continental or even
global scale, their raw data input can be generated and stored in different datacenters.
The performance of data analytics jobs is likely to suffer, as transferring data among
workers located in different datacenters is expensive.
In this dissertation, we propose a series of system optimizations for big data analytics
frameworks that are deployed across geographically distributed datacenters. Our works
optimize the following three components in their architectural design:
Inter-datacenter network transfers. Few measures have been taken to handle the
unpredictability and scarcity of inter-datacenter available bandwidth. With the aware-
ness of job-level performance, we focus on expediting inter-datacenter coflows, which are
collections of parallel flows associated with job-level communication requirements. We
propose novel scheduling and routing strategies to minimize the coflow completion times.
These strategies are implemented in Siphon, a software-defined overlay network deployed
atop inter-datacenter networks.
Shuffles. Data analytics frameworks all provide a set of parallel operators as applica-
tion building blocks to simplify development. Some operators trigger all-to-all transfers
among worker nodes, known as shuffles. Shuffles are the source of inter-datacenter traffic,
and its performance is critical. Our work focuses on improving the network utilization
during shuffles, by adopting a new push-based mechanism that allows early start of in-
dividual flow transfers.
ii
Graph analytics APIs. Graph analytics is among the most important algorithms
that are natively supported by typical big data analytics frameworks. Bulk Synchronous
Parallel (BSP) is the state-of-the-art synchronization model to parallelize graph algo-
rithms, which generates a significant amount of traffic among worker nodes iteratively.
We propose a Hierarchical Synchronous Parallel (HSP) model, which is designed to re-
duce the demand for inter-datacenter transfers. HSP achieves this goal without sacrificing
algorithm correctness, resulting in a better performance and lower monetary cost.
We have implemented and tested the prototypes based on Apache Spark, one of the
most popular data analytics frameworks. Extensive experimental results on real public
clouds across multiple geographical regions have shown their effectiveness.
iii
To my family
iv
Acknowledgements
It has been more than four years since I started my first day at the University ofToronto. At that time, I was a student knowing little about research in computer science,and being a bit afraid of speaking English in public. Four years later, with all the blood,sweat, and tears of PhD study in memory, I could not be more grateful and thankful.
First and foremost, I would like to express my sincere appreciation to Prof. BaochunLi, my PhD advisor, for his guidance throughout the years. His visions in research andhis methodologies of mentoring students are innovative and effective. He has also givenme career advice, affected my values and corrected my bad habits in work. I am verylucky to have Prof. Baochun Li to be my advisor, my mentor, my collaborator, and myfriend.
I would like to thank my colleagues from iQua research group: Hong Xu, Wei Wang,Jun Li, Li Chen, Liyao Xiang, Weiwei Fang, Xiaoyan Yin, Yilun Wu, Yinan Liu, ZhimingHu, Jingjie Jiang, Shiyao Ma, Yanjiao Chen, Hao Wang, Hongyu Huang, Wanyu Lin,Xu Yuan, Wenxin Li, Siqi Ji, Jiapin Lin, Tracy Cheng, Jiayue Li, Yuanxiang Gao, ChenYing, Yifan Gong. Even if we were working on different projects, they would alwaysoffer me a helping hand. Having meetings and discussions with them has improved thebreadth of my knowledge. I would like to give my special gratitudes to Li Chen, whohas been my closest collaborator. We worked side-by-side for countless hours and solvedhard research problems together. It has been a joyful experience working with her.
Finally, I would like to thank my family for their unconditional love. Though beingtens of thousands of miles away from home, I can always feel their care and support. Lifechanged a lot during these four years, but their love has always been as solid as a rock.
This dissertation serves as the ultimate milestone of my PhD study, and this journeywill be my life-long treasure. Without all the help and support I received throughout theyears, this dissertation would not have been possible.
v
Contents
Acknowledgements v
Table of Contents vi
List of Tables ix
List of Figures x
1 Introduction 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Background and Related Work 102.1 Wide-Area Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Network Optimization for Data Analytics . . . . . . . . . . . . . . . . . . 122.3 Software-Defined Networking . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Optimizing Shuffle in Data Analytics . . . . . . . . . . . . . . . . . . . . 142.5 Distributed Graph Analytics . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Siphon: Expediting Inter-Datacenter Coflows 163.1 Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Scheduling Inter-Datacenter Coflows . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Inter-Coflow Scheduling . . . . . . . . . . . . . . . . . . . . . . . 213.2.2 Intra-Coflow Scheduling . . . . . . . . . . . . . . . . . . . . . . . 263.2.3 Multi-Path Routing . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.4 A Flow’s Life in Siphon . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Siphon: Design and Implementation . . . . . . . . . . . . . . . . . . . . . 323.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.2 Data Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
vi
3.3.3 Connections for Inter-Datacenter Links . . . . . . . . . . . . . . . 36
3.3.4 Control Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Controller-Data Plane Interaction in Siphon . . . . . . . . . . . . . . . . 39
3.4.1 Inefficiency of Reactive Control in OpenFlow . . . . . . . . . . . . 40
3.4.2 Caching Dynamic Forwarding Logic . . . . . . . . . . . . . . . . . 42
3.4.3 Customizing the Message Processing Logic . . . . . . . . . . . . . 44
3.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.1 Macro-Benchmark Tests . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.2 Single Coflow Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.3 Inter-Coflow Scheduling . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.4 Aggregators: Stress Tests . . . . . . . . . . . . . . . . . . . . . . 58
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Optimizing Shuffle in Wide Area Data Analytics 61
4.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Fetch-based Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.2 Problems with Fetch in Wide-Area Data Analytics . . . . . . . . 63
4.2 Transferring Shuffle Input across Datacenters . . . . . . . . . . . . . . . . 64
4.2.1 Transferring Shuffle Input: Timing . . . . . . . . . . . . . . . . . 65
4.2.2 Transferring Shuffle Input: Choosing Destinations . . . . . . . . . 66
4.2.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Implementation on Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.2 transferTo(): Enforced Data Transfer in Spark . . . . . . . . . . 71
4.3.3 Implementation Details of tranferTo() . . . . . . . . . . . . . . 73
4.3.4 Automatic Push/Aggregate . . . . . . . . . . . . . . . . . . . . . 76
4.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.1 Cluster Configurations . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.2 Job Completion Time . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.3 Cross-Region Traffic . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.4 Stage Execution Time . . . . . . . . . . . . . . . . . . . . . . . . 86
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
vii
5 A Hierarchical Synchronous Parallel Model 895.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 925.2 Hierarchical Synchronous Parallel Model . . . . . . . . . . . . . . . . . . 94
5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2.2 Model Formulation and Description . . . . . . . . . . . . . . . . . 965.2.3 Proof of Convergence and Correctness . . . . . . . . . . . . . . . 995.2.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 1005.2.5 PageRank Example: a Numerical Verification . . . . . . . . . . . 103
5.3 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.4.2 WAN Bandwidth Usage . . . . . . . . . . . . . . . . . . . . . . . 1095.4.3 Performance and Total Cost Analysis . . . . . . . . . . . . . . . . 1115.4.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6 Concluding Remarks 1136.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Bibliography 116
viii
List of Tables
3.1 Peak TCP throughput (Mbps) achieved across different regions on theGoogle Cloud Platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Summary of prototype network applications implemented in Siphon . . . 463.3 Summary of shuffles in different workloads (present the run with median
application run time). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4 The overall throughput for 6 concurrent data fetches. . . . . . . . . . . . 59
4.1 The specifications of four workloads used in the evaluation. . . . . . . . . 80
5.1 Summary of the used datasets. . . . . . . . . . . . . . . . . . . . . . . . . 1085.2 WAN bandwidth usage comparison. . . . . . . . . . . . . . . . . . . . . . 109
ix
List of Figures
1.1 An overview of our work in the architectural design of a general dataanalytics framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Key terminologies and concepts in a data analytics framework. . . . . . . 11
3.1 A case of inter-datacenter coflow scheduling. . . . . . . . . . . . . . . . . 213.2 A complete execution graph of Monte Carlo simulation. . . . . . . . . . . 243.3 Network flows across datacenters in the shuffle phase of a simple job. . . 263.4 Job timeline with LFGF scheduling. . . . . . . . . . . . . . . . . . . . . . 273.5 Job timeline with naive scheduling. . . . . . . . . . . . . . . . . . . . . . 273.6 Flexibility in routing improves performance. . . . . . . . . . . . . . . . . 303.7 A flow’s life through Siphon. . . . . . . . . . . . . . . . . . . . . . . . . . 323.8 An architectural overview of Siphon. . . . . . . . . . . . . . . . . . . . . 323.9 The architectural design of a Siphon aggregator. . . . . . . . . . . . . . . 363.10 The architecture of the Siphon Controller. . . . . . . . . . . . . . . . . . 373.11 An example to show the benefits of programmable data plane in software-
defined networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.12 The message processing diagram in Siphon. . . . . . . . . . . . . . . . . . 433.13 Prototype inheritance of Message Processing Objects, illustrating the pro-
gramming interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.14 Average application run time. . . . . . . . . . . . . . . . . . . . . . . . . 483.15 Shuffle completion time and stage completion time comparison (present
the run with media application run time). . . . . . . . . . . . . . . . . . 493.16 Average job completion time across 5 runs. . . . . . . . . . . . . . . . . . 523.17 Breakdowns of the reduce stage execution across 5 runs. . . . . . . . . . 523.18 CDF of shuffle read time (present the run with median job completion time). 533.19 The summary of inter-datacenter traffic in the shuffle phase of the sort
application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.20 Bandwidth distribution among datacenters. . . . . . . . . . . . . . . . . 56
x
3.21 Average and 90th percentile CCT comparison. . . . . . . . . . . . . . . . 563.22 The switching capacities of a Siphon aggregator on three different types
of instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Motivation of optimizing shuffle: case 1. . . . . . . . . . . . . . . . . . . 654.2 Motivation of optimizing shuffle: case 2. . . . . . . . . . . . . . . . . . . 664.3 A snippet of a sample execution graph of a data analytic job. . . . . . . 674.4 Implementation of transferTo(). . . . . . . . . . . . . . . . . . . . . . . 734.5 Implementation of transferTo() implicit embedding. . . . . . . . . . . . 774.6 Geographical deployment of testbed. . . . . . . . . . . . . . . . . . . . . 804.7 Average job completion time under HiBench. . . . . . . . . . . . . . . . . 824.8 Total volume of cross-datacenter traffic under different workloads. . . . . 854.9 Stage execution time breakdown under each workload. . . . . . . . . . . 88
5.1 A motivating example for HSP. . . . . . . . . . . . . . . . . . . . . . . . 935.2 An example run of PageRank under HSP. . . . . . . . . . . . . . . . . . 1055.3 The flow chart that shows central coordination in HSP. . . . . . . . . . . 1065.4 Application runtime under HSP, normalized by the runtime under BSP. . 1095.5 Estimated cost breakdown for running applications. . . . . . . . . . . . . 1105.6 Rate of convergence analysis for PageRank on uk-2014-host. . . . . . . 110
xi
Chapter 1
Introduction
Since the dawn of cloud computing, an increasing number of applications and services
have been built thanks to the ability to efficiently extract useful information from a
massive amount of raw data in commodity datacenters. Some of these applications
require to digest petabytes of raw data daily or even hourly, launching their jobs on
tens of thousands of physical machines in parallel. Big data analytics frameworks, e.g.,
Apache Hadoop [1] and Apache Spark [84], make it feasible to develop and manage such
an application that runs at a large scale.
At a high level, a typical big data analytics framework executes a data analytics job
in several stages. Each stage consists of a number of tasks that can process different
partitions of the dataset, running on different worker machines in parallel.
The dependencies among stages can be abstracted as a Directed Acyclic Graph
(DAG), and the child stage cannot start until its parent is completely finished. Be-
tween two consecutive stages, the intermediate data layout of the parent stage will be
completely reorganized, resulting in an all-to-all traffic pattern among workers, known
as the shuffle phase. Due to the volume of traffic generated, shuffles may constitute a
significant fraction of the job completion time [40], even if workers are located within a
single datacenter where bandwidth is abundantly available [7].
As many Internet applications and services expand as global businesses, not only
has the volumes of input data been increasing continuously and exponentially, but also
1
Chapter 1. Introduction 2
the geographical diversity of datasets has been growing. User-generated contents and
system logs, two major sources of data to be processed [26], are naturally generated and
stored in servers housed in geographically distributed datacenters. NetFlix, for example,
houses its global video streaming service in Amazon Web Service (AWS), a global cloud
platform. It is tricky to run a recommender system continuously based on ever-changing
user preference data, which are generated in different AWS regions.
Challenges arise in the efficiency of big data analytics frameworks, which are not
designed with the awareness of datacenter boundaries. Inter-datacenter Wide-Area Net-
works (WANs) offer magnitudes of lower bandwidth capacity as compared to intra-
datacenter networks [38]. Even if such WANs are physically dedicated optical links
(e.g., Google’s B4 [42]), the bandwidth resources for a single cloud tenant is still scarce
as being shared among millions of users. As a result, the shuffle phase will take a much
longer time when bulk inter-datacenter flow transfers are required.
Administrators of these data analytics applications can do little — as cloud tenants,
they have zero control over their generated traffic in inter-datacenter WANs.
This problem is known as wide-area data analytics [64, 65]. As a tenant in the
public cloud, how can we process a geographically distributed dataset at a
low cost?
As an additional requirement, a practical solution must be transparent:
1. Compatibility with existing code. With an entire ecosystem of libraries and applica-
tions on top of popular data analytics frameworks like Hadoop and Spark, it makes
sense to maintain the API consistency. Any attempt to modify existing well-defined
APIs is undesirable.
2. Optional user interference. The usefulness of data analytics frameworks stems
from the fact that it hides plenty of implementation details from users, allowing
them to focus solely on their business logic. From the developers’ point of view,
it should make no difference where or how the input datasets are physically stored
Chapter 1. Introduction 3
and processed. To avoid violating this principle, it must be an optional interface
to intake user insights about the geographical data distribution for the sake of
performance, and it is okay if users are completely unaware.
3. Ease of deployment on mainstream cloud platforms. Deploying and managing a
data analytics framework is easy and well-supported on most cloud platforms, and it
should not be more difficult when the deployment scale out to multiple datacenters.
One straightforward and intuitive solution is to aggregate the entire dataset into a
single datacenter before processing it. However, moving a massive amount of raw data via
inter-datacenter WAN is expensive, both time-wise and money-wise. Both our work [56]
and existing work [64] have evaluation results showing that it is a less effective solution
as compared to processing raw data in-place. Also, it is sometimes infeasible to transfer
raw datasets across borders due to special political regulations [79].
There are generally two lines of existing work trying to tackle the wide-area data
analytics problem. One set of work attempted to design optimal mechanisms of as-
signing input data and computation tasks across datacenters [39, 64, 78, 79], while the
other adjusted the application workloads towards reducing demands on inter-datacenter
communications [38, 77]. While both lines of work have been proven effective, we wish
to rethink the problem from the architectural perspective: can we redesign some other
key system components in a data analytics framework, so that the bottlenecked inter-
datacenter transfers are minimized?
In this dissertation, complementary to all existing proposals in the literature, we pro-
pose a series of system component redesigns, which are readily implementable
and deployable on existing data analytics frameworks, to optimize the effi-
ciency of wide-area data analytics. We have two major objectives: (1) improving
the job-level performance, i.e., minimizing job completion times; and (2) reducing the
amount of data transferred across datacenters. As a result, the monetary cost of running
a data analytics job across datacenters can be minimized.
Chapter 1. Introduction 4
I/O LayerInter-Node RPC/Messaging
Inter-Node DataTransfers
Distributed FileManagement
ResourceManagement
LayerWorkflow Management/
ExecutionResource
Scheduling
UserInterface
LayerDataflow Parallel APIs
Graph Analytics APIs ML APIs …
Figure 1.1: An overview of our work in the architectural design of a general data analytics framework.
To achieve both objectives, we have redesigned and implemented several system com-
ponents with the awareness of datacenter boundaries, and we have integrated them into
each architectural layer. Fig. 1.1 shows an overview of our work in the architectural
design of a general data analytics framework. Existing work in the literature focuses on
resource scheduling and Machine Learning APIs. As a comparison, we have redesigned
or improved three different yet important system components (depicted in dark blocks)
from the ground up. Our designs have the potential to work in conjunction with existing
ones for further efficiency improvements.
As the first part of this dissertation, we focus on expediting the delivery of inter-node
data flows. We are motivated by a simple observation: employing the same transport
for both inter- and intra-datacenter data transfers is not efficient. In Apache Spark [84],
for example, the out-of-box solution to inter-node data transfers is based on on-demand,
single TCP connections. It is effective enough within a single datacenter; however, the
performance of inter-datacenter data transfers is very likely to suffer from long-tail, un-
predictable flow completion times [38], because of the high-latency, high-jitter, and low-
capacity WAN links.
It is conceivable to decouple inter-datacenter from intra-datacenter data transfers
Chapter 1. Introduction 5
completely and schedule their delivery strategically for the sake of job-level performance.
Specifically, our goal is to collectively optimize the completion time of inter-datacenter
coflows [20,22], which is a flow group abstraction that directly connects the performance
of network transfers to that of a data analytics job. For example, all flows generated in a
shuffle phase can be abstracted as a coflow, as the next-stage computation cannot start
until the coflow is fully completed.
We propose a novel coflow scheduling algorithm tailored to minimizing the
inter-datacenter coflow completion time. To realize it across real datacenters in the
cloud, we have designed and implemented Siphon, an inter-datacenter overlay network
that can transparently aggregate and handle inter-datacenter data transfers.
Inspired by prior work on traffic engineering in software-defined WANs [37, 42], the
architectural design of Siphon follows the software-defined networking principle for the
sake of flexibility in routing and scheduling. Coflow scheduling and flow routing decisions
are made on a central controller, which maintains a global view of the inter-datacenter
overlay.
Siphon can expedite the inter-datacenter coflow transfers, by making optimal schedul-
ing decisions, given an existing collection of flows in the network. As a natural follow-up
question, can we optimize by generating fewer competing inter-datacenter flows in wide-
area data analytics?
This part of the responsibility is handled by the resource management layer in a data
analytics framework. Most existing works in the literature have been focusing on the
resource assignment and scheduling towards this goal. They optimize the execution of a
wide-area data analytic job by intelligently assigning individual tasks to datacenters, such
that the overhead of moving data across datacenters can be minimized. For example,
Geode [79], WANAnalytics [78] and Pixida [45] have proposed various task placement
strategies, reducing the volume of inter-datacenter traffic. Iridium [64] achieves shorter
job completion times by leveraging a redistributed input dataset, along with mechanisms
Chapter 1. Introduction 6
for making optimal task assignment decisions.
Despite their promising outlook, even the best resource scheduling strategies may not
achieve optimality due to the level of abstraction needed to solve the problem. As an
example, in Spark, resource schedulers can only operate at the granularity of computa-
tion tasks. Potential optimizations on the actual materialization of API calls, especially
shuffles that generate inter-datacenter data transfers directly, have been overlooked.
As the second part of our work, we take an alternate approach. We put a microscope
on the execution behavior of a single shuffle, analyzing its pros and cons. It turns out
that the state-of-the-art fetch-based shuffle, where receivers of the shuffle traffic initiate
all flows at the same time, may under-utilize the inter-datacenter bandwidth. To solve
this problem, we propose a push-based shuffle mechanism in wide-area data
analytics, which improves the bandwidth utilization and eventually the job
performance by allowing early inter-datacenter transfers.
Finally, we focus on graph analytics, an important and special category of big data
analytics applications. Graph analytics, machine learning, and SQL queries are among
the foundations of many big data analytics applications [82]. Popular data analytics
frameworks, e.g., Spark, usually support them as libraries that are built based on basic
data parallel APIs.
In particular, distributed graph analytics relies on a synchronization model to au-
tomatically convert itself into an iterative sequence of data parallel operations. Most
data analytics frameworks support graph analytics by adopting BSP [82], which relies on
heavy communications and synchronizations across worker nodes. Representative works
in the literature, e.g., Pregel [59], PowerGraph [29] and GraphX [30], are solely designed
and optimized for processing graphs within a single datacenter. Gemini [90], one of
the state-of-the-art solutions, even assume a high-performance cluster with 100 Gbps of
bandwidth capacity between worker nodes. As a result, they are not sufficiently com-
petent to address the challenge of running across geographically distributed datacenters,
Chapter 1. Introduction 7
i.e., wide-area graph analytics.
In wide-area data analytics, Gaia [38] and Clarinet [77] have demonstrated their
efficiency by tweaking the workflow of machine learning and SQL queries, respectively.
Inspired by their work, we tend to answer another question: can we tweak the workflow
of graph analytics, so that they can also run efficiently across geographically distributed
datacenters?
One possible solution is to allow asynchronous computation on different partitions of
the graph. Existing systems implementing such an asynchronous parallel model include
GraphUC [33] and Maiter [87]. However, neither system can guarantee the convergence
or the correctness of graph applications [82], which is unacceptable.
Inspired by Gaia [38] in machine learning, a promising approach is to partially re-
lax the strong synchronization among worker nodes. As the third part of our work, we
propose a new synchronization model for graph analytics, which focuses on
reducing the inter-datacenter bandwidth usage while having a strong conver-
gence guarantee in wide-area graph analytics.
1.1 Contributions
We first investigate to expedite the inter-datacenter coflow transfers in wide-area data
analytics. To this end, we design and implement Siphon, an inter-datacenter overlay net-
work that can be integrated with any existing data analytics frameworks (e.g., Apache
Spark). With Siphon, inter-datacenter coflows are discovered, routed, and scheduled au-
tomatically at runtime. Specifically, Siphon serves as a transport service that accelerates
and schedules the inter-datacenter traffic with the awareness of workload-level dependen-
cies and performance, while being utterly transparent to analytics applications. Novel
intra-coflow and inter-coflow scheduling and routing strategies have been designed and
implemented in Siphon, based on a software-defined networking architecture.
On our cloud-based testbeds, we have extensively evaluated Siphon’s performance in
Chapter 1. Introduction 8
accelerating coflows generated by a broad range of workloads. With a variety of Spark
jobs, Siphon can reduce the completion time of a single coflow by up to 76%. With
respect to the average coflow completion time, Siphon outperforms the state-of-the-art
scheme by 10%.
We proceed to optimize the runtime execution of shuffle, a phase in job execution
that triggers all data transfers in data analytics. We design a new proactive push-based
shuffle mechanism, and implement a prototype based on Apache Spark, with a focus
on minimizing the network traffic incurred in shuffle stages of data analytic jobs. The
objective of this framework is to strategically and proactively aggregate the output data
of mapper tasks to a subset of worker datacenters, as a replacement for Spark’s original
passive fetch mechanism across datacenters. It improves the performance of wide-area
analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-
datacenter links. Our extensive experimental results using standard benchmarks across
six Amazon EC2 regions have shown that our proposed framework is able to reduce job
completion times by up to 73%, as compared to the existing baseline implementation in
Spark.
Finally, we focus on wide-area graph analytics, an important set of iterative algorithms
that are commonly implemented as a library on data analytics frameworks. Existing
graph analytics frameworks are not designed to run across multiple datacenters well,
as they implement a Bulk Synchronous Parallel model that requires excessive wide-area
data transfers. To address this challenge, we propose a new Hierarchical Synchronous
Parallel (HSP) model designed and implemented for synchronization across datacenters
with a much-improved efficiency in inter-datacenter communication. Our new model re-
quires no modifications to graph analytics applications, yet guarantees their convergence
and correctness. Our prototype implementation on Apache Spark can achieve up to 32%
lower WAN bandwidth usage, 49% faster convergence, and 30% less total cost for bench-
mark graph algorithms, with input data stored across five geographically distributed
Chapter 1. Introduction 9
datacenters.
1.2 Organization
The remainder of this dissertation is organized as follows:
Chapter 2 introduces the background and reviews related literature.
Chapter 3 presents our work on expediting inter-datacenter coflow transfers. We pro-
pose a set of strategies that can efficiently route and schedule inter-datacenter
coflows. We also present the design, implementation, and evaluation of Siphon, a
software defined inter-datacenter framework that can realize these strategies in data
analytics frameworks. This chapter is based on our work published in USENIX
ATC 2018 [53], collaborated with Li Chen and Baochun Li, and our work pub-
lished in IEEE Journal on Selected Areas in Communications [55], collaborated
with Baochun Li.
Chapter 4 presents our work on optimizing shuffle in wide-area data analytics. We
present a detailed system study on the behavior of shuffles and propose a new
Push/Aggregate mechanism to minimize the job completion time. This chapter
is based on our work published in IEEE ICDCS 2017 [56], collaborated with Hao
Wang and Baochun Li.
Chapter 5 presents our work on a Hierarchical Synchronize Parallel mode in wide-
area graph analytics. This novel synchronization model takes effect by generating
less inter-datacenter traffic when the graph algorithm is converted to an iterative
workflow. This chapter is based on our work published in IEEE INFOCOM 2018
[54], collaborated with Li Chen, Baochun Li, and Aiden Carnegie.
Chapter 6 concludes this dissertation, with a summary of our work and a discussion
on future directions.
Chapter 2
Background and Related Work
Since the advent of MapReduce [24], generations of data analytics frameworks have been
designed and continuously optimized for supporting the growing need of big data process-
ing. These frameworks define APIs as a set of well-defined parallel operations, allowing
developers to interpret their data parallel workflow in serial, without caring about the
management details in the underlying distributed computing environment. As a result,
data analytics frameworks makes a data analytics application more maintainable, more
scalable, and easier to develop.
In this chapter, we first introduce a few common terms used throughout this disserta-
tion before reviewing the related work in the literature. Our terms are largely consistent
with that of Apache Spark [84] because it is popular and the prototype implementations
of our work are all based on Spark. They are illustrated in Fig. 2.1.
In a deployed big data analytics framework, we usually refer to the collection of com-
puting resources it manages as a cluster. Clusters are typically organized in a master-slave
architecture. The master node of the cluster is responsible for tracking and maintaining
the status of all slaves, a.k.a. worker nodes or executors.
Developers define their workflow using a sequence of parallel APIs calls. They then
submit the compiled binary to the master to start parallel processing. The master takes
the binary, interpreting the workflow and start it as a data analytics job. The job com-
pletion time is one of the most important metric to evaluate the performance of a data
10
Chapter 2. Background and Related Work 11
Cluster
Master
User-DefinedApplication
Ongoing Stage(s)
JobDependency Graph
Stage 0
Stage 1 Stage 2……
Stage 0
Tasks
WorkerTask 0
WorkerTask 1
WorkerTask 2
……
Interpret
ParallelizeSchedule
&Assign
Figure 2.1: Key terminologies and concepts in a data analytics framework.
analytics framework.
Depending on the defined workflow, a job can be further divided in many stages,
whose dependencies are interpreted as a Directed Acyclic Graph (DAG). A stage can
start to run as long as all its dependencies are fulfilled, in the form of a bunch of parallel
tasks. Each task within the same stage is executed on a worker node, executing the same
piece of code on a different partition of input data. Note that a stage is not considered
completed unless all of its tasks completes successfully. If a task takes longer time to run
for some reason (e.g., slow worker, network congestion, etc.), it will become a straggler
and delay the completion of the entire stage. As a result, stage completion time is also a
valid metric in performance evaluation.
In Spark, the input datasets are split into partitions that can be processed in parallel.
Logical computation is organized in several consecutive map() and reduce() stages:
map() operates on each individual partition to filter or sort, while reduce() reorganizes
and collects the summary of map() intermediate results. An all-to-all communication
pattern will be triggered between mappers and reducers, which is called a shuffle phase.
These intermediate data shuffles are well known as costly operations in data analytic
jobs, since they incur intensive traffic across worker nodes.
Chapter 2. Background and Related Work 12
2.1 Wide-Area Data Analytics
Running a data analytics job whose input data originates from multiple geographically
distributed datacenters is commonly known as wide area data analytics in the research
literature.
Inter-datacenter networks are critical network resources in WAN. Even though they
might consists of dedicated, optical links [42], preliminary measurements suggest they be
heavily shared [27, 38, 57]. Since wide area network links easily become the performance
bottleneck, existing works strive to reduce the usage of inter-datacenter bandwidth.
Geode [79], WANalytics [78] and Pixida [45] propose task placement strategies aiming
at reducing the total volume of traffic among datacenters. Iridium [64] and Flutter [39],
on the other hand, argue that less cross-datacenter traffic does not necessarily result in
a shorter job completion time. Thus, they propose an online heuristic to make joint
decisions on both input data migration and task placement across datacenters.
Heintz et al. [35] propose an algorithm to produce an entire job execution plan,
including both data and task placement. However, their model requires too much prior
knowledge such as intermediate data sizes, making it far from practical. Hung et al. [41]
propose a greedy scheduling heuristic to schedule multiple concurrent analytic jobs, with
the objective of reducing average job completion time.
However, all these efforts focus on adding wide-area network awareness to the compu-
tation framework, without tackling the lower-level inter-datacenter data transfers directly.
Our works are all orthogonal and complementary to these efforts.
2.2 Network Optimization for Data Analytics
A variety of flow scheduling algorithms (e.g., [36, 76, 80, 85]) are proposed to improve
flow completion times and meet deadlines in datacenter networks. They focus on the
average behavior of independent flows, which cannot work with the awareness of job-
Chapter 2. Background and Related Work 13
level performance.
Considering the impact of flow completion on the job-level performance, coflow schedul-
ing algorithms (e.g., [20,22,25]) are proposed to minimize the average coflow completion
time within a datacenter network, which is assumed to be free of congestion. Without
such assumptions, joint coflow scheduling and routing strategies [52, 88] are proposed in
the datacenter network, where both the core and the edge are congested.
Different from these models, the network in the wide area has congested core and
congestion-free edge, since the inter-datacenter links have much lower bandwidth than
the access links of each datacenter. Apart from the different network model, our propose
coflow scheduling (Chapter 3.2) handles the uncertainty of the fluctuating bandwidth
in the wide area, while the existing efforts assume the bandwidth capacities to remain
unchanged. Also, previous efforts are limited by their performance evaluations with
simulations or emulations. In contrast, we have implemented our practical yet effective
scheduling algorithms in Chapter 3 and demonstrated their performance with real-world
deployment and experiments.
2.3 Software-Defined Networking
Software-Defined Networking is a rising star in network management. Starting from
a campus network [61], it has already developed as a promising technology in next-
generation networks [46]. The principle of software-defined networking is to decouple the
packet forwarding intelligence from the hardware. The system design of Siphon (Chapter
3) follows this software-defined networking principle, applying it in an inter-datacenter
overlay network.
OpenFlow [4] is the first standardized protocol designed for communications between
the controller and the data plane. Static rules are installed and cached on the data plane
switches, to perform longest prefix matches at runtime.
Despite the fine granularity of control, OpenFlow-based control plane suffers from
Chapter 2. Background and Related Work 14
scalability problems. Some workaround solutions attempt to offload the controller by
applying the label-switching [6,10,11,18] technology or a distributed control plane [34,62].
However, they are still far from effective in dealing with traffic in production [15].
Reconfigurable Dataplanes have emerged as the future of OpenFlow 2.0 [15]. With
new advancements in the hardware, switches today are able to process packets using
reconfigurable logic, instead of static rules, at the line rate [71]. Following this direction,
network programming primitives (e.g., FAST [63], P4 [15], Probabilistic NetKAT [28],
Domino [70]) and compilers (e.g., SNAP [44]) are proposed.
The interactions between the Siphon controller and aggregators (Chapter 3.4) is
greatly inspired by this line of work. However, due to hardware constraint, sophisticated
algorithms are difficult to be implemented with the limited number of operations [69]. By
taking advantage of the overlay data plane which is more easily programmable and recon-
figurable, the controller-data plane interaction scheme is designed to be more flexible. It
adapts better to the inter-datacenter overlay environment, getting rid of the unnecessary
complexity incurred by processing multi-layer packet headers.
2.4 Optimizing Shuffle in Data Analytics
When it comes to running data analytics jobs within a datacenter, there exist several
proposals on optimizing shuffle input data placement. iShuffle [31] proposes to shuffle
before the reducer tasks are launched in Hadoop. Shards in the shuffle input are pushed
to the predicted reducers, respectively, during shuffle write, that is, a “shuffle-on-write”
service. However, it is not practical to predict the reducer placement before hand, espe-
cially in Spark. MapReduce Online [23] also proposes a push-based shuffle mechanism,
in order to optimize the performance under continuous queries. Unfortunately, general
analytic jobs that are submitted randomly will not benefit.
As a comparison, our work in Chapter 4 optimizes shuffle that involves inter-datacenter
transfers in a general data analytics framework. We make no assumptions on any prior
Chapter 2. Background and Related Work 15
knowledge about the workload. We focus on generic operations that trigger shuffles at
runtime. Further, our solution interacts friendly with task scheduling module in the
data analytics frameworks, rather than predicting but eventually enforcing its decision
making, which is a violation of design principles.
2.5 Distributed Graph Analytics
A variety of distributed graph analytics systems, most of which implement the BSP
model, have been proposed in the literature. Representatives include [59, 90]. These
systems focused on computing environments within a high-performance cluster, with the
abundantly available bandwidth between worker nodes. In contrast, Chapter 5 proposes a
new synchronization model that can be integrated seamlessly with these systems, serving
as an alternative of BSP when running across multiple datacenters.
Algorithm-level optimizations such as [68] can certainly reduce the required inter-
datacenter traffic. They can be applied to specific categories of graph algorithms, and
are orthogonal to optimizations on the system or the synchronization model.
As closely related works, asynchronous (GraphLab [58], PowerGraph [29]) or partial
asynchronous (GraphUC [33], Maiter [87]) synchronization models can potentially reduce
the need for inter-datacenter communications, but they cannot always guarantee algo-
rithm convergence [82]. Such guarantees in our work (Chapter 5) root in the strategical
switches between synchronous (global) and asynchronous (local) modes. The limited
extent of asynchrony achieves a sweet spot with both minimal inter-datacenter traffic
and convergence guarantees.
Chapter 3
Siphon: Expediting Inter-DatacenterCoflows
Given particular traffic generated by wide-area data analytics, improving its performance
by directly accelerating the completion of its inter-datacenter data transfers has been
largely neglected. The literature either attempted to design optimal mechanisms of as-
signing input data and computation tasks across datacenters [39, 64, 78, 79], or tried to
adjust the application workloads towards reducing demands on inter-datacenter commu-
nications [38, 77]. On the other hand, inter-datacenter traffic engineering can only be
achieved by cloud service providers at a flow group level [37, 42]. Cloud tenants have
no control over their generated inter-datacenter traffic. Such inability is likely lead to
sub-optimal utilization of inter-datacenter bandwidth.
To fill this gap, we propose a deliberate design of a fast delivery service for data
transfers across datacenters, with the goal of improving application performance from
an orthogonal and complementary perspective to the existing efforts. Moreover, it has
been observed that an application cannot proceed until all its flows complete [21], which
indicates that its performance is determined by the collective behavior of all these flows,
rather than any individual ones. We incorporate the awareness of such an important ap-
plication semantic, abstracted as coflows [22], into our design, to better satisfy application
requirements and further improve application-level performance.
16
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 17
Existing efforts have investigated the scheduling of coflows within a single datacen-
ter [20,22,52,86], where the network is assumed to be congestion free and abstracted as a
giant switch. Unfortunately, such an assumption no longer holds in the inter-datacenter
WANs, yet the requirement for optimal coflow scheduling to improve application perfor-
mance becomes even more critical.
In this chapter, given the observation that data analytics frameworks have complete
knowledge of all generated network flows, we propose three strategies that can signifi-
cantly reduce the coflow completion time, which is directly translated to a shorter job
completion time.
First, we have designed a novel and practical inter-coflow scheduling algorithm to
minimize the average coflow completion time, despite the unpredictable available band-
width in wide-area networks. The algorithm is based on Monte Carlo simulations to
handle the uncertainty, with several optimizations to ensure its timely completion and
enforcement.
Second, we have proposed a simple yet effective intra-coflow scheduling policy. It tries
to prioritize a subset of flows such that the potential straggler tasks can be accelerated.
Finally, we have designed a greedy multi-path routing algorithm, which detours a
subset of the traffic on a bottlenecked link to an alternate idle path, such that the
slowest flow in a shuffle can be finished earlier.
Further, to enforce these scheduling and routing strategies, we have designed and
implemented Siphon, a new building block for data analytics frameworks that is designed
to provide a transparent and unified platform to expedite inter-datacenter coflows.
From the perspective of big data analytics frameworks, Siphon decouples inter-datacenter
transfers from intra-datacenter traffic, serving as a transport with full coflow awareness.
It can be easily integrated to existing frameworks with minimal changes in source code,
while being completely transparent to the analytics applications atop. We have integrated
Siphon to Apache Spark [84].
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 18
The aforementioned coflow scheduling strategies become feasible because Siphon’s ar-
chitectural design follows the software-defined networking principle. The network control
plane, which makes control decisions such as routing and scheduling, is logically central-
ized and decoupled from the network data plane, which deals with the actual flows.
For the datapath, Siphon employs aggregator daemons on all (or a subset of) work-
ers, forming a virtual overlay network atop the inter-datacenter WAN, aggregating and
forwarding inter-datacenter traffic efficiently. At the same time, a controller can make
centralized routing and scheduling decisions on the aggregated traffic and enforce them
on aggregators. Also, the controller can work closely with the resource scheduler of the
data parallel framework, to maintain a global and up-to-date knowledge about ongoing
inter-datacenter coflows at runtime.
Because of the size and scale of inter-datacenter WANs, it takes a noticeable amount
of time for the controller to update a rule on aggregators. This latency is sometimes
significant, in that routing and scheduling decisions cannot take effect immediately. To
address this problem, Siphon implements a novel approach for data plane-controller in-
teractions.
Unlike traditional link-layer software-defined networking [61], the aggregators, which
act like switches, take advantage of its software flexibility. Rather than caching static
control rules in a flow table, an aggregator can actually cache dynamic control logics that
are installed by the controller. In particular, we can define algorithms using a concise
set of Javascript APIs, which will later be interpreted by a light-weight interpreter at
each aggregator within the data plane. When new flows come, an aggregator can make
routing and scheduling decisions based on locally cached control logics without requiring
controller intervention. As a result, the volume of interactions with the controller are
minimized. This mechanism greatly improves efficiency in Siphon, where network latency
makes a significant difference.
We have evaluated our proposed coflow scheduling strategies with Siphon. Across
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 19
five geographical regions on Google Cloud, we have evaluated the performance of Siphon
from a variety of aspects, and the effectiveness of intra-coflow scheduling in accelerating
several real Spark jobs. Our experimental results have shown an up to 76% reduction
in the shuffle read time. Further experiments with the Facebook coflow benchmark [22]
have shown an ∼ 10% reduction on the average coflow completion time as compared to
the state-of-the-art schemes.
This chapter is organized as follows: We first motivate inter-datacenter coflow schedul-
ing and analyze its practical challenges (Chapter 3.1). Then, in Chapter 3.2, we propose
a novel inter-datacenter coflow scheduling algorithm based on the idea of Monte-Carlo
simulation to tackle these challenges. To realize it, we implement Siphon (Chapter 3.3), a
high-performance inter-datacenter overlay that provides the flexibility of inter-datacenter
traffic engineering. Chapter 3.4 describes the architectural optimization made in Siphon’s
software-defined networking design. The final chapters present extensive system evalua-
tions in a real cloud environment.
3.1 Motivation and Background
In modern big data analytics, the network stack traditionally serves to deliver individual
flows in a timely fashion [7, 8, 85], while being oblivious to the application workload.
Recent work argues that, by leveraging workload-level knowledge of flow interdepen-
dence, the proper scheduling of coflows can improve the performance of applications in
datacenter networks [22].
As an application is deployed at an inter-datacenter scale, the network is more likely
to be a system bottleneck [64]. Existing efforts in wide-area data analytics [38, 64, 77]
all seek to avoid this bottleneck, rather than mitigating it. Therefore, it is necessary to
enforce a systematic way of scheduling inter-datacenter coflows for better link utilization,
given the fact that the timely completion of coflows can play an even more significant
role in application performance.
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 20
Oregon Carolina Tokyo Belgium TaiwanOregon 3000 236 250 152.0 194Carolina 237 3000 83.8 251 45.1Tokyo 83.8 81.7 3000 89.2 586Belgium 249 242 86.6 3000 76.0Taiwan 182 35.8 508 68.0 3000
Table 3.1: Peak TCP throughput (Mbps) achieved across different regions on the Google Cloud Platform.
As an example, suppose we have two coflows, A and B, sharing the same inter-
datacenter link. Without enforcing a proper scheduling mechanism, A and B will have to
fair-share the available bandwidth, leading to the late complete of both flows. However,
with a simple preemptive scheduling that prioritizes A, A can be complete significantly
faster while B is completed at exactly the same time.
First, inter-datacenter networks have a different network model. Networks are usually
modeled as a big switch [22] or a fat tree [67] in the recent coflow scheduling literature,
where the ingress and egress ports at the workers are identified as the bottleneck. This is
no longer true in wide area data analytics, as the available bandwidth on inter-datacenter
links are magnitudes lower than the edge capacity (see Table 3.1, measured with iperf3
in TCP mode on standard 2-core instances. Rows and columns represent source and
destination datacenters, respectively. These statistics match the reports in [38]).
Second, the available inter-datacenter bandwidth fluctuates over time. Unlike in
datacenter networks, the completion time of a given flow can hardly be predictable,
which makes the effectiveness of existing deterministic scheduling strategies (e.g., [22,86])
questionable. The reason is easily understandable: though the aggregated link bandwidth
between a pair of datacenters might be abundant, it is shared among tons of users and
their launched applications, with varied, unsynchronized and unpredictable networking
patterns.
Third, our ability to properly schedule and route inter-datacenter flows is limited.
We may gain full control via software-defined networking within a datacenter [88], but
such a technology is not readily available in inter-datacenter WANs. Flows through
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 21
Link 1
Time0 1 2 3 4 5 6 7 8
Link 2
A1 B1
A2B2
Figure 3.1: An example with two coflows, A and B, being sent through two inter-datacenter links. Basedon link bandwidth measurements and flow sizes, the duration distributions of four flows are depictedwith box plots. Note that the expected duration of A1 and B2 are the same.
inter-datacenter links are typically delivered with best effort on direct paths, without the
intervention of application developers.
To summarize, it calls for a redesigned coflow scheduling and routing strategy for
wide-area data analytics, as well as a new platform to realize in existing data analytics
frameworks. In this chapter, Siphon is thus designed from the ground up for this purpose.
It is an application-layer, pluggable building block that is readily deployable. It can
support a better WAN transport mechanism and transparently enforce a flexible set of
coflow scheduling disciplines, by closely interacting with the data parallel frameworks.
A Spark job with tasks across multiple datacenters, for example, can take advantage of
Siphon to improve its performance by reducing its inter-datacenter coflow completion
times.
3.2 Scheduling Inter-Datacenter Coflows
3.2.1 Inter-Coflow Scheduling
Inter-coflow scheduling is the primary focus of the literature [20,22,72,86]. In this section,
we first analyze the practical network model of wide-area data analytics. Based on the
new observations, we propose the details of a Monte Carlo simulation-based scheduling
algorithm.
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 22
♦ Design Objectives
Similar to [22,88], we assume the complete knowledge of ongoing coflows, i.e., the source,
the destination and the size of each flow are known as soon as the coflow arrives. Despite
recent work [20,86] which deals with zero or partial prior knowledge, we argue that this
assumption is practical in modern data parallel frameworks. It is conceivable that the
task scheduler is fully aware the potential cross-worker traffic before launching the tasks
in the next stage and triggering the communication stage [1, 21, 84]. We will elaborate
further on its feasibility in Sec. 3.3.4.
Our major objective is to minimize the average coflow completion time, in alignment
with the existing literature. However, we focus on inter-datacenter coflows, which are
constrained by a different network model. In particular, based on the measurement in
Table 3.1, we conclude that inter-datacenter links are the only bottlenecked resources,
and congestion can hardly happen at the ingress or egress port. For convenience, we call
it a dumb bell network structure.
As compared to coflow scheduling within a single datacenter which the network is typi-
cally modeled as a giant, congestion free switch [20,22], inter-datacenter coflow scheduling
is different in two ways:
The good news is, because the number of inter-datacenter links is magnitudes of lower
than the available paths within a single datacenter, the problem complexity is reduced
significantly, as well. Though the decision space of scheduling a given set of coflows
remains the same, the cost of calculating the coflow completion times under a specific
scheduling order is magnitudes lower.
However, the bad news is that inter-datacenter bandwidth is a dynamic resource.
Scheduling across coflows must take runtime variations into account, making a scheduling
decision that has a higher probability of completing coflows faster. The requirement, on
the contrary, complicates the problem. It is the challenge we focus on in this chapter.
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 23
♦ Schedule with Bandwidth Uncertainty
Coflow scheduling in a big switch network model has been proven to be NP-hard, as it can
be reduced to an instance of the concurrent open shop scheduling with coupled resources
problem [22]. With a dumb bell network structure, as contention is removed from the
edge, each inter-datacenter link can be considered an independent resource that is used to
service the coflows (jobs). Therefore, it makes sense to perform fully preemptive coflow
scheduling, as resource sharing always results in an increased average [36].
The problem may seem simpler with this network model. However, it is the sharing
nature of inter-datacenter links that complicates the scheduling. The real challenge is,
being shared among tons of unknown users, the available bandwidth on a certain link is
not predictable. In fact, the available bandwidth is a random variable whose distribution
can be inferred from history measurements. Thus, the flow durations are also random
variables. The coflow scheduling problem in wide-area data analytics can be reduced to
the independent probabilistic job shop scheduling problem [13], which is also NP-hard.
We seek a heuristic algorithm to solve this online scheduling problem. An intuitive
approach is to make an estimation of the flow completion times, e.g., based on the
expectation of recent measurements, such that we can solve the problem by adopting a
deterministic scheduling policy such as Minimum-Remaining Time First (MRTF) [22,86].
Unfortunately, this naive approach fails to model the probabilistic distribution of flow
durations. Fig. 3.1 shows a simple example in which deterministic scheduling does not
work. In this example, the available bandwidth on Link 1 and 2 have distinct distributions
because users sharing the link have distinct networking behaviors. With Coflow A and
B competing, the box plots depict the skewed distributions of flow durations if the
corresponding coflow gets all the available bandwidth.
With a naive, deterministic approach that considers average only, scheduling either
A or B will result in a minimum average coflow completion time. However, it is an easy
observation that, with a higher probability, the duration of flow A1, will be shorter than
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 24
A B C
AB C A BC
ABC AB CCABC CBAC CCABCBCACACB CCBA
Figure 3.2: The complete execution graph of Monte Carlo simulation, given 3 ongoing coflows, A, B andC. The coflow scheduling order is determined by the distributions at the end of all branches.
B2. Thus, prioritizing Coflow A over B should yield an optimum schedule.
In conclusion, inter-datacenter coflow scheduling decisions should consider the distri-
bution of available bandwidth, which may be calculated with the weighted samples of
historic measurements.
♦ Monte Carlo Simulation-based Scheduling
To incorporate such uncertainty, we propose an online Monte Carlo simulation-based
inter-coflow scheduling algorithm, which is greatly inspired by the offline algorithm pro-
posed in [13].
The basic idea of Monte Carlo simulation is simple and intuitive: For every candidate
scheduling order, we repeatedly simulate its execution and calculate its cost, i.e., the
simulated average coflow completion time. With enough rounds of simulations, the cost
distribution will approximate the actual distribution of average coflow completion time.
Based on this simulated cost distribution, we can choose among all candidate scheduling
orders at a certain confidence level.
As an example, Fig. 3.2 illustrates an algorithm execution graph with 3 ongoing
coflows. There are 6 potential scheduling orders, corresponding to the 6 branches in the
graph. To perform one round of simulation, the scheduler generates a flow duration for
each of the node in the graph, by randomly drawing from their estimated distributions.
By summing up the cost for each branch, it yields a best scheduling decision instance,
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 25
which results in a counter increment. After plenty of rounds, the best scheduling order
will converge to the branch with the maximum counter value.
One major concern of this algorithm is its high complexity. In fact, it works just like
a brute-force, exhaustive search in each round of simulation. With n ongoing coflows,
there will be up to n! branches in the graph of simulation.
We argue that given a much smaller number of links to simulate, the algorithm
complexity is acceptable, especially with the following techniques to limit the simulation
search space.
Bounded search depth. In online coflow scheduling, all we care about is the coflow
that should be scheduled next. This property makes a full simulation towards all leaf
nodes unnecessary. Therefore, we set an upper bound, d, to the search depth, and
simulate the rest of branches using MRTF heuristic and the median flow durations. This
way, the search space is limited to a polynomial time Θ(nd). In our implementation, we
heuristically set d = 3, for an empirical balance between complexity and precision.
Early termination. Some “bad” scheduling decisions can be identified easily. For ex-
ample, scheduling an elephant coflow first will always result in a longer average. Based on
this observation, after several rounds of full simulation, we cut down some branches where
performances are always significantly worse. This technique limit the search breath, re-
sulting in a O(nd) complexity.
Online incremental simulation. As an online simulation, the scheduling algorithm
should quickly react to recent events, such as coflow arrivals and completions. Whenever
a new event comes, the previous job execution graph will be updated accordingly, by
pruning or creating branches. Luckily, the existing useful simulation results (or partial
results) can be preserved to avoid repetitive computation.
Simulation Timeout. Since Monte-Carlo simulation can be terminated at any time,
we have the flexibility to bound the execution time for each round of simulation, at the
cost of a less reliable simulation result. If a timeout happens, we can simply launch more
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 26
M1
M2R2
R1
M3
50MB
100MB
50MB
50MB50MB
50MB
DC1 to DC2
DC3 to DC2
Figure 3.3: Network flows across datacenters in the shuffle phase of a simple job.
parallel processes to run the Monte-Carlo simulation to speed it up.
These optimizations are inspired by similar techniques adopted in Monte Carlo Tree
Search (MCTS), but our algorithm differs from MCTS conceptually. In every simulation,
MCTS tends to reach the leaves of a single branch in the decision tree, where the outcome
can be revealed. As a comparison, our algorithm has to go though all branches at a certain
depth, otherwise we cannot figure out the optimal scheduling for the particular instance
of available bandwidth.
♦ Scalability
In wide-area data analytics, a centralized Monte Carlo simulation-based scheduling al-
gorithm may be questioned with respect to its scalability, as making and enforcing a
scheduling decision may experience seconds of delays.
We can exploit the parallelism and staleness tolerance of our algorithm. The beauty
of Monte Carlo simulation is that, by nature, the algorithm is infinitely parallelizable
and completely agnostic to staled synchronization. Thus, we can potentially scale out the
implementation to a great number of scheduler instances placed in all worker datacenters,
to minimize the running time of the scheduling algorithm and the propagation delays in
enforcing scheduling decisions.
3.2.2 Intra-Coflow Scheduling
To schedule flows belonging to the same coflow, we have designed a preemptive scheduling
policy to help flows share the limited link bandwidth efficiently. Our scheduling policy
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 27
100DC 1-2
50
Time0 1 2 3 4
Flow Group 1 ends
Schedule 1 (LFGFS)
50 50 50
50DC 3-2
200
150
R1
R2
Flow Group 2 ends
5 6 7 8
Figure 3.4: Job timeline with LFGF scheduling.
100DC 1-2
50
Time0 1 2 3 4
Flow Group 2 ends
Schedule 2
5050 50
50DC 3-2
200
150
R1
R2
Flow Group 1 ends
5 6 7 8
Figure 3.5: Job timeline with naive scheduling.
is called Largest Flow Group First (LFGF), whose goal is to minimize job completion
times. A Flow Group is defined as a group of all the flows that are destined to the same
reduce task. The size of a flow group is the total size of all the flows within, representing
the total amount of data received in the shuffle phase by the corresponding reduce task.
As suggested by its name, LFGF preemptively prioritizes the flow group of the largest
size.
The rationale of LFGF is to coordinate the scheduling order of flow groups so that
the task requiring more computation can start earlier, by receiving their flows earlier.
Here we assume that the task execution time is proportional to the total amount of data
it received for processing. It is an intuitive assumption given no prior knowledge about
the job.
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 28
As an example, we consider a simple Spark job that consists of two reduce tasks
launched in datacenter 2, both requiring to fetch data from two mappers in datacenter 1
and one mapper in datacenter 3, as shown in Fig. 3.3. Corresponding to the two reducers
R1 and R2, two flow groups are sharing both inter-datacenter links, with the size of 200
MB and 150 MB, respectively. For simplicity, we assume the two links have the same
bandwidth, and the calculation time per unit of data is the same as the network transfer
time.
With LFGF, Flow Group 1, corresponding to R1, has a larger size and thus will be
scheduled first. As is illustrated in Fig. 3.4, the two flows (M1-R1, M2-R1) in Flow Group
1 are scheduled first through the link between datacenter 1 and 2. The same applies to
another flow (M3-R1) of Flow Group 1 on the link between datacenter 3 and 2. When
Flow Group 1 completes at time 3, i.e., all its flows complete, R1 starts processing the
200 MB data received, and finishes within 4 time units. The other reduce task R2 starts
at time 5, processes the 150 MB data with 3 units of time, and completes at time 8,
which becomes the job completion time.
If the scheduling order is reversed as shown in Fig. 3.5, Flow Group 2 will complete
first, and thus R2 finishes at time 5. Although R1 starts at the same time as R2 in
Fig. 3.4, its execution time is longer due to its larger flow group size, which results in
a longer job completion time. This example intuitively justifies the essence of LFGF —
for a task that takes longer to finish, it is better to start it earlier by scheduling its flow
group earlier.
The proposed scheduling algorithm can be easily implemented in the controller, whose
execution is triggered by the shuffle flows reported by the TaskScheduler. In particular,
for each reduce task, we first calculate the total amount of data sent by all the flows
destined to it, and then assign priorities to all the tasks accordingly. This way, all the
flows destined to a task have the same priority, and flows in a task with more shuffle
data will have a higher priority to be scheduled. The priority number associated with
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 29
each flow will be conveyed to each aggregator the flow traverses, where a rule (flowId:
nextHop, priority) is installed in its forwarding table. The priority queues in these
aggregators will enforce the scheduling given the priorities obtained from the installed
rules, to be elaborated in Sec. 3.3.2.
3.2.3 Multi-Path Routing
Beyond ordering the coflows, optimizing route selection for individual flows is also made
feasible with Siphon. Aggregators, being deployed in geographically distributed data-
centers, can relay flows among datacenters efficiently. Provided that the topology of
inter-datacenter networks is publicly available [42], multi-path routing is an attractive
means of reducing coflow completion time.
Existing solutions, including RAPIER [88] and Li et al. [52], both exploit multiple,
equal-cost, paths within a single datacenter to improve the average coflow completion
time. Unfortunately, it is not ideal to directly apply their multi-path routing algorithms
here. Available paths between two datacenters have different cost, as a multi-hop path
will incur more bandwidth usage cost as compared to the directly link.
Being aware of this concern, we design a simple and efficient multi-path routing
algorithm to utilize available link bandwidth better and to balance network load. The
idea is similar to water-filling — it identifies the bottleneck link, and shifts some traffic
to the alternative path with the lightest network load in an iterative fashion.
The intuition that motivates this algorithm can be illustrated with the example shown
in Fig. 3.6. The shuffle phase consists of two fetches, each representing a group of flows
between a pair of source and destination datacenters. The first fetch is between DC2 and
DC3, involving four flows represented by the gray lines; the second fetch is between DC2
and DC1, which consists of two flows represented by the black lines. For simplicity, all
the flows are of the same size (100 MB), and all the inter-datacenter link bandwidth is
the same (10 MB/s). If these flows are routed directly as on the left side of the figure, the
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 30
DC 1
M1
R1
M2
R2
R3
DC 2 DC 3
DC 1
M1
R1
M2
R2
R3
DC 2 DC 3
Figure 3.6: Flexibility in routing improves performance.
link between DC2 and DC3 will become the bottleneck, resulting in 40 seconds of shuffle
completion time. However, if we allow the multi-path routing to split some traffic from
the direct path (DC2-DC3) to the alternative path (DC2-DC1, DC1-DC3), the network
load will be better balanced across links, which naturally speeds up the shuffle phase. As
illustrated by the right side of Fig. 3.6, when 100 MB of traffic is shifted from DC2-DC3
to the alternative two-hop path, DC2-DC3 and DC2-DC1 have the same network load.
With this routing, it only takes 30 seconds to complete the shuffle.
The bottleneck link is identified based on the time it takes to finish all the passing
flows. In the first iteration, we calculate all the link load and the time it takes to
finish all the passing flows, given that all the flows go through their direct links. To be
particular, for each link l, the link load is represented as Dl = di, where di represents
the total amount of data of the fetch i whose direct path is link l. The completion
time is thus calculated as tl = Dl/Bl, where Bl represents the bandwidth of link l. We
identify the most heavily loaded link l∗, which has the largest tl∗ , and choose one of its
alternative paths which has the lightest load for traffic re-routing. In order to compute
the percentage of traffic to be re-routed from l∗, represented by α, we solve the equation
Dl∗(1 − α)/Bl∗ = (Dl∗α + Dl′)/Bl′ , where l′ is the link with the heaviest load on the
selected detour path.
This algorithm can be readily implemented in the controller, since link bandwidth
and the flow sizes are all available as aforementioned. The calculated routing decision
for each flow is conveyed to all the aggregators along its assigned path, where a basic
forwarding rule (flowId: nextHop) indicating its next-hop datacenter will be installed
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 31
and enforced in the data plane.
3.2.4 A Flow’s Life in Siphon
We then examine how a flow from a data analytic job is established and routed through
Siphon aggregators, using the simple case shown in Fig. 3.7. The integration between
Siphon and data parallel frameworks, such as Spark, is designed as a simple RPC-based
published-subscribe API. To initiate a data flow to datacenter 2, the Spark executor in
datacenter 1 simply needs to send its outgoing traffic to the local aggregator, specify-
ing the destination datacenter and host with the publish API. To receive data through
Siphon, the Spark executor in datacenter 2 simply calls the subscribe API.
Upon receiving a flow, the aggregator checks whether there exists a pre-installed
rule for this flow. If a match is found, the flow will be switched to the corresponding
output queue, which is connected to an aggregator in another datacenter. Otherwise,
the aggregator will consult the Siphon controller for new rules that reflect its routing
and scheduling decisions. In Fig. 3.7, the flow is sent to the output queue corresponding
to datacenter 2, which will be transferred via pre-established parallel TCP connections
between these aggregators.
When a flow arrives at the aggregator in datacenter 2, the destination in its header will
be checked by the aggregator. If the destination executor is within the local datacenter,
the flow will be handled by the RPC server. Since the destination has already subscribed
to the data from the source, the flow will be delivered successfully by the RPC server.
If the destination is in a different datacenter, this aggregator will serve as a relay and
apply rules from the controller to select its output queue.
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 32
RPC server
In Out
In OutAggregator in datacenter 1
Aggregator in datacenter 2
RPC server
Figure 3.7: A flow’s life through Siphon.
Datacenter 1
Siphon Inter-Datacenter Software-Defined Overlay
…Worker Process
AggregatorDaemon
Worker Process
AggregatorDaemon
Worker Process
AggregatorDaemon
Datacenter 2
Worker Process
AggregatorDaemon
Worker Process
AggregatorDaemon
Worker Process
AggregatorDaemon
…
Controller
Figure 3.8: An architectural overview of Siphon.
3.3 Siphon: Design and Implementation
3.3.1 Overview
To realize any coflow scheduling strategies in wide-area data analytics, we need a system
that can flexibly enforce the scheduling decisions. Traditional traffic engineering [37, 42]
techniques can certainly be applied, but they are only available to cloud providers which
have full control over the traffic. Common cloud tenants do not have such flexibility. As
is concluded in Sec. 3.1, Siphon is designed and implemented as a host-based building
block to achieve this goal.
Fig. 3.8 shows a high-level overview of Siphon’s architecture. Processes, called aggre-
gator daemons, are deployed on all (or a subset of) workers, interacting with the worker
processes of the data parallel framework directly. Conceptually, all these aggregators will
form an overlay network, which is built atop inter-datacenter WANs and supports the
data parallel frameworks.
In order to ease the development and deployment of potential optimizations for inter-
datacenter transfers, the Siphon overlay is managed with the software-defined networking
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 33
principle. Specifically, aggregators operate as application-layer switches at the data plane,
being responsible for efficiently aggregating, forwarding and scheduling traffic within
the overlay. Network and flow statistics are also collected by the aggregators actively.
Meanwhile, all routing and scheduling decisions are made by the central Siphon controller.
With a flexible design to accommodate a wide variety of flow scheduling disciplines, the
centralized controller can make fine-grained control decisions, based on coflow information
provided by the resource scheduler of data parallel frameworks.
3.3.2 Data Plane
Siphon’s data plane consists of a group of aggregator daemons, collectively forming an
overlay that handles inter-datacenter transfers requested by the data parallel frameworks.
Working as application-layer switches, the aggregators are designed with two objectives:
it should be simple for data parallel frameworks to use, and supports high switching
performance.
♦ Software Message Switch
The main functionality of an aggregator is to work as a software switch, which takes care
of fragmentizing, forwarding, aggregating and prioritizing the data flows generated by
data parallel frameworks.
After receiving data from a worker in the data parallel framework, an aggregator
will first divide the data into fragments such that they can be easily addressable and
schedulable. These data fragments are called messages. Each data flow will be split into
a sequence of messages to be forwarded within Siphon. A minimal header, with a flow
identifier and a sequence number, will be attached to each message. Upon reaching the
desired destination aggregator, they will be again reassembled and delivered to the final
destination worker.
The aggregators can forward the messages to any peer aggregators as an interme-
diate nexthop or the final destination, depending on the forwarding decisions made by
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 34
the controller. Inheriting the design in traditional OpenFlow switches, the aggregator
looks up a forwarding table that stores all the forwarding rules in a hash table, to ensure
high performance. Fortunately, wildcards in forwarding rule matching are also available,
thanks to the hierarchical organizations of the flow identifiers. If neither the flow iden-
tifier nor the wildcard matches, the aggregator will consult the controller. A forwarding
rule includes a nexthop to enforce routing, and a flow weight to enforce flow scheduling
decisions.
Since messages forwarded to the same nexthop share the same link, we use a priority
queue to buffer all pending outbound messages to support scheduling decisions. Priorities
are allowed to be assigned to individual flows sharing a queue, when it is backlogged
with a fully saturated outbound link. The control plane will be responsible for assigning
priorities to each flows.
♦ Performance-Centric Implementation
Since an aggregator is I/O-bounded, it is designed and implemented with performance in
mind. It has been implemented in C++ from scratch with the event-driven asynchronous
programming paradigm. Several optimizations are adopted to maximize its efficiency.
Multi-threading. To fully utilize multiple CPU cores in modern VMs, throughout
our implementation, all operations as a result of handling events are executed in parallel
with a worker thread pool. The switch is also designed to be multi-threaded. In order
to reduce the overhead of excessive mutual exclusion locks, lock-free data structures are
used whenever possible. As examples, the forwarding table, which is read much more
frequently than being written, is protected by a reader-writer spin lock instead of a
traditional mutex lock.
Event-driven design. events are raised and handled asynchronously, including all
network I/O events. All the components are loosely coupled with one another, as each
function in these components is only triggered when specific events it is designed to handle
are raised. As examples of such an event-driven design, the switch will start forwarding
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 35
messages in an input queue as soon as the queue raises a PacketIn event, and the output
queue will be consumed as soon as a corresponding worker TCP connection raises a
DataSent event, indicating that the outbound link is ready.
Coroutine-based pipeline design pattern. Because an aggregator may commu-
nicate with a number of peers at the same time, work conservation must be preserved. In
particular, it should avoid head-of-line blocking, where one congested flow may take all
resources and slow down other non-congested flows. An intuitive implementation based
on input and output queues cannot achieve this goal. To solve this problem, our imple-
mentation takes advantage of a utility called “stackful coroutine,” which can be considered
as a procedure that can be paused and resumed freely, just like a thread whose context
switch is controlled explicitly. In an aggregator, each received message is associated with
a coroutine, and the total number of active coroutines is bounded for the same flow. This
way, we can guarantee that non-congested flows can be served promptly, even coexisting
with resource “hogs.”
Minimized memory copying. Excessive memory copying is often an important
design flaw that affects performance negatively. We used smart pointers and reference
counting in our implementation to avoid memory copying as messages are forwarded. In
the lifetime of a message through an aggregator, it is only copied between the kernel
socket buffers for TCP connections and the aggregator’s virtual address space. Within
the aggregator, a message is always accessed using a smart pointer, and passed between
different components by copying the pointer, rather than the data in the message itself.
Periodical measurements on link qualities. Siphon aggregators are also respon-
sible for measuring and reporting live statistics, such as the one-way delay, round-trip
time, and throughput on each of its inter-datacenter links, and estimates of available
inter-datacenter bandwidth. These live statistics will be reported to the controller peri-
odically for it to make informed decisions about routing and flow scheduling.
The architecture of an aggregator is illustrated in Fig. 3.9. Externally, all aggrega-
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 36
Siphon AggregatorRPC Server
SiphonController
Controller Proxy
Live statistics
monitoring
Switch
Spark executor
Publish Subscribe
Fragmentation and Reassembly
ConnectionInput Queue
Output Queue
Application Proxy
Input QueueOutput Queue
Forwarding Table
Parallel TCPConnections
Spark executor
To anotheraggregator
ConnectionInput Queue
Output Queue
ConnectionInput Queue
Output Queue
Parallel TCPConnections
To otheraggregators
Parallel TCPConnections
Figure 3.9: The architectural design of a Siphon aggregator.
tors are connected to the central Siphon controller for control plane decisions, and are
responsible for directly interacting with Apache Spark for data transfers. Internally, each
aggregator is designed and implemented as a high-performance software-defined switch
that operates in the application layer.
3.3.3 Connections for Inter-Datacenter Links
In each aggregator, an instance of the Connection component implements all the neces-
sary mechanisms to send or receive data over its corresponding inter-datacenter link. It
collectively manages a set of pre-established TCP connections to another aggregator in
a different datacenter, and handles network I/O asynchronously with I/O events. The
use of multiple pre-established TCP connections in parallel helps to saturate available
bandwidth capacities across datacenters.
To coordinate and synchronize all the pre-established TCP connections to another
aggregator, all underlying TCP connections are implemented as workers, which consis-
tently and automatically produce or consume the shared input queue and output queue.
In particular, whenever a worker TCP connection receives a complete message, it en-
queues it into the input queue and raises a notification event, notifying the downstream
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 37
Redis Database
ServerProcesses(node.js)
Pub/Sub Pub/SubDecisionMakers
Connections to aggregators
Spark Driver Program(Resource Scheduler)
Figure 3.10: The architecture of the Siphon Controller.
component, the Switch, to forward it to its destination. This way, messages received from
all underlying TCP connections are consumed sequentially, since the Switch only pulls
data from the input queue.
The design of the Connection component focuses on both high performance and the
flexibility of enforcing a variety of flow scheduling policies. For the sake of performance,
messages with known next-hop destinations are buffered in the output queue of the
corresponding Connection. Whenever there is a worker TCP connection ready to send
data, it dequeues one message from the output queue and sends it out. This will maximize
the aggregate performance as all worker TCP connections are combined.
3.3.4 Control Plane
The controller in Siphon is designed to make flexible control plane decisions, including
flow scheduling and routing.
Although the controller is a logically centralized entity, our design objective is to
make it highly scalable, so that it is easy to be deployed on a cluster of machines or VMs
when needed. As shown in Fig. 3.10, the architectural design of the controller completely
decouples the decision making processes from the server processes that directly respond
to requests from Siphon aggregators, connecting them only with a Redis database server.
Should the need arises, the decision making processes, server processes, and the Redis
database can be easily distributed across multiple servers or VMs, without incurring
additional configuration or management cost.
The Redis database server provides a reliable and high-performance key-value store
and a publish/subscribe interface for inter-process communication. It is used to keep
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 38
all the states within the Siphon datapath, including all the live statistics reported by
the aggregators. The publish/subscribe interface allows server processes to communicate
with decision-making processes via the Redis database.
The server processes, implemented in node.js, directly handle the connections from
all Siphon aggregators. These server processes are responsible for parsing all the reports
or requests sent from the aggregators, storing the parsed information into the Redis
database, and responding to requests with control decisions made by the decision-making
processes. It is flexible how the decision-making processes are implemented, depending
on requirements of the scheduling algorithm.
In inter-coflow scheduling, the controller requires the full knowledge of a coflow before
it starts. This is achieved by integrating the resource scheduler of the data parallel
framework to the controller’s Pub/Sub interface. Particularly in Spark, the task scheduler
running in the driver program have such knowledge as soon as the reduce tasks are
scheduled and placed on workers. We have modified the driver program, such that
whenever there are new tasks being scheduled, the generated traffic information will
be published to the controller. The incremental Monte Carlo simulations will then be
triggered on the corresponding parallel decision makers.
The superiority of the software-defined networking architecture is that, with a central
“brain” of the whole Siphon substrate, it is straightforward to implement, deploy and
debug a wide variety of decision-making algorithms to improve performance. Our design
of the controller maximizes such flexibility of centralized decision making. Siphon features
built-in support for the rapid development of these decision makers:
First, the event-driven programming model, backed by the node.js server and Redis
publish/subscribe interface, is well suited for rapid decision maker development. Unlike
long-running processes that have to poll from the database frequently, Siphon controller
triggers decision makers on data plane events. This way, a decision maker can be imple-
mented as a callback function.
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 39
Second, real-time measurements from the aggregators, such as available bandwidth,
are accessible to all running decision makers. In particular, such information is frequently
updated by the Siphon aggregators, and persists in the Redis database, which can be
retrieved by the decision makers with a readily available Python package.
Third, customized events and information reports can be easily plugged in to extend
the decision makers’ capabilities. The controller allows the developers to define their
customized data plane events, and store additional information in the database. Then,
new decision makers can make use of these information in similar ways.
In addition to the measurements available on the controller (bandwidth, etc.), in our
Siphon-Spark integration, we make the controller aware of all concurrent flows generated.
As soon as tasks are scheduled by Spark, a new customized event, onShuffleScheduled,
will be raised. Base on this design and integration with Spark, we have developed two
novel orthogonal decision makers at the Siphon controller out-of-box. Both are designed
with the sole purpose of improving the aggregated throughput of the concurrent inter-
datacenter flows generated in a Spark cluster. Unlike traditional Coflow scheduling [20,
22], the proposed policies and algorithms tend to accelerate a single Coflow.
3.4 Controller-Data Plane Interaction in Siphon
With software-defined networking as its underlying principle, the control and data planes
in Siphon are completely separated. The control plane is implemented with a centralized
controller, and it communicates — either proactively or reactively — with all the aggre-
gators, collecting real-time measurement statistics and deploying its control decisions.
Traditional ways of scaling the control plane cannot work smoothly in Siphon, whose
aggregators are distributed across geographically distributed datacenters, located far from
each other. Siphon implements a novel controller-data plane interaction scheme. By
caching dynamic forwarding logic instead of static flow tables on the aggregators, our
scheme greatly improves the scalability of the software-defined networking architecture.
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 40
3.4.1 Inefficiency of Reactive Control in OpenFlow
In the OpenFlow-like controller-data plane interaction, data plane will query the con-
troller about the forwarding decisions for each new flow. This approach is quite brute-
force but fine-grained because the controller can make flow-specific or even packet-specific
decisions in a straight-forward manner.
This scheme incurs heavy workload on the controller while being deployed at large-
scale. Since the control plane has to handle all data plane events and ensuring decision
integrity at the same time, the single centralized controller approach suffers from high
workload and reliability issues.
Three major issues limit the scalability of this design: i) There will be a significant flow
initiation delay, since its head packets will be forwarded to the controller, during which
the communication latency adds to the delay; ii) The controller workload is high, and it
might be congested by bursty network messages because of the all-to-one traffic pattern;
iii) Flow rerouting is necessary but expensive, which may encounter severe consistency
issues across the network.
Existing approaches to scale an OpenFlow control plane, such as hierarchical or par-
titioning solutions, either sacrifice the grain of control or limit the generality of possible
network applications. Recent progress made on the programmable data plane, as is
introduced in OpenFlow 2.0 [3], seems a promising cure for the scalability problem.
In this chapter, we implement a novel scalable and readily-deployable scheme for
controller-data plane interactions in Siphon. Inspired by the idea of a reconfigurable
data plane, the central Siphon controller caches control logic closer to the data plane
by injecting script code to the interpreter runtime on data plane nodes. Whenever the
Siphon data plane node is processing a new message, it directly dispatches the message
header information to the provided handler in the script code, to perform the desired
computation and actions based on locally cached controller intelligence.
An example of a load-balancing network application shown in Fig. 3.11. For tradi-
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 41
Flow 1
Path Y
Path X
ControllerLoad-balancer App
Switch A Switch B
Flow 2
Flow 3
Flow 4
Flow 1 -> Path X
Flow 2 -> Path Y
Flow 3 -> Path Y
Flow 4 -> Path X
(a) OpenFlow-style: installing static rule.
Flow 1
Path Y
Path X
ControllerLoad-balancer App
Switch A Switch B
Flow 2
Flow 3
Flow 4sortedPathByLoad=[Y,X]
sortedPathByLoad = [X, Y];function loadBalancer(Flow f) { for (path in sortedPathByLoad) { if (delay(path) < Inf) return path; } return NIL;} sortedPathByLoad=[X,Y]
(b) Siphon: caching dynamic forwarding logic.
Figure 3.11: An example to show the benefits of programmable data plane in software-defined networks.There exist two available shortest paths with the same capacity between ingress Switch A and egressSwitch B. The network application tries to balance the load between the two paths. The rate of Flow 1is 3 units, while the rate of Flow 2, 3 and 4 is 2 units, respectively.
tional OpenFlow network (Fig. 3.11(a)), a rule has to be installed on Switch A reactively
for each new flow. In the depicted example, four messages are required to achieve globally
optimal load-balancing. Note that the response time for rule installation will add to the
overhead of flow initiation.
In Siphon (Fig. 3.11(b)), on the contrary, the loadBalancer logic is cached on Switch
A beforehand by the central controller. When a new flow comes in, the path selection
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 42
is completely handled by the cached loadBalancer function, without querying the con-
troller. At the same time, the controller will proactively update Switch A with mea-
surement statistics of path load, ensuring the global optimality of forwarding decisions.
Only two updating messages are necessary in this case. Moreover, loadBalancer takes
local measurement statistics into consideration. Notably, delay() function returns the
monitoring delays of a given path. Its call on a path with failure will return an Inf.
Therefore, failed paths will be avoided automatically, without the slow reaction of the
controller.
As a conclusion, the benefits of dynamic logic caching are three-fold: i) Time overhead
of flow initiation is reduced because the inevitable communication latency is avoided; ii)
Both computation and network load on the controller is very likely to be reduced, since
statistics are updated on-demand; and iii) All route selection can react to network failures
quickly, so as to enable fast recovery.
3.4.2 Caching Dynamic Forwarding Logic
Generally speaking, message processing in software-defined networks involves three stages:
header matching, computing and action applying. Messages from different flows are iden-
tified through header matching. Then, computation is required to get the desired list
of actions (e.g., header modification, forwarding to a given next-hop node). Finally, the
actions will be applied to the message.
In software-defined networking, the computing stage is completely migrated to the
control plane. The core concept of our approach is to allow most of the messages to be
processed without consulting the central controller.
To this end, the Siphon controller preemptively caches the control logic (instead of
static forwarding rules) on the data plane. As follow-up actions, it frequently compiles
and updates the required global statistics. In Siphon, the communication messages be-
tween the controller and aggregators are encoded in JavaScript Object Notation (JSON)
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 43
KnowledgeUpdate
LogicCache
Hit
Not Hit
Controller Proxy
Header Matcher
Flow TableLookup
Action List
writeLabel(“minDelay”)toNextHop(10)toController()
……
Message Dispatcher
Message Processing Objectfunction processMessage(destList)
CachedKnowledgeDatabase
RTT/BandwidthMonitor
Action Executor
Select
SiphonController
Figure 3.12: The message processing diagram in Siphon.
strings, which naturally enables the logic caching via Javascript code injection. The
Siphon aggregators are embedded with a lightweight Javascript interpreter to execute
the code.
Specifically, the Siphon controller programs the message processing logic in a serial of
Javascript objects. Each of the objects has a key processMessage() method, defining
the logic to process messages of a given forwarding preference label in the header. Each
object is then serialized into a JSON message, thus to be sent to aggregators. The
message is then parsed and rebuilt into the origin Javascript objects. These objects that
contain the message processing logic, namely the Message Processing Objects, are thus
registered to the Javascript interpreter on the data plane node. Note that all above tasks
can be completed even before the first flow in the network initiates.
The processing of a single message is illustrated in Fig. 3.12. Whenever there is a
new incoming message, the header matcher will unpack the message headers, reading the
specified destination information and forwarding preference labels. Such information will
be used by the message dispatcher, to select an appropriate Message Processing Object
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 44
stored to process the corresponding message.
The member function processMessage() of the selected Message Processing Object
will then be called, with an argument indicating the list of destinations. The return value
of this function is a list of pre-defined actions, e.g., rewriting the Siphon message header,
relaying to a given next hop node or further consulting the central controller. Finally,
the action executor deploys the actions, which concludes a message process cycle.
The above process seems simple; however, it cannot be completed without a partic-
ular slice of global network knowledge. At the time a new Message Processing Object
being installed on a aggregator, it can also subscribe the updates of some customized
variables (i.e. global knowledge) from the controller. This global knowledge can be some
intermediate variables used by the algorithm inside processMessage(), which cannot be
locally obtained.
Moreover, the measurement results of local network delays and bandwidth statistics
can be used directly by processMessage(). As a result, the forwarding decision making
can react directly from the data plane events, such as link failure, without involving the
controller.
3.4.3 Customizing the Message Processing Logic
Software-defined networking architecture allows network operators to implement a va-
riety of novel network applications in pure software, customizing the control logic over
different network flows. In Siphon, programming a network application is as simple as in
OpenFlow. To implement a customized Siphon network application, there are two steps
in general: defining a Message Processing Object and subscribing a variable to be cached
as global knowledge.
Message Processing Objects are the core objects that define the control logic. Different
Message Processing Objects can be used to process messages with different labels. On
data plane nodes, new Message Processing Objects are registered to the controller proxy,
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 45
Register
Upd
ate
Controller Proxy
this.registerMPO = function(MPO) { MPO.prototype = this; // Register MPO to dispacher};
this.processMessage = function(message) { // Dispatch to an registed MPO};
this.knowledgeDB = {…};this.monitorStat = {…};this.updateDB = function(json) { json.updateDB(this.knowledgeDB);};
updateMsg = “{…, updateDB : function(DB){…}}”
SiphonController
Message Processing ObjectprefType: “Field to match preference label”,processMessage: function (…) {…},sharedVar: “Preference type shared property”,// Knowledge Database Access: this.knowledgeDB// Monitor Stat Access: this.monitorStat
Figure 3.13: Prototype inheritance of Message Processing Objects, illustrating the programming inter-faces.
and inherit the database and monitor access.
Thanks to the prototype inheritance mechanism in Javascript, customized Message
Processing Objects can program with simple interfaces shown in Fig. 3.13. The prefType
property defines the type of messages to be handle, whose value matches the preference
label in message headers. processMessage() function programs the message processing
logic, as is addressed previously. At the meantime, variables which are shared among
the flows of the same prefType can be defined or modified as property sharedVar. For
example, a counter can be added as a shared variable to get the number of processed
messages. Access to the global knowledge database and monitor statistics is made via
this.knowledgeDB and this.monitorStat interfaces directly.
The second step is to define the subscribed data from the controller. Together with
the Message Processing Object installation, the result of some user-defined functions can
be subscribed to update the knowledge database.
We implemented several prototype network applications, in the form of two Message
Processing Objects. Tab. 3.2 summarizes their implementation. Typically, these three
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 46
Network App Summary Global Knowledge“min-delay” Choose the next hop which directs to
the path with the minimum possibleone-way delay. Desired bandwidth canbe specified as a prerequisite.
Best 3 paths withthe least latency ofeach possible destina-tion. The correspond-ing available band-width is given, too.
“max-bandwidth” Choose the next hop which directs tothe path with most available band-width.
Best 3 paths with themost available band-width of each possibledestination.
“bw-allocation” Allocate a fraction of the availablebandwidth to a single flow, accordingto the desired relative weight
None.
“load-balancer” Choose the next hop for a message tobalance the load among all availablepaths. Note that either per-message orper-flow load balancing can be speci-fied.
Same as above.
Table 3.2: Summary of prototype network applications implemented in Siphon
applications are enough to satisfy the general traffic engineering needs to operate Siphon
in inter-datacenter networks. Note that multicast is automatically supported by these
applications. Also, per-message multipath routing can be enabled by the options specified
in the preference label field of a message header.
3.4.4 Discussion
One may think caching control logic on the relays is a violation of the software-define
network principle. Indeed, in some cases, the aggregators in Siphon seem to work in
the same way as a traditional routing. However, it is not true because the control logic
is cached by the central controller. In other words, the controller is free the delete or
update the cached logic at any time. Eventually, the forwarding decisions are still under
full provisioning of the central controller.
Caching forwarding logic is compatible with the OpenFlow-style of control, that is,
caching static forwarding rules. Directly consulting the controller is the default action to
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 47
be applied to a Siphon message. In particular, the message will be forwarded to the con-
troller under the following conditions: i) there is no matching Message Processing Object
installed; ii) processMessage() raises an exception due to lack of global knowledge.
Another concern about this design is that because the forwarding decisions can be
made locally, it might be unfeasible to ensure the correctness of message forwarding. For
example, the forwarding decisions might result in a loop. Siphon solves this problem by
consistently updating the knowledge cached on aggregators. Since the cached knowledge
on different relays is drawn from the same database, a loop can be avoided easily even
with greedy path selection. Moreover, the locality of forwarding decision making guar-
antees the consistency during a network update. There will not be any inconsistent rules
installed on the aggregators.
3.5 Performance Evaluation
In this section, we present our results from a comprehensive set of experimental eval-
uations with Siphon, organized into three parts. First, we provide a coarse-grained
comparison to show the application-level performance improvements by using Siphon.
A comprehensive set of machine learning workloads is used to evaluate our framework
compared with the baseline Spark. Then, we try to answer the question how Siphon
expedite a single coflow by putting a simple shuffle under the microscope. Finally, we
evaluate our inter-coflow scheduling algorithm, by using the state-of-the-art heuristic as
a baseline.
3.5.1 Macro-Benchmark Tests
Experimental Setup. In this experiment, we run 6 different machine learning work-
loads on a 160-core cluster, which spans across 5 geographical regions. Performance
metrics such as application runtime, stage completion time and shuffle read time are to
be evaluated. The shuffle read time is defined as the completion time of the slowest data
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 48
ALS PCA BMM Pearson W2V FG0
200
400
App
licat
ion
Run
Tim
e(s
) -26.1
-14.1
-8.5
-23.8
-4.1 +1.7
Siphon Spark
Figure 3.14: Average application run time.
Work-load
#Shuffles
TotalBandwidthUsage(GB)
ExtraBandwidthUsage(MB)
SiphonShuffle
Read Time(s)
SparkShuffle
Read Time(s)
RuntimeReduction
(%)
CostDifference
(¢)
ALS 18 40.47 2186.3 46.8 90.5 48.3 -26.56PCA 2 0.51 37.6 3.3 13.7 76.1 -6.80BMM 1 42.3 2911.1 48.9 97.8 50.0 -29.26Pear-son 2 0.57 23.8 3.6 13.1 72.6 -6.23
W2V 5 0.45 10.2 5.8 9.6 39.9 -2.49FG 2 0.57 20.5 1.77 1.87 5.4 -0.05
Table 3.3: Summary of shuffles in different workloads (present the run with median application runtime).
fetch in a shuffle. It reflects the time needed for the last task to start computing, and it
determines the stage completion time to some extent.
The Spark-Siphon cluster. We set up a 160-core, 520 GB-memory Spark cluster.
Specifically, 40 n1-highmem-2 instances are evenly disseminated in 5 Google Cloud dat-
acenter (N. Carolina, Oregon, Belgium, Taiwan, and Tokyo). Each instance provides 2
vCPUs, 13 GB of memory, and a 20 GB SSD of disk storage. Except for one instance
in the N. Carolina region works as both Spark master and driver, all instances serve as
Spark standalone executors. All instances in use are running Apache Spark 2.1.0.
The Siphon aggregators run on 10 of the executors, 2 in each datacenter. An aggrega-
tor is responsible for handling Pub/Sub requests from 4 executors in the same datacenter.
The Siphon controller runs on the same instance as the Spark master, in order to minimize
the communication overhead between them.
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 49
1 3 9 27 81ALS Stage Completion Time / Shuffle Read Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
Siphon StagesSpark StagesSiphon ShufflesSpark Shuffles
(a) Alternating Least Squares (in CDF).
0 1 2 3PCA Stage Index
0
5
10
15
20
25
30
Sta
geC
ompl
etio
nTi
me
(s) Siphon Stage
Spark StageSiphon ShuffleSpark Shuffle
(b) Principal Component Analysis.
0 1 2 3 4 5 6 7BMM Stage Index
0
50
100
150
200
250
Sta
geC
ompl
etio
nTi
me
(s) Siphon Stage
Spark StageSiphon ShuffleSpark Shuffle
(c) Block Matrix Multiplication.
0 1 2 3 4 5Pearson Stage Index
0
5
10
15
20
25
30
Sta
geC
ompl
etio
nTi
me
(s) Siphon Stage
Spark StageSiphon ShuffleSpark Shuffle
(d) Pearson’s Correlation.
0 1 2 3 4 5 6 7 8 9Word2Vec Stage Index
0
20
40
60
80
100
120
Sta
geC
ompl
etio
nTi
me
(s) Siphon Stage
Spark StageSiphon ShuffleSpark Shuffle
(e) Word2Vec distributed presentation of words.
0 1 2 3 4 5 6FP-Growth Stage Index
0
20
40
60
80
100
Sta
geC
ompl
etio
nTi
me
(s) Siphon Stage
Spark StageSiphon ShuffleSpark Shuffle
(f) FP-growth frequent item sets.
Figure 3.15: Shuffle completion time and stage completion time comparison (present the run with mediaapplication run time).
Note that we do not launch extra resources for Siphon aggregators to make the com-
parison fair. Even though they occupy some computation resource and system I/Os with
their co-located Spark executors, the consumption is minimal.
Workload specifications. 6 machine learning workloads, with multiple jobs and
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 50
multiple stages, are used for evaluation.
• ALS: Alternating Least Squares.
• PCA: Principle Component Analysis.
• BMM: Block Matrix Multiplication.
• Pearson: Pearson’s correlation.
• W2V: Word2Vec distributed presentation of words.
• FG: FP-Growth frequent item sets.
These workloads are the representative ones from Spark-Perf Benchmark1, the offi-
cial Spark performance test suite created by Databricks2. The workloads that are not
evaluated in this chapter share the same characteristics with one or more selected ones,
in terms of the network traffic pattern and computation intensiveness. We set the scale
factor to 2.0, which is designed for a 160-core, 600 GB-memory cluster.
Methodologies. With different workloads, we compare the performance of job exe-
cutions, with or without Siphon integrated as its cross-datacenter data transfer service.
Note that, without Siphon, Spark works in the same way as the out-of-box, vanilla
Spark, except one slight change on the TaskScheduler. Our modification eliminates
the randomness in the Spark task placement decisions. In other words, each task in a
given workload will be placed on a fixed executor across different runs. This way, we can
guarantee that the impact of task placement on the performance has been eliminated.
Performance. We run each workload on the same input dataset for 5 times. The
average application run time comparisons across 5 runs are shown in Fig. 3.14. Later
we focus on job execution details, taking the run with median application run time
for example. Table 3.3 summarizes the total shuffle size and shuffle read time of each
workload. Further, Fig. 3.15 breaks down the time for network transfers and computation
in each stage, providing more insight.
Among the 6 workloads, BMM, the most network-intensive workload, benefits most
1https://github.com/databricks/spark-perf2https://databricks.com/.
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 51
from Siphon. It enjoys a 23.6% reduction in average application run time. The reason
is that it has one huge shuffle — sending more than 40 GB of data in one shot — and
Siphon can help significantly. The observation can be proved by Fig. 3.15(c), which shows
that Siphon manages to reduce around 50 seconds of shuffle read time.
Another network-intensive workload is ALS, an iterative workload. The average run
time has been reduced by 13.2%. The reason can be easily seen with the information
provided in Table 3.3. During a single run of the application, 40.47 GB of data is shuffled
through the network, in 18 stages. Siphon collectively reduces the shuffle time by more
than 30 seconds. Fig. 3.15(a) shows the CDFs of shuffle completion times and stage
completion times, using Spark and Siphon respectively (note the x-axis is in log scale).
As we observe, the long tail of the stage completion time distribution is reduced because
Siphon has significantly improved the performance of all shuffle phases.
The rest of the workloads generate much less shuffled traffic, but their shuffle read
time have also been reduced (5.4%∼76.1%).
PCA and Pearson are two workloads that have network-intensive stages. Their shuffle
read time constitutes a significant share in some of the stages, but they also have com-
putation intensive stages that dominate the application run time. For these workloads,
Siphon greatly impacts the job-level performance, by minimizing the time used for shuffle
(Table 3.3).
W2V and FG are two representative workloads whose computation time dominates
the application execution. With these workloads, Siphon can hardly make a difference
in terms of application run time, which is mostly decided by the computation stragglers.
An extreme example is shown in Fig. 3.15(e). Even though the shuffle read time has been
reduced by 4 seconds (Table 3.3), the computation stragglers in Stage 4 and Stage 6 will
still slow down the application by 0.7% (Fig. 3.14). Siphon is not designed to accelerate
these computation-intensive data analytic applications.
Cost Analysis. As the acceleration of Spark shuffle reads in Siphon is partially due
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 52
Spark Naive Multipath Siphon0
50
100
150
200
250S
ortJ
obC
ompl
etio
nTi
me
(s)
186.5
215.2
157.3
188.9
139.4
170.3
134.7
164.0
Map StageReduce Stage
Figure 3.16: Average job completion time across 5runs.
Spark Naive Multipath Siphon0
50
100
150
200
Red
uce
Sta
geE
xecu
tion
Tim
e(s
)
72.2
186.2
56.3
155.3
49.4
135.3
48.7
130.3
Task ExecutionShuffle Read
Figure 3.17: Breakdowns of the reduce stage exe-cution across 5 runs.
to the relay of traffic through intermediate datacenters, it is concerned how it affects the
overall cost for running the data analytics jobs. On the one hand, the relay of traffic
increases the total WAN bandwidth usage, which is charged by public cloud providers.
On the other hand, the acceleration of jobs reduces the cost for computation resources.
We present the total cost of running the machine learning jobs in Table 3.3, based on
Google Cloud pricing3. Each instance used in our experiment costs $1.184 per hour, and
our cluster costs ¢ 0.6578 per second. As a comparison, the inter-datacenter bandwidth
only costs 1 cent per GB.
As a result, Siphon actually reduced the total cost of running all workloads (Table 3.3).
On the one hand, a small portion of inter-datacenter traffic has been relayed. On the
other hand, the idle time of computing resources has been reduced significantly, which
exceeds the extra bandwidth cost.
3.5.2 Single Coflow Tests
Experimental Setup. In the previous experiment, Siphon works well in terms of speed-
ing up the coflows in complex machine learning workloads. However, one question remains
unanswered: how does each component of Siphon contribute to the overall reduction on
the coflow completion time? In this experiment, we use a smaller cluster to answer this3https://cloud.google.com/products/calculator/
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 53
0 10 20 30 40 50 60 70 80
Shuffle Read Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
SparkNaiveMultipathSiphon
Figure 3.18: CDF of shuffle read time (present the run with median job completion time).
question by examining a single coflow more closely.
The cross-datacenter Spark cluster consists of 19 workers and 1 master, spanning
5 datacenters. The Spark master and driver is on a dedicated node in Oregon. The
geographical location of worker nodes is shown in Fig. 3.19, in which the number of
executors in different datacenters is shown in the black squares. The same instance type
(n1-highmem-2) is used.
Most software configurations are the same as the settings used in Sec. 3.5.1, including
the Spark patch. In other words, the cluster still offers a fixed task placement for a given
workload.
In order to study the system performance that generates a single coflow, we decided
to use the Sort application from the HiBench benchmark suite [40]. Sort has only two
stages, one map stage of sorting input data locally and a reduce stage of sorting after
a full shuffle. The only coflow will be triggered at the start of the reduce stage, which
is easier to analyze. We prepare the benchmark by generating 2.73 GB of raw input in
HDFS. Every datacenter in the experiment stores an arbitrary fraction of the input data
without replication, but the distribution of data sizes is skewed.
We compare the shuffle-level performance achieved by the following 4 schemes, with
the hope of providing a comprehensive analysis of the contribution of each component of
Siphon:
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 54
• Spark: The vanilla Spark framework, with fixed task placement decisions, as the
baseline for comparison.
• Naive: Spark using Siphon as its data transfer service, without any flow scheduling
or routing decision makers. In this case, messages are scheduled in a round-robin
manner, and the inter-datacenter flows are sent directly through the link between
the source to the destination aggregators.
• Multi-path: The Naive scheme with the multi-path routing decision maker enabled
in the controller.
• Siphon: The complete Siphon evaluated in Sec. 3.5.1. Both LFGF intra-coflow
scheduling and multi-path routing decision makers are enabled.
Job and stage level performance. Fig. 3.16 illustrates the performance of sort
jobs achieved by the 4 schemes aforementioned across 5 runs, with respect to their job
completion times, as well as their stage completion times for both map and reduce stages.
As we expected, all 3 schemes using Siphon have improved job performance by accelerat-
ing the reduce stage, as compared to Spark. With Naive, the performance improvement is
due to a higher throughput achieved by pre-established parallel TCP connections between
Siphon aggregators.The improvement of Multi-path over Naive is attributed to a further
reduction of reduce stage completion times — with multi-path routing, the network load
can be better balanced across links to achieve a higher throughput and faster network
transfer times. Finally, it is not surprising that Siphon, benefiting the advantages of both
intra-coflow scheduling and Multi-path routing, achieves the best job performance.
To obtain fine-grained insights on the performance improvement, we break down
the reduce completion time further into two parts: the shuffle read time (i.e., coflow
completion time) and the task execution time. As is shown in Fig. 3.17, the improvement
of Naive over Spark is mainly attributed to a reduction of the shuffle read time.Multi-
path achieves a substantial improvement of shuffle read time over Naive, since the network
transfer completes faster by mitigating the bottleneck through multi-path routing. Siphon
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 55
achieves a similar shuffle read time with Multi-path, with a slight reduction in the task
execution time. This implies that multi-path routing is the main contributing factor for
performance improvement, while intra-coflow scheduling helps marginally on the straggler
mitigation as expected.
Shuffle: Spark v.s. Naive. To allow a more in-depth analysis of the performance
improvement achieved by the baseline Siphon (Naive), we present the CDFs of shuffle
read times achieved by Spark and Naive, respectively, in Fig. 3.18. Compared with the
CDF of Spark that exhibits a long tail, all the shuffle read times are reduced by ∼10 s
with Naive, thanks to the improved throughput achieved by persistent, parallel TCP
connections between aggregators.
Shuffle: intra-coflow scheduling and multi-path routing. We further study
the effectiveness of the decision makers, with Multi-path and Siphon’s CDFs presented in
Fig. 3.18.
With multi-path routing enabled, both Multi-path and Siphon achieve shorter com-
pletion times (∼50 s) for their slowest flows respectively, compared to Naive (>60 s) with
direct routing. Such an improvement is contributed by the improved throughput with a
better balanced load across multiple paths. It is also worth noting that the percentage of
short completion times achieved with Multi-path is smaller than Naive — 22% of shuffle
reads complete within 18 s with Multi-path, while 35% complete with Naive. The reason
is that by rerouting flows from bottleneck links to lightly loaded ones via their alternative
paths, the network load, as well as shuffle read times, will be better balanced.
It is also clearly shown that with LFGF scheduling, the completion time of the slow-
est shuffle read is almost the same with that achieved by Multi-path. This meets our
expectation, since the slowest flow will always finish at the same time in spite of the
scheduling order, given a fixed amount of network capacity.
We further illustrate the inter-datacenter traffic during the sort job run time in
Fig. 3.19, to intuitively show the advantage of multi-path routing. The sizes of the
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 56
3
4
4
4
4
S. Carolina
Tokyo
Taiwan
Oregon Belgium
216
161
216
216
198
148
197
197
149
149 149
149
227
227
227170
206
207
155
206
Figure 3.19: The summary of inter-datacenter traffic in the shuffle phase of the sort application.
T-V T-M V-T V-M M-T M-V
150
160
170
180
190
200
210
Ava
ilabl
eB
andw
idth
(Mbp
s)
Figure 3.20: Bandwidth distributionamong datacenters.
<25% 25-49% 50-74% ≥75% All
Nor
mal
ized
CC
T(%
)
95.7
98.7 98.3
89.6 89.7
94.9
99.3
96.7
85.9
99.5
Average 90th Percentile
Figure 3.21: Average and 90th percentile CCT comparison.
traffic between each pair of datacenters are shown around the bidirectional arrow line,
the thickness of which is proportional to the amount of available bandwidth shown in
Table 3.1.
The narrow link from Taiwan to S. Carolina becomes the bottleneck, which needs to
transfer the largest amount of data. With our multi-path routing algorithm, part of the
traffic will be rerouted through Oregon. We can observe that the original traffic load
along this path is not heavy (only 149 MB from Taiwan to Oregon and 170 MB from
Oregon to S. Carolina), and both alternate links have more available bandwidth. This
demonstrates that our routing algorithm works effectively in selecting optimal paths to
balance loads and alleviate bottlenecks.
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 57
3.5.3 Inter-Coflow Scheduling
In this section, we evaluate the effectiveness of Monte Carlo simulation-based inter-coflow
scheduling algorithm, by comparing the average and the 90th-percentile Coflow Comple-
tion Time (CCT) with existing heuristics.
Testbed. To make the comparison fair, we set up a testbed on a private cloud, with 3
datacenters located inVictoria, Toronto, andMontreal, respectively. We have conducted
a long-term bandwidth measurement among them, with more than 1000 samples collected
for each link. Their distributions are depicted in Fig. 3.20, which are further used in the
online Monte Carlo simulation.
Benchmark. We use the Facebook benchmark [22] workload, which is a 1-hour
coflow trace from 150 workers. We assume workers are evenly distributed in the 3 data-
centers, and generate aggregated flows on inter-datacenter links. To avoid overflow, the
flow sizes are scaled down, with the average load on inter-datacenter links reduced by
30%.
Methodology. A coflow generator, together with a Siphon aggregator, is deployed
in each datacenter. All generated traffic goes through Siphon, which can enforce proper
inter-coflow scheduling decisions on inter-datacenter links. As a baseline, we experiment
with the Minimum Remaining Time First (MRTF) policy, which is the state-of-the-art
heuristic with full coflow knowledge [86]. The metrics CCTs are then normalized to the
performance of the baseline algorithm.
Performance. Fig. 3.21 shows that Monte Carlo simulation-based inter-coflow schedul-
ing outperforms MRTF in terms of both average and tail CCTs. Considering all coflows,
the average CCT is reduced by ∼10%. Since the coflow size in the workload follows a
long-tail distribution, we further categorize coflows in 4 bins, based on the total coflow
size. Apparently, the performance gain mostly stems from expediting the largest bin –
elephant coflows that can easily overlap with each other. Beyond MRTF, Monte Carlo
simulations can carefully study all possible near-term coflow ordering with respect to the
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 58
unpredictable flow completion times, and enforce a decision that is statistically optimal.
3.5.4 Aggregators: Stress Tests
Switching capacities and CPU overheads. To evaluate the switching capacity and
CPU overhead of our application-layer switch in a Siphon aggregator daemon, we have
conducted two experiments with VM instances that have a different number of vCPUs.
Two Siphon aggregators, connecting with each other in a “dumbbell” topology, are de-
ployed on two VM instances, with one of them receiving data from 16 mock Spark work-
ers, and another one forwarding the data eventually to a destination worker. To avoid
bottlenecks on the network link between the aggregators before reaching their switching
capacities, we run all the VM instances in the same datacenter, and use a message size
of 1 MB, the same fragmentation size as deployed to support Spark.
The switching capacities of the second aggregator, when running on three different
types of VMs, are illustrated in Fig. 3.22, with the x-axis representing the switching
rate, i.e., the number of messages switched per core within a second, and y-axis standing
for the average CPU load of the aggregator. When running on a 2-core instance, the
switching capacity is reached when each core handles 2000 messages per second. When
4-core and 8-core instances are used, the capacities are 1600 and 1000 messages per core
per second, respectively. Due to the increasing synchronization cost incurred between
concurrent threads when writing received messages to their shared output queues, it is
unavoidable that the switching capacity per vCPU core decreases as more cores become
available in a VM instance. Fortunately, the total number of messages that an aggregator
is able to forward per second with a full vCPU load — 4000 (32Gbps on 2 cores), 6400
(51.2Gbps on 4 cores) and 8000 (64Gbps on 8 cores) respectively — is way more than
sufficient to saturate inter-datacenter link capacities or even intra-datacenter links.
It is worth noting that aggregator daemons does not necessarily slow down the com-
putation on the co-located worker. Even though it consumes some CPU resources of the
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 59
0 400 800 1200 1600 2000Switching Rate (messages/core/second)
0
20
40
60
80
100
Avera
ge C
PU
Load (
%)
2 cores
4 cores
8 cores
Figure 3.22: The switching capacities of a Siphon aggregator on three different types of instances.
Data Size (MB) 2 8 32 128 512Throughput-Direct
(Mbps) 51.9 98.1 159.1 224.1 250.0
Throughput-Siphon(Mbps) 95.7 225.9 270.1 278.6 279.4
Improvement (%) +84.4 +130.3 +69.8 +24.3 +11.8
Table 3.4: The overall throughput for 6 concurrent data fetches.
worker, the resources are shared temporally. This is because whenever the worker starts
sending data to other workers, the current phase of computation has already concluded.
The next phase of computation will not start until it has already received all data, which
indicates that the corresponding aggregator is not busy anymore.
Inter-datacenter throughput. In the third experiment, we wish to evaluate how
Siphon takes advantage of the pre-established parallel TCP connections between the
aggregators, to accelerate the data fetches in wide-area data analytics.
We deploy 6 nodes to simulate the fetchers in N. Carolina region, and 6 servers to
respond to their fetch requests in Belgium. During the test, all fetchers will start at the
same time, and each fetcher will request a file of a given size from its assigned server. As
soon as all fetchers have received all the desired data, we stop timing and calculate the
aggregated average throughput for these fetches.
The results are shown in Table 3.4. The second row of the table shows the achieved
throughput by establishing new TCP connections to fetch data, which simulates the
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 60
default data fetch strategies in Spark. The third row shows the average throughput
when the fetched data is sent using Siphon, via two aggregators deployed in the two
datacenters, respectively.
Apparently, Siphon has dramatically improved the aggregated throughput to fetch
data across different datacenters, especially when the sizes of data are relatively small
(<128 MB). When the data size is 8 MB, Siphon has more than doubled the throughput.
It is mostly because the pre-established connections in Siphon skip the slow-start phase
of TCP, which usually takes a long time on such inter-datacenter links.
As the data size increases (>128 MB), the throughput improvement is less significant
because direct TCP connections are able to ramp up. However, we can still benefit from
the multiplexed connections — messages are multiplexed through all parallel connec-
tions, such that stragglers are avoided. When the data size is very small (2 MB), the
improvement is less significant because the network flows are completed rapidly, and the
constant overhead to use the Siphon aggregator as a relay matters. Fortunately, Siphon
still offers >80% better aggregated throughput.
3.6 Summary
We make four original contributions in this work:
•We have proposed a novel and practical inter-coflow scheduling algorithm for wide-
area data analytics. Starting from analyzing the network model, new challenges in inter-
datacenter coflow scheduling have been identified and addressed.
• We have designed an intra-coflow scheduling policy and a multi-path routing algo-
rithm that improve WAN utilization in wide-area data analytics.
• We have designed a novel interaction scheme between the controller and the dat-
aplane in software-defined WANs, which can significantly reduce the overhead of rule
updates.
• We have built Siphon, a transparent and unified building block that can easily
Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 61
extend existing data parallel frameworks with out-of-box capability of expediting inter-
datacenter coflows.
Chapter 4
Optimizing Shuffle in Wide Area DataAnalytics
In this chapter, we propose to take a systems-oriented approach to improve the bandwidth
utilization on inter-datacenter links during the shuffle phase. Previous studies show that
shuffle phase constitutes over 60% of job completion time for network-intensive jobs [31],
and it will become an even more severe bottleneck in Wide-Area Data Analytics, when
shuffle traffic is sent via inter-datacenter WANs.
Rather than employing the common fetch-based shuffle, we enforces a push-based
shuffle mechanism, which allows early inter-datacenter transfers and reduces link idle
times. As a result, the shuffle completion time and the job completion time can be
reduced.
Our new system framework is first and foremost designed to be practical: it has been
implemented in Apache Spark to optimize the runtime performance of wide-area analytic
jobs in a variety of real-world benchmarks. To achieve this objective, our framework
focuses on the shuffle phase, and strategically aggregate the output data of mapper tasks
in each shuffle phase to a subset of datacenters. In our proposed solution, the output
of mapper tasks is proactively and automatically pushed to be stored in the destination
datacenters, without requiring any intervention from a resource scheduler. Our solution
is orthogonal and complementary to existing task assignment mechanisms proposed in
62
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 63
the literature, and it remains effective even with the simplest task assignment strategy.
Compared to existing task assignment mechanisms, the design philosophy in our
proposed solution is remarkably different. The essence of traditional task assignment is
to move a computation task to be closer to its input data to exploit data locality; in
contrast, by proactively moving data in the shuffle phase from mapper to reducer tasks,
our solution improves data locality even further. As the core of our system framework, we
have implemented a new method, called transferTo(), on Resilient Distributed Datasets
(RDDs), which is a basic data abstraction in Spark. This new method proactively sends
data in the shuffle phase to a specific datacenter that minimizes inter-datacenter traffic.
It can be either used explicitly by application developers or embedded implicitly by the
job scheduler. With the implementation of this method, the semantics of aggregating the
output data of mapper tasks can be captured in a simple and intuitive fashion, making
it straightforward for our system framework to be used by existing Spark jobs.
With the new transferTo() method at its core, our new system framework enjoys a
number of salient performance advantages. First, it pipelines inter-datacenter transfers
with the preceding mappers. Starting data transfers early can help improve the utilization
of inter-datacenter links. Second, when task execution fails at the reducers, repetitive
transfers of the same datasets across datacenters can be avoided, since they are already
stored at the destination datacenter by our new framework. Finally, the application
programming interface (API) in our system framework is intentionally exposed to the
application developers, who are free to use this mechanism explicitly to optimize their
job performance.
Note that, even though the analysis and implementation in this chapter is based
entirely on Apache Spark, the push-based shuffle mechanism can be applied to other
general data analytics frameworks, including Apache HaDoop.
We have deployed our new system framework in a Spark cluster across six Amazon
EC2 regions. By running workloads from the HiBench [40] benchmark suite, we have
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 64
conducted a comprehensive set of experimental evaluations. Our experimental results
have shown that our framework speeds up the completion time of general analytic jobs
by 14% to 73%. Also, with our implementation, the impact of bandwidth and delay
jitters in wide-area networks is minimized, resulting in a lower degree of performance
variations over time.
4.1 Background and Motivation
4.1.1 Fetch-based Shuffle
Both Apache Hadoop and Spark are designed to be deployed in a single datacenter. Since
datacenter networks typically have abundant bandwidth, network transfers are considered
even less expensive than local disk I/O in Spark [84]. With this assumption in mind, the
shuffle phase is implemented with a fetch-based mechanism by default. To understand
the basic idea in our proposed solution, we need to provide a brief explanation of the
fetch-based shuffle in Spark.
In Spark, a data analytic job is divided into several stages, and launched in a stage-
by-stage manner. A typical stage starts with a shuffle, when all the output data from
the previous stages is already available. The workers of the new stage, i.e., reducers in
this shuffle, will fetch the output data from the previous stages, which constitutes the
shuffle input, stored as a collection of local files on the mappers. Because the reducers are
launched at the same time, shuffle input is fetched concurrently, resulting in a concurrent
all-to-all communication pattern. For better fault tolerance, the shuffle input will not be
deleted until the next stage finishes. When failures occur on the reducer side, the related
files will be fetched from the mappers again, without the need to re-run them.
4.1.2 Problems with Fetch in Wide-Area Data Analytics
Though effective within a single datacenter, it is quite a different story when it comes
to wide-area data analytics across geographically distributed datacenters, due to limited
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 65
bandwidth availability on wide-area links between datacenters [39]. Given the potential
bottlenecks on inter-datacenter links, there are two major problems with fetch-based
shuffles.
First, as a shuffle will only begin when all mappers are finished — a barrier-like
synchronization — inter-datacenter links are usually well under-utilized most of the time,
but likely to be congested with bursty traffic when the shuffle begins. The links are
under-utilized, because when some mappers finish their tasks earlier, their output cannot
be transmitted immediately to the reducers. Yet, when the shuffle is started by all
the reducers at the same time, they initiate concurrent network flows to fetch their
corresponding shuffle input, leading to bursty traffic that may contend for the limited
inter-datacenter bandwidth, resulting in potential congestion.
Second, when failures occur on reducers with the traditional fetch-based shuffle mech-
anism, data must be fetched again from the mappers over slower inter-datacenter network
links. Since a stage will not be considered complete until all its tasks are executed suc-
cessfully, The slowest tasks, called the stragglers, will directly affect the overall stage
completion time. Re-fetching shuffle input over inter-datacenter links will slow down
these stragglers even further, and negatively affects the overall job completion times.
4.2 Transferring Shuffle Input across Datacenters
To improve the performance of shuffle in wide-area data analytics, we will need to answer
two important questions: when and where should we transfer the shuffle input from
mappers to reducers? The approach we have taken in our system framework is simple
and quite intuitive: we should proactively push shuffle input as soon as any data partition
is ready, and aggregate it to a subset of worker datacenters.
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 66
time0 4 8 12 16
time0 4 8 12 16
Map Reduce
Reduce
Map Data Transfer
Data Transfer
Reduce
ReduceMap
Shuffle Read
Shuffle Read
ShuffleWrite
ShuffleWrite
Map
Stage N Stage N+1
Stage N Stage N+1
worker A
worker B
worker A
worker B
ShuffleRead
(a)
(b)
Figure 4.1: Mappers typically cannot finish their work at the same time. In this case, if we proactivelypush the shuffle input to the datacenter where the reducer is located (b), the inter-datacenter link willbe better utilized as compared to leaving it on the mappers (a).
4.2.1 Transferring Shuffle Input: Timing
Both problems of fetch-based shuffle stem from the fact that the shuffle input is co-
located with mappers in different datacenters. Therefore, they can be solved if, rather
than asking reducers to fetch the shuffle input from the mappers, we can proactively push
the shuffle input to those datacenters where the reducers are located, as soon as such
input data has been produced by each mapper.
As an example, consider a job illustrated in Fig. 4.1. Reducers in stage N + 1 need to
fetch the shuffle input from mappers A and B, located in another datacenter. We assume
that the available bandwidth across datacenters is 14of a single datacenter network link,
which is an optimistic estimate. Fig. 4.1(a) shows what happens with the fetch-based
shuffle mechanism, where shuffle input is stored on A and B, respectively, and transferred
as soon as stage N+1 starts at t = 10. Two flows share the inter-datacenter link, allowing
both reducers start at t = 18. In contrast, in Fig. 4.1(b), shuffle input is pushed to the
datacenter hosting the reducer immediately after it is computed by each mapper. Inter-
datacenter transfers are allowed to start at t = 4 and t = 8, respectively, without the
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 67
time0 4 8 12 16 20 24
Map FailedReduce
Reduce
Refetch ReduceShuffle Read
Shuffle Read
ShuffleWrite
time0 4 8 12 16 20 24
Map
Stage N Stage N+1
worker A
worker B(a)
Map FailedReduce
Reduce
Reduce
Map
Stage N Stage N+1
worker A
worker B(b)
Data Transfer
Data Transfer
ShuffleWrite
ShuffleRead
Refetch
Figure 4.2: In the case of reducer failures, if we proactively push the shuffle input to the datacenter wherethe reducer is located (Fig. 2(b)), data re-fetching across datacenters can be eliminated, reducing thetime needed for failure recovery as compared to the case where shuffle input is located on the mappers(Fig. 2(a)).
need for sharing link bandwidth. As a result, reducers will be able to start at t = 14.
Fig. 4.2 shows an example of the case of reducer failures. With the traditional fetch-
based shuffle mechanism, the failed reducer will need to fetch its input data again from
another datacenter, if such data is stored with the mappers, shown in Fig. 4.2(a). In
contrast, if the shuffle input is stored with the reducer instead when it fails, the reducer
can read from the local datacenter, which is much more efficient.
4.2.2 Transferring Shuffle Input: Choosing Destinations
Apparently, proactively pushing shuffle input to be co-located with reducers is beneficial,
but a new problem arises with this new mechanism: since the reduce tasks will not
be placed until the map stage finishes, how can we decide the destination hosts of the
proactive pushes?
It is indeed a tough question, because the placement of reducers is actually decided
by the shuffle input distribution at the start of each stage. In other words, our choice of
push destinations will in turn impact the reducer placement. Although it seems a cycle of
unpredictability, but we think there already exist enough hints to give a valid answer in
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 68
Datacenter 2
Datacenter 1
Patition 2222
Patition 3333
Patition 1111
Reducer C
Reducer B
Reducer A
123
1
32
321
Final StageWorker
…
…
…
…
…
…
FinalResults
Figure 4.3: A snippet of a sample execution graph of a data analytic job.
wide-area data analytics. Specifically, for the sake of minimizing cross-datacenter traffic,
there is a tendency for both task and shuffle input placement at the datacenter level. We
can exploit this tendency as a vital clue.
Our analysis starts by gaining a detailed understanding of shuffle behaviors in MapRe-
duce. Fig. 4.3 shows a snippet of abstracted job execution graph and data transfers. The
depicted shuffle involves 3 partitions of shuffle input, which will then be dispatched to 3
reducers. In this case, each partition of the shuffle input is saved as 3 shards based on
specific user-defined rules, e.g., the keys in the key-value pairs. During data shuffle, each
shard will be fetched by the corresponding reducer, forming an all-to-all traffic pattern.
In other words, every reducer will access all partitions of the shuffle input, fetching the
assigned shards from each.
We assume that shuffle input is placed in M datacenters. The sizes of the partitions
stored in these datacenters are s1, s2, . . . , sM , respectively. Without loss of generality,
let the sizes be sorted in the non-ascending order, i.e., s1 ≥ s2 ≥ . . . ≥ sM . Also, each
partition is divided into N shards, with respect to N reducers, R1, R2, . . . , RN . Though
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 69
being different in sizes in practice, all shards of a particular partition tend to be about the
same size for the sake of load balancing [66]. Thus, we assume the shards in a partition
are equal in size.
If a reducer Rk is placed in Datacenter ik, the total volume of its data fetched from
non-local datacenters will be
d(k)ik
=∑
1≤j≤Mj 6=ik
d(k)ik,j
=∑
1≤j≤Mj 6=ik
1
Nsj.
Each term in the summation, d(k)ik,j, denotes the size of data to be fetched from Datacenter
j.
Let S be the total size of the shuffle input, i.e., S =∑M
j=1 sj, we have
d(k)ik
=M∑j=1
1
Nsj −
1
Nsik =
1
N(S − sik) ≥ 1
N(S − s1). (4.1)
The equality holds if and only if ik = 1. In other words, the minimum cross-datacenter
traffic can be achieved when the reducer is placed in the datacenter which stores the most
shuffle input.
The inequality Eq. (4.1) holds for every reducer Rk (k ∈ {1, 2, . . . , N}). Then, the
total volume of cross-datacenter traffic incurred by this shuffle satisfies
D =N∑k=1
d(k)ik≥ N · 1
N(S − s1) = S − s1. (4.2)
Again, the equality holds iff. i1 = i2 = . . . = iN = 1.
Without any prior knowledge on application workflow, we reach two conclusions to
optimize a general wide-area data analytic job.
First, given a shuffle input distribution, the datacenter with the largest fraction of
shuffle input will be favored by the reducer placement. This is a direct corollary of
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 70
Eq. (4.2).
Second, shuffle input should be aggregated to a subset of datacenters as much as
possible. The minimum volume of data to be fetched across datacenters is S − s1.
Therefore, in order to further reduce cross-datacenter traffic in shuffle, we should improve
s1Swhich is the fraction of shuffle input placed in Datacenter 1. As an extreme case, if all
shuffle input is aggregated in Datacenter 1, there is no need for cross-datacenter traffic
in future stages.
In all, compared to scattered placement, a better placement decision would be ag-
gregating all shuffle input into a subset of datacenters which store the largest fractions.
Without loss of generality, in the subsequent sections of this chapter, we will aggregate
to a single datacenter as an example.
4.2.3 Summary and Discussion
According to the analysis throughout this section, we learn that the strategy of Push/Aggregate,
i.e., proactively pushing the shuffle input to be aggregated in a subset of worker datacen-
ters, can be beneficial in wide-area data analytics. It can reduce both stage completion
time and traffic tension, because of higher utilization of the inter-datacenter links. Also,
duplicated inter-datacenter data transfers can be avoided in case of task failures, further
reducing the pressure on the bottleneck links.
One may argue that even with the aggregated shuffle input, a good task placement
decision is still required, or it may generate more overall inter-datacenter traffic. Indeed,
the problem itself sounds like a dilemma, where task placement and shuffle input place-
ment depend on each other. However, with the above hints, we are able to break the
dilemma by placing the shuffle input first. After that, a better task placement decision
can be made by even the default resource schedulers, which has a simple straight-forward
strategy to exploit host-level data locality.
Conceptually speaking, the ultimate goal of Push/Aggregate is to proactively improve
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 71
the best data locality that a possible task placement decision can achieve. Thus, the
Push/Aggregate operations are completely orthogonal and complementary to the task
placement decisions.
4.3 Implementation on Spark
In this section, we present our implementation of the Push/Aggregate mechanism on
Apache Spark. We take Spark as an example in this chapter due to its better perfor-
mances [84] and better support for machine learning algorithms with MLlib [2]. However,
the idea can be applied to Hadoop as well.
4.3.1 Overview
In order to implement the Push/Aggregate shuffle mechanism, we are required to modify
two default behaviors in Spark: i) Spark should be allowed to directly push the output of
an individual map task to a remote worker node, rather than storing on the local disk;
and ii) the receivers of the output of map tasks should be selected automatically within
the specific aggregator datacenters.
One may think a possible implementation would be replacing the default shuffle mech-
anism completely, by enabling remote disks, which locate in the aggregator datacenters,
to be the potential storage in addition to the local disk on a mapper. Though this
approach is straight-forward and simple, there are two major issues.
On the one hand, although the aggregator datacenters are specified, it is hard for
mappers to decide the exact destination worker nodes to place the map output. In
Spark, it is the Task Scheduler’s responsibility to make centralized decisions on task
and data placement, considering both data locality and load balance among workers.
However, a mapper by itself, without synchronization with the global Task Scheduler,
can hardly have sufficient information to make the decision in a distributed manner, while
still keeping the Spark cluster load-balanced. On the other hand, the push will not start
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 72
until the entire map output is ready in the mapper memory, which introduces unnecessary
buffering time.
Both problems are tough to solve, requiring undesirable changes to other Spark com-
ponents such as the Task Scheduler. To tackle the former issues, it is natural to ask:
rather than implementing a new mechanism to select the storage of map output, is it
possible to leave the decisions to the Task Scheduler?
Because the Task Scheduler can have knowledge of computation tasks, we need to
generate additional tasks in the aggregator datacenter, whose computation is as simple
as receiving the output of mappers.
In this chapter, we add a new transformation on RDDs, transferTo(), to achieve
this goal. From a high level, transferTo() provides a means to explicitly transfer
a dataset to be stored in a specified datacenters, while the host-level data placement
decisions are made by the Spark framework itself for the sake of load balance. In addition,
we implement an optional mechanism in Spark to automatically enforce transferTo()
before a shuffle. This way, if this option is enabled, the developers are allowed to use the
Push/Aggregate mechanism in all shuffles without changing a single line of code in their
applications.
4.3.2 transferTo(): Enforced Data Transfer in Spark
transferTo() is implemented as a method of the base RDD class, the abstraction of
datasets in Spark. It takes one optional parameter, which gives all worker hosts in the
aggregator datacenter. However, in most cases, the parameter can be omitted, such that
all data will be transferred to a datacenter that is likely to store the largest fraction of
the parent RDD, as is suggested in Sec. 4.2.3. It returns a new RDD, TransferredRDD,
which represents the dataset after the transfer operation. Therefore, transferTo() can
be used in the same way as other native RDD transformations, including chaining with
other transformations.
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 73
When an analytic application is submitted, Spark will interpret the transformation by
launching an additional set of receiver tasks, only to receive all data in the parent RDD.
Then, the Task Scheduler can place them in the same manner as other computation tasks,
thus to achieve automatic host-level load balance. Because the preferredLocations
attributes of these receiver tasks are set to be in the aggregator datacenters, the default
Task Scheduler will satisfy these placement requirements as long as the datacenters have
workers available. This way, from the application’s perspective, the parent RDD is thus
explicitly pushed to the aggregator datacenters, without violating any default host-level
scheduling policies.
Besides, a bonus point of transferTo() is that, since the receiver tasks require no
shuffle from the parent RDD, they can be pipelined with the preceding computation
tasks. In other words, if transferTo() is called upon the output dataset of a map task,
the actual data transfer will start as soon as there is a fraction of data available, without
waiting until the entire output dataset is ready. This pipelining feature is enabled by
Spark without any further change, which automatically solves the second issue mentioned
in Sec. 4.3.1.
It is worth noting that transferTo() can be directly used as a developer API. For
developers, it provides the missing function that allows explicit data migration across
worker nodes. transferTo() enjoy the following graceful features:
Non-Intrusiveness and Compatibility. The introduction of the new API modifies
no original behavior of the Spark framework, maximizing the compatibility with ex-
isting Spark applications. In other words, changes made on the Spark codebase regard-
ing transferTo() are completely incremental, rather than being intrusive. Thus, our
patched version of Spark maintains 100% backward compatibility with the legacy code.
Consistency. The principle programming concept remains consistent. In Spark, RDD
is the abstraction of datasets. The APIs allow developers to process a dataset by ap-
plying transformations on the corresponding RDD instance. The implementation of
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 74
InputRDD
A1 B1A2
mapA1
mapA2
mapB1
reduceA1, A2
B1
reduceA1, A2
B1
InputRDD .map(…) .reduce(…) …
InputRDD .map(…) .transferTo([A]) .reduce(…) …
InputRDD
A1 B1A2
mapA1
mapA2
mapB1
reduceA1, A2, Ax
transferTo
A1transferTo
A2transferTo
A*
reduceA1, A2, Ax
(a) (b)
Figure 4.4: An example to show how preferredLocations attribute works without(a) or with(b)transferTo() transformation. A* represents all available hosts in datacenter A, while Ax representsthe host which is selected as the storage of third map output partition.
transferTo() inherits the same principle.
Minimum overhead. transferTo() strives to eliminate unnecessary overhead intro-
duced by enforced data transfers. For example, if a partition of dataset already locates
in our specified datacenter, no cross-node transfer is made. Also, unnecessary disk I/O
is avoided.
4.3.3 Implementation Details of tranferTo()
As a framework for building big data analytic applications, Spark strives to serve the
developers. By letting the framework itself make tons of miscellaneous decisions auto-
matically, the developers are no longer burdened by the common problems in distributed
computing, e.g., communication and synchronization among nodes. Spark thus provides
such a high-level abstraction that developers are allowed to program as if the cluster was
a single machine.
An easier life comes at the price of less control. The details of distributed comput-
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 75
ing, including communications and data transfers among worker nodes, are completely
hidden from the developers. In the implementation of transferTo() where we intend
to explicitly specify the cross-node transfer of intermediate data, Spark does not expose
such a functionality to the application developers.
Here we close this gap, by leveraging the internal preferredLocations attribute of
an RDD.
♦ preferredLocations in Spark
It is a native attribute in each partition of all RDDs, being used to specify the host-level
data locality preferences. While the Task Scheduler is trying to place the corresponding
computation on individual worker nodes, it plays an important role. In other words, the
Task Scheduler takes preferenceLocations as a list of higher priority hosts, and strives
to satisfy the placement preferences whenever possible.
A simple example is illustrated in Fig. 4.4 (a), where the input dataset is transformed
by a map() and a reduce(). The input RDD has 3 partitions, located on two hosts in
Datacenter A and one host in Datacenter B, respectively. Thus, 3 corresponding map
tasks are generated, with preferredLocations the same as input data placement. Since
the output of map tasks is stored locally, the preferredLocations of all reducers will
be the union of the mapper hosts.
This way, the Task Scheduler can have enough hints to place tasks to maximize host-
level data locality and minimize network traffic.
♦ Specifying the Preferred Locations for transferTo() Tasks
In our implementation of transferTo(), we generate an additional computation task
right after each map task, whose preferredLocations attribute filters out all hosts that
are not in the aggregator datacenters.
Why do we launch new tasks, rather than directly changing the preferredLocations
of mappers? The reason is simple: if mappers are directly placed in the aggregator data-
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 76
center, it will be the raw input data that is transferred across datacenters. In most cases,
it is undesirable because map output is very likely to have a smaller size as compared to
the raw input data.
If the parent mapper already locates in the aggregator datacenter, the generated task
will do nothing; however, if not, thus the parent partition of map output requires being
transferred, the corresponding task will provide a list of all worker nodes in the aggregator
datacenter as the preferredLocations. In the latter case, the Task Scheduler will select
one worker node from the list to place the task, which simply receives output from the
corresponding mapper.
As another example, Fig. 4.4 (b) shows how transferTo() can impact the preferredLocations
of all tasks in a simple job. As compared to Fig. 4.4 (a), the map output is explicitly trans-
ferred to Datacenter A. Because the first two partitions are already placed in Datacenter
A, the two corresponding transferTo() tasks are completely transparent. On the other
hand, since the third partition originated in Datacenter B, the subsequent transferTo()
task should prefer any hosts in Datacenter A. As a result of task execution, the map out-
put partition will be eventually transferred to a random host in Datacenter A, which is
selected by the Task Scheduler. Finally, since all input of reducers is in Datacenter A, the
shuffle can happen within a single datacenter, realizing the Push/Aggregate mechanism.
Note that we can omit the destination datacenter of transferTo(). If no parameter is
provided, transferTo() will automatically decide the aggregator datacenter, by selecting
the one with the most partitions of map output.
♦ Optimized Transfers in the case of Map-side Combine
There is a special case in some transformations, e.g., reduceByKey(), which require
MapSideCombine before a shuffle. Strictly speaking, MapSideCombine is a part of reduce
task, but it allows the output of map tasks to be combined on the mappers before being
sent through network, in order to reduce the traffic.
In wide-area data analytics, it is critical to reduce cross-datacenter traffic for the sake
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 77
of performance. Therefore, our implementation of transferTo() makes smart decisions,
by performing MapSideCombines before transfer whenever possible. In transferTo(),
we pipeline any MapSideCombine operations with the preceding map task, and avoid the
repetitive computation on the receivers before writing the shuffle input to disks.
4.3.4 Automatic Push/Aggregate
Even though transferTo() is enough to serve as the fundamental building block of
Push/Aggregate, a mechanism is required to enforce transferTo() automatically, with-
out the explicit intervention from the application developers. To this end, we modified
the default DAGScheduler component in Spark, to add an optional feature that automat-
ically inserts transferTo() before all potential shuffles in the application.
The programmers can enable this feature by setting a property option, spark.shuffle.aggregation,
to true in their Spark cluster configuration file or in their code. We did not enable
this feature by default for backward compatibility considerations. Once enabled, the
transferTo() method will be embedded implicitly and automatically to the code before
each shuffle, such that the shuffle inputs can be pushed to the aggregator datacenters.
Specifically, when a data analytic job is submitted, we use DAGScheduler to embed
the necessary transferTo() transformations into the origin submitted code. In Spark,
DAGScheduler is responsible for rebuilding the entire workflow of a job based on con-
secutive RDD transformations. Also, it decomposes the data analytic job into several
shuffle-separated stages.
Since DAGScheduler natively identifies all data shuffles, we propose to add a transferTo()
transformation ahead of each shuffle, such that the shuffle input can be aggregated.
Fig. 4.5 illustrates an example of implicit transferTo() embedding. Since groupByKey()
triggers a shuffle, the transferTo() transformation is embedded automatically right be-
fore that to start proactive transfers of the shuffle input.
Note that because transferTo() is inserted automatically, none parameter is pro-
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 78
DC1 DC2
Origin Code
val InRDD = In1+In2
InRDD .filter(…) .groupByKey(…) .collect()
Produced Code
val InRDD = In1+In2
InRDD .filter(…) .transferTo(…) .groupByKey(…) .collect()
In1
filter
groupByKey
Shuffle Input
collect
In2
filter
groupByKey
Shuffle Input
DC1 DC2
In1
filter
groupByKey
Shuffle Input
collect
In2
filter
groupByKey
Shuffle Input
transferTo
Processed ByDAGScheduler
transferTo
Figure 4.5: An example of implicit embedding of the transferTo() transformation. transferTo()aggregates all shuffle input in DC1, before the groupByKey() transformation starts. For the partitionnatively stored in DC1, transferTo() simply does nothing.
vided to the method. Therefore, it works the same way as if the aggregator datacenter
is omitted, i.e., the datacenter that generates the largest fraction of shuffle input should
be chosen. We approximate the optimal selection by choosing the datacenter storing the
largest amount of map input, which is a known piece of information in MapOutputTracker
at the beginning of the map task.
4.3.5 Discussion
In addition to the basic feature that enforces aggregation of shuffle input, the implemen-
tation of transferTo() can trigger interesting discussions in wide-area data analytics.
Reliability of Proactive Data Transfers. One possible concern is that, since the
push mechanism for shuffle input is a new feature in Spark, the reliability of computation,
e.g., fault tolerance, might be compromised. However, it is not true.
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 79
Because transferTo() is implemented by creating additional receiver tasks rather
than changing any internal implementations, all native features provided by Spark are
inherited. The introduced proactive data transfers, from the Spark framework’s perspec-
tive, are the same as regular data exchanges between a pair of worker nodes. Therefore,
in case of failure, built-in recovery mechanisms, such as retries or relaunches, will be
triggered automatically in the same manner.
Expressing Cross-region Data Transfers as Computation. Essentially, transferTo()
provides a new interpretation for inter-datacenter transfers. In particular, they can be
expressed in a form of computation, since transferTo() is implemented as a transfor-
mation. It conforms with our intuition, in which moving a large volume of data across
datacenters consumes both computation and network resources that are comparable to
a normal computing task.
This concept can help in several ways. For example, inter-datacenter data transfers
can be shown from the Spark WebUI. It can be helpful in terms of debugging the wide-
area data analytic jobs, by visualizing the critical inter-datacenter traffic.
Implicit vs. Explicit Embedding. Instead of implicit embedding transferTo()
using DAGScheduler, the developers are allowed to explicitly control the data placement
at the granularity of datacenters. In some real-world data analytic applications, this is
meaningful because the developers always know better about their data.
For example, it is possible in production that the shuffle input has a larger size than
the raw data. In this case, to minimize inter-datacenter traffic, it is the raw data rather
than the shuffle input that should be aggregated. The developers can be fully aware of
this situation; however, it is difficult for the Spark framework itself to make this call,
resulting in an unnecessary waste of bandwidth.
Another example is the cached datasets. In Spark, the developers are allowed to
call cache() on any intermediate RDD, in order to persist the represented dataset in
memory. These cached datasets will not be garbage collected until the application exits.
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 80
In practice, the intermediate datasets that will be used several times in an application
should be cached to avoid repetitive computation. In wide-area data analytics, caching
these datasets across multiple datacenters is extremely expensive, since reusing them will
induce repetitive inter-datacenter traffic. Fortunately, with the help of transferTo(),
the developers are allowed to cache after all data is aggregated in a single datacenter,
avoiding the duplicated cross-datacenter traffic.
Limitations. Even though the Push/Aggregate shuffle enjoys many graceful features,
it does have limitations that users should be aware of. The effectiveness of transferTo()
relies on the sufficient computation resources in the aggregator datacenter. It will launch
additional tasks in the aggregator datacenters, in which more computation resources will
be consumed. If the chosen aggregator datacenter cannot complete all reduce tasks be-
cause of insufficient resources, the reducers will be eventually placed in other datacenters,
which would be less effective.
We think this limitation is acceptable in wide-area data analytics for two reasons. On
the one hand, Push/Aggregate basically trades more computation resources for lower job
completion times and less cross-datacenter traffic. Because the cross-datacenter network
resources are the bottleneck in wide-area data analytics, the trade-off is reasonable. On
the other hand, in practice, it is common that a Spark cluster is shared by multiple jobs,
such that the available resources within one datacenter is more than enough for a single
job. Besides, when the cluster is multiplexed by many concurrent jobs, it is very likely
that the workload can be rebalanced across-datacenters, keeping the utilization high.
4.4 Experimental Evaluation
In this section, we present a comprehensive evaluation of our proposed implementation.
Our experiments are deployed across multiple Amazon Elastic Compute Cloud (EC2)
regions. Selective workloads from HiBench [40] are used as benchmarks to evaluate both
the job-level and the stage-level performances.
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 81
4 6
4
4
4
4N. Virginia
N. California
São Paulo
Frankfurt
Singapore
Sydney
Figure 4.6: The geographical location and the number of instances in our Spark cluster. Instances in 6different Amazon EC2 regions are employed. Each region has 4 instances running, except N. Virginiawhere two extra special nodes deployed.
Workload SpecificationWordCount The total size of generated input files is 3.2 GB.
Sort The total size of generated input data is 320 MB.
TeraSortThe input has 32 million records. Each record is 100 bytesin size.
PageRankThe input has 500,000 pages. The maximum number ofiterations is 3.
NaiveBayes The input has 100,000 pages, with 100 classes.
Table 4.1: The specifications of four workloads used in the evaluation.
The highlights of our evaluation results are as follows:
1. Our implementation speeds up workloads from the HiBench benchmark suite, re-
ducing the average job completion time by 14% ∼ 73%.
2. The performances are more predictive and stable, despite the bandwidth jitters on
inter-datacenter links.
3. The volume of cross-datacenter traffic can be reduced by about 16% ∼ 90%.
4.4.1 Cluster Configurations
Amazon EC2 is one of the most popular cloud service providers today. It provides
computing resources that are hosted in their datacenters around the globe. Since EC2 is
a production environment for a great number of big data analytic applications, we decide
to run our experimental evaluation by leasing instances across regions.
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 82
Cluster Resources. We set up a Spark cluster with 26 nodes in total, spanning 6
different geographically distributed regions on different continents, as is shown in Fig. 4.6.
Four worker nodes are leased in each datacenter. The Spark master node and the HaDoop
File System (HDFS) NameNode are deployed on 2 dedicated instances in the N. Virginia
region, respectively.
All instances in use are of the type m3.large, which has 2 vCPUs, 7.5 GB of memory,
and a 32 GB Solid-State Drive (SSD) as disk storage. The network performance of the
instances is reported as “moderate”. Our measurement shows that there is approximately
1 Gbps of bandwidth capacity between a pair of instances within a region. However, the
cross-region network capacity varies over time. Our preliminary investigation is consistent
with previous empirical studies [39,57]. The available bandwidth of inter-datacenter links
fluctuates greatly. Some links can have as low as 80 Mbps of capacity, while other links
may have up to 300 Mbps bandwidth.
Software Settings. The instances in our cluster are running a Linux Operating
System, Ubuntu 14.04 LTS 64-bit (HVM). To set up a distributed file system, we use
HDFS from Apache Hadoop 2.6.4. Our implementation is developed based on Apache
Spark 1.6.1, built with Java 1.8 and Scala 2.11.8. Spark cluster is started in the standalone
mode, without the intervention of external resource managers. This way, we leave the
Spark’s internal data locality mechanism to make the task placement decisions in a
coarse-grained and greedy manner.
Workload Specifications. Within the cluster, we run five selected workloads of
the HiBench benchmark suite, WordCount, Sort, TeraSort, PageRank, and NaiveBayes.
These workloads are good candidates for testing the efficiency of the data analytic frame-
works, with an increasing complexity. Among the workloads, WordCount is the simplest,
involving one shuffle only. PageRank and NaiveBayes are relatively more complex and
require several iterations at runtime, with multiple consecutive shuffles. They are two
representative workloads of the machine learning algorithms. The workloads are con-
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 83
WordCount Sort TeraSort PageRank NaiveBayesHiBench Workload
0
50
100
150
200
250
300
350
Avera
ge Job C
om
ple
tion T
ime (
s) Spark
Centralized
AggShuffle
Figure 4.7: The average job completion time under different HiBench workload. For each workload, wepresent a 10% trimmed mean over 10 runs, with an error bar representing the interquartile range as wellas the median value.
figured to run at “large scale,” which is one of the default options in HiBench. The
specifications of their settings are listed in Table 4.1. The maximum parallelism of both
map and reduce is set to 8, as there are 8 cores available within each datacenter.
Baselines. We use two naive solutions, referred as “Spark” and “Centralized”, in wide-
area data analytics as the baselines to compare with our proposed shuffle mechanism.
“Spark” represents the deployment of Spark across geo-distributed datacenters, without
any optimization in terms of the wide-area network. The job execution will be completely
blind about the network bottleneck. The “Centralized” scheme refers to the naive and
greedy solution in which all raw data is sent to a single datacenter before being processed.
After all data is centralized within a cluster, Spark works within a datacenter to process
data.
As a comparison, the Spark patched with our proposed shuffle mechanism is referred
to as “AggShuffle” in the remainder of this section, meaning the shuffle input is aggregated
in a single datacenter. Note that we do not use implicitly embedded transferTo() trans-
formation. Only are the implicit transformations involved in the experiments, leaving
the benchmark source code unchanged.
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 84
4.4.2 Job Completion Time
The completion time of a data analytic job is the primary performance metric. Here,
we report the measurement results from HiBench, which records the duration of running
each workload. With 10 iterative runs on the 5 different workloads, the mean and the
distribution of completion times are depicted in Fig. 4.7. Note that running Spark ap-
plications across EC2 regions is prone to the unpredictable network performances, as the
available bandwidth and network latency fluctuates dramatically over time. As a result,
running the same workload with the same execution plan at different times may result
in distinct performances. To eliminate the incurred randomness as much as possible, we
introduce the following statistical methods to process the data.
Trimmed average of the job completion time. The bars in Fig. 4.7 reports the 10%
trimmed mean value of job completion time measurements over 10 runs. In particular,
the maximum and the minimum values are invalidated before we compute the average.
This methodology, in a sense, eliminates the impact of its long-tail distribution on the
mean.
According to Fig. 4.7, AggShuffle offers the best performances in all three schemes
in evaluation. For example, AggShuffle shows as much as 73% and 63% reduction in
job completion time, as compared to Spark and Centralized, respectively. Under other
workloads, using Spark as the baseline, our mechanism achieves at least 15% performance
gain in terms of job durations.
As compared to the Centralized mechanism, we can easily find that AggShuffle is still
beneficial, except TeraSort. Under the TeraSort workload, the average job completion
time in Centralized is only 4% higher, which is a minor improvement in practice. The
reason is hidden behind the TeraSort algorithm. In the HiBench implementations, there
is a map transformation before all shuffles, which actually bloats the input data size.
In other words, the input of the first shuffle is even larger in size as compared to the
raw input. Consequently, extra data will be transferred to the destination datacenter,
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 85
incurring unnecessary overhead. Looking ahead, the analysis is supported by the cross-
datacenter traffic measurement shown in Fig. 4.8. TeraSort turns out to be a perfect
example to show the necessity of developers’ interventions. Only can the application
developers tell the increase of data size beforehand. This problem can be resolved by
explicitly calling transferTo() before the map, and we can expect further improvement
from AggShuffle.
Interquartile range and the median of the job completion time. In addition to
the average job completion time, we think the distribution of durations of the same job
matters in wide-area data analytics. To provide further the distribution information in
Fig. 4.7, according to our measurements over 10 iterative runs. To this end, we add the
interquartile range and the median into the figure as error bars. The interquartile range
shows the range from the 25-th percentile and the 75-th percentile in distribution. Also,
the median value is shown as a dot in the middle of an error bar.
Fig. 4.7 clearly shows that AggShuffle outperforms both other schemes in terms of
minimizing the variance. In other words, it can provide wide-area data analytic appli-
cations with more stable performances, making it more predictive. It is an important
feature, as is suggested in the experimental results, even running in the same environment
settings, the completion time of a wide-area analytic job varies significantly over time.
We argue the ability to limit the variance of data analytics frameworks is a performance
metric that has been overlooked in the literature.
The reason of AggShuffle’s stability is two-fold. On the one hand, the major source of
performance fluctuation is the network performance. As the wide-area links interconnect-
ing datacenters, unlike the datacenter network, are highly unstable with no performance
guarantees. Flash congestion and temporarily connections lost are common, whose im-
pact will be magnified in the job completion times. On the other hand, since AggShuffle
initiates early data transfers without waiting for the reducers to start. This way, concur-
rent bulk traffic on bottleneck links will be smoothed over time, with less link sharing
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 86
Sort0
50
100
150
200
Cro
ss-D
ata
cente
r Tra
ffic
(M
B)
Spark
Centralized
AggShuffle
TeraSort0
500
1000
1500
2000
2500
3000
3500
4000
Cro
ss-D
ata
cente
r Tra
ffic
(M
B)
Spark
Centralized
AggShuffle
PageRank0
200
400
600
800
1000
Cro
ss-D
ata
cente
r Tra
ffic
(M
B)
Spark
Centralized
AggShuffle
NaiveBayes0
50
100
150
200
250
300
Cro
ss-D
ata
cente
r Tra
ffic
(M
B)
Spark
Centralized
AggShuffle
Figure 4.8: Total volume of cross-datacenter traffic under different workloads.
and a better chance for data transfer to complete quickly.
As for TeraSort, rather than offering help, our proposed aggregation of shuffle input
actually burdens the cross-datacenter network. Again, it can be resolved by explicitly
invoking transferTo() for optimality.
4.4.3 Cross-Region Traffic
The volume of cross-datacenter traffic incurred by wide-area analytic applications is an-
other effective metric for evaluation. During our experiments on EC2, we tracked the
cross-datacenter traffic among Spark worker nodes. The average of our measurement is
shown in Fig. 4.8. Note in this figure, the “Centralized” scheme indicates the cross-region
traffic to aggregate all data into the centralized datacenter.
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 87
Except for TeraSort, whose transferTo() is automatically called on a bloated
dataset, all other workloads in the evaluation can enjoy much less bandwidth usage
in AggShuffle. As shuffle input is proactively aggregated in early stages and all fur-
ther computation is likely to be scheduled within one datacenter, cross-datacenter traffic
will be reduced significantly on average. In particular, it is worth noting that under the
PageRank workload, the required cross-datacenter traffic can be reduced by 91.3%, which
is pretty impressive.
Fig. 4.8 shows that the “Centralized” scheme requires the least cross-datacenter traf-
fic in TeraSort among the three. It is consistent with the conclusion in our previous
discussions.
4.4.4 Stage Execution Time
In Fig. 4.9, we breakdown the execution of different workloads by putting them under
the microscope. Specifically, we show the detailed average stage completion time in
our evaluation. Particularly, the length of stacked bars represents the trimmed average
execution time for each stage under specific workloads. Again, the error bars read the
interquartile regions and median values.
Inferred from the large variances, any stages in Spark may suffer from degraded per-
formances, most likely due to poor data locality. As a comparison, the Collective strategy
usually performs well in late stages, while has the longest average completion time in early
stages. It is supposed to be the result of collecting all raw data in early stages. However,
AggShuffle can finish both early and late stages quickly. Similar to the Collective scheme,
it offers an exceptionally low variance in the completion time of late stages.
Although different stages under different workloads have specific features and pat-
terns, we are still able to provide some useful insights. The “magic” behind AggShuffle is
that it proactively improves the data locality during shuffle phases, without the need to
transfer excessive data. Then, as shuffle input is aggregated in a smaller number of dat-
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 88
acenters, the achievable data locality is high enough to guarantee a better performance.
Note that in Fig. 4.9, the total completion time of all stages is not necessarily equiv-
alent to the job completion time presented in Fig. 4.7. First of all, though being stacked
together in Fig. 4.9, some of the stages may overlap with each other at runtime. The
summation does not directly contribute to the total job completion time. Second, stage
completion time is measured and reported by Spark, while the measurement of job com-
pletion time is implemented by HiBench, with different concepts. Third, the cross-stage
delays such as scheduling and queuing are not covered by the stage completion time
measurements.
4.5 Summary
In this chapter, we have designed and implemented a new system framework that opti-
mizes network transfers in the shuffle stages of wide-area data analytic jobs. The gist of
our new framework lies in the design philosophy that the output data from mapper tasks
should be proactively aggregated to a subset of datacenters, rather than passively fetched
as they are needed by reducer tasks in the shuffle stage. The upshot of such proactive
aggregation is that data transfers can be pipelined and started as soon as computation
finishes, and do not need to be repeated when reducer tasks fail. The core of our new
framework is a simple transferTo transformation on Spark RDDs, which allows it to
be implicitly embedded by the Spark DAG scheduler, or explicitly added by application
developers. Our extensive experimental evaluations with the HiBench benchmark suite
on Amazon EC2 have clearly demonstrated the effectiveness of our new framework, which
is complementary to existing task assignment algorithms in the literature.
Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 89
0 50 100 150 200 250 300WordCount Time (s)
Spark
Centralized
AggShuffle
0 20 40 60 80 100 120 140 160 180Sort Time (s)
Spark
Centralized
AggShuffle
0 50 100 150 200 250 300 350 400TeraSort Time (s)
Spark
Centralized
AggShuffle
0 20 40 60 80 100 120PageRank Time (s)
Spark
Centralized
AggShuffle
0 20 40 60 80 100 120 140NaiveBayes Time (s)
Spark
Centralized
AggShuffle
Figure 4.9: Stage execution time breakdown under each workload. In the graph, each segment in thestacked bars represents the life span of a stage.
Chapter 5
A Hierarchical Synchronous ParallelModel for Wide-Area Graph Analytics
Graph analytics is an important category of big data analytics applcations. It serves as
the foundation of many popular Internet services, including PageRank Web search [16]
and social networking [17]. These services are typically deployed at a global scale, with
user-related data naturally generated and stored in geographically distributed, com-
modity datacenters [32, 64, 65, 77]. In fact, popular cloud providers, such as Amazon,
Microsoft, and Google, all operate tens of datacenters around the world [38], offering
convenient access to both storage and computing resources.
It would benefit a wide variety of applications if the performance of processing geo-
graphically distributed graph data can be improved. In Internet-scale graph analytics, the
production graphs are typically as large as billions of vertices and trillions of edges [48,74],
taking terabytes of storage. For example, it is reported that Web search engines operate
on an indexable Web graph consisting of ever-growing 50 billions of websites and one
trillion hyperlinks between them [48]. Such large scales with rapid rates of change [38],
coupled with high costs of Wide-Area Network (WAN) data transfers [64] and possible
regulatory constraints [79], make it expensive, inefficient, or simply infeasible to cen-
tralize the entire dataset to a central location, even though this is a commonly-used
approach [9, 51] for analytics.
90
Chapter 5. A Hierarchical Synchronous Parallel Model 91
Therefore, it is critical to design efficient mechanisms to run graph analytics applica-
tions in a geographically distributed manner across multiple datacenters in a paradigm
called wide-area graph analytics. In particular, the fundamental challenge is to process
the graph with raw input data stored and computing resources distributed in globally
operated datacenters, which are inter-connected by wide-area networks (WANs).
Unfortunately, existing distributed graph analytics frameworks are not sufficiently
competent to address such a challenge. Representative works in the literature, e.g.,
Pregel [59], PowerGraph [29] and GraphX [30], are solely designed and optimized for
processing graphs within a single datacenter. Gemini [90], one of the state-of-the-art
solutions, even assume a high-performance cluster with 100 Gbps of bandwidth capacity
between worker nodes. Unfortunately, this assumption is far beyond the reality in inter-
datacenter WANs, whose available capacity is typically hundreds of Mbps [38].
In this chapter, we argue that the inefficiency of wide-area graph analytics stems from
the Bulk Synchronous Parallel (BSP) model [75], which is the dominating synchronization
model implemented by most of the popular graph analytics engines [82]. The primary
reason for its popularity is that BSP works seamlessly with the vertex-centric program-
ming abstraction [59], which eases the development of graph analytics applications. Many
graph algorithms and optimizations, namely vertex programs, are exclusively designed
with such an abstraction with BSP [68,83] in mind.
In a vertex program under BSP, the application runs in a sequence of “supersteps,”
or iterations, which apply updates on vertices and edges iteratively. Message passing
and synchronization is made between two consecutive supersteps, while performing local
computation within each superstep. Since each superstep typically allows communication
between neighboring vertices only, it takes at least k supersteps until the algorithm con-
verges on a graph whose diameter is k [83]. Thus, k message passing phases will happen
in serial, which incurs excessive — and not always necessary [87] — inter-datacenter traf-
fic in wide-area graph analytics. In conclusion, BSP is designed for a deployment where
Chapter 5. A Hierarchical Synchronous Parallel Model 92
communication cost among workers is similarly low. While being adopted in Wide-Area
Graph Analytics, excessive inter-datacenter traffic will be generated, lowering the job
performance.
One possible solution is to loosen the BSP model, by allowing asynchronous updates
on different graph partitions. This way, new iterations of computation are able to proceed
with partially staled vertex/edge properties, relaxing the hard requirement on inter-
datacenter communication to a best-effort model. Existing systems implementing such
an asynchronous parallel model include GraphUC [33] and Maiter [87]. However, neither
system can guarantee the convergence or the correctness of graph applications [82].
Our objective is to design a new synchronization model for wide-area graph analytics,
which satisfies three important requirements:
1. WAN efficiency. The new model should require fewer rounds of inter-datacenter
communication and generate less inter-datacenter traffic.
2. Correctness. The new model should ensure the application can return the same
result as if it was executed under BSP.
3. Transparency. The new model should require absolutely no change to the existing
applications, by retaining the same set of vertex-centric abstraction APIs.
In this chapter, we introduce Hierarchical Synchronous Parallel (HSP), a novel syn-
chronization model designed for efficiency in wide-area graph analytics. In contrast to
BSP, which requires complete, global synchronizations among all worker nodes in all
datacenters, HSP allows partial, local synchronizations within each datacenter as addi-
tional updates. Specifically, HSP automatically switches between two modes of execution,
global and local, like a two-level hierarchical organization. The global mode is the
same as BSP, where all datacenters respond to central coordination. The local mode, on
the other hand, allows each datacenter to work autonomously without coordinating with
others. Our theoretical analysis shows that, if the mode switch happens strategically,
HSP can guarantee the convergence and correctness of all vertex programs. In addition,
Chapter 5. A Hierarchical Synchronous Parallel Model 93
if the implementation of the vertex program is considered practical [83], HSP can en-
sure a much higher rate of convergence, as compared to BSP with the same amount of
inter-datacenter traffic generated.
We have implemented the HSP model on GraphX [30], an open-source general graph
analytics framework built on top of Apache Spark [84]. The original implementation of
GraphX supports the BSP vertex-centric programming abstraction. In our prototype
implementation, we have extended the framework with HSP, by allowing synchroniza-
tion to be bounded within a single datacenter, and by implementing the feature that
automatically switches the mode of execution on a central coordinator. With our imple-
mentation, we have performed an extensive evaluation of HSP in five real geographically
distributed datacenters on Google Cloud. Three empirical benchmark workloads on two
large-scale real-world graph datasets have been experimented on. The results show that
HSP is efficient in running wide-area graph analytics. It requires significantly fewer cross-
datacenter synchronizations until a guaranteed algorithm convergence, and reduces WAN
bandwidth usage by 22.4% to 32.2%. The monetary cost of running graph applications
can be reduced by up to 30.4%.
5.1 Background and Motivation
Graphs in production are typically too large to be efficiently processed by a single machine
[59]. Distributed graph analytics frameworks are thus developed to run graph analytics
in parallel on multiple worker nodes. Before running the actual analytics, the input graph
is divided into several partitions, each of which is held and processed by a worker. The
frameworks will then handle the synchronization and necessary message passing among
workers automatically, allowing developers to work solely on the analytic logic itself.
Most of the state-of-the-art solutions [29, 30, 90] provide a vertex-centric abstraction
for developers to work on — similar to Google’s Pregel — and implement the Bulk Syn-
chronous Parallel (BSP) model for inter-node synchronization [82]. Such an integration
Chapter 5. A Hierarchical Synchronous Parallel Model 94
DC A DC B
1 1 4 532
1 1 3 421
1 1 2 311
1 1 1 211
1 1 1 111
1 1 1 111
Partition in DC A Partition in DC B
1 2 5 643
DC A DC B
1 1 4 542
1 1 4 411
1 1 4 4411 1 4 441
1 1 1 411
1 1 1 111
1 1 1 111
(a) Supersteps under BSP. (b) Allow additional local synchronization.
Figure 5.1: A Connected-Component algorithm executed under different synchronization models. Whitecircles indicate active vertices, and the arrows represent message passing.
of programming abstraction and synchronization model allows developers to “think like
a vertex,” making the development of graph analytics applications intuitive and easy to
debug.
Even though it is well-known that the BSP model requires excessive communication,
the bandwidth capacity among workers is seldom considered a system bottleneck [90].
When deployed within a high-performance cluster where bandwidth is readily abundant,
BSP performs well with a large number of system optimization techniques, such as ad-
vanced graph partitioning strategies and load balancing. Unfortunately, this is no longer
true in wide-area data analytics, where inter-datacenter data transfers can incur a much
higher cost, in terms of both time and monetary expenses.
Fig. 5.1a illustrates a sample execution of the Connected-Component (CC) algorithm
under BSP. The algorithm runs on a six-vertex graph, which is cut into two partitions.
Within a superstep, each vertex tries to update itself with the smallest vertex ID seen
so far at all its neighbors. A vertex becomes inactive as soon as it cannot get further
updates. The algorithm converges until no vertex is active, and we can then compute the
number of connected components by counting the remaining vertex IDs. Fig. 5.1a shows
that CC converges in a total of 6 supersteps, while the first three supersteps require
Chapter 5. A Hierarchical Synchronous Parallel Model 95
message passing from DC A to DC B.
However, it is easy to observe that the first two inter-datacenter messages are un-
necessary in this example. These two messages are in fact transferring values that will
be immediately overridden in the next superstep. The insight behind it is that we can
sometimes hold inter-datacenter synchronization until multiple cycles of synchronization
within a single datacenter have been performed, for the sole purpose of minimizing cross-
datacenter traffic. For example, one possible optimization is illustrated in Fig. 5.1b. It
allows updates of vertex IDs to happen within a datacenter, without updating the vertex
that is owned by both datacenters. Inter-datacenter communication happens once at the
fourth step, when both partitions has already converged locally. This new principle of
synchronization reaches the same result of the CC algorithm, yet generating only 1/3 of
the inter-datacenter traffic as compared to BSP.
The example shown in Fig. 5.1 inspires us to explore such a new principle, for the
sake of minimizing inter-datacenter traffic. To achieve this objective, we wish to carefully
design a new synchronization model, called the Hierarchical Synchronous Parallel (HSP)
model. As an alternative to BSP, it needs to guarantee that the correctness of any vertex
program should be retained, which may be way more complex than the Connected-
Component algorithm.
5.2 Hierarchical Synchronous Parallel Model
In this section, we introduce the Hierarchical Synchronous Parallel (HSP) model. We
will first explain the high-level principle of its design and the general idea behind its
correctness guarantee. We will then formulate HSP model theoretically and explain it
in greater detail. With our formulation, we present a formal proof of its correctness and
rate of convergence in wide-area graph analytics. Finally, we use a simple PageRank
application as an example to illustrate the effectiveness of HSP.
Chapter 5. A Hierarchical Synchronous Parallel Model 96
5.2.1 Overview
Generally speaking, HSP is an extension to the BSP model in wide-area graph analytics,
by performing synchronization in a two-level hierarchy. In addition to BSP, HSP allows
local synchronization among worker nodes located in a single datacenter, completely
avoiding inter-datacenter communication. To achieve this, HSP introduces two modes of
execution, global and local, and switches between them strategically and frequently.
In the global mode, HSP has exactly the same behavior as BSP, where each syn-
chronization is a global, all-to-all communication among all worker nodes, regardless of
which datacenter they are located in. We call one iteration of the execution in HSP a
“global update,” which is equivalent to a superstep in BSP. Global updates are essential
to the correctness of graph analytics, because it is necessary to spread the information
outside of individual datacenters.
In the local mode, the worker nodes are organized in different autonomous datacen-
ters. Workers housed in the same datacenter work synchronously. In particular, they
still run in iterations, or “local updates,” as if they are running the vertex program un-
der BSP. The difference is that, if a vertex has mirrors in multiple datacenters, called
a “global vertex,” we mark it inactive and do not update it until switching back to the
global mode. Since synchronizing the property of these global vertices is the only source
of inter-datacenter traffic, we completely eliminate the need for inter-datacenter commu-
nication in the local mode. In addition, it is worth noting that without the need for
global synchronization, execution at different datacenters can be asynchronous.
Without running in the local mode, HSP is equivalent to BSP. Thus, it still guar-
antees correctness. However, running in the local mode takes advantage of low-cost
synchronization within a datacenter, allowing more updates in the same amount of time.
It is an interesting question when to make the switching between these two modes.
Our mode-switching strategy is designed based on our theoretical analysis in the next
subsection for the sake of the algorithm convergence guarantee. Here we introduce its
Chapter 5. A Hierarchical Synchronous Parallel Model 97
general principle. On the one hand, local updates should allow at least one information
exchange between any pair of vertices. Thus, the number of local updates should be
higher than the diameter of the local partition. On the other hand, we should not leave
any worker idle before reaching global convergence. As a result, HSP will switch away
from the local mode as soon as all datacenters execute more local updates than its
partition diameter or the local update in any datacenter converges. Then, while HSP
running in the global mode, it will switch back to local immediately as soon as the
algorithm is considered “more converged,” whose metric will be introduced later.
The intuition behind HSP is that, in general, graph algorithms need to spread the in-
formation on a vertex to all vertices of the entire graph. In other words, every vertex has
to “get its voice heard.” In wide-area data analytics, communicating with neighbors does
not always come at similar prices. Therefore, instead of requiring every vertex to talk
to its neighbors in every iteration, HSP organizes communication hierarchically. It al-
lows information to spread well within a closed neighborhood, before inter-neighborhood
communication. This way, the entire graph can still converge, while generating much less
inter-datacenter WAN traffic.
5.2.2 Model Formulation and Description
Before formulating the HSP model, we first give a formal definition of a vertex program.
Given a graph G = (V,E) with initial properties on all vertices xxx(0) ∈ R|V |, a ver-
tex programming application defines the function to update each vertex in a superstep.
Specifically, a combiner function g(·) is defined to combine the received message to each
vertex, and a compute function f(·) is defined to compute the updated property on a
vertex using the old property and the combined incoming message. Without loss of gen-
erality, let fi : R|V | → R denote the update function defined on vertex i ∈ {1, 2, . . . , |V |},
such that, in BSP, we have
x(k+1)i = f
(x(k)i , gi(xxx
(k)))
:= fi(xxx(k)). (5.1)
Chapter 5. A Hierarchical Synchronous Parallel Model 98
Or equivalently, define FFF : R|V | → R|V | such that
xxx(k+1) = FFF (xxx(k)).
The objective of the vertex programming application is to compute xxx∗ ∈ R|V | which
satisfies xxx∗ = FFF (xxx∗), by iteratively applying the update defined in Eq. (5.1) until con-
vergence under BSP. By definition, xxx∗ is a fixed point under operator FFF , denoted as
xxx∗ ∈ FixFFF .
In practice, the application is considered converged when a valid approximation xxx(N)
is obtained after N supersteps. We define a valid approximation as follows:
Definition 1 (Distance metric). A distance metric on R|V | is a function
D : R|V | × R|V | → [0,∞),
where [0,∞) is the set of non-negative real numbers and for all xxx,yyy,zzz ∈ R|V |, the following
conditions are satisfied:
D(xxx,yyy) = 0 ⇐⇒ xxx = yyy (identity of indiscernibles)
D(xxx,yyy) = D(yyy,xxx) (symmetry)
D(xxx,yyy) ≤ D(xxx,zzz) +D(zzz,yyy) (triangle inequality).
For example, Chebyshev norm, i.e.,
D(xxx,yyy) = max{|x1 − y1|, |x2 − y2|, . . . , |x|V | − y|V ||},
is a commonly used distance metric because it is easy to compute in a distributed envi-
ronment.
Definition 2 (A valid approximation of a fixed point). Given pre-defined error bound
δ ∈ (0,∞), xxx(N) is a valid approximation of xxx∗, a fixed point under operator FFF , if it
Chapter 5. A Hierarchical Synchronous Parallel Model 99
satisfies
D(xxx(N),xxx(N+1)) = D(xxx(N),FFF (xxx(N))) ≤ δ (5.2)
where D : (R|V |,R|V |)→ [0,∞) is a distance metric.
To process the graph in d geographically distributed datacenters, it is partitioned
into d large subgraphs. We define each vertex either a local vertex or a global vertex.
Specifically, vertex i is a local vertex of datacenter j if datacenter j stores its original copy
and its property xi is not required to update any vertex stored in a different datacenter.
In other words, xi should never be delivered outside of datacenter j when running the
application under BSP. Let Ij denote the index set of all local vertices of datacenter j.
Otherwise, if xi is required to update vertices stored in multiple datacenters, we define
vertex i a global vertex. Let IG denote the index set of all global vertices. Apparently,
the defined d + 1 index sets I1, I2, . . . , Id, IG are mutually exclusive and their union is
{1, 2, . . . , |V |}.
With a globally partitioned graph, we define FFF j(·), the local update function in data-
center j, as follows:
[FFF j(xxx)
]i
=
fi(xxx) i ∈ Ijxi otherwise
. (5.3)
Using the notations introduced above, we present our detailed description of the HSP
model in Procedure 1. The procedure for local updates in a single datacenter is listed
separately. Since the local updates in different datacenters are asynchronous, it requires a
mechanism of central coordination that decides mode switch to global. This is achieved
by two methods, voteModeSwitch() and forceModeSwitch(). The prior is called once
the number of local updates reaches the subgraph diameter, while the latter is called
upon local convergence. The central coordinator receives the signals triggered by these
two methods, and enforces mode switch when appropriate (line 12 of Prcedure 1).
Chapter 5. A Hierarchical Synchronous Parallel Model 100
Procedure 1 Execution of a vertex programming application under the HierarchicalSynchronous Parallel (HSP) model.1: Set execution mode to global, global update counter k ← 0, current error δ0 ←∞;2: while δk > δ do3: if Execution mode is global then4: Perform a global update: xxx(k+1) ← FFF (xxx(k));5: δk+1 ← D(xxx(k+1),xxx(k));6: if δk+1 < δk then7: Switch execution mode to local;8: else9: δk+1 ← δk;10: k ← k + 1;11: else12: Apply local updates in each datacenter concurrently (as in Procedure 2), until any
datacenter calls forceModeSwitch() or all datacenters call voteModeSwitch().13: Switch execution mode to global;14: return xxx(k).
Procedure 2 Local updates in datacenter j.1: Set local iteration counter nj ← 0, local error δ0 ←∞; d denotes the diameter of the local
partition;2: repeat3: nj ← nj + 1;4: Perform an in-place local update: xxx(k,nj) ← FFF j
(xxx(k,nj−1)
);
5: δnj ← D(xxx(k,nj),xxx(k,nj−1));6: if δnj < δ then7: forceModeSwitch();8: if nj == d then9: voteModeSwitch();10: until Mode switch is forced by any or voted by all datacenters;11: Execution mode switched to global.
5.2.3 Proof of Convergence and Correctness
To prove the convergence and correctness guarantee of the proposed HSP model, the only
requirement for the vertex programming application is that the iterative computation is
correctly implemented; that is, the application can converge within a finite number of
iterations under BSP. Formally, we have the following assumption:
Assumption 1 (Convergence under BSP). Given any arbitrary initial value of xxx(0) and
a distance metric D(·, ·), the sequence of successive approximations {xxx(k)} approaches xxx∗,
a fixed point under operator FFF , as the number of iterations of global updates Eq. (5.1)
Chapter 5. A Hierarchical Synchronous Parallel Model 101
approaches infinity. That is,
limk→∞
D(xxx(k),xxx∗
)= 0,where xxx∗ ∈ FixFFF . (5.4)
Theorem 1 (Convergence and correctness guarantee of HSP). If a vertex programming
application satisfies Assumption 1, given any pre-defined error bound δ ∈ (0,∞), it will
also return a valid approximation of xxx∗ in finite time under Procedure 1.
Proof. Since δk is overridden at line 9 using the previous value in the sequence, it is valid
to consider their original value before being overridden.
δk+1 = D(xxx(k+1),xxx(k)
)≤[D(xxx(k+1),xxx∗
)+D
(xxx(k),xxx∗
)](triangle inequality),
which approaches 0 as k →∞ given Eq. (5.4).
In a real-world implementation where precision of any number is bounded, the se-
quence {δk} will eventually approach 0, i.e., ∀δ ∈ (0,∞), ∃k∗ < ∞, s.t. δk∗ < δ, and
Procedure 1 will return an estimation xxx(k∗).
xxx(k∗) satisfies Eq. (5.2), making it a valid estimation of xxx∗.
As is shown in the proof, the convergence guarantee of HSP relies heavily upon its
global updates. As long as the application can converge, applying local updates in the
middle does not affect the result of the vertex programming algorithm.
5.2.4 Rate of Convergence
Even with the convergence and the correctness guarantee, one may still be skeptical
about the effectiveness of HSP. How could the additional local updates help with the
application execution? Is it a guarantee that it will generate less inter-datacenter traffic?
It is difficult to answer these questions without any prior knowledge about the actual
application itself, since the vertex programming model provides developers with a sub-
Chapter 5. A Hierarchical Synchronous Parallel Model 102
stantial amount of flexibility. Generally speaking, developers are able to code whatever
they desire, making it difficult to reach any useful conclusion about such applications.
However, to ensure the scalability while processing very large datasets, vertex pro-
gramming applications can share some characteristics in practice. Yan et al. [83], in
particular, investigated some well-implemented vertex programming algorithms, namely
practical Pregel algorithms. Common characteristics of a practical vertex programming
application were summarized under BSP. These applications require linear space usage,
linear computing complexity and linear communication cost per superstep. In addition,
practical Pregel algorithms require at most a logarithmic number of supersteps till their
convergence, i.e., at least a linear rate of convergence.
With the latter characteristic as an assumption, HSP can ensure effectiveness by
allowing additional local updates in different datacenters.
Assumption 2 (Practical implementation). The vertex programming application con-
verges at a linear or a superlinear rate under BSP, i.e.,
∃µ ∈ [0, 1), s.t. limk→∞
D(xxx(k+1),xxx∗)
D(xxx(k),xxx∗)≤ µ. (5.5)
To study the rate of convergence, we consider each cycle of synchronization in HSP. A
cycle of synchronization is defined as the interval between two consecutive mode switches
from local to global. In other words, a cycle includes several consecutive global updates
and the subsequent local updates, until a switch back to the global mode.
In particular, during the local mode within each synchronization cycle, several iter-
ations of local updates are applied in each individual datacenter at the same time. For
the convenience of our subsequent proof, we collectively formulate these local updates as
a function
FFF(n1,n2,...,nd)(xxx) :=
( d∏j=1
FFFnj
j
)(xxx),
where nj denotes the number of iterations of local updates applied in datacenter j.
Chapter 5. A Hierarchical Synchronous Parallel Model 103
Lemma 1. xxx∗ is a fixed point under operator FFF := FFF(n1,n2,...,nd) for any n1, n2, . . . , nd.
Proof. Given j ∈ {1, 2, . . . , d} and i ∈ Ij, according to Eq. (5.3), we have[FFF j(xxx
∗)]i
=
fi(xxx∗) = xxx∗i . Also, given i /∈ IG,
[FFF j(xxx
∗)]i
= xxx∗i by definition. Thus, xxx∗ is a fixed point
under FFF j.
Since no global update is applied, the properties of global vertices remain the same
after FFF (·) is applied, i.e., [FFF (xxx)
]i
= xi,∀i ∈ IG.
Further, FFF j(·) depends on the local vertices of datacenter j only. Thus, the individual
functions FFF j(j = 1, 2, . . . , d) are commutative and associative in Eq. (5.3). Therefore,
FFF(n1,n2,...,nd)(xxx∗) = FFF
(n1,...,nj−1,...,nd)(FjFjFj(xxx
∗))
= FFF(n1,...,nj−1,...,nd)(xxx∗) = · · · = FFF
(0,...,0)(xxx∗) = xxx∗
Lemma 2. FFF : R|V | → R|V | is a contraction mapping, i.e.,
D(FFF (xxx), FFF (yyy)) < D(xxx,yyy),∀xxx,yyy ∈ R|V |.
Proof. According to the Banach fixed-point theorem [12], an equivalent condition of
Eq. (5.5) is that FFF : R|V | → R|V | is a contraction mapping. Given ∀xxx,yyy ∈ R|V |, we have
D(FFF (xxx),FFF (yyy)) ≤ µD(xxx,yyy).
Construct yyy ∈ R|V | by letting yi =
yi i ∈ Ijxi i /∈ Ij
. Thus,
D(FFF j(xxx),FFF j(yyy)) = D(FFF j(xxx),FFF j(yyy)) = D(FFF (xxx),FFF (yyy))
≤ µD(xxx, yyy) ≤ µD(xxx,yyy).
Therefore,
D(FFF (xxx), FFF (yyy)) ≤ µ∑d
j=1 njD(xxx,yyy).
Chapter 5. A Hierarchical Synchronous Parallel Model 104
Theorem 2 (Rate of convergence of HSP). If a vertex programming application satisfies
Assumption 1 and 2, after the same number of global updates, HSP will converge to xxx∗
at a higher rate as compared to BSP.
Proof. Consider an HSP synchronization cycle that includes n global updates and a set
of local updates denoted by FFF . After the synchronization cycle, the original estimation
xxx(k) is updated to xxx(k+n) = (FFF ·FFF n)(xxx(k)).
As a valid comparison, after the same number of global synchronization iterations,
BSP will get xxx(k+n) = FFF n(xxx(k)).
The average rate of convergence of HSP µHSP satisfies
µnHSP = lim
k→∞
D(xxx(k+n),xxx∗)
D(xxx(k),xxx∗)= lim
k→∞
D(FFF (xxx(k+n)),xxx∗)
D(xxx(k),xxx∗)
= limk→∞
D(FFF (xxx(k+n)), FFF (xxx∗))
D(xxx(k),xxx∗)(Lemma 1)
≤ µ∑d
j=1 nj · limk→∞
D(xxx(k+n),xxx∗)
D(xxx(k),xxx∗)(Lemma 2)
= µ∑d
j=1 nj · µnBSP < µn
BSP
Therefore, HSP provides a much higher average rate of convergence as compared to
BSP.
Given a distance metric D : (R|V |,R|V |) → [0,∞) and an error bound δ ∈ (0,∞),
Theorem 2 implies that HSP requires fewer global updates, thus less inter-datacenter
traffic, to reach a practical approximation.
5.2.5 PageRank Example: a Numerical Verification
To verify our findings in the previous theorems, we compare the convergence under HSP
and BSP using a simple PageRank example shown in Fig. 5.2. Fig. 5.2(a) shows the
5-vertex graph in the example, while the graph is partitioned into two datacenters by
Chapter 5. A Hierarchical Synchronous Parallel Model 105
cutting the central vertex. Note that the diameter in each partition is 1 (ignoring the
global vertex); therefore, HSP runs local updates only once in every local mode of
execution. We plot every single estimation achieved by both synchronization models in
Fig. 5.2(b), whose y-axis shows the Euclidean norm between estimations and xxx∗ in log
scale.
Since PageRank is a practical Pregel algorithm by definition [83], BSP shows a per-
fectly linear rate of convergence (the red line in Fig. 5.2(b)). As a comparison, HSP
depicted by the black line, the lower bound of the gray area, shows a much higher rate
of convergence, given the same number of global updates as the number of supersteps in
BSP.
If we consider x-axis as algorithm run time in real systems, the gray area indicates
the possible convergence rate of HSP. The reason is that the lower bound, shown by
the black line, assumes no cost incurred by local updates because they do not introduce
inter-datacenter traffic. The upper bound, in the other hand, assumes that a local update
takes exactly the same amount of time as a superstep in BSP. In reality, the time needed
by a local update is in between of these two extremes, and HSP can always converge
faster.
5.3 Prototype Implementation
We have implemented a prototype of the HSP synchronization model on GraphX [30],
and it retains full compatibility with existing analytics applications. Our prototype
implementation is non-trivial to complete, yet it makes a strong case that HSP can be
seamlessly integrated with existing BSP-based graph analytics engines.
Systems design. GraphX is an open-source graph analytics engine built on top of
Apache Spark [84]. It is an ideal platform for us to implement our prototype, due to its
full interoperability with general dataflow analytics and machine learning applications in
the popular Spark framework.
Chapter 5. A Hierarchical Synchronous Parallel Model 106
DC A DC B0.170795
0.0300000.331283 0.307311 0.160607
(a) The example direct graph. The numbers on the vertices represent the finalranks that we used as xxx∗.
1 4 7 10 13 16 19 22 25 28 31
# Global Synchronizations
10−6
10−5
10−4
10−3
10−2
10−1
100
‖x(k
)−
x∗ ‖
2BSPHSP
Local updatesGlobal updates
(b) Convergence of PageRank under BSP and HSP.
Figure 5.2: A PageRank example with damping factor set to 0.15 [16]. All values used in computationare rounded to the sixth decimal place; therefore, norm lower than 10−6 makes little sense and is ignoredin the figure.
In GraphX, the vertex-centric programming abstraction is supported via a Pregel
API, which allows developers to pass in their customized vertex update and message
passing functions. The graph analytics applications are executed under the BSP model,
with all workers proceeding in supersteps. It takes advantage of the Resilient Distributed
Dataset (RDD) programming abstraction [84]. An RDD represents a collection of data
stored on multiple workers. Parallel computation on the data can be modeled as sequen-
tial transformations made on the RDD, allowing developers to program as if it is running
on a single machine. In particular, Pregel models a graph using a VertexRDD and an
EdgeRDD, while a superstep is modeled as a series of sequential transformations on them.
Our implementation retains the original Pregel API, providing full compatibility with
existing applications and requiring no change to their code. To enable HSP instead of
BSP, users can simply pass an additional option, -Dspark.graphx.hsp.enabled=true,
along with the application submission command.
Under the hood, our modifications to the original GraphX codebase take place within
Chapter 5. A Hierarchical Synchronous Parallel Model 107
DC Manager Thread
Input Graph
Graph Partition
Mode Select
1 Global Update
Converged?
Launch n DC Manager Threads
Accumulator c = 0
1 Local Update
# updates = diameter?
Converged?
Output Graph
c = c+1
c = c+n
c > n
Join n Threads
Local
No
YesNo
GlobalYes
Yes
Yes
No
No
…
…
…
Figure 5.3: The flow chart that shows central coordination in HSP. Blocks in black indicate the originalPregel implementation in GraphX.
the Pregel implementation and include two separate components: the central logic to
switch between local and global execution modes, as well as the RDD transformation
sequence that actually implements the local updates.
Mode switches. We have implemented Procedure 1 and Procedure 2 for mode
switching. It runs on the Spark “driver” program, which centrally manages all work-
ers acquired by the application. The flow chart is shown in Fig. 5.3. We utilize two
Accumulators to help decision making. An Accumulator can serve as a global variable,
which is addable by workers and readable by the central coordinator.
To switch from the global to the local mode, one Accumulator is used to track the
differences (i.e., distance) between consecutive updates. A mode switch will be made once
the difference decreases. On the other hand, to switch back to global, we use the other
Accumulator to track the progress made by individual local updates. voteModeSwitch()
and forceModeSwitch() are implemented by adding to the Accumulator. Each individual
datacenter will check the value of the global variable after each local update to decide
whether to proceed, and change the value once it converges or reaches a preset number.
Chapter 5. A Hierarchical Synchronous Parallel Model 108
Local updates. Implementing local updates in GraphX is challenging, because
the RDD transformations on a graph are designed to hide some runtime details from
the developers. These details include data distributions across the workers, which are
essential to the concept of local updates. We need to dig deep into the RDD internals
for our implementation.
One of the major challenges is to identify global vertices and local vertices.
In Spark, the actual dataset placement is decided at runtime depending on the worker
availabilities. Thus, the co-located graph partitions in the same datacenters can only
be identified at runtime. Fortunately, such runtime information is accessible via the
preferredLocations feature in an RDD, which provides insights about the actual worker
where a data partition is placed. We make full use of this feature, creating a set containing
all global vertices once the graph is fully loaded to the available workers. Since graph
partitioning remains unchanged throughout the application execution, the global vertex
set can be cached in the memory without being computed again.
With knowledge about all global vertices, we create a SubGraph out of the co-located
graph partitions in each datacenter. The local updates will be carried out asynchronously
on different SubGraph instances, and they will be controlled by DC Manager Threads on
the Spark driver as is depicted in Fig. 5.3. All transformations made on the VertexRDD
and the EdgeRDD of a SubGraph are similar to the original BSP transformations. However,
they have been carefully rewritten such that no global vertex is modified during local
updates.
5.4 Experimental Evaluation
We have evaluated the effectiveness and correctness of our prototype implementation in
real datacenters, with empirical benchmarking workloads and real-world graph datasets.
In this section, we summarize and analyze our experimental results.
Chapter 5. A Hierarchical Synchronous Parallel Model 109
Dataset # Vertices # Edges HarmonicDiameter
EdgelistSize (MB)
enwiki-2013 4,206,785 101,355,853 5.24 1556.9uk-2014-host 4,769,354 50,829,923 21.48 801.4
Table 5.1: Summary of the used datasets.
5.4.1 Methodology
Experiment platforms. We deploy a 10-worker Spark cluster on Google Cloud. Each
worker is a regular Ubuntu 16.04 LTS instance, with 2 CPU cores and 7.5GB of memory.
Our modified version of GraphX is based on Spark v2.2.0, and the cluster is deployed in
the standalone mode.
In particular, two workers are employed in each of the five geographical regions in-
cluding N. Virginia, Oregon, Tokyo, Belgium, and Sydney. Preliminary measurements on
the available bandwidth show ∼3Gbps of capacity within a datacenter. Inter-datacenter
bandwidth is more than a magnitude lower, ranging from 50Mbps to 230Mbps. The
findings are similar to the measurements reported in [38].
Applications. We use three benchmarking applications to evaluate the effectiveness
of HSP, including PageRank (PR), ConnectedComponents (CC), and ShortestPaths (SP).
We use the default implementations provided by GraphX without changing a single line
of code. Also, the default graph partitioning strategy is used, which preserves the original
edge partitioning in the HDFS input file.
PR represents random walk algorithms, an important category of algorithms that seek
to find the steady state in the graph. CC and SP represent graph traversal algorithms.
These two categories of algorithms cover the most common vertex programs in practice
[73]. Three applications show different degrees of network intensiveness. PR requires
more time for synchronization, while SP is more computation-intensive.
Input datasets. We use two web datasets from WebGraph [14]. The key features
of the datasets are summarized in Table 5.1. Both datasets have more than 4 million
vertices. However, uk-2014-host has much fewer edges, making the diameter of the
Chapter 5. A Hierarchical Synchronous Parallel Model 110
Workload# HSPGlobalSync.
# BSPSuper-step
HSPUsage(GB)
BSPUsage(GB)
Reduction(%)
enwiki-2013
PR 46 74 18.39 27.14 32.2CC 5 7 0.69 0.91 23.4SP 7 10 0.59 0.84 30.6
uk-2014-host
PR 35 52 21.48 31.47 31.7CC 12 20 0.71 0.95 25.4SP 15 23 0.50 0.64 22.4
Table 5.2: WAN bandwidth usage comparison.
1.0Normalized Application Runtime
PR
CC
SP
0.90x0.86x
0.97x0.93x
1.02x1.04x
enwiki-2013 uk-2014-host
Figure 5.4: Application runtime under HSP, normalized by the runtime under BSP.
graph much higher. In other words, enwiki-2013 is more “dense” in terms of vertex
connectivity. Experimenting on these two datasets makes a strong case that HSP can
work well on natural, real-world graphs.
5.4.2 WAN Bandwidth Usage
Apart from the correctness guarantee and the API transparency, the design objective
of HSP is WAN efficiency in wide-area graph analytics. As compared to BSP, HSP is
expected to significantly reduce the required number of global synchronizations as well
as the WAN bandwidth usage. These statistics in the experiments are calculated and
summarized in Table 5.2, with all combinations of benchmarks and datasets.
In general, HSP has met our expectations; more than 22% reduction in WAN band-
width usage can be observed in all workloads. Among the applications, PR benefits most
Chapter 5. A Hierarchical Synchronous Parallel Model 111
PR CC SP0.0
0.5
1.0
1.5
2.0
2.5
3.0
Cost
on G
oogle
Clo
ud (
US $
)
-29.79%
-15.54% -9.41%
BSP WAN Usage
HSP WAN Usage
BSP Instances
HSP Instances
(a) enwiki-2013
PR CC SP0.0
0.5
1.0
1.5
2.0
2.5
3.0
Cost
on G
oogle
Clo
ud (
US $
) -30.43%
-14.79% -8.21%
BSP WAN Usage
HSP WAN Usage
BSP Instances
HSP Instances
(b) uk-2014-host
Figure 5.5: Estimated cost breakdown for running applications. The calculation follows the GoogleCloud pricing model as of July 2017, where 10 instances cost $0.95/h and WAN traffic is $0.08/GB.
1 9 17 25 33 41 49
(a) # Global Synchronizations
10−2
10−1
100
101
102
103
‖x(k
)−
x(k−
1) ‖
2
BSPHSP
50 100 150 200 250 300 350
(b) PageRank Execution Time (s)
10−2
10−1
100
101
102
103
‖x(k
)−
x(k−
1) ‖
2
BSPHSP
Figure 5.6: Rate of convergence analysis for PageRank on uk-2014-host. The delta is organized by thenumber of global synchronizations and the application execution time, respectively.
from running under HSP, enjoying an over 30% reduction in total inter-datacenter traf-
fic. CC and SP require fewer global synchronizations before convergence even in BSP,
allowing less room for improvement.
HSP also works well on both datasets, despite the differences in graph diameters.
Because uk-2014-host is partitioned with less fragmentation in 5 datacenters due to a
larger diameter, graph traversal applications (CC and SP) can see a higher reduction
in the number of global synchronizations under HSP. Another interesting finding is that
running CC on enwiki-2013 under HSP takes only 5 global synchronizations, which
reaches the expected minimum in a 5-datacenter setting.
Chapter 5. A Hierarchical Synchronous Parallel Model 112
5.4.3 Performance and Total Cost Analysis
WAN bandwidth usage, along with instance usage, directly contribute to the monetary
cost of running analytics in public cloud. In most cloud pricing models, inter-datacenter
traffic is charged by GBs of usage, while instances are charged by hours of machine time.
In our experiments, the runtime of each workload is summarized in Fig. 5.4. It
shows the normalized time for running applications under HSP as compared to BSP.
The performance vary on different applications, due to the different degrees of network
intensiveness. PR, for example, can achieve 14% less application runtime because it
originally spent more time transferring a huge amount of data across-datacenters. SP, on
the other hand, has a slightly degraded performance, since the extra computation time
incurred by local updates exceeds the reduction in network transfers.
However, we argue that the possible performance degradation is acceptable in con-
sidering total cost. We illustrate the cost breakdown of our experiments in Fig. 5.5.
For PR, WAN usage cost contributes a majority proportion to the final monetary cost.
Since both machine time and inter-datacenter traffic have been reduced, HSP is about
30% cheaper than BSP. The rest two applications are relatively less network-intensive,
but the WAN usage still constitutes a large proportion. Even though the application
runtimes are similar, HSP can still save about 10%.
5.4.4 Rate of Convergence
To further verify the theoretical proof of a higher rate of convergence in HSP (Sec. 5.2.4),
we study the convergence speed of PageRank in our experiments. The results are shown
in Fig. 5.6. Different from Fig. 5.2(b), we measure the ranks’ “delta” (in the form of
Euclidean norm) between two consecutive global synchronizations instead of the distance
to the real ranks, because the real ranks are unknown.
We may observe that Fig. 5.6a matches the numerical analysis in Fig. 5.2(b). HSP
converges linearly, yet at a rate that is 1.49x of BSP with the same number of global
Chapter 5. A Hierarchical Synchronous Parallel Model 113
synchronizations.
In Fig. 5.6b we plot the deltas by the end times of global synchronizations. It shows
a similar speed of convergence in early stages of execution, while HSP accelerates more
in later stages. The reason is that, in the beginning, HSP takes more time running local
updates. These local updates usually double the time interval between global synchro-
nizations. However, local updates take much less time later because the local vertices are
be considered “more converged,” and the progress of HSP accelerates significantly.
5.5 Summary
We introduce Hierarchical Synchronous Parallel (HSP), a new synchronization model
that is designed to run graph analytics on geographically distributed datasets efficiently.
HSP has a local mode of execution, which allows workers in different regions to work
asynchronously and avoids WAN traffic. By carefully designing the strategy for mode
switches, We have proved that, theoretically and experimentally, HSP guarantees the con-
vergence and correctness of existing graph applications without change. Our prototype
implementation and evaluation show that HSP can reduce the WAN bandwidth usage by
up to 32%, leading to a significant reduction in monetary cost for analyzing graph data
in the cloud. We conclude that HSP is a general, efficient, and readily implementable
synchronization model that can benefit wide-area graph analytics systems.
Chapter 6
Concluding Remarks
6.1 Conclusion
In this dissertation, we explore several system approaches to optimizing the performance
of data analytics frameworks that are deployed across multiple geographically distributed
datacenters. By revisiting the design principles of general data analytics frameworks,
we have proposed three set of system optimizations, targeting different layers in their
architecture.
First of all, at networking layer, we attempt to alleviate the bottleneck of inter-
datacenter data transfers in wide-area data analytics directly. We have designed and
implemented Siphon — a building block that can be seamlessly integrated with existing
data parallel frameworks — to expedite coflow transfers. Following the principles of
software-defined networking, a controller implements and enforces several novel coflow
scheduling strategies. A novel approach to caching controller rules at the dataplane has
also been adopted and evaluated.
To evaluate the effectiveness of Siphon in expediting coflows as well as analytics jobs,
we have conducted extensive experiments on real testbeds, with Siphon deployed across
geo-distributed datacenters. The results have demonstrated that Siphon can effectively
reduce the completion time of a single coflow by up to 76% and improve the average
coflow completion time.
114
Chapter 6. Concluding Remarks 115
Secondly, the workflow execution layer, we attempt to optimize the timing of inter-
datacenter data transfers involved in a shuffle. We have designed and implemented a new
proactive data aggregation framework based on Apache Spark, with a focus on optimizing
the network traffic incurred in shuffle stages of data analytic jobs. The objective of this
framework is to strategically and proactively aggregate the output data of mapper tasks
to a subset of worker datacenters, as a replacement to Spark’s original passive fetch
mechanism across datacenters. It improves the performance of wide-area analytic jobs
by avoiding repetitive data transfers, which improves the utilization of inter-datacenter
links. Our extensive experimental results using standard benchmarks across six Amazon
EC2 regions have shown that our proposed framework is able to reduce job completion
times by up to 73%, as compared to the existing baseline implementation in Spark.
Last but not least, at algorithm API layer, we focus on optimizing wide-area graph
analytics, an important subset of data analytics applications. We have presented a new
Hierarchical Synchronous Parallel model designed and implemented for synchronization
across datacenters with a much improved efficiency in inter-datacenter communication.
Our new model requires no modifications to graph analytics applications, yet guarantees
their convergence and correctness. Our prototype implementation on Apache Spark can
achieve up to 32% lower WAN bandwidth usage, 49% faster convergence, and 30% less
total cost for benchmark graph algorithms, with input data stored across five geograph-
ically distributed datacenters.
6.2 Future Directions
We will continue our work in wide-area graph analytics, with a further relaxation on
deployment environments.
In both BSP and HSP, it still requires synchronization among all workers after each
iteration, stragglers can degrade the overall system performance significantly. As a result,
existing systems partition the input graph evenly among workers before proceeding to
Chapter 6. Concluding Remarks 116
the iterative computation for load balancing, relying heavily on the carefully designed
heuristics [19, 90].
These heuristics, despite their complexity, usually assume a high-performance, homo-
geneous computing environment. However, this assumption may not hold in practice. In
both the public cloud and the private cloud, it is common to have heterogeneous work-
ers, which possesses different levels of computation capacity [5]. For example, virtual
machine instances might be heterogeneous due to resource sharing and the coexistence
of multiple generations of hardware [43]. To be aware of such heterogeneity, some parti-
tioning strategies in the literature require profiling/learning the system behavior [47,50],
which is a costly process, while others assume a known resource model [49, 81] or prior
knowledge [60,89] of the heterogeneity. Because of the additional efforts or assumptions
in deployment, a static heterogeneity-aware partitioning strategy alone is less practical
or effective.
We argue that the input graph should be dynamically repartitioned across the iterative
execution of graph analytics for the perfect load balance. To this end, we can explore
to design a new distributed graph analytics system that can automatically adapt to a
heterogeneous environment without any prior knowledge. Beyond existing systems that
allow dynamic load balancing among workers, we can redesign the workload migration
framework from the ground up for efficiency, with a complete awareness of the worker
heterogeneity in computing capacities.
Bibliography
[1] Apache Hadoop Official Website. http://hadoop.apache.org/. [Online; accessed
1-May-2016].
[2] MLlib: Apache Spark Website. http://spark.apache.org/mllib/. [Online; ac-
cessed 1-May-2016].
[3] Open Network Foundation Official Website. https://www.opennetworking.org/.
[Online; accessed 6-May-2015].
[4] OpenFlow White Paper. https://www.opennetworking.org/images/stories/
downloads/sdn-resources/white-papers/wp-sdn-newnorm.pdf. [Online; ac-
cessed 6-May-2015].
[5] Martín Abadi, Paul Barham, Jianmin Chen, et al. TensorFlow: A System for Large-
Scale Machine Learning. In Proc. USENIX Symposium on Operating Systems Design
and Implementation (OSDI), 2016.
[6] Kanak Agarwal, Colin Dixon, Eric Rozner, and John Carter. Shadow MACs: Scal-
able Label-Switching for Commodity Ethernet. In Proc. ACM SIGCOMM Workshop
on Hot Topics in Software Defined Networking (HotSDN), 2014.
[7] Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen
Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data Center
TCP (DCTCP). In Proc. ACM SIGCOMM, 2010.
117
Bibliography 118
[8] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown,
Balaji Prabhakar, and Scott Shenker. pFabric: Minimal Near-Optimal Datacenter
Transport. In Proc. ACM SIGCOMM, 2013.
[9] Aditya Auradkar, Chavdar Botev, Shirshanka Das, Dave De Maagd, Alex Feinberg,
Phanindra Ganti, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna, Brendan Harris,
et al. Data Infrastructure at LinkedIn. In Proc. IEEE International Conference on
Data Engineering (ICDE), 2012.
[10] Daniel O Awduche. MPLS and Traffic Engineering in IP Networks. IEEE Commu-
nications Magazine, 37(12):42–47, 1999.
[11] Daniel O Awduche and Johnson Agogbua. Requirements for Traffic Engineering over
MPLS. Technical report, RFC 2702, September 1999.
[12] Stefan Banach. Sur les Opérations Dans les Ensembles Abstraits et Leur Application
aux Équations Intégrales. Fundamenta Mathematicae, 3(1):133–181, 1922.
[13] J Christopher Beck and Nic Wilson. Proactive Algorithms for Job Shop Scheduling
with Probabilistic Durations. Journal of Artificial Intelligence Research, 28:183–232,
2007.
[14] Paolo Boldi and Sebastiano Vigna. The Webgraph Framework I: Compression Tech-
niques. In Proc. ACM International Conference on World Wide Web (WWW), 2004.
[15] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rex-
ford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. P4:
Programming Protocol-Independent Packet Processors. ACM SIGCOMM Computer
Communication Review, 44(3):87–95, 2014.
[16] Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web
Search Engine. Computer Networks and ISDN Systems, 30(1):107–117, 1998.
Bibliography 119
[17] Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui
Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry C Li, et al. TAO:
Facebook’s Distributed Data Store for the Social Graph. In Proc. USENIX Annual
Technical Conference (ATC), 2013.
[18] Martin Casado, Teemu Koponen, Scott Shenker, and Amin Tootoonchian. Fabric:
a Retrospective on Evolving SDN. In Proc. ACM SIGCOMM Workshop on Hot
Topics in Software Defined Networking (HotSDN), 2012.
[19] Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. PowerLyra: Differentiated
Graph Computation and Partitioning on Skewed Graphs. In Proc. ACM European
Conference on Computer Systems (Eurosys), 2015.
[20] Mosharaf Chowdhury and Ion Stoica. Efficient Coflow Scheduling Without Prior
Knowledge. In Proc. ACM SIGCOMM, 2015.
[21] Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I Jordan, and Ion Stoica.
Managing Data Transfers in Computer Clusters with Orchestra. In Proc. ACM
SIGCOMM, 2011.
[22] Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. Efficient Coflow Scheduling with
Varys. In Proc. ACM SIGCOMM, 2014.
[23] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M Hellerstein, Khaled Elmeleegy,
and Russell Sears. MapReduce Online. In Proc. USENIX Symposium on Networked
Systems Design and Implementation (NSDI), 2010.
[24] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on
Large Clusters. In Proc. USENIX Symposium on Operating Systems Design and
Implementation (OSDI), 2004.
Bibliography 120
[25] Fahad R Dogar, Thomas Karagiannis, Hitesh Ballani, and Antony Rowstron. De-
centralized Task-Aware Scheduling for Data Center Networks. In Proc. ACM SIG-
COMM, 2014.
[26] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From Data Mining
to Knowledge Discovery in Databases. AI magazine, 17(3):37, 1996.
[27] Yuan Feng, Baochun Li, and Bo Li. Airlift: Video Conferencing as a Cloud Ser-
vice Using inter-Datacenter Networks. In Proc. IEEE International Conference on
Network Protocols (ICNP), 2012.
[28] Nate Foster, Dexter Kozen, Konstantinos Mamouras, Mark Reitblatt, and Alexan-
dra Silva. Probabilistic NetKAT. In Proc. European Symposium on Programming
Languages and Systems, 2016.
[29] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin.
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In
Proc. USENIX Symposium on Operating Systems Design and Implementation
(OSDI), 2012.
[30] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J
Franklin, and Ion Stoica. GraphX: Graph Processing in a Distributed Dataflow
Framework. In Proc. USENIX Symposium on Operating Systems Design and Imple-
mentation (OSDI), 2014.
[31] Yanfei Guo, Jia Rao, Dazhao Cheng, and Xiaobo Zhou. iShuffle: Improving Hadoop
Performance with Shuffle-on-Write. In Proc. USENIX International Conference on
Autonomic Computing (ICAC), 2013.
[32] Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai,
Shuo Wu, Sandeep Govind Dhoot, Abhilash Rajesh Kumar, Ankur Agiwal, et al.
Bibliography 121
Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing. Proc. VLDB
Endowment, 7(12):1259–1270, 2014.
[33] Minyang Han and Khuzaima Daudjee. Giraph Unchained: Barrierless Asynchronous
Parallel Execution in Pregel-Like Graph Processing Systems. Proc. VLDB Endow-
ment, 8(9):950–961, 2015.
[34] Soheil Hassas Yeganeh and Yashar Ganjali. Kandoo: a Framework for Efficient and
Scalable Offloading of Control Applications. In Proc. ACM SIGCOMM Workshop
on Hot Topics in Software Defined Networking (HotSDN), 2012.
[35] Benjamin Heintz, Abhishek Chandra, Ramesh K Sitaraman, and Jon Weissman.
End-to-End Optimization for Geo-Distributed MapReduce. IEEE Transactions on
Cloud Computing, 2015.
[36] C. Y. Hong, M. Caesar, and P. B. Godfrey. Finishing Flows Quickly with Preemptive
Scheduling. In Proc. ACM SIGCOMM, 2012.
[37] Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan
Nanduri, and Roger Wattenhofer. Achieving High Utilization with Software-Driven
WAN. In Proc. ACM SIGCOMM, 2013.
[38] Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R.
Ganger, Phillip B. Gibbons, and Onur Mutlu. Gaia: Geo-Distributed Machine
Learning Approaching LAN Speeds. In Proc. USENIX Symposium on Networked
Systems Design and Implementation (NSDI), 2017.
[39] Zhiming Hu, Baochun Li, and Jun Luo. Flutter: Scheduling Tasks Closer to Data
Across Geo-Distributed Datacenters. In Proc. IEEE INFOCOM, 2016.
[40] Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. The HiBench
Benchmark Suite: Characterization of the MapReduce-Based Data Analysis. In
Proc. International Conference on Data Engineering Workshops (ICDEW), 2010.
Bibliography 122
[41] Chien-Chun Hung, Leana Golubchik, and Minlan Yu. Scheduling Jobs Across Geo-
Distributed Datacenters. In Proc. ACM Symposium on Cloud Computing (SoCC),
2015.
[42] Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun
Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, et al. B4: Experi-
ence with a Globally-Deployed Software Defined WAN. In Proc. ACM SIGCOMM,
2013.
[43] Jiang, Jiawei and Cui, Bin and Zhang, Ce and Yu, Lele. Heterogeneity-aware Dis-
tributed Parameter Servers. In Proc. ACM International Conference on Management
of Data (SIGMOD), 2017.
[44] Lavanya Jose, Lisa Yan, George Varghese, and Nick McKeown. Compiling Packet
Programs to Reconfigurable Switches. In Proc. USENIX Symposium on Networked
Systems Design and Implementation (NSDI), 2015.
[45] Konstantinos Kloudas, Margarida Mamede, Nuno Preguiça, and Rodrigo Rodrigues.
Pixida: Optimizing Data Parallel Jobs in Wide-Area Data Analytics. Proc. VLDB
Endowment, 9(2):72–83, 2015.
[46] Diego Kreutz, Fernando MV Ramos, Paulo Esteves Verissimo, Christian Esteve
Rothenberg, Siamak Azodolmolky, and Steve Uhlig. Software-Defined Networking:
A Comprehensive Survey. Proceedings of the IEEE, 103(1):14–76, 2015.
[47] Dinesh Kumar, Arun Raj, Deepankar Patra, and Dharanipragada Janakiram.
GraphIVE: Heterogeneity-Aware Adaptive Graph Partitioning in GraphLab. In
Proc. IEEE International Conference on Parallel Processing Workshops (ICCPW),
2014.
Bibliography 123
[48] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Dandapani Sivakumar,
Andrew Tompkins, and Eli Upfal. The Web as a Graph. In Proc. ACM SIGMOD-
SIGACT-SIGART Symposium on Principles of Database Systems, 2000.
[49] Shailendra Kumar, Sajal K Das, and Rupak Biswas. Graph Partitioning for Parallel
Applications in Heterogeneous Grid Environments. In Proc. IEEE International
Parallel and Distributed Processing Symposium (IPDPS), 2002.
[50] Michael LeBeane, Shuang Song, Reena Panda, Jee Ho Ryoo, and Lizy K John.
Data Partitioning Strategies for Graph Workloads on Heterogeneous Clusters. In
Proc. IEEE International Conference for High Performance Computing, Networking,
Storage and Analysis (SC), 2015.
[51] George Lee, Jimmy Lin, Chuang Liu, Andrew Lorek, and Dmitriy Ryaboy. The Uni-
fied Logging Infrastructure for Data Analytics at Twitter. Proc. VLDB Endowment,
5(12):1771–1780, 2012.
[52] Yupeng Li, Shaofeng H-C Jiang, Haisheng Tan, Chenzi Zhang, Guihai Chen, Jipeng
Zhou, and Francis Lau. Efficient Online Coflow Routing and Scheduling. In Proc.
ACM International Symposium on Mobile Ad Hoc Networking and Computing (Mo-
biHoc), 2016.
[53] Shuhao Liu, Li Chen, and Baochun Li. Siphon: a High-Performance Substrate for
Inter-Datacenter Transfers in Wide-Area Data Analytics. In Proc. USENIX Annual
Technical Conference (ATC), 2017.
[54] Shuhao Liu, Li Chen, Baochun Li, and Aiden Carnegie. A Hierarchical Synchronous
Parallel Model for Wide-Area Graph Analytics. In Proc. IEEE INFOCOM, 2018.
[55] Shuhao Liu and Baochun Li. Stemflow: Software-Defined Inter-Datacenter Over-
lay as a Service. IEEE Journal on Selected Areas in Communications (JSAC),
35(11):2563–2573, 2017.
Bibliography 124
[56] Shuhao Liu, Hao Wang, and Baochun Li. Optimizing Shuffle in Wide-Area Data An-
alytics. In Proc. IEEE International Conference on Distributed Computing Systems
(ICDCS), 2017.
[57] Zimu Liu, Yuan Feng, and Baochun Li. Bellini: Ferrying Application Traffic Flows
through Geo-Distributed Datacenters in the Cloud. In Proc. IEEE GLOBECOM,
2013.
[58] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and
Joseph M Hellerstein. Distributed GraphLab: a Framework for Machine Learning
and Data Mining in the Cloud. Proc. VLDB Endowment, 5(8):716–727, 2012.
[59] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn,
Naty Leiser, and Grzegorz Czajkowski. Pregel: A System for Large-Scale Graph
Processing. In Proc. ACM SIGMOD International Conference on Management of
Data (SIGMOD), 2010.
[60] Christian Mayer, Muhammad Adnan Tariq, Chen Li, and Kurt Rothermel.
Graph: Heterogeneity-Aware Praph Computation with Adaptive Partitioning. In
Proc. IEEE International Conference on Distributed Computing Systems (ICDCS),
2016.
[61] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson,
Jennifer Rexford, Scott Shenker, and Jonathan Turner. OpenFlow: Enabling Inno-
vation in Campus Networks. ACM SIGCOMM Computer Communication Review,
38(2):69–74, 2008.
[62] Hesham Mekky, Fang Hao, Sarit Mukherjee, Zhi-Li Zhang, and TV Lakshman.
Application-Aware Data Plane Processing in SDN. In Proc. ACM SIGCOMM Work-
shop on Hot Topics in Software Defined Networking (HotSDN), 2014.
Bibliography 125
[63] Masoud Moshref, Apoorv Bhargava, Adhip Gupta, Minlan Yu, and Ramesh Govin-
dan. Flow-Level State Transition as a New Switch Primitive for SDN. In Proc. ACM
SIGCOMM Workshop on Hot Topics in Software Defined Networking (HotSDN),
2014.
[64] Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya
Akella, Paramvir Bahl, and Ion Stoica. Low Latency Geo-Distributed Data An-
alytics. In Proc. ACM SIGCOMM, 2015.
[65] Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S Pai, and Michael J Freedman.
Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area.
In Proc. USENIX Symposium on Networked Systems Design and Implementation
(NSDI), 2014.
[66] Smriti R Ramakrishnan, Garret Swart, and Aleksey Urmanov. Balancing Reducer
Skew in MapReduce Workloads Using Progressive Sampling. In Proc. ACM Sympo-
sium on Cloud Computing (SoCC), 2012.
[67] Jeff Rasley, Brent Stephens, Colin Dixon, Eric Rozner, Wes Felter, Kanak Agar-
wal, John Carter, and Rodrigo Fonseca. Planck: Millisecond-Scale Monitoring and
Control for Commodity Networks. In Proc. ACM SIGCOMM, 2014.
[68] Semih Salihoglu and Jennifer Widom. Optimizing Graph Algorithms on Pregel-Like
Systems. Proc. VLDB Endowment, 7(7):577–588, 2014.
[69] Naveen Kr Sharma, Antoine Kaufmann, Thomas E Anderson, Arvind Krishna-
murthy, Jacob Nelson, and Simon Peter. Evaluating the Power of Flexible Packet
Processing for Network Resource Allocation. In Proc. USENIX Symposium on Net-
worked Systems Design and Implementation (NSDI), 2017.
[70] Anirudh Sivaraman, Alvin Cheung, Mihai Budiu, Changhoon Kim, Mohammad
Alizadeh, Hari Balakrishnan, George Varghese, Nick McKeown, and Steve Licking.
Bibliography 126
Packet Transactions: High-Level Programming for Line-Rate Switches. In Proc.
ACM SIGCOMM, 2016.
[71] Anirudh Sivaraman, Changhoon Kim, Ramkumar Krishnamoorthy, Advait Dixit,
and Mihai Budiu. DC. p4: Programming the Forwarding Plane of a Data-Center
Switch. In Proc. ACM SIGCOMM Symposium on SDN Research (SOSR), 2015.
[72] Hengky Susanto, Hao Jin, and Kai Chen. Stream: Decentralized Opportunistic
Inter-Coflow Scheduling for Datacenter Networks. In Proc. IEEE International Con-
ference on Network Protocols (ICNP), 2016.
[73] Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, and
John McPherson. From Think Like a Vertex to Think Like a Graph. Proc. VLDB
Endowment, 7(3):193–204, 2013.
[74] Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The Anatomy
of the Facebook Social Graph. arXiv preprint arXiv:1111.4503, 2011.
[75] Leslie G Valiant. A Bridging Model for Parallel Computation. Communications of
the ACM, 33(8):103–111, 1990.
[76] Balajee Vamanan, Jahangir Hasan, and TN Vijaykumar. Deadline-Aware Datacenter
TCP (D2TCP). In Proc. ACM SIGCOMM, 2012.
[77] Raajay Viswanathan, Ganesh Ananthanarayanan, and Aditya Akella. Clarinet:
Wan-Aware Optimization for Analytics Queries. In Proc. USENIX Symposium on
Operating Systems Design and Implementation (OSDI), 2016.
[78] Ashish Vulimiri, Carlo Curino, Philip Brighten Godfrey, Thomas Jungblut, Kon-
stantinos Karanasos, Jitendra Padhye, and George Varghese. WANalytics: Analyt-
ics for a Geo-Distributed Data-Intensive World. In Proc. Conference on Innovative
Data Systems Research (CIDR), 2015.
Bibliography 127
[79] Ashish Vulimiri, Carlo Curino, Philip Brighten Godfrey, Thomas Jungblut, Jitu
Padhye, and George Varghese. Global Analytics in the Face of Bandwidth and
Regulatory Constraints. In Proc. USENIX Symposium on Networked Systems Design
and Implementation (NSDI), 2015.
[80] Damon Wischik, Costin Raiciu, Adam Greenhalgh, and Mark Handley. De-
sign, Implementation and Evaluation of Congestion Control for Multipath TCP.
In Proc. USENIX Symposium on Networked Systems Design and Implementation
(NSDI), 2011.
[81] Ning Xu, Bin Cui, Lei Chen, Zi Huang, and Yingxia Shao. Heterogeneous Environ-
ment Aware Streaming Graph Partitioning. IEEE Transactions on Knowledge and
Data Engineering (TKDE), 27(6):1560–1572, June 2015.
[82] Da Yan, Yingyi Bu, Yuanyuan Tian, Amol Deshpande, et al. Big Graph Analytics
Platforms. Foundations and Trends® in Databases, 7(1-2):1–195, 2017.
[83] Da Yan, James Cheng, Kai Xing, Yi Lu, Wilfred Ng, and Yingyi Bu. Pregel Algo-
rithms for Graph Connectivity Problems with Performance Guarantees. Proc. VLDB
Endowment, 7(14):1821–1832, 2014.
[84] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Mur-
phy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient Dis-
tributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.
In Proc. USENIX Symposium on Networked Systems Design and Implementation
(NSDI), 2012.
[85] David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz.
DeTail: Reducing the Flow Completion Time Tail in Datacenter Networks. In Proc.
ACM SIGCOMM, 2012.
Bibliography 128
[86] Hong Zhang, Li Chen, Bairen Yi, Kai Chen, Mosharaf Chowdhury, and Yanhui
Geng. CODA: Toward Automatically Identifying and Scheduling Coflows in the
Dark. In Proc. ACM SIGCOMM, 2016.
[87] Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. Maiter: an Asynchronous
Graph Processing Framework for Delta-Based Accumulative Iterative Computation.
IEEE Transactions on Parallel and Distributed Systems (TPDS), 25(8):2091–2100,
2014.
[88] Yangming Zhao, Kai Chen, Wei Bai, Minlan Yu, Chen Tian, Yanhui Geng, Yiming
Zhang, Dan Li, and Sheng Wang. Rapier: Integrating Routing and Scheduling for
Coflow-Aware Data Center Networks. In Proc. IEEE INFOCOM, 2015.
[89] Amelie Zhou, Shadi Ibrahim, and Bingsheng He. On Achieving Efficient Data Trans-
fer for Graph Processing in Geo-Distributed Datacenters. In Proc. IEEE Interna-
tional Conference on Distributed Computing Systems (ICDCS), 2017.
[90] Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. Gemini: A
Computation-Centric Distributed Graph Processing System. In Proc. USENIX Sym-
posium on Operating Systems Design and Implementation (OSDI), 2016.