cof low mending the application-network gap in big data analytics mosharaf chowdhury uc berkeley
TRANSCRIPT
Cof lowMending the Application-Network Gapin Big Data Analytics
Mosharaf Chowdhury
UC Berkeley
Big Data
The volume of data businesses want to make sense of is increasing
Increasing variety of sources• Web, mobile, wearables, vehicles, scientific, …
Cheaper disks, SSDs, and memory
Stalling processor speeds
Big Datacenters for Massive Parallelism
2005 2010 2015
MapReduce Hadoop
Spark
HiveDryad
DryadLINQ
Spark-Streaming
GraphXGraphLabPregel
Storm
Dremel
BlinkDB
Data-Parallel Applications
Multi-stage dataflow• Computation interleaved with
communication
Computation Stage (e.g., Map, Reduce)• Distributed across many machines• Tasks run in parallel
Communication Stage (e.g., Shuffle)• Between successive computation
stages
Map Stage
Shuffle
Reduce Stage
A communication stage cannot complete until all the data have
been transferred
Communication is Crucial
Performance
As SSD-based and in-memory systems proliferate,the network is likely to become the primary
bottleneck
1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.
Facebook jobs spend ~25% of runtime on average in intermediate comm.1
FasterCommunicatio
nStages:
NetworkingApproach
“Configuration should be handled at the system
level”
FlowTransfers data from a source to a destination
Independent unit of allocation, sharing, load balancing, and/orprioritization
Existing Solutions
GPS RED
WFQ CSFQ
ECN XCP D2TCPDCTCP
PDQD3
FCP
DeTail pFabric
2005 2010 20151980s 1990s 2000s
RCP
Per-Flow Fairness Flow Completion Time
Independent flows cannot capture the collective communication behavior common in data-parallel
applications
DatacenterNetwork
1
2
3
1
2
3
Why Do They Fall Short?r1 r2
s1 s2 s3
r1 r2
s1 s2 s3
Input Links Output Links
Why Do They Fall Short?r1 r2
s1 s2 s3
r1 r2
s1 s2 s3
DatacenterNetwork
1
2
3
1
2
3
r1
r2
s1
s2
s3
Why Do They Fall Short?
DatacenterNetwork
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 5
Avg. FlowCompletion
Time = 3.66
33
5
33
5
s1
s2
s3
r1
r2
1
2
3
1
2
3
Solutions focusing on flow completion time cannot
further decrease the shuffle completion time
Improve Application-Level Performance1
DatacenterNetwork
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 5
Avg. FlowCompletion
Time = 3.66
33
5
33
5
s1
s2
s3
r1
r2
1
2
3
1
2
3
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.
Slow down faster flows to
accelerate slower flows
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 4
Avg. FlowCompletionTime = 4
444
444
Data-Proportional Allocation
Applications know their performance
goals, but they have no means to
let the network know
FasterCommunicatio
nStages:SystemsApproach
“Configuration should be handled by the end
users”
FasterCommunicatio
nStages:
NetworkingApproach
“Configuration should be handled at the system
level”
FasterCommunicatio
nStages:SystemsApproach
“Configuration should be handled by the end
users”
MIND THE GAPME
Holistic ApproachApplications and the Network Working Together
Cof lowCommunication abstraction for data-parallel applications to express their performance goals
1. Minimize completion times,2. Meet deadlines, or 3. Perform fair allocation.
Aggregation
Broadcast
ShuffleParallel Flows
All-to-All
Single Flow
How to schedule coflows online …
… for faster #1 completion of coflows?
… to meet #2 more deadlines?
… for fair #3 allocation of the network?
1
2
N
1
2
N
.
.
.
.
.
.
Datacenter
Varys Enables coflows in data-intensive clusters
1. Coflow Scheduler Faster, application-aware data transfers throughout the network
2. Global Coordination
Consistent calculation and enforcement of scheduler decisions
3. The Coflow APIDecouples network optimizations from applications, relieving developers and end users1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.
1
Communication abstraction for data-parallel applications to express their performance goals
Cof low
1. The size of each flow, 2. The total number of flows, and 3. The endpoints of individual flows.
Benefits of
time2 4 6 time2 4 6 time2 4 6
Coflow1 comp. time = 5Coflow2 comp. time = 6
Coflow1 comp. time = 5Coflow2 comp. time = 6
Fair Sharing Smallest-Flow First1,2 The Optimal
Coflow1 comp. time = 3Coflow2 comp. time = 6
L1
L2
L1
L2
L1
L2
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
Link 1
Link 2
3 Units
Coflow 1
6 Units
Coflow 2
2 Units
Inter-Coflow Scheduling
time2 4 6
Coflow1 comp. time = 6Coflow2 comp. time = 6
Fair Sharing
L1
L2
time2 4 6
Coflow1 comp. time = 6Coflow2 comp. time = 6
Flow-level Prioritization1
L1
L2
time2 4 6
The Optimal
Coflow1 comp. time = 3Coflow2 comp. time = 6
L1
L2
Benefits of
Concurrent Open Shop Scheduling1
• Examples include job scheduling and caching blocks
• Solutions use a ordering heuristic
Link 1
Link 2
3 Units
Coflow 1
6 Units
Coflow 2
2 Units
1. A Note on the Complexity of the Concurrent Open Shop Problem, Journal of Scheduling, 9(4):389–396, 2006
Inter-Coflow Scheduling
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
is NP-Hard
Inter-Coflow Scheduling
1
2
3
1
2
3
Input Links Output Links
Datacenter
Concurrent Open Shop Scheduling with Coupled Resources• Examples include job
scheduling and caching blocks
• Solutions use a ordering heuristic
• Consider matching constraints
Link 1
Link 2
3 Units
Coflow 1
6 Units
Coflow 2
2 Units
3
6
2
is NP-Hard
VarysEmploys a two-step algorithm to minimize coflow completion times
1. Ordering heuristic Keep an ordered list of coflows to be scheduled, preempting if needed
2. Allocation algorithm
Allocates minimum required resources to each coflow to finish in minimum time
Ordering Heuristic
Datacenter13
O3
O2
Time
O1
C2 ends
C1 ends
3
C1 C2 C3
Length 3 5 6
Shortest-First19
C3 ends
(Total CCT = 35)
1
2
3
1
2
3
3
3
3
6
5
5
Ordering Heuristic
Datacenter
1
2
3
1
2
313
O3
O2
Time
O1
C2 ends
C1 ends
3
Shortest-First19
C3 ends
16
O3
O2
Time
O1
C2 ends
C1 ends
19
Narrowest-First
6
C3 ends
C1 C2 C3
Width 3 2 1
3
3
3
6
5
5
(Total CCT = 41)(35)
Ordering Heuristic
Datacenter
1
2
3
1
2
316
O3
O2
Time
O1
C2 ends
C1 ends
19
Narrowest-First
6
C3 ends
19
O3
O2
Time
O1
C2 ends
C1 ends
9
Smallest-First6
C3 endsC1 C2 C3
Size 9 10
6
3
3
3
6
5
5
(41)
(34)
13
O3
O2
Time
O1
C2 ends
C1 ends
3
Shortest-First19
C3 ends
(35)
Ordering Heuristic
Datacenter
1
2
3
1
2
3
19
O3
O2
Time
O1
C2 ends
C1 ends
9
Smallest-First
6
C3 ends
19
O3
O2
Time
O1
C2 ends
C1 ends
3
Smallest-Bottleneck9
C3 endsC1 C2 C3
Bottleneck 3 10
6
3
3
3
6
5
5
(34) (31)
16
O3
O2
Time
O1
C2 ends
C1 ends
19
Narrowest-First
6
C3 ends
(41)
13
O3
O2
Time
O1
C2 ends
C1 ends
3
Shortest-First19
C3 ends
(35)
Allocation Algorithm
A coflow cannot finish before its very last flow
Finishing flows faster than the bottleneck cannot decrease a coflow’s completion time
Allocate minimum flow rates such that all flows of a coflow finish together on time
Varys Enables coflows in data-intensive clusters
1. Coflow Scheduler Faster, application-aware data transfers throughout the network
2. Global Coordination
Consistent calculation and enforcement of scheduler decisions
3. The Coflow APIDecouples network optimizations from applications, relieving developers and end users
The Need for Coordination
1
2
3
1
2
3
4
3O3
O2
Time
O1
C2 ends
C1 ends
4
C1 C2
Bottleneck 4 5
Schedulingwith
Coordination
54 9
(Total CCT = 13)
The Need for Coordination
1
2
3
1
2
3
O3
O2
Time
O1
C2 ends
C1 ends
4
Schedulingwith
Coordination
9
O3
O2
Time
O1
C2 ends
C1 ends
7
Schedulingwithout
Coordination
12
Uncoordinated local decisions interleave coflows, hurting performance
4
3
54
(Total CCT = 13) (Total CCT = 19)
Varys Architecture
Centralized master-slave architecture • Applications use a client library
to communicate with the master
Actual timing and rates are determined by the coflow scheduler
Put Get Reg
Varys Master
Coflow Scheduler
Topology
Monitor
UsageEstimat
or
Network Interface
(Distributed) File System
fComp. Tasks callingVarys Client Library
TaskName
Sender Receiver Driver
Varys Daemo
n
Varys Daemo
n
Varys Daemo
n
1. Download from http://varys.net
Varys Enables coflows in data-intensive clusters
1. Coflow Scheduler Faster, application-aware data transfers throughout the network
2. Global Coordination
Consistent calculation and enforcement of scheduler decisions
3. The Coflow APIDecouples network optimizations from applications, relieving developers and end users
The Coflow API
• register• put• get• unregister
mappers
reducers
shuffl
e
driver (JobTracker)
bro
adca
st @mapperb.get(id)…
@reducers.get(idsl)…
@driverb register(BROADCAST)
id b.put(content)…
ids1 s.put(content) …
s.unregister()b.unregister()
1. NO changes to user jobs
2. NO storage management
s register(SHUFFLE)
1. Does it improve performance?2. Can it beat non-preemptive
solutions?3. Do we really need
coordination?YES
EvaluationA 3000-machine trace-driven simulation matched against a 100-machine EC2 deployment
Better than Per-Flow Fairness
Sim.
EC2 1.85X 1.25X
3.21X 1.11X
Comm. Improv. Job Improv.
2.50X3.16X
3.39X4.86X
Comm. Heavy
Preemption is Necessary [Sim.]
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
Varys Fair FIFO Priority FIFO-LM NC0
5
10
15
20
25
1.003.21
5.65 5.53
22.07
1.10
Ove
rhead O
ver
Vary
s
Varys Varys1 4
Per-FlowFairness
Per-FlowPrioritization
2,3
WhatAbout
StarvationNO
Lack of Coordination Hurts [Sim.]
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014
Smallest-flow-first (per-flow priorities)
• Minimizes flow completion time
FIFO-LM4 performs decentralized coflow scheduling
• Suffers due to local decisions• Works well for small, similar
coflows
Varys Fair FIFO Priority FIFO-LM NC0
5
10
15
20
25
1.003.21
5.65 5.53
22.07
1.10
Ove
rhead O
ver
Vary
s
Varys Varys1 4
Per-FlowFairness
Per-FlowPrioritization
2,3
Cof low1. The size of each flow, 2. The total number of flows,
and 3. The endpoints of individual
flows.
Pipelining between stages Speculative executions Task failures and restarts
Communication abstraction for data-parallel applications to express their performance goals
How to Perform Coflow Scheduling Without Complete Knowledge?
Implications
Minimize Avg. Comp. Time Flows in a Single Link Coflows in an Entire
Datacenter
With complete knowledge Smallest-Flow-FirstOrdering by Bottleneck Size + Data-Proportional Rate Allocation
Without complete knowledge
Least-Attained Service (LAS) ?
Revisiting Ordering Heuristics
1
2
3
1
2
313
O3
O2
Time
O1
C2 ends
C1 ends
3
Shortest-First19
C3 ends
16
O3
O2
Time
O1
C2 ends
C1 ends
19
Narrowest-First
6
C3 ends
19
O3
O2
Time
O1
C2 ends
C1 ends
9
Smallest-First
6
C3 ends
19
O3
O2
Time
O1
C2 ends
C1 ends
3
Smallest-Bottleneck9
C3 ends
3
3
3
6
5
5
(35) (41)
(34) (31)
✖
Coflow-Aware LAS (CLAS)
Set priority that decreases with how much a coflow has already sent• The more a coflow has sent, the lower its priority• Smaller coflows finish faster
Use total size of coflows to set priorities• Avoids the drawbacks of full decentralization
Coflow-Aware LAS (CLAS)Continuous priority reduces to fair sharing when similar coflows coexist• Priority oscillation
FIFO works well for similar coflows• Avoids the drawbacks of full
decentralization
Coflow 1
Coflow 2
time2 4 6
Coflow1 comp. time = 6Coflow2 comp. time = 6
time2 4 6
Coflow1 comp. time = 3Coflow2 comp. time = 6
Discretized Coflow-Aware LAS (D-CLAS)
Priority discretization• Change priority when total size
exceeds predefined thresholds
Scheduling policies• FIFO within the same queue• Prioritization across queue
Weighted sharing across queues• Guarantees starvation avoidance
FIFO
FIFO
FIFO
…
Highest-PriorityQueue
Lowest-PriorityQueue
QK
Q2
Q1
How to Discretize Priorities?
…
E1A
E1A
EK-1A
Exponentially spaced thresholds• K : number of queues• A : threshold constant• E : threshold exponent
Loose coordination suffices to calculate global coflow sizes• Slaves make independent decisions
in between
Small coflows (smaller than E1A) do not experience coordination overheads!
E2A
0
∞
Highest-PriorityQueue
Lowest-PriorityQueue
QK
Q2
Q1FIFO
FIFO
FIFO
Closely Approximates Varys [Sim. & EC2]
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014
Varys Fair FIFO Priority FIFO-LM NC0
5
10
15
20
25
1.003.21
5.65 5.53
22.07
1.10
Ove
rhead O
ver
Vary
s
Varysw/o Complete
Knowledge
Varys1 4
Per-FlowFairness
Per-FlowPrioritization
2,3
Orchestra
SIGCOMM’11
VarysSIGCOMM’14
My Contributions
FairCloud
SIGCOMM’12
HARPSIGCOMM’12
ViNEYard
ToN’12
SinbadSIGCOMM’13
SparkNSDI’12
DatacenterResource Allocation
Network-AwareApplications
Application-Aware Network Scheduling
Top-Level Apache Project
Merged at Facebook
Merged with Spark
Open-Source
@HP @Microsoft Bing
Open-Source
AaloSIGCOMM’15Open-Source
Coflow
Communication-First Big Data Systems
In-Datacenter Analytics• Cheaper SSDs and DRAM, proliferation of optical networks, and
resource disaggregation will make network the primary bottleneck
Inter-Datacenter Analytics• Bandwidth-constrained wide-area networks
End User Delivery• Faster and responsive delivery of analytics results over the
Internet for better end user experience
NetworkingSystems MIND THE GAPME
Better capture application-level performance goals using coflows
Coflows improve application-level performance and usability• Extends networking and scheduling literature
Coordination – even if not free – is worth paying for in many cases
[email protected]://mosharaf.com
Improve Flow Completion Times
Datacenter
1
2
3
1
2
3
r1
r2
s1
s2
s3
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 5
Avg. FlowCompletion
Time = 3.66
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
Smallest-Flow First1,2
time2 4 6
Link to r2
Link to r1
ShuffleCompletionTime = 6
Avg. FlowCompletion
Time = 2.66
1
1 6
42
2
1E+06 1E+08 1E+10 1E+12 1E+140
0.20.40.60.8
1
Coflow Length (Bytes)
Fra
c. o
f C
oflow
s
1E+00 1E+04 1E+08 1E+120
0.20.40.60.8
1
Coflow Width (Number of Flows)
Fra
c. o
f C
oflow
s
1E+06 1E+08 1E+10 1E+12 1E+140
0.20.40.60.8
1
Coflow Size (Bytes)
Fra
c. o
f C
oflow
s
1E+06 1E+08 1E+10 1E+12 1E+140
0.20.40.60.8
1
Coflow Bottleneck Size (Bytes)
Fra
c. o
f C
oflow
s
Distributions of Coflow Characteristics
Traffic Sources1.Ingest and replicate new
data 2.Read input from remote
machines, when needed3.Transfer intermediate data4.Write and replicate output
30
1446
10Percentage of Traffic by Category atFacebook
Distribution of Shuffle Durations
PerformanceFacebook jobs spend ~25% of runtime on average in intermediate comm.
Month-long trace from a 3000-machine MapReduce production cluster at Facebook
320,000 jobs150 Million tasks
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Fraction of Runtime Spent in Shuffle
Fra
cti
on o
f Jo
bs
Theoretical Results
Structure of optimal schedules• Permutation schedules might not always lead to the optimal
solution
Approximation ratio of COSS-CR• Polynomial-time algorithm with constant approximation ratio (
)1
The need for coordination• Fully decentralized schedulers can perform arbitrarily worse than
the optimal
1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 2014
643
The Coflow API
• register• put• get• unregister
mappers
reducers
shuffl
e
driver (JobTracker)
bro
adca
st @mapperb.get(id)…
@reducers.get(idsl)…
@driverb register(BROADCAST, numFlows)
id b.put(content, size)…
ids1 s.put(content, size) …
s.unregister()b.unregister()
1. NO changes to user jobs
2. NO storage management
s register(SHUFFLE, numFlows, {b})
1. Admission controlDo not admit any coflows that cannot be completed within deadline without violating existing deadlines
2. Allocation algorithm
Allocate minimum required resources to each coflow to finish them at their deadlines
VarysEmploys a two-step algorithm to support coflow deadlines
More Predictable
EC2 Deployment
Facebook Trace Simulation
Met Deadline Not Admitted Missed Deadline
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012
Varys Fair EDF0
25
50
75
100
% o
f C
ofl
ow
s
Varys 1
(Earliest-Deadline First) Varys Fair0
25
50
75
100
% o
f C
ofl
ow
s
Varys
Spark-v1.1.1 6
# Comm. Params*
10
20
Hadoop-v1.2.1
YARN-v2.6.0
OptimizingCommunicatio
nPerformance:
SystemsApproach
“Let users figure it out”*Lower bound. Does not include many parameters that can indirectly impact communication; e.g., number of reducers etc. Also excludes control-plane communication/RPC parameters.
Experimental Methodology
Varys deployment in EC2 • 100 m2.4xlarge machines• Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps
NIC• ~900 Mbps/machine during all-to-all communication
Trace-driven simulation• Detailed replay of a day-long Facebook trace (circa October
2010)• 3000-machine,150-rack cluster with 10:1 oversubscription