cof low mending the application-network gap in big data analytics mosharaf chowdhury uc berkeley

Cof lowMending the Application-Network Gapin Big Data Analytics

Mosharaf Chowdhury

UC Berkeley

Big Data

The volume of data businesses want to make sense of is increasing

Increasing variety of sources• Web, mobile, wearables, vehicles, scientific, …

Cheaper disks, SSDs, and memory

Stalling processor speeds

Big Datacenters for Massive Parallelism

2005 2010 2015

MapReduce Hadoop

Spark

HiveDryad

DryadLINQ

Spark-Streaming

GraphXGraphLabPregel

Storm

Dremel

BlinkDB

Data-Parallel Applications

Multi-stage dataflow• Computation interleaved with

communication

Computation Stage (e.g., Map, Reduce)• Distributed across many machines• Tasks run in parallel

Communication Stage (e.g., Shuffle)• Between successive computation

stages

Map Stage

Shuffle

Reduce Stage

A communication stage cannot complete until all the data have

been transferred

Communication is Crucial

Performance

As SSD-based and in-memory systems proliferate,the network is likely to become the primary

bottleneck

1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.

Facebook jobs spend ~25% of runtime on average in intermediate comm.1

FasterCommunicatio

nStages:

NetworkingApproach

“Configuration should be handled at the system

level”

FlowTransfers data from a source to a destination

Independent unit of allocation, sharing, load balancing, and/orprioritization

Existing Solutions

GPS RED

WFQ CSFQ

ECN XCP D2TCPDCTCP

PDQD3

FCP

DeTail pFabric

2005 2010 20151980s 1990s 2000s

RCP

Per-Flow Fairness Flow Completion Time

Independent flows cannot capture the collective communication behavior common in data-parallel

applications

DatacenterNetwork

1

2

3

1

2

3

Why Do They Fall Short?r1 r2

s1 s2 s3

r1 r2

s1 s2 s3

Input Links Output Links

Why Do They Fall Short?r1 r2

s1 s2 s3

r1 r2

s1 s2 s3

DatacenterNetwork

1

2

3

1

2

3

r1

r2

s1

s2

s3

Why Do They Fall Short?

DatacenterNetwork

time2 4 6

Link to r2

Link to r1

Per-Flow Fair SharingShuffle

CompletionTime = 5

Avg. FlowCompletion

Time = 3.66

33

5

33

5

s1

s2

s3

r1

r2

1

2

3

1

2

3

Solutions focusing on flow completion time cannot

further decrease the shuffle completion time

Improve Application-Level Performance1

DatacenterNetwork

time2 4 6

Link to r2

Link to r1


CompletionTime = 5

Avg. FlowCompletion

Time = 3.66

33

5

33

5

s1

s2

s3

r1

r2

1

2

3

1

2

3

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.

Slow down faster flows to

accelerate slower flows

time2 4 6

Link to r2

Link to r1


CompletionTime = 4

Avg. FlowCompletionTime = 4

444

444

Data-Proportional Allocation

Applications know their performance

goals, but they have no means to

let the network know

FasterCommunicatio

nStages:SystemsApproach

“Configuration should be handled by the end

users”

FasterCommunicatio

nStages:

NetworkingApproach

“Configuration should be handled at the system

level”

FasterCommunicatio

nStages:SystemsApproach

“Configuration should be handled by the end

users”

MIND THE GAPME

Holistic ApproachApplications and the Network Working Together

Cof lowCommunication abstraction for data-parallel applications to express their performance goals

1. Minimize completion times,2. Meet deadlines, or 3. Perform fair allocation.

Aggregation

Broadcast

ShuffleParallel Flows

All-to-All

Single Flow

How to schedule coflows online …

… for faster #1 completion of coflows?

… to meet #2 more deadlines?

… for fair #3 allocation of the network?

1

2

N

1

2

N

.

.

.

.

.

.

Datacenter

Varys Enables coflows in data-intensive clusters

1. Coflow Scheduler Faster, application-aware data transfers throughout the network

2. Global Coordination

Consistent calculation and enforcement of scheduler decisions

3. The Coflow APIDecouples network optimizations from applications, relieving developers and end users1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.

1

Communication abstraction for data-parallel applications to express their performance goals

Cof low

1. The size of each flow, 2. The total number of flows, and 3. The endpoints of individual flows.

Benefits of

time2 4 6 time2 4 6 time2 4 6

Coflow1 comp. time = 5Coflow2 comp. time = 6


Fair Sharing Smallest-Flow First1,2 The Optimal


L1

L2

L1

L2

L1

L2

1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

Inter-Coflow Scheduling

time2 4 6


Fair Sharing

L1

L2

time2 4 6


Flow-level Prioritization1

L1

L2

time2 4 6

The Optimal


L1

L2

Benefits of

Concurrent Open Shop Scheduling1

• Examples include job scheduling and caching blocks

• Solutions use a ordering heuristic

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

1. A Note on the Complexity of the Concurrent Open Shop Problem, Journal of Scheduling, 9(4):389–396, 2006



is NP-Hard


1

2

3

1

2

3

Input Links Output Links

Datacenter

Concurrent Open Shop Scheduling with Coupled Resources• Examples include job

scheduling and caching blocks

• Solutions use a ordering heuristic

• Consider matching constraints

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

3

6

2

is NP-Hard

VarysEmploys a two-step algorithm to minimize coflow completion times

1. Ordering heuristic Keep an ordered list of coflows to be scheduled, preempting if needed

2. Allocation algorithm

Allocates minimum required resources to each coflow to finish in minimum time

Ordering Heuristic

Datacenter13

O3

O2

Time

O1

C2 ends

C1 ends

3

C1 C2 C3

Length 3 5 6

Shortest-First19

C3 ends

(Total CCT = 35)

1

2

3

1

2

3

3

3

3

6

5

5

Ordering Heuristic

Datacenter

1

2

3

1

2

313

O3

O2

Time

O1

C2 ends

C1 ends

3

Shortest-First19

C3 ends

16

O3

O2

Time

O1

C2 ends

C1 ends

19

Narrowest-First

6

C3 ends

C1 C2 C3

Width 3 2 1

3

3

3

6

5

5

(Total CCT = 41)(35)

Ordering Heuristic

Datacenter

1

2

3

1

2

316

O3

O2

Time

O1

C2 ends

C1 ends

19

Narrowest-First

6

C3 ends

19

O3

O2

Time

O1

C2 ends

C1 ends

9

Smallest-First6

C3 endsC1 C2 C3

Size 9 10

6

3

3

3

6

5

5

(41)

(34)

13

O3

O2

Time

O1

C2 ends

C1 ends

3

Shortest-First19

C3 ends

(35)

Ordering Heuristic

Datacenter

1

2

3

1

2

3

19

O3

O2

Time

O1

C2 ends

C1 ends

9

Smallest-First

6

C3 ends

19

O3

O2

Time

O1

C2 ends

C1 ends

3

Smallest-Bottleneck9

C3 endsC1 C2 C3

Bottleneck 3 10

6

3

3

3

6

5

5

(34) (31)

16

O3

O2

Time

O1

C2 ends

C1 ends

19

Narrowest-First

6

C3 ends

(41)

13

O3

O2

Time

O1

C2 ends

C1 ends

3

Shortest-First19

C3 ends

(35)

Allocation Algorithm

A coflow cannot finish before its very last flow

Finishing flows faster than the bottleneck cannot decrease a coflow’s completion time

Allocate minimum flow rates such that all flows of a coflow finish together on time





3. The Coflow APIDecouples network optimizations from applications, relieving developers and end users

The Need for Coordination

1

2

3

1

2

3

4

3O3

O2

Time

O1

C2 ends

C1 ends

4

C1 C2

Bottleneck 4 5

Schedulingwith

Coordination

54 9

(Total CCT = 13)

The Need for Coordination

1

2

3

1

2

3

O3

O2

Time

O1

C2 ends

C1 ends

4

Schedulingwith

Coordination

9

O3

O2

Time

O1

C2 ends

C1 ends

7

Schedulingwithout

Coordination

12

Uncoordinated local decisions interleave coflows, hurting performance

4

3

54

(Total CCT = 13) (Total CCT = 19)

Varys Architecture

Centralized master-slave architecture • Applications use a client library

to communicate with the master

Actual timing and rates are determined by the coflow scheduler

Put Get Reg

Varys Master

Coflow Scheduler

Topology

Monitor

UsageEstimat

or

Network Interface

(Distributed) File System

fComp. Tasks callingVarys Client Library

TaskName

Sender Receiver Driver

Varys Daemo

n

Varys Daemo

n

Varys Daemo

n

1. Download from http://varys.net





3. The Coflow APIDecouples network optimizations from applications, relieving developers and end users

The Coflow API

• register• put• get• unregister

mappers

reducers

shuffl

e

driver (JobTracker)

bro

adca

st @mapperb.get(id)…

@reducers.get(idsl)…

@driverb register(BROADCAST)

id b.put(content)…

ids1 s.put(content) …

s.unregister()b.unregister()

1. NO changes to user jobs

2. NO storage management

s register(SHUFFLE)

1. Does it improve performance?2. Can it beat non-preemptive

solutions?3. Do we really need

coordination?YES

EvaluationA 3000-machine trace-driven simulation matched against a 100-machine EC2 deployment

Better than Per-Flow Fairness

Sim.

EC2 1.85X 1.25X

3.21X 1.11X

Comm. Improv. Job Improv.

2.50X3.16X

3.39X4.86X

Comm. Heavy

Preemption is Necessary [Sim.]

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011

Varys Fair FIFO Priority FIFO-LM NC0

5

10

15

20

25

1.003.21

5.65 5.53

22.07

1.10

Ove

rhead O

ver

Vary

s

Varys Varys1 4

Per-FlowFairness

Per-FlowPrioritization

2,3

WhatAbout

StarvationNO

Lack of Coordination Hurts [Sim.]

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014

Smallest-flow-first (per-flow priorities)

• Minimizes flow completion time

FIFO-LM4 performs decentralized coflow scheduling

• Suffers due to local decisions• Works well for small, similar

coflows


5

10

15

20

25

1.003.21

5.65 5.53

22.07

1.10

Ove

rhead O

ver

Vary

s

Varys Varys1 4

Per-FlowFairness


2,3

Cof low1. The size of each flow, 2. The total number of flows,

and 3. The endpoints of individual

flows.

Pipelining between stages Speculative executions Task failures and restarts

Communication abstraction for data-parallel applications to express their performance goals

How to Perform Coflow Scheduling Without Complete Knowledge?

Implications

Minimize Avg. Comp. Time Flows in a Single Link Coflows in an Entire

Datacenter

With complete knowledge Smallest-Flow-FirstOrdering by Bottleneck Size + Data-Proportional Rate Allocation

Without complete knowledge

Least-Attained Service (LAS) ?

Revisiting Ordering Heuristics

1

2

3

1

2

313

O3

O2

Time

O1

C2 ends

C1 ends

3

Shortest-First19

C3 ends

16

O3

O2

Time

O1

C2 ends

C1 ends

19

Narrowest-First

6

C3 ends

19

O3

O2

Time

O1

C2 ends

C1 ends

9

Smallest-First

6

C3 ends

19

O3

O2

Time

O1

C2 ends

C1 ends

3

Smallest-Bottleneck9

C3 ends

3

3

3

6

5

5

(35) (41)

(34) (31)

✖

Coflow-Aware LAS (CLAS)

Set priority that decreases with how much a coflow has already sent• The more a coflow has sent, the lower its priority• Smaller coflows finish faster

Use total size of coflows to set priorities• Avoids the drawbacks of full decentralization

Coflow-Aware LAS (CLAS)Continuous priority reduces to fair sharing when similar coflows coexist• Priority oscillation

FIFO works well for similar coflows• Avoids the drawbacks of full

decentralization

Coflow 1

Coflow 2

time2 4 6


time2 4 6


Discretized Coflow-Aware LAS (D-CLAS)

Priority discretization• Change priority when total size

exceeds predefined thresholds

Scheduling policies• FIFO within the same queue• Prioritization across queue

Weighted sharing across queues• Guarantees starvation avoidance

FIFO

FIFO

FIFO

…

Highest-PriorityQueue

Lowest-PriorityQueue

QK

Q2

Q1

How to Discretize Priorities?

…

E1A

E1A

EK-1A

Exponentially spaced thresholds• K : number of queues• A : threshold constant• E : threshold exponent

Loose coordination suffices to calculate global coflow sizes• Slaves make independent decisions

in between

Small coflows (smaller than E1A) do not experience coordination overheads!

E2A

0

∞

Highest-PriorityQueue

Lowest-PriorityQueue

QK

Q2

Q1FIFO

FIFO

FIFO

Closely Approximates Varys [Sim. & EC2]

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014


5

10

15

20

25

1.003.21

5.65 5.53

22.07

1.10

Ove

rhead O

ver

Vary

s

Varysw/o Complete

Knowledge

Varys1 4

Per-FlowFairness


2,3

Orchestra

SIGCOMM’11

VarysSIGCOMM’14

My Contributions

FairCloud

SIGCOMM’12

HARPSIGCOMM’12

ViNEYard

ToN’12

SinbadSIGCOMM’13

SparkNSDI’12

DatacenterResource Allocation

Network-AwareApplications

Application-Aware Network Scheduling

Top-Level Apache Project

Merged at Facebook

Merged with Spark

Open-Source

@HP @Microsoft Bing

Open-Source

AaloSIGCOMM’15Open-Source

Coflow

Communication-First Big Data Systems

In-Datacenter Analytics• Cheaper SSDs and DRAM, proliferation of optical networks, and

resource disaggregation will make network the primary bottleneck

Inter-Datacenter Analytics• Bandwidth-constrained wide-area networks

End User Delivery• Faster and responsive delivery of analytics results over the

Internet for better end user experience

NetworkingSystems MIND THE GAPME

Better capture application-level performance goals using coflows

Coflows improve application-level performance and usability• Extends networking and scheduling literature

Coordination – even if not free – is worth paying for in many cases

[email protected]://mosharaf.com

Improve Flow Completion Times

Datacenter

1

2

3

1

2

3

r1

r2

s1

s2

s3

time2 4 6

Link to r2

Link to r1


CompletionTime = 5

Avg. FlowCompletion

Time = 3.66


Smallest-Flow First1,2

time2 4 6

Link to r2

Link to r1

ShuffleCompletionTime = 6

Avg. FlowCompletion

Time = 2.66

1

1 6

42

2

1E+06 1E+08 1E+10 1E+12 1E+140

0.20.40.60.8

1

Coflow Length (Bytes)

Fra

c. o

f C

oflow

s

1E+00 1E+04 1E+08 1E+120

0.20.40.60.8

1

Coflow Width (Number of Flows)

Fra

c. o

f C

oflow

s

1E+06 1E+08 1E+10 1E+12 1E+140

0.20.40.60.8

1

Coflow Size (Bytes)

Fra

c. o

f C

oflow

s

1E+06 1E+08 1E+10 1E+12 1E+140

0.20.40.60.8

1

Coflow Bottleneck Size (Bytes)

Fra

c. o

f C

oflow

s

Distributions of Coflow Characteristics

Traffic Sources1.Ingest and replicate new

data 2.Read input from remote

machines, when needed3.Transfer intermediate data4.Write and replicate output

30

1446

10Percentage of Traffic by Category atFacebook

Distribution of Shuffle Durations

PerformanceFacebook jobs spend ~25% of runtime on average in intermediate comm.

Month-long trace from a 3000-machine MapReduce production cluster at Facebook

320,000 jobs150 Million tasks

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Fraction of Runtime Spent in Shuffle

Fra

cti

on o

f Jo

bs

Theoretical Results

Structure of optimal schedules• Permutation schedules might not always lead to the optimal

solution

Approximation ratio of COSS-CR• Polynomial-time algorithm with constant approximation ratio (

)1

The need for coordination• Fully decentralized schedulers can perform arbitrarily worse than

the optimal

1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 2014

643

The Coflow API

• register• put• get• unregister

mappers

reducers

shuffl

e

driver (JobTracker)

bro

adca

st @mapperb.get(id)…

@reducers.get(idsl)…

@driverb register(BROADCAST, numFlows)

id b.put(content, size)…

ids1 s.put(content, size) …

s.unregister()b.unregister()

1. NO changes to user jobs

2. NO storage management

s register(SHUFFLE, numFlows, {b})

1. Admission controlDo not admit any coflows that cannot be completed within deadline without violating existing deadlines

2. Allocation algorithm

Allocate minimum required resources to each coflow to finish them at their deadlines

VarysEmploys a two-step algorithm to support coflow deadlines

More Predictable

EC2 Deployment

Facebook Trace Simulation

Met Deadline Not Admitted Missed Deadline

1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012

Varys Fair EDF0

25

50

75

100

% o

f C

ofl

ow

s

Varys 1

(Earliest-Deadline First) Varys Fair0

25

50

75

100

% o

f C

ofl

ow

s

Varys

Spark-v1.1.1 6

# Comm. Params*

10

20

Hadoop-v1.2.1

YARN-v2.6.0

OptimizingCommunicatio

nPerformance:

SystemsApproach

“Let users figure it out”*Lower bound. Does not include many parameters that can indirectly impact communication; e.g., number of reducers etc. Also excludes control-plane communication/RPC parameters.

Experimental Methodology

Varys deployment in EC2 • 100 m2.4xlarge machines• Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps

NIC• ~900 Mbps/machine during all-to-all communication

Trace-driven simulation• Detailed replay of a day-long Facebook trace (circa October

2010)• 3000-machine,150-rack cluster with 10:1 oversubscription

cof low mending the application-network gap in big data analytics mosharaf chowdhury uc berkeley

Documents

shuffle completion time

r1r1 r2r2 s1s1 s2s2

datacenter network time

parallel communication

data transfers

slower flows time

andor prioritization

faster flows