is advance knowledge of flow sizes a plausible assumption? · 2019. 12. 18. · fs known in...
TRANSCRIPT
Is advance knowledge of flow sizes a plausible assumption?
Vojislav Dukic, Sangeetha Abdu Jyothi, Bojan Karlas,
Muhsen Owaida, Ce Zhang, Ankit Singla
Flow size information can increase efficiency
2
Performance
FS known in advance
FS not known in advance
Flow size information can increase efficiency
3
…fine-grained circuit scheduling
…packet scheduling within switches
…deadline-aware prioritization
…adaptive routing
…congestion control schemes
100 KB
100 MB
!5
Typically, first-in first-out
Packet scheduling at a switch queue
100 MB
100 KB
!5
Typically, first-in first-out
Packet scheduling at a switch queue
100 MB
100 KB
!6
Packet scheduling at a switch queue
But could apply shortest job first idea!
100 MB
100 KB
!7
Packet scheduling at a switch queue
Least remaining bytes first[*pFabric: Minimal Near-Optimal Datacenter Transport; SIGCOMM ’13]
100 MB
100 KB
!7
Packet scheduling at a switch queue
Least remaining bytes first[*pFabric: Minimal Near-Optimal Datacenter Transport; SIGCOMM ’13]
100 MB
100 KB
!8
Packet scheduling at a switch queue
Simple, but problematic: do we know the shortest flow?
100 MB
100 KB
Assumption:Flow size is know in advance
“In many datacenter applications flow sizes or deadlines are known at initiation time and can be conveyed to the network stack (e.g., through a socket API)...”
pFabric – SIGCOMM ’13Improves FCT by 4x
“When an application calls send() or sendto() on a socket, the operating
system sends this demand in a request message to the Fastpass arbiter,
specifying the destination and the number of bytes.”
FastPass – SIGCOMM ‘14 Improves FCT by 15x
“The sender must specify the size of a message when presenting its first byte to the transport... Knowledge of message sizes is particularly valuable because it allows transports to prioritize shorter messages.”
Homa – SIGCOMM ’18Reduces tail latency by 100x
Time
Assumption: Flow size is know in advance
Assumption: Flow size is know in advance
“In this paper, we question the validity of this assumption, and point out that, for many applications, such information is difficult to obtain, and may even be unavailable.”
PIAS – NSDI ‘15
“A number of applications are unable to provide size/deadline information at the start of their flows, e.g. database access
and HTTP chunked transfer.”
Karuna – SIGCOMM ‘16
“We ignore flow size and duration as they cannot be acquired until a flow finishes.”
CODA – SIGCOMM ’16
Time
Example alternative: Flow aging
Packets sent = packets remaining
012340123 5
100 MB
100 KB
Example alternative: Flow aging
Packets sent = packets remaining
45
100 MB
100 KB
Possible alternative: flow aging
Works well for long tail flow size distributions
Possible alternative: flow aging
Works well for long tail flow size distributions
Works only when relative flow size is needed
Doesn’t work for equally sized flows
Is advance knowledge of flow sizes a plausible assumption?
100% knowledge is likely intractable* * or at least too expensive
Is advance knowledge of flow sizes a plausible assumption?
100% knowledge is likely intractable* * or at least too expensive
… but partial knowledge is plausible and useful
Is advance knowledge of flow sizes a plausible assumption?
17
Sources of flow size information?
How useful is imprecise / partial knowledge?
Incentives for operators and users?
18
Sources of flow size information?
How useful is imprecise / partial knowledge?
Incentives for operators and users?
Flow size estimation: design space
19
Many apps know flow sizes
Doable in private DCs
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past traces
Flow size estimation: design space
19
Many apps know flow sizes
Doable in private DCs
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past tracesSome apps don’t
Change a lot of applications
Change network API
0KB
50KB
100KB
150KB
200KB
250KB
1 1.002 1.004 1.006 1.008 1.01
Buffe
r occ
upan
cy
Time (s)
1G network
Flow size estimation: design space
20
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past traces
Buffer occupancy
0KB
50KB
100KB
150KB
200KB
250KB
1 1.002 1.004 1.006 1.008 1.01
Buffe
r occ
upan
cy
Time (s)
1G network
Flow size estimation: design space
21
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past traces
Buffer occupancy
0KB
50KB
100KB
150KB
200KB
250KB
1 1.002 1.004 1.006 1.008 1.01
Buffe
r occ
upan
cy
Time (s)
1G network
Flow size estimation: design space
22
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past traces
Buffer occupancy
0KB
50KB
100KB
150KB
200KB
250KB
1 1.002 1.004 1.006 1.008 1.01
Buffe
r occ
upan
cy
Time (s)
1G network
Flow size estimation: design space
23
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past traces
Buffer occupancy
0KB
50KB
100KB
150KB
200KB
250KB
1 1.002 1.004 1.006 1.008 1.01
Buffe
r occ
upan
cy
Time (s)
1G network10G network
Flow size estimation: design space
24
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past traces
Buffer occupancy
Flow size estimation: design space
25
while(data in buffer):y = write(socket_desc,buffer+x,100KB
)
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past traces
Flow size estimation: design space
26
flow_size = f(disk I/O,memory I/O,past network traffic,computation
)
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past traces
Learning flow size
27
* Values are in R2
1 – perfect prediction0 – mean value prediction
WorkloadWeb server
TensorFlow
PageRank
KMeans
SGD
Performance0.960.970.830.880.79
+
TraceDisk I/O
Memory I/O
CPU cycles
Past traffic
+
ModelGBDT
FFNN
LSTM =
28
Sources of flow size information?
How useful is imprecise / partial knowledge?
Incentives for operators and users?
Flow scheduling using imprecise knowledge
[TensorFlow workload]
0 0.05
0.1 0.15
0.2 0.25
0.3 0.35
0.4 0.45
pFabric pHost FastPass
Mea
n FC
T (m
s)Perfect
Fifo
Mean FCT (ms)
1
[1. pHost: Distributed Near-Optimal Datacenter Transport Over Commodity Network Fabric; CoNEXT ’15]
2
[2. Fastpass: A Centralized “Zero-Queue” Datacenter Network; SIGCOMM ’14]
Flow scheduling using imprecise knowledge
0 0.05
0.1 0.15
0.2 0.25
0.3 0.35
0.4 0.45
pFabric pHost FastPass
Mea
n FC
T (m
s)Perfect
Fifo
Mean FCT (ms)
[TensorFlow workload]
1
[1. pHost: Distributed Near-Optimal Datacenter Transport Over Commodity Network Fabric; CoNEXT ’15]
2
[2. Fastpass: A Centralized “Zero-Queue” Datacenter Network; SIGCOMM ’14]
Least remaining bytes first
Flow scheduling using imprecise knowledge
0 0.05
0.1 0.15
0.2 0.25
0.3 0.35
0.4 0.45
pFabric pHost FastPass
Mea
n FC
T (m
s)Perfect
Fifo
Mean FCT (ms)
[TensorFlow workload]
1
[1. pHost: Distributed Near-Optimal Datacenter Transport Over Commodity Network Fabric; CoNEXT ’15]
2
[2. Fastpass: A Centralized “Zero-Queue” Datacenter Network; SIGCOMM ’14]
Flow scheduling using imprecise knowledge
0 0.05
0.1 0.15
0.2 0.25
0.3 0.35
0.4 0.45
pFabric pHost FastPass
Mea
n FC
T (m
s)Perfect
FifoPrediction
Mean FCT (ms)
[TensorFlow workload]
1
[1. pHost: Distributed Near-Optimal Datacenter Transport Over Commodity Network Fabric; CoNEXT ’15]
2
[2. Fastpass: A Centralized “Zero-Queue” Datacenter Network; SIGCOMM ’14]
Flow scheduling using imprecise knowledge
0 0.05
0.1 0.15
0.2 0.25
0.3 0.35
0.4 0.45
pFabric pHost FastPass
Mea
n FC
T (m
s)Perfect
FifoPrediction
Mean FCT (ms)
[TensorFlow workload]
1
[1. pHost: Distributed Near-Optimal Datacenter Transport Over Commodity Network Fabric; CoNEXT ’15]
2
[2. Fastpass: A Centralized “Zero-Queue” Datacenter Network; SIGCOMM ’14]
0 0.05
0.1 0.15
0.2 0.25
0.3 0.35
0.4 0.45
pFabric pHost FastPass
Mea
n FC
T (m
s)Perfect
FifoPrediction
Aging
Flow scheduling using imprecise knowledge
Mean FCT (ms)
[TensorFlow workload]
1
[1. pHost: Distributed Near-Optimal Datacenter Transport Over Commodity Network Fabric; CoNEXT ’15]
2
[2. Fastpass: A Centralized “Zero-Queue” Datacenter Network; SIGCOMM ’14]
1 1.2 1.4 1.6 1.8
2 2.2 2.4 2.6
PageRankKMeans SGD
TensorFlowWeb Server
Rela
tive
perfo
rman
ce
degr
adat
ion
Coflow scheduling using imprecise knowledge (Sincronia*)
35
Slowdown
[*Sincronia: Near-Optimal Network Design for Coflows; SIGCOMM ’18]
1 1.2 1.4 1.6 1.8
2 2.2 2.4 2.6
PageRankKMeans SGD
TensorFlowWeb Server
Rela
tive
perfo
rman
ce
degr
adat
ion
Coflow scheduling using imprecise knowledge (Sincronia*)
36
Slowdown
[*Sincronia: Near-Optimal Network Design for Coflows; SIGCOMM ’18]
Workload R2
Web server 0.96
TensorFlow 0.97
PageRank 0.83
KMeans 0.88
SGD 0.79
1 1.2 1.4 1.6 1.8
2 2.2 2.4 2.6
PageRankKMeans SGD
TensorFlowWeb Server
Rela
tive
perfo
rman
ce
degr
adat
ion
Coflow scheduling using imprecise knowledge (Sincronia*)
37
Slowdown
[*Sincronia: Near-Optimal Network Design for Coflows; SIGCOMM ’18]
Workload R2
Web server 0.96
TensorFlow 0.97
PageRank 0.83
KMeans 0.88
SGD 0.79
1 1.2 1.4 1.6 1.8
2 2.2 2.4 2.6
PageRankKMeans SGD
TensorFlowWeb Server
Rela
tive
perfo
rman
ce
degr
adat
ion
Coflow scheduling using imprecise knowledge (Sincronia*)
38
Slowdown
[*Sincronia: Near-Optimal Network Design for Coflows; SIGCOMM ’18]
Workload R2
Web server 0.96
TensorFlow 0.97
PageRank 0.83
KMeans 0.88
SGD 0.79
0
0.2
0.4
0.6
0.8
1
1 10 100
CDF
Latency (µs)
CPU
Fast enough, deployable learning?
39
CDF
0
0.2
0.4
0.6
0.8
1
1 10 100
CDF
Latency (µs)
CPU
Fast enough, deployable learning?
40
CDF
0
0.2
0.4
0.6
0.8
1
1 10 100
CDF
Latency (µs)
CPUFPGA
Fast enough, deployable learning?
41
CDF
Flow size estimation: design space
42
Works only for repetitive workloads
Must identify the application Substantial implementation effort.
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past traces
Learning from past traces
Every technique provides some knowledge
43
Estimation effort
Flow sizeknowledge
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
Learning from past traces
Every technique provides some knowledge
43
Estimation effort
Flow sizeknowledge
Exact sizes given by application
TCP buffer occupancy
Monitoring system calls
44
Sources of flow size information?
How useful is imprecise / partial knowledge?
Incentives for operators and users?
What we want: More knowledge ⇒ better performance
Knowledge = Effort
Performance
0% 100%
Time
Unknown
Time
Unknown Known
TimeTime
Unknown Known
Q1: Can this flow’s performance deteriorate?
TimeTime
Unknown Known
Q2: Can the whole system’s performance deteriorate?
TimeTime
Q1: Can this flow’s performance deteriorate?
Unknown Known
Q2: Can the whole system’s performance deteriorate?
Q1: Can this flow’s performance deteriorate?
Unknown Known
Old scheduling problem, but novel, interesting questions raised by our unique angle!
Q2: Can the whole system’s performance deteriorate?
Q1: Can this flow’s performance deteriorate?
Unknown Known
Q1: Can this flow’s performance deteriorate?
Unknown Known
Least remaining bytes first
Q1: Can this flow’s performance deteriorate?
Unknown Known
Least remaining bytes first
+ 1
Flow aging0101223
Q1: Can this flow’s performance deteriorate?
Unknown Known
A flow’s performance cannot deteriorate!
Least remaining bytes first
+ 1
Flow aging0101223
For coflow scheduling with mean flow size used for unknowns,performance can in fact deteriorate
Unknown Known
Q1: Can this flow’s performance deteriorate?
Q2: Can the whole system’s performance deteriorate?
Unknown Known
Q2: Can the whole system’s performance deteriorate?
Average co/flow completion time system-wide can deteriorate!
Unknown Known
0 0.5
1 1.5
2 2.5
3 3.5
0% 20% 40% 60% 80% 100%
Slow
down
Percentage of known flows
Slowdown
Perfect knowledge performance
Unknown Known
Q2: Can the whole system’s performance deteriorate?
0 0.5
1 1.5
2 2.5
3 3.5
0% 20% 40% 60% 80% 100%
Slow
down
Percentage of known flows
Slowdown
Open question: positive results with some restrictions?
Perfect knowledge performance
Unknown Known
Q2: Can the whole system’s performance deteriorate?
55
Sources of flow size information? Partial knowledge available from many sources
How useful is imprecise / partial knowledge?Can still provide performance improvements
Incentives for operators and users? Need better guarantees on value of investment
Is advance knowledge of flow sizes a plausible assumption?
Thank you for your attention
Performance improvement
0
0.2
0.4
0.6
0.8
1
0 1KB 10KB 100KB 1MB 10MB All
Perfo
rman
ce
Size of known flows
SGDPageRank
TensorFlow
Features
58
Feature DescriptionStart time, tf Start time of f relative to job start timeFlow gap Time since the end of the previous flowFirst call Size of the first system call tfNetwork in Data received until tfNetwork out Data sent until tfNetwork in(d) Data received from flow’s dest. d until tfNetwork out(d) Data sent by this host to d until tfCPU CPU cycles used until tfDisk I/O Total disk I/O until tfMemory I/O Total memory I/O until tfPrevious flows Flow sizes for last k flows