transport level protocol performance evaluation for bulk data transfers matei ripeanu the university...
Post on 22-Dec-2015
215 views
TRANSCRIPT
Transport Level Protocol Performance Evaluation for Bulk Data TransfersMatei Ripeanu
The University of Chicagohttp://www.cs.uchicago.edu/~matei/
Abstract: Before developing new protocols targeted at bulk data transfers, the achievable performance and limitations of the broadly used TCP protocol should be carefully investigated. Our first goal is to explore TCP's bulk transfer throughput as a function of network path properties, number of concurrent flows, loss rates, competing traffic, etc. We use analytical models, simulations, and real-world experiments. The second objective is to repeat this evaluation for some of TCP's replacement candidates (e.g. NETBLT). This should allow an informed decision whether (or not) to put effort into developing and/or using new protocols specialized on bulk transfers.
Application requirements (GriPhyN): Efficient management of 10s to 100s of
PetaBytes (PB) of data, many PBs of new raw data / year.
Granularity: file size 10M to 1G. Large-pipes: OC3 and up, high
latencies Efficient bulk data transfers Gracefully share with other applns. Projects: CMS, ATLAS, LIGO, SDSS
Stable-state throughput as % of bottleneck link rate (RTT=80ms, MSS=1460bytes)
0%
20%
40%
60%
80%
100%
1.E
-02
5.E
-03
2.E
-03
1.E
-03
5.E
-04
2.E
-04
1.E
-04
5.E
-05
2.E
-05
1.E
-05
5.E
-06
2.E
-06
1.E
-06
5.E
-07
2.E
-07
1.E
-07
5.E
-08
2.E
-08
1.E
-08
Loss indication rates (log scale)
% o
f bo
ttlen
eck
link
rate
(%
).
OC3 link (155Mbps)
OC12 link (622 Mbps)
T3 link (43.2Mbps)
(Rough) analytical stable-state throughput estimates (based on [Math96]
2maxmax
max
2max
38
81
1*
38*
Wpfor
pW
WRTT
MSS
Wpfor
p
C
RTT
MSS
Throughput
Main inefficiencies TCP is blamed for:Overhead. However, less than 15% of time spent
in proper TCP processing.Flow control. Claim: a rate-based protocol would
be faster. However, there is no proof that this is better than (self) ACK-clocking.
Congestion control: • Underlying problem: underlying layers do not
give explicit congestion feedback, TCP therefore assumes any packet loss is a congestion signal
• Not scalable.
Questions: Is TCP appropriate/usable? What about rate based protocols?
Want to optimize: Link utilization Per file transfer delay While maintaining “fair” sharing
TCP Refresher:
Time
Slow Start (exponential growth)
Congestion Avoidance (linear growth)
Fast retransmit
Packet loss discovered through fast recovery mechanism
Packet loss discovered through timeout
Simulations (using NS []): Simulation topology:
Significant throughput improvements can be achieved just by tuning the end-systems and the network path: set up proper window-sizes, disable delayed ACK, use SACK and ECN, use jumbo frames, etc.
For high link loss rates, striping is a legitimate and effective solution.
1
10
100
1000
1.E
-02
5.E
-03
2.E
-03
1.E
-03
5.E
-04
2.E
-04
1.E
-04
5.E
-05
2.E
-05
1.E
-05
5.E
-06
2.E
-06
1.E
-06
5.E
-07
2.E
-07
1.E
-07
5.E
-08
2.E
-08
1.E
-08
Link loss rate (log scale)
100M
B t
rans
fer
tim
e (s
ec)
(log
sca
le).
MSS=1460, DelAck, huge WS
MSS=1460, DelAck, WS ok
MSS=1460, WS ok
MSS=9000
MSS=9000, FACK
MSS=9000, FACK, 5 flows
Ideal
10
100
1000
1.E
-02
5.E
-03
2.E
-03
1.E
-03
5.E
-04
2.E
-04
1.E
-04
5.E
-05
2.E
-05
1.E
-05
5.E
-06
2.E
-06
1.E
-06
5.E
-07
2.E
-07
1.E
-07
5.E
-08
2.E
-08
1.E
-08
Link loss rate (log scale)
1GB
tran
sfer
tim
e (s
ec)
(log
sca
le)
MSS=1460, DelAck
MSS=1460, WS ok
MSS=9000
FACK
5 flows
25 flows
Ideal
OC3 link, 80ms RTT, MSS=1460 initially
OC12 link, 100ms RTT, MSS=1460 initially
1Gbps, 1ms RTT links
1Gbps, 1ms RTT links
OC3, 35ms or OC12, 45ms
0
1000
2000
3000
4000
5000
6000
0
100
200
300
400
500
600
700
800
900
1000
Number of parallel flows used
Pac
kets
dro
pped
0
2
4
6
8
10
12
Tra
nsfe
r ti
me
(StD
ev)
StDev (right scale)
Dropped messages(left scale)
0
5
10
15
20
25
30
35
40
45
50
0
100
200
300
400
500
600
700
800
900
1000
Number of parallel flows used
Tim
e (s
ec) 10 flows
20 flows
50 flows
100 flows
200 flows
300 flows
400 flows
500 flows
600 flows
700 flows
800 flows
900 flows
1000 flows
0
5
10
15
20
25
30
35
40
45
50
0
100
200
300
400
500
600
700
800
900
1000
Number of parallel flows (stripes) used
Tim
e (s
ec)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0
100
200
300
400
500
600
700
800
900
1000
Number of parallel flows used
Dro
pped
pac
kets
.
0
2
4
6
8
10
12
Tra
nsfe
r ti
me
(Std
Dev
)
Packets Dropped (leftscale)
StdDev (right scale)
TCP striping issues Widespread usage exposes
scaling problems in TCP congestion control mechanism:• Unfair allocation: a small
number of flows grabs almost all available bandwidth
• Reduced efficiency: a large number of packets are dropped.
• Rule of thumb: have less flows in the systems than ‘pipe size’ expressed in packets
Not ‘TCP unfriendly’ as long as link loss rates are high
Even high link loss rates do not break unfairness
0.5GB striped transfer, OC3 link (155Mbps), RTT80ms, MSS=9000 using up to 1000 flows
Loss rate=0.1%Loss rate=0ConclusionsTCP can work well with careful end-host and
network tuningFor fair sharing with other users, need
mechanisms to provide congestion feedback and distinguish genuine link losses from congestion indications.
In addition, admission mechanisms based on the number of parallel flows might be beneficial
GridFTP and iperf Performance(between LBNL and ANL via ES-Net)
0
50
100
150
200
250
300
350
0 5 10 15 20 25 30 35
# of TCP streams
Band
wid
th (M
bs)
GridFTP iperf
Striping Widely used (browsers, ftp, etc) Good practical results Not ‘TCP friendly’!
•RFC2140/ Ensemble TCP – share information and congestion management among parallel flows
MCS/ANL courtesy
Future workWhat are optimal buffer sizes for bulk
transfers? Can we use ECN and large buffers to reliably
detect congestion without using dropped packets as a congestion indicator?
Assuming the link loss rate pattern is known, can it be used to reliably detect congestion and improve throughput and
OC12, ANL to LBNL (56ms), Linux boxes