ricochet: lateral error correction for time-critical...
TRANSCRIPT
MotivationSystem
Evaluation
Ricochet: Lateral Error Correction forTime-Critical Multicast
Mahesh Balakrishnan1, Ken Birman1, Amar Phanishayee2,Stefan Pleisch1
1Cornell University, Ithaca, NY2Carnegie Mellon University, Pittsburgh, PA
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Multicast in the Datacenter
Commodity Datacenters:Extreme Scale-OutData Replication:
Fault ToleranceHigh AvailabilityPerformance
Updating data in multiplelocations: Multicast!
Data is Clustered / Replicated / Published / Cached...
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Multicast in the Datacenter
Commodity Datacenters:Extreme Scale-OutData Replication:
Fault ToleranceHigh AvailabilityPerformance
Updating data in multiplelocations: Multicast!
Upd
ates
Data is Clustered / Replicated / Published / Cached...
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
How is Multicast Used?
Financial Pub-Sub Example:Each equity is mapped toa multicast group.Each node is interestedin a different set ofequities ...
Each node in many groups=⇒ Low per-group data rate
High per-node data rate=⇒ Overload
Tracking S&P 500
Tracking Portfolio
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
How is Multicast Used?
Financial Pub-Sub Example:Each equity is mapped toa multicast group.Each node is interestedin a different set ofequities ...
Each node in many groups=⇒ Low per-group data rate
High per-node data rate=⇒ Overload
Tracking S&P 500
Tracking Portfolio
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
How is Multicast Used?
Financial Pub-Sub Example:Each equity is mapped toa multicast group.Each node is interestedin a different set ofequities ...
Each node in many groups=⇒ Low per-group data rate
High per-node data rate=⇒ Overload
Tracking S&P 500
Tracking Portfolio
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Recovering Lost Packets... Fast!
Loss occurs on overloaded end-hosts.Real-Time Apps: Financial Trading, Mission Control...Foobooks.com?
Massive volume...Stale inventory = Lost $$$
Required: rapid, scalable recovery from packet loss
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Problem Statement
Time-Critical ReliableMulticastScalability:
Number of ReceiversNumber of SendersNumber of Groups
Multicast Data
Lost Packet
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Problem Statement
Time-Critical ReliableMulticastScalability:
Number of Receivers
Number of SendersNumber of Groups
Multicast Data
Lost Packet
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Problem Statement
Time-Critical ReliableMulticastScalability:
Number of ReceiversNumber of Senders
Number of Groups
Multicast Data
Lost Packet
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Problem Statement
Time-Critical ReliableMulticastScalability:
Number of ReceiversNumber of SendersNumber of Groups
Multicast Data
Lost Packet
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Design Space for Reliable MulticastHow does latency scale?
Existing mechanisms:ACK/timeout: RMTP/RMTP-IINAK/sender-based sequencing: SRMGossip-based: Bimodal Multicast, lpbcastForward Error Correction
Fundamental Insight: latency ∝ 1datarate
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
NAK/Sender-Based Sequencing: SRM
Scalable Reliable Multicast - Developed 1997
Loss discoveredon next packetfrom samesender in samegrouplatency ∝ 1
datarate
data rate: at asingle sender, ina single group 0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 2 4 8 16 32 64 128
Mill
isec
onds
Groups per Node (Log Scale)
SRM Average Discovery Delay
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Forward Error Correction
Pros:Tunable, Proactive Overhead: (r , c)
Time-Critical : No RetransmissionCons:
FEC packets are generated over a stream of dataHave to wait for r data packets before generating FEClatency ∝ 1
datarate
data rate: at a single sender, in a single group
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Receiver-Based Forward Error Correction
Receivers generate XORs ofincoming multicast packets ...
... and exchange with otherreceiversA receiver can recover from atmost one missing packet in anXOR
EDCBAK
erne
l Buf
fer
F
FEDK
erne
l Buf
fer
B CApp Buffer App Buffer
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Receiver-Based Forward Error Correction
Receivers generate XORs ofincoming multicast packets ...... and exchange with otherreceiversA receiver can recover from atmost one missing packet in anXOR
EKer
nel B
uffe
r
Ker
nel B
uffe
r
App Buffer App Buffer
B C ED FXOR
A B C D
X
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Receiver-Based Forward Error Correction
Receiver sends XOR to crandomly chosen receiversGossip-style RandomnessTunable Overhead: (r , c)rate-of-firelatency ∝ 1P
s dataratedata rate: across all senders, ina single group
Multicast Data
SENDERS
GROUP
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Receiver-Based Forward Error Correction
Receiver sends XOR to crandomly chosen receivers
Gossip-style RandomnessTunable Overhead: (r , c)rate-of-firelatency ∝ 1P
s dataratedata rate: across all senders, ina single group
Multicast Data
SENDERS
GROUP
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Receiver-Based Forward Error Correction
Receiver sends XOR to crandomly chosen receiversGossip-style RandomnessTunable Overhead: (r , c)rate-of-fire
latency ∝ 1Ps datarate
data rate: across all senders, ina single group
Multicast Data
SENDERS
GROUP
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Receiver-Based Forward Error Correction
Receiver sends XOR to crandomly chosen receiversGossip-style RandomnessTunable Overhead: (r , c)rate-of-firelatency ∝ 1P
s dataratedata rate: across all senders, ina single group
Multicast Data
SENDERS
GROUP
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Lateral Error Correction: Principle
B2
INC
OM
ING
DA
TA P
ACK
ETS
A1
A2A3
B2
B1
A4
A5
B4
B3
A6
Repair Packet I:(A1, A2, A3, A4, A5)
Repair Packet II:(B1,B2,B3,B4,B5)
B5
Node n1
Node n2
A1
A2
A3
B1
A4
A5B4
B3
A6
B5
INC
OM
ING
DA
TA P
ACK
ETS
Recovery
Loss
Repairs from n1 to n2 include data from common groups.
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Lateral Error Correction: Principle
B2
INC
OM
ING
DA
TA P
ACK
ETS
A1
A2A3
B2
B1
A4
A5
B4
B3
A6
Repair Packet I:(A1, A2, A3, A4, A5)
Repair Packet II:(B1,B2,B3,B4,B5)
B5
Node n1
Node n2
A1
A2
A3
B1
A4
A5B4
B3
A6
B5
INC
OM
ING
DA
TA P
ACK
ETS
Recovery
LossRepair Packet I:(A1, A2, A3, B1, B2)
B2
INC
OM
ING
DA
TA P
ACK
ETS
A1
A2A3
B2
B1
A4
A5
B4
B3
A6
Repair Packet II:(A4, A5, B3, B4, A6)
B5
Node n1
Node n2
A1
A2
A3
B1
A4
A5B4
B3
A6
B5
INC
OM
ING
DA
TA P
ACK
ETS
Recovery
Loss
Repairs from n1 to n2 include data from common groups.
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Nodes and Intersections
2
13
Node n1 belongs togroups A, B, and C
Divides groups intodisjoint intersectionsIs unaware of groups itdoes not belong to (D)
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Nodes and Intersections
2
13
Node n1 belongs togroups A, B, and CDivides groups intodisjoint intersections
Is unaware of groups itdoes not belong to (D)
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Nodes and Intersections
2
1
4
3
Node n1 belongs togroups A, B, and CDivides groups intodisjoint intersectionsIs unaware of groups itdoes not belong to (D)
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Regional Selection
1
A
A
Select targets for repairsfrom intersections, notgroupsFrom each intersection,select proportionalfraction of cA:cx
A = |x ||A| · cA
latency ∝ 1PsP
g datarate
data rate: across all senders, in intersections of groups
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Regional Selection
1
A
A
Select targets for repairsfrom intersections, notgroups
From each intersection,select proportionalfraction of cA:cx
A = |x ||A| · cA
latency ∝ 1PsP
g datarate
data rate: across all senders, in intersections of groups
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Regional Selection
1
Aac
A a
A abc
1.0cAab
Select targets for repairsfrom intersections, notgroupsFrom each intersection,select proportionalfraction of cA:cx
A = |x ||A| · cA
latency ∝ 1PsP
g datarate
data rate: across all senders, in intersections of groups
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Regional Selection
1
Aac
A a
A abc
1.0cAab
Select targets for repairsfrom intersections, notgroupsFrom each intersection,select proportionalfraction of cA:cx
A = |x ||A| · cA
latency ∝ 1PsP
g datarate
data rate: across all senders, in intersections of groups
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Design SpaceReceiver-Based FECLateral Error Correction
Systems Issues
Overheads:Membership State:# of intersections < # of known nodes.Computational:XORs are fast... 150-300 ms per packet.Bandwidth:(r , c) =⇒ c
r+c repair overhead.
Group Membership Service: Any old one works.
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Experimental Evaluation
Cornell Cluster: 64 1.3 Ghz nodesJava Implementation running on Linux 2.6.12Three Loss Models: {Uniform, Burst, Markov}Grouping Parameters: g ∗ s = d ∗ n
g: Number of Groups in Systems: Average Size of Groupd: Groups joined by each Noden: Number of Nodes in System
Each node joins d randomly selected groups from g groups
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Where does loss occur in a Datacenter?
Packet Loss occurs at end-hosts: independent and bursty
0
20
40
60
80
100
1801601401201008060
burs
t len
gth
(pac
kets
) receiver r1: loss bursts in A and Breceiver r1: data bursts in A and B
0
20
40
60
80
100
1801601401201008060
burs
t len
gth
(pac
kets
)
time in seconds
receiver r2: loss bursts in Areceiver r2: data bursts in A
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Distribution of Recovery Latency16 Nodes, 128 groups per node, 10 nodes per group, Uniform *% Loss
0
10
20
30
40
50
0 50 100 150 200 250
Rec
over
y Pe
rcen
tage
Milliseconds
96.8% LEC + 3.2% NAK
0
10
20
30
40
50
0 50 100 150 200 250
Rec
over
y Pe
rcen
tage
Milliseconds
92% LEC + 8% NAK
0
10
20
30
40
50
0 50 100 150 200 250
Rec
over
y Pe
rcen
tage
Milliseconds
84% LEC + 16% NAK
(a) 10% Loss Rate (b) 15% Loss Rate (c) 20% Loss Rate
Most lost packets recovered < 50ms by LEC.Remainder via reactive NAKs.
Claim: Ricochet is reliable and time-critical.
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Scalability in Groups64 nodes, * groups per node, 10 nodes per group, Loss Model: Uniform 1%
90
92
94
96
98
100
2 4 8 16 32 64 128 256 512 1024
LE
C R
ecov
ery
Perc
enta
ge
Groups per Node (Log Scale)
Recovery %
0
10000
20000
30000
40000
50000
2 4 8 16 32 64 128 256 512 1024M
icro
seco
nds
Groups per Node (Log Scale)
Average Recovery Latency
Claim: Ricochet scales to hundreds of groups. Comparison: at128 groups, SRM latency was 8 seconds. 400 times slower!
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
CPU time and XORs per data packet64 nodes, * groups per node, 10 nodes per group, Loss Model: Uniform 1%
0
50
100
150
200
250
300
350
400
450
2 4 8 16 32 64 128 256 512 1024
Mic
rose
cond
s
Groups per Node (Log Scale)
Updating Repair BinsData Packet Accounting
Waiting in BufferXOR
0
1
2
3
4
5
2 4 8 16 32 64 128 256 512 1024N
umbe
r
Groups per Node (Log Scale)
XORs per Data Packet
Claim: Ricochet is lightweight=⇒ Time-Critical Apps can run over it
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Resilience to Burstiness64 nodes, 128 groups per node, 10 nodes per group, Loss Model: Bursty 1%
30
40
50
60
70
80
90
100
5 10 15 20 25 30 35 40
Rec
over
y Pe
rcen
tage
Burst Size
Resilience to Bursty Losses: Recovery Percentage
L=0.1%L=1%
L=10%
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
5 10 15 20 25 30 35 40Rec
over
y L
aten
cy (M
icro
seco
nds)
Burst Size
Resilience to Bursty Losses: Recovery Latency
L=0.1%L=1%
L=10%
... can handle short bursts (5-10 packets) well. Good enough?
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Staggering64 nodes, 128 groups per node, 10 nodes per group, Loss Model: Bursty 1%
10 20 30 40 50 60 70 80 90
100
1 2 3 4 5 6 7 8 9 10
Rec
over
y Pe
rcen
tage
Stagger
Staggering: LEC Recovery Percentage
b=10burst=50
burst=100Uniform
10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
1 2 3 4 5 6 7 8 9 10Rec
over
y L
aten
cy (M
icro
seco
nds)
Stagger
Staggering: Recovery Latency
b=10burst=50
burst=100Uniform
Stagger of i : Encode every i th packetStagger 6, burst of 100 packets =⇒ 90% recovered at 50 ms!
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Conclusion
Multicast in Datacenters:large numbers of low-rate groupsaggregate load can be high, causing packet loss
Ricochet is the first protocol to scale in the number ofgroups in the systemLayered under high-level platforms: Tempest, Axis2Available for download:http://www.cs.cornell.edu/projects/quicksilver/Ricochet.html
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Overflow
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast
MotivationSystem
Evaluation
Impact of Loss Rate on LEC64 nodes, 128 groups per node, 10 nodes per group, Loss Model: *
0
20
40
60
80
100
0 0.05 0.1 0.15 0.2 0.25
LE
C R
ecov
ery
Perc
enta
ge
Loss Rate
Effect of Loss Rate on Recovery %
Bursty b=10Uniform
Markov m=10 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000
0 0.05 0.1 0.15 0.2 0.25Rec
over
y L
aten
cy (M
icro
seco
nds)
Loss Rate
Effect of Loss Rate on Latency
Bursty b=10Uniform
Markov m=10
Works well at typical datacenter loss rates: 1-5%
Mahesh Balakrishnan Ricochet: Lateral Error Correction for Time-Critical Multicast