fault tolerant service function chainingali/papers/sigcomm20-ftc-slides.pdf · alice bob alice...
TRANSCRIPT
Fault Tolerant Service Function ChainingM. GHAZNAVI , E . JALALPOUR, B . WONG, R. BOUTABA, A . MASHTIZADEH
UNIVERSITY OF WATERLOO
Middleboxes and Service Function Chains
1
Firewall IDS NAT Internet
Middlebox Failures
2
Demystifying the Dark Side of the Middle: A Field Study ofMiddlebox Failures in Datacenters
Rahul PotharajuPurdue University
Navendu JainMicrosoft Research
ABSTRACT
Network appliances or middleboxes such as firewalls, intru-sion detection and prevention systems (IDPS), load bal-ancers, and VPNs form an integral part of datacenters andenterprise networks. Realizing their importance and short-comings, the research community has proposed software im-plementations, policy-aware switching, consolidation appli-ances, moving middlebox processing to VMs, end hosts, andeven o!oading it to the cloud. While such e"orts can usemiddlebox failure characteristics to improve their reliability,management, and cost-e"ectiveness, little has been reportedon these failures in the field.In this paper, we make one of the first attempts to perform
a large-scale empirical study of middlebox failures over twoyears in a service provider network comprising thousands ofmiddleboxes across tens of datacenters. We find that mid-dlebox failures are prevalent and they can significantly im-pact hosted services. Several of our findings di"er in key as-pects from commonly held views: (1) Most failures are greydominated by connectivity errors and link flaps that exhibitintermittent connectivity, (2) Hardware faults and overloadproblems are present but they are not in majority, (3) Mid-dleboxes experience a variety of misconfigurations such asincorrect rules, VLAN misallocation and mismatched keys,and (4) Middlebox failover is ine"ective in about 33% of thecases for load balancers and firewalls due to configurationbugs, faulty failovers and software version mismatch. Fi-nally, we analyze current middlebox proposals based on ourstudy and discuss directions for future research.
Categories and Subject Descriptors
C.2.3 [Computer-Communication Network]: NetworkOperations—network management
General Terms
Network management; Reliability
Keywords
Datacenters; Network reliability; Middleboxes
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
IMC’13, October 23–25, 2013, Barcelona, Spain.
Copyright 2013 ACM 978-1-4503-1953-9/13/10 ...$15.00..
20
81
36
2
43
11 1 3
0
25
50
75
100
L2 Switches L3 Routers Middleboxes Others
Perc
en
t C
on
trib
uti
on
High Severity Incidents
Population
4235
11 9 30
25
50
75
100
Lost connectivity
High Latency
Service Error
SecuritySLA
Perc
en
t C
on
trib
uti
on
Figure 1: Middleboxes contribute to 43% of high-severity incidents despite being 11% of the popula-tion (top). The top-5 categories of service impact inthese incidents caused by middleboxes (bottom).
1. INTRODUCTIONToday’s datacenters and enterprises deploy a variety of
intermediary network devices or middleboxes to distributeload (e.g., load balancers), enable remote connectivity (e.g.,VPNs), improve performance (e.g., proxies) and security(e.g., firewalls, IDPS), as well as to support new tra#cclasses and applications [1–4]. Given these valuable benefits,the market for middleboxes is estimated to exceed $10B by2016 [5], and their number is becoming comparable to thatof routers in enterprise networks [1, 6].
These benefits, however, come at a high cost: middle-boxes constitute a significant fraction of the network capitaland operational expenses [1]; they are complex to manageand expensive to troubleshoot [6]; and their outages cangreatly impact service performance and availability [7]. Forinstance, in December 2012, a load balancing misconfigura-tion a"ected multiple Google services including Chrome andGmail [8]. In a 2011 survey of 1,000 organizations [9], 36%and 42% of the respondents indicated failure of a firewalldue to DDoS attacks at the application layer and networklayer, respectively; the very attack firewalls are deployed toprotect against.
Demystifying the dark side of the middle:A field study of middlebox failures in datacentersIMC 2013
NAT
Middlebox Fault Tolerance
3
NAT Internet
Alice
Bob
✕
NAT
Consistent State Replication
4
NAT Internet
Alice
Bob
Alice ⬄AppleBob ⬄Bing
Alice ⬄AppleBob ⬄Bing
NAT
Consistent State Replication
5
NAT Internet
Alice
Bob
Alice ⬄AppleBob ⬄Bing
Bob ⬄Bing
Previous ApproachesEXTERNALIZED STATE
StatelessNF, NSDI 2017
CHC, NSDI 2019
SNAPSHOT BASED
Pico Replication, SoCC 2013
FTMB, SIGCOMM 2015
REINFORCE, CoNEXT 2018
6
Externalized State Approach
7
NAT Internet
Fault Tolerant Data Store
Read/Write
State
NAT
Snapshot Based Approaches
8
NAT Internet
Alice
Bob
Alice ⬄AppleBob ⬄Bing
Alice ⬄AppleBob ⬄Bing Primary state
Replicated state
Snapshot Based Approaches for a Chain
9
FW
FW
IDS NAT
IDPS NAT
InternetHigh LatencyLow Throughput
Primary stateReplicated state
Our Approach
10
Fault Tolerant
Firewall IDS NATF IF’ NI’
…
Primary stateReplicated state
Goals
Consistent state replication to tolerate ! middlebox failures
Minimizing performance overhead during normal operation
Minimizing disruption during middlebox failures
11
Fault Tolerant Chaining (FTC)In-chain replication◦ Replicates a chain’s state instead of the state of individual middleboxes◦ Each middlebox’s state replicated to subsequent ! middlebox servers
Transactional packet processing ◦ Simplifies the development of multi-threaded middleboxes◦ Improves scalability and performance
Data dependency vectors◦ Enables concurrent state replication
12
r1 r3r2
Normal Operation
13
m1 m2 m3
Forw. Buffer21 2 3
r1 r3r2
Normal Operation
14
m1 m2 m3
Forw. Buffer1 21 2 3
r1 r3r2
Normal Operation
15
m1 m2 m3
Forw. Buffer3 1 21 2 3
Failure Recovery
16
r3r2r1
m1 m2 m3
13 21 32✕r’
2
m2
2
1 2Forw. Buffer
Primary stateReplicated state
Transactional Packet ProcessingExisting approaches◦ Single thread or batched packet processing◦ FTMB: multi threaded packet processing
◦ Tracking state changes in granularity of each state variable read/write◦ Frequent periodic state snapshots
Our approach◦ Packet transaction model for concurrent packet processing◦ Using isolation property to tracking state changes in granularity of packet transactions
17
Data Dependency VectorsTracking data changes instead of thread operations
Enabling different number of threads at the middlebox and replicas◦ Fail over to smaller machine◦ Scale up to a larger machine
18
Middlebox Product Throughput CPU Core
IPSecHP VSR1001 268 Mbps 1HP VSR1008 926 Mbps 8
WAN Optimizer
STEELHEAD CCX770M 10 Mbps 2STEELHEAD CCX1555M 50 Mbps 4
WAFBarracuda Level 1 100 Mbps 1Barracuda Level 5 200 Mbps 2
Data Dependency Vectors Example
19
1
2
3
4
5
Middlebox
Replica hold
W(1)
R(1), W(3)
⟨0,x,x⟩
⟨1,x,4⟩
⟨0,3,4⟩ ⟨1,x,4⟩≥ ⟨1,3,4⟩ ⟨1,x,4⟩≥
⟨0,3,4⟩ ⟨0,x,x⟩≥
⟨0,3,4⟩ ⟨1,3,4⟩ ⟨2,3,5 ⟩
Middlebox’s dependency vector:
Replica’s dependency vector:
1 2
⟨0,3,4⟩ ⟨1,3,4⟩ ⟨2,3,5 ⟩4 5
? ✓
✓ ⟨0,x,x⟩ ⟨1,x,4⟩
⟨0,3,4⟩⟨0,3,4⟩ ⟨1,3,4⟩
⟨1,x,4⟩
EvaluationMETHOD
Comparing FTC with:
NF, Non-Fault tolerant system◦ Ideal performance
FTMB (SIGCOMM 2015)◦ State logging + Snapshots
FTMB + Snapshot (SIGCOMM 2015)◦ State logging + Snapshots
ENVIRONMENTS
A cluster of 12 servers◦ 40 Gbps network
SAVI Cloud environment◦ A virtual network of OVS switches
MoonGen and pktGen traffic generators◦ UDP traffic
◦ Packet size: 256 B
20
Fault Tolerant NATs
21
1 2 4 8Threads
0
2
4
6
8
10
Thro
ughp
ut(M
pps)
NF FTC FTMB 2x higher throughput
Fault Tolerant Chains – Throughput
22
2 3 4 5Chain Length
0
2
4
6
8
10
Thro
ughp
ut(M
pps)
NF FTC FTMB FTMB+Snapshot
3.5x higher throughput
39% drop due to snapshots
1.8x higher throughput
Fault Tolerant Chains – Latency
23
40 60 80 100 120 140 160 180
Latency (µs)
0.0
0.2
0.4
0.6
0.8
1.0
Pack
ets
(CD
F)
NF FTC FTMB
Conclusion
Keep operation of a chain of middleboxes online after ! middleboxes fail
◦ In-chain replication
◦ Transactional packet processing
◦ Data dependency vectors
24