fault tolerant service function chainingali/papers/sigcomm20-ftc-slides.pdf · alice bob alice...

Post on 19-Oct-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fault Tolerant Service Function ChainingM. GHAZNAVI , E . JALALPOUR, B . WONG, R. BOUTABA, A . MASHTIZADEH

UNIVERSITY OF WATERLOO

Middleboxes and Service Function Chains

1

Firewall IDS NAT Internet

Middlebox Failures

2

Demystifying the Dark Side of the Middle: A Field Study ofMiddlebox Failures in Datacenters

Rahul PotharajuPurdue University

rpothara@purdue.edu

Navendu JainMicrosoft Research

navendu@microsoft.com

ABSTRACT

Network appliances or middleboxes such as firewalls, intru-sion detection and prevention systems (IDPS), load bal-ancers, and VPNs form an integral part of datacenters andenterprise networks. Realizing their importance and short-comings, the research community has proposed software im-plementations, policy-aware switching, consolidation appli-ances, moving middlebox processing to VMs, end hosts, andeven o!oading it to the cloud. While such e"orts can usemiddlebox failure characteristics to improve their reliability,management, and cost-e"ectiveness, little has been reportedon these failures in the field.In this paper, we make one of the first attempts to perform

a large-scale empirical study of middlebox failures over twoyears in a service provider network comprising thousands ofmiddleboxes across tens of datacenters. We find that mid-dlebox failures are prevalent and they can significantly im-pact hosted services. Several of our findings di"er in key as-pects from commonly held views: (1) Most failures are greydominated by connectivity errors and link flaps that exhibitintermittent connectivity, (2) Hardware faults and overloadproblems are present but they are not in majority, (3) Mid-dleboxes experience a variety of misconfigurations such asincorrect rules, VLAN misallocation and mismatched keys,and (4) Middlebox failover is ine"ective in about 33% of thecases for load balancers and firewalls due to configurationbugs, faulty failovers and software version mismatch. Fi-nally, we analyze current middlebox proposals based on ourstudy and discuss directions for future research.

Categories and Subject Descriptors

C.2.3 [Computer-Communication Network]: NetworkOperations—network management

General Terms

Network management; Reliability

Keywords

Datacenters; Network reliability; Middleboxes

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full cita-

tion on the first page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior specific permission

and/or a fee. Request permissions from permissions@acm.org.

IMC’13, October 23–25, 2013, Barcelona, Spain.

Copyright 2013 ACM 978-1-4503-1953-9/13/10 ...$15.00..

20

81

36

2

43

11 1 3

0

25

50

75

100

L2 Switches L3 Routers Middleboxes Others

Perc

en

t C

on

trib

uti

on

High Severity Incidents

Population

4235

11 9 30

25

50

75

100

Lost connectivity

High Latency

Service Error

SecuritySLA

Perc

en

t C

on

trib

uti

on

Figure 1: Middleboxes contribute to 43% of high-severity incidents despite being 11% of the popula-tion (top). The top-5 categories of service impact inthese incidents caused by middleboxes (bottom).

1. INTRODUCTIONToday’s datacenters and enterprises deploy a variety of

intermediary network devices or middleboxes to distributeload (e.g., load balancers), enable remote connectivity (e.g.,VPNs), improve performance (e.g., proxies) and security(e.g., firewalls, IDPS), as well as to support new tra#cclasses and applications [1–4]. Given these valuable benefits,the market for middleboxes is estimated to exceed $10B by2016 [5], and their number is becoming comparable to thatof routers in enterprise networks [1, 6].

These benefits, however, come at a high cost: middle-boxes constitute a significant fraction of the network capitaland operational expenses [1]; they are complex to manageand expensive to troubleshoot [6]; and their outages cangreatly impact service performance and availability [7]. Forinstance, in December 2012, a load balancing misconfigura-tion a"ected multiple Google services including Chrome andGmail [8]. In a 2011 survey of 1,000 organizations [9], 36%and 42% of the respondents indicated failure of a firewalldue to DDoS attacks at the application layer and networklayer, respectively; the very attack firewalls are deployed toprotect against.

Demystifying the dark side of the middle:A field study of middlebox failures in datacentersIMC 2013

NAT

Middlebox Fault Tolerance

3

NAT Internet

Alice

Bob

NAT

Consistent State Replication

4

NAT Internet

Alice

Bob

Alice ⬄AppleBob ⬄Bing

Alice ⬄AppleBob ⬄Bing

NAT

Consistent State Replication

5

NAT Internet

Alice

Bob

Alice ⬄AppleBob ⬄Bing

Bob ⬄Bing

Previous ApproachesEXTERNALIZED STATE

StatelessNF, NSDI 2017

CHC, NSDI 2019

SNAPSHOT BASED

Pico Replication, SoCC 2013

FTMB, SIGCOMM 2015

REINFORCE, CoNEXT 2018

6

Externalized State Approach

7

NAT Internet

Fault Tolerant Data Store

Read/Write

State

NAT

Snapshot Based Approaches

8

NAT Internet

Alice

Bob

Alice ⬄AppleBob ⬄Bing

Alice ⬄AppleBob ⬄Bing Primary state

Replicated state

Snapshot Based Approaches for a Chain

9

FW

FW

IDS NAT

IDPS NAT

InternetHigh LatencyLow Throughput

Primary stateReplicated state

Our Approach

10

Fault Tolerant

Firewall IDS NATF IF’ NI’

Primary stateReplicated state

Goals

Consistent state replication to tolerate ! middlebox failures

Minimizing performance overhead during normal operation

Minimizing disruption during middlebox failures

11

Fault Tolerant Chaining (FTC)In-chain replication◦ Replicates a chain’s state instead of the state of individual middleboxes◦ Each middlebox’s state replicated to subsequent ! middlebox servers

Transactional packet processing ◦ Simplifies the development of multi-threaded middleboxes◦ Improves scalability and performance

Data dependency vectors◦ Enables concurrent state replication

12

r1 r3r2

Normal Operation

13

m1 m2 m3

Forw. Buffer21 2 3

r1 r3r2

Normal Operation

14

m1 m2 m3

Forw. Buffer1 21 2 3

r1 r3r2

Normal Operation

15

m1 m2 m3

Forw. Buffer3 1 21 2 3

Failure Recovery

16

r3r2r1

m1 m2 m3

13 21 32✕r’

2

m2

2

1 2Forw. Buffer

Primary stateReplicated state

Transactional Packet ProcessingExisting approaches◦ Single thread or batched packet processing◦ FTMB: multi threaded packet processing

◦ Tracking state changes in granularity of each state variable read/write◦ Frequent periodic state snapshots

Our approach◦ Packet transaction model for concurrent packet processing◦ Using isolation property to tracking state changes in granularity of packet transactions

17

Data Dependency VectorsTracking data changes instead of thread operations

Enabling different number of threads at the middlebox and replicas◦ Fail over to smaller machine◦ Scale up to a larger machine

18

Middlebox Product Throughput CPU Core

IPSecHP VSR1001 268 Mbps 1HP VSR1008 926 Mbps 8

WAN Optimizer

STEELHEAD CCX770M 10 Mbps 2STEELHEAD CCX1555M 50 Mbps 4

WAFBarracuda Level 1 100 Mbps 1Barracuda Level 5 200 Mbps 2

Data Dependency Vectors Example

19

1

2

3

4

5

Middlebox

Replica hold

W(1)

R(1), W(3)

⟨0,x,x⟩

⟨1,x,4⟩

⟨0,3,4⟩ ⟨1,x,4⟩≥ ⟨1,3,4⟩ ⟨1,x,4⟩≥

⟨0,3,4⟩ ⟨0,x,x⟩≥

⟨0,3,4⟩ ⟨1,3,4⟩ ⟨2,3,5 ⟩

Middlebox’s dependency vector:

Replica’s dependency vector:

1 2

⟨0,3,4⟩ ⟨1,3,4⟩ ⟨2,3,5 ⟩4 5

? ✓

✓ ⟨0,x,x⟩ ⟨1,x,4⟩

⟨0,3,4⟩⟨0,3,4⟩ ⟨1,3,4⟩

⟨1,x,4⟩

EvaluationMETHOD

Comparing FTC with:

NF, Non-Fault tolerant system◦ Ideal performance

FTMB (SIGCOMM 2015)◦ State logging + Snapshots

FTMB + Snapshot (SIGCOMM 2015)◦ State logging + Snapshots

ENVIRONMENTS

A cluster of 12 servers◦ 40 Gbps network

SAVI Cloud environment◦ A virtual network of OVS switches

MoonGen and pktGen traffic generators◦ UDP traffic

◦ Packet size: 256 B

20

Fault Tolerant NATs

21

1 2 4 8Threads

0

2

4

6

8

10

Thro

ughp

ut(M

pps)

NF FTC FTMB 2x higher throughput

Fault Tolerant Chains – Throughput

22

2 3 4 5Chain Length

0

2

4

6

8

10

Thro

ughp

ut(M

pps)

NF FTC FTMB FTMB+Snapshot

3.5x higher throughput

39% drop due to snapshots

1.8x higher throughput

Fault Tolerant Chains – Latency

23

40 60 80 100 120 140 160 180

Latency (µs)

0.0

0.2

0.4

0.6

0.8

1.0

Pack

ets

(CD

F)

NF FTC FTMB

Conclusion

Keep operation of a chain of middleboxes online after ! middleboxes fail

◦ In-chain replication

◦ Transactional packet processing

◦ Data dependency vectors

24

top related