fault tolerant service function chainingali/papers/sigcomm20-ftc-slides.pdf · alice bob alice...

25
Fault Tolerant Service Function Chaining M. GHAZNAVI, E. JALALPOUR, B. WONG, R. BOUTABA, A. MASHTIZADEH UNIVERSITY OF WATERLOO

Upload: others

Post on 19-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Fault Tolerant Service Function ChainingM. GHAZNAVI , E . JALALPOUR, B . WONG, R. BOUTABA, A . MASHTIZADEH

UNIVERSITY OF WATERLOO

Page 2: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Middleboxes and Service Function Chains

1

Firewall IDS NAT Internet

Page 3: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Middlebox Failures

2

Demystifying the Dark Side of the Middle: A Field Study ofMiddlebox Failures in Datacenters

Rahul PotharajuPurdue University

[email protected]

Navendu JainMicrosoft Research

[email protected]

ABSTRACT

Network appliances or middleboxes such as firewalls, intru-sion detection and prevention systems (IDPS), load bal-ancers, and VPNs form an integral part of datacenters andenterprise networks. Realizing their importance and short-comings, the research community has proposed software im-plementations, policy-aware switching, consolidation appli-ances, moving middlebox processing to VMs, end hosts, andeven o!oading it to the cloud. While such e"orts can usemiddlebox failure characteristics to improve their reliability,management, and cost-e"ectiveness, little has been reportedon these failures in the field.In this paper, we make one of the first attempts to perform

a large-scale empirical study of middlebox failures over twoyears in a service provider network comprising thousands ofmiddleboxes across tens of datacenters. We find that mid-dlebox failures are prevalent and they can significantly im-pact hosted services. Several of our findings di"er in key as-pects from commonly held views: (1) Most failures are greydominated by connectivity errors and link flaps that exhibitintermittent connectivity, (2) Hardware faults and overloadproblems are present but they are not in majority, (3) Mid-dleboxes experience a variety of misconfigurations such asincorrect rules, VLAN misallocation and mismatched keys,and (4) Middlebox failover is ine"ective in about 33% of thecases for load balancers and firewalls due to configurationbugs, faulty failovers and software version mismatch. Fi-nally, we analyze current middlebox proposals based on ourstudy and discuss directions for future research.

Categories and Subject Descriptors

C.2.3 [Computer-Communication Network]: NetworkOperations—network management

General Terms

Network management; Reliability

Keywords

Datacenters; Network reliability; Middleboxes

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full cita-

tion on the first page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior specific permission

and/or a fee. Request permissions from [email protected].

IMC’13, October 23–25, 2013, Barcelona, Spain.

Copyright 2013 ACM 978-1-4503-1953-9/13/10 ...$15.00..

20

81

36

2

43

11 1 3

0

25

50

75

100

L2 Switches L3 Routers Middleboxes Others

Perc

en

t C

on

trib

uti

on

High Severity Incidents

Population

4235

11 9 30

25

50

75

100

Lost connectivity

High Latency

Service Error

SecuritySLA

Perc

en

t C

on

trib

uti

on

Figure 1: Middleboxes contribute to 43% of high-severity incidents despite being 11% of the popula-tion (top). The top-5 categories of service impact inthese incidents caused by middleboxes (bottom).

1. INTRODUCTIONToday’s datacenters and enterprises deploy a variety of

intermediary network devices or middleboxes to distributeload (e.g., load balancers), enable remote connectivity (e.g.,VPNs), improve performance (e.g., proxies) and security(e.g., firewalls, IDPS), as well as to support new tra#cclasses and applications [1–4]. Given these valuable benefits,the market for middleboxes is estimated to exceed $10B by2016 [5], and their number is becoming comparable to thatof routers in enterprise networks [1, 6].

These benefits, however, come at a high cost: middle-boxes constitute a significant fraction of the network capitaland operational expenses [1]; they are complex to manageand expensive to troubleshoot [6]; and their outages cangreatly impact service performance and availability [7]. Forinstance, in December 2012, a load balancing misconfigura-tion a"ected multiple Google services including Chrome andGmail [8]. In a 2011 survey of 1,000 organizations [9], 36%and 42% of the respondents indicated failure of a firewalldue to DDoS attacks at the application layer and networklayer, respectively; the very attack firewalls are deployed toprotect against.

Demystifying the dark side of the middle:A field study of middlebox failures in datacentersIMC 2013

Page 4: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

NAT

Middlebox Fault Tolerance

3

NAT Internet

Alice

Bob

Page 5: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

NAT

Consistent State Replication

4

NAT Internet

Alice

Bob

Alice ⬄AppleBob ⬄Bing

Alice ⬄AppleBob ⬄Bing

Page 6: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

NAT

Consistent State Replication

5

NAT Internet

Alice

Bob

Alice ⬄AppleBob ⬄Bing

Bob ⬄Bing

Page 7: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Previous ApproachesEXTERNALIZED STATE

StatelessNF, NSDI 2017

CHC, NSDI 2019

SNAPSHOT BASED

Pico Replication, SoCC 2013

FTMB, SIGCOMM 2015

REINFORCE, CoNEXT 2018

6

Page 8: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Externalized State Approach

7

NAT Internet

Fault Tolerant Data Store

Read/Write

State

Page 9: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

NAT

Snapshot Based Approaches

8

NAT Internet

Alice

Bob

Alice ⬄AppleBob ⬄Bing

Alice ⬄AppleBob ⬄Bing Primary state

Replicated state

Page 10: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Snapshot Based Approaches for a Chain

9

FW

FW

IDS NAT

IDPS NAT

InternetHigh LatencyLow Throughput

Primary stateReplicated state

Page 11: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Our Approach

10

Fault Tolerant

Firewall IDS NATF IF’ NI’

Primary stateReplicated state

Page 12: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Goals

Consistent state replication to tolerate ! middlebox failures

Minimizing performance overhead during normal operation

Minimizing disruption during middlebox failures

11

Page 13: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Fault Tolerant Chaining (FTC)In-chain replication◦ Replicates a chain’s state instead of the state of individual middleboxes◦ Each middlebox’s state replicated to subsequent ! middlebox servers

Transactional packet processing ◦ Simplifies the development of multi-threaded middleboxes◦ Improves scalability and performance

Data dependency vectors◦ Enables concurrent state replication

12

Page 14: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

r1 r3r2

Normal Operation

13

m1 m2 m3

Forw. Buffer21 2 3

Page 15: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

r1 r3r2

Normal Operation

14

m1 m2 m3

Forw. Buffer1 21 2 3

Page 16: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

r1 r3r2

Normal Operation

15

m1 m2 m3

Forw. Buffer3 1 21 2 3

Page 17: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Failure Recovery

16

r3r2r1

m1 m2 m3

13 21 32✕r’

2

m2

2

1 2Forw. Buffer

Primary stateReplicated state

Page 18: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Transactional Packet ProcessingExisting approaches◦ Single thread or batched packet processing◦ FTMB: multi threaded packet processing

◦ Tracking state changes in granularity of each state variable read/write◦ Frequent periodic state snapshots

Our approach◦ Packet transaction model for concurrent packet processing◦ Using isolation property to tracking state changes in granularity of packet transactions

17

Page 19: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Data Dependency VectorsTracking data changes instead of thread operations

Enabling different number of threads at the middlebox and replicas◦ Fail over to smaller machine◦ Scale up to a larger machine

18

Middlebox Product Throughput CPU Core

IPSecHP VSR1001 268 Mbps 1HP VSR1008 926 Mbps 8

WAN Optimizer

STEELHEAD CCX770M 10 Mbps 2STEELHEAD CCX1555M 50 Mbps 4

WAFBarracuda Level 1 100 Mbps 1Barracuda Level 5 200 Mbps 2

Page 20: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Data Dependency Vectors Example

19

1

2

3

4

5

Middlebox

Replica hold

W(1)

R(1), W(3)

⟨0,x,x⟩

⟨1,x,4⟩

⟨0,3,4⟩ ⟨1,x,4⟩≥ ⟨1,3,4⟩ ⟨1,x,4⟩≥

⟨0,3,4⟩ ⟨0,x,x⟩≥

⟨0,3,4⟩ ⟨1,3,4⟩ ⟨2,3,5 ⟩

Middlebox’s dependency vector:

Replica’s dependency vector:

1 2

⟨0,3,4⟩ ⟨1,3,4⟩ ⟨2,3,5 ⟩4 5

? ✓

✓ ⟨0,x,x⟩ ⟨1,x,4⟩

⟨0,3,4⟩⟨0,3,4⟩ ⟨1,3,4⟩

⟨1,x,4⟩

Page 21: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

EvaluationMETHOD

Comparing FTC with:

NF, Non-Fault tolerant system◦ Ideal performance

FTMB (SIGCOMM 2015)◦ State logging + Snapshots

FTMB + Snapshot (SIGCOMM 2015)◦ State logging + Snapshots

ENVIRONMENTS

A cluster of 12 servers◦ 40 Gbps network

SAVI Cloud environment◦ A virtual network of OVS switches

MoonGen and pktGen traffic generators◦ UDP traffic

◦ Packet size: 256 B

20

Page 22: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Fault Tolerant NATs

21

1 2 4 8Threads

0

2

4

6

8

10

Thro

ughp

ut(M

pps)

NF FTC FTMB 2x higher throughput

Page 23: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Fault Tolerant Chains – Throughput

22

2 3 4 5Chain Length

0

2

4

6

8

10

Thro

ughp

ut(M

pps)

NF FTC FTMB FTMB+Snapshot

3.5x higher throughput

39% drop due to snapshots

1.8x higher throughput

Page 24: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Fault Tolerant Chains – Latency

23

40 60 80 100 120 140 160 180

Latency (µs)

0.0

0.2

0.4

0.6

0.8

1.0

Pack

ets

(CD

F)

NF FTC FTMB

Page 25: Fault Tolerant Service Function Chainingali/papers/sigcomm20-ftc-slides.pdf · Alice Bob Alice ⬄Apple Bob ⬄Bing Alice ⬄Apple Bob ⬄Bing. NAT Consistent State Replication 5

Conclusion

Keep operation of a chain of middleboxes online after ! middleboxes fail

◦ In-chain replication

◦ Transactional packet processing

◦ Data dependency vectors

24