trumpet: timely and precise triggers in data centers · trumpet: timely and precise triggers in...

47
Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat

Upload: others

Post on 24-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Trumpet: Timely and Precise Triggers in Data Centers

Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat

Page 2: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

The Problem

2

Human-in-the-loop failure assessment and repair

Long failure repair times in large networks

Evolve or Die, SIGCOMM 2016

Page 3: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Humans in the Loop

3

Detect

Locate

Inspect

Fix

Page 4: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Programs in the Loop

4

Detect

Locate

Inspect

Fix

Programs in the loop

Page 5: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Our Focus

5

Detect

A framework for programmed detection of events in large datacenters

Page 6: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Events

6

Linkfailure

DDoS

Trafficsu

rge

Packetdelay

Lostpacket

Packetburst

Switchfailure

IncastLoadimbalance

Blackhole

Congestion

Traffichijack

Loop

Middleboxfailure

❖Availability❖Performance❖Security

BurstLoss

Page 7: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Our Focus

7

Detect

Aggregated, often sampledmeasures of network health

Page 8: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

8

Fine Timescale

Events

40 ms burst

Timeouts lasting several 100 ms

Detecting Transient Congestion

Page 9: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Fine Timescale

Events

9

Did this tenant see a sudden increase in traffic over the last few milliseconds?

Detecting Attack Onset

Page 10: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Inspect Every

Packet

10

Linkfailure

DDoS

Trafficsu

rge

Packetdelay

Lostpacket

Packetburst

Switchfailure

IncastLoadimbalance

Blackhole

Congestion

Traffichijack

Loop

Middleboxfailure

Some event definitions may require inspecting every packet

BurstLoss

Page 11: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Eventing Framework Requirements

Expressivity▸ Set of possible

events not known a priori

Fine timescale eventing▸ Capture transient

and onset events

Per-packet processing▸ Precise event

determination

11

Because data centers will require high availability and high utilization

Page 12: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

12

A Key Architectural

QuestionWhere do we place eventing

functionality?

Switches HostsNICs

❖ Are programmable❖ Have processing power for fine-time scale

eventing❖ Already inspect every packet

Page 13: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

13

We explore the design of a host-based eventing

framework

Page 14: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelopeof such an eventing framework?

14

Page 15: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelope of such an eventing framework?

15

Trumpet has a logically centralized event manager that aggregates local events

from per-host packet monitors

Page 16: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

For each packet matching

group by

and report every

each group that satisfies

Filter

Predicate

Time-interval

Flow-granularity

16

Event Definition

Flow volumes, loss rate, loss pattern (bursts), delay

Page 17: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

17

For each packet matching

group by

and report every

any flow whose

Event Example

Service IP Prefix

5-tuple

10ms

sum (is_lost & is_burst) > 10%

Is there any flow sourced by a service that sees a burst of losses in a small interval?

Page 18: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

18

For each packet matching

group by

and report every

any job whose

Event Example

Cluster IP Prefix and Port

Job IP Prefix

10ms

sum (volume) > 100MB

Is there a job in a cluster that sees abnormal traffic volumes in a small interval?

Page 19: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

19

Server

Controller

ServerVM

VM

HypervisorTrumpet Packet Monitor

Software switch

Trumpet Event Manager

Triggers

Trigger Reports

Event

Report

Trumpet Design

Page 20: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

20

Trumpet Event Manager

Trumpet Event

Manager

Congestion?

CongestionTriggers

Contains event attributes, detects local events

Page 21: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

21

Trumpet Event Manager

Trumpet Event

Manager

Page 22: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

22

Trumpet Event Manager

Trumpet Event

Manager

Large flow?

Large FlowTriggers

Trumpet can be used by programs to drill-down to

potential root causes

Page 23: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelopeof such an eventing framework?

23

The monitor optimizes packet processing to inspect every packet and evaluate predicates

at fine timescales

Page 24: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

The Packet Monitor

24

ServerVM

VM

HypervisorTrumpet Packet Monitor

Software switch

Page 25: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

A Key Assumption

25

ServerVM

VM

HypervisorTrumpet Packet Monitor

Software switch

Piggyback on CPU core used by software switch❖ Conserves server CPU resources❖ Avoids inter-core synchronization

Page 26: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

26

Can a single core monitor thousands of triggers at full packet rate (14.8

Mpps) on a 10G NIC?

Page 27: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Two Obvious Tricks

Use kernel bypass▸ Avoid kernel stack

overhead

Use polling to have tighter scheduling▸ Trigger time intervals

at 10ms

27

Necessary, but far from sufficient….

Page 28: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

28

Packet Match Updatestatistics at Check

Source IP = 10.1.1.0/24Source IP = 20.2.2.0/24

PredicateTime intervalFilterSum(loss) > 10%Sum(size) < 10MB

Flow granularity10ms

100msService IP prefix5-tuple

filters flow granularity predicatetime-interval

Monitor Design

at

With 1000s of triggers

Page 29: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

29

Packet Match Updatestatistics at Check

filters flow granularity predicatetime-interval

Design Challenges

at

Which of these should be performed❖On-path❖Off-path

Page 30: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

30

Packet Match Updatestatistics at Check

filters flow granularity predicatetime-interval

Design Challenges

at

Which operations to do on-path?❖70ns to forward and inspect packet

Page 31: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

31

Packet Match Updatestatistics at Check

filters flow granularity predicatetime-interval

Design Challenges

at

How to schedule off-path operations?❖Off-path on same core, can delay packets❖Bound delay to a few µs

Page 32: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

32

Packet

Match Updatestatistics at Check

filters flow granularity predicatetime-interval

Strawman Design

at

PacketHistory

On-Path

Off-Path

Doesn’t scale to large numbers of triggers

Page 33: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

33

PacketMatch Update

statistics at

Check

filters flow granularity

predicatetime-interval

Strawman Design

at

On-Path

Off-Path

Still cannot reach goal❖Memory subsystem becomes a bottleneck

Page 34: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

34

PacketMatch Update

statistics at

Check

filters 5-tuple granularity

predicatetime-interval

Trumpet Monitor Design

at

On-Path

Off-Path

Gatherstatistics at

flow granularity

Page 35: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

35

PacketMatch Update

statistics atfilters 5-tuple granularity

Optimizations

On-Path

❖ Use tuple-space search for matching❖Match on first packet, cache match❖ Lay out tables to enable cache prefetch❖ Use TLB huge pages for tables

Page 36: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

36

Checkpredicate

time-interval

Optimizations

at

Off-Path

Gatherstatistics at

flow granularity

❖ Lazy cleanup of statistics across intervals❖ Lay out tables to enable cache prefetch❖ Bounded-delay cooperative scheduling

Page 37: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Bounded Delay

Cooperative Scheduling

37

Off-Path On-Path

Bounded Delay

Bound delay to a few µs

Page 38: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Research Questions

What eventing architecture permits programmability and visibility?

How can we achieve precise eventing at fine timescales?

What is the performance envelopeof such an eventing framework?

38

Trumpet can monitor thousands of triggers at full packet rate on a 10G NIC

Page 39: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

39

Trumpet is expressive❖Transient congestion❖Burst loss❖Attack onset

Trumpet scales to thousands of triggers

Trumpet is DoS-Resilient

Evaluation

Page 40: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Detecting Transient

Congestion

40

Congestion

Large Flow (Reactive)

Trumpet can detect

millisecond scale

congestion events

40 ms

Page 41: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Scalability

41

Trumpet can process❉ 14.8 Mpps❖64 byte packets at 10G❖650 byte packets at 4x10G

… while evaluating 16K triggers at 10ms granularity

❉Xeon ES-2650, 10-core 2.3 Ghz, Intel 82599 10G NIC

Page 42: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Performance Envelope

42

Triggers matched by each flow

How often each predicate is checked

Above this rate, Trumpet would miss

events

Page 43: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Performance Envelope

43

At moderate packet rates, can detect events at 1ms

Number of <trigger, flow> pairs increases statistics gathering overhead

Page 44: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Performance Envelope

44

Need to profile and provision Trumpet deployment

Above 10ms, CPU can sustain full packet rate

Page 45: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

Conclusion

Future datacenters will need fast and precise eventing▸ Trumpet is an

expressive system for host-based eventing

Trumpet can process 16K triggers at full packet rate▸ … without delaying

packets by more than 10 µs

Future work: scale to 40G NICs▸ … perhaps with

NIC or switch support

45https://github.com/USC-NSL/Trumpet

Page 46: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

A Big Discrepancy

46

Outage budget for five 9savailability

24 seconds per month

99.999% uptime

Long failure durations due to time to root-

cause failures

Page 47: Trumpet: Timely and Precise Triggers in Data Centers · Trumpet: Timely and Precise Triggers in Data Centers Masoud Moshref, Minlan Yu Ramesh Govindan, Amin Vahdat. The Problem 2

47

Every optimization is necessary❉

❉Details in the paper