a r eal -t ime p acket s can a rchitecture tim sherwood uc santa barbara

45
A REAL-TIME PACKET SCAN ARCHITECTURE Tim Sherwood UC Santa Barbara

Upload: jasmine-shepherd

Post on 27-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

A REAL-TIME PACKET

SCAN ARCHITECTURE

Tim SherwoodUC Santa Barbara

Page 2: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Big Questions

Can my system be optimized further? If so, then how and when? How much benefit can I expect? Have I seen this behavior before?

Is my system working “correctly”? Soft errors, backdoors, hardware bugs

Am I under attack? If so, then by whom? Am I witness to an attack?

Online Monitors

Page 3: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

To Protect and Serve

• Our machines are constantly under attack

• Cannot rely on end users, we need networks which actively defend themselves.

This requires the protection system to be able to operate at 10 to 40 Gb/s. (We aim at current and next generation networks.)

IDS/IPS are promising ways of providing protectionMarket for such systems: $918.9 million by the end of 2007.Snort: an widely accepted open source IDS

Page 4: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

The ProblemOur computing infrastructure is fast

Processors → ~109 instructions/second Network Routers → ~109 bytes/second

Beyond our ability to monitor naively Full traces are near impossible to gather Sampling may miss important data Intrusive monitoring will change data

New Architectures are Required

Page 5: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Why a new Computer Architecture

Latency

CommonCase

• Design for worst case stream– Network vendors chip by wire rate– Denial of service and reliability– Caches are no help

• Throughput is critical–40 Gigabit link = Packet out every 8 ns–Each packet needs multiple memory ref

Page 6: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Packet Scan Architecture• High Performance Packet Scan Architecture

Underlying primitives to support high-throughput monitors Algorithm – Architecture co-design

• Example primitive: String Matching 0.4MB and 10Gbps for Snort rule set ( >10,000 characters)

• Bit-Split String Matching Algorithm Reduces out edges from 256 to 2. Formal language – correctness and efficiency

• Memory Tile Based Design Memory throughput is the key Data is distributed over tiles with bounded contention

• Performance/area beats the best techniques we examined by a factor of 10 or more.

Page 7: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Packet Scan Architecture• String Matching

• Bit-Split String Matching Algorithm

• A Memory Tile Based Architecture

• Building a Real System

• Is it really correct?

• Future Work

examinepacket content

Page 8: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Scanning for Intrusions

Most IDS define a set of rules.A string defines a suspicious transmission.

We are not building a full IDS, rather building the primitives from which full systems can be built

CodeRed worm: web flow established uricontent with “/root.exe”

Traffic In Traffic Out

ScanSoftware

IDS

Page 9: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Multiple String Matching

• The multiple string matching algorithm: Input: A set of strings/patterns S, and a buffer b Output: Every occurrence of an element of S in b

Extra constraint: b is really a stream

• How to implement: Option 1) search for each string independently

Option 2) combine strings together and search all at once

A B

A string can be anywhere in the payload of a packet.

A B D F C A B Input:

A BC AStrings:

Page 10: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Why hardware

• Snort: >1,000 rules, growing at 1 rule/day or more• Active research into automated rule building• Strings are not limited to be just [a-z]+

• We need a high speed string matching technique with stringent worst case performance.

• Many algorithms are targeted for average case performance. Aho-Corasick can scan once and output all matches. But it is too big to be on-chip.

Page 11: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

The Aho-Corasick Algorithm

• Given a finite set P of patterns, build a deterministic finite automaton G accepting the set of all patterns in P.

Page 12: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

The Aho-Corasick Algorithm

• An Aho/Corasick String Matching Automaton for a given finite set P of patterns is a (deterministic) finite automaton G accepting the set of all words containing a word of P as a suffix. G consists of the following components:

• finite set Q of states • finite alphabet A • Transition function g: Q × A → Q + {fail} • Failure Function h: Q → Q + {fail} • initial state q0 in Q • a set F of final states

Page 13: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

On String Matching and Languages

• This should not be any big surprise P is a FL FL RL RL can be recognized by a RE RE can be simulated with an NFA An NFA can be simulated with a DFA

• This last step is the problem Aho and Corasick shows that for FL there is no

exponential blow up in state

Page 14: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

An AC Automaton Example

• Example: P = {he, she, his, hers}

0

1

h

2

9

8

6

3

4

57

e

s

i h

s er

s

Initial State

Accepting State

StateTransition Function

h Sh

hh

hh

S

SS

S

S

S

i

h

r

h

•The Construction: linear time.•The search of all patterns in P: linear time

(Edges pointing back to State 0 are not shown).

Page 15: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

0

1

h

2

9

8

6

3

4

57

e

s

i h

s er

s

hS

h

hh

hh

S

SS

S

S

S

i

h

r

h

Matching on the example

h x h e rs Only scan the input stream once.

Input stream:

Page 16: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Linear Time: So what’s the problem

16,384

2

1

0

2553210

256 Next State Pointers

<14> <14> <14> <14> <14>

• How to implement it on chip?

• Problem: Size too big to be on-chip ~ 10,000 nodes 256 out edges per node Requires 16,384*256*14 = ~10MB

• Solution: partition into small state machines Less strings per machine Less out edges per machine

Page 17: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Packet Scan Architecture• String Matching

• Bit-Split String Matching Algorithm

• A Memory Tile Based Architecture

• Building a Real System

• Is it really correct?

• Future Work

many tiny FSMworking together

Page 18: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

An example

P0 = { he, she, his, hers }

Page 19: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

An example

P0 = { he, she, his, hers }

check for agreement

Page 20: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

An example of Bit-SplitP0 = { he, she, his, hers }

0

1

h

2

9

8

6

3

4

57

e

s

i h

s er

s

hS

h

hh

hh

S

SS

S

S

S

i

h

r

h

(Edges pointing back to State 0 are not shown).

0000 00000000 0001

0110 1000

b0 {0}

P0 B03

0

b1 { }0

1

b2 { },1 0,3

0001 0000

0111 0011

0 1

{ }0,3{ } 0,1,2,6b3

1

b3{0,1,2,6}

0

1

b4{0,1,4}

b6{0,1,2,5,6}

b5{0,3,7,8}

b7{0,3,9}

0

1

00

0

0

0

1

1

11

1

1

Page 21: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Compact State SetP0 = { he, she, his, hers }

0

1

h

2

9

8

6

3

4

57

e

s

i h

s er

s

hS

h

hh

hh

S

SS

S

S

S

i

h

r

h

(Edges pointing back to State 0 are not shown).

b0 { }

P0 B03

0

b1 { }

1

b2 { }1

b3{ 2 }

0b4 { }

b6{ 2,5 }

b5{7}

b7{9}

0

1

00

0

0

0

1

1

11

1

1

Page 22: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

An example of Bit-SplitP0 = { he, she, his, hers }

(Edges pointing back to State 0 are not shown).

P0

0

1

h

2

9

8

6

3

4

57

e

s

i h

s er

s

hS

h

hh

hh

S

SS

S

S

S

i

h

r

h

B03b0 {}

b1{}

b2{}

b3{2}

b4 {}

b6{2,5}

b5{7}

b7{9}

01

1

0

0

1

1

0

0

0

1

1

1

0

01

B04

1

b8{2,7}

b5 {}

b0 {}

b1{}

b2{}

b4{2}

b3 {}

b6{2,5}

b9{9}

0

1

b7 {}

0

0

0

0

00

0

00

1

1

1

11

1 1

1

Page 23: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Nice Properties

• The number of states in Bij is rigorously bounded by the number of states in Pi

• No exponential blow up in state

• Linear construction time

• Possible to traverse multiple edges at a time to multiply throughput

Page 24: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

0

1

h

2

9

8

6

3

4

57

e

s

i h

s er

s

hS

h

hh

hh

S

SS

S

S

S

i

h

r

h

Matching on the example

P0 B03b0 {}

b1{}

b2{}

b3{2}

b4 {}

b6{2,5}

b5{7}

b7{9}

01

1

0

0

1

1

0

0

0

1

1

1

0

01

B04

1

b8{2,7}

b5 {}

b0 {}

b1{}

b2{}

b4{2}

b3 {}

b6{2,5}

b9{9}

0

1

b7 {}

0

0

0

0

00

0

00

1

1

1

11

1 1

1

h x h e 0 1 0 0 1 1 1 0

How do you “combine” the results from the different state machines?Only if all the state machines agree, is there actually a match.

2

Page 25: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Packet Scan Architecture• String Matching

• Bit-Split String Matching Algorithm

• A Memory Tile Based Architecture

• Building a Real System

• Is it really correct?

• Future Work

SRAM tilesimplement FSM

Page 26: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Our Main Idea: Bit-Split

• Partition rules (P) into smaller sets (P0 to Pn)

• Build AC state-machine for each subset

• For each DFA Pi, rip state-machine apart into 8 tiny state-machines (Bi0 through Bi7)

• Each of which searches for 1 bit in the 8 bit encoding of an input character Only if all the different B machines agree can

there actually a match

Page 27: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

How to Implement

• The AC state machine is equivalent to the 8 tiny state machines.

• The 8 tiny state machines can run independently, which means in parallel

• Intersection done with bit-wise AND.

• 8 is intuitive but not optimal

• How to build a system to implement this algorithm? Our algorithm makes it feasible to be on-chip

Page 28: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

A Hardware Implementation

• A rule module is equivalent to an AC state machine• Rule modules, tiles are structurally equivalent• All full match vectors are concatenated to indicate which

strings are matched• One tile stores one tiny bit-split state machine

Rule Module 0

Tile 0

Tile 1

Tile 3

Tile 2

Full Match Vector

2-bit Input [0:1]

Partial Match Vector

16

168

[6:7]

[2:3] [4:5]

8

4 Next State Pointers Partial Match Vector

<8> <8> <8> <8> <16> 0

1

2

255

3

deco

der

InputInput

Cur

rent

Sta

te <

8>C

urre

nt S

tate

<8>

2 bits fromeach byte

PartialMatchVector

Config Data

Output LatchOutput Latch

4:1 Mux 16

State Machine Tile

ControlBlock

Rule Module 1

Byte

from

Pay

load

8

2

Rule Module N

8

8

Com

plet

e Se

t of M

atch

es fo

r All R

ules

String Match Engine

16

Page 29: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

An efficient Implementation

00 01 10 11 PMV

0 0 1 0 0 0000

1 0 2 0 0 0000

2 0 3 0 0 1000

3 0 4 0 0 1110

4 0 4 0 0 1111

5

6

7

8

9

00 01 10 11 PMV

0 1 0 2 0 0000

1 1 0 3 0 0000

2 1 0 5 0 0000

3 1 6 5 0 0000

4 7 0 2 0 1000

5 0 4 5 0 0000

6 7 0 2 0 1100

7 9 0 3 0 0000

8 1 0 3 0 0010

9 1 0 3 0 0001

00 01 10 11 PMV

0 1 0 0 2 0000

1 1 3 0 2 0000

2 4 0 0 2 0000

3 1 0 5 6 1000

4 1 7 0 2 0000

5 1 0 0 8 0000

6 4 0 0 2 0010

7 1 0 5 6 1100

8 4 0 0 2 0001

9

00 01 10 11 PMV

0 0 0 1 2 0000

1 0 0 3 2 0000

2 0 0 4 2 0000

3 0 0 3 5 1000

4 0 0 6 2 0000

5 0 0 4 7 0010

6 0 0 3 5 1100

7 0 0 4 2 0001

8

9

Tile 0 Tile 2Tile 1 Tile 3

Cycle 3 e 01 10 01 01Cycle 2 h 01 10 10 00Cycle 1 x 01 11 10 00Cycle 0 h 01 10 10 00

h

h

x

e

h

h

x

e

h

h

e

xh

x

he

e 1100h 0000x 0000h 0000

e 1111h 1110x 1000h 0000

e 1000h 0000x 0000h 0000

e 1000h 0000x 0000h 0000

Cycle 3 + P 1000Cycle 2 + P 0000Cycle 1 + P 0000Cycle 0 + P 0000

22 2

2

Page 30: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

An efficient Implementation

00 01 10 11 PMV

0 0 1 0 0 0000

1 0 2 0 0 0000

2 0 3 0 0 1000

3 0 4 0 0 1110

4 0 4 0 0 1111

5

6

7

8

9

00 01 10 11 PMV

0 1 0 2 0 0000

1 1 0 3 0 0000

2 1 0 5 0 0000

3 1 6 5 0 0000

4 7 0 2 0 1000

5 0 4 5 0 0000

6 7 0 2 0 1100

7 9 0 3 0 0000

8 1 0 3 0 0010

9 1 0 3 0 0001

00 01 10 11 PMV

0 1 0 0 2 0000

1 1 3 0 2 0000

2 4 0 0 2 0000

3 1 0 5 6 1000

4 1 7 0 2 0000

5 1 0 0 8 0000

6 4 0 0 2 0010

7 1 0 5 6 1100

8 4 0 0 2 0001

9

00 01 10 11 PMV

0 0 0 1 2 0000

1 0 0 3 2 0000

2 0 0 4 2 0000

3 0 0 3 5 1000

4 0 0 6 2 0000

5 0 0 4 7 0010

6 0 0 3 5 1100

7 0 0 4 2 0001

8

9

Tile 0 Tile 2Tile 1 Tile 3

Cycle 3 e 01 10 01 01Cycle 2 h 01 10 10 00Cycle 1 x 01 11 10 00Cycle 0 h 01 10 10 00

h

h

x

e

h

h

x

e

h

h

e

xh

x

he

e 1100h 0000x 0000h 0000

e 1111h 1110x 1000h 0000

e 1000h 0000x 0000h 0000

e 1000h 0000x 0000h 0000

Cycle 3 + P 1000Cycle 2 + P 0000Cycle 1 + P 0000Cycle 0 + P 0000

22 2

2

Page 31: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Performance of Hardware

Page 32: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Performance of Hardware

Key Metric: Throughput*Character/Area

Page 33: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Packet Scan Architecture• String Matching

• Bit-Split String Matching Algorithm

• A Memory Tile Based Architecture

• Building a Real System

• Is it really correct?

• Future Work

Integration andinterfaces (FPGA)

Page 34: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Prototype Design

Avalon Bus (50MHz, 12Gbps)

Microprocessor(control/update)

Ethernet Interface100Mbps

(promiscuous)DMA

String MatchEngine

(~1Gbps)Device Drivers/

Application Layer

clkresetcsaddresswritedata

byte_indata_enabl

edata_low

data_highaddress

werst

byte_indata_enabledataaddresswerstclk

vector out

Con

nect

to b

us

Reg Interface

SM Core

Page 35: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Interface With Avalon Bus

clkresetcsaddresswritedata

byte_indata_enable

data_lowdata_high

addresswerst

byte_indata_enabledataaddresswerstclk

vector out

sme_write_tile(Base_add, 0, 1, 0, 0x0001, 0x00000000)

Module number

TilenumberaddressUpper

dataLower data

Con

nect

to

bus

sme_send_byte( Base_add, byte_from_packet)

This function is for initializing the memory in the string match engines

This function is for sending actual

data to the string match engine

Page 36: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Packet Scan Architecture• String Matching

• Bit-Split String Matching Algorithm

• A Memory Tile Based Architecture

• Building a Real System

• Is it really correct?

• Future Work

Proofs(yes)

Page 37: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

A Formalization

Page 38: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Splits DFA as an NFA

Page 39: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Correctness stems from RL subset

The above property is sufficient, is it necessary?

Exploiting fixed wildcards is possible, whatabout more general patterns?

Page 40: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Packet Scan Architecture• String Matching

• Bit-Split String Matching Algorithm

• A Memory Tile Based Architecture

• Building a Real System

• Is it really correct?

• Future WorkExtensions

and Applications

Page 41: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Primitives for Security

• Packet Address List Lookup

• Packet Address Range Query

• Packet Classification

• String Finding

• Regular Expression Finding

• Statefull Flow Monitors

• Packet Ordering

Page 42: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Related Work• Software based

Good for ~100Mb/s, common case

• FPGA-based Many schemes map rules down to a specialized circuit

Near optimal utilization of hardware resources Implementing state machines on block-RAMs [Cho and Mangione-

Smith] Concurrent to our work: mapping state machines to on-chip SRAM

[Aldwairi et. al.] Bloom filters [Dharmapurikar et al.]

Excellent filter in the common case

• TCAM-based Require all patterns to be shorter or equal to TCAM width Cutting long patterns: 2Gbps with 295KB TCAM [Yu et. al.]

Page 43: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Conclusions

• New Tile-based Architecture 0.4MB and 10Gbps for Snort rule set ( >10,000

characters) Possible to be used for other applications, e.g. IP

lookups, packet classification.

• New Bit-split Algorithm: General purpose enough for many other applications, e.g.

spam detection, peephole optimization, IP lookups, packet classification, etc.

Feasible to be implemented on other tile-based architecture.

Page 44: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

Thanks

• Lin Tan

• Brett Brotherton

• Prof. Ryan Kastner

• Prof. Ömer Egecioglu

• Shreyas Prasad, Shashi Mysore, Bita Mazloom, Ted Huffmire, Banit Argawal

Page 45: A R EAL -T IME P ACKET S CAN A RCHITECTURE Tim Sherwood UC Santa Barbara

All done.