nerd lunch

Upload: cristi-tohanean

Post on 07-Apr-2018

250 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Nerd Lunch

    1/51

    1

    RouterBricks

    Scaling Software Routers with Modern Servers

    Kevin Fall

    Intel Labs, Berkeley

    Feb 24, 2010Ericsson, San Jose, CA

  • 8/3/2019 Nerd Lunch

    2/51

    2

    Project Participants

    Intel Labs Gianluca Iannaccone (co-PI, researcher)

    Sylvia Ratnasamy (co-PI, researcher)

    Kevin Fall (principal engineer) Allan Knies (principal engineer)

    Maziar Manesh (research engineer)

    Eddie Kohler (Click expert)

    Dan Dahle (tech strategy)

    Badarinath Kommandur (tech strategy)

    Ecole Polytecnique (EPFL), Switzerland Katerina Argyraki (faculty)

    Mihai Dobrescu (student)

    Diaqing Chu (student)

  • 8/3/2019 Nerd Lunch

    3/51

    3

    Outline

    Introduction

    Approach: cluster-based router

    RouteBricks implementation

    Performance results

    Next steps

  • 8/3/2019 Nerd Lunch

    4/51

    4

    RouterBricks: in a nutshell

    A high-speed router using IA server components

    fully programmable: control and data plane

    extensible: evolve networks via software upgrade

    incrementally scalable: flat cost per bit

  • 8/3/2019 Nerd Lunch

    5/51

    5

    Motivation

    Network infrastructure is doing more than ever before

    Packet-pushing (routing) no longer the whole story security, data loss protection, application optimization, etc.

    has led to a proliferation of special appliances

    and notions that perhaps routers could do more Cisco, Juniper supporting open APIs Openflow consortium: Stanford, HP, Broadcom, Cisco

    But these platforms werent born programmable

  • 8/3/2019 Nerd Lunch

    6/51

    6

    Motivation

    If flexibility ultimately implies programmability...

    Hard to beat IA platforms and their ecosystem

    Or price

    However, must deal with persistent folklore:

    IA cant do high-speed packet processing

    But todays IA isnt the IA you know from your youth

    multicore, multiple integrated mem-controllers, PCIe, multi-Q NICs,

  • 8/3/2019 Nerd Lunch

    7/51

    7

    Motivation

    Combine a desire for more programmability...

    with new router friendly server trends

    a new opportunity for IA servers?

    Router Bricks: How might we

    build a big (~1Tbps) IA-based software router?

  • 8/3/2019 Nerd Lunch

    8/51

    8

    Challenge

    traditional software routers

    research prototypes (2007): 1 - 2 Gbps

    Vyatta* datasheet (2009): 2 - 4 Gbps

    current carrier-grade routers

    line speeds: 10/40Gbps aggregate switching speeds:40Gbps to 92Tbps!

    * Other names and brands may be claimed as properties of others

  • 8/3/2019 Nerd Lunch

    9/51

    9

    Strategy

    1. A cluster-based router architecture

    each server need only scale to line speeds (10-40Gbps),rather than aggregate speeds (40Gbps 92Tbps)

    2. Understand whether modern server architecturescan scale to line speeds (10-40Gbps)

    if not, why?

    3. Leverage open-source control plane implementations

    xorp, quagga, etc. [but we focus on data plane here]

  • 8/3/2019 Nerd Lunch

    10/51

    10

    Broader Benefits

    1. infrastructure that is well-known and cheaper to evolve

    familiar programming environment

    separately-evolvable network software and hardware

    reduced cost -> more frequent upgrade opportunity

    2. networks with the benefits of the PC ecosystem

    high-volume manufacturing

    widespread supply/support

    state-of-the-art process technologies (ride Moores Law)

    evolving PC platform features (power mgmt, crypto, etc.)

  • 8/3/2019 Nerd Lunch

    11/51

    11

    Outline

    Introduction

    Approach: cluster-based router

    RouteBricks implementation

    Performance results Next steps

  • 8/3/2019 Nerd Lunch

    12/51

    1212

    Traditional router architecture

    #1 Nports

    R bps[R each direction]

    2 3

    N ports, per-port speed R bps

  • 8/3/2019 Nerd Lunch

    13/51

    13

    Traditional router architecture

    R bps

    switchscheduler

    switch fabric

    queuemgmnt,

    shaping,etc.

    IP addresslookup,

    Q mgmnt, etc.

    addrtables,FIB,ACLs

    IP address

    lookup,

    q-mgmt, etc.

    addr tables,FIB, ACLs

    queue

    mgmt,

    shaping,

    etc.

    linecard

    queuemgmnt,

    shaping,etc.

    IP addresslookup,

    Q mgmnt, etc.

    addrtables,FIB,ACLs

    queuemgmnt,

    shaping,etc.

    IP addresslookup,

    Q mgmnt, etc.

    addrtables,FIB,ACLs

    control processor (runs IOS/quagga/xorp, etc)

    runs atR bps

    runs atNR

  • 8/3/2019 Nerd Lunch

    14/51

    1414

    Moving to a cluster-router

    #1 Nports

    R bps

    switchscheduler

    switch fabric

    2 3

    queuemgmnt,

    shaping,etc.

    IP addresslookup,

    Q mgmnt, etc.

    addrtables,FIB,ACLs

    IP address

    lookup,

    q-mgmt, etc.

    addr tables,FIB, ACLs

    queue

    mgmt,

    shaping,

    etc.

    linecard

    queuemgmnt,

    shaping,etc.

    IP addresslookup,

    Q mgmnt, etc.

    addrtables,FIB,ACLs

    queuemgmnt,

    shaping,etc.

    IP addresslookup,

    Q mgmnt, etc.

    addrtables,FIB,ACLs

    control processor (runs IOS/quagga/xorp, etc)

    #1 Nportsstep 1: single server implements one port;

    N ports N servers

  • 8/3/2019 Nerd Lunch

    15/51

    1515

    Moving to a cluster-router

    #1 N

    R bps

    step 1: single server implements one port;

    N ports N servers

    switchscheduler

    switch fabric

    2

    control processor (runs IOS/quagga/xorp, etc)

    IP addresslookup,

    Q mgmnt, etc.

    addr tables,FIB, ACLs

    queuemgmnt,shaping,

    etc.

    linecard

    implementedin software

    Each server must

    process at least2R traffic (in+out)

  • 8/3/2019 Nerd Lunch

    16/51

    1616

    Moving to a cluster-router

    #1 Nports

    R bps

    step 2: replace switch fabric and scheduler

    with a distributed, software-based solution

    switchscheduler

    switch fabric

    control processor (runs IOS/quagga/xorp, etc)

    2

  • 8/3/2019 Nerd Lunch

    17/51

    17

    Moving to a cluster-router

    #1 Nports

    R bps

    2

    control processor (runs IOS/quagga/xorp, etc)

    server-to-serverinterconnect topology

    step 2: replace switch fabric and scheduler

    with a distributed, software-based solution

    distributed schedulingalgorithms, based on

    Valiant Load Balancing (VLB)

  • 8/3/2019 Nerd Lunch

    18/51

    1818

    Example: VLB over a mesh** other topologies offer different tradeoffs

    # servers N

    internal fanout N-1

    internal link capacity

    (RN/[N(N-1)/2])

    2R

    N-1

    processing/server

    [out+in+through]

    3R(2R)*

    N servers can achieve switching speeds of N R bps, provided each

    server can process packets at 3R (*2R for Direct-VLB avg case)

    N ports, Rbpsport rate

    Rbps [each direction]

    N

    1

    2

    3 5

    4

  • 8/3/2019 Nerd Lunch

    19/51

    19

    Outline

    Introduction

    Approach: cluster-based router

    RouteBricks implementation

    RB4 prototype Click overview

    Performance results

    Next steps

  • 8/3/2019 Nerd Lunch

    20/51

    20

    RB4: hardware architecture

    10Gbps 4 dual-socket NHM-EPs

    8x 2.8GHz cores (no SMT)

    8MB L3 cache

    6x1 GB DDR3

    2 PCIe 2.0 slots (8 lanes)

    default BIOS setting

    2x 10Gbps Oplin cards per server

    dual port

    PCIe 1.1

    (now using Niantic /PCIe 2.0)

  • 8/3/2019 Nerd Lunch

    21/51

    21

    RB4: software architecture

    10Gbps

    Linux

    2.6.24

    KernelClick runtime

    RB

    VLB

    RB device driver

    user space

    packet

    processing

    (linecard)

    NIC NIC NIC NIC

    Place for value-added services(e.g., monitoring, energyproxy, management, etc.)

    hooks

    for

    new

    srvcs

    implemented in Click

    unmodified

    RB data plane

  • 8/3/2019 Nerd Lunch

    22/51

    22

    Click Overview

    Modular, extensible software router

    built on Linux as kernel module

    combines versatility and high performance

    Architecture consists of elementsthat implement packet processing functions

    configuration language that connects elements into a packet data flow

    internal scheduler that decides which element to run

    Large open source library (200+ elements) means new routingapplications can often be written with just a configuration script

    slide material courtesy E.Kohler, UCLA

  • 8/3/2019 Nerd Lunch

    23/51

    23

    RB4: software architecture

    Linux

    2.6.24

    KernelClick runtime

    RB

    VLB

    RB device driver

    user space

    packetprocessing

    (linecard)

    NIC NIC NIC NIC

    Value-added services(e.g., monitoring, energyproxy, management, etc.)

    hooks

    fornew

    srvcs

    implemented in Click

    unmodified

    Intel 10G driver polling-only operation

    (no interrupts)

    transfers packets tomemory in batches

    of k (we use k=16) RSS w/ upto 32/64 rx/tx

    NIC queues

  • 8/3/2019 Nerd Lunch

    24/51

    24

    Outline

    Introduction

    Approach: cluster-based router

    RouteBricks implementation

    RB4 prototype Click overview

    Performance results

    cluster scalability

    single server scalability

    Next steps

  • 8/3/2019 Nerd Lunch

    25/51

    25

    Cluster Scalability

    # servers N

    Internal fanout N-1

    internal link capacity 2R

    N-1

    processing/server 3R(2R)

    N ports, Rbpsper portRbps

    N

    1

    2

    3 5

    4

    recall: VLB over a mesh

  • 8/3/2019 Nerd Lunch

    26/51

    26

    1

    10

    100

    1000

    10000

    1 10 100 1000 10000

    Cluster Scalability

    10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

    number of ports

    costin#

    serv

    ers

    y=x

  • 8/3/2019 Nerd Lunch

    27/51

    27

    1

    10

    100

    1000

    10000

    1 10 100 1000 10000

    Cluster Scalability

    number of ports

    costin#

    serv

    ers

    one server scales to 20Gbps; typical fanout

    10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

  • 8/3/2019 Nerd Lunch

    28/51

    28

    1

    10

    100

    1000

    10000

    1 10 100 1000 10000

    Cluster Scalability

    number of ports

    costin#

    serv

    ers

    one server scales to 20Gbps; typical fanout

    20Gbps; higher fanout

    10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

  • 8/3/2019 Nerd Lunch

    29/51

    29

    1

    10

    100

    1000

    10000

    1 10 100 1000 10000

    Cluster Scalability

    number of ports

    costin#

    serv

    ers

    one server scales to 20Gbps; typical fanout

    20Gbps; higher fanout

    server scales to 40Gbps

    + higher fanout

    10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

  • 8/3/2019 Nerd Lunch

    30/51

    30

    1

    10

    100

    1000

    10000

    1 10 100 1000 10000

    Cluster Scalability

    number of ports

    costin#

    serv

    ers

    one server scales to 20Gbps; typical fanout

    20Gbps; higher fanout

    server scales to 40Gbps

    + higher fanout

    Conclusions so far

    (1) VLB-based server cluster scales well, is cost-effective

    (2) feasible if a single server can scale to at least 20Gbps (2R)

    10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

  • 8/3/2019 Nerd Lunch

    31/51

    31

    Outline

    Introduction

    Approach: cluster-based router

    RouteBricks implementation

    RB4 prototype Click overview

    Performance results

    cluster scalability

    single server scalability

    Next steps

  • 8/3/2019 Nerd Lunch

    32/51

    32

    RB4: software architecture

    Linux

    2.6.24

    KernelClick runtime

    RB

    VLB++

    RB device driver

    user space

    packetprocessing

    (linecard)

    NIC NIC NIC NIC

    Value-added services(e.g., monitoring, energyproxy, management, etc.)

    hooks

    fornew

    srvcs

    implemented in Click

    unmodified

    Tested 3 packet processing

    functions (so far)1. simple forwarding (fwd)

    2. IPv4 forwarding (rtr)

    3. AES-128 encryption (ipsec)

  • 8/3/2019 Nerd Lunch

    33/51

    33

    Test Configuration

    packet processing functions simple forwarding (no header

    processing; ~ bridging) IPv4 routing (longest-prefix destinationlookup, 256K entry routing table)

    AES-128 packet encryption

    test traffic

    fixed-size packets (64B-1024B) abilene: real-world packet trace

    from Abilene/Internet2 backbone

    Click runtime

    RB device driver

    packet

    processing

    NIC NIC

    trafficgeneration

    server

    traffic sink

    test server

  • 8/3/2019 Nerd Lunch

    34/51

    34

    Performance versus packet size

    Performance for simple forwarding

    under different input

    traffic workloads

    results in bits-per-second (top)and packets-per-second (bottom)

    In all our tests, the real-world

    Abilene and 1024B packet

    workloads achieve similar

    performance; hence, from hereon,

    we only consider two extremetraffic workloads: 64B and 1024B pkts.

  • 8/3/2019 Nerd Lunch

    35/51

    35

    Performance with different packetprocessing functions (64, 1KB pkts)

    Simple forwarding and IPv4

    forwarding for (realistic) trafficworkloads with larger packets

    achieve ~25Gbps;

    limited by traffic generation

    due to the #PCIe slots

    Encryption is CPU limited

    Simple Forwarding IPv4 Forwarding Encrypted Forwarding

  • 8/3/2019 Nerd Lunch

    36/51

    36

    Memory Loading

    64B workload, NHM

    nom and benchmark

    represent upper bounds on

    available memory bandwidth

    normalized by packet rate to

    compare with actual apps.nom is based on nominal

    rated capacity; benchmark

    refers to empirically observed

    load using a stream-like

    read/write random

    access workload.

    All applications are well below estimated upper bounds.Per-packet memory load is constant as a function of packet rate.

    Packet rate (Mpps)

  • 8/3/2019 Nerd Lunch

    37/51

    37

    QuickPath (inter-socket) Loading

    64B workload, NHM

    benchmark refers to

    the maximum load onthe inter-socket

    QuickPath link

    with stream-like

    workload

    All applications are well below estimated upper bound.Per-packet inter-socket load is constant versus packet rate.

  • 8/3/2019 Nerd Lunch

    38/51

    38

    QuickPath (I/O) Loading

    64B workload, NHM

    benchmark refers to

    the maximum load onthe I/O Quickpath link

    we have been able to

    generate with a NIC.

    All applications are well below estimated upper bound.Per-packet I/O load is constant versus packet rate.

  • 8/3/2019 Nerd Lunch

    39/51

    39

    Per-packet load on CPU

    64B workload, NHM

    application instr/pkt (CPI)

    simple forwarding 1,033 (1.19)

    ipv4 forwarding 1,595 (1.01)

    encryption 14,221 (0.55)

    All applications reach CPU cycles upper bound.CPU load is (fairly) constant as a function of packet rate.

    CPUSaturation

  • 8/3/2019 Nerd Lunch

    40/51

    40

    Single server scalability

    Key results

    (1) NHM server performance is sufficient to enable VLB clustering, for

    realistic input traffic

    (2) falls short for worst-case traffic

    (3) CPUs are the bottleneck for 64B packet workloads

    (4) scaling: constant per-packet load with increasing packet rate

  • 8/3/2019 Nerd Lunch

    41/51

    41

    Outline

    Introduction

    Approach: cluster-based router

    RouteBricks implementation

    RB4 prototype Click overview

    Performance results

    cluster scalability

    single server scalability

    Next steps

  • 8/3/2019 Nerd Lunch

    42/51

    42

    Next Steps

    RB prototype

    control plane

    additional packet processing functions

    new hardware when available management interface

    reliability / robustness improvements

    power

    packaging

  • 8/3/2019 Nerd Lunch

    43/51

    43

    Thanks

    http://routebricks.orgAlso: see paper in SOSP 2009

  • 8/3/2019 Nerd Lunch

    44/51

    44

    Backups

  • 8/3/2019 Nerd Lunch

    45/51

    45

    Click on multicore

    Each core (or HW thread) runs one instance of Click

    instance is statically scheduled and pinned to the core

    best performance when one core handles the entire dataflow of a packet

    Click runs internal scheduler to decide which element to run

  • 8/3/2019 Nerd Lunch

    46/51

  • 8/3/2019 Nerd Lunch

    47/51

    47

    4-port VLB mesh, 10Gbps ports

    10Gbps

    5Gbps

    Each server has internal fanout = 3Each server runs at avg. 20Gbps

    810Gbps

    2.5Gbps

    Each server has internal fanout = 7Each server runs at avg. 20Gbps

  • 8/3/2019 Nerd Lunch

    48/51

    48

    8-port VLB mesh, server@ 20Gbps40Gbps

    10Gbps

    2.5Gbps

    Each server has internal fanout = 7Each server runs at avg. 20Gbps

    10Gbps

    5Gbps

    10Gbps

    Each server has internal fanout =3Each server runs at avg. 40Gbps

  • 8/3/2019 Nerd Lunch

    49/51

    49

    8-port VLB mesh, server@ 20Gbps1000

    10Gbps

    2.5Gbps

    Each server has internal fanout = 7Each server runs at avg. 20Gbps

    And each server has maxinternal fanout = 32

    (1Gbps ports)

  • 8/3/2019 Nerd Lunch

    50/51

    50

    8-port VLB mesh, server@ 20Gbps1000

    10Gbps

    1000 servers, each w/

    10Gbps external port

    Plus (lg32(1000)-1)*1000servers interconnected by a

    32-ary-1000-fly topology

    (total 2000 servers)

    Each server has fanout=32

    Each internal link runs at

    0.625Gbps (=2*10/32)

    40Gbps

    1000

  • 8/3/2019 Nerd Lunch

    51/51

    51

    More generally

    Different topologies offer tradeoffs between:

    per server forwarding capability

    per server fanout (#slots/server, ports/slot) number of servers required

    input (for us)

    dominatesrouter cost