packetshader as a future internet platform

PacketShader as a Future Internet Platform

AsiaFI Summer School

2011.8.11.

Sue Moon

in collaboration with:

Joongi Kim, Seonggu Huh, Sangjin Han, Keon Jang, KyoungSoo Park ‡

Advanced Networking Lab, CS, KAIST ‡ Networked and Distributed Computing Systems Lab, EE, KAIST

Network Systems

Routers are the last bastion of niche super computers Cisco CRS-1 4-slot 320Gbps chassis = $15K on ebay Cisco’s CRS-3 Multishelf System = 322Tbps =$15K x 10^3 + 𝑎

Cost = special-purpose ASIC chips + TCAMs + S/W suites

People want Huge Freaking Routers!

Why do people go for large routers? Managment O/H Routing protocol scalability

Hurdles for innovation Special-purpose ASICs, network processors Proprietary S/W (Cisco’s IOS, Juniper’s JunOS, Huawei’s ???) Cost

Recent Trends

New functions keep added

Firewalls / IDS / IPS

Monitoring and measurement

VPN gears

Traffic engineering

Switches with minimum L3 functions at the network edge

Next Generation Networking vs Future Internet

Evolution vs revolution Today’s problems w/ hacks and patches

Architectural re-design to attack the problems at the roots

Both approaches need to address ever-increasing traffic demand

Notable breakthrough A new protocol that could replace IP

2010 2020

1.8B Internet users 5B Internet users

4.6B mobile subscribers 10B mobile subscribers

1.4B connected machines 40B connected machines

800 x (10^18) bytes of info 53 x (10^21) bytes of info

Challenges Ahead

Design and demonstrate “overarching architecture”

Routing + security + mobility

NSF Future Internet Architecture (FIA) projects

• XIA

• NDN

• Mobility First

• Nebula

NSF GENI projects

• Testbed building – Mostly OpenFlow driven

Need a platform to play with

Fast but cost-effective

Software routers!

Positioning of 10+ Gbps Routers

Commercial competitor?

Core routers with 100+ Tbps capacity? No.

Edge routers with 100+ Gbps capacity with complex features? Maybe

Experimental platforms?

• NetFGPA

• ServerSwitch

• RouteBricks

• ATCA-based boxes

PacketShader: A GPU-Accelerated Software Router

High-performance

Our prototype: 40 Gbps on a single box

Software Router

Despite its name, not limited to IP routing

You can implement whatever you want on it.

Driven by software

Flexible

Friendly development environments

Based on commodity hardware

Fast evolution

Now 10 Gigabit NIC is a commodity

From $200 – $300 per port

Great opportunity for software routers

Achilles’ Heel of Software Routers

Low performance

Due to CPU bottleneck

Year Ref. H/W IPv4 Throughput

2008 Egi et al. Two quad-core CPUs 3.5 Gbps

2008 “Enhanced SR” Bolla et al.

Two quad-core CPUs 4.2 Gbps

2009 “RouteBricks” Dobrescu et al.

Two quad-core CPUs (2.8 GHz)

8.7 Gbps

Not capable of supporting even a single 10G port

What is PacketShader?

A high-speed software router

Presented at SIGCOMM 2010

40 Gbps on a single box

Main ideas

Optimizing packet I/O

Utilizing GPU for packet processing

Four applications

IPv4, IPv6 forwarding, OpenFlow switch, IPsec gateway

WHAT IS GPU?

GPU = Graphics Processing Unit

The heart of graphics cards

Mainly used for real-time 3D game rendering

Massively-parallel processing capacity

(Ubisoft’s AVARTAR, from http://ubi.com)

CPU vs. GPU

CPU: Small # of super-fast cores

GPU: Large # of small cores

“Silicon Budget” in CPU and GPU

Xeon X5550:

4 cores

731M transistors GTX480:

480 cores

3,200M transistors

CPU BOTTLENECK

Per-Packet CPU Cycles for 10G

1,200 600

1,200 1,600 Cycles needed

Packet I/O IPv4 lookup

= 1,800 cycles

= 2,800

Your budget

1,400 cycles

10G, min-sized packets, dual quad-core 2.66GHz CPUs

5,400 1,200 … = 6,600

Packet I/O IPv6 lookup

Packet I/O Encryption and hashing

(in x86, cycle numbers are from RouteBricks [Dobrescu09] and ours)

PacketShader Part 1: I/O Optimization

Packet I/O

1,200 reduced to 200 cycles per packet

Main ideas

Huge packet buffer

Batch processing

IPv4 lookup

= 1,800 cycles

= 2,800

5,400 … = 6,600

IPv6 lookup

Encryption and hashing

PacketShader Part 2: GPU Offloading

Packet I/O

GPU Offloading for

Memory-intensive or

Compute-intensive operations

IPv4 lookup

5,400 …

IPv6 lookup

Encryption and hashing

OPTIMIZING PACKET I/O ENGINE

User-space Packet Processing

Packet processing in kernel is bad

Kernel has higher scheduling priority; overloaded kernel may starve user-level processes.

Some CPU extensions such as MMX and SSE are not available.

Buggy kernel code causes irreversible damage to the system.

Processing in user-space is good

• Rich, friendly development and debugging environment

• Seamless integration with 3rd party libraries such as CUDA or OpenSSL

• Easy to develop virtualized data plane.

But packet processing in user-space is known to be 3x times slower!

Our solution: (1) batching + (2) better core-queue mapping

Inefficiencies of Linux Network Stack

CPU cycle breakdown in packet RX

Software prefetch

Huge packet buffer

Compact metadata

Batch processing

Huge Packet Buffer

eliminates per-packet buffer allocation cost

Linux per-packet buffer allocation

Our huge packet buffer scheme

GPU FOR PACKET PROCESSING

Advantages of GPU for Packet Processing

1. Raw computation power

2. Memory access latency

3. Memory bandwidth

Comparison between

Intel X5550 CPU

NVIDIA GTX480 GPU

(1/3) Raw Computation Power

Compute-intensive operations in software routers

Hashing, encryption, pattern matching, network coding, compression, etc.

GPU can help!

CPU: 43×109 = 2.66 (GHz) × 4 (# of cores) ×

4 (4-way superscalar)

GPU: 672×109 = 1.4 (GHz) ×

480 (# of cores)

Instructions/sec

(2/3) Memory Access Latency

Software router lots of cache misses

GPU can effectively hide memory latency

GPU core

Cache miss

Switch to Thread 2

Switch to Thread 3

(3/3) Memory Bandwidth

CPU’s memory bandwidth (theoretical): 32 GB/s

CPU’s memory bandwidth (empirical) < 25 GB/s

4. TX: RAM NIC

3. TX: CPU RAM 2. RX:

RAM CPU

1. RX: NIC RAM

Your budget for packet processing can be less 10 GB/s

GPU’s memory bandwidth: 174GB/s

PACKETSHADER DESIGN

Device driver

Pre-shader

Shader

Post-shader

Device driver

Shader

Scaling with Multiple Multi-Core CPUs

Results (w/ 64B packets)

39.2 38.2

IPv4 IPv6 OpenFlow IPsec

CPU-only CPU+GPU

1.4x 4.8x 2.1x 3.5x GPU speedup

Example 1: IPv6 forwarding

Longest prefix matching on 128-bit IPv6 addresses

Algorithm: binary search on hash tables [Waldvogel97]

7 hashings + 7 memory accesses

… … … …

Prefix length 1 64 128 96 80

Example 1: IPv6 forwarding

(Routing table was randomly generated with 200K entries)

64 128 256 512 1024 1514

Packet size (bytes)

CPU-only CPU+GPU

Bounded by motherboard IO capacity

Year Ref. H/W IPv4 Throughput

2008 Egi et al. Two quad-core CPUs 3.5 Gbps

2008 “Enhanced SR” Bolla et al.

Two quad-core CPUs 4.2 Gbps

2009 “RouteBricks” Dobrescu et al.

8.7 Gbps

2010 PacketShader (CPU-only)

28.2 Gbps

2010 PacketShader (CPU+GPU)

Two quad-core CPUs + two GPUs

39.2 Gbps

Kernel

Results

PacketShader 1.0

a great opportunity for fast packet processing

PacketShader

Optimized packet I/O + GPU acceleration

scalable with

• # of multi-core CPUs, GPUs, and high-speed NICs

Current Prototype

Supports IPv4, IPv6, OpenFlow, and IPsec

40 Gbps performance on a single PC

Next Steps?

PacketShader 2.0

Control plane integration

Dynamic routing protocols with Quagga or XORP

Opportunistic offloading

CPU at low load

GPU at high load

Multi-functional, modular programming environment

Integration with Click? [Kohler99]

#3 Multi-functional, modular programming

environment

Module 1

Module 2

Module N

NIC NIC

NIC NIC NIC

slow path (e.g. Linux

TCP/IP stack) (drop)

Subchunk

Filter

Gathering queue

Schedulable task

Cascading

Batching, batching, batching!

The IO engine (modified NIC driver) uses continuous huge packet buffers called “chunks”.

The user-level process pipelines multiple chunks.

The GPU processes multiple chunks in parallel.

Hardware-aware optimizations

No NUMA node crossing

Minimized cache conflicts among multi-cores

Factors behind PacketShader 1.0 Performance

Remaining Challenges

100+ Gbps speed

Stateful processing

Intrusion detection systems / firewalls

Review of I/O Capacity for 100+ Gbps

QuickPath Interconnect (QPI)

CPU socket-to-socket link for remote memory

IOH-to-IOH link for I/O traffic

CPU-to-IOH for CPU to peripheral connections

Today’s QPI link

12.8 GB/s or 102.4 Gbps

Sandy Bridge

On recall at the moment

Expected to boost performance to 60 Gbps w/o modification

Review of Memory B/W for 100+ Gbps

For 100Gbps forwarding we need 400 Gbps in memory bandwidth + bookkeeping

Current configuration

triple-channel DDR3 1,333 MHz

32 GB/s per core (theoretical) and 17.9GB/s (empirical)

On NUMA system

More nodes

Careful placement...

Future Work

Consider other architectures

AMD’s APU

Tilera’s tiled many-core chips

Inte’s MIC

Become a platform for all new FIA architectures

Advantage over NetFPGA, ServerSwitch, ATCA solutions

QUESTIONS?

THANK YOU!

For more details

https://shader.kaist.edu/

packetshader as a future internet platform

Documents

adobe flash platform rich internet application development

bottlenecks on the internet and platform competition

wip - wireless internet platform

fiware: future internet open platform

open source and standardized platform for internet …...

web platform trident browser internet explorer

internet strategies forum 2010 - microsoft media platform

windows platform only internet explorer

predix: the industrial internet platform - ge · predix:...

wireless internet platform for interoperability 2.0

samsung smart tv - internet multimedia service platform

internet of things platform approach 0614 1

d32.1 future internet platform federation specifications...

recommendations platform by times internet, which helps

packetshader: a gpu-accelerated software...

javafx: the platform for rich internet applications

adobe technology platform for rich internet...

oracle’s internet of things platform: the m2m platform ·...

internet platform regulation: platforms are not landlines

wip - wireless internet platform - wi4b · wip - wireless...