2 of e engineeringengineering, kaistkaist, kyoungsoo@ee kaist … · 2017-01-05 · mimassive with...

M iMassiveMassiveith GPUwith GPUwith GPU

San jin Han1 KeonSangjin Han1, Keongj ,1D t t f C t S i KAIST {1Department of Computer Science, KAIST, {

2Department of EDepartment of E

P k Sh dPacketShaderPacketShader

The fastest software router in the worldThe fastest software router in the world» 4 5x faster than RouteBricks [SOSP ’09]» 4.5x faster than RouteBricks [SOSP 09]

Eliminates the CPU bottleneckEliminates the CPU bottleneck» with massively‐parallel GPU computation» with massively parallel GPU computation

M ti tiMotivationMotivationGG

PC based Soft are Ro tersPC‐based Software RoutersG

E t fi blG

Easy to fix problemsy p

Ready for additional functionalities HReady for additional functionalities H

Ch & Ubi it ($1 000 ~ $10 000)Cheap & Ubiquitous ($1,000 ~ $10, 000)p q ( , , )CP

Low Performance: 1‐10 Gbps today GPLow Performance: 1 10 Gbps today GP

““The massive core array of GPUs matches the inhThe massive core array of GPUs matches the inhThe massive core array of GPUs matches the inhThe massive core array of GPUs matches the inhy fy f

D iDesignDesigngData‐path in a software router =Data‐path in a software router =

k / kPacket I/O + Packet ProcessingPacket I/O + Packet Processing

Our Packet I/O Engine GPU ProcessinOur Packet I/O Engine GPU Processin

Huge packet buffer General framewHuge packet bufferi t d f k t b ff

General framewP id» instead of per‐packet buffer » Provides a conv

Batch packet processing parallel packeBatch packet processing parallel packeA li ti (» effectively amortizes per‐peacket cost » Applications (e. effectively amortizes per peacket costimplementedimplemented

Hi hl l blHighly scalableg y» with multiple C» with multiple C

nodes, NICs, a, ,

Optimization suOptimization su» Pipelining in CP» Pipelining in CPG th /S tt» Gather/Scatter

3 stage workflow3‐stage workflowI/O batch performance with a single CPU coreI/O batch performance with a single CPU core

l i / l blMulti‐core/NUMA scalableu co e/ U sca ab e» RSS and NUMA aware data placement» RSS and NUMA‐aware data placement

Efficient user‐level interfaceEfficient user level interface» supports multiple queuespp p q» resistant to receive livelock» resistant to receive livelock

P fPerformancePerformanceWe implemented 4 aWe implemented 4 aBasic Packet I/O performanceBasic Packet I/O performance

40 Gb f di f k t i Application» 40 Gbps forwarding for any packet sizes pp

IP 4 ti 2 IP 4IPv4 routing2 IPv4

IPv6 routing2 IPv6IPv6 routing2 IPv6

OpenFlow switch3 WildOpenFlow switch Wild

IPsec gateway Encryg y y

1) All throughputs are measu1) All throughputs are measu2) C fi d i h 300K IP 42) Configured with 300K IPv43) Configured with 64K exact3) Configured with 64K exact4) This performance is 4 5x h4) This performance is 4.5x h

Thi k t d b NAP f K R h CThis work was supported by NAP of Korea Research Co

Advanced N7th USENIX NSDI April 28 2010 Advanced N7th USENIX NSDI, April 28, 2010

l P ll l P k t P i ely-Parallel Packet Processing ely Parallel Packet Processing gU t A l t S ft R tUs to Accelerate Software RoutersUs to Accelerate Software RoutersJan 1 K o n Soo Park2 d S e Moon1Jang1, KyoungSoo Park2, and Sue Moon1g , y g ,{ ji k j }@ k i t k b @k i t d{sangjin, keonjang}@an.kaist.ac.kr, [email protected] Engineering KAIST kyoungsoo@ee kaist ac krlectrical Engineering, KAIST, [email protected]

Highly scalable Packet I/OHighly scalable Packet I/O» 40 Gbps throughput even with 64B packets40 Gbps throughput even with 64B packets

Fl ibl hit tFlexible architectureR i d dit 86» Runs in user‐mode on commodity x86 servers

Implementations: IPv4 IPv6 OpenFlow and IPsecImplementations: IPv4, IPv6, OpenFlow, and IPsec

General‐Purpose Computation on GPUsGeneral‐Purpose Computation on GPUs

GPUs are widely used for data‐intensive workloadsGPUs are widely used for data intensive workloads» E.g. Medical imaging, bioinformatics, finance, etc.g g g, , ,

High performance with massively‐parallel processingHigh performance with massively‐parallel processingPrice # of cores # of HW threads Peak performancePrice # of cores # of HW threads Peak performance

U (Intel Core i7 920) $260 4 8 43 GFLOPS( ) $

PU (NVIDIA GTX280) $260 240 30 720 933 GFLOPSPU (NVIDIA GTX280) $260 240 30,720 933 GFLOPS

herent parallelism in stateless packet processing ”herent parallelism in stateless packet processing ”herent parallelism in stateless packet processing.herent parallelism in stateless packet processing.p p p gp p p g

Hardware SetupHardware Setup

ng Frameworkng Framework

workworki t fenient way for

et processinget processingIP 4 ti )g. IPv4 routing) are

on it It S ifi ti Qt on it Item Specification Qty

CPU Xeon X5550 (quad‐core 2.66GHz) 2

CPU cores NUMA(q )

RAM DDR3 ECC FBDIMM 2GB 1 333Mhz 6CPU cores, NUMA RAM DDR3 ECC FBDIMM 2GB 1,333Mhz 6

and GPUs Motherboard Super Micro X8DAH+ 1

pports Graphics card NVIDIA GTX285 (240 cores, 1GB DDR3) 2pports Graphics card NVIDIA GTX285 (240 cores, 1GB DDR3) 2

NIC I t l X520 DA2 (d l t 10GbE) 4PU‐GPU collaboration

NIC Intel X520‐DA2 (dual‐port 10GbE) 4PU GPU collaborationf ll lifor more parallelism

ww

The interior view ofThe interior view of our prototype serverour prototype server( d h d(equipped with 2 quad‐core

)CPUs, 2 GPUs, 8×10GbE ports)

Our prototype costs less than $7 000 which is 10+Our prototype costs less than $7,000, which is 10+ ti h th h d b d i ltimes cheaper than hardware‐based commercial routers with similar performance.

applications on top of PacketShader and measured performanceapplications on top of PacketShader and measured performance.

1 1What we offloaded Type CPU‐only1 CPU+GPU1yp y

l t fi t hi Si l 28 8 Gb 38 1 Gb 4longest prefix matching Simple 28.8 Gbps 38.1 Gbps4

longest prefix matching Memory intensive 8 1 Gbps 31 4 Gbpslongest prefix matching Memory‐intensive 8.1 Gbps 31.4 Gbps

card matching Configuration dependent 9 6 Gbps 24.6 Gbpscard matching Configuration dependent 9.6 Gbps 24.6 Gbps

yption and HMAC Compute‐intensive 2.8 Gbps 5.3 Gbpsyp p p p

ured withmin‐sized (64B) packetsured with min‐sized (64B) packets.4 fi d 200K IP 6 fi4 prefixes and 200K IPv6 prefixest‐match flows and 64 wildcard‐match flows.t match flows and 64 wildcard match flows.higher than the fastest software router previously reported (RouteBricks in SOSP ‘09)higher than the fastest software router previously reported (RouteBricks in SOSP 09)

il f F d t l S i & T h l (KRCF)ouncil of Fundamental Science & Technology (KRCF).

Networking Lab & Networked and Distributed Computing Systems Lab KAIST Networking Lab & Networked and Distributed Computing Systems Lab, KAIST

2 of e engineeringengineering, kaistkaist, kyoungsoo@ee kaist … · 2017-01-05 · mimassive with...

Documents