2 of e engineeringengineering, kaistkaist, kyoungsoo@ee kaist … · 2017-01-05 · mimassive with...

1
M i Massive Massive ith GPU with GPU with GPU San jin Han 1 Keon Sangjin Han 1 , Keon 1 D t t fC t Si KAIST { 1 Department of Computer Science, KAIST, { 2 Department of E Department of E P k Sh d PacketShader PacketShader The fastest software router in the world The fastest software router in the world » 4 5x faster than RouteBricks [SOSP 09] » 4.5x faster than RouteBricks [SOSP 09] Eliminates the CPU bottleneck Eliminates the CPU bottleneck » with massivelyparallel GPU computation » with massively parallel GPU computation M ti ti Motivation Motivation G G PC based Soft are Ro ters PCbased Software Routers G E t fi bl G Easy to fix problems Ready for additional functionalities H Ready for additional functionalities H Ch & Ubi it ($1 000 ~ $10 000) Cheap & Ubiquitous ($1,000 ~ $10, 000) CP Low Performance: 110 Gbps today GP Low Performance: 1 10 Gbps today GP The massive core array of GPUs matches the inh The massive core array of GPUs matches the inh The massive core array of GPUs matches the inh The massive core array of GPUs matches the inh D i Design Design Datapath in a software router = Datapath in a software router = k / k Packet I/O + Packet Processing Packet I/O + Packet Processing Our Packet I/O Engine GPU Processin Our Packet I/O Engine GPU Processin Huge packet buffer General framew Huge packet buffer i t d f k tb ff General framew P id » instead of perpacket buffer » Provides a conv Batch packet processing parallel packe Batch packet processing parallel packe A li ti ( » effectively amortizes perpeacket cost » Applications (e. effectively amortizes per peacket cost implemented implemented Hi hl l bl Highly scalable » with multiple C » with multiple C nodes, NICs, a Optimization su Optimization su » Pipelining in CP » Pipelining in CP G th /S tt » Gather/Scatter 3 stage workflow 3stage workflow I/O batch performance with a single CPU core I/O batch performance with a single CPU core li / l bl Multicore/NUMA scalable » RSS and NUMA aware data placement » RSS and NUMAaware data placement Efficient userlevel interface Efficient user level interface » supports multiple queues » resistant to receive livelock » resistant to receive livelock P f Performance Performance We implemented 4 a We implemented 4 a Basic Packet I/O performance Basic Packet I/O performance 40 Gb f di f kt i Application » 40 Gbps forwarding for any packet sizes IP 4 ti 2 IP 4 IPv4 routing 2 IPv4 IPv6 routing 2 IPv6 IPv6 routing 2 IPv6 OpenFlow switch 3 Wild OpenFlow switch Wild IPsec gateway Encry 1) All throughputs are measu 1) All throughputs are measu 2) C fi d i h 300K IP 4 2) Configured with 300K IPv4 3) Configured with 64K exact 3) Configured with 64K exact 4) This performance is 4 5x h 4) This performance is 4.5x h Thi k tdb NAP fK R hC This work was supported by NAP of Korea Research Co Advanced N 7 th USENIX NSDI April 28 2010 Advanced N 7 th USENIX NSDI, April 28, 2010 l P ll l P k t P i ely-Parallel Packet Processing ely Parallel Packet Processing U t A l t S ft R t Us to Accelerate Software Routers Us to Accelerate Software Routers Jan 1 K o n Soo Park 2 d S e Moon 1 Jang 1 , KyoungSoo Park 2 , and Sue Moon 1 { ji k j }@ kit k b @k i t d {sangjin, keonjang}@an.kaist.ac.kr , [email protected] lectrical Engineering KAIST kyoungsoo@ee kaist ac kr lectrical Engineering, KAIST, [email protected] Highly scalable Packet I/O Highly scalable Packet I/O » 40 Gbps throughput even with 64B packets 40 Gbps throughput even with 64B packets Fl ibl hit t Flexible architecture R i d dit 86 » Runs in usermode on commodity x86 servers Implementations: IPv4 IPv6 OpenFlow and IPsec Implementations: IPv4, IPv6, OpenFlow , and IPsec GeneralPurpose Computation on GPUs GeneralPurpose Computation on GPUs GPUs are widely used for dataintensive workloads GPUs are widely used for data intensive workloads » E.g. Medical imaging, bioinformatics, finance, etc. High performance with massivelyparallel processing High performance with massivelyparallel processing Price # of cores # of HW threads Peak performance Price # of cores # of HW threads Peak performance U (Intel Core i7 920) $260 4 8 43 GFLOPS PU (NVIDIA GTX280) $260 240 30 720 933 GFLOPS PU (NVIDIA GTX280) $260 240 30,720 933 GFLOPS herent parallelism in stateless packet processing ” herent parallelism in stateless packet processing ” herent parallelism in stateless packet processing. herent parallelism in stateless packet processing. Hardware Setup Hardware Setup ng Framework ng Framework work work i t f enient way for et processing et processing IP 4 ti ) g. IPv4 routing) are on it It S ifi ti Qt on it Item Specification Qty CPU Xeon X5550 (quadcore 2.66GHz) 2 CPU cores NUMA RAM DDR3 ECC FBDIMM 2GB 1 333Mhz 6 CPU cores, NUMA RAM DDR3 ECC FBDIMM 2GB 1,333Mhz 6 and GPUs Motherboard Super Micro X8DAH+ 1 pports Graphics card NVIDIA GTX285 (240 cores, 1GB DDR3) 2 pports Graphics card NVIDIA GTX285 (240 cores, 1GB DDR3) 2 NIC I t l X520 DA2 (d l t 10GbE) 4 PUGPU collaboration NIC Intel X520DA2 (dualport 10GbE) 4 PU GPU collaboration f ll li for more parallelism w w The interior view of The interior view of our prototype server our prototype server ( d h d (equipped with 2 quadcore ) CPUs, 2 GPUs, 8×10GbE ports) Our prototype costs less than $7 000 which is 10+ Our prototype costs less than $7,000, which is 10+ ti h th h d b d il times cheaper than hardwarebased commercial routers with similar performance. applications on top of PacketShader and measured performance applications on top of PacketShader and measured performance. 1 1 What we offloaded Type CPUonly 1 CPU+GPU 1 l t fi t hi Si l 28 8 Gb 38 1 Gb 4 longest prefix matching Simple 28.8 Gbps 38.1 Gbps 4 longest prefix matching Memory intensive 81 Gbps 31 4 Gbps longest prefix matching Memoryintensive 8.1 Gbps 31.4 Gbps card matching Configuration dependent 96 Gbps 24.6 Gbps card matching Configuration dependent 9.6 Gbps 24.6 Gbps yption and HMAC Computeintensive 2.8 Gbps 5.3 Gbps ured with minsized (64B) packets ured with minsized (64B) packets. 4 fi d 200K IP 6 fi 4 prefixes and 200K IPv6 prefixes tmatch flows and 64 wildcardmatch flows. t match flows and 64 wildcard match flows. higher than the fastest software router previously reported (RouteBricks in SOSP ‘09) higher than the fastest software router previously reported (RouteBricks in SOSP 09) il fF d tlSi &T h l (KRCF) ouncil of Fundamental Science & Technology (KRCF). Networking Lab & Networked and Distributed Computing Systems Lab KAIST Networking Lab & Networked and Distributed Computing Systems Lab, KAIST

Upload: others

Post on 29-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

M iMassiveMassiveith GPUwith GPUwith GPU

San jin Han1 KeonSangjin Han1, Keongj ,1D t t f C t S i KAIST {1Department of Computer Science, KAIST,  {

2Department of EDepartment of E

P k Sh dPacketShaderPacketShader

The fastest software router in the worldThe fastest software router in the world» 4 5x faster than RouteBricks [SOSP ’09]» 4.5x faster than RouteBricks [SOSP 09]

Eliminates the CPU bottleneckEliminates the CPU bottleneck» with massively‐parallel GPU computation» with massively parallel GPU computation

M ti tiMotivationMotivationGG

PC based Soft are Ro tersPC‐based Software RoutersG

E t fi blG

Easy to fix problemsy p

Ready for additional functionalities HReady for additional functionalities H

Ch & Ubi it ($1 000 ~ $10 000)Cheap & Ubiquitous ($1,000 ~ $10, 000)p q ( , , )CP

Low Performance: 1‐10 Gbps today GPLow Performance: 1 10 Gbps today GP

““The massive core array of GPUs matches the inhThe massive core array of GPUs matches the inhThe massive core array of GPUs matches the inhThe massive core array of GPUs matches the inhy fy f

D iDesignDesigngData‐path in a software router =Data‐path in a software router =

k / kPacket I/O + Packet ProcessingPacket I/O + Packet Processing

Our Packet I/O Engine GPU ProcessinOur Packet I/O Engine GPU Processin

Huge packet buffer General framewHuge packet bufferi t d f k t b ff

General framewP id» instead of per‐packet buffer » Provides a conv

Batch packet processing parallel packeBatch packet processing parallel packeA li ti (» effectively amortizes per‐peacket cost » Applications (e. effectively amortizes per peacket costimplementedimplemented

Hi hl l blHighly scalableg y» with multiple C» with multiple C

nodes, NICs, a, ,

Optimization suOptimization su» Pipelining in CP» Pipelining in CPG th /S tt» Gather/Scatter 

3 stage workflow3‐stage workflowI/O batch performance with a single CPU coreI/O batch performance with a single CPU core

l i / l blMulti‐core/NUMA scalableu co e/ U sca ab e» RSS and NUMA aware data placement» RSS and NUMA‐aware data placement

Efficient user‐level interfaceEfficient user level interface» supports multiple queuespp p q» resistant to receive livelock» resistant to receive livelock

P fPerformancePerformanceWe implemented 4 aWe implemented 4 aBasic Packet I/O performanceBasic Packet I/O performance

40 Gb f di f k t i Application» 40 Gbps forwarding for any packet sizes pp

IP 4 ti 2 IP 4IPv4 routing2 IPv4 

IPv6 routing2 IPv6IPv6 routing2 IPv6 

OpenFlow switch3 WildOpenFlow switch Wild

IPsec gateway Encryg y y

1) All throughputs are measu1) All throughputs are measu2) C fi d i h 300K IP 42) Configured with 300K IPv43) Configured with 64K exact3) Configured with 64K exact4) This performance is 4 5x h4) This performance is 4.5x h

Thi k t d b NAP f K R h CThis work was supported by NAP of Korea Research Co

Advanced N7th USENIX NSDI April 28 2010 Advanced N7th USENIX NSDI, April 28, 2010

l P ll l P k t P i ely-Parallel Packet Processing ely Parallel Packet Processing gU t A l t S ft R tUs to Accelerate Software RoutersUs to Accelerate Software RoutersJan 1 K o n Soo Park2 d S e Moon1Jang1, KyoungSoo Park2, and Sue Moon1g , y g ,{ ji k j }@ k i t k b @k i t d{sangjin, keonjang}@an.kaist.ac.kr, [email protected] Engineering KAIST kyoungsoo@ee kaist ac krlectrical Engineering, KAIST, [email protected]

Highly scalable Packet I/OHighly scalable Packet I/O» 40 Gbps throughput even with 64B packets40 Gbps throughput even with 64B packets

Fl ibl hit tFlexible architectureR i d dit 86» Runs in user‐mode on commodity x86 servers

Implementations: IPv4 IPv6 OpenFlow and IPsecImplementations: IPv4, IPv6, OpenFlow, and IPsec

General‐Purpose Computation on GPUsGeneral‐Purpose Computation on GPUs

GPUs are widely used for data‐intensive workloadsGPUs are widely used for data intensive workloads» E.g. Medical imaging, bioinformatics, finance, etc.g g g, , ,

High performance with massively‐parallel processingHigh performance with massively‐parallel processingPrice # of cores # of HW threads Peak performancePrice # of cores # of HW threads Peak performance

U (Intel Core i7 920) $260 4 8 43 GFLOPS( ) $

PU (NVIDIA GTX280) $260 240 30 720 933 GFLOPSPU (NVIDIA GTX280) $260 240 30,720 933 GFLOPS

herent parallelism in stateless packet processing ”herent parallelism in stateless packet processing ”herent parallelism in stateless packet processing.herent parallelism in stateless packet processing.p p p gp p p g

Hardware SetupHardware Setup

ng Frameworkng Framework

workworki t fenient way for 

et processinget processingIP 4 ti )g. IPv4 routing) are 

on it It S ifi ti Qt on it Item Specification Qty

CPU Xeon X5550 (quad‐core 2.66GHz) 2

CPU cores NUMA(q )

RAM DDR3 ECC FBDIMM 2GB 1 333Mhz 6CPU cores, NUMA  RAM DDR3 ECC FBDIMM 2GB 1,333Mhz 6

and GPUs Motherboard Super Micro X8DAH+ 1

pports Graphics card NVIDIA GTX285 (240 cores, 1GB DDR3) 2pports Graphics card NVIDIA GTX285 (240 cores, 1GB DDR3) 2

NIC I t l X520 DA2 (d l t 10GbE) 4PU‐GPU collaboration

NIC Intel X520‐DA2 (dual‐port 10GbE) 4PU GPU collaborationf ll lifor more parallelism

ww

The interior view ofThe interior view of our prototype serverour prototype server( d h d(equipped with 2 quad‐core 

)CPUs, 2 GPUs, 8×10GbE ports)

Our prototype costs less than $7 000 which is 10+Our prototype costs less than $7,000,  which is 10+ ti h th h d b d i ltimes cheaper than hardware‐based commercial routers with similar performance.

applications on top of PacketShader and measured performanceapplications on top of PacketShader and measured performance.

1 1What we offloaded Type CPU‐only1 CPU+GPU1yp y

l t fi t hi Si l 28 8 Gb 38 1 Gb 4longest prefix matching Simple 28.8 Gbps 38.1 Gbps4

longest prefix matching Memory intensive 8 1 Gbps 31 4 Gbpslongest prefix matching Memory‐intensive 8.1 Gbps 31.4 Gbps

card matching Configuration dependent 9 6 Gbps 24.6 Gbpscard matching Configuration dependent 9.6 Gbps 24.6 Gbps

yption and HMAC Compute‐intensive 2.8 Gbps 5.3 Gbpsyp p p p

ured withmin‐sized (64B) packetsured with min‐sized (64B) packets.4 fi d 200K IP 6 fi4 prefixes and 200K IPv6 prefixest‐match flows and 64 wildcard‐match flows.t match flows and 64 wildcard match flows.higher than the fastest software router previously reported (RouteBricks in SOSP ‘09)higher than the fastest software router previously reported (RouteBricks in SOSP  09)

il f F d t l S i & T h l (KRCF)ouncil of  Fundamental Science & Technology (KRCF).

Networking Lab & Networked and Distributed Computing Systems Lab KAIST Networking Lab & Networked and Distributed Computing Systems Lab, KAIST