2 "a4rb_premium" – 2012-02_v02 – do not delete this text object! speech 1/45 hermes:...

1/45

1

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Yuhao Zhu*, Yangdong Deng‡, Yubei Chen‡

Presenters: Abraham Addisie, Vaibhav Gogte

*Electrical and Computer EngineeringUniversity of Texas at Austin

‡Institute of MicroelectronicsTsinghua University

2

2

2

• Introduction• Motivation• Related work• GPU Overview• Hermes Architecture• Adaptive warp scheduling• Hardware Implementation• Experimental Analysis• Conclusion

Outline

3

3

3

Processing of an IP packet at a router

1. Checking IP Header 2. Packet Classification 3. Routing Table Lookup 4. Decrementing Time to Live (TTL) value5. IP Fragmentation (if > Max Transmission Unit)

Introduction

Receive an IP packet

New processing requirements are being added to the list• Deep packet inspection

IP Packet Processing

Mac Header:Source Mac :mxDest Mac :my-----------------------------IP Header:Source IP :xDest IP :y-----------------------------Data

Mac Header:Source Mac :newDest Mac :new-----------------------------IP Header:Source IP :xDest IP :y-----------------------------Data

4

4

4

MotivationInternet traffic is increasing exponentially

• Multimedia application, social network, internet of things

Network protocols are being added and modified

• Transition from IPv4(32 bit) to IPv6(128 bit)

High Throughput Router

High Programmable Router

New high processing demanding task is being added• Deep packet inspection

5

5

5

ASIC based routerNetwork processor based routerGPP (software) based router

Related Work

ASIC based router:• Long design turnaround• High non-recurring engineering cost

NP based router:• No effective programming model• Intel discontinue its NP router

business

GPP (Software) based router: • Low performance

GPU based router:• High performance + High

programmability

6

6

6

GPP (Software) based router

Related Work – CPU vs GPU Throughput

GPU based software router

Low throughput processor High throughput processor

Packetshader: Han and et. al[2010]

7

7

7

Processing of a Packet is independent with the others • Data level parallelism = Packet level parallelism

Exploiting High Throughput GPU for IP routing

GPU based router is shown to outperform software based router by 30x (in terms of throughput)Packetshader: Han and et. al[2010]

Packet Queue

Batching

Parallel Processing by GPU

8

8

8

Memory mapping from CPU’s main memory to GPU’s device memory through PCIe bus with a pick bandwidth of 8GBps• GPU throughput = 30x CPU’s , without memory mapping• Reduced to 5x CPU’s , with memory mapping overhead

Cannot guarantee minimum latency for an individual packet

Limitation of existing GPU based router

Solution: Hermes

Architecture of NVIDIA GTX480

99

Shared Memory Hierarchy

Hermes, integrated CPU/GPU IP routingLower packet transferring overhead• Shared memory

Lower per packet latency• Adaptive warp scheduling

10

10

10

Adaptive Warp Issue

Arrival pattern of packets

Available resources in GPU

Tradeoff in updating the FIFO: Too large – average packet delay increases Too low – complicated GPU fetch scheduling

no. of packets to be processed

SMP

SMP

SMP

SMP

SMP

SMP

SMP

SMP

SMP

Minimum 1 warp fetch granularity

Shared MemoryData transfer

Task FIFO- - - - -

- - - - -

- - - - -

Monitor the packets

CPU

11

11

11

In Order Commit

UDP protocol users expect packets to arrive in order

DCQ entry id Warp idLookup Table (LUT)

Warp Allocator

Warp Scheduler

Write Back Stage

.

.

.

Shader Core

DCQ

Warp id. . .. . .

DCQ entry id

Warp id

Maps DCQ entry to wrap ID

Records warp ids in flight

Warps committed in order

12

12

12

Task FIFO• 32 bit - 1028 entries• Area = 0.053 mm2

Delay Commit Queue• Size depends on maximally allowed concurrent warps (MCWs) and

shader cores• 8 bit – 1028 entries• Area = 0.013 mm2

DCQ-Warp LUT• Size depends on number of MCWs• 16 bit – 32 entries• Area = 0.006 mm2

Hardware and Area Overhead

Hardware Overhead Negligible!

13

13

13

Cycle Accurate GPGPU-Sim to evaluate performance

Experimental Setup

Benchmarks• Checking IP header Packet classification Routing table

lookup Decrementing TTL IP fragmentation and Deep packet inspection

• Both burst and sparse patterns

QoS parameters – throughput, delay, delay variance

14

14

14

Throughput evaluation

Burst traffic without DCQ

Sparse traffic without DCQ

• No packet queueing• CPU/GPU still unable to deliver at input rate

• Outperforms CPU/GPU by a factor of 5

• Better resource utilization with increasing MCW

Computing rates of benchmark applications

15

15

15

Delay analysis

Simple processing in GPU, overlap of CPU side waiting

with GPU processing

Packet Delay reduction by 81.2%!

Burst traffic without DCQ

Divergent branches takes higher processing time

starving the packets

Delay - with DCQ vs without DCQ

16

16

16

• Lack of QoS and CPU-GPU communication overhead major bottleneck

• Hermes – closely coupled CPU-GPU solution

• Meet stringent delay requirements

• Enable QoS through optimized configuration

• Minimal hardware extension

• Novel high quality packet processing engine for future software routers

Conclusion

17

17

17

• Are GPUs really easy to program for processing packets?• How does the performance and area overhead compare with ASIC

based routers?• Is router programmability really a crucial concern?

Discussion points

2 "a4rb_premium" – 2012-02_v02 – do not delete this text object! speech 1/45 hermes:...

Documents