2 "a4rb_premium" – 2012-02_v02 – do not delete this text object! speech 1/45 hermes:...
TRANSCRIPT
1/45
1
Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing
Yuhao Zhu*, Yangdong Deng‡, Yubei Chen‡
Presenters: Abraham Addisie, Vaibhav Gogte
*Electrical and Computer EngineeringUniversity of Texas at Austin
‡Institute of MicroelectronicsTsinghua University
2
2
2
• Introduction• Motivation• Related work• GPU Overview• Hermes Architecture• Adaptive warp scheduling• Hardware Implementation• Experimental Analysis• Conclusion
Outline
3
3
3
Processing of an IP packet at a router
1. Checking IP Header 2. Packet Classification 3. Routing Table Lookup 4. Decrementing Time to Live (TTL) value5. IP Fragmentation (if > Max Transmission Unit)
Introduction
Receive an IP packet
New processing requirements are being added to the list• Deep packet inspection
IP Packet Processing
Mac Header:Source Mac :mxDest Mac :my-----------------------------IP Header:Source IP :xDest IP :y-----------------------------Data
Mac Header:Source Mac :newDest Mac :new-----------------------------IP Header:Source IP :xDest IP :y-----------------------------Data
4
4
4
MotivationInternet traffic is increasing exponentially
• Multimedia application, social network, internet of things
Network protocols are being added and modified
• Transition from IPv4(32 bit) to IPv6(128 bit)
High Throughput Router
High Programmable Router
New high processing demanding task is being added• Deep packet inspection
5
5
5
ASIC based routerNetwork processor based routerGPP (software) based router
Related Work
ASIC based router:• Long design turnaround• High non-recurring engineering cost
NP based router:• No effective programming model• Intel discontinue its NP router
business
GPP (Software) based router: • Low performance
GPU based router:• High performance + High
programmability
6
6
6
GPP (Software) based router
Related Work – CPU vs GPU Throughput
GPU based software router
Low throughput processor High throughput processor
Packetshader: Han and et. al[2010]
7
7
7
Processing of a Packet is independent with the others • Data level parallelism = Packet level parallelism
Exploiting High Throughput GPU for IP routing
GPU based router is shown to outperform software based router by 30x (in terms of throughput)Packetshader: Han and et. al[2010]
Packet Queue
Batching
Parallel Processing by GPU
8
8
8
Memory mapping from CPU’s main memory to GPU’s device memory through PCIe bus with a pick bandwidth of 8GBps• GPU throughput = 30x CPU’s , without memory mapping• Reduced to 5x CPU’s , with memory mapping overhead
Cannot guarantee minimum latency for an individual packet
Limitation of existing GPU based router
Solution: Hermes
Architecture of NVIDIA GTX480
99
Shared Memory Hierarchy
Hermes, integrated CPU/GPU IP routingLower packet transferring overhead• Shared memory
Lower per packet latency• Adaptive warp scheduling
10
10
10
Adaptive Warp Issue
Arrival pattern of packets
Available resources in GPU
Tradeoff in updating the FIFO: Too large – average packet delay increases Too low – complicated GPU fetch scheduling
no. of packets to be processed
SMP
SMP
SMP
SMP
SMP
SMP
SMP
SMP
SMP
Minimum 1 warp fetch granularity
Shared MemoryData transfer
Task FIFO- - - - -
- - - - -
- - - - -
Monitor the packets
CPU
11
11
11
In Order Commit
UDP protocol users expect packets to arrive in order
DCQ entry id Warp idLookup Table (LUT)
Warp Allocator
Warp Scheduler
Write Back Stage
.
.
.
Shader Core
DCQ
Warp id. . .. . .
DCQ entry id
Warp id
Maps DCQ entry to wrap ID
Records warp ids in flight
Warps committed in order
12
12
12
Task FIFO• 32 bit - 1028 entries• Area = 0.053 mm2
Delay Commit Queue• Size depends on maximally allowed concurrent warps (MCWs) and
shader cores• 8 bit – 1028 entries• Area = 0.013 mm2
DCQ-Warp LUT• Size depends on number of MCWs• 16 bit – 32 entries• Area = 0.006 mm2
Hardware and Area Overhead
Hardware Overhead Negligible!
13
13
13
Cycle Accurate GPGPU-Sim to evaluate performance
Experimental Setup
Benchmarks• Checking IP header Packet classification Routing table
lookup Decrementing TTL IP fragmentation and Deep packet inspection
• Both burst and sparse patterns
QoS parameters – throughput, delay, delay variance
14
14
14
Throughput evaluation
Burst traffic without DCQ
Sparse traffic without DCQ
• No packet queueing• CPU/GPU still unable to deliver at input rate
• Outperforms CPU/GPU by a factor of 5
• Better resource utilization with increasing MCW
Computing rates of benchmark applications
15
15
15
Delay analysis
Simple processing in GPU, overlap of CPU side waiting
with GPU processing
Packet Delay reduction by 81.2%!
Burst traffic without DCQ
Divergent branches takes higher processing time
starving the packets
Delay - with DCQ vs without DCQ
16
16
16
• Lack of QoS and CPU-GPU communication overhead major bottleneck
• Hermes – closely coupled CPU-GPU solution
• Meet stringent delay requirements
• Enable QoS through optimized configuration
• Minimal hardware extension
• Novel high quality packet processing engine for future software routers
Conclusion
17
17
17
• Are GPUs really easy to program for processing packets?• How does the performance and area overhead compare with ASIC
based routers?• Is router programmability really a crucial concern?
Discussion points