evaluating the potential of graphics processors for high...
TRANSCRIPT
-
Evaluating the Potential of Graphics Processors for
High Performance Embedded Computing
Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng
Department of Micro-/Nano-electronics
Tsinghua University
-
22
Motivation
HPEC Implementation and Evaluation
Kernel Benchmarks
Synthetic Aperture Radar
Performance Comparison
Conclusion
Outline
-
33
HPEC: High Performance Embedded Computing
Future IT infrastructure demands even higher computing power High performance radar : 800GFLOPs(Giga FLoating point Operations Per second)
4G wireless base station: 1Gbit/s data rate per customer and up to 200
subscribers in service area
CMU driverless car: 270GFLOPs
…
-
44
Implication
An increasing number of high performance embedded
applications would be implemented with multi-core devices
Intel: cluster based Internet routers
IBM: signal processing and radar applications on Cell processor
Huawei: multi-core base stations
…
Systematically evaluating the potential of GPU
Performance
Scalability
-
55
HPEC Challenge Benchmark
Developed by MIT Lincoln Laboratory*
Quantitatively evaluate HPEC systems
Kernel benchmarks: extracted from a broad range of signal and
image processing application
* The HPEC Challenge Benchmark Suite, R. Haney, T. Meuse, J. Kepner, HPEC 2006
-
66
Kernel BenchmarksCategory Benchmark Description
Signal
Processing
TDFIR Time-domain finite impulse response filtering
FDFIRFrequency-domain finite impulse response filtering
QRQR factorization: prevalent in target recognition algorithms
SVDSingular value decomposition: produces a basis for the
matrix as well as the rank for reducing interference
CFARConstant false-alarm rate detection: find target in an
environment with varying background noise
Communication CTCorner turn or matrix transpose to place radar data into a
contiguous row for efficient FFT
Information
Processing
PMPattern Matching: identify stored tracks that match a target
GAGraph optimization via genetic algorithm: removing
uncorrelated data relations
DBDatabase operations to store and query target tracks
-
77
Benchmark PropertiesBenchmark Data Set
Workload
(MFLOPS)*
Task-Level
Parallelism
Data
Structure
Data
Size
Data
Correlation
Memory
Access
TD FIRSet1
Set2
268.4
1.97
64
20Vector
4096
1024Low Low
FD FIRSet1
Set2
34
2.21
64
20Vector
4096
1024Low Low
CTSet1
Set2
2
30
1
1Matrix
50x5000
750x5000Very Low Very High
PMSet1
Set2
1.21
13.59
72
256Vector
64
128Low Low
CFAR
Set1
Set2
Set3
Set4
0.17
150.5
41.1
17.7
384
6144
3072
480
Vector
64
3500
1909
9900
Medium Low
GA
Set1
Set2
Set3
Set4
0.011
0.51
0.015
0.11
50
200
100
400
Vector
8
96
5
10
Medium High
QRSet1
Set2
Set3
397
30.5
45
1
1
1
Matrix
500x100
180x60
150x150High Medium
SVDSet1
Set2
0.24
0.88
1
1Matrix
500x100
180x60High Medium
DBSet1
Set2
440
700
1
1Tree
440
700High Very High
* The workload of CT and DB are measured in MB and Transactions, respectively
-
88
Implementation on GPU (1)
Plenty of data level parallelism
Raw computing power
Loops of multiplication and accumulation (MAC)
TDFIR FDFIR CFAR
-
99
Implementation on GPU (2)
Plenty of task level parallelism
Synchronization between blocks
PM GA
-
1010
Implementations on GPU (3)
Memory accessing operation
Global memory accessing coalescing
Shared memory for local operation
Database
CT DB
-
1111
Implementations on GPU (4)
Advanced linear algebra operation
Hard to explore parallelism
Pipelining the row updates of matrix
QR SVD
(a) Threads assignment (b) Step 1 (c) Step 2 and 3
-
1212
Experiment Environment
CPU
Intel Core2 Duo 2.66GHz
4GB memory
GPU
NVidia Tesla C2050 : 448cores,1.15GHz
3GB memory
DSP
ADSP-TS101S Tiger SHARC T2-PCI
8 DSP processor, 600MHz
24Mbits on-chip memory per DSP
-
1313
Performance Comparison
Kernels Data SetDSP Throughput
(GFLOPS)*
CPU Throughput
(GFLOPS) *GPU Throughput
(GFLOPS) *Speedup
TDFIRSet 1
Set 2
6.865
0.84
3.382
3.326
97.506
23.130
14.2/28.8
27.5/6.9
FDFIRSet 1
Set 2
3.144
0.588
0.541
0.542
61.681
11.955
19.6/114.1
20.3/22.1
CTSet 1
Set 2_____
1.194
0.501
17.177
35.545
14.3
70.9
PMSet 1
Set 2_____
0.871
0.281
7.761
21.241
8.9
75.6
CFAR
Set 1
Set 2
Set 3
Set 4
0.488
2.568
2.408
2.088
1.154
1.314
1.313
1.261
2.234
17.319
13.962
8.301
4.5/1.9
6.7/13.1
5.8/10.6
3.9/6.6
GA
Set 1
Set 2
Set 3
Set 4
____
0.562
0.683
0.441
0.373
1.177
8.571
0.589
2.249
2.1
12.5
1.4
6.0
QR
Set 1
Set 2
Set 3
1.552
3.056
2.408
1.704
0.901
0.904
54.309
5.679
6.686
34.9/31.8
1.8/6.3
2.7/7.4
SVDSet 1
Set 2
2.576
0.6
0.747
0.791
4.175
2.684
1.6/5.6
4.5/3.4
DBSet 1
Set 2____
112.3
5.794
126.8
8.459
1.13
1.46
*The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively.
-
1414
Power Efficiency Comparison
CPU: 65w, GPU: 238w, DSP: 10w
GPU suffers from a low power-efficiency
-
1515
Synthetic Aperture Radar Benchmark
Simulating a sensor processing chain
Data Set Set 1 Set 2 Set 3
Image Size 382x266 762x512 1144x756
Work
Load
(MFLOP)
FFT/IFF
T28.61 113.38 259.92
Match
Filtering6.42 22.06 47.08
Interope
ration56.88 195.34 416.96
Miscella
neous1.23 4.43 9.62
Total 93.14 335.21 733.58
-
1616
Performance Result
Data Set Kernel
CPU
Throughput
(GFLOPS)
GPU
Throughput
(GFLOPS)
Speedup
Set 1
FFT/IFFT 0.463 5.259 11.3
Filtering 0.538 17.165 31.8
Interpolation 0.256 19.274 75.1
Overall 0.312 8.316 26.6
Set 2
FFT/IFFT 0.581 9.252 15.9
Filtering 0.545 25.241 46.3
Interpolation 0.252 17.332 68.8
Overall 0.327 9.507 29.1
Set 3
FFT/IFFT 0.832 15.155 18.2
Filtering 0.523 26.856 51.3
Interpolation 0.248 18.569 74.7
Overall 0.346 11.403 32.8
-
1717
Overview of Optimization Techniques
Maximizing the usage of on-chip resources
Shared memory
Registers
Reducing memory accessing time
Global memory accessing coalesced
Overlapping transfers with computation
Reducing divergence
Warp level parallelism
……
-
1818
Architecture Implication
SIMD width: suitable for large vector computing
Dynamically configurable SIMD width according to application
Shared memory superior to cache for embedded application
Data prefetch is preferred
Special functions for specific applications
Dedicated efficient shuffle network for fft, et. al.
Power efficiency is quite low now
Reorganizing memory access patterns
New interconnection technologies : 3D stacking
-
1919
Conclusion
Efficient implementations of the HPEC benchmarks on
NVidia’s Fermi
Performance comparison with CPU
• Kernels: 10X speedup
• SAR: 30X speedup
A detailed analysis provides key insight
Optimizing data parallelism algorithm
Bottleneck of GPU’s architecture for HPEC
Publications:
Design Automation and Test in Europe (DATE), March 2011
Journal of Parallel and Distributed Computing, submitted under
review.
-
2020
Thank You!