evaluating the potential of graphics processors for high...

Evaluating the Potential of Graphics Processors for

High Performance Embedded Computing

Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng

Department of Micro-/Nano-electronics

Tsinghua University

22

Motivation

HPEC Implementation and Evaluation

Kernel Benchmarks

Synthetic Aperture Radar

Performance Comparison

Conclusion

Outline

33

HPEC: High Performance Embedded Computing

Future IT infrastructure demands even higher computing power High performance radar : 800GFLOPs(Giga FLoating point Operations Per second)

4G wireless base station: 1Gbit/s data rate per customer and up to 200

subscribers in service area

CMU driverless car: 270GFLOPs

…

44

Implication

An increasing number of high performance embedded

applications would be implemented with multi-core devices

Intel: cluster based Internet routers

IBM: signal processing and radar applications on Cell processor

Huawei: multi-core base stations

…

Systematically evaluating the potential of GPU

Performance

Scalability

55

HPEC Challenge Benchmark

Developed by MIT Lincoln Laboratory*

Quantitatively evaluate HPEC systems

Kernel benchmarks: extracted from a broad range of signal and

image processing application

* The HPEC Challenge Benchmark Suite, R. Haney, T. Meuse, J. Kepner, HPEC 2006

66

Kernel BenchmarksCategory Benchmark Description

Signal

Processing

TDFIR Time-domain finite impulse response filtering

FDFIRFrequency-domain finite impulse response filtering

QRQR factorization: prevalent in target recognition algorithms

SVDSingular value decomposition: produces a basis for the

matrix as well as the rank for reducing interference

CFARConstant false-alarm rate detection: find target in an

environment with varying background noise

Communication CTCorner turn or matrix transpose to place radar data into a

contiguous row for efficient FFT

Information

Processing

PMPattern Matching: identify stored tracks that match a target

GAGraph optimization via genetic algorithm: removing

uncorrelated data relations

DBDatabase operations to store and query target tracks

77

Benchmark PropertiesBenchmark Data Set

Workload

(MFLOPS)*

Task-Level

Parallelism

Data

Structure

Data

Size

Data

Correlation

Memory

Access

TD FIRSet1

Set2

268.4

1.97

64

20Vector

4096

1024Low Low

FD FIRSet1

Set2

34

2.21

64

20Vector

4096

1024Low Low

CTSet1

Set2

2

30

1

1Matrix

50x5000

750x5000Very Low Very High

PMSet1

Set2

1.21

13.59

72

256Vector

64

128Low Low

CFAR

Set1

Set2

Set3

Set4

0.17

150.5

41.1

17.7

384

6144

3072

480

Vector

64

3500

1909

9900

Medium Low

GA

Set1

Set2

Set3

Set4

0.011

0.51

0.015

0.11

50

200

100

400

Vector

8

96

5

10

Medium High

QRSet1

Set2

Set3

397

30.5

45

1

1

1

Matrix

500x100

180x60

150x150High Medium

SVDSet1

Set2

0.24

0.88

1

1Matrix

500x100

180x60High Medium

DBSet1

Set2

440

700

1

1Tree

440

700High Very High

* The workload of CT and DB are measured in MB and Transactions, respectively

88

Implementation on GPU (1)

Plenty of data level parallelism

Raw computing power

Loops of multiplication and accumulation (MAC)

TDFIR FDFIR CFAR

99

Implementation on GPU (2)

Plenty of task level parallelism

Synchronization between blocks

PM GA

1010

Implementations on GPU (3)

Memory accessing operation

Global memory accessing coalescing

Shared memory for local operation

Database

CT DB

1111

Implementations on GPU (4)

Advanced linear algebra operation

Hard to explore parallelism

Pipelining the row updates of matrix

QR SVD

(a) Threads assignment (b) Step 1 (c) Step 2 and 3

1212

Experiment Environment

CPU

Intel Core2 Duo 2.66GHz

4GB memory

GPU

NVidia Tesla C2050 : 448cores,1.15GHz

3GB memory

DSP

ADSP-TS101S Tiger SHARC T2-PCI

8 DSP processor, 600MHz

24Mbits on-chip memory per DSP

1313

Performance Comparison

Kernels Data SetDSP Throughput

(GFLOPS)*

CPU Throughput

(GFLOPS) *GPU Throughput

(GFLOPS) *Speedup

TDFIRSet 1

Set 2

6.865

0.84

3.382

3.326

97.506

23.130

14.2/28.8

27.5/6.9

FDFIRSet 1

Set 2

3.144

0.588

0.541

0.542

61.681

11.955

19.6/114.1

20.3/22.1

CTSet 1

Set 2_____

1.194

0.501

17.177

35.545

14.3

70.9

PMSet 1

Set 2_____

0.871

0.281

7.761

21.241

8.9

75.6

CFAR

Set 1

Set 2

Set 3

Set 4

0.488

2.568

2.408

2.088

1.154

1.314

1.313

1.261

2.234

17.319

13.962

8.301

4.5/1.9

6.7/13.1

5.8/10.6

3.9/6.6

GA

Set 1

Set 2

Set 3

Set 4

____

0.562

0.683

0.441

0.373

1.177

8.571

0.589

2.249

2.1

12.5

1.4

6.0

QR

Set 1

Set 2

Set 3

1.552

3.056

2.408

1.704

0.901

0.904

54.309

5.679

6.686

34.9/31.8

1.8/6.3

2.7/7.4

SVDSet 1

Set 2

2.576

0.6

0.747

0.791

4.175

2.684

1.6/5.6

4.5/3.4

DBSet 1

Set 2____

112.3

5.794

126.8

8.459

1.13

1.46

*The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively.

1414

Power Efficiency Comparison

CPU: 65w, GPU: 238w, DSP: 10w

GPU suffers from a low power-efficiency

1515

Synthetic Aperture Radar Benchmark

Simulating a sensor processing chain

Data Set Set 1 Set 2 Set 3

Image Size 382x266 762x512 1144x756

Work

Load

(MFLOP)

FFT/IFF

T28.61 113.38 259.92

Match

Filtering6.42 22.06 47.08

Interope

ration56.88 195.34 416.96

Miscella

neous1.23 4.43 9.62

Total 93.14 335.21 733.58

1616

Performance Result

Data Set Kernel

CPU

Throughput

(GFLOPS)

GPU

Throughput

(GFLOPS)

Speedup

Set 1

FFT/IFFT 0.463 5.259 11.3

Filtering 0.538 17.165 31.8

Interpolation 0.256 19.274 75.1

Overall 0.312 8.316 26.6

Set 2

FFT/IFFT 0.581 9.252 15.9

Filtering 0.545 25.241 46.3


Overall 0.327 9.507 29.1

Set 3

FFT/IFFT 0.832 15.155 18.2

Filtering 0.523 26.856 51.3


Overall 0.346 11.403 32.8

1717

Overview of Optimization Techniques

Maximizing the usage of on-chip resources

Shared memory

Registers

Reducing memory accessing time

Global memory accessing coalesced

Overlapping transfers with computation

Reducing divergence

Warp level parallelism

……

1818

Architecture Implication

SIMD width: suitable for large vector computing

Dynamically configurable SIMD width according to application

Shared memory superior to cache for embedded application

Data prefetch is preferred

Special functions for specific applications

Dedicated efficient shuffle network for fft, et. al.

Power efficiency is quite low now

Reorganizing memory access patterns

New interconnection technologies : 3D stacking

1919

Conclusion

Efficient implementations of the HPEC benchmarks on

NVidia’s Fermi

Performance comparison with CPU

• Kernels: 10X speedup

• SAR: 30X speedup

A detailed analysis provides key insight

Optimizing data parallelism algorithm

Bottleneck of GPU’s architecture for HPEC

Publications:

Design Automation and Test in Europe (DATE), March 2011

Journal of Parallel and Distributed Computing, submitted under

review.

2020

Thank You!

evaluating the potential of graphics processors for high...

Documents