evaluating the potential of graphics processors for high...

20
Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University

Upload: others

Post on 11-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Evaluating the Potential of Graphics Processors for

    High Performance Embedded Computing

    Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng

    Department of Micro-/Nano-electronics

    Tsinghua University

  • 22

    Motivation

    HPEC Implementation and Evaluation

    Kernel Benchmarks

    Synthetic Aperture Radar

    Performance Comparison

    Conclusion

    Outline

  • 33

    HPEC: High Performance Embedded Computing

    Future IT infrastructure demands even higher computing power High performance radar : 800GFLOPs(Giga FLoating point Operations Per second)

    4G wireless base station: 1Gbit/s data rate per customer and up to 200

    subscribers in service area

    CMU driverless car: 270GFLOPs

  • 44

    Implication

    An increasing number of high performance embedded

    applications would be implemented with multi-core devices

    Intel: cluster based Internet routers

    IBM: signal processing and radar applications on Cell processor

    Huawei: multi-core base stations

    Systematically evaluating the potential of GPU

    Performance

    Scalability

  • 55

    HPEC Challenge Benchmark

    Developed by MIT Lincoln Laboratory*

    Quantitatively evaluate HPEC systems

    Kernel benchmarks: extracted from a broad range of signal and

    image processing application

    * The HPEC Challenge Benchmark Suite, R. Haney, T. Meuse, J. Kepner, HPEC 2006

  • 66

    Kernel BenchmarksCategory Benchmark Description

    Signal

    Processing

    TDFIR Time-domain finite impulse response filtering

    FDFIRFrequency-domain finite impulse response filtering

    QRQR factorization: prevalent in target recognition algorithms

    SVDSingular value decomposition: produces a basis for the

    matrix as well as the rank for reducing interference

    CFARConstant false-alarm rate detection: find target in an

    environment with varying background noise

    Communication CTCorner turn or matrix transpose to place radar data into a

    contiguous row for efficient FFT

    Information

    Processing

    PMPattern Matching: identify stored tracks that match a target

    GAGraph optimization via genetic algorithm: removing

    uncorrelated data relations

    DBDatabase operations to store and query target tracks

  • 77

    Benchmark PropertiesBenchmark Data Set

    Workload

    (MFLOPS)*

    Task-Level

    Parallelism

    Data

    Structure

    Data

    Size

    Data

    Correlation

    Memory

    Access

    TD FIRSet1

    Set2

    268.4

    1.97

    64

    20Vector

    4096

    1024Low Low

    FD FIRSet1

    Set2

    34

    2.21

    64

    20Vector

    4096

    1024Low Low

    CTSet1

    Set2

    2

    30

    1

    1Matrix

    50x5000

    750x5000Very Low Very High

    PMSet1

    Set2

    1.21

    13.59

    72

    256Vector

    64

    128Low Low

    CFAR

    Set1

    Set2

    Set3

    Set4

    0.17

    150.5

    41.1

    17.7

    384

    6144

    3072

    480

    Vector

    64

    3500

    1909

    9900

    Medium Low

    GA

    Set1

    Set2

    Set3

    Set4

    0.011

    0.51

    0.015

    0.11

    50

    200

    100

    400

    Vector

    8

    96

    5

    10

    Medium High

    QRSet1

    Set2

    Set3

    397

    30.5

    45

    1

    1

    1

    Matrix

    500x100

    180x60

    150x150High Medium

    SVDSet1

    Set2

    0.24

    0.88

    1

    1Matrix

    500x100

    180x60High Medium

    DBSet1

    Set2

    440

    700

    1

    1Tree

    440

    700High Very High

    * The workload of CT and DB are measured in MB and Transactions, respectively

  • 88

    Implementation on GPU (1)

    Plenty of data level parallelism

    Raw computing power

    Loops of multiplication and accumulation (MAC)

    TDFIR FDFIR CFAR

  • 99

    Implementation on GPU (2)

    Plenty of task level parallelism

    Synchronization between blocks

    PM GA

  • 1010

    Implementations on GPU (3)

    Memory accessing operation

    Global memory accessing coalescing

    Shared memory for local operation

    Database

    CT DB

  • 1111

    Implementations on GPU (4)

    Advanced linear algebra operation

    Hard to explore parallelism

    Pipelining the row updates of matrix

    QR SVD

    (a) Threads assignment (b) Step 1 (c) Step 2 and 3

  • 1212

    Experiment Environment

    CPU

    Intel Core2 Duo 2.66GHz

    4GB memory

    GPU

    NVidia Tesla C2050 : 448cores,1.15GHz

    3GB memory

    DSP

    ADSP-TS101S Tiger SHARC T2-PCI

    8 DSP processor, 600MHz

    24Mbits on-chip memory per DSP

  • 1313

    Performance Comparison

    Kernels Data SetDSP Throughput

    (GFLOPS)*

    CPU Throughput

    (GFLOPS) *GPU Throughput

    (GFLOPS) *Speedup

    TDFIRSet 1

    Set 2

    6.865

    0.84

    3.382

    3.326

    97.506

    23.130

    14.2/28.8

    27.5/6.9

    FDFIRSet 1

    Set 2

    3.144

    0.588

    0.541

    0.542

    61.681

    11.955

    19.6/114.1

    20.3/22.1

    CTSet 1

    Set 2_____

    1.194

    0.501

    17.177

    35.545

    14.3

    70.9

    PMSet 1

    Set 2_____

    0.871

    0.281

    7.761

    21.241

    8.9

    75.6

    CFAR

    Set 1

    Set 2

    Set 3

    Set 4

    0.488

    2.568

    2.408

    2.088

    1.154

    1.314

    1.313

    1.261

    2.234

    17.319

    13.962

    8.301

    4.5/1.9

    6.7/13.1

    5.8/10.6

    3.9/6.6

    GA

    Set 1

    Set 2

    Set 3

    Set 4

    ____

    0.562

    0.683

    0.441

    0.373

    1.177

    8.571

    0.589

    2.249

    2.1

    12.5

    1.4

    6.0

    QR

    Set 1

    Set 2

    Set 3

    1.552

    3.056

    2.408

    1.704

    0.901

    0.904

    54.309

    5.679

    6.686

    34.9/31.8

    1.8/6.3

    2.7/7.4

    SVDSet 1

    Set 2

    2.576

    0.6

    0.747

    0.791

    4.175

    2.684

    1.6/5.6

    4.5/3.4

    DBSet 1

    Set 2____

    112.3

    5.794

    126.8

    8.459

    1.13

    1.46

    *The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively.

  • 1414

    Power Efficiency Comparison

    CPU: 65w, GPU: 238w, DSP: 10w

    GPU suffers from a low power-efficiency

  • 1515

    Synthetic Aperture Radar Benchmark

    Simulating a sensor processing chain

    Data Set Set 1 Set 2 Set 3

    Image Size 382x266 762x512 1144x756

    Work

    Load

    (MFLOP)

    FFT/IFF

    T28.61 113.38 259.92

    Match

    Filtering6.42 22.06 47.08

    Interope

    ration56.88 195.34 416.96

    Miscella

    neous1.23 4.43 9.62

    Total 93.14 335.21 733.58

  • 1616

    Performance Result

    Data Set Kernel

    CPU

    Throughput

    (GFLOPS)

    GPU

    Throughput

    (GFLOPS)

    Speedup

    Set 1

    FFT/IFFT 0.463 5.259 11.3

    Filtering 0.538 17.165 31.8

    Interpolation 0.256 19.274 75.1

    Overall 0.312 8.316 26.6

    Set 2

    FFT/IFFT 0.581 9.252 15.9

    Filtering 0.545 25.241 46.3

    Interpolation 0.252 17.332 68.8

    Overall 0.327 9.507 29.1

    Set 3

    FFT/IFFT 0.832 15.155 18.2

    Filtering 0.523 26.856 51.3

    Interpolation 0.248 18.569 74.7

    Overall 0.346 11.403 32.8

  • 1717

    Overview of Optimization Techniques

    Maximizing the usage of on-chip resources

    Shared memory

    Registers

    Reducing memory accessing time

    Global memory accessing coalesced

    Overlapping transfers with computation

    Reducing divergence

    Warp level parallelism

    ……

  • 1818

    Architecture Implication

    SIMD width: suitable for large vector computing

    Dynamically configurable SIMD width according to application

    Shared memory superior to cache for embedded application

    Data prefetch is preferred

    Special functions for specific applications

    Dedicated efficient shuffle network for fft, et. al.

    Power efficiency is quite low now

    Reorganizing memory access patterns

    New interconnection technologies : 3D stacking

  • 1919

    Conclusion

    Efficient implementations of the HPEC benchmarks on

    NVidia’s Fermi

    Performance comparison with CPU

    • Kernels: 10X speedup

    • SAR: 30X speedup

    A detailed analysis provides key insight

    Optimizing data parallelism algorithm

    Bottleneck of GPU’s architecture for HPEC

    Publications:

    Design Automation and Test in Europe (DATE), March 2011

    Journal of Parallel and Distributed Computing, submitted under

    review.

  • 2020

    Thank You!