debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu presented...

26
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran Seminar of Parallel Processing. Instructor: Dr. Fakhraie 29 Dec 11 ISCA 2010 Original authors: Victor W Lee et al. Intel Corporation 1 Some slides are included from original paper only for educational purposes

Upload: beatrice-parsons

Post on 23-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput

Computing on CPU and GPUPresented by: Ahmad Lashgar

ECE Department, University of TehranSeminar of Parallel Processing. Instructor: Dr. Fakhraie

29 Dec 11

ISCA 2010Original authors: Victor W Lee et al.

Intel Corporation

1Some slides are included from original paper only for educational purposes

Page 2: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Abstract

• Is the GPU silver bullet of parallel computing?• How far is the difference between peak and

achievable performance?

2

Page 3: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Overview

• Abstract• Architecture

– CPU: Intel core i7– GPU: Nvidia GTX280

• Implications for throughput computing applications• Methodology• Results• Analyzing the results• Platform optimization guides• Conclusion

3

Page 4: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Architecture (1)

• Intel core i7-960– 4-core, 3.2 GHz– 2-way multi-threading– 4-wide– L1 32KB, L2 256KB, L3 3MB– 32 GB/sec

4[DIXON’2010]

Page 5: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Architecture (2)

• Nvidia GTX280– 30 core, 1.3GHz– 1024-way multi-threading– 8-way SIMD– 16KB software managed cache (shared

memory)– 141 GB/s

5[LINDHOLM’2008]

Page 6: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Architecture (3)

Core i7-960 GTX280

Core 4 30

Frequency (GHz) 3.2 1.3

Transistors 0.7B (263mm2) 1.4B (576mm2)

Memory Bandwidth (GB/s) 32 141

SP SIMD 4 8

DP SIMD 2 1

Peak SP scalar GFLOPS 25.6 116.6

Peak SP SIMD GFLOPS 102.4 311.1 (933.1)

Peak DB SIMD GFLOPS 51.2 77.8

Red texts are not the author’s numbers.6

Page 7: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Implications for throughput computing applications

1. Number of core difference2. Cache size/multi-threading3. Bandwidth difference

7

Page 8: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

1. Number of cores difference

• It is all about the core complexity:– The common goal: Improving pipeline efficiency– CPU goal: Single-thread performance

• Exploiting ILP• Sophisticated branch predictor• Multiple issue logics

– GPU goal: Throughput• Interleaving hundreds of threads

8

Page 9: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

2. Cache size/multi-threading

• CPU goal: reducing memory latency– Programmer-transparent data caching

• Increasing the cache size to capture the working set– Prefetching (HW/SW)

• GPU goal: hiding memory latency– Interleave the execution of hundreds of threads to hide

the latency of each other• Notice:– CPU uses multi-threading for latency hiding– GPU uses software controlled caching (shared memory)

for reducing memory latency

9

Page 10: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

3. Bandwidth difference

• Bandwidth versus latency• CPU goal: single thread performance– Workloads do not demand for many memory accesses– Bring the data as soon as possible

• GPU goal: throughput– There are lots of memory accesses, provide the good

bandwidth– No matter the latency, core will hide it!

10

Page 11: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Methodology (1)

• Hardware– Intel Core i7-960, 6GB DRAM, GTX280 1GB

• Software– SUSE Enterprise 11– CUDA Toolkit 2.3

11

Page 12: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Methodology (2)

• Optimizations– On CPU:

• SGEMM, SpMV and FFT from Intel MKL 10• Always 2 threads per core

– On GPU:• Best possible algorithm for SpMV, FFT and MC• Often 128 to 256 threads per core (to leverage shared memory

and register-file usage)

– Interleaving GPU execution and HD/DH memory transfers where possible

12

Page 13: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Results

• The HD/DH data transfer time is not considered• Only 2.5X on average– Far from what is reported by previous researches (100X)

13

Page 14: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Where is the speedup of previous researches?!

• What CPU and GPU are compared?• How much optimization is performed on CPU and

GPU?– Where they optimize both platforms, they reported much

lower speedup (like this paper)

14

Page 15: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Analyzing the results (1)

1. Bandwidth2. Compute flops (single precision)3. Compute flops (double precision)4. Reduction and synchronization5. Fixed function

15

Page 16: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Analyzing the results (2)

1. Bandwidth– Peak: GTX280/Corei7-960 ~ 4.7X– Feature: Large working set, Performance is bounded by

the bandwidth– Examples

• SAXPY (5.3X)• LBM (5X)• SpMV (1.9X)

– CPU benefits from caching

16

Page 17: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Analyzing the results (3)

2. Compute Flops (Single Precision)– Peak: GTX280/Corei7-960 ~ 3X– Feature: Bounded by computation, benefit from more

cores– Examples

• SGEMM, Conv and FFT (2.8-4X)

17

Page 18: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Analyzing the results (4)

3. Compute Flops (Double Precision)– Peak: GTX280/Corei7-960 ~ 1.5X– Feature: Bounded by computation, benefit from more

cores– Examples

• MC (1.8X)• Blitz (5X)

– Uses transcendental operations

• Sort (1.25X slower)– Due to decrease in SIMD width usage– Depends on scalar performance

18

Page 19: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Analyzing the results (5)

4. Reduction and Synchronization– Feature: More threads, higher the synchronization

overhead– Examples

• Hist (1.8X)– On CPU, 28% of the time is spent on atomic operations– On GPU, the atomic operations are much slower

• Solv (1.9X slower)– Multiple kernel launches to preserve cache coherency on GPU

19

Page 20: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Analyzing the results (6)

5. Fixed function– Feature: Interpolation, texturing and transcendental

operation are bonus on GPU– Examples

• Bilat (5.7X)– On CPU, 66% of the time is spent on transcendental operations

• GJK (14.9X)– Uses texture lookup

20

Page 21: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Platform optimization guides

• CPU programmer have heavily relied on increasing clock frequency

• Their application do not benefits from TLP and DLP• Today CPUs use wider SIMD which stays idle if not

exploited by programmer (or compiler)• This paper showed that careful multi-threading can

reduce the gap heavily– For LBM, from 114X down to 5X

• Let’s learn some optimization tips from the authors

21

Page 22: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

CPU optimization

• Scalability (4X):– Scale the kernel with the number of threads

• Blocking (5X):– Be aware of cache hierarchy and use it efficiently

• Regularizing (1.5X):– Align the data regularly to take advantage of SIMD

22

Page 23: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

GPU optimization

• Global synchronization– Reduce the atomic operations

• Shared memory– Use shared memory to reduce of-chip demand– Shared memory is multi-banked and is efficient for

gathers/scatters operations

23

Page 24: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Conclusion

• This work analyzed the performance of important throughput computing kernels on CPU and GPU– the gap is much lower that previous reports (~2.5X)

• Recommendation for a throughput computing architecture:– High compute– High bandwidth– Large cache– Gather/scatter support– Efficient synchronization– Fixed function units

24

Page 25: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

Thank you for your attention.any question?

25

Page 26: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran

References

[LEE’2010] V. W. Lee et al, Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU, ISCA 2010

[DIXON’2010] M. Dixon et al, The next-generation Intel® Core ™ Microarchitecture, Intel® Technology Journal, Volume 14 Issue 3, 2010

[LINDHOLM’2008] E. Lindholm et al, NVIDIA Tesla A Unified Graphics and Computing Architecture, IEEE Micro 2008

26