debunking the 100x gpu vs cpu myth: an evaluation of throughput computing on cpu and gpu victor w....

Debunking the 100X Debunking the 100X GPU vs CPU Myth: An GPU vs CPU Myth: An

Evaluation of Evaluation of Throughput Computing Throughput Computing

on CPU and GPUon CPU and GPU

Victor W. Lee, Victor W. Lee, et al.et al.Intel CorporationIntel Corporation

ISCA ’10 June 19-23, 2010, ISCA ’10 June 19-23, 2010, Saint-Malo, FranceSaint-Malo, France

Mythbusters view on the Mythbusters view on the topictopic

CPU vs GPUCPU vs GPU http://videosift.com/video/MythBusters-

CPU-vs-GPU-or-Paintball-Cannons-are-Cool

Full movie:Full movie: http://www.nvidia.com/object/nvision08_gpu

_v_cpu.html

The Initial ClaimThe Initial Claim

Over the past 4 years NVIDIA has made Over the past 4 years NVIDIA has made a great many claims regarding how a great many claims regarding how porting various types of applications to porting various types of applications to run on GPUs instead of CPUs can run on GPUs instead of CPUs can tremendously improve performance by tremendously improve performance by anywhere from 10x to 500x.anywhere from 10x to 500x.

But it actually began much earlier But it actually began much earlier (SIGGRAPH 2004) (SIGGRAPH 2004) http://pl887.pairlitesite.com/talks/2004-08-

08-GP2-CPU-vs-GPU-BillMark.pdf

Intel’s Response?Intel’s Response?

Intel, unsurprisingly, sees the Intel, unsurprisingly, sees the situation differently, but has situation differently, but has remained relatively quiet on the remained relatively quiet on the issue, possibly because Larrabee issue, possibly because Larrabee was going to be positioned as a was going to be positioned as a discrete GPU. discrete GPU.

Intel’s Response?Intel’s Response?

The recent announcement that Larrabee has The recent announcement that Larrabee has been repurposed as an HPC/scientific been repurposed as an HPC/scientific computing solution may therefore be partially computing solution may therefore be partially responsible for Intel ramping up an offensive responsible for Intel ramping up an offensive against NVIDIA's claims regarding GPU against NVIDIA's claims regarding GPU computing. computing.

At the International Symposium On Computer At the International Symposium On Computer Architecture (ISCA) this June, a team from Architecture (ISCA) this June, a team from Intel presented a whitepaper purporting to Intel presented a whitepaper purporting to investigate the real-world performance delta investigate the real-world performance delta between between CPUs and GPUs. and GPUs.

But before that….But before that….

December 16, 2009December 16, 2009 One month after ISCA’s final papers were One month after ISCA’s final papers were

due.due.

The Federal Trade Commission filed an The Federal Trade Commission filed an antitrust-related lawsuit against Intel Wednesday, accusing the chip maker of deliberately , accusing the chip maker of deliberately attempting hurt its competition and attempting hurt its competition and ultimately consumers. ultimately consumers.

The The Federal Trade Commission's complaint against Intel for alleged anticompetitive against Intel for alleged anticompetitive practices has a new twist: graphics chips.practices has a new twist: graphics chips.

2009 was expensive for 2009 was expensive for IntelIntel

The The European Commission fined Intel for for nearly 1.5 billion USD, nearly 1.5 billion USD,

the the US Federal Trade Commission sued Intel on on anti-trust grounds, and anti-trust grounds, and

Intel settled with AMD for another 1.25 for another 1.25 billion USD. billion USD. If nothing else it was an expensive year, and If nothing else it was an expensive year, and

while Intel settling with AMD was a significant while Intel settling with AMD was a significant milestone for the company it was not the end of milestone for the company it was not the end of their troubles.their troubles.

Finally the settlement(s)Finally the settlement(s)

The EU Fine is still under appeal The EU Fine is still under appeal ($1.45B)($1.45B)

8/4/2010 Intel Settles with the FTC8/4/2010 Intel Settles with the FTC

Then there is the whole Dell issue….Then there is the whole Dell issue….

So back to the paper, So back to the paper, What did Intel Say?What did Intel Say?

Throughput ComputingThroughput Computing

KernelsKernels What is a kernel?What is a kernel?

Kernels selected:Kernels selected: SGEMM, MC, Conv, FFT, SAXPY, LBM, SGEMM, MC, Conv, FFT, SAXPY, LBM,

Solv, SpMV, GJK, Sort, RC, Search, Solv, SpMV, GJK, Sort, RC, Search, Hist, BilatHist, Bilat

The Hardware selectedThe Hardware selected

CPU:CPU: 3.2GHz Core i7-960, 6GB RAM3.2GHz Core i7-960, 6GB RAM

GPUGPU 1.3GHz eVGA GeForce GTX280 w/ 1GB1.3GHz eVGA GeForce GTX280 w/ 1GB

Optimizations:Optimizations:

CPUCPU Mutithreading, Mutithreading, cache blocking, and cache blocking, and reorganization of memory accesses for reorganization of memory accesses for

SIMDificationSIMDification GPUGPU

Minimizing global synchronization, and Minimizing global synchronization, and using local shared buffers.using local shared buffers.

This even made SlashdotThis even made Slashdot

Hardware: Intel, NVIDIA Take Shots At CPU vs. GPU Performance

And PCWorldAnd PCWorld Intel: 2-year-old Nvidia GPU Intel: 2-year-old Nvidia GPU

Outperforms 3.2GHz Core I7Outperforms 3.2GHz Core I7 Intel researchers have published the results Intel researchers have published the results

of a performance comparison between their of a performance comparison between their latest quad-core Core i7 processor and a latest quad-core Core i7 processor and a two-year-old Nvidia graphics card, and two-year-old Nvidia graphics card, and found that the Intel processor can't match found that the Intel processor can't match the graphics chip's parallel processing the graphics chip's parallel processing performance. performance.

http://www.pcworld.com/article/199758/http://www.pcworld.com/article/199758/intel_2yearold_nvidia_gpu_outperforms_32gintel_2yearold_nvidia_gpu_outperforms_32ghz_core_i7.html hz_core_i7.html

From the paper's abstract:From the paper's abstract: In the past few years there have been In the past few years there have been

many studies claiming GPUs deliver many studies claiming GPUs deliver substantial speedups ...over multi-core substantial speedups ...over multi-core CPUs...[W]e perform a rigorous CPUs...[W]e perform a rigorous performance analysis and find that after performance analysis and find that after applying optimizations appropriate for applying optimizations appropriate for both CPUs and GPUs the performance both CPUs and GPUs the performance gap between an Nvidia GTX280 processor gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. narrows to only 2.5x on average.

Do you have a problem with this statement?Do you have a problem with this statement?

Intel's own paper indirectly raises a Intel's own paper indirectly raises a question when it notes:question when it notes: The previously reported LBM number on The previously reported LBM number on

GPUs claims 114X speedup over CPUs. GPUs claims 114X speedup over CPUs. However, we found that with careful However, we found that with careful multithreading, reorganization of memory multithreading, reorganization of memory access patterns, and SIMD optimizations, access patterns, and SIMD optimizations, the performance on both CPUs and GPUs the performance on both CPUs and GPUs is limited by memory bandwidth and the is limited by memory bandwidth and the gap is reduced to only 5X. gap is reduced to only 5X.

What is important about What is important about the context?the context?

The International Symposium on The International Symposium on Computer Architecture (ISCA) in Computer Architecture (ISCA) in Saint-Malo, France, interestingly Saint-Malo, France, interestingly enough, is the same event where enough, is the same event where NVIDIA’s Chief Scientist Bill Dally NVIDIA’s Chief Scientist Bill Dally received the prestigious 2010 received the prestigious 2010 Eckert-Mauchly Award for his Eckert-Mauchly Award for his pioneering work in architecture for pioneering work in architecture for parallel computing. parallel computing.

NVIDIA Blog Response:NVIDIA Blog Response:

It’s a rare day in the world of technology It’s a rare day in the world of technology when a company you compete with stands when a company you compete with stands up at an important conference and up at an important conference and declares that your technology is *only* up declares that your technology is *only* up to 14 times faster than theirs. to 14 times faster than theirs.

http://blogs.nvidia.com/blog/2010/06/23/http://blogs.nvidia.com/blog/2010/06/23/gpus-are-only-up-to-14-times-faster-than-gpus-are-only-up-to-14-times-faster-than-cpus-says-intel/cpus-says-intel/

NVIDIA Blog Response: NVIDIA Blog Response: (cont)(cont)

The real myth here is that multi-core The real myth here is that multi-core CPUs are easy for any developer to CPUs are easy for any developer to use and see performance use and see performance improvements. improvements.

Undergraduate students learning Undergraduate students learning parallel programming at M.I.T. parallel programming at M.I.T. disputed this when they looked at the disputed this when they looked at the performance increase they could get performance increase they could get from different processor types and from different processor types and compared this with the amount of compared this with the amount of time they needed to spend in re-time they needed to spend in re-writing their code. writing their code.

According to them, for the same According to them, for the same investment of time as coding for a investment of time as coding for a CPU, they could get more than 35x CPU, they could get more than 35x the performance from a GPU. the performance from a GPU.

Despite substantial investments in Despite substantial investments in parallel computing tools and libraries, parallel computing tools and libraries, efficient multi-core optimization efficient multi-core optimization remains in the realm of experts like remains in the realm of experts like those Intel recruited for its analysis. those Intel recruited for its analysis.

In contrast, the CUDA parallel In contrast, the CUDA parallel computing architecture from NVIDIA is computing architecture from NVIDIA is a little over 3 years old and already a little over 3 years old and already hundreds of consumer, professional hundreds of consumer, professional and scientific applications are seeing and scientific applications are seeing speedups ranging from 10 to 100x using speedups ranging from 10 to 100x using NVIDIA GPUs. NVIDIA GPUs.

QuestionsQuestions

Where did the 2.5x, 5x, and 14x Where did the 2.5x, 5x, and 14x come from?come from?

How big were the problems that How big were the problems that Intel used for comparisons? Intel used for comparisons? [compare w/ cache size][compare w/ cache size]

How were they selected?How were they selected? What optimizations were done?What optimizations were done?

Fermi cards were almost certainly Fermi cards were almost certainly unavailable when Intel commenced unavailable when Intel commenced its project, but it's still worth noting its project, but it's still worth noting that some of the GF100's that some of the GF100's architectural advances partially architectural advances partially address (or at least alleviate) certain address (or at least alleviate) certain performance-limiting handicaps Intel performance-limiting handicaps Intel points to when comparing Nehalem points to when comparing Nehalem to a GT200 processor. to a GT200 processor.

Bottom LineBottom Line

Parallelization is hard, whether Parallelization is hard, whether you're working with a quad-core x86 you're working with a quad-core x86 CPU or a 240-core GPU; each CPU or a 240-core GPU; each architecture has strengths and architecture has strengths and weaknesses that make it better or weaknesses that make it better or worse at handling certain kinds of worse at handling certain kinds of workloads. workloads.

debunking the 100x gpu vs cpu myth: an evaluation of throughput computing on cpu and gpu victor w....

vs gpu

gpu computing

cool http

discrete gpu

gpu victor

x gpu vs cpu myth

cpus slide

intel corporation isca

Documents

gpu-based mrc methods for overlapping ebeam shots › docs...

cmpt454 gpu managed database · gpgpu: general purpose gpu,...

gpu programming on cpu - using c++amp

cpu, gpu und fpga -...

intelligent scheduling for simultaneous cpu-gpu

p-cad eda - [cpu and gpu control]

the sharing economy for cpu/gpu power

selective gpu caches to eliminate cpu–gpu hw cache...

neurips | 2018 snap ml: a hierarchical framework for...

cpu and gpu

redefining the role of the cpu in era of cpu...

opencl framework for heterogeneous cpu/gpu programming

debunking the 100x gpu vs. cpu myth: an evaluation of...

turning software into hardware - hastlayer · cpu vs gpu vs...

lattice boltzmann simulations on heterogeneous cpu-gpu

debunking the 100x gpu vs. cpu myth: an evaluation of

gpu computing april 2009. gpu outpacing cpu in raw...

amd’s uni ed cpu & gpu processor concept€¦ · uni ed...

agenda cpu threads flip queue cpu queues gpu hardware queue

gromacs (gpu) performance benchmark and profiling€¦ ·...