Download - Gpu IEEE Paper
-
8/13/2019 Gpu IEEE Paper
1/14
..........................................................................................................................................................................................................
THE GPU COMPUTING ERA..........................................................................................................................................................................................................
GPUCOMPUTING IS AT A TIPPING POINT, BECOMING MORE WIDELY USED IN DEMANDING
CONSUMER APPLICATIONS AND HIGH-PERFORMANCE COMPUTING. THIS ARTICLE
DESCRIBES THE RAPID EVOLUTION OFGPU ARCHITECTURESFROM GRAPHICS
PROCESSORS TO MASSIVELY PARALLEL MANY-CORE MULTIPROCESSORS, RECENT
DEVELOPMENTS IN GPU COMPUTING ARCHITECTURES, AND HOW THE ENTHUSIASTIC
ADOPTION OF CPUGPU COPROCESSING IS ACCELERATING PARALLEL APPLICATIONS.
......As we ent er the era of GPUcomputing, demanding applications withsubstantial parallelism increasingly use themassively parallel computing capabilitiesof GPUs to achieve superior performanceand efficiency. Today GPU computingenables applications that we previouslythought infeasible because of long execu-
tion times.With the GPUs rapid evolution from a
configurable graphics processor to a pro-grammable parallel processor, the ubiqui-tous GPU in every PC, laptop, desktop,and workstation is a many-core multi-threaded multiprocessor that excels atboth graphics and computing applications.Todays GPUs use hundreds of parallelprocessor cores executing tens of thou-sands of parallel threads to rapidly solvelarge problems having substantial inherent
parallelism. Theyre now the most perva-sive massively parallel processing platformever available, as well as the most cost-effective.
Using NVIDIA GPUs as examples, thisarticle describes the evolution of GPU com-puting and its parallel computing model, theenabling architecture and software develop-ments, how computing applications useCPUGPU coprocessing, example applica-tion performance speedups, and trends inGPU computing.
GPU computings evolutionWhy have GPUs evolved to have large
numbers of parallel threads and manycores? The driving force continues to bethe real-time graphics performance neededto render complex, high-resolution 3Dscenes at interactive frame rates for games.
Rendering high-definition graphics scenes
is a problem with tremendous inherent par-allelism. A graphics programmer writes asingle-thread program that draws one pixel, andthe GPU runs multiple instances of thisthread in paralleldrawing multiple pixelsin parallel. Graphics programs, written inshading languages such as Cg or High-Level Shading Language (HLSL), thus scaletransparently over a wide range of threadand processor parallelism. Also, GPU com-puting programswritten in C or Cwith the CUDA parallel computing
model,1,2 or using a parallel computingAPI inspired by CUDA such as Direct-Compute3 or OpenCL4scale transparentlyover a wide range of parallelism. Softwarescalability, too, has enabled GPUs torapidly increase their parallelism andperformance with increasing transistordensity.
GPU technology developmentThe demand for faster and higher-
definition graphics continues to drive the
John Nickolls
William J. Dally
NVIDIA
................................................
56 Pub lished by the IEEE Comp uter Society 0 272-1732/10/$26.00 c 2010 IEEE
-
8/13/2019 Gpu IEEE Paper
2/14
development of increasingly parallel GPUs.Table 1 lists significant milestones in NVIDIAGPU technology development that drove theevolution of unified graphics and computing
GPUs. GPU transistor counts increased expo-nentially, doubling roughly every 18 monthswith increasing semiconductor density. Sincetheir 2006 introduction, CUDA parallelcomputing cores per GPU also doublednearly every 18 months.
In the early 1990s, there were no GPUs.Video graphics array (VGA) controllers gen-erated 2D graphics displays for PCs to accel-erate graphical user interfaces. In 1997,NVIDIA released the RIVA 128 3D single-chip graphics accelerator for games and 3D
visualization applications, programmed withMicrosoft Direct3D and OpenGL. Evolvingto modern GPUs involved adding pro-grammability incrementallyfrom fixedfunction pipelines to microcoded processors,configurable processors, programmable pro-cessors, and scalable parallel processors.
Early GPUsThe first GPU was the GeForce 256, a
single-chip 3D real-time graphics processorintroduced in 1999 that included nearly
every feature of high-end workstation 3Dgraphics pipelines of that era. It containeda configurable 32-bit floating-point vertextransform and lighting processor, and a con-
figurable integer pixel-fragment pipeline,programmed with OpenGL and MicrosoftDirectX 7 (DX7) APIs.
GPUs first used floating-point arithmeticto calculate 3D geometry and vertices, thenapplied it to pixel lighting and color valuesto handle high-dynamic-range scenes andto simplify programming. They imple-mented accurate floating-point rounding toeliminate frame-varying artifacts on movingpolygon edges that would otherwise sparkleat real-time frame rates.
As programmable shaders emerged, GPUsbecame more flexible and programmable. In2001, the GeForce 3 introduced the first pro-grammable vertex processor that executedvertex shader programs, along with a config-urable 32-bit floating-point pixel-fragmentpipeline, programmed with OpenGL andDX8. The ATI Radeon 9700, introducedin 2002, featured a programmable 24-bitfloating-point pixel-fragment processor pro-grammed with DX9 and OpenGL. TheGeForce FX and GeForce 68005 featured
Table 1. NVIDIA GPU technology development.
Date Product Transistors CUDA cores Technology
1997 RIVA 128 3 million 3D graphics accelerator
1999 GeForce 256 25 million First GPU, programmed with DX7 and OpenGL
2001 GeForce 3 60 million First programmable shader GPU, programmedwith DX8 and OpenGL
2002 GeForce FX 125 million 32-bit floating-point (FP) programmable GPU with
Cg programs, DX9, and OpenGL
2004 GeForce 6800 222 million 32-bit FP programmable scalable GPU, GPGPU
Cg programs, DX9, and OpenGL
2006 GeForce 8800 681 million 128 First unified graphics and computing GPU,
programmed in C with CUDA
2007 Tesla T8, C870 681 million 128 First GPU computing system programmed in C
with CUDA
2008 GeForce GTX 280 1.4 billion 240 Unified graphics and computing GPU, IEEE FP,
CUDA C, OpenCL, and DirectCompute
2008 Tesla T10, S1070 1.4 billion 240 GPU computing clusters, 64-bit IEEE FP, 4-Gbyte
memory, CUDA C, and OpenCL
2009 Fermi 3.0 billion 512 GPU computing architecture, IEEE 754-2008 FP,
64-bit unified addressing, caching, ECC
memory, CUDA C, C++, OpenCL, and
DirectCompute
..........................................................
MARCH/APRIL 2010 57
-
8/13/2019 Gpu IEEE Paper
3/14
programmable 32-bit floating-point pixel-fragment processors and vertex processors,programmed with Cg programs, DX9, andOpenGL. These processors were highly multi-threaded, creating a thread and executing a
thread program for each vertex and pixelfragment. The GeForce 6800 scalable pro-cessor core architecture facilitated multipleGPU implementations with different num-bers of processor cores.
Developing the Cg language6 for pro-gramming GPUs provided a scalable parallelprogramming model for the programmablefloating-point vertex and pixel-fragment pro-cessors of GeForce FX, GeForce 6800, andsubsequent GPUs. A Cg program resemblesa C program for a single thread that draws
a single vertex or single pixel. The multi-threaded GPU created independent threadsthat executed a shader program to drawevery vertex and pixel fragment.
In addition to rendering real-time graph-ics, programmers also used Cg to computephysical simulations and other general-purpose GPU (GPGPU) computations.Early GPGPU computing programs achievedhigh performance, but were difficult to writebecause programmers had to express non-graphics computations with a graphics API
such as OpenGL.
Unified computing and graphics GPUsThe GeForce 8800 introduced in 2006
featured the first unified graphics and com-puting GPU architecture7,8 programmablein C with the CUDA parallel computingmodel, in addition to using DX10 andOpenGL. Its unified streaming processorcores executed vertex, geometry, and pixelshader threads for DX10 graphics programs,and also executed computing threads for
CUDA C programs. Hardware multithread-ing enabled the GeForce 8800 to efficientlyexecute up to 12,288 threads concurrentlyin 128 processor cores. NVIDIA deployedthe scalable architecture in a family ofGeForce GPUs with different numbers ofprocessor cores for each market segment.
The GeForce 8800 was the first GPU touse scalar thread processors rather than vectorprocessors, matching standard scalar languageslike C, and eliminating the need to managevector registers and program vector
operations. It added instructions to supportC and other general-purpose languages,including integer arithmetic, IEEE 754floating-point arithmetic, and load/storememory access instructions with byte address-
ing. It provided hardware and instructions tosupport parallel computation, communica-tion, and synchronizationincluding threadarrays, shared memory, and fast barriersynchronization.
GPU computing systemsAt first, users built personal supercom-
puters by adding multiple GPU cards toPCs and workstations, and assembled clustersof GPU computing nodes. In 2007, respond-ing to demand for GPU computing systems,
NVIDIA introduced the Tesla C870, D870,and S870 GPU card, deskside, and rack-mount GPU computing systems containingone, two, and four T8 GPUs. The T8GPU was based on the GeForce 8800GPU, configured for parallel computing.
The second-generation Tesla C1060 andS1070 GPU computing systems introducedin 2008 used the T10 GPU, based on theGPU in GeForce GTX 280. The T10 fea-tured 240 processor cores, 1-teraflop-per-second peak single-precision floating-point
rate, IEEE 754-2008 double-precision 64-bit floating-point arithmetic, and 4-GbyteDRAM memory. Today there are TeslaS1070 systems with thousands of GPUswidely deployed in high-performance com-puting systems in production and research.
NVIDIA introduced the third-generationFermi GPU computing architecture in2009.9 Based on user experience with priorgenerations, it addressed several key areas tomake GPU computing more broadly appli-cable. Fermi implemented IEEE 754-2008
and significantly increased double-precisionperformance. It added error-correcting code(ECC) memory protection for large-scaleGPU computing, 64-bit unified addressing,cached memory hierarchy, and instructionsfor C, C, Fortran, OpenCL, andDirectCompute.
GPU computing ecosystemThe GPU computing ecosystem is expand-
ing rapidly, enabled by the deployment ofmore than 180 million CUDA-capable
....................................................
58 IEEE MICRO
...............................................................................................................................................................................................
HOT CHIPS
-
8/13/2019 Gpu IEEE Paper
4/14
GPUs. Researchers and developers have en-thusiastically adopted CUDA and GPUcomputing for a diverse range of applica-tions,10 publishing hundreds of technicalpapers, writing parallel programming text-
books,
11
and teaching CUDA programmingat more than 300 universities. The CUDAZone (see http://www.nvidia.com/object/cuda_home_new.html) lists more than1,000 links to GPU computing applications,programs, and technical papers. The 2009GPU Technology Conference (see http://www.nvidia.com/object/research_summit_posters.html) published 91 research posters.
Library and tools developers are makingGPU development more productive. GPUcomputing languages include CUDA C,
CUDA C, Portland Group (PGI)CUDA Fortran, DirectCompute, andOpenCL. GPU mathematics packages in-clude MathWorks Matlab, Wolfram Mathe-matica, National Instruments Labview, Sci-Comp SciFinance, and PyCUDA. NVIDIAdeveloped the parallel Nsight GPU develop-ment environment, debugger, and analyzerintegrated with Microsoft Visual Studio.GPU libraries include C productivitylibraries, dense linear algebra, sparse linearalgebra, FFTs, video and image processing,
and data-parallel primitives. Computer systemmanufacturers are developing integratedCPUGPU coprocessing systems in rack-mount server and cluster configurations.
CUDA scalable parallel architectureCUDA is a hardware and software copro-
cessing architecture for parallel computingthat enables NVIDIA GPUs to execute pro-grams written with C, C, Fortran,OpenCL, DirectCompute, and other lan-guages. Because most languages were
designed for one sequential thread, CUDApreserves this model and extends it with aminimalist set of abstractions for expressingparallelism. This lets the programmer focuson the important issues of parallelismhowto design efficient parallel algorithmsusing a familiar language.
By design, CUDA enables the develop-ment of highly scalable parallel programsthat can run across tens of thousands of con-current threads and hundreds of processorcores. A compiled CUDA program executes
on any size GPU, automatically using moreparallelism on GPUs with more processorcores and threads.
A CUDA program is organized into ahost program, consisting of one or more se-quential threads running on a host CPU, andone or more parallel kernels suitable for exe-cution on a parallel computing GPU. A ker-nel executes a sequential program on a set oflightweight parallel threads. As Figure 1shows, the programmer or compiler organ-
izes these threads into a grid of thread blocks.The threads comprising a thread block cansynchronize with each other via barriersand communicate via a high-speed, per-block shared memory.
Threads from different blocks in the samegrid can coordinate via atomic operations ina global memory space shared by all threads.Sequentially dependent kernel grids can syn-chronize via global barriers and coordinatevia global shared memory. CUDA requiresthat thread blocks be independent, which
Grid 1
Grid 0
Per-applicationglobal memory
Per-thread privatelocal memory
Thread
Thread block
Per-block
shared memory
Figure 1. The CUDA hierarchy of threads, thread blocks, and grids of
blocks, with corresponding memory spaces: per-thread private local,
per-block shared, and per-application global memory spaces.
..........................................................
MARCH/APRIL 2010 59
-
8/13/2019 Gpu IEEE Paper
5/14
provides scalability to GPUs with different
numbers of processor cores and threads.Thread blocks implement coarse-grainedscalable data parallelism, while the light-weight threads comprising each threadblock provide fine-grained data parallelism.Thread blocks executing different kernelsimplement coarse-grained task parallelism.Threads executing different paths implementfine-grained thread-level parallelism. Detailsof the CUDA programming model are avail-able in the programming guide.2
Figure 2 shows some basic features of par-
allel programming with CUDA. It containssequential and parallel implementations ofthe SAXPY routine defined by the basic linearalgebra subroutines (BLAS) library. Givenscalar a and vectors x and y containing nfloating-point numbers, it performs the up-date y ax y. The serial implementationis a simple loop that computes one elementof yper iteration. The parallel kernel executeseach of these independent iterations in paral-lel, assigning a separate thread to computeeach element of y. The __global__
modifier indicates that the procedure is a ker-nel entry point, and the extended function-call syntax saxpy(. . .)launches the kernel saxpy() in parallelacrossBblocks ofTthreads each. Each threaddetermines which element it should pro-cess from its integer thread block indexblockIdx.x, its thread index within itsblockthreadIdx.x, and the total numberof threads per block blockDim.x.
This example demonstrates a com-mon parallelization pattern, where we can
transform a serial loop with independent
iterations to execute in parallel across manythreads. In the CUDA paradigm, the pro-grammer writes a scalar programthe paral-lel saxpy() kernelthat specifies thebehavior of a single thread of the kernel.This lets CUDA leverage standard C lan-guage with only a few small additions, suchas built-in thread and block index variables.The SAXPY kernel is also a simple exampleof data parallelism, where parallel threadseach produce assigned result data elements.
GPU computing architectureTo address different market segments,GPU architectures scale the number of pro-cessor cores and memories to implement dif-ferent products for each segment while usingthe same scalable architecture and software.NVIDIAs scalable GPU computing archi-tecture varies the number of streaming multi-processors to scale computing performance,and varies the number of DRAM memoriesto scale memory bandwidth and capacity.
Each multithreaded streaming multiproc-
essor provides sufficient threads, processorcores, and shared memory to execute oneor more CUDA thread blocks. The parallelprocessor cores within a streaming multi-processor execute instructions for parallelthreads. Multiple streaming multiprocessorsprovide coarse-grained scalable data andtask parallelism to execute multiple coarse-grained thread blocks (possibly running dif-ferent kernels) in parallel. Multithreadingand parallel-pipelined processor cores withineach streaming multiprocessor implement
void saxpy(uint n, float a,float *x, float *y)
{uint i;for(i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
void serial_sample(){
// Call serial SAXPY function
saxpy(n, 2.0, x, y);}
(a)
__global__ void saxpy(uint n, float a,float *x, float *y)
{uint i = blockIdx.x*blockDim.x + threadIdx.x;if(i < n)
y[i] = a*x[i] + y[i];
}
void parallel_sample(){
// Launch parallel SAXPY kernel
// using n/256 blocks of 256 threads eachsaxpy(n, 2.0, x, y);
}
(b)
Figure 2. Serial (a) and parallel CUDA (b) SAXPY kernels computing y ax y.
....................................................
60 IEEE MICRO
...............................................................................................................................................................................................
HOT CHIPS
-
8/13/2019 Gpu IEEE Paper
6/14
fine-grained data and thread-level parallelismto execute hundreds of fine-grained threadsin parallel. Application programs using theCUDA model thus scale transparently tosmall and large GPUs with different num-bers of streaming multiprocessors and pro-cessor cores.
Fermi computing architectureTo illustrate GPU computing architecture,
Figure 3 shows the third-generation Fermicomputing architecture configured with
16 streaming multiprocessors, each with32 CUDA processor cores, for a total of512 cores. The GigaThread work schedulerdistributes CUDA thread blocks to streamingmultiprocessors with available capacity, dy-namically balancing the computing workloadacross the GPU, and running multiple kerneltasks in parallel when appropriate. The multi-threaded streaming multiprocessors scheduleand execute CUDA thread blocks and indi-vidual threads. Each streaming multiprocessorexecutes up to 1,536 concurrent threads to
help cover long latency loads from DRAMmemory. As each thread block completes exe-cuting its kernel program and releases itsstreaming multiprocessor resources, the workscheduler assigns a new thread block to thatstreaming multiprocessor.
The PCIe host interface connectsthe GPU and its DRAM memory withthe host CPU and system memory. TheCPUGPU coprocessing and data transfersuse the bidirectional PCIe interface. Thestreaming multiprocessor threads access
system memory via the PCIe interface, andCPU threads access GPU DRAM memoryvia PCIe.
The GPU architecture balances its parallelcomputing power with parallel DRAMmemory controllers designed for high mem-ory bandwidth. The Fermi GPU in Figure 3has six high-speed GDDR5 DRAM interfa-ces, each 64 bits wide. Its 40-bit addresseshandle up to 1 Tbyte of address space forGPU DRAM and CPU system memoryfor large-scale computing.
D
RAMi
nterface
DRAMi
nterface
Hostinterface
GigaThread
L2 cacheDRAMi
nterface
DRAMi
nterface
D
RAMi
nterface
DRAMi
nterface
SM SM SM SM SM SM SM SM
SMSMSMSMSMSMSMSM
L1 L1 L1 L1 L1 L1 L1 L1
L1L1L1L1L1L1L1L1
Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex
TexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTexTex
Figure 3. Fermi GPU computing architecture with 512 CUDA processor cores organized as 16 streaming multiprocessors
(SMs) sharing a common second-level (L2) cache, six 64-bit DRAM interfaces, and a host interface with the host CPU,
system memory, and I/O devices. Each streaming multiprocessor has 32 CUDA cores.
..........................................................
MARCH/APRIL 2010 61
-
8/13/2019 Gpu IEEE Paper
7/14
Cached memory hierarchyFermi introduces a parallel cached mem-
ory hierarchy for load, store, and atomicmemory accesses by general applications.Each streaming multiprocessor has a first-level
(L1) data cache, and the streaming multi-processors share a common 768-Kbyte unifiedsecond-level (L2) cache. The L2 cache connectswith six 64-bit DRAM interfaces and the PCIeinterface, which connects with the host CPU,system memory, and PCIe devices. It cachesDRAM memory locations and system mem-ory pages accessed via the PCIe interface.The unified L2 cache services load, store,atomic, and texture instruction requests fromthe streaming multiprocessors and requestsfrom their L1 caches, and fills the streaming
multiprocessor instruction caches and uni-form data caches.
Fermi implements a 40-bit physicaladdress space that accesses GPU DRAM,CPU system memory, and PCIe deviceaddresses. It provides a 40-bit virtual addressspace to each application context and maps itto the physical address space with translationlookaside buffers and page tables.
ECC memoryFermi introduces ECC memory protec-
tion to enhance data integrity in large-scaleGPU computing systems. Fermi ECC cor-rects single-bit errors and detects double-biterrors in the DRAM memory, GPU L2cache, L1 caches, and streaming multiproces-sor registers. The ECC lets us integrate thou-sands of GPUs in a system while maintaininga high mean time between failures (MTBF)for high-performance computing and super-computing systems.
Streaming multiprocessorThe Fermi streaming multiprocessor
introduces several architectural features thatdeliver higher performance, improve its pro-grammability, and broaden its applicability.
As Figure 4 shows, the streaming multiproc-essor execution units include 32 CUDA pro-cessor cores, 16 load/store units, and fourspecial function units (SFUs). It has a64-Kbyte configurable shared memory/L1cache, 128-Kbyte register file, instructioncache, and two multithreaded warp schedu-lers and instruction dispatch units.
Efficient multithreadingThe streaming multiprocessor implements
zero-overhead multithreading and threadscheduling for up to 1,536 concurrentthreads. To efficiently manage and execute
this many individual threads, the multiproc-essor employs the single-instruction multiple-thread (SIMT) architecture introduced in thefirst unified computing GPU.7,8 The SIMTinstruction logic creates, manages, schedules,and executes concurrent threads in groups of32 parallel threads called warps. A CUDAthread block comprises one or more warps.Each Fermi streaming multiprocessor hastwo warp schedulers and two dispatch unitsthat each select a warp and issue an instruc-tion from the warp to 16 CUDA cores,
16 load/store units, or four SFUs. Becausewarps execute independently, the streamingmultiprocessor can issue two warp instruc-tions to appropriate sets of CUDA cores,load/store units, and SFUs.
To support C, C, and standardsingle-thread programming languages, eachstreaming multiprocessor thread is indepen-dent, having its own private registers, condi-tion codes and predicates, private per-threadmemory and stack frame, instructionaddress, and thread execution state. The
SIMT instructions control the execution ofan individual thread, including arithmetic,memory access, and branching and controlflow instructions. For efficiency, the SIMTmultiprocessor issues an instruction to awarp of 32 independent parallel threads.
The streaming multiprocessor realizes fullefficiency and performance when all threadsof a warp take the same execution path. Ifthreads of a warp diverge at a data-dependentconditional branch, execution serializes foreach branch path taken, and when all paths
complete, the threads converge to the sameexecution path. The Fermi streaming multi-processor extends the flexibility of the SIMTindependent thread control flow with indi-rect branch and function-call instructions,and trap handling for exceptions anddebuggers.
Thread instructionsParallel thread execution (PTX) instruc-
tions describe the execution of a single threadin a parallel CUDA program. The PTX
....................................................
62 IEEE MICRO
...............................................................................................................................................................................................
HOT CHIPS
-
8/13/2019 Gpu IEEE Paper
8/14
instructions focus on scalar (rather thanvector) operations to match standard scalarprogramming languages. Fermi implementsthe PTX 2.0 instruction set architecture(ISA), which targets C, C, Fortran,OpenCL, and DirectCompute programs.Instructions include
32-bit and 64-bit integer, addressing,and floating-point arithmetic;
load, store, and atomic memory access; texture and multidimensional surface
access;
individual thread flow control with pre-dicated instructions, branching, func-tion calls, and indirect function callsfor C virtual functions; and
parallel barrier synchronization.
CUDA coresEach pipelined CUDA core executes a
scalar floating point or integer instructionper clock for a thread. With 32 cores, thestreaming multiprocessor can execute up to32 arithmetic thread instructions per clock.
FP = Floating pointINT = Integer arithmetic logicLD/ST = Load/store
SFU = Special function unit
Instruction cache
Warp scheduler Warp scheduler
Dispatch unitDispatch unit
Register file (128 Kbytes)
Core Core
Core Core
Core Core
Core
Core
Core
Core Core
Core Core
CoreCore
CoreCore
CoreCore
Core
Core
Core
CoreCore
CoreLD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/STLD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
SFU
SFU
SFU
SFU
Uniform cache
64-Kbyte shared memory and L1 cache
Core
Core Core
CoreCore
Core
Core
Interconnect network
CUDA core
Dispatch port
Operand collector
FP unit
Result.queue
INT unit
Figure 4. The Fermi streaming multiprocessor has 32 CUDA processor cores, 16 load/store
units, four special function units, a 64-Kbyte configurable shared memory/L1 cache,
128-Kbyte register file, instruction cache, and two multithreaded warp schedulers and
instruction dispatch units.
..........................................................
MARCH/APRIL 2010 63
-
8/13/2019 Gpu IEEE Paper
9/14
The integer unit implements 32-bit precisionfor scalar integer operations, including 32-bitmultiply and multiply-add operations, andefficiently supports 64-bit integer operations.The Fermi integer unit adds bit-field insert
and extract, bit reverse, and populationcount.
IEEE 754-2008 floating-point arithmeticThe Fermi CUDA core floating-point unit
implements the IEEE 754-2008 floating-point arithmetic standard for 32-bit single-precision and 64-bit double-precision results,including fused multiply-add (FMA) instruc-tions. FMA computes D A* B Cwithno loss of precision by retaining full precisionin the intermediate product and addition,
then rounding the final sum to form theresult. Using FMA enables fast division andsquare-root operations with exactly roundedresults.
Fermi raises the throughput of 64-bitdouble-precision operations to half that ofsingle-precision operations, a dramatic im-provement over the T10 GPU. This perfor-mance level enables broader deployment ofGPUs in high-performance computing.The floating-point instructions handle sub-normal numbers at full speed in hardware,
allowing small values to retain partialprecision rather than flushing them tozero or calculating subnormal values inmulticycle software exception handlers asmost CPUs do.
The SFUs execute 32-bit floating-pointinstructions for fast approximations of recip-rocal, reciprocal square root, sin, cos, exp,and log functions. The approximations areprecise to better than 22 mantissa bits.
Unified memory addressing and access
The streaming multiprocessor load/storeunits execute load, store, and atomic mem-ory access instructions. A warp of 32 activethreads presents 32 individual byte addresses,and the instruction accesses each memoryaddress. The load/store units coalesce 32individual thread accesses into a minimalnumber of memory block accesses.
Fermi implements a unified thread ad-dress space that accesses the three separateparallel memory spaces of Figure 1: per-thread local, per-block shared, and global
memory spaces. A unified load/store in-struction can access any of the three mem-ory spaces, steering the access to the correctmemory, which enables general C andC pointer access anywhere. Fermi pro-
vides a terabyte 40-bit unified byte addressspace, and the load/store ISA supports64-bit byte addressing for future growth.The ISA also provides 32-bit addressinginstructions when the program can limitits accesses to the lower 4 Gbytes of addressspace.
Configurable shared memory and L1 cacheOn-chip shared memory provides low-
latency, high-bandwidth access to datashared by cooperating threads in the same
CUDA thread block. Fast shared memorysignificantly boosts the performance ofmany applications having predictable regularaddressing patterns, while reducing DRAMmemory traffic.
Fermi introduces a configurable-capacityL1 cache to aid unpredictable or irregularmemory accesses, along with a configurable-capacity shared memory. Each streamingmultiprocessor has 64 Kbytes of on-chipmemory, configurable as 48 Kbytes of sharedmemory and 16 Kbytes of L1 cache, or as
16 Kbytes of shared memory and 48 Kbytesof L1 cache.
CPU+GPU coprocessingHeterogeneous CPUGPU coprocessing
systems evolved because the CPU and GPUhave complementary attributes that allowapplications to perform best using bothtypes of processors. CUDA programs arecoprocessing programsserial portions exe-cute on the CPU, while parallel portionsexecute on the GPU. Coprocessing optimizes
total application performance.With coprocessing, we use the right core
for the right job. We use a CPU core (opti-mized for low latency on a single thread) fora codes serial portions, and we use GPUcores (optimized for aggregate throughputon a codes parallel portions) for parallel por-tions of code. This approach gives moreperformance per unit area or power thaneither CPU or GPU cores alone.
The comparison in Table 2 illustrates theadvantage of CPUGPU coprocessing using
....................................................
64 IEEE MICRO
...............................................................................................................................................................................................
HOT CHIPS
-
8/13/2019 Gpu IEEE Paper
10/14
Amdahls law. The table compares the per-formance of four configurations:
a system containing one latency-optimized (CPU) core,
a system containing 500 throughput-optimized (GPU) cores,
a system containing 10 CPU cores, and a coprocessing system that contains a
single CPU core and 450 GPU cores.
Table 2 assumes that a CPU core is 5faster and 50 the area of a GPU corenumbers consistent with contemporaryCPUs and GPUs. The coprocessing systemdevotes 10 percent of its area to the singleCPU core and 90 percent of its area tothe 450 GPU cores.
Table 2 compares the four configurationson both a parallel-intensive program (with aserial fraction of only 0.5 percent) and on amostly sequential program (with a serial frac-tion of 75 percent). The coprocessing archi-
tecture is the fastest on both programs. Onthe parallel-intensive program, the coprocess-ing architecture is slightly slower on theparallel portion than the pure GPU configu-ration (0.44 seconds versus 0.40 seconds) butmore than makes up for this by running thetiny serial portion 5 faster (1 second versus5 seconds). The heterogeneous architecturehas an advantage over the pure throughput-optimized configuration here because serialperformance is important even for mostlyparallel codes.
On the mostly sequential program, thecoprocessing configuration matches the per-formance of the multi-CPU configurationon the codes serial portion but runs the par-allel portion 45 as fast, giving a slightlyfaster overall performance. Even on mostlysequential codes, its more efficient to runthe codes parallel portion on a throughput-optimized architecture.
The coprocessing architecture provides
the best performance across a wide rangeof the serial fraction because it uses theright core for each task. By using a latency-optimized CPU to run the codes serialfraction, it gives the best possible perfor-mance on the serial fractionwhich is im-portant even for mostly parallel codes. Byusing throughput-optimized cores to runthe codes parallel portion, it gives near-optimal performance on the parallel frac-tion as wellwhich becomes increasinglyimportant as codes become more parallel.
Its wasteful to use large, inefficientlatency-optimized cores to run parallelcode segments.
Application performanceMany applications consist of a mixture of
fundamentally serial control logic and inher-ently parallel computations. Furthermore,these parallel computations are frequentlydata-parallel in nature. This directly matchesthe CUDA coprocessing programmingmodel, namely a sequential control thread
Table 2. CPU+GPU coprocessing execution time, assuming that a CPU core is 53 faster
and 503 the area of a GPU core.
Configuration
Processing
time for 1 CPU
core
Processing
time for 500
GPU cores
Processing
time for 10 CPU
cores
Processing time
for 1 CPU core +
450 GPU cores
Program type Area 50 500 500 500
Parallel-intensive
program
0.5% serial code
99.5% paralleliz-
able code
Total
1.0
199.0
200.0
5.0
0.4
5.4
1.0
19.9
20.9
1.00
0.44
1.44
Mostly sequential
program
75% serial code
25% paralleliz-
able code
Total
150.0
50.0
200.0
750.0
0.1
750.1
150.0
5.0
155.0
150.0
0.11
150.11
..........................................................
MARCH/APRIL 2010 65
-
8/13/2019 Gpu IEEE Paper
11/14
capable of launching a series of parallel ker-nels. The use of parallel kernels launchedfrom a sequential program also makes it rel-atively easy to parallelize an applications in-dividual components rather than rewrite theentire application.
Many applicationsboth academic re-search and industrial productshave beenaccelerated using CUDA to achieve signifi-cant parallel speedups.10 Such applicationsfall into a variety of problem domains,including quantum chemistry, moleculardynamics, computational fluid dynamics,quantum physics, finite element analysis,astrophysics, bioinformatics, computer vi-sion, video transcoding, speech recognition,computed tomography, and computationalmodeling.
Table 3 lists some representative applica-tions along with the runtime speedupsobtained for the whole application usingCPUGPU coprocessing over CPU alone,as measured by application developers.12-22
The speedups using GeForce 8800, TeslaT8, GeForce GTX 280, Tesla T10, andGeForce GTX 285 range from 9 to morethan 130, with the higher speedups reflect-ing applications where more of the work ran inparallel on the GPU. The lower speedupswhile still quite attractiverepresent
applications that are limited by the codesCPU portion, coprocessing overhead, or by di-vergence in the codes GPU fraction.
The speedups achieved on this diverse setof applications validate the programmabilityof GPUsin addition to their performance.
The breadth of applications that have beenported to CUDA (more than 1,000 on theCUDA Zone) demonstrates the range ofthe programming model and architecture.
Applications with dense matrices, sparsematrices, and arbitrary pointer structureshave all been successfully implemented inCUDA with impressive speedups. Similarly,applications with diverse control structuresand significant data-dependent control,such as ray tracing, have achieved good per-formance in CUDA.
Many real-world applications (such asinteractive ray tracing) are composed ofmany different algorithms, each with varyingdegrees of parallelism. OptiX, our interactiveray-tracing software developers kit built inthe CUDA architecture, provides a mecha-nism to control and schedule a wide varietyof tasks on both the CPU and GPU. Sometasks are primarily serial and execute onthe CPU, such as compilation, data structuremanagement, and coordination with theoperating system and user interaction.
Table 3. Representative CUDA application coprocessing speedups.
Application Field Speedup
Two-electron repulsion integral12 Quantum chemistry 130
Gromacs13 Molecular dynamics 137
Lattice Boltzmann14
3D computational fluiddynamics (CFD)
100
Euler solver15 3D CFD 16
Lattice quantum chromodynamics16 Quantum physics 10
Multigrid finite element method and
partial differential equation solver17Finite element analysis 27
N-body physics18 Astrophysics 100
Protein multiple sequence alignment19 Bioinformatics 36
Image contour detection20 Computer vision 130
Portable media converter* Consumer video 20
Large vocabulary speech recognition21 Human interaction 9
Iterative image reconstruction22 Computed tomography 130
Matlab accelerator** Computational modeling 100................................................................................................................................* Elemental Technologies, Badaboom media converter, 2009; http://badaboomit.com.** Accelereyes, Jacket GPU engine for Matlab, 2009; http://www.accelereyes.com.
....................................................
66 IEEE MICRO
...............................................................................................................................................................................................
HOT CHIPS
-
8/13/2019 Gpu IEEE Paper
12/14
Other tasks, such as building an accelerationstructure or updating animations, may runeither on the CPU or the GPU dependingon the choice of algorithms and the perfor-mance required. However, the real power
of OptiX is the manner in which it managesparallel operations that are either regularor highly irregular in a fine-grained fashion,and this is executed entirely on the GPU.
To accomplish this, OptiX decomposesthe algorithm into a series of states. Someof these states consist of uniform operationssuch as generating rays according to a cameramodel, or applying a tone-mapping operator.Other states consist of highly irregular com-putation, such as traversing an accelerationstructure to locate candidate geometry, or
intersecting a ray with that candidate geom-etry. These operations are highly irregular be-cause they require an indefinite number ofoperations and can vary in cost by an orderof magnitude or more between rays. Auser-provided program that executes withinan abstract ray-tracing machine controls thedetails of each operation.
To execute this state machine, each warpof parallel threads on the GPU selects ahandful of rays that have matching state.Each warp executes in SIMT fashion, caus-
ing each ray to transition to a new state(which might not be the same for all rays).This process repeats until all rays completetheir operations. OptiX can control boththe prioritization of the states and the scopeof the search for rays to target trade-offs indifferent GPU generations. Even if rays tem-porarily diverge to different states, theyll re-turn to a small number of common statesusually traversal and intersection. This state-machine model gives the GPU an opportu-nity to reunite diverged threads with other
threads executing the same code, whichwouldnt occur in a straightforward SIMTexecution model.
The net result is that CPUGPU copro-cessing enables fast, interactive ray tracing ofcomplex scenes while you watch, which is anapplication that researchers previously con-sidered too irregular for a GPU.
G PU computing is at the tipping point.Single-threaded processor perfor-mance is no longer scaling at historic rates.
Thus, we must use parallelism for theincreased performance required to delivermore value to users. A GPU thats opti-mized for throughput delivers parallelperformance much more efficiently than
a CPU thats optimized for latency.A heterogeneous coprocessing architecturethat combines a single latency-optimizedcore (a CPU) with many throughput-optimized cores (a GPU) performs betterthan either alternative alone. This is becauseit uses the right processor for the right jobthe CPU for serial sections and critical pathsand the GPU for the parallel sections.
Because they efficiently execute programsacross a range of parallelism, heterogeneousCPUGPU architectures are becoming
pervasive. In high-performance computing,technical computing, and consumer mediaprocessing, CPUGPU coprocessing hasbecome the architecture of choice. Weexpect the rate of adoption of GPU com-puting in these beachhead areas to acceler-ate, and for GPU computing to spread tobroader application areas. In addition toaccelerating existing applications, we expectGPU computing to enable new classes ofapplications, such as building compellingvirtual worlds and performing universal
speech translation.As the GPU computing era progresses,
GPU performance will continue to scale atMoores law ratesabout 50 percent peryeargiving an order of magnitudeincrease in performance in five to six years.
At the same time, GPU architecture willevolve to further increase the span of appli-cations that it can efficiently address. GPUcores will not become CPUsthey will con-tinue to be optimized for throughput, ratherthan latency. However, they will evolve to
become more agile and better able to handlearbitrary control and data access patterns.GPUs and their programming systems willalso evolve to make GPUs even easier toprogramfurther accelerating their rate ofadoption. M ICRO
Acknowledgments
We thank Jen-Hsun Huang of NVIDIAfor his Hot Chips 21 keynote23 that inspiredthis article, and the entire NVIDIA team thatbrings GPU computing to market.
..........................................................
MARCH/APRIL 2010 67
-
8/13/2019 Gpu IEEE Paper
13/14
....................................................................References
1. J. Nickolls et al., Scalable Parallel Program-
ming with CUDA, ACM Queue, vol. 6,
no. 2, 2008, pp. 40-53.
2. NVIDIA,NVIDIA CUDA Programming Guide,
2009; http://developer.download.nvidia.c o m/ c o mp u te / c u d a / 2 _ 3 / to o l k i t / d o c s /
NVIDIA_CUDA_Programming_Guide_2.3.pdf.
3. C. Boyd, DirectCompute: Capturing the
Teraflop, Microsoft Personal Developers
Conf., 2009; http://ecn.channel9.msdn.
com/o9/pdc09/ppt/CL03.pptx.
4. Khronos,The OpenCL Specification, 2009;
http://www.khronos.org/OpenCL.
5. J. Montrym and H. Moreton, The GeForce
6800, IEEE Micro, vol. 25, no. 2, 2005,
pp. 41-51.
6. W.R. Mark et al., Cg: A System for Pro-gramming Graphics Hardware in a C-like
Language, Proc. Special Interest Group
on Computer Graphics (Siggraph), ACM
Press, 2003, pp. 896-907.
7. E. Lindholm et al., NVIDIA Tesla: A Unified
Graphics and Computing Architecture,
IEEE Micro, vol. 28, no. 2, 2008, pp. 39-55.
8. J. Nickolls and D. Kirk, Graphics and
Computing GPUs, Computer Organization
and Design: The Hardware/Software
Interface, D .A . P att er so n an d J. L.
Hennessy, 4th ed., Morgan Kaufmann,
2009, pp. A2-A77.
9. NVIDIA, Fermi: NVIDIAs Next Generation
CUDA Compute Architecture, 2009; http://
www.nvidia.com/content/PDF/fermi_white_
papers/NVIDIA_Fermi_Compute_Architecture_
Whitepaper.pdf.
10. M. Garland et al., Parallel Computing Expe-
riences with CUDA, IEEE Micro, vol. 28,
no. 4, 2008, pp. 13-27.
11. D.B. Kirk and W.W. Hwu, Programming
Massively Parallel Processors: A Hands-on
Approach, Morgan Kaufmann, 2010.
12. I.S. Ufimtsev and T.J. Martinez, Quantum
Chemistry on Graphical Processing Units.
1. Strategies for Two-Electron Integral Eval-
uation,J. Chemical Theory and Computa-
tion, vol. 4, no. 2, 2008, pp. 222-231.
13. M.S. Friedrichs et al., Accelerating Molec-
ular Dynamic Simulation on Graphics
Processing Units, J. Computational Chem-
istry, vol. 30, no. 6, 2009, pp. 864-872.
14. J. Tolke and M. Krafczyk, TeraFLOP Com-
puting on a Desktop PC with GPUs for 3D
CFD, Intl J. Computational Fluid Dynam-
ics, vol. 22, no. 7, 2008, pp. 443-456.
15. T. Brandvik and G. Pullan, Acceleration of a
3D Euler Solver Using Commodity Graphics
Hardware, Proc. 46th Am. Inst. of Aero-
nautics and Astronautics (AIAA) AerospaceSciences Meeting, AIAA, 2008; http://
www.aiaa.org/agenda.cfm?lumeetingid=
1065&dateget=08-Jan-08#session8907.
16. M.A. Clark et al., Solving Lattice QCD Sys-
tems of Equations Using Mixed Precision
Solvers on GPUs, Computer Physics
Comm., 2009; http://arxiv.org/abs/0911.
3191v2.
17. D. Goddeke and R. Strzodka, Perfor-
mance and Accuracy of Hardware-Oriented
Native-, Emulated-, and Mixed-Precision
Solvers in FEM Simulations (Part 2: DoublePrecision GPUs), Ergebnisberichte des
Instituts fur Angewandte Mathematik
[Reports on Findings of the Inst. for
Applied Mathematics], Dortmund Univ. of
Technology, no. 370, 2008; http://www.
mathematik.uni-dortmund.de/~goeddeke/
pubs/GTX280_mixedprecision.pdf.
18. R.G. Belleman, J. Bedorf, and S.P. Zwart,
High Performance Direct Gravitational
N-body Simulations on Graphics Processing
Units II: An Implementation in CUDA, New
Astronomy,vol. 13, no. 2, 2008, pp. 103-112.19. Y. Liu, B. Schmidt, and D.L. Maskell, MSA-
CUDA: Multiple Sequence Alignment on
Graphics Processing Units with CUDA,
Proc. 20th IEEE Intl Conf. Application-
Specific Systems, Architectures and Pro-
cessors,IEEE CS Press, 2009, pp. 121-128.
20. B. Catanzaro et al., Efficient, High-Quality
Image Contour Detection,Proc. IEEE Intl
Conf. Computer Vision,IEEE CS Press, 2009;
http://www.cs.berkeley.edu/~catanzar/
Damascene/iccv2009.pdf.
21. J. Chong et al., Data-Parallel Large Vocabu-
lary Continuous Speech Recognition on
Graphics Processors, tech. report
UCB/EECS-2008-69, Univ. of California at
Berkeley, 2008; http://www.eecs.berkeley.
edu/Pubs/TechRpts/2008/EECS-2008-69.pdf.
22. Y. Pan et al., Feasibility of GPU-Assisted
Iterative Image Reconstruction for Mobile
C-Arm CT, Proc. Intl Soc. for Photonics
and Optonics (SPIE), vol. 7258, SPIE
2009; http://www.sci.utah.edu/~ypan/Pan_
SPIE2009.pdf.
....................................................
68 IEEE MICRO
...............................................................................................................................................................................................
HOT CHIPS
-
8/13/2019 Gpu IEEE Paper
14/14
23. J.H. Huang, 2009: The GPU Computing Tip-
ping Point, Proc. IEEE Hot Chips 21, 2009;
http://www.hotchips.org/archives/hc21.
John Nickolls is the director of architecture
at NVIDIA for GPU computing. Histechnical interests include developing parallelprocessing products, languages, and architec-tures. Nickolls has a PhD in electricalengineering from Stanford University.
William J. Dally is the chief scientist andsenior vice president of research at NVIDIAand the Willard R. and Inez Kerr BellProfessor of Engineering at StanfordUniversity. His technical interests includeparallel computer architecture, parallel
programming systems, and interconnectionnetworks. Hes a member of the National
Academy of Engineering and a fellow ofIEEE, ACM, and the American Academy of
Arts and Sciences. Dally has a PhD in
computer science from the CaliforniaInstitute of Technology.
Direct questions and comments aboutthis article to John Nickolls, NVIDIA, 2701San Tomas Expressway, Santa Clara, CA95050; [email protected].
..........................................................
MARCH/APRIL 2010 69