nvidia tesla:aunified graphics and computing … · nvidia tesla:aunified graphics and computing...

........................................................................................................................................................................................................................................................

NVIDIA TESLA: A UNIFIEDGRAPHICS AND

COMPUTING ARCHITECTURE........................................................................................................................................................................................................................................................

TO ENABLE FLEXIBLE, PROGRAMMABLE GRAPHICS AND HIGH-PERFORMANCE COMPUTING,

NVIDIA HAS DEVELOPED THE TESLA SCALABLE UNIFIED GRAPHICS AND PARALLEL

COMPUTING ARCHITECTURE. ITS SCALABLE PARALLEL ARRAY OF PROCESSORS IS

MASSIVELY MULTITHREADED AND PROGRAMMABLE IN C OR VIA GRAPHICS APIS.

......The modern 3D graphics process-ing unit (GPU) has evolved from a fixed-function graphics pipeline to a programma-ble parallel processor with computing powerexceeding that of multicore CPUs. Tradi-tional graphics pipelines consist of separateprogrammable stages of vertex processorsexecuting vertex shader programs and pixelfragment processors executing pixel shaderprograms. (Montrym and Moreton provideadditional background on the traditionalgraphics processor architecture.1)

NVIDIA’s Tesla architecture, introducedin November 2006 in the GeForce 8800GPU, unifies the vertex and pixel processorsand extends them, enabling high-perfor-mance parallel computing applications writ-ten in the C language using the ComputeUnified Device Architecture (CUDA2–4)parallel programming model and develop-ment tools. The Tesla unified graphics andcomputing architecture is available in ascalable family of GeForce 8-series GPUsand Quadro GPUs for laptops, desktops,workstations, and servers. It also providesthe processing architecture for the TeslaGPU computing platforms introduced in2007 for high-performance computing.

In this article, we discuss the require-ments that drove the unified graphics andparallel computing processor architecture,describe the Tesla architecture, and how it isenabling widespread deployment of parallelcomputing and graphics applications.

The road to unificationThe first GPU was the GeForce 256,

introduced in 1999. It contained a fixed-function 32-bit floating-point vertex trans-form and lighting processor and a fixed-function integer pixel-fragment pipeline,which were programmed with OpenGLand the Microsoft DX7 API.5 In 2001,the GeForce 3 introduced the first pro-grammable vertex processor executing vertexshaders, along with a configurable 32-bitfloating-point fragment pipeline, pro-grammed with DX85 and OpenGL.6 TheRadeon 9700, introduced in 2002, featureda programmable 24-bit floating-point pixel-fragment processor programmed with DX9and OpenGL.7,8 The GeForce FX added 32-bit floating-point pixel-fragment processors.The XBox 360 introduced an early unifiedGPU in 2005, allowing vertices and pixelsto execute on the same processor.9

Erik Lindholm

John Nickolls

Stuart Oberman

John Montrym

NVIDIA

0272-1732/08/$20.00 G 2008 IEEE Published by the IEEE Computer Society.

........................................................................

39

Vertex processors operate on the verticesof primitives such as points, lines, andtriangles. Typical operations include trans-forming coordinates into screen space,which are then fed to the setup unit andthe rasterizer, and setting up lighting andtexture parameters to be used by the pixel-fragment processors. Pixel-fragment proces-sors operate on rasterizer output, which fillsthe interior of primitives, along with theinterpolated parameters.

Vertex and pixel-fragment processorshave evolved at different rates: Vertexprocessors were designed for low-latency,high-precision math operations, whereaspixel-fragment processors were optimizedfor high-latency, lower-precision texturefiltering. Vertex processors have tradition-ally supported more-complex processing, sothey became programmable first. For thelast six years, the two processor typeshave been functionally converging as theresult of a need for greater programminggenerality. However, the increased general-ity also increased the design complexity,area, and cost of developing two separateprocessors.

Because GPUs typically must processmore pixels than vertices, pixel-fragmentprocessors traditionally outnumber vertexprocessors by about three to one. However,typical workloads are not well balanced,leading to inefficiency. For example,with large triangles, the vertex processorsare mostly idle, while the pixel processorsare fully busy. With small triangles,the opposite is true. The addition ofmore-complex primitive processing inDX10 makes it much harder to select afixed processor ratio.10 All these factorsinfluenced the decision to design a unifiedarchitecture.

A primary design objective for Tesla wasto execute vertex and pixel-fragment shaderprograms on the same unified processorarchitecture. Unification would enable dy-namic load balancing of varying vertex- andpixel-processing workloads and permit theintroduction of new graphics shader stages,such as geometry shaders in DX10. It alsolet a single team focus on designing a fastand efficient processor and allowed thesharing of expensive hardware such as the

texture units. The generality required of aunified processor opened the door to acompletely new GPU parallel-computingcapability. The downside of this generalitywas the difficulty of efficient load balancingbetween different shader types.

Other critical hardware design require-ments were architectural scalability, perfor-mance, power, and area efficiency.

The Tesla architects developed thegraphics feature set in coordination withthe development of the Microsoft Direct3DDirectX 10 graphics API.10 They developedthe GPU’s computing feature set in coor-dination with the development of theCUDA C parallel programming language,compiler, and development tools.

Tesla architectureThe Tesla architecture is based on a

scalable processor array. Figure 1 shows ablock diagram of a GeForce 8800 GPUwith 128 streaming-processor (SP) coresorganized as 16 streaming multiprocessors(SMs) in eight independent processing unitscalled texture/processor clusters (TPCs).Work flows from top to bottom, startingat the host interface with the system PCI-Express bus. Because of its unified-processordesign, the physical Tesla architecturedoesn’t resemble the logical order ofgraphics pipeline stages. However, we willuse the logical graphics pipeline flow toexplain the architecture.

At the highest level, the GPU’s scalablestreaming processor array (SPA) performsall the GPU’s programmable calculations.The scalable memory system consists ofexternal DRAM control and fixed-functionraster operation processors (ROPs) thatperform color and depth frame bufferoperations directly on memory. An inter-connection network carries computedpixel-fragment colors and depth values fromthe SPA to the ROPs. The network alsoroutes texture memory read requests fromthe SPA to DRAM and read data fromDRAM through a level-2 cache back to theSPA.

The remaining blocks in Figure 1 deliverinput work to the SPA. The input assemblercollects vertex work as directed by the inputcommand stream. The vertex work distri-

.........................................................................................................................................................................................................................

HOT CHIPS 19

.......................................................................

40 IEEE MICRO

bution block distributes vertex work packetsto the various TPCs in the SPA. The TPCsexecute vertex shader programs, and (ifenabled) geometry shader programs. Theresulting output data is written to on-chipbuffers. These buffers then pass their resultsto the viewport/clip/setup/raster/zcull blockto be rasterized into pixel fragments. Thepixel work distribution unit distributes pixelfragments to the appropriate TPCs forpixel-fragment processing. Shaded pixel-fragments are sent across the interconnec-tion network for processing by depth andcolor ROP units. The compute workdistribution block dispatches computethread arrays to the TPCs. The SPA acceptsand processes work for multiple logicalstreams simultaneously. Multiple clockdomains for GPU units, processors,DRAM, and other units allow independentpower and performance optimizations.

Command processingThe GPU host interface unit communi-

cates with the host CPU, responds tocommands from the CPU, fetches data fromsystem memory, checks command consisten-cy, and performs context switching.

The input assembler collects geometricprimitives (points, lines, triangles, linestrips, and triangle strips) and fetchesassociated vertex input attribute data. Ithas peak rates of one primitive per clockand eight scalar attributes per clock at theGPU core clock, which is typically600 MHz.

The work distribution units forward theinput assembler’s output stream to the arrayof processors, which execute vertex, geom-etry, and pixel shader programs, as well ascomputing programs. The vertex and com-pute work distribution units deliver work toprocessors in a round-robin scheme. Pixel

Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming

multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.

........................................................................

MARCH–APRIL 2008 41

work distribution is based on the pixellocation.

Streaming processor arrayThe SPA executes graphics shader thread

programs and GPU computing programsand provides thread control and manage-ment. Each TPC in the SPA roughlycorresponds to a quad-pixel unit in previousarchitectures.1 The number of TPCs deter-mines a GPU’s programmable processingperformance and scales from one TPC in asmall GPU to eight or more TPCs in high-performance GPUs.

Texture/processor clusterAs Figure 2 shows, each TPC contains a

geometry controller, an SM controller(SMC), two streaming multiprocessors(SMs), and a texture unit. Figure 3 expandseach SM to show its eight SP cores. Tobalance the expected ratio of math opera-

tions to texture operations, one texture unitserves two SMs. This architectural ratio canvary as needed.

Geometry controllerThe geometry controller maps the logical

graphics vertex pipeline into recirculationon the physical SMs by directing allprimitive and vertex attribute and topologyflow in the TPC. It manages dedicated on-chip input and output vertex attributestorage and forwards contents as required.

DX10 has two stages dealing with vertexand primitive processing: the vertex shaderand the geometry shader. The vertex shaderprocesses one vertex’s attributes indepen-dently of other vertices. Typical operationsare position space transforms and color andtexture coordinate generation. The geome-try shader follows the vertex shader anddeals with a whole primitive and its vertices.Typical operations are edge extrusion for

Figure 2. Texture/processor cluster (TPC).

.........................................................................................................................................................................................................................

HOT CHIPS 19

.......................................................................

42 IEEE MICRO

stencil shadow generation and cube maptexture generation. Geometry shader outputprimitives go to later stages for clipping,viewport transformation, and rasterizationinto pixel fragments.

Streaming multiprocessorThe SM is a unified graphics and

computing multiprocessor that executesvertex, geometry, and pixel-fragment shaderprograms and parallel computing programs.As Figure 3 shows, the SM consists of eightstreaming processor (SP) cores, two special-function units (SFUs), a multithreadedinstruction fetch and issue unit (MT Issue),an instruction cache, a read-only constantcache, and a 16-Kbyte read/write sharedmemory.

The shared memory holds graphics inputbuffers or shared data for parallel comput-ing. To pipeline graphics workloadsthrough the SM, vertex, geometry, andpixel threads have independent input andoutput buffers. Workloads can arrive anddepart independently of thread execution.Geometry threads, which generate variableamounts of output per thread, use separateoutput buffers.

Each SP core contains a scalar multiply-add (MAD) unit, giving the SM eightMAD units. The SM uses its two SFU units

for transcendental functions and attributeinterpolation—the interpolation of pixelattributes from vertex attributes defining aprimitive. Each SFU also contains fourfloating-point multipliers. The SM uses theTPC texture unit as a third execution unitand uses the SMC and ROP units toimplement external memory load, store,and atomic accesses. A low-latency inter-connect network between the SPs and theshared-memory banks provides shared-memory access.

The GeForce 8800 Ultra clocks the SPsand SFU units at 1.5 GHz, for a peak of 36Gflops per SM. To optimize power and areaefficiency, some SM non-data-path unitsoperate at half the SP clock rate.

SM multithreading. A graphics vertex orpixel shader is a program for a single threadthat describes how to process a vertex or apixel. Similarly, a CUDA kernel is a Cprogram for a single thread that describeshow one thread computes a result. Graphicsand computing applications instantiatemany parallel threads to render compleximages and compute large result arrays. Todynamically balance shifting vertex andpixel shader thread workloads, the unifiedSM concurrently executes different threadprograms and different types of shaderprograms.

To efficiently execute hundreds ofthreads in parallel while running severaldifferent programs, the SM is hardwaremultithreaded. It manages and executes upto 768 concurrent threads in hardware withzero scheduling overhead.

To support the independent vertex,primitive, pixel, and thread programmingmodel of graphics shading languages andthe CUDA C/C++ language, each SMthread has its own thread execution stateand can execute an independent code path.Concurrent threads of computing programscan synchronize at a barrier with a singleSM instruction. Lightweight thread crea-tion, zero-overhead thread scheduling, andfast barrier synchronization support veryfine-grained parallelism efficiently.

Single-instruction, multiple-thread. To man-age and execute hundreds of threads running

Figure 3. Streaming multiprocessor (SM).

........................................................................


several different programs efficiently, theTesla SM uses a new processor architecturewe call single-instruction, multiple-thread(SIMT). The SM’s SIMT multithreadedinstruction unit creates, manages, schedules,and executes threads in groups of 32parallel threads called warps. The term warporiginates from weaving, the first parallel-thread technology. Figure 4 illustrates SIMTscheduling. The SIMT warp size of 32parallel threads provides efficiency on plen-tiful fine-grained pixel threads and comput-ing threads.

Each SM manages a pool of 24 warps,with a total of 768 threads. Individualthreads composing a SIMT warp are of thesame type and start together at the sameprogram address, but they are otherwise freeto branch and execute independently. Ateach instruction issue time, the SIMTmultithreaded instruction unit selects awarp that is ready to execute and issuesthe next instruction to that warp’s activethreads. A SIMT instruction is broadcastsynchronously to a warp’s active parallelthreads; individual threads can be inactivedue to independent branching or predica-tion.

The SM maps the warp threads to the SPcores, and each thread executes indepen-dently with its own instruction address andregister state. A SIMT processor realizes fullefficiency and performance when all 32threads of a warp take the same executionpath. If threads of a warp diverge via a data-dependent conditional branch, the warpserially executes each branch path taken,disabling threads that are not on that path,and when all paths complete, the threadsreconverge to the original execution path.The SM uses a branch synchronization stackto manage independent threads that divergeand converge. Branch divergence onlyoccurs within a warp; different warpsexecute independently regardless of whetherthey are executing common or disjoint codepaths. As a result, Tesla architecture GPUsare dramatically more efficient and flexibleon branching code than previous generationGPUs, as their 32-thread warps are muchnarrower than the SIMD width of priorGPUs.1

SIMT architecture is similar to single-instruction, multiple-data (SIMD) design,which applies one instruction to multipledata lanes. The difference is that SIMTapplies one instruction to multiple inde-pendent threads in parallel, not just multi-ple data lanes. A SIMD instruction controlsa vector of multiple data lanes together andexposes the vector width to the software,whereas a SIMT instruction controls theexecution and branching behavior of onethread.

In contrast to SIMD vector architectures,SIMT enables programmers to write thread-level parallel code for independent threadsas well as data-parallel code for coordinatedthreads. For program correctness, program-mers can essentially ignore SIMT executionattributes such as warps; however, they canachieve substantial performance improve-ments by writing code that seldom requiresthreads in a warp to diverge. In practice, thisis analogous to the role of cache lines in

Figure 4. Single-instruction, multiple-

thread (SIMT) warp scheduling.

.........................................................................................................................................................................................................................

HOT CHIPS 19

.......................................................................

44 IEEE MICRO

traditional codes: Programmers can safelyignore cache line size when designing forcorrectness but must consider it in the codestructure when designing for peak perfor-mance. SIMD vector architectures, on theother hand, require the software to manu-ally coalesce loads into vectors and tomanually manage divergence.

SIMT warp scheduling. The SIMT ap-proach of scheduling independent warps issimpler than previous GPU architectures’complex scheduling. A warp consists of upto 32 threads of the same type—vertex,geometry, pixel, or compute. The basic unitof pixel-fragment shader processing is the 23 2 pixel quad. The SM controller groupseight pixel quads into a warp of 32 threads.It similarly groups vertices and primitivesinto warps and packs 32 computing threadsinto a warp. The SIMT design shares theSM instruction fetch and issue unit effi-ciently across 32 threads but requires a fullwarp of active threads for full performanceefficiency.

As a unified graphics processor, the SMschedules and executes multiple warp typesconcurrently—for example, concurrentlyexecuting vertex and pixel warps. The SMwarp scheduler operates at half the 1.5-GHzprocessor clock rate. At each cycle, it selectsone of the 24 warps to execute a SIMT warpinstruction, as Figure 4 shows. An issuedwarp instruction executes as two sets of 16threads over four processor cycles. The SPcores and SFU units execute instructionsindependently, and by issuing instructionsbetween them on alternate cycles, thescheduler can keep both fully occupied.

Implementing zero-overhead warp sched-uling for a dynamic mix of different warpprograms and program types was a chal-lenging design problem. A scoreboardqualifies each warp for issue each cycle.The instruction scheduler prioritizes allready warps and selects the one with highestpriority for issue. Prioritization considerswarp type, instruction type, and ‘‘fairness’’to all warps executing in the SM.

SM instructions. The Tesla SM executesscalar instructions, unlike previous GPUvector instruction architectures. Shader

programs are becoming longer and morescalar, and it is increasingly difficult to fullyoccupy even two components of the priorfour-component vector architecture. Previ-ous architectures employed vector pack-ing—combining sub-vectors of work togain efficiency—but that complicated thescheduling hardware as well as the compiler.Scalar instructions are simpler and compilerfriendly. Texture instructions remain vectorbased, taking a source coordinate vector andreturning a filtered color vector.

High-level graphics and computing-lan-guage compilers generate intermediate in-structions, such as DX10 vector or PTXscalar instructions,10,2 which are then opti-mized and translated to binary GPUinstructions. The optimizer readily expandsDX10 vector instructions to multiple TeslaSM scalar instructions. PTX scalar instruc-tions optimize to Tesla SM scalar instruc-tions about one to one. PTX provides astable target ISA for compilers and providescompatibility over several generations ofGPUs with evolving binary instruction setarchitectures. Because the intermediate lan-guages use virtual registers, the optimizeranalyzes data dependencies and allocatesreal registers. It eliminates dead code, foldsinstructions together when feasible, andoptimizes SIMT branch divergence andconvergence points.

Instruction set architecture. The Tesla SMhas a register-based instruction set includingfloating-point, integer, bit, conversion, tran-scendental, flow control, memory load/store,and texture operations.

Floating-point and integer operationsinclude add, multiply, multiply-add, mini-mum, maximum, compare, set predicate,and conversions between integer and float-ing-point numbers. Floating-point instruc-tions provide source operand modifiers fornegation and absolute value. Transcenden-tal function instructions include cosine,sine, binary exponential, binary logarithm,reciprocal, and reciprocal square root.Attribute interpolation instructions provideefficient generation of pixel attributes.Bitwise operators include shift left, shiftright, logic operators, and move. Control

........................................................................


flow includes branch, call, return, trap, andbarrier synchronization.

The floating-point and integer instruc-tions can also set per-thread status flags forzero, negative, carry, and overflow, whichthe thread program can use for conditionalbranching.

Memory access instructions. The textureinstruction fetches and filters texture sam-ples from memory via the texture unit. TheROP unit writes pixel-fragment output tomemory.

To support computing and C/C++language needs, the Tesla SM implementsmemory load/store instructions in additionto graphics texture fetch and pixel output.Memory load/store instructions use integerbyte addressing with register-plus-offsetaddress arithmetic to facilitate conventionalcompiler code optimizations.

For computing, the load/store instruc-tions access three read/write memory spaces:

N local memory for per-thread, private,temporary data (implemented in ex-ternal DRAM);

N shared memory for low-latency accessto data shared by cooperating threadsin the same SM; and

N global memory for data shared by allthreads of a computing application(implemented in external DRAM).

The memory instructions load-global,store-global, load-shared, store-shared,load-local, and store-local access global,shared, and local memory. Computingprograms use the fast barrier synchroniza-tion instruction to synchronize threadswithin the SM that communicate with eachother via shared and global memory.

To improve memory bandwidth andreduce overhead, the local and global load/store instructions coalesce individual paral-lel thread accesses from the same warp intofewer memory block accesses. The addressesmust fall in the same block and meetalignment criteria. Coalescing memoryrequests boosts performance significantlyover separate requests. The large threadcount, together with support for manyoutstanding load requests, helps cover

load-to-use latency for local and globalmemory implemented in external DRAM.

The latest Tesla architecture GPUsprovide efficient atomic memory opera-tions, including integer add, minimum,maximum, logic operators, swap, andcompare-and-swap operations. Atomic op-erations facilitate parallel reductions andparallel data structure management.

Streaming processor. The SP core is theprimary thread processor in the SM. Itperforms the fundamental floating-pointoperations, including add, multiply, andmultiply-add. It also implements a widevariety of integer, comparison, and conver-sion operations. The floating-point add andmultiply operations are compatible with theIEEE 754 standard for single-precision FPnumbers, including not-a-number (NaN)and infinity values. The unit is fullypipelined, and latency is optimized tobalance delay and area.

The add and multiply operations useIEEE round-to-nearest-even as the defaultrounding mode. The multiply-add opera-tion performs a multiplication with trunca-tion, followed by an add with round-to-nearest-even. The SP flushes denormalsource operands to sign-preserved zero andflushes results that underflow the targetoutput exponent range to sign-preservedzero after rounding.

Special-function unit. The SFU supportscomputation of both transcendental func-tions and planar attribute interpolation.11 Atraditional vertex or pixel shader designcontains a functional unit to computetranscendental functions. Pixels also needan attribute-interpolating unit to computethe per-pixel attribute values at the pixel’s x,y location, given the attribute values at theprimitive’s vertices.

For functional evaluation, we use qua-dratic interpolation based on enhancedminimax approximations to approximatethe reciprocal, reciprocal square root, log2x,2x, and sin/cos functions. Table 1 shows theaccuracy of the function estimates. The SFUunit generates one 32-bit floating pointresult per cycle.

.........................................................................................................................................................................................................................

HOT CHIPS 19

.......................................................................

46 IEEE MICRO

The SFU also supports attribute interpo-lation, to enable accurate interpolation ofattributes such as color, depth, and texturecoordinates. The SFU must interpolatethese attributes in the (x, y) screen spaceto determine the values of the attributes ateach pixel location. We express the value ofa given attribute U in an (x, y) plane inplane equations of the following form:

U x, yð Þ~

AU | x z BU | y z CUð Þ=AW | x z BW | y z CWð Þ

where A, B, and C are interpolationparameters associated with each attributeU, and W is related to the distance of thepixel from the viewer for perspectiveprojection. The attribute interpolationhardware in the SFU is fully pipelined,and it can interpolate four samples percycle.

In a shader program, the SFU cangenerate perspective-corrected attributes asfollows:

N Interpolate 1/W, and invert to formW.

N Interpolate U/W.N Multiply U/W by W to form perspec-

tive-correct U.

SM controller. The SMC controls multipleSMs, arbitrating the shared texture unit,load/store path, and I/O path. The SMCserves three graphics workloads simulta-

neously: vertex, geometry, and pixel. Itpacks each of these input types into thewarp width, initiating shader processing,and unpacks the results.

Each input type has independent I/Opaths, but the SMC is responsible for loadbalancing among them. The SMC supportsstatic and dynamic load balancing based ondriver-recommended allocations, currentallocations, and relative difficulty of addi-tional resource allocation. Load balancing ofthe workloads was one of the morechallenging design problems due to itsimpact on overall SPA efficiency.

Texture unitThe texture unit processes one group of

four threads (vertex, geometry, pixel, orcompute) per cycle. Texture instructionsources are texture coordinates, and theoutputs are filtered samples, typically afour-component (RGBA) color. Texture isa separate unit external to the SM connect-ed via the SMC. The issuing SM thread cancontinue execution until a data dependencystall.

Each texture unit has four texture addressgenerators and eight filter units, for a peakGeForce 8800 Ultra rate of 38.4 gigabi-lerps/s (a bilerp is a bilinear interpolation offour samples). Each unit supports full-speed2:1 anisotropic filtering, as well as high-dynamic-range (HDR) 16-bit and 32-bitfloating-point data format filtering.

The texture unit is deeply pipelined.Although it contains a cache to capturefiltering locality, it streams hits mixed withmisses without stalling.

Table 1. Function approximation statistics.

Function

Input

interval

Accuracy (good

bits) ULP* error

% exactly

rounded Monotonic

1/x [1, 2) 24.02 0.98 87 Yes

1/sqrt(x) [1, 4) 23.40 1.52 78 Yes

2x [0, 1) 22.51 1.41 74 Yes

log2x [1, 2) 22.57 N/A** N/A Yes

sin/cos [0, p/2) 22.47 N/A N/A No........................................................................................................................................................* ULP: unit-in-the-last-place.** N/A: not applicable.

........................................................................


RasterizationGeometry primitives output from the

SMs go in their original round-robin inputorder to the viewport/clip/setup/raster/zcullblock. The viewport and clip units clip theprimitives to the standard view frustum andto any enabled user clip planes. Theytransform postclipping vertices into screen(pixel) space and reject whole primitivesoutside the view volume as well as back-facing primitives.

Surviving primitives then go to the setupunit, which generates edge equations for therasterizer. Attribute plane equations are alsogenerated for linear interpolation of pixelattributes in the pixel shader. A coarse-rasterization stage generates all pixel tilesthat are at least partially inside the primi-tive.

The zcull unit maintains a hierarchical zsurface, rejecting pixel tiles if they areconservatively known to be occluded bypreviously drawn pixels. The rejection rateis up to 256 pixels per clock. The screen issubdivided into tiles; each TPC processes apredetermined subset. The pixel tile addresstherefore selects the destination TPC. Pixeltiles that survive zcull then go to a fine-rasterization stage that generates detailedcoverage information and depth values forthe pixels.

OpenGL and Direct3D require that adepth test be performed after the pixelshader has generated final color and depthvalues. When possible, for certain combi-nations of API state, the Tesla GPUperforms the depth test and update aheadof the fragment shader, possibly savingthousands of cycles of processing time,without violating the API-mandated seman-tics.

The SMC assembles surviving pixels intowarps to be processed by a SM running thecurrent pixel shader. When the pixel shaderhas finished, the pixels are optionally depthtested if this was not done ahead of theshader. The SMC then sends survivingpixels and associated data to the ROP.

Raster operations processorEach ROP is paired with a specific

memory partition. The TPCs feed data tothe ROPs via an interconnection network.

ROPs handle depth and stencil testing andupdates and color blending and updates.The memory controller uses lossless color(up to 8:1) and depth compression (up to8:1) to reduce bandwidth. Each ROP has apeak rate of four pixels per clock andsupports 16-bit floating-point and 32-bitfloating-point HDR formats. ROPs supportdouble-rate-depth processing when colorwrites are disabled.

Each memory partition is 64 bits wideand supports double-data-rate DDR2 andgraphics-oriented GDDR3 protocols at upto 1 GHz, yielding a bandwidth of about16 Gbytes/s.

Antialiasing support includes up to 163

multisampling and supersampling. HDRformats are fully supported. Both algo-rithms support 1, 2, 4, 8, or 16 samples perpixel and generate a weighted average of thesamples to produce the final pixel color.Multisampling executes the pixel shaderonce to generate a color shared by all pixelsamples, whereas supersampling runs thepixel shader once per sample. In both cases,depth values are correctly evaluated for eachsample, as required for correct interpene-tration of primitives.

Because multisampling runs the pixelshader once per pixel (rather than onceper sample), multisampling has become themost popular antialiasing method. Beyondfour samples, however, storage cost increasesfaster than image quality improves, espe-cially with HDR formats. For example, asingle 1,600 3 1,200 pixel surface, storing16 four-component, 16-bit floating-pointsamples, requires 1,600 3 1,200 3 16 3

(64 bits color + 32 bits depth) 5 368Mbytes.

For the vast majority of edge pixels, twocolors are enough; what matters is more-detailed coverage information. The cover-age-sampling antialiasing (CSAA) algorithmprovides low-cost-per-coverage samples, al-lowing upward scaling. By computing andstoring Boolean coverage at up to 16samples and compressing redundant colorand depth and stencil information into thememory footprint and bandwidth of four oreight samples, 163 antialiasing quality canbe achieved at 43 antialiasing performance.CSAA is compatible with existing rendering

.........................................................................................................................................................................................................................

HOT CHIPS 19

.......................................................................

48 IEEE MICRO

techniques including HDR and stencilalgorithms. Edges defined by the intersec-tion of interpenetrating polygons are ren-dered at the stored sample count quality(43 or 83). Table 2 summarizes thestorage requirements of the three algo-rithms.

Memory and interconnectThe DRAM memory data bus width is

384 pins, arranged in six independentpartitions of 64 pins each. Each partitionowns 1/6 of the physical address space. Thememory partition units directly enqueuerequests. They arbitrate among hundreds ofin-flight requests from the parallel stages ofthe graphics and computation pipelines.The arbitration seeks to maximize totalDRAM transfer efficiency, which favorsgrouping related requests by DRAM bankand read/write direction, while minimizinglatency as far as possible. The memorycontrollers support a wide range of DRAMclock rates, protocols, device densities, anddata bus widths.

Interconnection network. A single hub unitroutes requests to the appropriate partitionfrom the nonparallel requesters (PCI-Ex-press, host and command front end, inputassembler, and display). Each memorypartition has its own depth and colorROP units, so ROP memory traffic origi-nates locally. Texture and load/store re-quests, however, can occur between anyTPC and any memory partition, so aninterconnection network routes requestsand responses.

Memory management unit. All processingengines generate addresses in a virtualaddress space. A memory management unit

performs virtual to physical translation.Hardware reads the page tables from localmemory to respond to misses on behalf of ahierarchy of translation look-aside buffersspread out among the rendering engines.

Parallel computing architectureThe Tesla scalable parallel computing

architecture enables the GPU processorarray to excel in throughput computing,executing high-performance computing ap-plications as well as graphics applications.Throughput applications have several prop-erties that distinguish them from CPU serialapplications:

N extensive data parallelism—thousandsof computations on independent dataelements;

N modest task parallelism—groups ofthreads execute the same program,and different groups can run differentprograms;

N intensive floating-point arithmetic;N latency tolerance—performance is the

amount of work completed in a giventime;

N streaming data flow—requires highmemory bandwidth with relativelylittle data reuse;

N modest inter-thread synchronizationand communicat ion—graphicsthreads do not communicate, andparallel computing applications re-quire limited synchronization andcommunication.

GPU parallel performance on through-put problems has doubled every 12 to18 months, pulled by the insatiable de-mands of the 3D game market. Now, TeslaGPUs in laptops, desktops, workstations,

Table 2. Comparison of antialiasing modes.

Feature

Antialiasing mode

Brute-force supersampling Multisampling Coverage sampling

Quality level 13 43 163 13 43 163 13 43 163

Texture and shader samples 1 4 16 1 1 1 1 1 1

Stored color and z samples 1 4 16 1 4 16 1 4 4

Coverage samples 1 4 16 1 4 16 1 4 16

........................................................................


and systems are programmable in C withCUDA tools, using a simple parallelprogramming model.

Data-parallel problem decompositionTo map a large computing problem

effectively to a highly parallel processingarchitecture, the programmer or compilerdecomposes the problem into many smallproblems that can be solved in parallel. Forexample, the programmer partitions a largeresult data array into blocks and furtherpartitions each block into elements, so thatthe result blocks can be computed indepen-dently in parallel, and the elements withineach block can be computed cooperativelyin parallel. Figure 5 shows the decomposi-tion of a result data array into a 3 3 2 gridof blocks, in which each block is furtherdecomposed into a 5 3 3 array of elements.

The two-level parallel decomposition mapsnaturally to the Tesla architecture: ParallelSMs compute result blocks, and parallelthreads compute result elements.

The programmer or compiler writes aprogram that computes a sequence of resultgrids, partitioning each result grid intocoarse-grained result blocks that are com-puted independently in parallel. The pro-gram computes each result block with anarray of fine-grained parallel threads, parti-tioning the work among threads thatcompute result elements.

Cooperative thread array or thread blockUnlike the graphics programming model,

which executes parallel shader threadsindependently, parallel-computing pro-gramming models require that parallelthreads synchronize, communicate, sharedata, and cooperate to efficiently compute aresult. To manage large numbers of con-current threads that can cooperate, the Teslacomputing architecture introduces the co-operative thread array (CTA), called a threadblock in CUDA terminology.

A CTA is an array of concurrent threadsthat execute the same thread program andcan cooperate to compute a result. A CTAconsists of 1 to 512 concurrent threads, andeach thread has a unique thread ID (TID),numbered 0 through m. The programmerdeclares the 1D, 2D, or 3D CTA shape anddimensions in threads. The TID has one,two, or three dimension indices. Threads ofa CTA can share data in global or sharedmemory and can synchronize with thebarrier instruction. CTA thread programsuse their TIDs to select work and indexshared data arrays. Multidimensional TIDscan eliminate integer divide and remainderoperations when indexing arrays.

Each SM executes up to eight CTAsconcurrently, depending on CTA resourcedemands. The programmer or compilerdeclares the number of threads, registers,shared memory, and barriers required bythe CTA program. When an SM hassufficient available resources, the SMCcreates the CTA and assigns TID numbersto each thread. The SM executes the CTAthreads concurrently as SIMT warps of 32parallel threads.

Figure 5. Decomposing result data into a grid of blocks partitioned into

elements to be computed in parallel.

.........................................................................................................................................................................................................................

HOT CHIPS 19

.......................................................................

50 IEEE MICRO

CTA gridsTo implement the coarse-grained block

and grid decomposition of Figure 5, theGPU creates CTAs with unique CTA IDand grid ID numbers. The compute workdistributor dynamically balances the GPUworkload by distributing a stream of CTAwork to SMs with sufficient availableresources.

To enable a compiled binary program torun unchanged on large or small GPUs withany number of parallel SM processors,CTAs execute independently and computeresult blocks independently of other CTAsin the same grid. Sequentially dependentapplication steps map to two sequentiallydependent grids. The dependent grid waitsfor the first grid to complete; then the CTAsof the dependent grid read the result blockswritten by the first grid.

Parallel granularityFigure 6 shows levels of parallel granu-

larity in the GPU computing model. Thethree levels are

N thread—computes result elements se-lected by its TID;

N CTA—computes result blocks selectedby its CTA ID;

N grid—computes many result blocks,and sequential grids compute sequen-tially dependent application steps.

Higher levels of parallelism use multipleGPUs per CPU and clusters of multi-GPUnodes.

Parallel memory sharingFigure 6 also shows levels of parallel

read/write memory sharing:

Figure 6. Nested granularity levels: thread (a), cooperative thread array (b), and grid (c).

These have corresponding memory-sharing levels: local per-thread, shared per-CTA, and

global per-application.

........................................................................


N local—each executing thread has aprivate per-thread local memory forregister spill, stack frame, and address-able temporary variables;

N shared—each executing CTA has aper-CTA shared memory for access todata shared by threads in the sameCTA;

N global—sequential grids communicateand share large data sets in globalmemory.

Threads communicating in a CTA usethe fast barrier synchronization instructionto wait for writes to shared or globalmemory to complete before reading datawritten by other threads in the CTA. Theload/store memory system uses a relaxedmemory order that preserves the order ofreads and writes to the same address fromthe same issuing thread and from theviewpoint of CTA threads coordinatingwith the barrier synchronization instruction.Sequentially dependent grids use a globalintergrid synchronization barrier betweengrids to ensure global read/write ordering.

Transparent scaling of GPU computingParallelism varies widely over the range of

GPU products developed for various marketsegments. A small GPU might have one SMwith eight SP cores, while a large GPUmight have many SMs totaling hundreds ofSP cores.

The GPU computing architecture trans-parently scales parallel application perfor-mance with the number of SMs and SPcores. A GPU computing program executeson any size of GPU without recompiling,and is insensitive to the number of SMmultiprocessors and SP cores. The programdoes not know or care how many processorsit uses.

The key is decomposing the problem intoindependently computed blocks as de-scribed earlier. The GPU compute workdistribution unit generates a stream ofCTAs and distributes them to availableSMs to compute each independent block.Scalable programs do not communicateamong CTA blocks of the same grid; thesame grid result is obtained if the CTAsexecute in parallel on many cores, sequen-

tially on one core, or partially in parallel ona few cores.

CUDA programming modelCUDA is a minimal extension of the C

and C++ programming languages. A pro-grammer writes a serial program that callsparallel kernels, which can be simplefunctions or full programs. The CUDAprogram executes serial code on the CPUand executes parallel kernels across a set ofparallel threads on the GPU. The program-mer organizes these threads into a hierarchyof thread blocks and grids as describedearlier. (A CUDA thread block is a GPUCTA.)

Figure 7 shows a CUDA program exe-cuting a series of parallel kernels on aheterogeneous CPU–GPU system. Ker-

nelA and KernelB execute on the GPUas grids of nBlkA and nBlkB threadblocks (CTAs), which instantiate nTidA

and nTidB threads per CTA.The CUDA compiler nvcc compiles an

integrated application C/C++ programcontaining serial CPU code and parallelGPU kernel code. The CUDA runtime APImanages the GPU as a computing devicethat acts as a coprocessor to the host CPUwith its own memory system.

The CUDA programming model issimilar in style to a single-program multi-ple-data (SPMD) software model—it ex-presses parallelism explicitly, and eachkernel executes on a fixed number ofthreads. However, CUDA is more flexiblethan most SPMD implementations becauseeach kernel call dynamically creates a newgrid with the right number of thread blocksand threads for that application step.

CUDA extends C/C++ with the declara-tion specifier keywords __global__ forkernel entry functions, __device__ forglobal variables, and __shared__ forshared-memory variables. A CUDA kernel’stext is simply a C function for onesequential thread. The built-in variablesthreadIdx.{x, y, z} and block

Idx.{x, y, z} provide the thread IDwithin a thread block (CTA), while blockIdx provides the CTA ID within a grid.The extended function call syntax ker-

nel,,,nBlocks,nThreads...(args);

.........................................................................................................................................................................................................................

HOT CHIPS 19

.......................................................................

52 IEEE MICRO

invokes a parallel kernel function on a gridof nBlocks, where each block instanti-ates nThreads concurrent threads, andargs are ordinary arguments to functionkernel().

Figure 8 shows an example serial C pro-gram and a corresponding CUDA C program.The serial C program uses two nested loops toiterate over each array index and computec[idx] 5 a[idx] + b[idx] each trip.The parallel CUDA C program has no loops.

It uses parallel threads to compute the samearray indices in parallel, and each threadcomputes only one sum.

Scalability and performanceThe Tesla unified architecture is designed

for scalability. Varying the number of SMs,TPCs, ROPs, caches, and memory parti-tions provides the right mix for differentperformance and cost targets in the value,mainstream, enthusiast, and professional

Figure 8. Serial C (a) and CUDA C (b) examples of programs that add arrays.

Figure 7. CUDA program sequence of kernel A followed by kernel B on a heterogeneous

CPU–GPU system.

........................................................................


market segments. NVIDIA’s Scalable LinkInterconnect (SLI) enables multiple GPUsto act together as one, providing furtherscalability.

CUDA C/C++ applications executing onTesla computing platforms, Quadro work-stations, and GeForce GPUs deliver com-pelling computing performance on a rangeof large problems, including more than1003 speedups on molecular modeling,more than 200 Gflops on n-body problems,and real-time 3D magnetic-resonance im-aging.12–14 For graphics, the GeForce 8800GPU delivers high performance and imagequality for the most demanding games.15

Figure 9 shows the GeForce 8800 Ultraphysical die layout implementing the Teslaarchitecture shown in Figure 1. Implemen-tation specifics include

N 681 million transistors, 470 mm2;N TSMC 90-nm CMOS;N 128 SP cores in 16 SMs;N 12,288 processor threads;N 1.5-GHz processor clock rate;N peak 576 Gflops in processors;N 768-Mbyte GDDR3 DRAM;

N 384-pin DRAM interface;N 1.08-GHz DRAM clock;N 104-Gbyte/s peak bandwidth; andN typical power of 150 W at 1.3 V.

The Tesla architecture is the firstubiquitous supercomputing platform.

NVIDIA has shipped more than 50 millionTesla-based systems. This wide availability,coupled with C programmability and theCUDA software development environment,enables broad deployment of demandingparallel-computing and graphics applications.

With future increases in transistor density,the architecture will readily scale processorparallelism, memory partitions, and overallperformance. Increased number of multipro-cessors and memory partitions will supportlarger data sets and richer graphics andcomputing, without a change to the pro-gramming model.

We continue to investigate improved sched-uling and load-balancing algorithms for theunified processor. Other areas of improvementare enhanced scalability for derivative products,reduced synchronization and communicationoverhead for compute programs, new graphicsfeatures, increased realized memory band-width, and improved power efficiency. MICRO

AcknowledgmentsWe thank the entire NVIDIA GPU deve-

lopment team for their extraordinary effortin bringing Tesla-based GPUs to market.

................................................................................................

References1. J. Montrym and H. Moreton, ‘‘The GeForce

6800,’’ IEEE Micro, vol. 25, no. 2, Mar./

Apr. 2005, pp. 41-51.

2. CUDA Technology, NVIDIA, 2007, http://

www.nvidia.com/CUDA.

3. CUDA Programming Guide 1.1, NVIDIA,

2007; http://developer.download.nvidia.

com/compute/cuda/1_1/NVIDIA_CUDA_

Programming_Guide_1.1.pdf.

4. J. Nickolls, I. Buck, K. Skadron, and M.

Garland, ‘‘Scalable Parallel Programming

with CUDA,’’ ACM Queue, vol. 6, no. 2,

Mar./Apr. 2008, pp. 40-53.

5. DX Specification, Microsoft; http://msdn.

microsoft.com/directx.

Figure 9. GeForce 8800 Ultra die layout.

.........................................................................................................................................................................................................................

HOT CHIPS 19

.......................................................................

54 IEEE MICRO

6. E. Lindholm, M.J. Kilgard, and H. Moreton,

‘‘A User-Programmable Vertex Engine,’’

Proc. 28th Ann. Conf. Computer Graphics

and Interactive Techniques (Siggraph 01),

ACM Press, 2001, pp. 149-158.

7. G. Elder, ‘‘Radeon 9700,’’ Eurographics/

Siggraph Workshop Graphics Hardware,

Hot 3D Session, 2002, http://www.

graphicshardware.org/previous/www_2002/

presentations/Hot3D-RADEON9700.ppt.

8. Microsoft DirectX 9 Programmable Graph-

ics Pipeline, Microsoft Press, 2003.

9. J. Andrews and N. Baker, ‘‘Xbox 360

System Architecture,’’ IEEE Micro,

vol. 26, no. 2, Mar./Apr. 2006, pp. 25-37.

10. D. Blythe, ‘‘The Direct3D 10 System,’’

ACM Trans. Graphics, vol. 25, no. 3, July

2006, pp. 724-734.

11. S.F. Oberman and M.Y. Siu, ‘‘A High-

Performance Area-Efficient Multifunction

Interpolator,’’ Proc. 17th IEEE Symp. Com-

puter Arithmetic (Arith-17), IEEE Press,

2005, pp. 272-279.

12. J.E. Stone et al., ‘‘Accelerating Molecular

Modeling Applications with Graphics Pro-

cessors,’’ J. Computational Chemistry,

vol. 28, no. 16, 2007, pp. 2618-2640.

13. L. Nyland, M. Harris, and J. Prins, ‘‘Fast N-

Body Simulation with CUDA,’’ GPU Gems

3, H. Nguyen, ed., Addison-Wesley, 2007,

pp. 677-695.

14. S.S. Stone et al., ‘‘How GPUs Can Improve

the Quality of Magnetic Resonance Imag-

ing,’’ Proc. 1st Workshop on General

Purpose Processing on Graphics Process-

ing Units, 2007; http://www.gigascale.org/

pubs/1175.html.

15. A.L. Shimpi and D. Wilson, ‘‘NVIDIA’s

GeForce 8800 (G80): GPUs Re-architected

for DirectX 10,’’ AnandTech, Nov. 2006;

http://www.anandtech.com/video/showdoc.

aspx?i52870.

Erik Lindholm is a distinguished engineerat NVIDIA, working in the architecture

group. His research interests include graph-ics processor design and parallel graphicsarchitectures. Lindholm has an MS inelectrical engineering from the Universityof British Columbia.

John Nickolls is director of GPU comput-ing architecture at NVIDIA. His interestsinclude parallel processing systems, languag-es, and architectures. Nickolls has a BS inelectrical engineering and computer sciencefrom the University of Illinois and MS andPhD degrees in electrical engineering fromStanford University.

Stuart Oberman is a design manager in theGPU hardware group at NVIDIA. Hisresearch interests include computer arith-metic, processor design, and parallel archi-tectures. Oberman has a BS in electricalengineering from the University of Iowaand MS and PhD degrees in electricalengineering from Stanford University. He isa senior member of the IEEE.

John Montrym is a chief architect atNVIDIA, where he has worked in thedevelopment of several GPU product fam-ilies. His research interests include graphicsprocessor design, parallel graphics architec-tures, and hardware-software interfaces.Montrym has a BS in electrical engineeringfrom the Massachusetts Institute of Tech-nology.

Direct questions and comments aboutthis article to Erik Lindholm or JohnNickolls, NVIDIA, 2701 San TomasExpressway, Santa Clara, CA 95050;[email protected] or [email protected].

For more information on this or any other

computing topic, please visit our Digital

Library at http://computer.org/csdl.

........................................................................


nvidia tesla:aunified graphics and computing … · nvidia tesla:aunified graphics and computing...

Documents