gpu algorithms i the early years - madalgo.au.dk · animusic demo[ati] overview gpus today have 10s...

GPU Algorithms IThe Early Years

MADALGO Summer School on Algorithms for Modern Parallel

and Distributed Models

Suresh VenkatasubramanianUniversity of Utah

Motivation

Fast Computation of Generalized Voronoi Diagrams UsingGraphics Hardware[HIKL+99]

Motivation[Mar03]

Animusic Demo[ATI]

Overview

GPUs today have 10s of cores (soon, 100s !)Have huge data bandwidth (100s of GB/s)Force a SIMD-style mode of computationHPC on the cheap !

However,GPU models constantly changingAlmost no algorithmic work on GPUs (challenge!)Disconnect between programming model and execution model

No proofs!

Overview

GPUs today have 10s of cores (soon, 100s !)Have huge data bandwidth (100s of GB/s)Force a SIMD-style mode of computationHPC on the cheap !

However,GPU models constantly changingAlmost no algorithmic work on GPUs (challenge!)Disconnect between programming model and execution model

No proofs!

Outline

1999 2006

Programmablepipeline

1999 2006

sorting

1999 2006

sortingmatrices

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

sortingmatrices

graphs

Outline

1999 2006

sorting

1999 2006

sortingmatrices

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

sortingmatrices

graphs

Outline

1999 2006

sorting

1999 2006

sortingmatrices

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

sortingmatrices

graphs

Outline

1999 2006

sorting

1999 2006

sortingmatrices

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

sortingmatrices

graphs

Outline

1999 2006

sorting

1999 2006

sortingmatrices

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

sortingmatrices

graphs

Outline

1999 2006

sorting

1999 2006

sortingmatrices

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

1999 2006

sortingmatrices

geometry

A streaming model

sortingmatrices

graphs

This Lecture

1999 2006

sortingmatrices

The von Neumann Bottleneck[Bac77]

Von NeumannBottleneck

Coined by John Backus

Has always been a problem (regardless of technology)

GPU design is an attempt to get around it

Addressing the Bottleneck

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

• SIMD processors have a fixed program that runs ondifferent streams

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

Instructions

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

Instructions

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

Instructions

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

Instructions

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

Instructions

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

Instructions

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

Instructions

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

LD A, 5

ADD A, B

LD C, 10

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

Instructions

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

GPU Basics

The GPU Pipeline[Shr10]

Geometrytransforma-tions

LightingGeometrytransforma-tions

Lighting ClippingGeometrytransforma-tions

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Texturing

Lighting Clipping

Vertex pipeline

TexturingFragments

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Lighting Clipping

Vertex pipeline

Fragment pipeline

Fragment shader operations

Every pixel acts like an SIMD processor

actually, some fixed number are processed in parallel

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel

Fragment processor could perform simple straight-line operationsand conditionals (no looping)

(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storage

Each pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)

Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.

All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipeline

Computation proceeds in passes: output could be rendered orstored in memory for next pass.

Simple GPU Algorithms

Voronoi Diagrams

Voronoi diagram is lower envelope of collection of distancefunctions

Voronoi Diagrams

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored triangles

Use many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

For each point, render a cone of colored trianglesUse many triangles to approximate smooth cone

Use shading to encode distance as color value

end if

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

end if

min← 0

if depth(pi) < min then

end if

{GPU Z-test}

min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

{GPU Z-test}min = depth(pi) {GPU blending operation}

color(x, y) = color(pi)end if

end if

Rendering engine is the mapper

Gathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)

Fragment processors implement reduce

end if

Simple GPU Matrix Multiplication[HCH03, AM03]

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

C = A · BCij = ∑

kAikBkj

end for

• GPU loops have to be unrolled

C = A · BCij = ∑

kAikBkj

end for

• This works only if k is small

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

end for

C = A · BCij = ∑

kAikBkj

end for

GPU Matrix Multiplication II[FSH04]

BPass 1

BPass 2

• Multiple passes increase number of memory access, but help withcaching

BPass 2

• Each ”field” is actually four values

BPass 2

• Can get a factor 4 speedup with careful partitioning of matrix

BPass 2

• More complicated methods needed for sparse multiplication

BPass 1

BPass 2

BPass 1

BPass 2

BPass 1

BPass 2

BPass 1

BPass 2

BPass 1

BPass 2

Sorting on the GPU

Can we sort on the GPU ?

Compute parallelism is not enough

Need SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:

Many simple (and local) compute elementsHigh-throughput and synchronous

Sorting on the GPU

Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)

External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:

Sorting on the GPU

Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.

Plan: Use sorting networks:

Sorting on the GPU

Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:

Sorting on the GPU

Many simple (and local) compute elements

High-throughput and synchronous

Sorting on the GPU

Bitonic Sorting

min(a, b)

max(a, b)

min(a, b)

max(a, b)

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

Bitonic sort requires log2 n layers, n/2 comparators/layer

Bitonic Sorting

min(a, b)

max(a, b)

min(a, b)

max(a, b)

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

Bitonic Sorting

min(a, b)

max(a, b)

min(a, b)

max(a, b)

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

Bitonic Sorting

min(a, b)

max(a, b)

min(a, b)

max(a, b)

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

Bitonic Sorting

min(a, b)

max(a, b)

min(a, b)

max(a, b)

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

Bitonic Sorting

min(a, b)

max(a, b)

min(a, b)

max(a, b)

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

Bitonic Sorting

min(a, b)

max(a, b)

min(a, b)

max(a, b)

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

Bitonic Sorting

min(a, b)

max(a, b)

min(a, b)

max(a, b)

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

Bitonic Sorting

min(a, b)

max(a, b)

min(a, b)

max(a, b)

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

Bitonic Sorting

min(a, b)

max(a, b)

min(a, b)

max(a, b)

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

min(a, b)

max(a, b)

37486215

GPU Bitonic Sorting[PDC+03, GGKM06]

Texturea0a3

Texture

• Fill 2D array with values

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

Texture

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

Texture

values

• log2 n passes used to complete the computation

Texture

values

Texture

values

Texture

values

Texture

values

Texture

values

Texture

values

Texturea0a3

Texture

values

Texture

values

Texture

values

Texturea0a3

Texture

values

Texture

values

Texture

values

Texturea0a3

Texture

values

Texture

values

Texture

values

Review

This Lecture

Brief history of GPU modelSimple GPU SIMD modelExamples: Voronoi diagrams, matrix multiplication and sorting

Next Lecture

More simple GPU examplesToy example of algorithmic view: “GPU as streaming processor”The CUDA model for modern GPUs“Hello world” example: matrix multiplication in CUDA

Questions?

[Dur12]

References I

Ádám Moravánsky.Dense matrix algebra on the gpu.http://www.shaderx2.com/shaderx.PDF, 2003.

ATI/Animusic.Pipe dream.http://developer.amd.com/ARCHIVE/LEGACYDEMOS/pages/ATIRadeon9700Real-TimeDemos.aspx.

J. Backus.Can programming be liberated from the von Neumann style ? afunctional style and its algebra of programs.ACM Turing Award Lecture, 1977.

Dallin Durfee.Strange quark.http://sqcomic.com/2012/08/19/, Aug 19 2012.

References II

K. Fatahalian, J. Sugerman, and P. Hanrahan.Understanding the efficiency of gpu algorithms for matrix-matrixmultiplication.In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference onGraphics hardware, pages 133–137. ACM, 2004.

N. Govindaraju, J. Gray, R. Kumar, and D. Manocha.Gputerasort: high performance graphics co-processor sorting for largedatabase management.In Proceedings of the 2006 ACM SIGMOD international conference onManagement of data, pages 325–336. ACM, 2006.

Jesse D. Hall, Nathan A. Carr, and John C. Hart.Cache and bandwidth aware matrix multiplication on the gpu.Technical Report UIUCDCS-R-2003-2328, UIUC, 2003.

References III

K.E. Hoff III, J. Keyser, M. Lin, D. Manocha, and T. Culver.Fast computation of generalized voronoi diagrams using graphicshardware.In Proceedings of the 26th annual conference on Computer graphics andinteractive techniques, pages 277–286. ACM Press/Addison-WesleyPublishing Co., 1999.

John Markoff.From PlayStation to Supercomputer for $50,000.http://www.nytimes.com/2003/05/26/business/technology-from-playstation-to-supercomputer-for-50000.html,May 2003.New York Times.

T.J. Purcell, C. Donner, M. Cammarano, H.W. Jensen, and P. Hanrahan.Photon mapping on programmable graphics hardware.In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference onGraphics hardware, pages 41–50. Eurographics Association, 2003.

References IV

D. Shreiner.OpenGL programming guide: the official guide to learning OpenGL, versions3.0 and 3.1, volume 1.Addison-Wesley Professional, 2010.

gpu algorithms i the early years - madalgo.au.dk · animusic demo[ati] overview gpus today have 10s...

Documents