gpu algorithms i the early years - madalgo.au.dk · animusic demo[ati] overview gpus today have 10s...
TRANSCRIPT
GPU Algorithms IThe Early Years
MADALGO Summer School on Algorithms for Modern Parallel
and Distributed Models
Suresh VenkatasubramanianUniversity of Utah
Motivation
Fast Computation of Generalized Voronoi Diagrams UsingGraphics Hardware[HIKL+99]
Motivation[Mar03]
Animusic Demo[ATI]
Overview
GPUs today have 10s of cores (soon, 100s !)Have huge data bandwidth (100s of GB/s)Force a SIMD-style mode of computationHPC on the cheap !
However,GPU models constantly changingAlmost no algorithmic work on GPUs (challenge!)Disconnect between programming model and execution model
No proofs!
Overview
GPUs today have 10s of cores (soon, 100s !)Have huge data bandwidth (100s of GB/s)Force a SIMD-style mode of computationHPC on the cheap !
However,GPU models constantly changingAlmost no algorithmic work on GPUs (challenge!)Disconnect between programming model and execution model
No proofs!
Outline
1999 2006
Programmablepipeline
1999 2006
Programmablepipeline
sorting
1999 2006
Programmablepipeline
sortingmatrices
1999 2006
Programmablepipeline
sortingmatrices
geometry
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
sortingmatrices
graphs
Outline
1999 2006
Programmablepipeline
1999 2006
Programmablepipeline
sorting
1999 2006
Programmablepipeline
sortingmatrices
1999 2006
Programmablepipeline
sortingmatrices
geometry
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
sortingmatrices
graphs
Outline
1999 2006
Programmablepipeline
1999 2006
Programmablepipeline
sorting
1999 2006
Programmablepipeline
sortingmatrices
1999 2006
Programmablepipeline
sortingmatrices
geometry
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
sortingmatrices
graphs
Outline
1999 2006
Programmablepipeline
1999 2006
Programmablepipeline
sorting
1999 2006
Programmablepipeline
sortingmatrices
1999 2006
Programmablepipeline
sortingmatrices
geometry
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
sortingmatrices
graphs
Outline
1999 2006
Programmablepipeline
1999 2006
Programmablepipeline
sorting
1999 2006
Programmablepipeline
sortingmatrices
1999 2006
Programmablepipeline
sortingmatrices
geometry
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
sortingmatrices
graphs
Outline
1999 2006
Programmablepipeline
1999 2006
Programmablepipeline
sorting
1999 2006
Programmablepipeline
sortingmatrices
1999 2006
Programmablepipeline
sortingmatrices
geometry
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
1999 2006
Programmablepipeline
sortingmatrices
geometry
CUDA
A streaming model
sortingmatrices
graphs
This Lecture
1999 2006
Programmablepipeline
sortingmatrices
The von Neumann Bottleneck[Bac77]
Von NeumannBottleneck
Coined by John Backus
Von NeumannBottleneck
Coined by John Backus
Has always been a problem (regardless of technology)
Von NeumannBottleneck
Coined by John Backus
Has always been a problem (regardless of technology)
GPU design is an attempt to get around it
The von Neumann Bottleneck[Bac77]
Von NeumannBottleneck
Coined by John Backus
Von NeumannBottleneck
Coined by John Backus
Has always been a problem (regardless of technology)
Von NeumannBottleneck
Coined by John Backus
Has always been a problem (regardless of technology)
GPU design is an attempt to get around it
The von Neumann Bottleneck[Bac77]
Von NeumannBottleneck
Coined by John Backus
Von NeumannBottleneck
Coined by John Backus
Has always been a problem (regardless of technology)
Von NeumannBottleneck
Coined by John Backus
Has always been a problem (regardless of technology)
GPU design is an attempt to get around it
The von Neumann Bottleneck[Bac77]
Von NeumannBottleneck
Coined by John Backus
Von NeumannBottleneck
Coined by John Backus
Has always been a problem (regardless of technology)
Von NeumannBottleneck
Coined by John Backus
Has always been a problem (regardless of technology)
GPU design is an attempt to get around it
Addressing the Bottleneck
• Systolic arrays move data between a grid of processingelements.
• Systolic arrays move data between a grid of processingelements.
+ ×
• Vector processors operate on multiple words at a time
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
1, . . . , 3, . . . , 6, . . . , 7
21, . . . , 9, . . . , 4, . . . , 1
101, . . . , 17, . . . , 6, . . . , 2
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
Addressing the Bottleneck
• Systolic arrays move data between a grid of processingelements.
• Systolic arrays move data between a grid of processingelements.
+ ×
• Vector processors operate on multiple words at a time
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
1, . . . , 3, . . . , 6, . . . , 7
21, . . . , 9, . . . , 4, . . . , 1
101, . . . , 17, . . . , 6, . . . , 2
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
Addressing the Bottleneck
• Systolic arrays move data between a grid of processingelements.
• Systolic arrays move data between a grid of processingelements.
+ ×
• Vector processors operate on multiple words at a time
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
1, . . . , 3, . . . , 6, . . . , 7
21, . . . , 9, . . . , 4, . . . , 1
101, . . . , 17, . . . , 6, . . . , 2
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
Addressing the Bottleneck
• Systolic arrays move data between a grid of processingelements.
• Systolic arrays move data between a grid of processingelements.
+ ×
• Vector processors operate on multiple words at a time
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
1, . . . , 3, . . . , 6, . . . , 7
21, . . . , 9, . . . , 4, . . . , 1
101, . . . , 17, . . . , 6, . . . , 2
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
Addressing the Bottleneck
• Systolic arrays move data between a grid of processingelements.
• Systolic arrays move data between a grid of processingelements.
+ ×
• Vector processors operate on multiple words at a time
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
1, . . . , 3, . . . , 6, . . . , 7
21, . . . , 9, . . . , 4, . . . , 1
101, . . . , 17, . . . , 6, . . . , 2
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
Addressing the Bottleneck
• Systolic arrays move data between a grid of processingelements.
• Systolic arrays move data between a grid of processingelements.
+ ×
• Vector processors operate on multiple words at a time
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
1, . . . , 3, . . . , 6, . . . , 7
21, . . . , 9, . . . , 4, . . . , 1
101, . . . , 17, . . . , 6, . . . , 2
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
Addressing the Bottleneck
• Systolic arrays move data between a grid of processingelements.
• Systolic arrays move data between a grid of processingelements.
+ ×
• Vector processors operate on multiple words at a time
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
1, . . . , 3, . . . , 6, . . . , 7
21, . . . , 9, . . . , 4, . . . , 1
101, . . . , 17, . . . , 6, . . . , 2
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
Addressing the Bottleneck
• Systolic arrays move data between a grid of processingelements.
• Systolic arrays move data between a grid of processingelements.
+ ×
• Vector processors operate on multiple words at a time
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
LD A, 5
ADD A, B
LD C, 10
...
1, . . . , 3, . . . , 6, . . . , 7
21, . . . , 9, . . . , 4, . . . , 1
101, . . . , 17, . . . , 6, . . . , 2
• SIMD processors have a fixed program that runs ondifferent streams
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
• Systolic arrays move data between a grid of processingelements.
• Vector processors operate on multiple words at a time
• SIMD processors have a fixed program that runs ondifferent streams
Data
Instructions
SISD MISD
MIMDSIMD
Flynn’s taxonomy
GPU Basics
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
LightingGeometrytransforma-tions
Lighting ClippingGeometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
LightingGeometrytransforma-tions
Lighting ClippingGeometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
Lighting
Geometrytransforma-tions
Lighting ClippingGeometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
Lighting
Geometrytransforma-tions
Lighting Clipping
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
LightingGeometrytransforma-tions
Lighting Clipping
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
LightingGeometrytransforma-tions
Lighting ClippingGeometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
LightingGeometrytransforma-tions
Lighting ClippingGeometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
LightingGeometrytransforma-tions
Lighting ClippingGeometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
LightingGeometrytransforma-tions
Lighting ClippingGeometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
LightingGeometrytransforma-tions
Lighting ClippingGeometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
The GPU Pipeline[Shr10]
Geometrytransforma-tions
Geometrytransforma-tions
LightingGeometrytransforma-tions
Lighting ClippingGeometrytransforma-tions
Lighting Clipping
Vertex pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
Texturing
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragments
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Geometrytransforma-tions
Lighting Clipping
Vertex pipeline
TexturingFragmentsDepth/Stencil
Fragment pipeline
Fragment shader operations
Every pixel acts like an SIMD processor
actually, some fixed number are processed in parallel
Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.
Fragment shader operations
Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel
Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.
Fragment shader operations
Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel
Fragment processor could perform simple straight-line operationsand conditionals (no looping)
(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.
Fragment shader operations
Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel
Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storage
Each pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.
Fragment shader operations
Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel
Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)
Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.
Fragment shader operations
Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel
Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.
All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.
Fragment shader operations
Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel
Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipeline
Computation proceeds in passes: output could be rendered orstored in memory for next pass.
Fragment shader operations
Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel
Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.
Simple GPU Algorithms
Voronoi Diagrams
65
10
65
10
Voronoi diagram is lower envelope of collection of distancefunctions
Voronoi Diagrams
65
10
65
10
Voronoi diagram is lower envelope of collection of distancefunctions
Voronoi Diagrams
65
10
65
10
Voronoi diagram is lower envelope of collection of distancefunctions
Voronoi Diagrams
65
10
65
10
Voronoi diagram is lower envelope of collection of distancefunctions
Voronoi Diagrams
65
10
65
10
Voronoi diagram is lower envelope of collection of distancefunctions
Voronoi Diagrams
65
10
65
10
Voronoi diagram is lower envelope of collection of distancefunctions
Voronoi Diagrams
65
10
65
10
Voronoi diagram is lower envelope of collection of distancefunctions
Voronoi Diagrams
65
10
65
10
Voronoi diagram is lower envelope of collection of distancefunctions
Voronoi Diagrams
65
10
65
10
Voronoi diagram is lower envelope of collection of distancefunctions
Voronoi Diagrams
65
10
65
10
Voronoi diagram is lower envelope of collection of distancefunctions
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored triangles
Use many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth cone
Use shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0
if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}
min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}
color(x, y) = color(pi)end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapper
Gathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)
Fragment processors implement reduce
GPU Voronoi Diagrams[HIKL+99]
For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value
Fragment processor at (x, y) receives stream of pixels p1, p2, . . .
min← 0if depth(pi) < min then
{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)
end if
Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce
Simple GPU Matrix Multiplication[HCH03, AM03]
C = A · BCij = ∑
kAikBkj
C = A · BCij = ∑
kAikBkj
A
B
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
• GPU loops have to be unrolled
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
• GPU loops have to be unrolled
• This works only if k is small
Simple GPU Matrix Multiplication[HCH03, AM03]
C = A · BCij = ∑
kAikBkj
C = A · BCij = ∑
kAikBkj
A
B
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
• GPU loops have to be unrolled
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
• GPU loops have to be unrolled
• This works only if k is small
Simple GPU Matrix Multiplication[HCH03, AM03]
C = A · BCij = ∑
kAikBkj
C = A · BCij = ∑
kAikBkj
A
B
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
• GPU loops have to be unrolled
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
• GPU loops have to be unrolled
• This works only if k is small
Simple GPU Matrix Multiplication[HCH03, AM03]
C = A · BCij = ∑
kAikBkj
C = A · BCij = ∑
kAikBkj
A
B
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
• GPU loops have to be unrolled
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
• GPU loops have to be unrolled
• This works only if k is small
Simple GPU Matrix Multiplication[HCH03, AM03]
C = A · BCij = ∑
kAikBkj
C = A · BCij = ∑
kAikBkj
A
B
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
• GPU loops have to be unrolled
C = A · BCij = ∑
kAikBkj
A
B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]
end for
• GPU loops have to be unrolled
• This works only if k is small
GPU Matrix Multiplication II[FSH04]
A
B
A
BPass 1
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
• More complicated methods needed for sparse multiplication
GPU Matrix Multiplication II[FSH04]
A
B
A
BPass 1
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
• More complicated methods needed for sparse multiplication
GPU Matrix Multiplication II[FSH04]
A
B
A
BPass 1
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
• More complicated methods needed for sparse multiplication
GPU Matrix Multiplication II[FSH04]
A
B
A
BPass 1
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
• More complicated methods needed for sparse multiplication
GPU Matrix Multiplication II[FSH04]
A
B
A
BPass 1
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
• More complicated methods needed for sparse multiplication
GPU Matrix Multiplication II[FSH04]
A
B
A
BPass 1
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
A
BPass 2
• Multiple passes increase number of memory access, but help withcaching
xy
zu
• Each ”field” is actually four values
• Can get a factor 4 speedup with careful partitioning of matrix
• More complicated methods needed for sparse multiplication
Sorting on the GPU
Can we sort on the GPU ?
Compute parallelism is not enough
Need SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:
Many simple (and local) compute elementsHigh-throughput and synchronous
Sorting on the GPU
Can we sort on the GPU ?
Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)
External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:
Many simple (and local) compute elementsHigh-throughput and synchronous
Sorting on the GPU
Can we sort on the GPU ?
Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.
Plan: Use sorting networks:
Many simple (and local) compute elementsHigh-throughput and synchronous
Sorting on the GPU
Can we sort on the GPU ?
Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:
Many simple (and local) compute elementsHigh-throughput and synchronous
Sorting on the GPU
Can we sort on the GPU ?
Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:
Many simple (and local) compute elements
High-throughput and synchronous
Sorting on the GPU
Can we sort on the GPU ?
Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:
Many simple (and local) compute elementsHigh-throughput and synchronous
Bitonic Sorting
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
37486215
a
b
min(a, b)
max(a, b)
37486215
3748
62
15
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
86
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
Bitonic sort requires log2 n layers, n/2 comparators/layer
Bitonic Sorting
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
37486215
a
b
min(a, b)
max(a, b)
37486215
3748
62
15
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
86
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
Bitonic sort requires log2 n layers, n/2 comparators/layer
Bitonic Sorting
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
37486215
a
b
min(a, b)
max(a, b)
37486215
3748
62
15
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
86
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
Bitonic sort requires log2 n layers, n/2 comparators/layer
Bitonic Sorting
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
37486215
a
b
min(a, b)
max(a, b)
37486215
3748
62
15
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
86
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
Bitonic sort requires log2 n layers, n/2 comparators/layer
Bitonic Sorting
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
37486215
a
b
min(a, b)
max(a, b)
37486215
3748
62
15
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
86
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
Bitonic sort requires log2 n layers, n/2 comparators/layer
Bitonic Sorting
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
37486215
a
b
min(a, b)
max(a, b)
37486215
3748
62
15
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
86
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
Bitonic sort requires log2 n layers, n/2 comparators/layer
Bitonic Sorting
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
37486215
a
b
min(a, b)
max(a, b)
37486215
3748
62
15
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
86
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
Bitonic sort requires log2 n layers, n/2 comparators/layer
Bitonic Sorting
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
37486215
a
b
min(a, b)
max(a, b)
37486215
3748
62
15
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
86
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
Bitonic sort requires log2 n layers, n/2 comparators/layer
Bitonic Sorting
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
37486215
a
b
min(a, b)
max(a, b)
37486215
3748
62
15
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
86
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
Bitonic sort requires log2 n layers, n/2 comparators/layer
Bitonic Sorting
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
a
b
min(a, b)
max(a, b)
37486215
a
b
min(a, b)
max(a, b)
37486215
3748
62
15
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
74
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
86
21
5
a
b
min(a, b)
max(a, b)
37486215
3
7
4
8
6
21
5
Bitonic sort requires log2 n layers, n/2 comparators/layer
GPU Bitonic Sorting[PDC+03, GGKM06]
a0a3
a1a2
a0a3
a1a2
0 1
3 2
Texturea0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
• log2 n passes used to complete the computation
GPU Bitonic Sorting[PDC+03, GGKM06]
a0a3
a1a2
a0a3
a1a2
0 1
3 2
Texture
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
• log2 n passes used to complete the computation
GPU Bitonic Sorting[PDC+03, GGKM06]
a0a3
a1a2
a0a3
a1a2
0 1
3 2
Texture
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
• log2 n passes used to complete the computation
GPU Bitonic Sorting[PDC+03, GGKM06]
a0a3
a1a2
a0a3
a1a2
0 1
3 2
Texturea0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
• log2 n passes used to complete the computation
GPU Bitonic Sorting[PDC+03, GGKM06]
a0a3
a1a2
a0a3
a1a2
0 1
3 2
Texturea0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
• log2 n passes used to complete the computation
GPU Bitonic Sorting[PDC+03, GGKM06]
a0a3
a1a2
a0a3
a1a2
0 1
3 2
Texturea0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
a0a3
a1a2
0 1
3 2
Texture
• Fill 2D array with values• (for each pass) construct quadrilateral with lookup
values
• Texture hardware locates lookup values, and fragmentprogram does comparisons
• log2 n passes used to complete the computation
Review
This Lecture
Brief history of GPU modelSimple GPU SIMD modelExamples: Voronoi diagrams, matrix multiplication and sorting
Next Lecture
More simple GPU examplesToy example of algorithmic view: “GPU as streaming processor”The CUDA model for modern GPUs“Hello world” example: matrix multiplication in CUDA
Questions?
[Dur12]
References I
Ádám Moravánsky.Dense matrix algebra on the gpu.http://www.shaderx2.com/shaderx.PDF, 2003.
ATI/Animusic.Pipe dream.http://developer.amd.com/ARCHIVE/LEGACYDEMOS/pages/ATIRadeon9700Real-TimeDemos.aspx.
J. Backus.Can programming be liberated from the von Neumann style ? afunctional style and its algebra of programs.ACM Turing Award Lecture, 1977.
Dallin Durfee.Strange quark.http://sqcomic.com/2012/08/19/, Aug 19 2012.
References II
K. Fatahalian, J. Sugerman, and P. Hanrahan.Understanding the efficiency of gpu algorithms for matrix-matrixmultiplication.In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference onGraphics hardware, pages 133–137. ACM, 2004.
N. Govindaraju, J. Gray, R. Kumar, and D. Manocha.Gputerasort: high performance graphics co-processor sorting for largedatabase management.In Proceedings of the 2006 ACM SIGMOD international conference onManagement of data, pages 325–336. ACM, 2006.
Jesse D. Hall, Nathan A. Carr, and John C. Hart.Cache and bandwidth aware matrix multiplication on the gpu.Technical Report UIUCDCS-R-2003-2328, UIUC, 2003.
References III
K.E. Hoff III, J. Keyser, M. Lin, D. Manocha, and T. Culver.Fast computation of generalized voronoi diagrams using graphicshardware.In Proceedings of the 26th annual conference on Computer graphics andinteractive techniques, pages 277–286. ACM Press/Addison-WesleyPublishing Co., 1999.
John Markoff.From PlayStation to Supercomputer for $50,000.http://www.nytimes.com/2003/05/26/business/technology-from-playstation-to-supercomputer-for-50000.html,May 2003.New York Times.
T.J. Purcell, C. Donner, M. Cammarano, H.W. Jensen, and P. Hanrahan.Photon mapping on programmable graphics hardware.In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference onGraphics hardware, pages 41–50. Eurographics Association, 2003.
References IV
D. Shreiner.OpenGL programming guide: the official guide to learning OpenGL, versions3.0 and 3.1, volume 1.Addison-Wesley Professional, 2010.