gpu algorithms i the early years - madalgo.au.dk · animusic demo[ati] overview gpus today have 10s...

110
GPU Algorithms I The Early Years MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models Suresh Venkatasubramanian University of Utah

Upload: others

Post on 18-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Algorithms IThe Early Years

MADALGO Summer School on Algorithms for Modern Parallel

and Distributed Models

Suresh VenkatasubramanianUniversity of Utah

Page 2: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Motivation

Fast Computation of Generalized Voronoi Diagrams UsingGraphics Hardware[HIKL+99]

Page 3: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Motivation[Mar03]

Page 4: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Animusic Demo[ATI]

Page 5: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Overview

GPUs today have 10s of cores (soon, 100s !)Have huge data bandwidth (100s of GB/s)Force a SIMD-style mode of computationHPC on the cheap !

However,GPU models constantly changingAlmost no algorithmic work on GPUs (challenge!)Disconnect between programming model and execution model

No proofs!

Page 6: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Overview

GPUs today have 10s of cores (soon, 100s !)Have huge data bandwidth (100s of GB/s)Force a SIMD-style mode of computationHPC on the cheap !

However,GPU models constantly changingAlmost no algorithmic work on GPUs (challenge!)Disconnect between programming model and execution model

No proofs!

Page 7: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Outline

1999 2006

Programmablepipeline

1999 2006

Programmablepipeline

sorting

1999 2006

Programmablepipeline

sortingmatrices

1999 2006

Programmablepipeline

sortingmatrices

geometry

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

sortingmatrices

graphs

Page 8: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Outline

1999 2006

Programmablepipeline

1999 2006

Programmablepipeline

sorting

1999 2006

Programmablepipeline

sortingmatrices

1999 2006

Programmablepipeline

sortingmatrices

geometry

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

sortingmatrices

graphs

Page 9: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Outline

1999 2006

Programmablepipeline

1999 2006

Programmablepipeline

sorting

1999 2006

Programmablepipeline

sortingmatrices

1999 2006

Programmablepipeline

sortingmatrices

geometry

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

sortingmatrices

graphs

Page 10: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Outline

1999 2006

Programmablepipeline

1999 2006

Programmablepipeline

sorting

1999 2006

Programmablepipeline

sortingmatrices

1999 2006

Programmablepipeline

sortingmatrices

geometry

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

sortingmatrices

graphs

Page 11: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Outline

1999 2006

Programmablepipeline

1999 2006

Programmablepipeline

sorting

1999 2006

Programmablepipeline

sortingmatrices

1999 2006

Programmablepipeline

sortingmatrices

geometry

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

sortingmatrices

graphs

Page 12: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Outline

1999 2006

Programmablepipeline

1999 2006

Programmablepipeline

sorting

1999 2006

Programmablepipeline

sortingmatrices

1999 2006

Programmablepipeline

sortingmatrices

geometry

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

1999 2006

Programmablepipeline

sortingmatrices

geometry

CUDA

A streaming model

sortingmatrices

graphs

Page 13: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

This Lecture

1999 2006

Programmablepipeline

sortingmatrices

Page 14: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The von Neumann Bottleneck[Bac77]

Von NeumannBottleneck

Coined by John Backus

Von NeumannBottleneck

Coined by John Backus

Has always been a problem (regardless of technology)

Von NeumannBottleneck

Coined by John Backus

Has always been a problem (regardless of technology)

GPU design is an attempt to get around it

Page 15: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The von Neumann Bottleneck[Bac77]

Von NeumannBottleneck

Coined by John Backus

Von NeumannBottleneck

Coined by John Backus

Has always been a problem (regardless of technology)

Von NeumannBottleneck

Coined by John Backus

Has always been a problem (regardless of technology)

GPU design is an attempt to get around it

Page 16: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The von Neumann Bottleneck[Bac77]

Von NeumannBottleneck

Coined by John Backus

Von NeumannBottleneck

Coined by John Backus

Has always been a problem (regardless of technology)

Von NeumannBottleneck

Coined by John Backus

Has always been a problem (regardless of technology)

GPU design is an attempt to get around it

Page 17: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The von Neumann Bottleneck[Bac77]

Von NeumannBottleneck

Coined by John Backus

Von NeumannBottleneck

Coined by John Backus

Has always been a problem (regardless of technology)

Von NeumannBottleneck

Coined by John Backus

Has always been a problem (regardless of technology)

GPU design is an attempt to get around it

Page 18: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Addressing the Bottleneck

• Systolic arrays move data between a grid of processingelements.

• Systolic arrays move data between a grid of processingelements.

+ ×

• Vector processors operate on multiple words at a time

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Page 19: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Addressing the Bottleneck

• Systolic arrays move data between a grid of processingelements.

• Systolic arrays move data between a grid of processingelements.

+ ×

• Vector processors operate on multiple words at a time

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Page 20: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Addressing the Bottleneck

• Systolic arrays move data between a grid of processingelements.

• Systolic arrays move data between a grid of processingelements.

+ ×

• Vector processors operate on multiple words at a time

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Page 21: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Addressing the Bottleneck

• Systolic arrays move data between a grid of processingelements.

• Systolic arrays move data between a grid of processingelements.

+ ×

• Vector processors operate on multiple words at a time

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Page 22: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Addressing the Bottleneck

• Systolic arrays move data between a grid of processingelements.

• Systolic arrays move data between a grid of processingelements.

+ ×

• Vector processors operate on multiple words at a time

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Page 23: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Addressing the Bottleneck

• Systolic arrays move data between a grid of processingelements.

• Systolic arrays move data between a grid of processingelements.

+ ×

• Vector processors operate on multiple words at a time

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Page 24: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Addressing the Bottleneck

• Systolic arrays move data between a grid of processingelements.

• Systolic arrays move data between a grid of processingelements.

+ ×

• Vector processors operate on multiple words at a time

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Page 25: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Addressing the Bottleneck

• Systolic arrays move data between a grid of processingelements.

• Systolic arrays move data between a grid of processingelements.

+ ×

• Vector processors operate on multiple words at a time

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

LD A, 5

ADD A, B

LD C, 10

...

1, . . . , 3, . . . , 6, . . . , 7

21, . . . , 9, . . . , 4, . . . , 1

101, . . . , 17, . . . , 6, . . . , 2

• SIMD processors have a fixed program that runs ondifferent streams

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

• Systolic arrays move data between a grid of processingelements.

• Vector processors operate on multiple words at a time

• SIMD processors have a fixed program that runs ondifferent streams

Data

Instructions

SISD MISD

MIMDSIMD

Flynn’s taxonomy

Page 26: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Basics

Page 27: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

LightingGeometrytransforma-tions

Lighting ClippingGeometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 28: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

LightingGeometrytransforma-tions

Lighting ClippingGeometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 29: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

Lighting

Geometrytransforma-tions

Lighting ClippingGeometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 30: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

Lighting

Geometrytransforma-tions

Lighting Clipping

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 31: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

LightingGeometrytransforma-tions

Lighting Clipping

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 32: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

LightingGeometrytransforma-tions

Lighting ClippingGeometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 33: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

LightingGeometrytransforma-tions

Lighting ClippingGeometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 34: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

LightingGeometrytransforma-tions

Lighting ClippingGeometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 35: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

LightingGeometrytransforma-tions

Lighting ClippingGeometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 36: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

LightingGeometrytransforma-tions

Lighting ClippingGeometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 37: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

The GPU Pipeline[Shr10]

Geometrytransforma-tions

Geometrytransforma-tions

LightingGeometrytransforma-tions

Lighting ClippingGeometrytransforma-tions

Lighting Clipping

Vertex pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

Texturing

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragments

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Geometrytransforma-tions

Lighting Clipping

Vertex pipeline

TexturingFragmentsDepth/Stencil

Fragment pipeline

Page 38: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Fragment shader operations

Every pixel acts like an SIMD processor

actually, some fixed number are processed in parallel

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Page 39: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Fragment shader operations

Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Page 40: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Fragment shader operations

Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel

Fragment processor could perform simple straight-line operationsand conditionals (no looping)

(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Page 41: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Fragment shader operations

Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storage

Each pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Page 42: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Fragment shader operations

Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)

Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Page 43: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Fragment shader operations

Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.

All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Page 44: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Fragment shader operations

Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipeline

Computation proceeds in passes: output could be rendered orstored in memory for next pass.

Page 45: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Fragment shader operations

Every pixel acts like an SIMD processoractually, some fixed number are processed in parallel

Fragment processor could perform simple straight-line operationsand conditionals (no looping)(limited) texture memory for local storageEach pixel processor could do a simple reduce (add, blend)Computation initiated by “rendering call” from host machine.All computation resides on GPU from start of the vertex pipelineComputation proceeds in passes: output could be rendered orstored in memory for next pass.

Page 46: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Simple GPU Algorithms

Page 47: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Voronoi Diagrams

65

10

65

10

Voronoi diagram is lower envelope of collection of distancefunctions

Page 48: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Voronoi Diagrams

65

10

65

10

Voronoi diagram is lower envelope of collection of distancefunctions

Page 49: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Voronoi Diagrams

65

10

65

10

Voronoi diagram is lower envelope of collection of distancefunctions

Page 50: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Voronoi Diagrams

65

10

65

10

Voronoi diagram is lower envelope of collection of distancefunctions

Page 51: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Voronoi Diagrams

65

10

65

10

Voronoi diagram is lower envelope of collection of distancefunctions

Page 52: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Voronoi Diagrams

65

10

65

10

Voronoi diagram is lower envelope of collection of distancefunctions

Page 53: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Voronoi Diagrams

65

10

65

10

Voronoi diagram is lower envelope of collection of distancefunctions

Page 54: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Voronoi Diagrams

65

10

65

10

Voronoi diagram is lower envelope of collection of distancefunctions

Page 55: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Voronoi Diagrams

65

10

65

10

Voronoi diagram is lower envelope of collection of distancefunctions

Page 56: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Voronoi Diagrams

65

10

65

10

Voronoi diagram is lower envelope of collection of distancefunctions

Page 57: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored triangles

Use many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 58: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth cone

Use shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 59: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 60: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 61: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0

if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 62: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 63: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}

min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 64: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}

color(x, y) = color(pi)end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 65: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 66: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 67: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapper

Gathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 68: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)

Fragment processors implement reduce

Page 69: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Voronoi Diagrams[HIKL+99]

For each point, render a cone of colored trianglesUse many triangles to approximate smooth coneUse shading to encode distance as color value

Fragment processor at (x, y) receives stream of pixels p1, p2, . . .

min← 0if depth(pi) < min then

{GPU Z-test}min = depth(pi) {GPU blending operation}color(x, y) = color(pi)

end if

Rendering engine is the mapperGathering happens automatically, with fixed key (x, y)Fragment processors implement reduce

Page 70: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Simple GPU Matrix Multiplication[HCH03, AM03]

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

A

B

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

• GPU loops have to be unrolled

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

• GPU loops have to be unrolled

• This works only if k is small

Page 71: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Simple GPU Matrix Multiplication[HCH03, AM03]

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

A

B

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

• GPU loops have to be unrolled

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

• GPU loops have to be unrolled

• This works only if k is small

Page 72: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Simple GPU Matrix Multiplication[HCH03, AM03]

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

A

B

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

• GPU loops have to be unrolled

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

• GPU loops have to be unrolled

• This works only if k is small

Page 73: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Simple GPU Matrix Multiplication[HCH03, AM03]

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

A

B

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

• GPU loops have to be unrolled

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

• GPU loops have to be unrolled

• This works only if k is small

Page 74: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Simple GPU Matrix Multiplication[HCH03, AM03]

C = A · BCij = ∑

kAikBkj

C = A · BCij = ∑

kAikBkj

A

B

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

• GPU loops have to be unrolled

C = A · BCij = ∑

kAikBkj

A

B for i = 1 . . . k doC[x, y] = C[x, y] +A[i, k] ∗ B[k, j]

end for

• GPU loops have to be unrolled

• This works only if k is small

Page 75: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Matrix Multiplication II[FSH04]

A

B

A

BPass 1

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

• More complicated methods needed for sparse multiplication

Page 76: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Matrix Multiplication II[FSH04]

A

B

A

BPass 1

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

• More complicated methods needed for sparse multiplication

Page 77: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Matrix Multiplication II[FSH04]

A

B

A

BPass 1

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

• More complicated methods needed for sparse multiplication

Page 78: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Matrix Multiplication II[FSH04]

A

B

A

BPass 1

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

• More complicated methods needed for sparse multiplication

Page 79: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Matrix Multiplication II[FSH04]

A

B

A

BPass 1

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

• More complicated methods needed for sparse multiplication

Page 80: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Matrix Multiplication II[FSH04]

A

B

A

BPass 1

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

A

BPass 2

• Multiple passes increase number of memory access, but help withcaching

xy

zu

• Each ”field” is actually four values

• Can get a factor 4 speedup with careful partitioning of matrix

• More complicated methods needed for sparse multiplication

Page 81: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Sorting on the GPU

Can we sort on the GPU ?

Compute parallelism is not enough

Need SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:

Many simple (and local) compute elementsHigh-throughput and synchronous

Page 82: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Sorting on the GPU

Can we sort on the GPU ?

Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)

External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:

Many simple (and local) compute elementsHigh-throughput and synchronous

Page 83: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Sorting on the GPU

Can we sort on the GPU ?

Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.

Plan: Use sorting networks:

Many simple (and local) compute elementsHigh-throughput and synchronous

Page 84: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Sorting on the GPU

Can we sort on the GPU ?

Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:

Many simple (and local) compute elementsHigh-throughput and synchronous

Page 85: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Sorting on the GPU

Can we sort on the GPU ?

Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:

Many simple (and local) compute elements

High-throughput and synchronous

Page 86: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Sorting on the GPU

Can we sort on the GPU ?

Compute parallelism is not enoughNeed SIMD structures (find repeated instruction patterns)External memory methods manage I/O bottlenecks but don’texploit compute power.Plan: Use sorting networks:

Many simple (and local) compute elementsHigh-throughput and synchronous

Page 87: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Bitonic Sorting

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

37486215

a

b

min(a, b)

max(a, b)

37486215

3748

62

15

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

86

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

Bitonic sort requires log2 n layers, n/2 comparators/layer

Page 88: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Bitonic Sorting

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

37486215

a

b

min(a, b)

max(a, b)

37486215

3748

62

15

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

86

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

Bitonic sort requires log2 n layers, n/2 comparators/layer

Page 89: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Bitonic Sorting

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

37486215

a

b

min(a, b)

max(a, b)

37486215

3748

62

15

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

86

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

Bitonic sort requires log2 n layers, n/2 comparators/layer

Page 90: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Bitonic Sorting

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

37486215

a

b

min(a, b)

max(a, b)

37486215

3748

62

15

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

86

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

Bitonic sort requires log2 n layers, n/2 comparators/layer

Page 91: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Bitonic Sorting

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

37486215

a

b

min(a, b)

max(a, b)

37486215

3748

62

15

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

86

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

Bitonic sort requires log2 n layers, n/2 comparators/layer

Page 92: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Bitonic Sorting

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

37486215

a

b

min(a, b)

max(a, b)

37486215

3748

62

15

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

86

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

Bitonic sort requires log2 n layers, n/2 comparators/layer

Page 93: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Bitonic Sorting

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

37486215

a

b

min(a, b)

max(a, b)

37486215

3748

62

15

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

86

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

Bitonic sort requires log2 n layers, n/2 comparators/layer

Page 94: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Bitonic Sorting

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

37486215

a

b

min(a, b)

max(a, b)

37486215

3748

62

15

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

86

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

Bitonic sort requires log2 n layers, n/2 comparators/layer

Page 95: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Bitonic Sorting

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

37486215

a

b

min(a, b)

max(a, b)

37486215

3748

62

15

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

86

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

Bitonic sort requires log2 n layers, n/2 comparators/layer

Page 96: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Bitonic Sorting

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

a

b

min(a, b)

max(a, b)

37486215

a

b

min(a, b)

max(a, b)

37486215

3748

62

15

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

74

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

86

21

5

a

b

min(a, b)

max(a, b)

37486215

3

7

4

8

6

21

5

Bitonic sort requires log2 n layers, n/2 comparators/layer

Page 97: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Bitonic Sorting[PDC+03, GGKM06]

a0a3

a1a2

a0a3

a1a2

0 1

3 2

Texturea0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

• log2 n passes used to complete the computation

Page 98: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Bitonic Sorting[PDC+03, GGKM06]

a0a3

a1a2

a0a3

a1a2

0 1

3 2

Texture

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

• log2 n passes used to complete the computation

Page 99: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Bitonic Sorting[PDC+03, GGKM06]

a0a3

a1a2

a0a3

a1a2

0 1

3 2

Texture

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

• log2 n passes used to complete the computation

Page 100: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Bitonic Sorting[PDC+03, GGKM06]

a0a3

a1a2

a0a3

a1a2

0 1

3 2

Texturea0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

• log2 n passes used to complete the computation

Page 101: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Bitonic Sorting[PDC+03, GGKM06]

a0a3

a1a2

a0a3

a1a2

0 1

3 2

Texturea0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

• log2 n passes used to complete the computation

Page 102: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

GPU Bitonic Sorting[PDC+03, GGKM06]

a0a3

a1a2

a0a3

a1a2

0 1

3 2

Texturea0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

a0a3

a1a2

0 1

3 2

Texture

• Fill 2D array with values• (for each pass) construct quadrilateral with lookup

values

• Texture hardware locates lookup values, and fragmentprogram does comparisons

• log2 n passes used to complete the computation

Page 103: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Review

Page 104: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

This Lecture

Brief history of GPU modelSimple GPU SIMD modelExamples: Voronoi diagrams, matrix multiplication and sorting

Page 105: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Next Lecture

More simple GPU examplesToy example of algorithmic view: “GPU as streaming processor”The CUDA model for modern GPUs“Hello world” example: matrix multiplication in CUDA

Page 106: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

Questions?

[Dur12]

Page 107: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

References I

Ádám Moravánsky.Dense matrix algebra on the gpu.http://www.shaderx2.com/shaderx.PDF, 2003.

ATI/Animusic.Pipe dream.http://developer.amd.com/ARCHIVE/LEGACYDEMOS/pages/ATIRadeon9700Real-TimeDemos.aspx.

J. Backus.Can programming be liberated from the von Neumann style ? afunctional style and its algebra of programs.ACM Turing Award Lecture, 1977.

Dallin Durfee.Strange quark.http://sqcomic.com/2012/08/19/, Aug 19 2012.

Page 108: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

References II

K. Fatahalian, J. Sugerman, and P. Hanrahan.Understanding the efficiency of gpu algorithms for matrix-matrixmultiplication.In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference onGraphics hardware, pages 133–137. ACM, 2004.

N. Govindaraju, J. Gray, R. Kumar, and D. Manocha.Gputerasort: high performance graphics co-processor sorting for largedatabase management.In Proceedings of the 2006 ACM SIGMOD international conference onManagement of data, pages 325–336. ACM, 2006.

Jesse D. Hall, Nathan A. Carr, and John C. Hart.Cache and bandwidth aware matrix multiplication on the gpu.Technical Report UIUCDCS-R-2003-2328, UIUC, 2003.

Page 109: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

References III

K.E. Hoff III, J. Keyser, M. Lin, D. Manocha, and T. Culver.Fast computation of generalized voronoi diagrams using graphicshardware.In Proceedings of the 26th annual conference on Computer graphics andinteractive techniques, pages 277–286. ACM Press/Addison-WesleyPublishing Co., 1999.

John Markoff.From PlayStation to Supercomputer for $50,000.http://www.nytimes.com/2003/05/26/business/technology-from-playstation-to-supercomputer-for-50000.html,May 2003.New York Times.

T.J. Purcell, C. Donner, M. Cammarano, H.W. Jensen, and P. Hanrahan.Photon mapping on programmable graphics hardware.In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference onGraphics hardware, pages 41–50. Eurographics Association, 2003.

Page 110: GPU Algorithms I The Early Years - madalgo.au.dk · Animusic Demo[ATI] Overview GPUs today have 10s of cores (soon, 100s !) Have huge data bandwidth (100s of GB/s) Force a SIMD-style

References IV

D. Shreiner.OpenGL programming guide: the official guide to learning OpenGL, versions3.0 and 3.1, volume 1.Addison-Wesley Professional, 2010.