general purpose computing using graphics hardware

General Purpose Computingusing Graphics Hardware

Hanspeter PfisterHarvard University

2

Acknowledgements

• Won-Ki Jeong, Harvard University• Kayvon Fatahalian, Stanford University

3

GPU (Graphics Processing Unit)

• PC hardware dedicated for 3D graphics– Massively parallel SIMD processor

• Performance pushed by game industry

NVIDIA SLI System

4

GPGPU

• General Purpose computation on the GPU– Started in computer graphics research

community– Mapping computational problems to graphics

rendering pipeline

Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong

5

Why GPU for computing?• GPU is fast

– Massively parallel• CPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad

Core)• GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA

GT200)– High memory bandwidth

• Programmable– NVIDIA CUDA, DirectX Compute Shader, OpenCL

• High precision floating point support– 64bit floating point (IEEE 754)

• Inexpensive desktop supercomputer– NVIDIA Tesla C1060 : ~1 TFLOPS @ $1000

6

FLOPS

Image Courtesy NVIDIA

7

Memory Bandwidth

Image Courtesy NVIDIA

8

GPGPU Biomedical Examples

Level-Set Segmentation (Lefohn et al.)

EM Image Processing (Jeong et al.)Image Registration (Strzodka et al.)

CT/MRI Reconstruction (Sumanaweera et al.)

9

Overview

1. GPU Architecture Overview2. GPU Programming Overview

– Programming Model– NVIDIA CUDA– OpenCL

3. Application Example– CUDA ITK

1. GPU Architecture Overview

Kayvon Fatahalian Stanford University

10

11

What’s in a GPU?

Compute

Core

ComputeCore

Compute

Core

Compute

Core

Compute

Core

Compute

Core

Compute

Core

Compute

Core

Tex

Tex

Tex

Tex

Input Assembly

Rasterizer

Output Blend

Video Decode

WorkDistributor

Heterogeneous chip multi-processor (highly tuned for graphics)

HWor

SW?

12

CPU-“style” cores

ALU(Execute)

Fetch/Decode

ExecutionContext

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Data Cache(A big one)

13

Slimming down

ALU(Execute)

Fetch/Decode

ExecutionContext

Idea #1:

Remove components thathelp a single instructionstream run fast

14

Two cores (two threads in parallel)

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

<diffuseShader>:

sample r0, v4, t0, s0

mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

thread1

<diffuseShader>:


mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

thread 2

15

Four cores (four threads in parallel)

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

16

Sixteen cores (sixteen threads in parallel)

ALU ALU

ALUALU

ALU ALU

ALUALU

ALU ALU

ALUALU

ALU ALU

ALUALU

16 cores = 16 simultaneous instruction streams

17

Instruction stream sharing

<diffuseShader>:


mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

But… many threads shouldbe able to share an instructionstream!

18

Recall: simple processing core

Fetch/Decode

ALU(Execute)

ExecutionContext

19

Add ALUs

Fetch/Decode

Idea #2:

Amortize cost/complexity ofmanaging an instructionstream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4


SIMD processing

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

20

Modifying the code

<diffuseShader>:


mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

Original compiled shader:

Fetch/Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data



Processes one threadusing scalar ops on scalarregisters

21

Modifying the code

Fetch/Decode

<VEC8_diffuseShader>:

VEC8_sample vec_r0, vec_v4, t0, vec_s0

VEC8_mul vec_r3, vec_v0, cb0[0]

VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3


VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)

VEC8_mul vec_o0, vec_r0, vec_r3



VEC8_mov vec_o3, l(1.0)Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data Processes 8 threadsusing vector ops on vectorregisters



New compiled shader:

22

Modifying the code

Fetch/Decode

<VEC8_diffuseShader>:

VEC8_sample vec_r0, vec_v4, t0, vec_s0

VEC8_mul vec_r3, vec_v0, cb0[0]



VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)




VEC8_mov vec_o3, l(1.0)Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

2 31 4

6 75 8



23

128 threads in parallel

= 16 simultaneous instruction streams16 cores = 128 ALUs

24

But what about branches?

ALU 1ALU 2 . . . ALU 8. . . Time

(clocks)

2...

1...

8

if (x> 0) {

} else {

}

<unconditional shader code>

<resume unconditional shader code>

y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

25



(clocks)

2...

1...

8

if (x> 0) {

} else {

}



y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

T T T F FF F F

26



(clocks)

2...

1...

8

if (x> 0) {

} else {

}



y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

T T T F FF F F

Not all ALUs do useful work! Worst case: 1/8 performance

27



(clocks)

2...

1...

8

if (x> 0) {

} else {

}



y = pow(x, exp);

y *= Ks;

refl = y + Ka;

x = 0;

refl = Ka;

T T T F FF F F

28

Clarification

• Option 1: Explicit vector instructions– Intel/AMD x86 SSE, Intel Larrabee

• Option 2: Scalar instructions, implicit HW vectorization– HW determines instruction stream sharing across

ALUs (amount of sharing hidden from software)– NVIDIA GeForce (“SIMT” warps), ATI Radeon

architectures

SIMD processing does not imply SIMD instructions

In practice: 16 to 64 threads share an instruction stream

29

Stalls!

Texture access latency = 100’s to 1000’s of cycles

We’ve removed the fancy caches and logic that helps avoid stalls.

Stalls occur when a core cannot run the next instruction because of a dependency on a

previous operation.

30

But we have LOTS of independent threads.

Idea #3:Interleave processing of many threads on a single

core to avoid stalls caused by high latency operations.

31

Hiding stallsTime

(clocks)Thread1 …

8

Fetch/Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

SharedCtx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

32

Hiding stallsTime

(clocks)

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Thread1 … 8

Thread9… 16 Thread17 … 24 Thread25 … 32

33

Hiding stallsTime

(clocks)

Stall

Runnable

1 2 3 4

Thread1 … 8


34

Hiding stallsTime

(clocks)

Stall

Runnable

1 2 3 4

Thread1 … 8


35

Hiding stallsTime

(clocks)

1 2 3 4

Stall

Stall

Stall

Stall

Runnable

Runnable

Runnable

Thread1 … 8


36

Throughput!Time

(clocks)

Stall

Runnable

2 3 4

Thread1 … 8


Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

1

Increase run time of one groupTo maximum throughput of many groups

Start

Start

Start

37

Storing contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Pool of context storage

32KB

38

Twenty small contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2 3 4 5

6 7 8 9 10

11 1512 13 14

16 2017 18 19

(maximal latency hiding ability)

39

Twelve medium contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2 3 4

5 6 7 8

9 10 11 12

40

Four large contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

43

1 2

(low latency hiding ability)

41

GPU block diagram key

= single “physical” instruction stream fetch/decode (functional unit control)

= SIMD programmable functional unit (FU), control shared with other functional units. This functional unit may contain multiple 32-bit “ALUs”

= execution context storage

= fixed function unit

= 32-bit mul-add unit= 32-bit multiply unit

42

Example: NVIDIA GeForce GTX 280• NVIDIA-speak:

– 240 stream processors– “SIMT execution” (automatic HW-managed sharing of instruction

stream)

• Generic speak:– 30 processing cores– 8 SIMD functional units per core– 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock)– Best case: 240 mul-adds + 240 muls per clock– 1.3 GHz clock– 30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS

• Mapping data-parallelism to chip:– Instruction stream shared across 32 threads– 8 threads run on 8 SIMD functional units in one clock

43

GTX 280 core

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Zcull/Clip/Rast Output Blend Work Distributor

… … …

………

………

………

………

………

………

………

………

………

44

Example: ATI Radeon 4870• AMD/ATI-speak:

– 800 stream processors– Automatic HW-managed sharing of scalar instruction stream (like

“SIMT”)

• Generic speak:– 10 processing cores– 16 SIMD functional units per core– 5 mul-adds per functional unit (5 * 2 =10 flops/clock)– Best case: 800 mul-adds per clock– 750 MHz clock– 10 * 16 * 5 * 2 * .75 = 1.2 TFLOPS

• Mapping data-parallelism to chip:– Instruction stream shared across 64 threads– 16 threads run on 16 SIMD functional units in one clock

45

ATI Radeon 4870 core

Zcull/Clip/Rast Output Blend Work Distributor

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

Tex

…

…

…

…

…

…

…

…

…

…

46

Summary: three key ideas

1. Use many “slimmed down cores” to run in parallel

2. Pack cores full of ALUs (by sharing instruction stream across groups of threads)– Option 1: Explicit SIMD vector instructions– Option 2: Implicit sharing managed by

hardware

3. Avoid latency stalls by interleaving execution of many groups of threads– When one group stalls, work on another group

2. GPU Programming Models

Programming ModelNVIDIA CUDA

OpenCL

47

48

Task parallelism

• Distribute the tasks across processors based on dependency

• Coarse-grain parallelism

Task 1Task

2

Task 4Task

5Task 6

Task 7 Task 8Task 9

Task 3

Task dependency graph

Task assignment across 3 processors

Task 1

Task 4

Task 7

Task 5

Task 8

Task 2

Task 6

Task 3

Task 9

P1P2P3

Time

49

Data parallelism

• Run a single kernel over many elements– Each element is independently updated– Same operation is applied on each element

• Fine-grain parallelism– Many lightweight threads, easy to switch

context– Maps well to ALU heavy architecture : GPU

Kernel P1

P2

P3

P4

P5

Pn

…….

…….Data

GPU-friendly Problems

• Data-parallel processing• High arithmetic intensity

– Keep GPU busy all the time– Computation offsets memory latency

• Coherent data access– Access large chunk of contiguous memory– Exploit fast on-chip shared memory

50

The Algorithm Matters

• Jacobi: Parallelizable

for(inti=0; i<num; i++){

vn+1[i] = (vn[i-1] + vn[i+1])/2.0;}

• Gauss-Seidel: Difficult to parallelize

for(inti=0; i<num; i++){

v[i] = (v[i-1] + v[i+1])/2.0;}

51

Example: Reduction

• Serial version (O(N))for(int i=1; i<N; i++){

v[0] += v[i];}

• Parallel version (O(logN))width = N/2;while(width > 1){

for(int i=0; i<width; i++) { v[i] += v[i+width]; // computed in parallel } width /= 2;

}

52

53

GPU programming languages

• Using graphics APIs– GLSL, Cg, HLSL

• Computing-specific APIs– DX 11 Compute Shaders– NVIDIA CUDA– OpenCL

54

NVIDIA CUDA• C-extension programming language

– No graphics API– Supports debugging tools

• Extensions / API– Function type : __global__, __device__, __host__– Variable type : __shared__, __constant__– Low-level functions

• cudaMalloc(), cudaFree(), cudaMemcpy(),…• __syncthread(), atomicAdd(),…

• Program types– Device program (kernel) : runs on the GPU– Host program : runs on the CPU to call device programs

55

CUDA Programming Model

• Kernel– GPU program that runs on a thread grid

• Thread hierarchy– Grid : a set of blocks– Block : a set of threads– Grid size * block size = total # of threads

Grid

Block 1

Threads

Block 2

Threads

Block n

Threads

. . . . .

<diffuseShader>:


mul r3, v0, cb0[0]

madd r3, v1, cb0[1], r3

madd r3, v2, cb0[2], r3

clmp r3, r3, l(0.0), l(1.0)

mul o0, r0, r3

mul o1, r1, r3

mul o2, r2, r3

mov o3, l(1.0)

Kernel

56

CUDA Memory Structure• Memory hierarchy

– PC memory : off-card– GPU Global : off-chip / on-card– Shared/register/cache : on-chip

• The host can read/write global memory• Each thread communicates using shared

memory

PC Memory

(DRAM)

GPU GlobalMemory(DRAM)

GPU SharedMemory

(On-Chip)ALUs

Graphics cardGPU Core

12004000

57

Synchronization

• Threads in the same block can communicate using shared memory

• No HW global synchronization function yet

• __syncthreads()– Barrier for threads only within the current block

• __threadfence()– Flushes global memory writes to make them

visible to all threads

58

Example: CPU Vector Addition

// Pair-wise addition of vector elements// CPU version : serial add

void vectorAdd(float* iA, float* iB, float* oC, int num)

{ for(inti=0; i<num; i++) { oC[i] = iA[i] + iB[i]; }

}

59

Example: CUDA Vector Addition

// Pair-wise addition of vector elements// CUDA version : one thread per addition

__global__ voidvectorAdd(float* iA, float* iB, float* oC) {

intidx = threadIdx.x + blockDim.x * blockIdx.x;oC[idx] = iA[idx] + iB[idx];

}

60

Example: CUDA Host Code

float* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));// …initalizeh_A and h_B

// allocate device memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to devicecudaMemcpy( d_A, h_A, N *

sizeof(float),cudaMemcpyHostToDevice );cudaMemcpy( d_B, h_B, N * sizeof(float),

cudaMemcpyHostToDevice );

// execute the kernel on N/256 blocks of 256 threads eachvectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );

61

OpenCL (Open Computing Language)

• First industry standard for computing language– Based on C language– Platform independent

• NVIDIA, ATI, Intel, ….

• Data and task parallel compute model– Use all computational resources in system

• CPU, GPU, …– Work-item : same as thread / fragment / etc..– Work-group : a group of work-items

• Work-items in a same work-group can communicate

– Execute multiple work-groups in parallel

62

OpenCL program structure

• Host program (CPU)– Platform layer

• Query compute devices• Create context

– Runtime• Create memory objects• Compile and create kernel program objects• Issue commands (i.e., kernel launching) to command-

queue• Synchronization of commands• Clean up OpenCL resources

• Kernel (CPU, GPU)– C-like code with some extensions– Runs on compute device

63

CUDA v.s. OpenCL comparison

• Conceptually almost identical– Work-item == thread– Work-group == block– Similar memory model

• Global, local, shared memory– Kernel, host program

• CUDA is highly optimized only for NVIDIA GPUs

• OpenCL can be widely used for any GPUs/CPUs

64

Implementation status of OpenCL

• Specification 1.0 released by Khronos• NVIDIA released Beta 1.2 driver and SDK

– Available for registered GPU computing developers

• Apple will include in Mac OS X Snow Leopard– Q3 2009– NVIDIA and ATI GPUs, Intel CPU for Mac

• More companies will join

GPU optimization tips: configuration

• Identify bottleneck– Computing / bandwidth bound (use profiler)– Focus on most expensive but parallelizable

parts (Amdahl’s law)• Maximize parallel execution

– Use large input (many threads)– Avoid divergent execution– Efficient use of limited resource

• Minimize shared memory / register use

65

GPU optimization tips: memory

• Memory access: the most important optimization– Minimize device to host memory overhead

• Overlap kernel with memory copy (asynchronous copy)– Avoid shared memory bank conflict– Coalesced global memory access– Texture or constant memory can be helpful (cache)

PC Memory

(DRAM)

GPU GlobalMemory(DRAM)

GPU SharedMemory

(On-Chip)ALUs

Graphics cardGPU Core

12004000

66

67

GPU optimization tips: instructions

• Use less expensive operators– division: 32 cycles, multiplication: 4 cycles

• *0.5 instead of /2.0– Atomic operator is expensive

• Possible race condition– Double precision is much slower than float– Use less accurate floating point instruction

when possible• __sin(), __exp(), __pow()

• Save unnecessary instructions– Loop unrolling

3. Application Example

CUDA ITK

68

ITK image filters implemented using CUDA• Convolution filters

– Mean filter– Gaussian filter– Derivative filter– Hessian of Gaussian filter

• Statistical filter– Median filter

• PDE-based filter– Anisotropic diffusion filter

69

70

CUDA ITK

• CUDA code is integrated into ITK– Transparent to the ITK users– No need to modify current code using ITK

library• Check environment variable ITK_CUDA

– Entry point• GenerateData() or ThreadedGenerateData()

– If ITK_CUDA == 0• Execute original ITK code

– If ITK_CUDA == 1• Execute CUDA code

71

Convolution filters

• Weighted sum of neighbors– For size n filter, each pixel is reused n times

• Non-separable filter (Anisotropic)– Reusing data using shared memory

• Separable filter (Gaussian)– N-dimensional convolution = N*1D convolution

kernel

*

kernel

*

kernel

*

72

• Read from input image whenever needed

Naïve C/CUDA implementation

int xdim, ydim; // size of input imagefloat *in, *out; // input/output image of size xdim*ydimfloat w[][]; // convolution kernel of size n*m

for(x=0; x<xdim; x++){ for(y=0; y<ydim; y++) { // compute convolution for(sx=x-n/2; sx<=x+n/2; sx++) { for(sy=y-m/2; sy<=y+m/2; sy++) { wx = sx – x + n/2; wy = sy – y + m/2; out[x][y] = w[wx][wy]*in[sx][sy]; } } }}

n*m

xdim*ydim

load from global memory, n*m times

73

• For size n*m filter, each pixel is reused n*m times– Save n*m-1 global memory loads by using shared

memory

Improved CUDA convolution filter

__global__ cudaConvolutionFilter2DKernel(in, out, w){ // copy global to shared memory

sharedmem[] = in[][];

__syncthreads();

// sum neighbor pixel values float _sum = 0;

for(uint j=threadIdx.y; j<=threadIdx.y + m; j++) { for(uint i=threadIdx.x; i<=threadIdx.x + n; i++) { wx = i – threadIdx.x; wy = j – threadIdx.y; _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i]; } }}

n*m

load from global memory (slow), only once

load from shared memory (fast), n*m times

74

CUDA Gaussian filter

• Apply 1D convolution filter along each axis– Use temporary buffers: ping-pong rendering

// temp[0], temp[1] : temporary buffer to store intermediate results

void cudaDiscreteGaussianImageFilter(in, out, stddev){ // create Gaussian weight w = ComputeGaussKernel(stddev);

temp[0] = in; // call 1D convolution with Gaussian weight dim3 G, B; for(i=0; i<dimension; i++) { cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w);

}

out = temp[i%2];}

1D convolution cuda kernel

Median filter• Viola et al. [VIS 03]

– Finding median by bisection of histogram bins

– Log(# bins) iterations• 8-bit pixel : log(256) = 8

iterations

14 3 18 2 10

16 41.

0 1 2 3 4 5 6 7Intensity :

3. 14 3 18 2 10

1152. 14 3 18 2 10

4. 14 3 18 2 10

Copy current block from global to shared memorymin = 0;max = 255;pivot = (min+max)/2.0f;For(i=0; i<8; i++){ count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count <kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f;}return floor(pivot);

75

76

Perona & Malik anisotropic diffusion

• Nonlinear diffusion– Adaptive smoothing based on magnitude of

gradient– Preserves edges (high gradient)

• Numerical solution– Euler explicit integration (iterative method)– Finite difference for derivative computation

Input Image Linear diffusion P & M diffusion

77

Performance

• Convolution filters– Mean filter : ~140x– Gaussian filter : ~60x– Derivative filter– Hessian of Gaussian filter

• Statistical filter– Median filter : ~25x

• PDE-based filter– Anisotropic diffusion

filter : ~70x

78

CUDA ITK• Source code available at

– http://sourceforge.net/projects/cudaitk/

79

CUDA ITK Future Work• ITK GPU image class

– Reduce CPU to GPU memory I/O– Pipelining support

• Native interface for GPU code– Similar to ThreadedGenerateData() for GPU threads

• Numerical library (vnl)• Out-of-GPU-core / GPU-cluster

– Processing large images (10~100 Terabytes)

• GPU Platform independent implementation– OpenCL could be a solution

80

Conclusions• GPU computing delivers high performance

– Many scientific computing problems are parallelizable

– More consistency/stability in HW/SW• Main GPU architecture is mature• Industry-wide programming standard now exists (OpenCL)

– Better support/tools available• C-based language, compiler, and debugger

• Issues– Not every problem is suitable for GPUs

• Re-engineering of algorithms/software required– Unclear future performance growth of GPU

hardware• Intel’s Larrabee

thrust

• thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA

• C++ template metaprogramming automatically chooses the fastest code path at compile time

thrust::sort

#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <thrust/generate.h>

#include <thrust/sort.h>

#include <cstdlib>

int main(void)

{

// generate random data on the host

thrust::host_vector<int> h_vec(1000000);

thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer to device and sort

thrust::device_vector<int> d_vec = h_vec;

// sort 140M 32b keys/sec on GT200

thrust::sort(d_vec.begin(), d_vec.end());

return 0;}

http://thrust.googlecode.com

general purpose computing using graphics hardware

Health & Medicine

sample vec

madd vec

clmp vec

cb00madd r3

r3madd r3

r3clmp r3

sample r0

tttfffffif x