gpu computing with cuda lecture 8 - cuda libraries - …gpu computing with cuda lecture 8 - cuda...

GPU Computing with CUDALecture 8 - CUDA Libraries - CUFFT, PyCUDA

Christopher Cooper Boston University

August, 2011UTFSM, Valparaíso, Chile

1

Outline of lecture

‣Overview:

- Discrete Fourier Transform (DFT)

- Fast Fourier Transform (FFT)

‣ Algorithm

‣Motivation, examples

‣CUFFT: A CUDA based FFT library

‣ PyCUDA: GPU computing using scripting languages

2

3

Bell, Dalton, Olson. Towards AMG on GPU

CUDA Libraries

Fourier Transform

‣ Fourier Transform

‣ Inverse Fourier Transform

4

u(k) =! !

"!e"ikxu(x)dx

u(x) =! !

"!eikxu(k)dk

Real Wave

RealWave

Discrete Fourier Transform (DFT)

‣ The case of discrete we have aliasing

5

u(x)

Real Discrete, bounded

Wave Bounded, discrete

Values at sample points repeat at k2 = k1+N

sin(x)sin(5x)

sin(3x)sin(7x)

Discrete Fourier Transform (DFT)

‣DFT

‣ Inverse DFT

6

uk =N!1!

j=0

uje! 2!i

N kj k = 0, 1, ..., N ! 1

j = 0, 1, ..., N ! 1uj =N!1!

k=0

uke2!iN kj

Fast Fourier Transform (FFT)

‣ Fast method to calculate the DFT

‣Computations drop from to

- N = 104:

‣ Naive: 108 computations

‣ FFT: 4*104 computations

‣Many algorithms, let’s look at Cooley-Tukey radix-2

7

O(N2) O(N log(N))

Huge reduction!


‣Cooley-Tukey radix 2

- Assume N being a power of 2

8

uk =N!1!

j=0

uje! 2!i

N kj

Divide sum into even and odd parts

uk =N/2!1!

j=0

u2je! 2!i

N k(2j) +N/2!1!

j=0

u2j+1e! 2!i

N k(2j+1)


‣Cooley-Tukey radix 2

- Assume N being a power of 2

8

uk =N!1!

j=0

uje! 2!i

N kj

Divide sum into even and odd parts

uk =N/2!1!

j=0

u2je! 2!i

N k(2j) +N/2!1!

j=0

u2j+1e! 2!i

N k(2j+1)

Even Odd


‣ By doing this recursively until there is no sum, you get log(N) levels

‣ Sum is decomposed and redundant operations appear

‣ 4 point transform

9

uk =N/2!1!

j=0

u2je! 2!i

N/2 kj + e!2!iN k

N/2!1!

j=0

u2j+1e! 2!i

N/2 kj

uk = u0 + u1e! 2!

4 ik + u2e! 2!

4 i2k + u3e! 2!

4 i3k

uk = u0 + u2e! 2!

4 i2k + e!2!4 ik(u1 + u3e

! 2!4 i2k)

uk = u0 + u2e!!ik + e!

!2 ik(u1 + u3e

!!ik)

k = 0, 1, 2, 3


10

u0 = u0 + u2e0 + e0(u1 + u3e

0)u1 = u0 + u2e

!!i + e!!2 i(u1 + u3e

!!i)

u2 = u0 + u2e!2!i + e!!i(u1 + u3e

!2!i)u3 = u0 + u2e

!3!i + e!3 !2 i(u1 + u3e

!3!i)

periodicity! e0 = e!2!i = 1, e!!i = e!3!i = "1u0

u2

u1

u3

u0

u1

u2

u3

FFT - Motivation

‣ Signal processing

- Signal comes in time domain, but want the frequency spectrum

‣Convolution, filters

- Signals can be filtered with convolutions

11

F{f ! g} = F{f} · F{g}

! t

0f(s)g(t! s) ds, 0 " t <#

FFT - Motivation

‣ Partial Differential Equations (PDEs) - Spectral methods

- Use DFTs to calculate derivatives

12

uj =N!1!

k=0

uke2!iN kj 2!

Nj = xj For evenly spaced grid

uj =N!1!

k=0

ukeikxj

!uj

!x=

N!1!

k=0

ikukeikxj

!2uj

!x2=

N!1!

k=0

!k2ukeikxj

!!u

!x= iku

FFT - Motivation

‣Advantages

- Spectral accuracy

‣ Limitations

- Grid constraints

- Boundary condition constraint

13

O(N!m)

O(cN ) 0 < c < 1

CUFFT

‣CUFFT: CUDA library for FFTs on the GPU

‣ Supported by NVIDIA

‣ Features:

- 1D, 2D, 3D transforms for complex and real data

- Batch execution for multiple transforms

- Up to 128 million elements (limited by memory)

- In-place or out-of-place transforms

- Double precision on GT200 or later

- Allows streamed execution: simultaneous computation and data movement 14

CUFFT - Types

‣ cufftHandle

- Handle type to store CUFFT plans

‣ cufftResult

- Return values, like CUFFT_SUCCESS, CUFFT_INVALID_PLAN, CUFFT_ALLOC_FAILED, CUFFT_INVALID_TYPE, etc.

‣ cufftReal

‣ cufftDoubleReal

‣ cufftComplex

‣ cufftDoubleComplex

15

CUFFT - Transform types

‣ R2C: real to complex

‣C2R: Complex to real

‣C2C: complex to complex

‣D2Z: double to double complex

‣ Z2D: double complex to double

‣ Z2Z: double complex to double complex

16

CUFFT - Plans

‣ cufftPlan1d()

‣ cufftPlan2d()

‣ cufftPlan3d()

‣ cufftPlanMany()

17

CUFFT - Functions

‣ cufftDestroy

- Free GPU resources

‣ cufftExecC2C, R2C, C2R, Z2Z, D2Z, Z2D

- Performs the specified FFT

‣ For more details see the CUFFT Library documentation available in the NVIDIA website!

18

CUFFT - Performance considerations

‣ Several algorithms for different sizes

‣ Performance recommendations

- Restrict size to be a multiple of 2, 3, 5 or 7

- Restrict the power-of-two factorization term of the X-dimension to be at least a multiple of 16 for single and 8 for double

- Restrict the X-dimension of single precision transforms to be strictly a power of two between 2(2) and 2048(8192) for Tesla (Fermi) GPUs

19

CUFFT - Performance considerations

‣CUFFT vs FFTW

- CUFFT is good for larger, power of two sized FFTs

- CUFFT is not good for small sized FFTs

‣ CPU can store all data in cache

‣ GPU data transfers take too long

20http://www.sharcnet.ca/~merz/CUDA_benchFFT/

http://www.sharcnet.ca/~merz/CUDA_benchFFT/

http://www.sharcnet.ca/~merz/CUDA_benchFFT/

CUFFT - Example

21

#include<cufft.h>#defineNX256#defineBATCH10

cufftHandleplan;cufftComplex*data;cudaMalloc((void**)&data,sizeof(cufftComplex)*NX*BATCH);/*Createa1DFFTplan.*/cufftPlan1d(&plan,NX,CUFFTC2C,BATCH);

/*UsetheCUFFTplantotransformthesignalinplace.*/cufftExecC2C(plan,data,data,CUFFTFORWARD);

/*DestroytheCUFFTplan.*/cufftDestroy(plan);cudaFree(data);

00.2

0.40.6

0.81

00.2

0.40.6

0.810

0.2

0.4

0.6

0.8

1

x

Fourier spectral method for 2D Poisson Eqn

y

u

CUFFT - Example

‣ Solve Poisson equation using FFT

‣Consider periodic boundary conditions

22

!2u =r2 " 2!2

!4e!

r2

2!2

uan = e!r2

2!2 r =!

(x! 0.5)2 + (y ! 0.5)2

CUFFT - Example

‣ Steps

23

FFT system

Find derivative

Transform back

!2u = f

!k2u = f

u = ! f

k2

u = i!t

!! f

k2

"

k2 = k2x + k2

y

CUFFT - Example

24

intmain(){intN=64;floatxmax=1.0f,xmin=0.0f,ymin=0.0f,h=(xmax‐xmin)/((float)N),s=0.1,s2=s*s;float*x=newfloat[N*N],*y=newfloat[N*N],*u=newfloat[N*N],*f=newfloat[N*N],*u_a=newfloat[N*N],*err=newfloat[N*N];

floatr2;for(intj=0;j<N;j++)for(inti=0;i<N;i++){x[N*j+i]=xmin+i*h;y[N*j+i]=ymin+j*h;

r2=(x[N*j+i]‐0.5)*(x[N*j+i]‐0.5)+(y[N*j+i]‐0.5)*(y[N*j+i]‐0.5);f[N*j+i]=(r2‐2*s2)/(s2*s2)*exp(‐r2/(2*s2));u_a[N*j+i]=exp(‐r2/(2*s2));//analyticalsolution}

float*k=newfloat[N];for(inti=0;i<=N/2;i++){k[i]=i*2*M_PI;}for(inti=N/2+1;i<N;i++){k[i]=(i‐N)*2*M_PI;}

CUFFT - Example

25

//Allocatearraysonthedevicefloat*k_d,*f_d,*u_d;cudaMalloc((void**)&k_d,sizeof(float)*N);cudaMalloc((void**)&f_d,sizeof(float)*N*N);cudaMalloc((void**)&u_d,sizeof(float)*N*N);

cudaMemcpy(k_d,k,sizeof(float)*N,cudaMemcpyHostToDevice);cudaMemcpy(f_d,f,sizeof(float)*N*N,cudaMemcpyHostToDevice);

cufftComplex*ft_d,*f_dc,*ft_d_k,*u_dc;cudaMalloc((void**)&ft_d,sizeof(cufftComplex)*N*N);cudaMalloc((void**)&ft_d_k,sizeof(cufftComplex)*N*N);cudaMalloc((void**)&f_dc,sizeof(cufftComplex)*N*N);cudaMalloc((void**)&u_dc,sizeof(cufftComplex)*N*N);

dim3dimGrid(int((N‐0.5)/BSZ)+1,int((N‐0.5)/BSZ)+1);dim3dimBlock(BSZ,BSZ);real2complex<<<dimGrid,dimBlock>>>(f_d,f_dc,N);

cufftHandleplan;cufftPlan2d(&plan,N,N,CUFFT_C2C);

CUFFT - Example

26

cufftExecC2C(plan,f_dc,ft_d,CUFFT_FORWARD);

solve_poisson<<<dimGrid,dimBlock>>>(ft_d,ft_d_k,k_d,N);

cufftExecC2C(plan,ft_d_k,u_dc,CUFFT_INVERSE);

complex2real<<<dimGrid,dimBlock>>>(u_dc,u_d,N);

cudaMemcpy(u,u_d,sizeof(float)*N*N,cudaMemcpyDeviceToHost);

floatconstant=u[0];for(inti=0;i<N*N;i++){u[i]‐=constant;//substractu[0]toforcethearbitraryconstanttobe0}

CUFFT - Example

27

__global__voidsolve_poisson(cufftComplex*ft,cufftComplex*ft_k,float*k,intN){inti=threadIdx.x+blockIdx.x*BSZ;intj=threadIdx.y+blockIdx.y*BSZ;intindex=j*N+i;

if(i<N&&j<N){floatk2=k[i]*k[i]+k[j]*k[j];if(i==0&&j==0){k2=1.0f;}ft_k[index].x=‐ft[index].x/k2;ft_k[index].y=‐ft[index].y/k2;}}

CUFFT - Example

28

__global__voidreal2complex(float*f,cufftComplex*fc,intN){inti=threadIdx.x+blockIdx.x*blockDim.x;intj=threadIdx.y+blockIdx.y*blockDim.y;intindex=j*N+i;

if(i<N&&j<N){fc[index].x=f[index];fc[index].y=0.0f;}}

__global__voidcomplex2real(cufftComplex*fc,float*f,intN){inti=threadIdx.x+blockIdx.x*BSZ;intj=threadIdx.y+blockIdx.y*BSZ;intindex=j*N+i;

if(i<N&&j<N){f[index]=fc[index].x/((float)N*(float)N);//dividebynumberofelementstorecovervalue}}

PyCUDA

‣ Python + CUDA = PyCUDA

‣ Python: scripting language easy to code, but slow

‣CUDA difficult to code, but fast!

‣ PyCUDA wants to take the best of both worlds

‣ http://mathema.tician.de/software/pycuda

29

http://mathema.tician.de/software/pycuda

http://mathema.tician.de/software/pycuda

PyCUDA

‣ Scripting language

- High level programming language that is interpret by another program at runtime rather than compiled

- Advantages: ease on programmer

- Disadvantages: slow (specially inner loops)

‣ PyCUDA

- CUDA codes does not need to be a constant at compile time

- Machine generated code: automatic manage of resources

30

PyCUDA

31

importpycuda.autoinitimportpycuda.driverasdrvimportnumpy

frompycuda.compilerimportSourceModulemod=SourceModule("""__global__voidmultiply_them(float*dest,float*a,float*b){constinti=threadIdx.x;dest[i]=a[i]*b[i];}""")

multiply_them=mod.get_function("multiply_them")

a=numpy.random.randn(400).astype(numpy.float32)b=numpy.random.randn(400).astype(numpy.float32)

dest=numpy.zeros_like(a)multiply_them(drv.Out(dest),drv.In(a),drv.In(b),block=(400,1,1),grid=(1,1))

printdest‐a*b

PyCUDA

‣ Transferring data

‣ Executing a kernel

32

importnumpya=a.astype(numpy.float32)a_gpu=cuda.mem_alloc(a.nbytes)

cuda.memcpy_htod(a_gpu,a)

frompycuda.compilerimportSourceModulemod=SourceModule("""__global__voiddoublify(float*a){intidx=threadIdx.x+threadIdx.y*4;a[idx]*=2;}""")...#Allocate,generateandtransferfunc=mod.get_function("doublify")func(a_gpu,block=(4,4,1))

a_doubled=numpy.empty_like(a)cuda.memcpy_dtoh(a_doubled,a_gpu)printa_doubledprinta

gpu computing with cuda lecture 8 - cuda libraries - …gpu computing with cuda lecture 8 - cuda...

Documents