gpu computing with cuda lecture 8 - cuda libraries - …gpu computing with cuda lecture 8 - cuda...
TRANSCRIPT
GPU Computing with CUDALecture 8 - CUDA Libraries - CUFFT, PyCUDA
Christopher Cooper Boston University
August, 2011UTFSM, Valparaíso, Chile
1
Outline of lecture
‣Overview:
- Discrete Fourier Transform (DFT)
- Fast Fourier Transform (FFT)
‣ Algorithm
‣Motivation, examples
‣CUFFT: A CUDA based FFT library
‣ PyCUDA: GPU computing using scripting languages
2
3
Bell, Dalton, Olson. Towards AMG on GPU
CUDA Libraries
Fourier Transform
‣ Fourier Transform
‣ Inverse Fourier Transform
4
u(k) =! !
"!e"ikxu(x)dx
u(x) =! !
"!eikxu(k)dk
Real Wave
RealWave
Discrete Fourier Transform (DFT)
‣ The case of discrete we have aliasing
5
u(x)
Real Discrete, bounded
Wave Bounded, discrete
Values at sample points repeat at k2 = k1+N
sin(x)sin(5x)
sin(3x)sin(7x)
Discrete Fourier Transform (DFT)
‣DFT
‣ Inverse DFT
6
uk =N!1!
j=0
uje! 2!i
N kj k = 0, 1, ..., N ! 1
j = 0, 1, ..., N ! 1uj =N!1!
k=0
uke2!iN kj
Fast Fourier Transform (FFT)
‣ Fast method to calculate the DFT
‣Computations drop from to
- N = 104:
‣ Naive: 108 computations
‣ FFT: 4*104 computations
‣Many algorithms, let’s look at Cooley-Tukey radix-2
7
O(N2) O(N log(N))
Huge reduction!
Fast Fourier Transform (FFT)
‣Cooley-Tukey radix 2
- Assume N being a power of 2
8
uk =N!1!
j=0
uje! 2!i
N kj
Divide sum into even and odd parts
uk =N/2!1!
j=0
u2je! 2!i
N k(2j) +N/2!1!
j=0
u2j+1e! 2!i
N k(2j+1)
Fast Fourier Transform (FFT)
‣Cooley-Tukey radix 2
- Assume N being a power of 2
8
uk =N!1!
j=0
uje! 2!i
N kj
Divide sum into even and odd parts
uk =N/2!1!
j=0
u2je! 2!i
N k(2j) +N/2!1!
j=0
u2j+1e! 2!i
N k(2j+1)
Even Odd
Fast Fourier Transform (FFT)
‣ By doing this recursively until there is no sum, you get log(N) levels
‣ Sum is decomposed and redundant operations appear
‣ 4 point transform
9
uk =N/2!1!
j=0
u2je! 2!i
N/2 kj + e!2!iN k
N/2!1!
j=0
u2j+1e! 2!i
N/2 kj
uk = u0 + u1e! 2!
4 ik + u2e! 2!
4 i2k + u3e! 2!
4 i3k
uk = u0 + u2e! 2!
4 i2k + e!2!4 ik(u1 + u3e
! 2!4 i2k)
uk = u0 + u2e!!ik + e!
!2 ik(u1 + u3e
!!ik)
k = 0, 1, 2, 3
Fast Fourier Transform (FFT)
10
u0 = u0 + u2e0 + e0(u1 + u3e
0)u1 = u0 + u2e
!!i + e!!2 i(u1 + u3e
!!i)
u2 = u0 + u2e!2!i + e!!i(u1 + u3e
!2!i)u3 = u0 + u2e
!3!i + e!3 !2 i(u1 + u3e
!3!i)
periodicity! e0 = e!2!i = 1, e!!i = e!3!i = "1u0
u2
u1
u3
u0
u1
u2
u3
FFT - Motivation
‣ Signal processing
- Signal comes in time domain, but want the frequency spectrum
‣Convolution, filters
- Signals can be filtered with convolutions
11
F{f ! g} = F{f} · F{g}
! t
0f(s)g(t! s) ds, 0 " t <#
FFT - Motivation
‣ Partial Differential Equations (PDEs) - Spectral methods
- Use DFTs to calculate derivatives
12
uj =N!1!
k=0
uke2!iN kj 2!
Nj = xj For evenly spaced grid
uj =N!1!
k=0
ukeikxj
!uj
!x=
N!1!
k=0
ikukeikxj
!2uj
!x2=
N!1!
k=0
!k2ukeikxj
!!u
!x= iku
FFT - Motivation
‣Advantages
- Spectral accuracy
‣ Limitations
- Grid constraints
- Boundary condition constraint
13
O(N!m)
O(cN ) 0 < c < 1
CUFFT
‣CUFFT: CUDA library for FFTs on the GPU
‣ Supported by NVIDIA
‣ Features:
- 1D, 2D, 3D transforms for complex and real data
- Batch execution for multiple transforms
- Up to 128 million elements (limited by memory)
- In-place or out-of-place transforms
- Double precision on GT200 or later
- Allows streamed execution: simultaneous computation and data movement 14
CUFFT - Types
‣ cufftHandle
- Handle type to store CUFFT plans
‣ cufftResult
- Return values, like CUFFT_SUCCESS, CUFFT_INVALID_PLAN, CUFFT_ALLOC_FAILED, CUFFT_INVALID_TYPE, etc.
‣ cufftReal
‣ cufftDoubleReal
‣ cufftComplex
‣ cufftDoubleComplex
15
CUFFT - Transform types
‣ R2C: real to complex
‣C2R: Complex to real
‣C2C: complex to complex
‣D2Z: double to double complex
‣ Z2D: double complex to double
‣ Z2Z: double complex to double complex
16
CUFFT - Plans
‣ cufftPlan1d()
‣ cufftPlan2d()
‣ cufftPlan3d()
‣ cufftPlanMany()
17
CUFFT - Functions
‣ cufftDestroy
- Free GPU resources
‣ cufftExecC2C, R2C, C2R, Z2Z, D2Z, Z2D
- Performs the specified FFT
‣ For more details see the CUFFT Library documentation available in the NVIDIA website!
18
CUFFT - Performance considerations
‣ Several algorithms for different sizes
‣ Performance recommendations
- Restrict size to be a multiple of 2, 3, 5 or 7
- Restrict the power-of-two factorization term of the X-dimension to be at least a multiple of 16 for single and 8 for double
- Restrict the X-dimension of single precision transforms to be strictly a power of two between 2(2) and 2048(8192) for Tesla (Fermi) GPUs
19
CUFFT - Performance considerations
‣CUFFT vs FFTW
- CUFFT is good for larger, power of two sized FFTs
- CUFFT is not good for small sized FFTs
‣ CPU can store all data in cache
‣ GPU data transfers take too long
20http://www.sharcnet.ca/~merz/CUDA_benchFFT/
CUFFT - Example
21
#include<cufft.h>#defineNX256#defineBATCH10
cufftHandleplan;cufftComplex*data;cudaMalloc((void**)&data,sizeof(cufftComplex)*NX*BATCH);/*Createa1DFFTplan.*/cufftPlan1d(&plan,NX,CUFFTC2C,BATCH);
/*UsetheCUFFTplantotransformthesignalinplace.*/cufftExecC2C(plan,data,data,CUFFTFORWARD);
/*DestroytheCUFFTplan.*/cufftDestroy(plan);cudaFree(data);
00.2
0.40.6
0.81
00.2
0.40.6
0.810
0.2
0.4
0.6
0.8
1
x
Fourier spectral method for 2D Poisson Eqn
y
u
CUFFT - Example
‣ Solve Poisson equation using FFT
‣Consider periodic boundary conditions
22
!2u =r2 " 2!2
!4e!
r2
2!2
uan = e!r2
2!2 r =!
(x! 0.5)2 + (y ! 0.5)2
CUFFT - Example
‣ Steps
23
FFT system
Find derivative
Transform back
!2u = f
!k2u = f
u = ! f
k2
u = i!t
!! f
k2
"
k2 = k2x + k2
y
CUFFT - Example
24
intmain(){intN=64;floatxmax=1.0f,xmin=0.0f,ymin=0.0f,h=(xmax‐xmin)/((float)N),s=0.1,s2=s*s;float*x=newfloat[N*N],*y=newfloat[N*N],*u=newfloat[N*N],*f=newfloat[N*N],*u_a=newfloat[N*N],*err=newfloat[N*N];
floatr2;for(intj=0;j<N;j++)for(inti=0;i<N;i++){x[N*j+i]=xmin+i*h;y[N*j+i]=ymin+j*h;
r2=(x[N*j+i]‐0.5)*(x[N*j+i]‐0.5)+(y[N*j+i]‐0.5)*(y[N*j+i]‐0.5);f[N*j+i]=(r2‐2*s2)/(s2*s2)*exp(‐r2/(2*s2));u_a[N*j+i]=exp(‐r2/(2*s2));//analyticalsolution}
float*k=newfloat[N];for(inti=0;i<=N/2;i++){k[i]=i*2*M_PI;}for(inti=N/2+1;i<N;i++){k[i]=(i‐N)*2*M_PI;}
CUFFT - Example
25
//Allocatearraysonthedevicefloat*k_d,*f_d,*u_d;cudaMalloc((void**)&k_d,sizeof(float)*N);cudaMalloc((void**)&f_d,sizeof(float)*N*N);cudaMalloc((void**)&u_d,sizeof(float)*N*N);
cudaMemcpy(k_d,k,sizeof(float)*N,cudaMemcpyHostToDevice);cudaMemcpy(f_d,f,sizeof(float)*N*N,cudaMemcpyHostToDevice);
cufftComplex*ft_d,*f_dc,*ft_d_k,*u_dc;cudaMalloc((void**)&ft_d,sizeof(cufftComplex)*N*N);cudaMalloc((void**)&ft_d_k,sizeof(cufftComplex)*N*N);cudaMalloc((void**)&f_dc,sizeof(cufftComplex)*N*N);cudaMalloc((void**)&u_dc,sizeof(cufftComplex)*N*N);
dim3dimGrid(int((N‐0.5)/BSZ)+1,int((N‐0.5)/BSZ)+1);dim3dimBlock(BSZ,BSZ);real2complex<<<dimGrid,dimBlock>>>(f_d,f_dc,N);
cufftHandleplan;cufftPlan2d(&plan,N,N,CUFFT_C2C);
CUFFT - Example
26
cufftExecC2C(plan,f_dc,ft_d,CUFFT_FORWARD);
solve_poisson<<<dimGrid,dimBlock>>>(ft_d,ft_d_k,k_d,N);
cufftExecC2C(plan,ft_d_k,u_dc,CUFFT_INVERSE);
complex2real<<<dimGrid,dimBlock>>>(u_dc,u_d,N);
cudaMemcpy(u,u_d,sizeof(float)*N*N,cudaMemcpyDeviceToHost);
floatconstant=u[0];for(inti=0;i<N*N;i++){u[i]‐=constant;//substractu[0]toforcethearbitraryconstanttobe0}
CUFFT - Example
27
__global__voidsolve_poisson(cufftComplex*ft,cufftComplex*ft_k,float*k,intN){inti=threadIdx.x+blockIdx.x*BSZ;intj=threadIdx.y+blockIdx.y*BSZ;intindex=j*N+i;
if(i<N&&j<N){floatk2=k[i]*k[i]+k[j]*k[j];if(i==0&&j==0){k2=1.0f;}ft_k[index].x=‐ft[index].x/k2;ft_k[index].y=‐ft[index].y/k2;}}
CUFFT - Example
28
__global__voidreal2complex(float*f,cufftComplex*fc,intN){inti=threadIdx.x+blockIdx.x*blockDim.x;intj=threadIdx.y+blockIdx.y*blockDim.y;intindex=j*N+i;
if(i<N&&j<N){fc[index].x=f[index];fc[index].y=0.0f;}}
__global__voidcomplex2real(cufftComplex*fc,float*f,intN){inti=threadIdx.x+blockIdx.x*BSZ;intj=threadIdx.y+blockIdx.y*BSZ;intindex=j*N+i;
if(i<N&&j<N){f[index]=fc[index].x/((float)N*(float)N);//dividebynumberofelementstorecovervalue}}
PyCUDA
‣ Python + CUDA = PyCUDA
‣ Python: scripting language easy to code, but slow
‣CUDA difficult to code, but fast!
‣ PyCUDA wants to take the best of both worlds
‣ http://mathema.tician.de/software/pycuda
29
PyCUDA
‣ Scripting language
- High level programming language that is interpret by another program at runtime rather than compiled
- Advantages: ease on programmer
- Disadvantages: slow (specially inner loops)
‣ PyCUDA
- CUDA codes does not need to be a constant at compile time
- Machine generated code: automatic manage of resources
30
PyCUDA
31
importpycuda.autoinitimportpycuda.driverasdrvimportnumpy
frompycuda.compilerimportSourceModulemod=SourceModule("""__global__voidmultiply_them(float*dest,float*a,float*b){constinti=threadIdx.x;dest[i]=a[i]*b[i];}""")
multiply_them=mod.get_function("multiply_them")
a=numpy.random.randn(400).astype(numpy.float32)b=numpy.random.randn(400).astype(numpy.float32)
dest=numpy.zeros_like(a)multiply_them(drv.Out(dest),drv.In(a),drv.In(b),block=(400,1,1),grid=(1,1))
printdest‐a*b
PyCUDA
‣ Transferring data
‣ Executing a kernel
32
importnumpya=a.astype(numpy.float32)a_gpu=cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu,a)
frompycuda.compilerimportSourceModulemod=SourceModule("""__global__voiddoublify(float*a){intidx=threadIdx.x+threadIdx.y*4;a[idx]*=2;}""")...#Allocate,generateandtransferfunc=mod.get_function("doublify")func(a_gpu,block=(4,4,1))
a_doubled=numpy.empty_like(a)cuda.memcpy_dtoh(a_doubled,a_gpu)printa_doubledprinta