applications of programming the gpu directly from python ... · breaking the speed barrier (numba!)...
TRANSCRIPT
Applications of Programming the GPU Directly from Python Using NumbaPro
Supercomputing 2013November 20, 2013
Travis E. Oliphant, Ph.D.
Inroduction
Enterprise
Python
Scientific
Computing
Data Processing
Data Analysis
Visualisation
Scalable
Computing
Wakari
•Products•Training•Support•Consulting
Free Python distributionEnterprise version available
which includes GPU support
Scientific Python in your browserAvailable to install in your data-center
Big PictureEmpower domain experts, subject
matter experts, and other occasional programmers with high-level tools that
exploit modern hardware
Array Oriented Computing
?
Why Array-oriented computing
•Express domain knowledge directly in arrays (tensors, matrices, vectors) --- easier to teach programming in domain
•Can take advantage of parallelism and accelerators like the GPU
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Object
Attr1
Attr2
Attr3
Attr1 Attr2 Attr3
Object1
Object2
Object3
Object4
Object5
Object6
Why PythonLicense
Community
Readable Syntax
Modern Constructs
Batteries Included
Free and Open Source, Permissive License
• Broad and friendly community• Over 36,000 packages on PyPI• Commercial Support• Many conferences (PyData, SciPy, PyCon...)
• Executable pseudo-code• Can understand and edit code a year later• Fun to develop• Use of Indentation
IPython
• Interactive prompt on steroids (Notebook)• Allows less working memory • Allows failing quickly for exploration
• List comprehensions• Iterator protocol and generators• Meta-programming• Introspection• JIT Compiler and Concurrency (Numba)
• Internet (FTP, HTTP, SMTP, XMLRPC)• Compression and Databases• Great Visualization tools (Bokeh, Matplotlib, etc.)• Powerful libraries for STEM• Integration with C/C++/Fortran
Breaking the Speed Barrier (Numba!)
Numba aims to be the world’s best array-oriented compiler.
rapid iteration and development+
fast code executionideal combination!=
Python syntax but no GILNative code speed for Numerical
computing (NumPy code)
NumPy + Mamba = Numba
LLVM Library
Intel Nvidia AppleAMD
OpenCLISPC CUDA CLANGOpenMP
LLVM-PY
Python Function Machine Code
ARM
Example@jit(‘f8(f8)’)def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)
Numba
Compiler Overview
LLVM IR
x86C++
ARM
PTX
C
Fortran
Numba
Numba turns Python into a “compiled language”
~150x speed-up Real-time image processing in Python (50 fps Mandelbrot)
Anaconda Accelerate
Python and NumPy stack compiled to Parallel Architectures
(GPUs and multi-core machines)
• Compile NumPy array expressions for the CPU and GPU
• Create parallel-for loops• Parallel execution of ufuncs• Run ufuncs on the GPU• Write CUDA directly in Python!• Requires CUDA 5.5
Fast development and execution
$ conda install accelerate
NumbaPro Features
•CUDA Python•Vectorize --- NumPy functions on the GPU•Array expressions•Parallel for loops•Access to fast libraries (cuRAND, cuFFT, cuBLAS)
Compile NumPy array expressions
import numbaprofrom numba import autojit
@autojitdef formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1]
@autojitdef express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2
Create parallel-for loops“prange” directive that spawns compiled tasks in threads (like Open-MP parallel-for pragma)
import numbapro # import first to make prange availablefrom numba import autojit, prange
@autojitdef parallel_sum2d(a): sum = 0.0 for i in prange(a.shape[0]): for j in range(a.shape[1]): sum += a[i,j]
Fast vectorize
NumPy’s ufuncs take “kernels” and apply the kernel element-by-element over entire arrays Write kernels in Python!
from numbapro import vectorizefrom math import sin, pi
@vectorize([‘f8(f8)’, ‘f4(f4)’])def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)
x = numpy.linspace(-5,5,100)y = sinc(x)
Ufuncs in parallel (multi-thread or GPU)from numbapro import vectorizefrom math import sin
@vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’)def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)
@vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’)def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)
x = numpy.linspace(-5,5,1000)y = sinc(x) # On GPU
z = sinc2(x) # Multiple CPUs
Example Benchmark
Overhead of memory transfer is over-come after about 1 million floats for simple computation
About 1ms of overhead for memory transfer and set-up
Using Vectorizefrom numbapro import vectorize
sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)'
@vectorize([sig], target='gpu')def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height
x = tid % width y = tid / width
real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y
c = complex(real, imag) z = 0.0j
for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255
Kind Time Speed-up
Python 263.6 1.0x
CPU 2.639 100x
GPU 0.1676 1573x Tesla S2050
Grouping Calculations
prototype = "void(float32[:,:], float32[:,:], float32[:,:])"@guvectorize([prototype], '(m,n),(n,p)->(m,p)', target='gpu')def matmul(A, B, C): m, n = A.shape n, p = B.shape for i in range(m): for j in range(p): C[i, j] = 0 for k in range(n): C[i, j] += A[i, k] * B[k, j]
Create “generalized ufuncs” whose elements are “arrays”
# creates an array of 1000 x 2 x 4A = np.arange(matrix_ct * 2 * 4, dtype=np.float32 ).reshape(1000, 2, 4)# creates an array of 1000 x 4 x 5B = np.arange(matrix_ct * 4 * 5, dtype=np.float32 ).reshape(1000, 4, 5)# outputs an array of 1000 x 2 x 5C = matmul(A, B)
Using cuBLAS
import numpy as npfrom numbapro.cudalib import cublas
A = np.array(np.arange(N ** 2, dtype=np.float32).reshape(N, N))B = np.array(np.arange(N) + 10, dtype=A.dtype)D = np.zeros_like(A, order='F')
# NumPyE = np.dot(A, np.diag(B))
# cuBLASblas = cublas.Blas()blas.gemm('T', 'T', N, N, N, 1.0, A, np.diag(B), 1.0, D)
FFT Convolution with cuFFT
from numbapro.cudalib import cufft# host -> deviced_img = cuda.to_device(img) # imaged_fltr = cuda.to_device(fltr) # filter# FFT forwardcufft.fft_inplace(d_img)cufft.fft_inplace(d_fltr)# multplyvmult(d_img, d_fltr, out=d_img) # inplace# FFT inversecufft.ifft_inplace(d_img)# device -> hostfilted_img = d_img.copy_to_host()
@vectorize(['complex64(complex64, complex64)'], target='gpu')def vmult(a, b): return a * b
works with 1d,2d,3d
Monte-Carlo Pricing and cuRANDfrom numbapro import vectorize, cudafrom numbapro.cudalib import curand@vectorize(['f8(f8, f8, f8, f8, f8)'], target='gpu')def step(last, dt, c0, c1, noise): return last * math.exp(c0 * dt + c1 * noise) def monte_carlo_pricer(paths, dt, interest, volatility): n = paths.shape[0] blksz = cuda.get_current_device().MAX_THREADS_PER_BLOCK gridsz = int(math.ceil(float(n) / blksz)) # Instantiate cuRAND PRNG prng = curand.PRNG(curand.PRNG.MRG32K3A) # Allocate device side array d_normdist = cuda.device_array(n, dtype=np.double) c0 = interest - 0.5 * volatility ** 2 c1 = volatility * math.sqrt(dt) # Simulation loop d_last = cuda.to_device(paths[:, 0]) for j in range(1, paths.shape[1]): prng.normal(d_normdist, mean=0, sigma=1) d_paths = cuda.to_device(paths[:, j]) step(d_last, dt, c0, c1, d_normdist, out=d_paths) d_paths.copy_to_host(paths[:, j]) d_last = d_paths
from numbapro import jit, cuda@jit('void(double[:], double[:], double, double, double, double[:])', target='gpu')def step(last, paths, dt, c0, c1, normdist): # expands to i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x i = cuda.grid(1) if i >= paths.shape[0]: return noise = normdist[i] paths[i] = last[i] * math.exp(c0 * dt + c1 * noise)
Tuned kernel
Benchmark
http://continuum.io/blog/monte-carlo-pricer
CUDA Pythonfrom numbapro import cudafrom numba import autojit
@autojit(target=‘gpu’)def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x
i = tid + blkid * blkdim
if i >= n: return
dst[i] = src[i] * scale
src = np.arange(N, dtype=np.float)dst = np.empty_like(src)
array_scale[grid, block](src, dst, 5.0)
CUDA Developmentdirectly in Python
Example: Matrix [email protected](argtypes=[f4[:,:], f4[:,:], f4[:,:]])def cu_square_matrix_mul(A, B, C): sA = cuda.shared.array(shape=(tpb, tpb), dtype=f4) sB = cuda.shared.array(shape=(tpb, tpb), dtype=f4) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bx = cuda.blockIdx.x by = cuda.blockIdx.y bw = cuda.blockDim.x bh = cuda.blockDim.y
x = tx + bx * bw y = ty + by * bh
acc = 0. for i in range(bpg): if x < n and y < n: sA[ty, tx] = A[y, tx + i * tpb] sB[ty, tx] = B[ty + i * tpb, x]
cuda.syncthreads()
if x < n and y < n: for j in range(tpb): acc += sA[ty, j] * sB[j, tx]
cuda.syncthreads()
if x < n and y < n: C[y, x] = acc
bpg = 50tpb = 32n = bpg * tpb
A = np.array(np.random.random((n, n)), dtype=np.float32)B = np.array(np.random.random((n, n)), dtype=np.float32)C = np.empty_like(A)
stream = cuda.stream()with stream.auto_synchronize(): dA = cuda.to_device(A, stream) dB = cuda.to_device(B, stream) dC = cuda.to_device(C, stream) cu_square_matrix_mul[(bpg, bpg), (tpb, tpb), stream](dA, dB, dC) dC.to_host(stream)
Performance Results
About 6x faster on the GPU.
GeForce GTX 560 Ticore i7
How to get it
•Anaconda Accelerate (Anaconda Add-On)•Available as part of on-premise Wakari •On wakari.io --- cloud based Python (GPU instances coming soon)
•http://github.com/ContinuumIO/numbapro-examples
conda install acceleratepip install condaconda initconda install accelerate
Anaconda UserOther Python users
enterprise.wakari.io