applications of programming the gpu directly from python ... · breaking the speed barrier (numba!)...

Applications of Programming the GPU Directly from Python Using NumbaPro

Supercomputing 2013November 20, 2013

Travis E. Oliphant, Ph.D.

Inroduction

Enterprise

Python

Scientific

Computing

Data Processing

Data Analysis

Visualisation

Scalable

Computing

Wakari

•Products•Training•Support•Consulting

Free Python distributionEnterprise version available

which includes GPU support

Scientific Python in your browserAvailable to install in your data-center

Big PictureEmpower domain experts, subject

matter experts, and other occasional programmers with high-level tools that

exploit modern hardware

Array Oriented Computing

?

Why Array-oriented computing

•Express domain knowledge directly in arrays (tensors, matrices, vectors) --- easier to teach programming in domain

•Can take advantage of parallelism and accelerators like the GPU

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Attr1 Attr2 Attr3

Object1

Object2

Object3

Object4

Object5

Object6

Why PythonLicense

Community

Readable Syntax

Modern Constructs

Batteries Included

Free and Open Source, Permissive License

• Broad and friendly community• Over 36,000 packages on PyPI• Commercial Support• Many conferences (PyData, SciPy, PyCon...)

• Executable pseudo-code• Can understand and edit code a year later• Fun to develop• Use of Indentation

IPython

• Interactive prompt on steroids (Notebook)• Allows less working memory • Allows failing quickly for exploration

• List comprehensions• Iterator protocol and generators• Meta-programming• Introspection• JIT Compiler and Concurrency (Numba)

• Internet (FTP, HTTP, SMTP, XMLRPC)• Compression and Databases• Great Visualization tools (Bokeh, Matplotlib, etc.)• Powerful libraries for STEM• Integration with C/C++/Fortran

Breaking the Speed Barrier (Numba!)

Numba aims to be the world’s best array-oriented compiler.

rapid iteration and development+

fast code executionideal combination!=

Python syntax but no GILNative code speed for Numerical

computing (NumPy code)

NumPy + Mamba = Numba

LLVM Library

Intel Nvidia AppleAMD

OpenCLISPC CUDA CLANGOpenMP

LLVM-PY

Python Function Machine Code

ARM

Example@jit(‘f8(f8)’)def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

Numba

Compiler Overview

LLVM IR

x86C++

ARM

PTX

C

Fortran

Numba

Numba turns Python into a “compiled language”

~150x speed-up Real-time image processing in Python (50 fps Mandelbrot)

Anaconda Accelerate

Python and NumPy stack compiled to Parallel Architectures

(GPUs and multi-core machines)

• Compile NumPy array expressions for the CPU and GPU

• Create parallel-for loops• Parallel execution of ufuncs• Run ufuncs on the GPU• Write CUDA directly in Python!• Requires CUDA 5.5

Fast development and execution

$ conda install accelerate

NumbaPro Features

•CUDA Python•Vectorize --- NumPy functions on the GPU•Array expressions•Parallel for loops•Access to fast libraries (cuRAND, cuFFT, cuBLAS)

Compile NumPy array expressions

import numbaprofrom numba import autojit

@autojitdef formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1]

@autojitdef express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2

Create parallel-for loops“prange” directive that spawns compiled tasks in threads (like Open-MP parallel-for pragma)

import numbapro # import first to make prange availablefrom numba import autojit, prange

@autojitdef parallel_sum2d(a): sum = 0.0 for i in prange(a.shape[0]): for j in range(a.shape[1]): sum += a[i,j]

Fast vectorize

NumPy’s ufuncs take “kernels” and apply the kernel element-by-element over entire arrays Write kernels in Python!

from numbapro import vectorizefrom math import sin, pi

@vectorize([‘f8(f8)’, ‘f4(f4)’])def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

x = numpy.linspace(-5,5,100)y = sinc(x)

Ufuncs in parallel (multi-thread or GPU)from numbapro import vectorizefrom math import sin

@vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’)def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

@vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’)def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

x = numpy.linspace(-5,5,1000)y = sinc(x) # On GPU

z = sinc2(x) # Multiple CPUs

Example Benchmark

Overhead of memory transfer is over-come after about 1 million floats for simple computation

About 1ms of overhead for memory transfer and set-up

Using Vectorizefrom numbapro import vectorize

sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)'

@vectorize([sig], target='gpu')def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height

x = tid % width y = tid / width

real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y

c = complex(real, imag) z = 0.0j

for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255

Kind Time Speed-up

Python 263.6 1.0x

CPU 2.639 100x

GPU 0.1676 1573x Tesla S2050

Grouping Calculations

prototype = "void(float32[:,:], float32[:,:], float32[:,:])"@guvectorize([prototype], '(m,n),(n,p)->(m,p)', target='gpu')def matmul(A, B, C): m, n = A.shape n, p = B.shape for i in range(m): for j in range(p): C[i, j] = 0 for k in range(n): C[i, j] += A[i, k] * B[k, j]

Create “generalized ufuncs” whose elements are “arrays”

# creates an array of 1000 x 2 x 4A = np.arange(matrix_ct * 2 * 4, dtype=np.float32 ).reshape(1000, 2, 4)# creates an array of 1000 x 4 x 5B = np.arange(matrix_ct * 4 * 5, dtype=np.float32 ).reshape(1000, 4, 5)# outputs an array of 1000 x 2 x 5C = matmul(A, B)

Using cuBLAS

import numpy as npfrom numbapro.cudalib import cublas

A = np.array(np.arange(N ** 2, dtype=np.float32).reshape(N, N))B = np.array(np.arange(N) + 10, dtype=A.dtype)D = np.zeros_like(A, order='F')

# NumPyE = np.dot(A, np.diag(B))

# cuBLASblas = cublas.Blas()blas.gemm('T', 'T', N, N, N, 1.0, A, np.diag(B), 1.0, D)

FFT Convolution with cuFFT

from numbapro.cudalib import cufft# host -> deviced_img = cuda.to_device(img) # imaged_fltr = cuda.to_device(fltr) # filter# FFT forwardcufft.fft_inplace(d_img)cufft.fft_inplace(d_fltr)# multplyvmult(d_img, d_fltr, out=d_img) # inplace# FFT inversecufft.ifft_inplace(d_img)# device -> hostfilted_img = d_img.copy_to_host()

@vectorize(['complex64(complex64, complex64)'], target='gpu')def vmult(a, b): return a * b

works with 1d,2d,3d

Monte-Carlo Pricing and cuRANDfrom numbapro import vectorize, cudafrom numbapro.cudalib import curand@vectorize(['f8(f8, f8, f8, f8, f8)'], target='gpu')def step(last, dt, c0, c1, noise): return last * math.exp(c0 * dt + c1 * noise) def monte_carlo_pricer(paths, dt, interest, volatility): n = paths.shape[0] blksz = cuda.get_current_device().MAX_THREADS_PER_BLOCK gridsz = int(math.ceil(float(n) / blksz)) # Instantiate cuRAND PRNG prng = curand.PRNG(curand.PRNG.MRG32K3A) # Allocate device side array d_normdist = cuda.device_array(n, dtype=np.double) c0 = interest - 0.5 * volatility ** 2 c1 = volatility * math.sqrt(dt) # Simulation loop d_last = cuda.to_device(paths[:, 0]) for j in range(1, paths.shape[1]): prng.normal(d_normdist, mean=0, sigma=1) d_paths = cuda.to_device(paths[:, j]) step(d_last, dt, c0, c1, d_normdist, out=d_paths) d_paths.copy_to_host(paths[:, j]) d_last = d_paths

from numbapro import jit, cuda@jit('void(double[:], double[:], double, double, double, double[:])', target='gpu')def step(last, paths, dt, c0, c1, normdist): # expands to i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x i = cuda.grid(1) if i >= paths.shape[0]: return noise = normdist[i] paths[i] = last[i] * math.exp(c0 * dt + c1 * noise)

Tuned kernel

Benchmark

http://continuum.io/blog/monte-carlo-pricer



CUDA Pythonfrom numbapro import cudafrom numba import autojit

@autojit(target=‘gpu’)def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x

i = tid + blkid * blkdim

if i >= n: return

dst[i] = src[i] * scale

src = np.arange(N, dtype=np.float)dst = np.empty_like(src)

array_scale[grid, block](src, dst, 5.0)

CUDA Developmentdirectly in Python

Example: Matrix [email protected](argtypes=[f4[:,:], f4[:,:], f4[:,:]])def cu_square_matrix_mul(A, B, C): sA = cuda.shared.array(shape=(tpb, tpb), dtype=f4) sB = cuda.shared.array(shape=(tpb, tpb), dtype=f4) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bx = cuda.blockIdx.x by = cuda.blockIdx.y bw = cuda.blockDim.x bh = cuda.blockDim.y

x = tx + bx * bw y = ty + by * bh

acc = 0. for i in range(bpg): if x < n and y < n: sA[ty, tx] = A[y, tx + i * tpb] sB[ty, tx] = B[ty + i * tpb, x]

cuda.syncthreads()

if x < n and y < n: for j in range(tpb): acc += sA[ty, j] * sB[j, tx]

cuda.syncthreads()

if x < n and y < n: C[y, x] = acc

bpg = 50tpb = 32n = bpg * tpb

A = np.array(np.random.random((n, n)), dtype=np.float32)B = np.array(np.random.random((n, n)), dtype=np.float32)C = np.empty_like(A)

stream = cuda.stream()with stream.auto_synchronize(): dA = cuda.to_device(A, stream) dB = cuda.to_device(B, stream) dC = cuda.to_device(C, stream) cu_square_matrix_mul[(bpg, bpg), (tpb, tpb), stream](dA, dB, dC) dC.to_host(stream)

Performance Results

About 6x faster on the GPU.

GeForce GTX 560 Ticore i7

How to get it

•Anaconda Accelerate (Anaconda Add-On)•Available as part of on-premise Wakari •On wakari.io --- cloud based Python (GPU instances coming soon)

•http://github.com/ContinuumIO/numbapro-examples

conda install acceleratepip install condaconda initconda install accelerate

Anaconda UserOther Python users

enterprise.wakari.io

https://github.com/ContinuumIO/numbapro-examples




applications of programming the gpu directly from python ... · breaking the speed barrier (numba!)...

Documents