applications of programming the gpu directly from python ... · breaking the speed barrier (numba!)...

27
Applications of Programming the GPU Directly from Python Using NumbaPro Supercomputing 2013 November 20, 2013 Travis E. Oliphant, Ph.D.

Upload: others

Post on 17-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Applications of Programming the GPU Directly from Python Using NumbaPro

Supercomputing 2013November 20, 2013

Travis E. Oliphant, Ph.D.

Page 2: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Inroduction

Enterprise

Python

Scientific

Computing

Data Processing

Data Analysis

Visualisation

Scalable

Computing

Wakari

•Products•Training•Support•Consulting

Free Python distributionEnterprise version available

which includes GPU support

Scientific Python in your browserAvailable to install in your data-center

Page 3: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Big PictureEmpower domain experts, subject

matter experts, and other occasional programmers with high-level tools that

exploit modern hardware

Array Oriented Computing

?

Page 4: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Why Array-oriented computing

•Express domain knowledge directly in arrays (tensors, matrices, vectors) --- easier to teach programming in domain

•Can take advantage of parallelism and accelerators like the GPU

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Object

Attr1

Attr2

Attr3

Attr1 Attr2 Attr3

Object1

Object2

Object3

Object4

Object5

Object6

Page 5: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Why PythonLicense

Community

Readable Syntax

Modern Constructs

Batteries Included

Free and Open Source, Permissive License

• Broad and friendly community• Over 36,000 packages on PyPI• Commercial Support• Many conferences (PyData, SciPy, PyCon...)

• Executable pseudo-code• Can understand and edit code a year later• Fun to develop• Use of Indentation

IPython

• Interactive prompt on steroids (Notebook)• Allows less working memory • Allows failing quickly for exploration

• List comprehensions• Iterator protocol and generators• Meta-programming• Introspection• JIT Compiler and Concurrency (Numba)

• Internet (FTP, HTTP, SMTP, XMLRPC)• Compression and Databases• Great Visualization tools (Bokeh, Matplotlib, etc.)• Powerful libraries for STEM• Integration with C/C++/Fortran

Page 6: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Breaking the Speed Barrier (Numba!)

Numba aims to be the world’s best array-oriented compiler.

rapid iteration and development+

fast code executionideal combination!=

Python syntax but no GILNative code speed for Numerical

computing (NumPy code)

Page 7: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

NumPy + Mamba = Numba

LLVM Library

Intel Nvidia AppleAMD

OpenCLISPC CUDA CLANGOpenMP

LLVM-PY

Python Function Machine Code

ARM

Page 8: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Example@jit(‘f8(f8)’)def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

Numba

Page 9: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Compiler Overview

LLVM IR

x86C++

ARM

PTX

C

Fortran

Numba

Numba turns Python into a “compiled language”

Page 10: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

~150x speed-up Real-time image processing in Python (50 fps Mandelbrot)

Page 11: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Anaconda Accelerate

Python and NumPy stack compiled to Parallel Architectures

(GPUs and multi-core machines)

• Compile NumPy array expressions for the CPU and GPU

• Create parallel-for loops• Parallel execution of ufuncs• Run ufuncs on the GPU• Write CUDA directly in Python!• Requires CUDA 5.5

Fast development and execution

$ conda install accelerate

Page 12: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

NumbaPro Features

•CUDA Python•Vectorize --- NumPy functions on the GPU•Array expressions•Parallel for loops•Access to fast libraries (cuRAND, cuFFT, cuBLAS)

Page 13: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Compile NumPy array expressions

import numbaprofrom numba import autojit

@autojitdef formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1]

@autojitdef express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2

Page 14: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Create parallel-for loops“prange” directive that spawns compiled tasks in threads (like Open-MP parallel-for pragma)

import numbapro # import first to make prange availablefrom numba import autojit, prange

@autojitdef parallel_sum2d(a): sum = 0.0 for i in prange(a.shape[0]): for j in range(a.shape[1]): sum += a[i,j]

Page 15: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Fast vectorize

NumPy’s ufuncs take “kernels” and apply the kernel element-by-element over entire arrays Write kernels in Python!

from numbapro import vectorizefrom math import sin, pi

@vectorize([‘f8(f8)’, ‘f4(f4)’])def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

x = numpy.linspace(-5,5,100)y = sinc(x)

Page 16: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Ufuncs in parallel (multi-thread or GPU)from numbapro import vectorizefrom math import sin

@vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’)def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

@vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’)def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

x = numpy.linspace(-5,5,1000)y = sinc(x) # On GPU

z = sinc2(x) # Multiple CPUs

Page 17: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Example Benchmark

Overhead of memory transfer is over-come after about 1 million floats for simple computation

About 1ms of overhead for memory transfer and set-up

Page 18: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Using Vectorizefrom numbapro import vectorize

sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)'

@vectorize([sig], target='gpu')def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height

x = tid % width y = tid / width

real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y

c = complex(real, imag) z = 0.0j

for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255

Kind Time Speed-up

Python 263.6 1.0x

CPU 2.639 100x

GPU 0.1676 1573x Tesla S2050

Page 19: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Grouping Calculations

prototype = "void(float32[:,:], float32[:,:], float32[:,:])"@guvectorize([prototype], '(m,n),(n,p)->(m,p)', target='gpu')def matmul(A, B, C): m, n = A.shape n, p = B.shape for i in range(m): for j in range(p): C[i, j] = 0 for k in range(n): C[i, j] += A[i, k] * B[k, j]

Create “generalized ufuncs” whose elements are “arrays”

# creates an array of 1000 x 2 x 4A = np.arange(matrix_ct * 2 * 4, dtype=np.float32 ).reshape(1000, 2, 4)# creates an array of 1000 x 4 x 5B = np.arange(matrix_ct * 4 * 5, dtype=np.float32 ).reshape(1000, 4, 5)# outputs an array of 1000 x 2 x 5C = matmul(A, B)

Page 20: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Using cuBLAS

import numpy as npfrom numbapro.cudalib import cublas

A = np.array(np.arange(N ** 2, dtype=np.float32).reshape(N, N))B = np.array(np.arange(N) + 10, dtype=A.dtype)D = np.zeros_like(A, order='F')

# NumPyE = np.dot(A, np.diag(B))

# cuBLASblas = cublas.Blas()blas.gemm('T', 'T', N, N, N, 1.0, A, np.diag(B), 1.0, D)

Page 21: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

FFT Convolution with cuFFT

from numbapro.cudalib import cufft# host -> deviced_img = cuda.to_device(img) # imaged_fltr = cuda.to_device(fltr) # filter# FFT forwardcufft.fft_inplace(d_img)cufft.fft_inplace(d_fltr)# multplyvmult(d_img, d_fltr, out=d_img) # inplace# FFT inversecufft.ifft_inplace(d_img)# device -> hostfilted_img = d_img.copy_to_host()

@vectorize(['complex64(complex64, complex64)'], target='gpu')def vmult(a, b): return a * b

works with 1d,2d,3d

Page 22: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Monte-Carlo Pricing and cuRANDfrom numbapro import vectorize, cudafrom numbapro.cudalib import curand@vectorize(['f8(f8, f8, f8, f8, f8)'], target='gpu')def step(last, dt, c0, c1, noise): return last * math.exp(c0 * dt + c1 * noise) def monte_carlo_pricer(paths, dt, interest, volatility): n = paths.shape[0] blksz = cuda.get_current_device().MAX_THREADS_PER_BLOCK gridsz = int(math.ceil(float(n) / blksz)) # Instantiate cuRAND PRNG prng = curand.PRNG(curand.PRNG.MRG32K3A) # Allocate device side array d_normdist = cuda.device_array(n, dtype=np.double) c0 = interest - 0.5 * volatility ** 2 c1 = volatility * math.sqrt(dt) # Simulation loop d_last = cuda.to_device(paths[:, 0]) for j in range(1, paths.shape[1]): prng.normal(d_normdist, mean=0, sigma=1) d_paths = cuda.to_device(paths[:, j]) step(d_last, dt, c0, c1, d_normdist, out=d_paths) d_paths.copy_to_host(paths[:, j]) d_last = d_paths

from numbapro import jit, cuda@jit('void(double[:], double[:], double, double, double, double[:])', target='gpu')def step(last, paths, dt, c0, c1, normdist): # expands to i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x i = cuda.grid(1) if i >= paths.shape[0]: return noise = normdist[i] paths[i] = last[i] * math.exp(c0 * dt + c1 * noise)

Tuned kernel

Page 23: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Benchmark

http://continuum.io/blog/monte-carlo-pricer

Page 24: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

CUDA Pythonfrom numbapro import cudafrom numba import autojit

@autojit(target=‘gpu’)def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x

i = tid + blkid * blkdim

if i >= n: return

dst[i] = src[i] * scale

src = np.arange(N, dtype=np.float)dst = np.empty_like(src)

array_scale[grid, block](src, dst, 5.0)

CUDA Developmentdirectly in Python

Page 25: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Example: Matrix [email protected](argtypes=[f4[:,:], f4[:,:], f4[:,:]])def cu_square_matrix_mul(A, B, C): sA = cuda.shared.array(shape=(tpb, tpb), dtype=f4) sB = cuda.shared.array(shape=(tpb, tpb), dtype=f4) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bx = cuda.blockIdx.x by = cuda.blockIdx.y bw = cuda.blockDim.x bh = cuda.blockDim.y

x = tx + bx * bw y = ty + by * bh

acc = 0. for i in range(bpg): if x < n and y < n: sA[ty, tx] = A[y, tx + i * tpb] sB[ty, tx] = B[ty + i * tpb, x]

cuda.syncthreads()

if x < n and y < n: for j in range(tpb): acc += sA[ty, j] * sB[j, tx]

cuda.syncthreads()

if x < n and y < n: C[y, x] = acc

bpg = 50tpb = 32n = bpg * tpb

A = np.array(np.random.random((n, n)), dtype=np.float32)B = np.array(np.random.random((n, n)), dtype=np.float32)C = np.empty_like(A)

stream = cuda.stream()with stream.auto_synchronize(): dA = cuda.to_device(A, stream) dB = cuda.to_device(B, stream) dC = cuda.to_device(C, stream) cu_square_matrix_mul[(bpg, bpg), (tpb, tpb), stream](dA, dB, dC) dC.to_host(stream)

Page 26: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

Performance Results

About 6x faster on the GPU.

GeForce GTX 560 Ticore i7

Page 27: Applications of Programming the GPU Directly from Python ... · Breaking the Speed Barrier (Numba!) Numba aims to be the world’s best array-oriented compiler. rapid iteration and

How to get it

•Anaconda Accelerate (Anaconda Add-On)•Available as part of on-premise Wakari •On wakari.io --- cloud based Python (GPU instances coming soon)

•http://github.com/ContinuumIO/numbapro-examples

conda install acceleratepip install condaconda initconda install accelerate

Anaconda UserOther Python users

enterprise.wakari.io