the high performance python landscape by ian ozsvald

www.morconsulting.com

The High Performance Python Landscape - profiling and fast calculation

Ian Ozsvald @IanOzsvald MorConsulting.com

[email protected] @IanOzsvald PyDataLondon February 2014

What is “high performance”?

● Profiling to understand system behaviour● We often ignore this step...

● Speeding up the bottleneck● Keeps you on 1 machine (if possible)

● Keeping team speed high


“High Performance Python”

• “Practical Performant

Programming

for Humans”

• Please join the mailing

list via IanOzsvald.com


cProfile


line_profilerLine # Hits Time Per Hit % Time Line Contents

==============================================================

9 @profile

10 def calculate_z_serial_purepython(

maxiter, zs, cs):

12 1 6870 6870.0 0.0 output = [0] * len(zs)

13 1000001 781959 0.8 0.8 for i in range(len(zs)):

14 1000000 767224 0.8 0.8 n = 0

15 1000000 843432 0.8 0.8 z = zs[i]

16 1000000 786013 0.8 0.8 c = cs[i]

17 34219980 36492596 1.1 36.2 while abs(z) < 2

and n < maxiter:

18 33219980 32869046 1.0 32.6 z = z * z + c

19 33219980 27371730 0.8 27.2 n += 1

20 1000000 890837 0.9 0.9 output[i] = n

21 1 4 4.0 0.0 return output


memory_profilerLine # Mem usage Increment Line Contents

================================================

9 89.934 MiB 0.000 MiB @profile

10 def calculate_z_serial_purepython(

maxiter, zs, cs):

12 97.566 MiB 7.633 MiB output = [0] * len(zs)

13 130.215 MiB 32.648 MiB for i in range(len(zs)):

14 130.215 MiB 0.000 MiB n = 0

15 130.215 MiB 0.000 MiB z = zs[i]

16 130.215 MiB 0.000 MiB c = cs[i]

17 130.215 MiB 0.000 MiB while n < maxiter and abs(z) < 2:

18 130.215 MiB 0.000 MiB z = z * z + c

19 130.215 MiB 0.000 MiB n += 1

20 130.215 MiB 0.000 MiB output[i] = n

21 122.582 MiB 7.633 MiB return output


memory_profiler mprofhttps://github.com/scikit-learn/scikit-learn/pull/2248Before & After an improvement

https://github.com/scikit-learn/scikit-learn/pull/2248

https://github.com/scikit-learn/scikit-learn/pull/2248


Transforming memory_profiler into a resource profiler?


Profiling possibilities

● CPU (line by line or by function)● Memory (line by line)● Disk read/write (with some hacking)● Network read/write (with some hacking)● mmaps● File handles● Network connections ● Cache utilisation via libperf?


Cython 0.20 (pyx annotations)#cython: boundscheck=False

def calculate_z(int maxiter, zs, cs):

"""Calculate output list using Julia update rule"""

cdef unsigned int i, n

cdef double complex z, c

output = [0] * len(zs)

for i in range(len(zs)):

n = 0

z = zs[i]

c = cs[i]

while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:

z = z * z + c

n += 1

output[i] = n

return output

Pure CPython lists code 12sCython lists runtime 0.19s Cython numpy runtime 0.16s


Cython + numpy + OMP nogil#cython: boundscheck=False

from cython.parallel import parallel, prange

import numpy as np

cimport numpy as np

def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):

cdef unsigned int i, length, n

cdef double complex z, c

cdef int[:] output = np.empty(len(zs), dtype=np.int32)

length = len(zs)

with nogil, parallel():

for i in prange(length, schedule="guided"):

z = zs[i]

c = cs[i]

n = 0

while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:

z = z * z + c

n = n + 1

output[i] = n

return outputRuntime 0.05s


ShedSkin 0.9.4 annotationsdef calculate_z(maxiter, zs, cs): # maxiter: [int], zs:

[list(complex)], cs: [list(complex)]

output = [0] * len(zs) # [list(int)]

for i in range(len(zs)): # [__iter(int)]

n = 0 # [int]

z = zs[i] # [complex]

c = cs[i] # [complex]

while n < maxiter and (… <4): # [complex]

z = z * z + c # [complex]

n += 1 # [int]

output[i] = n # [int]

return output # [list(int)]

Couldn't we generate Cython pyx? Runtime 0.22s


Pythran (0.40)

#pythran export calculate_z_serial_purepython(int, complex list, complex list)

def calculate_z_serial_purepython(maxiter, zs, cs):

…

Support for OpenMP on numpy arraysAuthor Serge made an overnight fix – superb support!

List Runtime 0.4s

#pythran export calculate_z(int, complex[], complex[], int[])

…

#omp parallel for schedule(dynamic)

OMP numpy Runtime 0.10s


PyPy nightly (and numpypy)

● “It just works” on Python 2.7 code● Clever list strategies (e.g. unboxed, uniform)● Little support for pre-existing C extensions (e.g. the existing numpy)

● multiprocessing, IPython etc all work fine● Python list code runtime: 0.3s● (pypy)numpy support is incomplete, bugs are tackled (numpy runtime 5s [CPython+numpy 56s])


Numba 0.12from numba import jit

@jit(nopython=True)

def calculate_z_serial_purepython(maxiter, zs, cs, output):

# couldn't create output, had to pass it in

# output = numpy.zeros(len(zs), dtype=np.int32)

for i in xrange(len(zs)):

n = 0

z = zs[i]

c = cs[i]

#while n < maxiter and abs(z) < 2: # abs unrecognised

while n < maxiter and z.real * z.real + z.imag * z.imag < 4:

z = z * z + c

n += 1

output[i] = n

#return output

Runtime 0.4sSome Python 3 support, some GPUprange support missing (was in 0.11)?0.12 introduces temp limitations


Tool Tradeoffs● PyPy no learning curve (pure Py only) easy win?● ShedSkin easy (pure Py only) but fairly rare● Cython pure Py hours to learn – team cost low (and lots of online help)

● Cython numpy OMP days+ to learn – heavy team cost?● Numba/Pythran hours to learn, install a bit tricky (Anaconda easiest for Numba)

● Pythran OMP very impressive result for little effort● Numba big toolchain which might hurt productivity?● (numexpr not covered – great for numpy and easy to use)


Wrap up

● Our profiling options should be richer● 4-12 physical CPU cores commonplace● Cost of hand-annotating code is reduced agility● JITs/AST compilers are getting fairly good, manual intervention still gives best results

BUT! CONSIDER:● Automation should (probably) be embraced ($CPUs < $humans) as team velocity is probably higher


Thank You

• [email protected]• @IanOzsvald

• MorConsulting.com

• Annotate.io

• GitHub/IanOzsvald

mailto:[email protected]

the high performance python landscape by ian ozsvald

Technology