the high performance python landscape by ian ozsvald
DESCRIPTION
The High Performance Python Landscape by Ian OzsvaldTRANSCRIPT
www.morconsulting.com
The High Performance Python Landscape - profiling and fast calculation
Ian Ozsvald @IanOzsvald MorConsulting.com
[email protected] @IanOzsvald PyDataLondon February 2014
What is “high performance”?
● Profiling to understand system behaviour● We often ignore this step...
● Speeding up the bottleneck● Keeps you on 1 machine (if possible)
● Keeping team speed high
[email protected] @IanOzsvald PyDataLondon February 2014
“High Performance Python”
• “Practical Performant
Programming
for Humans”
• Please join the mailing
list via IanOzsvald.com
[email protected] @IanOzsvald PyDataLondon February 2014
cProfile
[email protected] @IanOzsvald PyDataLondon February 2014
line_profilerLine # Hits Time Per Hit % Time Line Contents
==============================================================
9 @profile
10 def calculate_z_serial_purepython(
maxiter, zs, cs):
12 1 6870 6870.0 0.0 output = [0] * len(zs)
13 1000001 781959 0.8 0.8 for i in range(len(zs)):
14 1000000 767224 0.8 0.8 n = 0
15 1000000 843432 0.8 0.8 z = zs[i]
16 1000000 786013 0.8 0.8 c = cs[i]
17 34219980 36492596 1.1 36.2 while abs(z) < 2
and n < maxiter:
18 33219980 32869046 1.0 32.6 z = z * z + c
19 33219980 27371730 0.8 27.2 n += 1
20 1000000 890837 0.9 0.9 output[i] = n
21 1 4 4.0 0.0 return output
[email protected] @IanOzsvald PyDataLondon February 2014
memory_profilerLine # Mem usage Increment Line Contents
================================================
9 89.934 MiB 0.000 MiB @profile
10 def calculate_z_serial_purepython(
maxiter, zs, cs):
12 97.566 MiB 7.633 MiB output = [0] * len(zs)
13 130.215 MiB 32.648 MiB for i in range(len(zs)):
14 130.215 MiB 0.000 MiB n = 0
15 130.215 MiB 0.000 MiB z = zs[i]
16 130.215 MiB 0.000 MiB c = cs[i]
17 130.215 MiB 0.000 MiB while n < maxiter and abs(z) < 2:
18 130.215 MiB 0.000 MiB z = z * z + c
19 130.215 MiB 0.000 MiB n += 1
20 130.215 MiB 0.000 MiB output[i] = n
21 122.582 MiB 7.633 MiB return output
[email protected] @IanOzsvald PyDataLondon February 2014
memory_profiler mprofhttps://github.com/scikit-learn/scikit-learn/pull/2248Before & After an improvement
[email protected] @IanOzsvald PyDataLondon February 2014
Transforming memory_profiler into a resource profiler?
[email protected] @IanOzsvald PyDataLondon February 2014
Profiling possibilities
● CPU (line by line or by function)● Memory (line by line)● Disk read/write (with some hacking)● Network read/write (with some hacking)● mmaps● File handles● Network connections ● Cache utilisation via libperf?
[email protected] @IanOzsvald PyDataLondon February 2014
Cython 0.20 (pyx annotations)#cython: boundscheck=False
def calculate_z(int maxiter, zs, cs):
"""Calculate output list using Julia update rule"""
cdef unsigned int i, n
cdef double complex z, c
output = [0] * len(zs)
for i in range(len(zs)):
n = 0
z = zs[i]
c = cs[i]
while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
z = z * z + c
n += 1
output[i] = n
return output
Pure CPython lists code 12sCython lists runtime 0.19s Cython numpy runtime 0.16s
[email protected] @IanOzsvald PyDataLondon February 2014
Cython + numpy + OMP nogil#cython: boundscheck=False
from cython.parallel import parallel, prange
import numpy as np
cimport numpy as np
def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
cdef unsigned int i, length, n
cdef double complex z, c
cdef int[:] output = np.empty(len(zs), dtype=np.int32)
length = len(zs)
with nogil, parallel():
for i in prange(length, schedule="guided"):
z = zs[i]
c = cs[i]
n = 0
while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
z = z * z + c
n = n + 1
output[i] = n
return outputRuntime 0.05s
[email protected] @IanOzsvald PyDataLondon February 2014
ShedSkin 0.9.4 annotationsdef calculate_z(maxiter, zs, cs): # maxiter: [int], zs:
[list(complex)], cs: [list(complex)]
output = [0] * len(zs) # [list(int)]
for i in range(len(zs)): # [__iter(int)]
n = 0 # [int]
z = zs[i] # [complex]
c = cs[i] # [complex]
while n < maxiter and (… <4): # [complex]
z = z * z + c # [complex]
n += 1 # [int]
output[i] = n # [int]
return output # [list(int)]
Couldn't we generate Cython pyx? Runtime 0.22s
[email protected] @IanOzsvald PyDataLondon February 2014
Pythran (0.40)
#pythran export calculate_z_serial_purepython(int, complex list, complex list)
def calculate_z_serial_purepython(maxiter, zs, cs):
…
Support for OpenMP on numpy arraysAuthor Serge made an overnight fix – superb support!
List Runtime 0.4s
#pythran export calculate_z(int, complex[], complex[], int[])
…
#omp parallel for schedule(dynamic)
OMP numpy Runtime 0.10s
[email protected] @IanOzsvald PyDataLondon February 2014
PyPy nightly (and numpypy)
● “It just works” on Python 2.7 code● Clever list strategies (e.g. unboxed, uniform)● Little support for pre-existing C extensions (e.g. the existing numpy)
● multiprocessing, IPython etc all work fine● Python list code runtime: 0.3s● (pypy)numpy support is incomplete, bugs are tackled (numpy runtime 5s [CPython+numpy 56s])
[email protected] @IanOzsvald PyDataLondon February 2014
Numba 0.12from numba import jit
@jit(nopython=True)
def calculate_z_serial_purepython(maxiter, zs, cs, output):
# couldn't create output, had to pass it in
# output = numpy.zeros(len(zs), dtype=np.int32)
for i in xrange(len(zs)):
n = 0
z = zs[i]
c = cs[i]
#while n < maxiter and abs(z) < 2: # abs unrecognised
while n < maxiter and z.real * z.real + z.imag * z.imag < 4:
z = z * z + c
n += 1
output[i] = n
#return output
Runtime 0.4sSome Python 3 support, some GPUprange support missing (was in 0.11)?0.12 introduces temp limitations
[email protected] @IanOzsvald PyDataLondon February 2014
Tool Tradeoffs● PyPy no learning curve (pure Py only) easy win?● ShedSkin easy (pure Py only) but fairly rare● Cython pure Py hours to learn – team cost low (and lots of online help)
● Cython numpy OMP days+ to learn – heavy team cost?● Numba/Pythran hours to learn, install a bit tricky (Anaconda easiest for Numba)
● Pythran OMP very impressive result for little effort● Numba big toolchain which might hurt productivity?● (numexpr not covered – great for numpy and easy to use)
[email protected] @IanOzsvald PyDataLondon February 2014
Wrap up
● Our profiling options should be richer● 4-12 physical CPU cores commonplace● Cost of hand-annotating code is reduced agility● JITs/AST compilers are getting fairly good, manual intervention still gives best results
BUT! CONSIDER:● Automation should (probably) be embraced ($CPUs < $humans) as team velocity is probably higher
[email protected] @IanOzsvald PyDataLondon February 2014
Thank You
• [email protected]• @IanOzsvald
• MorConsulting.com
• Annotate.io
• GitHub/IanOzsvald