december 1st, 2016 - mcgill hpc - mcg… · december 1st, 2016 1 ... using cython and numba ......

Advanced and Parallel Python

December 1st, 2016

http://tinyurl.com/cq-advanced-python-20161201By: Bart Oldeman and Pier-Luc St-Onge

Financial Partners

Setup for the workshop1. Get a user ID and password paper (provided in class):

##: userNMXXXXXXXXXX **********

2. Access to local computer (replace ## and ___ with appropriate values, “___” is provided in class):a. User name: csuser##b. Password: ___@[S##

3. HTTPS connection to Colosse (replace **********):a. https://jupyter.calculquebec.cab. User name: userNMc. Password: **********d. If requested:

i. click Start Server button, set walltime 83

Select Modules -Change Notebook Kernel

● In the Software tab, select:○ compilers/llvm/3.7.1○ compilers/gcc/4.8.5

● Open notebooks/01-stack.ipynb○ File -> Save and Checkpoint

Import Examples and ExercisesIn case the cq-formation-advanced-python folder is not in your home directory, open a Terminal and type:

module load apps/git/1.8.5.3 # If on Colosse

git clone -b ulaval \

https://github.com/calculquebec/cq-formation-advanced-python.git

cd cq-formation-advanced-python

Outline

● Revisiting the Scientific Python Stack● Why (and What) is Python?

○ Accelerating Python code: PyPy and Numpy○ Using C code from Python code

● Finding Bottlenecks - Profiling code● Compiling Python Code

○ Using Cython and Numba● Parallelizing Python Programs

○ Parallel Programming Concepts○ The multiprocessing Module○ MPI for Python (mpi4py)

The Scientific Python stack

Scientific Python stack

In the introductory workshop we looked at:● Python itself● Numpy, for numerical array objects● Scipy, for higher level routines● IPython, an advanced Python shell● Matplotlib, for plottingOn top of that we introduce some new components, for example:● Cython, for speed and interfacing● mpi4py for using MPI in Python

Speeding up Python programs

Speeding up PythonCentral example: approx_pi.c / approx_pi.py:

// approx_pi.c

double approx_pi(int intervals)

{ double pi = 0.0;

int i;

for (i = 0; i < intervals; i++) {

pi += (4 - ((i % 2) * 8)) /

(double)(2 * i + 1);

return pi;

# approx_pi.py

def approx_pi(intervals):

pi = 0.0

for i in range(intervals):

pi += (4 - 8 * (i % 2)) /

(float)(2 * i + 1)

return pi

Speeding up PythonCompile:$ gcc -O2 pi_collect.c approx_pi.c -o pi_collect

$ ./pi_collect 100000000

.. Time = 0.88 secPython run (example on Guillimin):$ module load iomkl/2015b Python/3.5.0

$ python pi_collect.py approx_pi 100000000

The compiled C code runs almost 100 times faster than the Python code (0.88 vs. 66 seconds with intervals = 100000000).Note that “approx_pi” is the module to import for pi_collect.py.

Speeding up Python

How to speed up: two approaches1. Make Python go faster

a. Use the PyPy just-in-time compilerb. Use Numpy with vectorized codec. Use Cython

2. Call C code from Pythona. Manuallyb. Use SWIGc. Use Ctypesd. Use Cythone. ....

Speeding up Python using PyPy

How to speed up: use PyPy:$ module add pypy/3-2.4.0

$ pypy3 pi_collect.py approx_pi 100000000

gives 2.2 seconds (30 times faster)

An alternative to PyPy is Numba (not installed on Guillimin).

Speeding up with numpy

How to speed up: use vectorized code:from __future__ import division # only needed for Python 2.x

def approx_pi(intervals):

pi1 = 4/numpy.arange(1, intervals*2, 4)

pi2 = -4/numpy.arange(3, intervals*2, 4)

return numpy.sum(pi1) + numpy.sum(pi2)

$ python3 pi_collect.py approx_pi_numpy 100000000

gives 1.4 seconds (47 times faster).Drawback: extra memory use.How to speed up: Cython: see later

Interfacing with C/C++/Fortran

Interfacing with C and C++

● There are at least 14 different ways to do it:1. By hand using the Python API (*)2. Pyrex3. Cython (**)4. SWIG (*)5. SIP6. Boost.Python7. PyCXX8. CTypes (*)9. Py++

10. f2py (*)11. PyD12. Interrogate13. Robin (*) Quick introduction14. Pybind11 (**) Most popular now, more thorough introduction

Using the Python API● Pros: no extra dependencies● Cons: a lot of boilerplate code, which can change between

Python version/* Example of wrapping approx_pi() with the Python-C-API. */

#include <Python.h>

#include "approx_pi.h"

static PyObject* approx_pi_func(PyObject* self, PyObject* args) // wrapped approx_pi()

{ int value; double answer;

if (!PyArg_ParseTuple(args, "i", &value)) // parse input, python float to c double

return NULL;

/* if the above function returns -1, an appropriate Python exception will

* have been set, and the function simply returns NULL */

answer =approx_pi(value);

/* construct the output from approx_pi, from c double to python float */

return Py_BuildValue("f", answer); }

Using the Python API/* define functions in module */

static PyMethodDef PiMethods[] =

{"approx_pi", approx_pi_func, METH_VARARGS, "approximate Pi"},

{NULL, NULL, 0, NULL} };

static struct PyModuleDef PiModule = {

PyModuleDef_HEAD_INIT, "approx_pi_pyapi", NULL, -1, PiMethods,

NULL, NULL, NULL, NULL };

/* module initialization */

PyMODINIT_FUNC PyInit_approx_pi_pyapi(void)

{ (void) PyModule_Create(&PiModule);}

Compile using $ python3 setup_approx_pi_pyapi.py build_ext --inplacefrom distutils.core import setup, Extension

# define the extension module

module = Extension('approx_pi_pyapi', sources=['approx_pi_pyapi.c', 'approx_pi.c'])

setup(ext_modules=[module]) # run the setup

Using CTypes● Pros: the ctypes package is in Python by default, pure

Python solution● Cons: wrapped code in shared lib, interface not fastFirst compile approx_pi_ctypes.so:$ gcc -fPIC -shared -O2 approx_pi.c -o approx_pi_ctypes.so# approx_pi_ctypes.py

""" Example of wrapping approx_pi using ctypes. """

import ctypes

approx_pi_dll = ctypes.cdll.LoadLibrary('./approx_pi_ctypes.so') # find and load the library

approx_pi_dll.approx_pi.argtypes = [ctypes.c_int] # set the argument type

approx_pi_dll.approx_pi.restype = ctypes.c_double # set the return type

def approx_pi(arg):

''' Wrapper for approx_pi '''

return approx_pi_dll.approx_pi(arg)

Using SWIG● Mature solution● Wrapper file is autogenerated from interface file./* approx_pi_swig.i */

/* Example of wrapping approx_pi using SWIG. */

%module approx_pi_swig

/* the resulting C file should be built as a python extension */

#define SWIG_FILE_WITH_INIT

/* Includes the header in the wrapper code */

#include "approx_pi.h"

/* Parse the header file to generate wrappers */

%include "approx_pi.h"

Using SWIG● Use distutils as before (python3

setup_approx_pi_swig.py build_ext --inplace) but mention the interface file in the setup script.

from distutils.core import setup, Extension

approx_pi_module = Extension("_approx_pi", sources=["approx_pi.c", "approx_pi.i"])setup(ext_modules=[approx_pi_module]])

● This generates three files: approx_pi_swig.py, approx_pi_swig_wrap.c, and _approx_pi_swig*.so

Using f2py● Fortran version: approx_pi.f90subroutine approx_pi(intervals, pi)

integer, intent(in) :: intervals

double precision, intent(out) :: pi

integer i

pi = 0

do i = 0, intervals - 1

pi = pi + (4 - (mod(i,2) * 8)) / dble(2 * i + 1)

end subroutine approx_pi

● Compile usingf2py3 -c -m approx_pi_f2py approx_pi.f90

● Then dopython3 pi_collect.py approx_pi_f2py 100000000

Cython

Cython● Cython compiles from Python (with extensions) to C.● Based on Pyrex● Goals: faster execution (especially with those

extensions) and easier interoperability with other C code.

● Cython files use the .pyx extension.

Cython● Example: approx_pi_cython1.pyx (same as

approx_pi.py)def approx_pi(intervals):

pi = 0.0

pi += (4 - 8 * (i % 2)) / (float)(2 * i + 1)

return pi

● Executing python3 setup_cython.py build_ext --inplace from distutils.core import setup

from Cython.Build import cythonize

setup(ext_modules = cythonize("*.pyx"))

turns all .pyx files into .c files and .so modules● Run python3 pi_collect.py approx_pi_cython1 100000000

○ 25 seconds: the C code uses only Python objects.25

Cython: declare variables● Need to declare variables using cdef to make it fast● Example: approx_pi_cython2.pyx def approx_pi(int intervals):

cdef double pi

cdef int i

pi = 0.0

pi += (4 - 8 * (i % 2)) / (float)(2 * i + 1)

return pi

● Execute python3 setup_cython.py build_ext --inplace ● Run python3 pi_collect.py approx_pi_cython2 100000000

○ 0.89 seconds: almost as fast as native C.

Cython: division● Inspecting approx_pi_cython2.c we found it uses

__Pyx_mod_long(__pyx_v_i, 2) instead of a plain __pyx_v_i % 2. This is because in C, -1%10=-1 but in Python, -1%10=9.

● Here we can ignore this and tell Cython to use C behaviour, by adding a line

#cython:cdivision=True● Execute python3 setup_cython.py build_ext --inplace

○ Check that approx_pi_cython3.c uses %.● Run python3 pi_collect.py approx_pi_cython3 100000000

○ 0.88 seconds: the same as native C.● Note: use Cython in IPython/Jupyter using “%load_ext

cythonmagic” and “%%cython” in a cell.27

Cython: wrapping C code● Last but not least: interfacing with C code:# approx_pi_cython4.pyx

cdef extern from "approx_pi.h":

double c_approx_pi "approx_pi" (int intervals)

# C name: approx_pi, Cython name: c_approx_pi

def approx_pi(int intervals):

return c_approx_pi(intervals)

● Plus special setup_cython4.py scriptfrom distutils.core import setup, Extension

from Cython.Distutils import build_ext

setup(cmdclass={'build_ext': build_ext},

ext_modules=[Extension("approx_pi_cython4",

sources=["approx_pi_cython4.pyx", "approx_pi.c"])])

● Execute python3 setup_cython4.py build_ext --inplace ● Run python3 pi_collect.py approx_pi_cython4 100000000

Parallel Programming Concepts

Vocabulary

● Serial tasks○ Any task that cannot be split in two simultaneous

sequences of actions

○ Examples: starting a process, reading a file, any communication between two processes

● Parallel tasks○ Data parallelism: same action applied on different

data. Could be serial tasks done in parallel.

○ Process parallelism: one action on one set of data. Action split in multiple processes or threads.■ Data partitioning: rectangles or blocks

Parallel tasks

● Parallel efficiency (scaling)○ Amdahl’s law: how long does it take to compute a

task with an infinite number of processors?○ Gustafson's law: what size of problem can we

solve in a given time with N processors?

● Shared memory○ Multiple threads share the same memory space in a

single process: full read and write access.

● Distributed memory○ Each process has its own memory space○ Information is sent and received by messages

Distributed Memory Model

Process 1

Process 2

Different variables!

Serial Code Parallelization● Implicit Parallelization - minimum work for you

○ Threaded libraries (MKL, ACML, GOTO, etc.)○ Compiler directives (OpenMP)○ Good for desktops and shared memory machines

● Explicit Parallelization - work is required !○ You tell what should be done on what CPU○ Solution for distributed clusters (shared nothing!)

● Hybrid Parallelization - work is required !○ Mix of implicit and explicit parallelization

■ Vectorization and parallel CPU instructions○ Good for accelerators (CUDA, OpenCL, etc.)

The multiprocessing Module

● Because of the implementation of CPython, only one thread at a time can execute Python code○ This avoids common issues with the shared

memory model: race condition, ...

○ There is a threading module, but it is no longer recommended

● Solution: the multiprocessing module!

Pool of WorkersFor embarrassingly parallel tasks, the Pool class allows the creation of worker processes. Each process will compute different data.

Warning: only works in a script!

from multiprocessing import Pool

def prod(values):

return values[0] * values[1]

if __name__ == '__main__':

N = 12

values = [(i + 1, N - i)

for i in range(0, N)]

print(values)

workers = Pool(processes=4)

results = workers.map(prod, values)

print(results)

Pool of Workers

● Run: python script.py● What happens with 4 workers:

Pool of WorkersAsynchronous map calls can be used in order to do something else in the main process. The map_async() method returns an AsyncResult object which can wait until all workers are done.

from multiprocessing import Pool

import time

def prod(values):

time.sleep(1)

return values[0] * values[1]

if __name__ == '__main__':

N = 12

values = [(i + 1, N - i)

print(values)

results = workers.map_async(prod, values)

print('Waiting...')

print(results.get(timeout=10))

Pool of WorkersAsynchronous map calls can use a callback function. Then, the main thread has to wait by first closing the access to workers, and by joining the pool of workers.

def printRes(results):

print(results)

if __name__ == '__main__':

N = 12

values = [(i + 1, N - i)

print(values)

results = workers.map_async(prod,

values, callback=printRes)

print('Waiting...')

workers.close()

workers.join()

Pool of Workers

● class Pool([processes[,...]])○ processes: number of worker processes. If None,

processes=multiprocessing.cpu_count()○ Methods:

■ map(func, iterable[, ...]): returns results

■ map_async(func, iterable[, ...]): returns an AsyncResult object

■ close(): closes access to worker processes

■ join(): waiting for all workers to exit. Must call close() before.

Pool of Workers

● class AsyncResult○ Methods:

■ get([timeout]): blocking, get results as soon as they are available. In case of error, get

■ wait([timeout]): blocking, waits until the call is done

■ ready(): non-blocking, returns a boolean indicating if the call has completed.

■ successful(): non-blocking, returns a boolean indicating if the call has succeeded.

Exercise - Baby Genomic

● Edit baby-genomic.py○ Use a pool of 4 workers○ Use the asynchronous map function

○ Provide a callback function that will print results at the end

○ Tip: use the edProxy() function in order to call the real editDistance() function.

● Run:time -p python baby-genomic.py

The Process class● https://docs.python.org/2/library/multiprocessing.html

○ The Process class: manually spawn and control each

processProcess(target=fct, args=(arg1,arg2)).start()

○ Communication channels:■ The Pipe class: to communicate between two

processes, one sends data, one receives data■ The Queue class: a shared pipe managed with locks

and semaphores, one puts data, one gets data○ Synchronization:

■ The Lock class: one acquires lock, one releases lock

MPI for Python (mpi4py)

MPI for Python● The mpi4py package provides bindings from Python to

MPI (Message Passing Interface).● MPI functions are then available in Python but with

some simplifications:○ MPI_Init() and MPI_Finalize() are done automatically○ The bindings can auto-detect many values that

need to be specified as explicit parameters in the C and Fortran bindings.

○ Example: dest = 1; tag = 54321; MPI_Send( &matrix,

count, MPI_INT, dest, tag, MPI_COMM_WORLD )

becomes MPI.COMM_WORLD.Send(matrix, dest=1, tag=54321)

MPI for Python● Import as from mpi4py import MPI● Then often use comm = MPI.COMM_WORLD● Two variations for most functions:

a. all lowercase, e.g. comm.recv()■ works on general Python objects, using pickle (can

be slow)■ received object (value) returned:

● matrix = comm.recv(source=0, tag=MPI.ANY_TAG)

b. capitalized, e.g. comm.Recv()■ works fast on numpy arrays & other buffers■ received object given as parameter:

● comm.Recv(matrix, source=0, tag=MPI.ANY_TAG)

■ Specify [matrix, MPI.INT], or [data, count, MPI.INT] if autodetection fails.

Conclusions● Main techniques covered:

○ Speeding up: PyPy, Numba, CTypes, Cython

○ Parallel programming: multiprocessing, mpi4py

● Useful links:○ http://www.scipy-lectures.org/advanced/interfacing_with_c/int

erfacing_with_c.html

○ https://github.com/kwmsmith/scipy-2015-cython-tutorial

○ https://docs.python.org/3/library/multiprocessing.html

○ http://materials.jeremybejarano.com/MPIwithPython

Questions?

● Calcul Quebec support team:○ support@calculquebec.ca

● Specific site support teams:○ briaree@calculquebec.ca○ colosse@calculquebec.ca○ guillimin@calculquebec.ca○ mammouth@calculquebec.ca

december 1st, 2016 - mcgill hpc - mcg… · december 1st, 2016 1 ... using cython and numba ......

Documents

introduction to cython: example of gcoptimization

optimizing and interfacing with cython - [groupe...

high performance python with numba - intel · high...

cython slides

python at warp speed - dlr...

pysfml - cython documentation

running cython and...

cython tutorial · 2011. 9. 13. · 4.2example problem:...

numba documentation

wrapping c libraries with cython - github pages

numba: array-oriented python compiler for numpy

how to make python perform like c - il.pycon.org · books...

lab 1 interfacing with other programming languages … ·...

sergey maidanov software engineering manager …...and...

awkward array: numba...apr 17, 2019 · numba can only...

zlm-cython build you first module

learning cython programming - second edition - sample...

speeding up scientiﬁc python code using cython ·...

mpi for python · mpi for python release 3.0.0 lisandro...

numba-compiled python udfs for impala (impala meetup...