python performance 101

20
About me :) Computer Programmer, Coding in Python for last 3 years, Part of the Team at HP that developed an early warning software that parses over 40+TB of data annually to find problems before they happen, (Coded in Python) Skills in Django, PyQt, Http://uptosomething.in [ Homepage ]

Upload: ankur-gupta

Post on 06-May-2015

6.374 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Python Performance 101

About me :)

● Computer Programmer,● Coding in Python for last 3 years,● Part of the Team at HP that developed an early

warning software that parses over 40+TB of data annually to find problems before they happen, (Coded in Python)

● Skills in Django, PyQt,● Http://uptosomething.in [ Homepage ]

Page 2: Python Performance 101

Python Performance 101

Performance : Why it matters ?

Performance : Measurement

Performance : Low hanging fruits

Python Performance : Interpreter

Python Performance : Outsourcing to C/C++

Page 3: Python Performance 101

Performance : Measurement

Reading cProfile output

Ncalls : for the number of calls,

Tottime : for the total time spent in the given function (and excluding time made in calls to sub-functions),

Percall : is the quotient of tottime divided by ncalls

Cumtime : is the total time spent in this and all subfunctions (from invocation till exit). This figure is accurate even for recursive functions.

Percall : is the quotient of cumtime divided by primitive calls

filename:lineno(function) : provides the respective data of each function

Page 4: Python Performance 101

Performance : Measurement $ apt-get install python graphviz $ sudo apt-get install python graphviz $ wget http://gprof2dot.jrfonseca.googlecode.com/git/gprof2dot.py $ python -m cProfile -o out.pstats ex21.py $ python gprof2dot.py -f pstats out.pstats | dot -Tpng -o output.png $ gimp output.png

Page 5: Python Performance 101

Performance : MeasurementRunPythonRun : http://www.vrplumber.com/programming/runsnakerun/

Page 6: Python Performance 101

Python Performance : Low Hanging Fruits

● String concatenation Benchmark ( http://sprocket.io/blog/2007/10/string-concatenation-performance-in-python/ )

add: a + b + c + d

add equals: a += b; a += c; a += d

format strings: ‘%s%s%s%s’ % (a, b, c, d)

named format strings:‘%(a)s%(b)s%(c)s%(d)s’ % {‘a’: a, ‘b’: b, ‘c’: c, ‘d’: d}”

join: ”.join([a,b,c,d])

#!/usr/bin/python# benchmark various string concatenation methods. Run each 5*1,000,000 times# and pick the best time out of the 5. Repeats for string lengths of# 4, 16, 64, 256, 1024, and 4096. Outputs in CSV format via stdout.import timeit tests = { 'add': "x = a + b + c + d", 'join': "x = ''.join([a,b,c,d])", 'addequals': "x = a; x += b; x += c; x += d", 'format': "x = '%s%s%s%s' % (a, b, c, d)", 'full_format': "x = '%(a)s%(b)s%(c)s%(d)s' % {'a': a, 'b': b, 'c': c, 'd': d}"} count = 1for i in range(6): count = count * 4 init = "a = '%s'; b = '%s'; c = '%s'; d = '%s'" % \ ('a' * count, 'b' * count, 'c' * count, 'd' * count)  for test in tests: t = timeit.Timer(tests[test], init) best = min(t.repeat(5, 1000000)) print "'%s',%s,%s" % (test, count, best)

Page 7: Python Performance 101

Python Performance : Low Hanging Fruits

Simple addition is the fastest string concatenation for small strings, followed by add equals.

”.join() is the fastest string concatenation for large strings.

* named format is always the worst performer.

* using string formatting for joins is equally as good as add equals for large strings, but for small strings it’s mediocre.

Page 8: Python Performance 101

Python Performance : Low Hanging Fruits

newlist = []for word in oldlist: newlist.append(word.upper())

newlist = map(str.upper, oldlist)

newlist = [s.upper() for s in oldlist]

upper = str.uppernewlist = []append = newlist.appendfor word in oldlist: append(upper(word))

I wouldn't do this

wdict = {}for word in words: if word not in wdict: wdict[word] = 0 wdict[word] += 1

wdict = {}for word in words: try: wdict[word] += 1 except KeyError: wdict[word] = 1

Exception for branching

Page 9: Python Performance 101

Python Performance : Low Hanging Fruits

Function call overhead

import timex = 0def doit1(i): global x x = x + ilist = range(100000)t = time.time()for i in list: doit1(i)print "%.3f" % (time.time()-t)

import timex = 0def doit2(list): global x for i in list: x = x + ilist = range(100000)t = time.time()doit2(list)print "%.3f" % (time.time()-t)

>>> t = time.time()>>> for i in list:... doit1(i)...>>> print "%.3f" % (time.time()-t)0.758>>> t = time.time()>>> doit2(list)>>> print "%.3f" % (time.time()-t)0.204

Page 10: Python Performance 101

Python Performance : Low Hanging Fruits

Xrange vs range

Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple.

Lists perform well as either fixed length arrays or variable length stacks. However, for queue applications using pop(0) or insert(0,v)), collections.deque() offers superior O(1) performance because it avoids the O(n) step of rebuilding a full list for each insertion or deletion.

In functions, local variables are accessed more quickly than global variables, builtins, and attribute lookups. So, it is sometimes worth localizing variable access in inner-loops.

http://wiki.python.org/moin/PythonSpeed

http://wiki.python.org/moin/PythonSpeed/PerformanceTips

Page 11: Python Performance 101

Python : Multi-core Architecture● In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from

executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe. (However, since the GIL exists, other features have grown to depend on the guarantees that it enforces.) More here http://wiki.python.org/moin/GlobalInterpreterLock

● Use Multi Processing to overcome GIL

from multiprocessing import Process, Queue

def f(iq,oq): if not iq.empty(): values = iq.get() oq.put(sum(values))

if __name__ == '__main__': inputQueue = Queue() outputQueue = Queue() values = range(0,1000000) processOne = Process(target=f, args=(inputQueue,outputQueue)) processTwo = Process(target=f, args=(inputQueue,outputQueue)) inputQueue.put(values[0:len(values)/2]) inputQueue.put(values[len(values)/2:]) processOne.start() processTwo.start() processOne.join() processTwo.join() outputOne = outputQueue.get() outputTwo = outputQueue.get() print sum([outputOne, outputTwo])

Page 12: Python Performance 101

Python : Multi-core Architecture

● IPL encapsulated. Queue, Pipe, Lock.

● Use logging module to log multiprocess i.e. SocketHandler,

● Good practise is to have maximum 2 * No of cores processes spawned.

● Debugging is a little painful as cProfile has to be attached to each process and then you dump the stats output of it and one can join them all. Still a little painful.

Page 13: Python Performance 101

Python : Interpreter

CPython - the default install everyone uses

Jython - Python on the JVM, currently targets Python 2.5, true concurrency, strong JVM integration. About even with CPython speed-wise, maybe a bit slower.

IronPython - Python on the CLR, currently targets 2.6, with a 2.7 pre-release available, true concurrency, good CLR integration. Speed comparison with CPython varies greatly depending on which feature you're looking at.

PyPy - Python on RPython (a static subset of python), currently targets 2.5, with a branch targeting 2.7, has a GIL, and a JIT, which can result in huge performance gains (see http://speed.pypy.org/).

Unladen Swallow - a branch of CPython utilizing LLVM to do just in time compilation. Branched from 2.6, although with the acceptance of PEP 3146 it is slated for merger into py3k.

Source: Alex Gaynor @ Quora

Page 14: Python Performance 101

Python : Interpreter

PyPyHttp://pypy.orgPyPy is a fast, compliant alternative implementation of the Python language (2.7.1). It has several advantages and distinct features:

Speed: thanks to its Just-in-Time compiler, Python programs often run faster on PyPy. (What is a JIT compiler?)

Memory usage: large, memory-hungry Python programs might end up taking less space than they do in CPython.

Compatibility: PyPy is highly compatible with existing python code. It supports ctypes and can run popular python libraries like twisted and django.

Sandboxing: PyPy provides the ability to run untrusted code in a fully secure way.

Stackless: PyPy can be configured to run in stackless mode, providing micro-threads for massive concurrency.

Source : http://pypy.org

Page 15: Python Performance 101

Python : Interpreter

● Unladen swallow

An optimization branch of CPython, intended to be fully compatible and significantly faster.

http://code.google.com/p/unladen-swallow/● Mandate is to merge the codebase with Python

3.x series.● It's a google sponsered project.● Know to be used @ Youtube which is in Python.

Page 16: Python Performance 101

Python : Interpreter Benchmarks

Source: http://morepypy.blogspot.com/2009/11/some-benchmarking.html

Page 17: Python Performance 101

Python : Interpreter Benchmarks

Source: http://morepypy.blogspot.com/2009/11/some-benchmarking.html

Page 18: Python Performance 101
Page 19: Python Performance 101

Python : Outsourcing to C/C++

● Ctypes● SWIG

Page 20: Python Performance 101

Python : Outsourcing to C/C++

● $ sudo apt-get install libboost-python-dev ● $ sudo apt-get install python-dev● $ sudo apt-get install swig

/*hellomodule.c*/ #include <stdio.h> void say_hello(const char* name) { printf("Hello %s!\n", name);}/*hello.i*/ %module helloextern void say_hello(const char* name);

$ swig -python hello.i$ gcc -fpic -c hellomodule.c hello_wrap.c -I/usr/include/python2.7/

$ gcc -shared hellomodule.o hello_wrap.o -o _hello.so

>>> import hello>>> hello.say_hello("World")Hello World!