python speed optimization in the real world

http://esr.ibiblio.org/?p=4861 1/44
Armed and Dangerous Sex, software, politi cs, and firearms. Life's simple pleasures…
63
Python speed optimization in the real world Posted on 2013-03-24 by esr
I shipped reposurgeon 2.29 a few minutes ago. The main improvement in this version is
speed – it now reads in and analyzes Subversion repositories at a clip of more than
11,000 commits per minute. This, is, in case you are in any doubt, ridiculously fast – faster than the native
Subversion tools do it, and for certain far faster than any of the rival conversion utilities can manage. It’s well
over an order of magnitude faster than when I began seriously tuning for speed three weeks ago. I’ve learned
some interesting lessons along the way.
The impetus for this tune-up was the Battle for Wesnoth repository. The project’s senior devs finally decided
to move from Subversion to git recently. I wan’t actively involved in the decision myself, since I’ve been semi-
retired from Wesnoth for a while, but I supported it and was naturally the person they turned to to do the
conversion. Doing surgical runs on that repository rubbed my nose in the fact that code with good enough
performance on a repository 500 or 5000 commits long won’t necessarily cut it on a repository with over
56000 commits. Two-hour waits for the topological-analysis phase of each load to finish were kicking my ass
– I decided that some serious optimization effort seemed like a far better idea than twiddling my thumbs.
First I’ll talk about some things that didn’t work.
pypy , which is alleged to use fancy JIT compilation techniques to speed up a lot of Python programs, failed
miserably on this one. My pypy runs were 20%-30% slower than plain Python. The pypy site warns that
pypy’s optimization methods can be defeated by tricky, complex code, and perhaps that accounts for it;
reposurgeon is nothing if not algorithmically dense.
cython didn’t emulate pypy’s comic pratfall, but didn’t deliver any speed gains distinguishable from noise
either. I wasn’t very surprised by this; what it can compile is mainly control structure. which I didn’t expect to
be a substantial component of the runtime c ompared to (for example) string-bashing during stream-file
parsing.
My grandest (and perhaps nuttiest) plan was to translate the program into a Lisp dialect with a decent
compiler. Why Lisp? Well…I needed (a) a language with unlimited-extent types that (b) could be compiled to
machine-code for speed, and (c) minimized the semantic distance from Python to ease translation (that last
point is why you Haskell and ML fans should refrain from even drawing breath to ask your obvious question;
instead, go read this ). After some research I found Steel Bank Common Lisp (SBCL) and began reading up
on what I’d need to do to translate Python to it.
The learning process was interesting. Lisp was my second language; I loved it and was already expert in it by
1980 well before I learned C. But since 1982 the only Lisp programs I’ve written have been Emacs modes. I’ve
done a whole hell of a lot of those, including some of the most widely used ones like GDB and VC, but
semantically Emacs Lisp is a sort of living fossil coelacanth from the 1970s, dynamic scoping and all.
Common Lisp, and more generally the evolution of Lisp implementations with decent alien type bindings,
passed me by. And by the time Lisp got good enough for standalone production use in modern environments I
already had Python in hand.
So, for me, reading the SBCL and Common Lisp documentation was a strange mixture of learning a new
language and returning to very old roots. Yay for lexical scoping! I recoded about 6% of reposurgeon in SBCL,
then hit a couple of walls. Once of the lesser walls was a missing feature in Common Lisp corresponding to
the __str__ special method in Python. Lisp types don’t know how to print themselves, and as it turns out
reposurgeon relies on this capability in various and subtle ways. Another problem was that I couldn’t easily
see how to duplicate Python’s subprocess-control interface – at all, let alone portably across common Lisp
implementations.
But the big problem was CLOS, the Common Lisp Object System. I like most of the rest of Common Lisp
now that I’ve studied it. OK, it’s a bit baroque and heavyweight and I can see where it’s had a couple of
kitchen sinks pitched in – if I were choosing a language on purely esthetic grounds I’d prefer Scheme. But I
could get comfortable with it, except for CLOS.
But me no buts about multimethods and the power of generics – I get that, OK? I see why it was done the
way it was done, but the brute fact remains that CLOS is an ugly pile of ugly. More to the point in this
particular context, CLOS objects are quite unlike Python objects (which are in many ways more like CL
defstructs). It was the impedance mismatch between Python and CLOS objects that really sank my
translation attempt, which I had originally hoped could be done without seriously messing with the
architecture of the Python code. Alas, that was not to be. Which refocused me on algorithmic methods of
improving the Python code.
Now I’ll talk about what did work.
What worked, ultimately, was finding operations that have instruction costs O(n**2) in the number of commits
and squashing them. At this point a shout-out goes to Julien “FrnchFrgg” Rivaud, a very capable hacker trying
to use reposurgeon for some work on the Blender repository. He got interested in the speed problem (the
Blender repo is also quite large) and was substantially helpful with both patches and advice. Working
together, we memoized some expensive operations and eliminated others, often by incrementally computing
reverse-lookup pointers when linking objects together in order to avoid having to traverse the entire repository
later on.
Even just finding all the O(n**2) operations isn’t necessarily easy in a language as terse and high-level as
Python; they can hide in very innocuous-looking code and method calls. The biggest bad boy in this case
turned out to be child-node computation. Fast import streams express “is a child of” directly; for obvious
reasons, a repository analysis often has to look at all the children of a given parent. This operation blows up
quite badly on very large repositories even if you memoize it; the only way to make it fast is to precompute all

120 THOUGHTS ON “PYTHON SPEED OPTIMIZATION IN THE REAL WORLD”
Pingback: Python speed optimization in the real world | dropsafe
Another time sink (the last one to get solved) was identifying all tags and resets attached to a particular
commit. The brute-force method (look through all tags for any with a from member matching the commit’s
mark) is expensive mainly because to look through all tags you have to look through all the events in the
stream – and that’s expensive when there are 56K of them. Again, the solution was to give each commit a list
of back-pointers to the tags that reference it and make sure all the mutation operations update it properly.
It all came good in the end. In the last benchmarking run before I shipped 2.29 it processed 56424 commits in
303 seconds. That’s 186 commits per second, 11160 per minute. That’s good enough that I plan to lay off
serious speed-tuning efforts; the gain probably wouldn’t be worth the increased code complexity.
UPDATE: A week later, after more speed-tuning mainly by Julien (because it was still slow on the very large
repo he’s working with) analysis speed is up to 282 commits/sec (16920 per minute) and a curious thing has
occurred. pypy now actually produces an actual speedup, up to around 338 commits/sec (20280 per minute).
We don’t know why, but apparently the algorithmic optimizations somehow gave pypy’s JIT better traction.
This is particularly odd because the density of the code actually increased.
This entry was posted in Software and tagged reposurgeon by esr . Bookmark the permalink
[http://esr.ibiblio.org/?p=4861] .
on 2013-03-24 at 19:55:07 said:
Has been a pleasure watching hackers at work on irc, and the early warning for the blog post :-)
– Foo Quuxman
on 2013-03-24 at 20:23:45 said:
Hmmm. I learned a new word today: Memoization. I’ve had few formal programming classes, and
none recently. I keep up with programming trends by lurking on various lists–but that often shows me
techniques without naming them. Anyways, I regularly “memoize” functions but never knew there was
a formal name for it.

on 2013-03-24 at 20:51:33 said:
Is it 11k commits per second or per minute? First paragraph says second, last paragraph says
minute.
esr
on 2013-03-24 at 21:01:30 said:
>Is it 11k commits per second or per minute? First paragraph says second, last paragraph says
minute.
Typo. I got it right the second time; I’ve fixed the incorrect first instance.
esr
on 2013-03-24 at 21:07:03 said:
>Anyways, I regularly “memoize” functions but never knew there was a formal name for it.
Oddly enough, my situation was opposite – I knew the word, but how to memoize systematically was
something I’d never learned until this last three weeks. I don’t write code that is both performance-
critical and compute-bound very often, so I haven’t before had enough use for this technique to nail it
down.
on 2013-03-24 at 21:53:43 said:
Python is faster than a lot of people think it is.

You have to figure out how to let most of the looping happen inside C builtins.
Usually, if it needs to go fast, someone has already made a library.
Occasionally, I will write C or Pyrex/Cython to speed it up.
But the last time that happened was in 2003…
Joshua Kronengold
on 2013-03-25 at 00:55:29 said:
Nice writing, although I’m somewhat surprised that you haven’t discovered what I (as someone who
frequently works with “big data” setups) have long since determined — that when the going gets slow,
it’s time to pull out a profiler and see if some part of your codebase is running -far- more often than
you’ve anticipated; a sure sign that something upstream of it is suffering big O problems.
esr
on 2013-03-25 at 01:22:17 said:
>when the going gets slow, it’s time to pull out a profiler and see if some part of your codebase is
running -far- more often than you’ve anticipated; a sure sign that something upstream of it is suffering
big O problems.
I’m well aware of the principle. Unfortunately, my experience is that Python profilers suck rather badly
– you generally end up having to write your own instrumentation to gather timings, which is what I did
in this case. It helped me find the obscured O(n**2) operations.
John Wiseman
on 2013-03-25 at 03:11:12 said:
“Once of the lesser walls was a missing feature in Common Lisp corresponding to the __str__
special method in Python.”
on 2013-03-25 at 05:06:07 said:
> you generally end up having to write your own instrumentation to gather timings, which is what I did
in this case.
Do you deem it good enough to show the rest of the world?
Beat Bolli
on 2013-03-25 at 08:08:31 said:
Looks like a classical runtime/memory trade-off. Have you compared the working set size before and
after the speedup?
on 2013-03-25 at 08:12:37 said:
>Looks like a classical runtime/memory trade-off. Have you compared the working set size before
and after the speedup?
It is most certainly that. I didn’t bother measuring the working set because the only metric of that that
mattered to me was “doesn’t trigger noticeable swapping”.
esr
on 2013-03-25 at 08:13:26 said:
>Do you deem it good enough to show the rest of the world?
Look at the implementation of the “timings” command.

>You want print-object: http://www.lispworks.com/documentation/HyperSpec/Body/f_pr_obj.htm
“The function print-object is called by the Lisp printer; it should not be called by the user.”
Anyway, this looks like an analogue of Python repr(), not print – it’s supposed to print a
representation that’s invertible (can be fed back to read-eval). I use str() for dumping the fast-import
stream representations of objects, which is not invertible by Python itself.
JustSaying
on 2013-03-25 at 08:21:47 said:
Big O optimization trumps (or at worst equals in lucky cases) any compiler-aware information,
because the degrees-of-freedom in the semantics not modeled by the language (and the declared
types) is always a superset. Yet another reason why computers will never program themselves
creatively and why I think the Singularity is nonsense.
I don’t know enough about the details of CLOS nor defstructs to grasp the detailed reasons for the
claimed impedance mismatch between the CLOS and Python.
Programmers are rightfully proud when they achieve an order-of-magnitude gain in performance. I
don’t see programmers run away from their babies and disappear into thin air without ever bragging to
any one of their accomplishment. How lonely that would be otherwise.
Nancy Lebovitz
on 2013-03-25 at 08:49:17 said:
Checking to make sure I understand: Memoization is looking up and recording the data you’re likely
to keep needing instead of looking it up every time you need it?
esr

>Checking to make sure I understand: Memoization is looking up and recording the data you’re likely
to keep needing instead of looking it up every time you need it?
Correct. It works when the results of an expensive function (a) change slowly, and (b) are small and
cheap to store. Also there has to be a way to know when the cached results have become invalid so
you can clear the cache.
Since you’re not a programmer, I’ll add that big-O notation is a way of talking about how your
computation costs scale up with the size of your input data. O(1) is constant time, O(n) is linear in
the size of the input set, O(n**2) is as the square of the size, O(2**n) as the number of subsets of the
data set. Also you’ll see O(log n), typically associated with the cost of finding a specified item in a
tree or hash table. And O(n log n) which is the expected cost function of various good sorting
algorithms. In general, O(1) < O(log n) < O(n) < O(n log n) < O(n**2) < O(2**n). Normally anything
O(n log n) or below is tolerable, O(n**2) is pretty bad, and O(2**n) is unusably slow.
Rick C
on 2013-03-25 at 09:15:11 said:
Nancy, it would be more accurate to say you record the results of complex calculations and then
reuse the stored result later, rather than recalculate it every time.
Shenpen
on 2013-03-25 at 10:08:19 said:
> Once of the lesser walls was a missing feature in Common Lisp corresponding to the __str__
special method in Python. Lisp types don’t know how to print themselves, and as it turns out
reposurgeon relies on this capability in various and subtle ways.
Does it also rely on everybody using this and doing it in a sensible, readable way in their classes.
Also, have you checked Jython?
iajrz

Rick: all calculations are functions, aren’t they? But if you had to do a look-up which requires
expensive/extensive/recurrent parsing, can that be called a calculation?
It is still good for memoization…
The Monster
on 2013-03-25 at 10:29:15 said:
I’m a big believer that a data structure with one-way pointers is vastly inferior to one that includes
back-pointers. With back-pointers, you can always traverse the structure in any direction. Without
them, you have to do searches, which are always expensive, and progressively more expensive as
the structure grows.
I, too, was unfamiliar with the verb “memoize”, but have made use of the idea behind it many times.
At my last job, I wrote some utility programs that had to know where to find some files that weren’t
stored in well-known locations (but were very unlikely to move once they’d been put in a given place,
because that was a PITA). Since a find is a very expensive operation, I made the utility installer
dispatch an at now job to do the find once and cache the result in a specific location that the other
utilities knew about.
on 2013-03-25 at 10:44:42 said:
>Does it also rely on everybody using this and doing it in a sensible, readable way in their classes.
My code doesn’t assume that every class in the universe has a sensible __str__, but it does assume
that almost every class defined in reposurgeon has its own __str__ that is useful for
progress/debugging messages, and (this is the key point) the system str() will recurse down through
all such methods when told to print an arbitrary structure.
>Also, have you checked Jython?
No. Is there any reason I should expect it to be faster than c-python? I thought it was mainly aimed
at allowing programmers to use the Java library classes, rather than at performance per se.
Mike E
on 2013-03-25 at 11:17:53 said:
“Memoization”; good to know the name for that. I found myself doing that extensively while trying to
work through the problems at Project Euler (projecteuler.net), which is a marvelous resource with a
series of incrementally more difficult mathematical programming puzzles for those interested in such
a thing.
on 2013-03-25 at 11:41:19 said:
> the system str() will recurse down through all such methods when told to print an arbitrary
structure.
I’m not sure what this means. The closest thing I can think of is the fact that system structure types
(such as list/tuple/dict) will call str (or, looks like it’s actually repr at least half the time) on their
children.
That’s a kind of narrow definition of “arbitrary structure” for my taste.
esr
on 2013-03-25 at 12:01:56 said:
>That’s a kind of narrow definition of “arbitrary structure” for my taste.
Perhaps I was unclear. I created __str__ methods for all the classes that are parts of Repository –
the effect is that when requesting structure dumps for debugging instrumentation I can just say str()
on whatever implicit object pointere I have and the intuitively useful thing will happen. I don’t know
how to duplicate this effect in CL. What it would probably require is for the system print function to
magically call a str generic whenever it reaches a CLOS object.
Patrick Maupin
@The Monster:
I, too, was unfamiliar with the v erb “memoize”, but have made use of the idea behind it many
times. At my last job, I w rote some utility programs that had to know where to find some files
that weren’t stored in we ll-known locations (but were very unlikely to move once they’d
been put in a given place , because that w as a PITA). Since a find is a very expensive
operation, I made the utility installer dispatch an at now job to do the find once and cache
the result in a spec ific location that the other utilities knew about.
Congratulations! You reinvented bash’s command hash. :-)
But seriously, this is a great idea, and like most great ideas, will have multiple independent
inventions by multiple clever people.
Garrett
on 2013-03-25 at 12:08:56 said:
I would just follow up on esr’s excellent overview of big-O notation above with one point which is often
missed by developers. The impact of the algorithm is usually seen as data sets grow larger. For
small data sets, the complexity of the operation frequently is overtaken by other concerns.
To provide a mundane example: a car goes much faster than you can walk, but if you are a city-
dweller it’s probably faster to walk to your neighbour’s house than to drive.
Adam
on 2013-03-25 at 12:23:26 said:
> O(1) < O(log n) < O(n) < O(n log n) < O(n**2) < O(2**n)
While that's theoretically true, It's interesting to note that in practice, O(1) = O(log n). For typical
problems, you should just mentally macro expand "log2 n" to 30. The only way you're going to get it
different enough from 30 to make any difference is to have n be so small that the operation in
question is effectively instant. For example, to shave a mere 1/3 from that "constant" requires n to
decrease by three orders of magnitude.
Maybe you want to work on an atypical problem. For the biggest problem most people could possibly
attempt, (log2 n) < 60. For Google, it might be 70. For the crackpot who wants to count every

on 2013-03-25 at 12:23:51 said:
Well, looks like the Boston Lisp folks called it. At their last meeting a couple of weeks ago, some of
them predicted that you would:
a) discover that your speed problem is better solved by algorithmic optimization than by switching to
a faster language or compiler;
b) write a post critiquing the shortcomings of Common Lisp.
They were pretty spot on except they thought you would critique CL’s lack of libraries, not the
ugliness of CLOS. :)
>esr’s excellent overview of big-O notation
Entertainingly, one of the downsides of being an entirely self-taught programmer is that I didn’t learn
big-O notation or the associated reflexes until relatively late in my career. It wasn’t intuitive for me
until, oh, probably less than five years ago.
esr
on 2013-03-25 at 12:53:53 said:
>They were pretty spot on except they thought you would critique CL’s lack of libraries, not the
ugliness of CLOS. :)
And I might have gotten to that if I’d gotten around CLOS.
John Wiseman

>“The function print-object is called by the Lisp printer; it should not be called by the user.”
Correct. You define a custom print-object method on your data types, and it is called by the
implementation whenever you cause a value of that type to be printed–by calling print, prin1, format,
or whatever. Just like you don’t explicitly call __str__.
> Anyway, this looks like an analogue of Python repr(), not print – it’s supposed to print a
representation that’s invertible (can be fed back to read-eval).
It is used for both, actually. If *print-readably* is T, then it must either print an readable (“invertible”)
representation or throw an error–”repr mode”. Otherwise, it can print whatever it wants–”str mode”.
John Wiseman
Lispers usually use the print-unreadable-object helper macro. See
http://clhs.lisp.se/Body/m_pr_unr.htm for an example.
on 2013-03-25 at 14:46:32 said:
> Once of the lesser walls was a missing feature in Common Lisp corresponding to the >__str__
special method in Python. Lisp types don’t know how to print themselves, and as >it turns out
reposurgeon relies on this capability in various and subtle ways. Another >problem was that I couldn’t
easily see how to duplicate Python’s subprocess-control >interface
what about CL’s much hyped ability to have new features added very easily (I’m thinking of Paul
Graham’s writings): adding a macro would not have solved your problems? not worth your time? too
tricky?
Faré
on 2013-03-25 at 15:05:02 said:

case, there is no perfect answer, but EXECUTOR does a decent job on the major implementations
(SBCL, CCL and a few more).
CLOS is ugly but (1) it’s more expressive and powerful than any other object system I’ve heard of
(e.g. multiple inheritance, multiple-dispatch, method combinations, accessors, meta-object protocol,
etc.), and (2) you can hide the ugly behind suitable macros, and many people have.
Regarding __str__ and print-method, see John Wiseman’s answer; though in this case you might
want to define your own serialize-object method and have a mixin defining a print-object method that
wraps a call to that in a print-unreadable-object.
esr
on 2013-03-25 at 15:23:12 said:
>what about CL’s much hyped ability to have new features added very easily (I’m thinking of Paul
Graham’s writings): adding a macro would not have solved your problems? not worth your time? too
tricky?
Dunno. Would have looked into it more deeply, but CLOS blocked the translation. Now that I know
SBCL exists, though, I’ll probably do a project in it from scratch sometime and learn these things.
esr
on 2013-03-25 at 15:25:08 said:
>though in this case you might want to define your own serialize-object method and have a mixin
defining a print-object method that wraps a call to that in a print-unreadable-object.
Yes, I thought the answer would be something much like that.
Good to know that UIOP:RUN-PROGRAM exists – next time I try something like this I’ll look it up.
dtsund
on 2013-03-25 at 15:52:00 said:

big-O notation or the associated reflexes until relatively late in my career. It wasn’t intuitive for me
until, oh, probably less than five years ago.
Weren’t you also a mathematician, at least briefly? My first exposure to the notation was in Real
Analysis, after which grasping it in a CS context was almost trivial.
Jay Maynard
on 2013-03-25 at 15:52:23 said:
Let’s go up a metalevel. I was mildly surprised you considered switching languages at all before
attacking the algorithms’ speed issues. This seems unlike you. How did you get there?
Jeff Read
on 2013-03-25 at 16:02:30 said:
CLOS is ugly but (1 ) it’s more expressive and pow erful than any other objec t system I’v e
heard of (e.g. multiple inheritance, multiple-dispatch, method c ombinations, acc essors,
meta-objec t protocol, etc.), and (2) you can hide the ugly behind suitable macros, and many
people have.
Historically there was T’s object system: as powerful as CLOS but actually beautiful.
The closest I can find in a modern running Scheme is RScheme’s object system, but RScheme has
sadly been lacking in maintenance or interest and is still quite riddled with bugs.
esr
on 2013-03-25 at 16:06:53 said:
>Let’s go up a metalevel. I was mildly surprised you considered switching languages at all before
attacking the algorithms’ speed issues. This seems unlike you. How did you get there?
You’re right – it was unlike me (on the evidence anyone else has available). I’ve actually been
wondering if anyone would notice this and bring it up.

out the stuff that could be attacked that way. In my defense, I will note that the remaining O(n**2)
code was pretty well obscured; it took a couple of weeks of concentrated attention by two able
hackers to find it, and that was after I’d built the machinery for gathering timings.
esr
>Weren’t you also a mathematician, at least briefly?
I was, but my concentration was in abstract algebra, logic, and finite mathematics. I didn’t actually
learn a lot of real analysis (I had a fondness for topology that was unrelated to my main interests, but
I approached it through set and group theory rather than differential geometry). It may also be that
big-O notation wasn’t as prominent then (in the 1970s) as it later became, so I’d have been less likely
to encounter it even if I had been learning more on the continuous side.
BRM aka Brian R. Marshall
on 2013-03-25 at 16:37:46 said:
Another note for non-programmers…
A “profiler” is a tool to determine how much time is spent running different parts of a program. As
ESR noted, sometimes it is better to add some code to the program to get the required results.
(Such code generally isn’t used/run when not trying to speed up the program.)
Sometimes, at least as a first try, a programmer can tell from the code where it is worth trying to
speed things up.
In any case, this kind of analysis is very useful. A junior/lousy programmer may attempt to speed up
a program by reworking code that obviously can be made to run faster. But if a program takes 10
minutes to run and this code accounts for only 10 seconds of that time, it is a waste of time trying to
speed it up. Even if it can be made to run 10 times faster, the program run time goes from 600
(590+10) seconds to 591 (590+1) seconds.
Sometimes this kind of improvement is worse than a waste of time. The code may be written in a
way that makes it obvious what it is supposed to do and that it is, in fact, doing it. Reworking code

on 2013-03-25 at 17:11:19 said:
> Also you’ll see O(log n), typically associated with the cost of finding a specified item in a tree or
hash table.
Small correction: hash table insertion and lookup are expected O(1), not O(lg n).
(At worst, they are O(n), but this degenerate case hardly ever happens unless you piss off a
cryptographer. )
JustSaying
@Adam:
It ’s interesting to note that in prac tice, O(1) = O(log n). For typical problems, yo u should just
mentally macro expand “log2 n” to 3 0. The only way you’re going to get it different enough
from 30 to make any difference is to have n b e so small that the operation in question is
effectively instant. For example, to shav e a mere 1 /3 from that “co nstant” requires n to
decrease by three orders of magnitude.
Why are you claiming that 300% efficiency increase is irrelevant (equivalent to constant) in the
presence of iterations that range over 3 orders-of-magnitude?
log n is still log n, not constant.
Are you claiming that no such cases occur?
JustSaying
@esr:
Y ou’re right – it was unlike me (on the ev idence anyo ne else has available). I’ve actually

I had assumed that you wanted to test out whether there was a fundamental advantage of your long-
lost love over your new one. I have observed that you favor continuity of code bases over other
considerations, so I should have realized I was wrong. Perhaps I was distracted.
Jay Maynard
on 2013-03-26 at 03:37:13 said:
JustSaying: I think the point is that going from 30 to 1 is almost never enough improvement to be
worth doing, and it’s effectively linear (when you have a 2**30 scale factor on input producing scale
factor of 30 on output, there are much bigger fish to fry).
Jay Maynard
on 2013-03-26 at 03:47:33 said:
>I will note that the remaining O(n**2) code was pretty well obscured; it took a couple of weeks of
concentrated attention by two able hackers to find it
Are there any O(n**2) traps within Python itself we can avoid that you found, or was this all your
algorithms’ fault?
another user
Did you consider profiling reposurgeon for performance bottlenecks, rewriting the relevant pieces of
code in C/C++ and using bindings? I personally like boost-python.
Maybe if you keep 90% of code written in Python and rewrite 10% of performance-critical code in C,
you can approach the speed of a C program.
JustSaying

almost never
That is why I asked if there are no cases. I can’t think of a case at the moment, but I am skeptical of
saying there are none. I think log n is still log n and I should remember it as that, while also factoring
in that it might nearly always be too low of a priority. All of us experienced programmers, I am sure
share BRM’s experience that obfuscating code for insignificant efficiency gains is myopic.
Winter
@JustSaying
” I think log n is still log n and I should remember it as that, while also factoring in that it might nearly
always be too low of a priority.”
O(log n) vs O(n) corresponds to [c1 * log(n) + d1] N. In practice the constants may be so large that n
> N is out of your reach.
So, your implementation might indeed scale as O(log n), but it could still run much slower for
practical n.
Sorry, html filter messed up my comment:
@JustSaying
” I think log n is still log n and I should remember it as that, while also factoring in that it might nearly
always be too low of a priority.”
O(log n) vs O(n) corresponds to [c1 * log(n) + d1] LT [c2 * n + d2] for some n GT N. In practice the
constants may be so large that n GT N is out of your reach.
So, your implementation might indeed scale as O(log n), but it could still run much slower for
practical n.
on 2013-03-26 at 11:12:22 said:
Of course lg n isn’t *really* a constant, but it’s often useful to think of it that way. It’s also useful at
times to assume a spherical cow.
They say that premature optimization is the root of all evil. If you need to sort the items in a dropdown
box, you’re probably fine to use an n^2 sort. Those are fast and easy to code up, which means fewer
bugs. It’s a dropdown box, so your user experience will be crap if you have more that a few dozen
items anyway. At most, you’ll add a few milliseconds, which isn’t noticeable. When “n” is small
enough, even the difference between O(n) and O(n^2) doesn’t matter. An extra lg n is completely
irrelevant. That’s one of Knuth’s “small efficiencies”.
However, some optimizations aren’t premature. If you have lists of a billion items, n^2 sorts are out of
the question. Lets say you’re typically sorting a billion items. Then lg n is 30. Assume that once in a
while, you need to sort 10 billion items. Then lg n is a hair over 33. That adds 11% to your runtime.
Instead of spending 10x longer processing 10x more items, you’ll have to spend 11x longer. The
difference is negligible: An order of magnitude change of the input size in either direction affects your
total runtime over by only 11% over an O(n) algorithm. Given a gigantic three orders of magnitude
change of input size, the lg n factor results in only 66%. That’s not nothing, but it’s also not the real
problem. You’ll need more memory before you’ll need more CPU.
In short, when “lg n” really varies, “n” is small enough that the entire operation doesn’t matter. When
“n” is large enough to matter, “lg n” varies so little that the variation doesn’t matter. Not much
anyway.
You could improve your spherical cow model by making it an oblate spheroid, adding another smaller
one as a head, and adding four cylindrical legs — but that won’t change the air resistance enough to
stop the cow from making a big mess when it hits the ground.
esr
on 2013-03-26 at 11:14:30 said:
>Are there any O(n**2) traps within Python itself we can avoid that you found, or was this all your
algorithms’ fault?
I don’t know that yet. It was probably all my code, but there could be O(n**2) traps within Python as
well.
>boost-python
*twitch*
Merciful $DEITY. Boost is bad enough. Don’t inflict it on Python.
esr
>Did you consider profiling reposurgeon for performance bottlenecks, rewriting the relevant pieces of
code in C/C++ and using bindings?
Yes, for about a half-second. Then I realized how ridiculous the idea was and abandoned it.
That strategy only works when the stuff you need to do fast fits in C’s type ontology without incurring
so much code complexity that you end up with more problems than you started with. There was no
chance that would be true of reposurgeon’s internals – none at all.
Garrett
@JustSaying:
I work with filesystems for a living. When you have a large on-disk data structure you need to search,
loading another block off of disk is a big cost. OTOH, searching that block in memory is
comparatively cheap. For some of our data structures we use binary or hash trees to locate the block
we need, but then pack the block as an array. This avoids extra pointers and allows us to cram a few
more entries per block. In these cases, cutting the number of block loads from 20 to 10 can be a big
savings if the operation must occur in real-time for a client (as opposed to a background processing
operation). Spinning rust is slow …

@esr:
O(1) is co nstant time [...] O(log n), typically assoc iated with the c ost o f finding a spec ified
item in a tree or hash table.
O(1) is typically associated with the cost of accessing a specified item in an array by index.
@Winter: O(log n) vs O(n) corresponds to [c1 * log(n) + d1] LT [c2 * n + d2]
@Adam:
In short, w he n “lg n” really varies, “n” is small e nough that the entire operation do esn’t
matter.
That is only if c1 is small relative to d1 and the “universe”.
The curve for log(n) flattens faster than even sqrt.
The sacrosanct rule to not do premature optimization appears to be deprecated under open-
extension, because profiling isn’t available.
If your caller must to call you a billion times (perhaps deep in some nested function hierarchy), and
you are employing a log(n) tree or hash instead of an array, then the difference in application
performance can 300% n = 1000, 400% n = 10,000, 500% n = 100,000, etc.
So log(n) is never the same as constant. The cow is never spherical except when we “touch him only
one way” — Steve Jobs.
on 2013-03-26 at 12:39:34 said:
@esr: “…I didn’t bother measuring the working set because the only metric of that that mattered to
me was “doesn’t trigger noticeable swapping”.”
Sure, for your current number commits. Now, using caching, the design trades off runtime for an
upper limit based on the memory of the box.

on 2013-03-26 at 13:41:41 said:
>Now, using caching, the design trades off runtime for an upper limit based on the memory of the
box.
Indeed so. It’s easier to buy memory than more processor speed these days.
Winter
@JustSaying
The constants for O(log n) tend to be larger than for O(n), else you would have tried the log n
algorithm first. And indeed, log n matters if n is in the billions. But at that point, you are tweaking all
algorithms.
on 2013-03-26 at 18:26:07 said:
> Looks like a classical runtime/memory trade-off. Have you compared the working set size before
and after the speedup?
TL,DR: see below
In fact, I should tell that before I worked on refactoring for speed, I began searching for ways to cut a
lot the memory used by reposurgeon. Most of the gain was obtained by using __slots__ on most
instanciated structures, but I did some dict eviction and copy on write optimization on a really
memory hungry part: the filemaps.
Reposurgeon already was optimized in that regard (Eric had already implemented a rather good
COW scheme for PathMaps), but the fact that PathMap’s snapshotting required a new dictionnary
each time — to be able to replace later an entry by its copy — was taking its toll… So I devised a
tweek to take snapshots even less often, then a further optimization which is a real memory
usage/code complexity tradeoff.
Returning to simpler structures would probably gain some speed too, but the fact is that on my
machine, reposurgeon still tops at 75% of my 4GB of RAM when converting the blender repository —
and I suspect Battle for Westnoth to be such a contender too. Sure, one can trade computationnal
cost and even code readability for memory, but the bargain is not the same when you can trade
200MB temporary memory for a O(n**2) to expected O(n) reduction — e.g. store previous hits in a
set/dict instead of searching them backwards in the “already seen” list — than when you trade 2GB
of memory used through the whole import for only a constant factor — one of the costs of the smart
COW PathMap over a list of dicts is that built-in types don’t have interpreter overhead, and in fact run
at C-speed, but that’s only a constant factor rather that a whole new complexity class.
As for the optimization itself, it is amusing to note that Eric and I actually started optimizing for
speed each in his corner without concerting… At first we were doing orthogonal changes, then as the
set of molasses reduced we began stepping on each other’s toes^W^W^W^W^W collaborating more
;-) Also note that while Eric says to have driven his optimizations by profiling, I was less smart and
just wandered in the code searching for unpythonic or unpretty code to my eyes — to the risk of
premature or over- optimization. I was more seeking refactors for clarity and code compactess — and
iterators galore because I love them too much for my own good — than real speed optimizations; it
just happens that I seem to find ugly O(n**2) code.
The last thing that can explain why Eric didn’t find the places to optimize at first sight is that
reposurgeon is big and its internal structures have been made to mirror the fast-import format at first.
This legacy still shows a lot. While that decision was sane at the time when reposurgeon was less
complex and able than now, and while there still are several tangible benefits to this similarity — like
the ability of reposurgeon to round-trip fast-import streams to the exact character, over the course of
time — and especially in the few last weeks where Eric and I started to optimize — internal objects
like Commits track more and more their relationships with their surrondings, to the point that now
they collectively maintain in memory the whole DAG, in both directions.
At first, Commits only stored the marks to their parents. To find parents a sweep over the complete
set of events was needed, because a mark is only a string containing a colon and a number, and
marks aren’t even necessarily consecutive… Eric made that computation to remember its results,
then swapped altogether to storing the commits objects instead, diverging from fast-import towards a
graph representation. For children, I first memoized the function searching for all commits whose
parents contained self, then replaced that altogether by code that stores the children list on commits
but keeps them synchronized at all times with parent lists. And for tags/resets, Eric and I both tried
to make commits know which tags/resets pointed to them, always kept in sync with the information
on tags telling where they point to.
TL,DR: Some of the innefficiencies were hidden, but most of them were due to the lack of
informations stored. Some loops that were only O(n) were actually called O(n) times by another
function — which in a codebase that dense is not easy to spot — and it was not possible to make
tho inner loop more efficient short of doing large refactors… All these problems combined tend to
make a poor human’s brain automatically sweep over and search some other more palatable
optimization. The needed refactors were difficult to do, not because the end result isn’t known but

everywhere.
Keeping commits very small and ensuring each state was correct was an imperative goal for me.
Kudos for Eric and his approach to writing code, documentation, and test suites at the same time, or
else none of these refactorings could have happened for fear of breaking everything… And I broke a
lot of things… but noticed right away. Some parts of the code were actually relying on some
invariants that came from the fact that parent and children lists were generated at first ! Finding those
was hard and a blocker for the refactorings.
I already said far too much for a small comment, sorry for that :-(
Sigivald
on 2013-03-26 at 18:38:48 said:
BRM said: Sometimes this k ind of improvement is worse than a waste of t ime. The code may be
written in a way that makes it obvious what it is supposed to do and that it is, in fact, doing it.
Reworking code that makes unimportant improvements but also makes the code obscure and subtle
is bad practice.
“Premature optimization” is a related problem.
First, see if it’s slow.
Then, see what part of it’s actually making it slow,
Then fix that part.
(And if, as in the quote above, the speed improvement is minor compared to the added complexity,
don’t fix it .)
on 2013-03-26 at 18:53:08 said:
>(Eric had already implemented a rather good COW scheme for PathMaps)

memory footprint, and in so doing enabled me to solved a fiendishly subtle bug in branch processing
that had stalled the completion of the Subversion reader for six months. To invoke it, the repository
had to contain a Subversion branch creation, followed by a deletion, followed by a move of another
branch to the deleted name.
I still don’t know what exactly was wrong with my original implementation, but a small generalization
of Hudson’s code (from CoW filepath sets to CoW filepath maps) enabled me to use it to remove a
particular O(n**2) ancestry computation in which I suspected the bug was lurking. Happily that
suspicion proved correct.
on 2013-03-26 at 20:48:55 said:
By the way, Eric, what profiler did you try to use, and what you are missing in it? What features
would you like to see in profiler?
esr
on 2013-03-26 at 21:47:41 said:
>By the way, Eric, what profiler did you try to use, and what you are missing in it? What features
would you like to see in profiler?
The stock Python profiler. Unfortunately, it’s pretty bad about assigning time to method calls! I’ve
always thought this was odd given that the standard style is so OO.
Faré
on 2013-03-26 at 23:43:09 said:
BTW, SBCL has SB-SPROF for profiling, which is quite informative, though it is not obvious at first
how to read the results.
Faré
on 2013-03-26 at 23:55:19 said:
(Also, if you do complex shell pipes or string substitutions, INFERIOR-SHELL:RUN is a richer front-
end on top of UIOP:RUN-PROGRAM. A implementation of it on top of EXECUTOR:RUN-PROGRAM
or IOLIB:SPAWN would be nice, but hasn’t been done yet.)
Shenpen
on 2013-03-27 at 05:22:59 said:
>Entertainingly, one of the downsides of being an entirely self-taught programmer
Actually I think the standard schoolish way of learning theory, then hands-on experience, then more
work experience, is not useful at all in the two fields I was taught, programming/database design and
business administration. We memorize and barf back theoretical definitions which we don’t care
about because we have no idea what they are good for, and are often too formal to seem really
useful, take an exam, forget them, and later on it is hard to apply it to practical problems, or even
realize that the problems we face have anything to do with them.
It would be better to do hands-on practice first, try to figure out solutions, usually fail, then being told
to do it X way without an explanation, and then learn the theory why we were told so.
Example: I remember memorizing, not really understanding, taking an exam of, and then promptly
forgetting database normalization: 3NF, BCNF, 4NF. Then years later actually designing databases,
figuring out a common sense way of doing it, then realizing this actually sounds something like
BCNF. Then I went back to the textbook, looked up 4NF and actually my design got better. And then
– realizing it is all too slow and we have to denormalize for speed :-)
Same with business administration, only after many years of work I got the philosophy of accounting
and going back to the textbook they started to make sense.
What would be a world like in which every construction engineer would first work as a mason and
carpenter?
on 2013-03-27 at 05:59:09 said:
It would be nice if you told us more about your dissatisfaction with CLOS. I can think of lack of dot

and for situations when it does there are with-slots and with-accessors. Maybe decorators for
classes/generic functions/methods would not harm, too (macrology and MOP helps, but in some
situations I would indeed prefer decorators as they’re easier to combine and using MOP may switch
off compiler optimizations for CLOS). Other than that, is the problem reduced to the fact that CLOS is
unlike Python’s object model? If so, I’m not sure whether it’s a problem of CLOS or one of Python. I
for one often miss CLOS when I write Python or JavaScript code. Besides multimethods/MOP/etc.
there are other good sides to it, for instance using (func object) instead of object.method notation
makes completion and call tips work much better; also, CLOS is very well suited for “live editing”,
when you make modifications without restarting your program – that’s usually hard to achieve in JS
and very hard in Python.
esr
on 2013-03-27 at 08:39:11 said:
>It would be nice if you told us more about your dissatisfaction with CLOS.
Dot notation would have been nice, but that’s just syntax and un-Lispy (though you should look into
e7 if any documentation for it is still on the web). I think the “feature” that stuck most in my craw was
having to declare stub generics on penalty of a style warning from the compiler. Bletch! I dislike the
requirement that all methods be globally exposed. too.
For this particular translation, I wanted a class system that s imulated Python behavior more closely.
I’m sure this could be done with sufficiently complex macro wrappers but that seemed like a
forbidding amount of work and possibly dangerous to maintainability.
The Monster
on 2013-03-27 at 09:04:36 said:
> It’s easier to buy memory than more processor speed these days.
The original driver for 64-bit architectures was people who wanted to cache their entire database in
RAM, and the 32-bit machines couldn’t address enough memory to do that.
Jeff Read

What would be a wo rld like in which every co nstruction engineer wo uld first work as a
mason and carpenter?
My dad served as a mentor to a couple of UConn mech eng students a few years back for their
senior project. His big complaint was that while they were smart and knew their physics, they didn’t
know how to machine at all. He thought it terribly important that an engineer gain experience as a
machinist, since a technical drawing is basically a set of instructions to the machinist who will
actually make the part.
on 2013-03-27 at 20:26:31 said:
> Actually I think the standard schoolish way of learning theory, then hands-on
>experience, then more work experience, is not useful at all in the two fields I
>was taught, programming/database design and business administration.
This was my constant complaint about my CS classes. They taught plenty of theory of the various
modern programming techniques, but there was so little practical application, and what little there
was was so contrived (a square is a rectangle is a shape for class inheritance for example) that while
I could give you the reasons why you would want to do these things on an intellectual level, I had no
gut understanding of why you would go through the extra work.
BRM aka Brian R. Marshall
on 2013-03-27 at 22:54:01 said:
Tangential to the matter at hand, but…
Probably anyone who is into database design has heard this one, but…
1NF, 2NF and 3NF can be described as:
“The key, the whole key and nothing but the key”
Jakub Narebski

> [...] what little there was was so contrived (a square is a rectangle is a shape for class inheritance
for example) [...]
Particullary because square / rectangle relationship is just a bad fit and bad example of OOP
inheritance (where more specialized class is usually extended, not limited).
Patrick Maupin
@Jakub:
where more spec ialized class is usually extended, not limited
That’s a really good observation.
LS
“where more specialized class is usually extended, not limited”
“That’s a really good observation.”
Yes, but it just goes to show that while OOP is a good fit for many problems, it doesn’t make things
much easier. Coming up with a really good set of classes, with the right ‘responsibilities’ is difficult.
Finding hidden gotchas in the inheritance hierarchy is difficult. It’s only after you’ve struggled quite a
while with these issues that you end up with a good set of classes that make the actual program
construction easy.
This is not really an OOP thing. If you’re doing plain old procedural programming, the hard part is
figuring out how to partition the problem. Once you do that, everything seems to fall into place.
Either way, what you are doing is actually trying to understand the problem you are trying to solve.
That’s the hard part.
William Newman

ESR wrote “Dot notation [for CLOS] would have been nice, but that’s just syntax and un-Lispy ”
It’s not what you’re looking for, but you might at least be amused that I often use a macro
DEF.STRUCT which is a fairly thin layer over stock CL:DEFSTRUCT which, among other things,
makes accessor names use #\. instead of #\- as the separator between structure class name and
slot name. (E.g., after (DEF.STRUCT PLANAR-POINT X Y), PLANAR-POINT.X is the name of an
accessor function.)
More seriously, when you talked earlier about the apparent limitations of CL for printing objects, my
impression is that the CL printer is more powerful and flexible than in most languages. It has some
ugly misfeatures in its design (e.g., making stream printing behavior depend too much on global
special variables instead of per-stream settings). It tends to be slow. But it is fundamentally
functional and flexible enough that on net I’d list it as an advantage of CL vs. most other languages in
something like Peter Norvig’s chart. The feature I’ve pushed hardest on is *PRINT-READABLY*
coupled with complementary readmacros to allow the object to be read back in. In CL, these hooks
are expressive enough to let me do tricks like writing out a complex cyclic data structure at the
REPL, and later scrolling back in the REPL or even in the transcript of a previous session, cutting out
the printed form, pasting it into my new REPL prompt, and getting “the same” thing. (Of course the
implementor of the readmacros needs to decide how “the same” copes with technicalities like shared
structure, e.g. by memoization or not.) I am not an expert in Python 2.x or Ocaml or Haskell, but I’ve
read about them and written thousands of lines of each, and it’s not clear to me that their
printer/reader configurability is powerful enough to support this.
esr
on 2013-03-28 at 13:14:35 said:
>More seriously, when you talked earlier about the apparent limitations of CL for printing objects, my
impression is that the CL printer is more powerful and flexible than in most languages.
You may well be right. It wouldn’t surprise me if you were.
But you’re falling into a trap here that I find often besets Lisp advocates (and as I criticize, remember
that I love the language myself). You’re confusing theoretical availability with practicality to hand. As
you note, and as I had previously noticed, print behavior has ugly dependencies on global variables.
Separately, supposing your “more powerful” is really there, it is difficult to use, requiring arcane and
poorly documented invocations. Contrast this with Python str(), which ten minutes after you’ve first
seen it looks natural enough that any fool can use it.
Programming languages should not be tavern puzzles. The Lispy habit of saying “yes, you can do X
provided you’re willing to sacrifice a goat at midnight and then dance widdershins around a flowerpot”
is one of the reasons Lisp advocates are often dismissed as semi-crackpots. Yes, LISP has

on 2013-03-28 at 13:22:53 said:
Those of you who are concerned about software patents have reason to celebrate: Uniloc just got
handed its ass in their patent suit against Rackspace. In the very same East Texas district court
that patent trolls venue-shop for to get patent-troll-friendly rulings. Uniloc is a notorious, and
heretofore rather successful, patent troll; basically if you do any sort of license verification for a piece
of proprietary software, expect to be sued by Uniloc.
The defense cited not only In re Bilski but two other, more recent cases: Cybersource v. Retail
Decisions and Dealertrack v. Huber , which establish that for purposes of the “machine or
transformation test” of patentability, a general-purpose computer is not a specific enough machine,
and transformation of data is not sufficient transformation.
Given the way the sausage that is law gets made in Murka, I’m not going to say it’s game over for
software patentholders yet. But their job just got a whole lot harder.
John Wiseman
> You’re confusing theoretical availability with practicality to hand
Writing the equivalent of simple __str__ and __repr__ methods in Lisp is very easy, it’s not some
theoretically-powerful-but-practically-difficult beast. You just have to know that print-object and *print-
readably* exist, like you have to know that __str__ and __repr__ exist.
If you want to support pretty-printing or printing cyclic data structures in a way that they can be read
back in, then you need to learn some more Lisp, but that’s actually not hard either (well, except
pretty-printing–that can be a beast). As far as I know neither is even possible in Python using the
standard for printing & reading.

Programming languages should not be tav ern puzzles. The Lispy habit of saying “ ye s, you
can do X provided yo u’re willing to sacrifice a goat at midnight and then dance w iddershins
around a flow erpot” is one o f the reasons Lisp advoc ates are often dismissed as semi-
crackpots.
@John Wiseman:
If you w ant to suppo rt pretty-printing o r printing c yc lic data struc tures in a w ay that they
can be read back in, then you nee d to learn some more Lisp, but that’s actually not hard
either (well, exc ept pretty-printing–that can be a be ast). As far as I know neither is even
possible in Py thon using the standard for printing & re ading.
In the general case, there is usually no real reason to worry about printing for round-tripping in
Python, because pickle handles things like c ircular references quite nicely.
As far as the other goes, there are several ways to pretty print things, including leveraging the
standard str() functions by providing your own __str__.
Jess ica Boxer
@esr
> Unfortunately, my experience is that Python profilers suck rather badly
Whatever happened to “batteries included”?
Improving the performance of anything beyond a trivial program without a profiler is like painting a
portrait wearing a blindfold. It is a plain observable fact that programs don’t spend their time where
programmers think they do. It is much more fun to write a cool optimization than an effective one.

Which reminds me of the aphorism that those who don’t use UNIX are deemed to reinvent it, badly.[*]
One more reason not to do Python, as if there weren’t enough already.
[*] BTW, I know that isn’t actually what Henry Spencer said, but I didn’t want to use the real one
since plainly ESR is not lacking understanding here, just tools.
esr
on 2013-03-28 at 16:48:09 said:
>Whatever happened to “batteries included”?
It’s a question I’ve wondered about myself in this case. There aren’t many places where Python fails
to live up to its billing; this is one. Actually, the most serious one I can think of offhand.
>Nonetheless, it sounds like you recognize this and implemented a custom, rube goldberg profiler.
That’s too negative. What I did is often useful in conjunction with profilers even when they don’t suck
– I sampled a timer after each phase in my repo analysis and reported both elapsed time and
percentages. When several different phases call (for example) the same lookup-commit-by-mark
code, custom instrumentation of the phases can tell you things that function timings alone will not.
esr
>s/Lisp/Linux/g
Linux is not even within orders of magnitude as bad as Lisp is this way – they’re really not
comparable. The real-world evidence for that is penetration levels.
Jay Maynard
on 2013-03-28 at 17:17:39 said:

on 2013-03-28 at 17:33:40 said:
>Jessica, what is *your* weapon of choice for the problem space Python occupies?
I’m curious about that myself. I would rate Ruby approximately as good (though with weaker library
support) Perl somewhat inferior due to long-term maintainability issues, and nothing else anywhere
near as good.
on 2013-03-28 at 18:10:40 said:
I’m curious abo ut that myself. I w ould rate Ruby appro ximately as go od (th ough w ith
we aker library support) Perl somewhat inferior due to long-term maintainability issues, and
nothing else anywhe re near as good.
Given Jessica’s putative requirements (must be statically typed and work with the .NET framework),
Boo would be the closest thing to Python; but really, C# is a good enough language that few people
working within those constraints have a reason to switch away from it.
Jess ica Boxer
on 2013-03-28 at 18:21:03 said:
I’m not really sure what “problem space” Python occupies. It seems to me that “every programming
problem” is its domain, according to its advocates.
Nonetheless, as I have said here a number of times as a general programming language I think C# is
the best system I have used (system including all the peripheral items that make a language usable.)
The problem space Eric is referring too, what I want to call “batch” tools, I find C# excellent for that
kind of work.
I doubt you love that answer, but there it is.

angularjs as an helper for javascript. It is super cool, very useful, and has probably tripled my speed
in writing browser side code.
I have never used Ruby, but I have read a little about it and know someone who has a lot of expertise.
Anything that describes itself as “the good parts of perl” is unlikely to be appealing to me – because I
don’t think perl has any good parts.
Jess ica Boxer
@esr
> That’s too negative. What I did is often useful in conjunction with profilers even when they don’t
suck – I sampled a timer after each phase in my repo analysis and reported both elapsed time and
percentages.
I didn’t read your code, but FWIW, a sampling profiler is more than adequate for 95% of profiling
needs. Seems to me that you just created the batteries required, assuming you made it general
enough.
Certainly for optimizing what you need is “show me the top five places my code spends most of its
time”, which is what that gives you. So Rube Goldberg be damned, sounds like a great tool you built.
Patrick Maupin
@esr:
It ’s a question I’ve w ondere d about myself in this case. There aren’t many plac es whe re
Pytho n fails to live up to its billing; this is o ne. Actually, the most serious o ne I c an think of
offhand.
In my experience, one of the stock profilers (cProfile) works quite well. But you do have to take into
consideration the number of calls that are made to a given method (this data is reported as well as
total time spent in each call). An attribute lookup for a call is quite properly assigned to the calling
function.
The problem space Eric is referring too, w hat I w ant to call “batc h” tools, I find C# exce llent
for th at kind of wo rk.
I agree that C# is a good language. Like Python, its domain space is huge, so it’s worthwhile honing
your abilities on a general purpose tool like either one of these rather than remembering arcane batch
syntax.
But I’m not going to use a Microsoft OS and I’m not going to use a non-Microsoft implementation of
C#, so I’m not using C#.
uma
esr:
rrenaud
on 2013-03-28 at 23:32:45 said:
You have a performance problem, and your first instinct is to rewrite the code in a different language,
rather than find algorithmic bottlenecks? Maybe you should stop hating on computer science
education, and start taking some CS classes?
esr
on 2013-03-28 at 23:56:23 said:
>You have a performance problem, and your first instinct is to rewrite the code in a different
language, rather than find algorithmic bottlenecks? Maybe you should stop hating on computer
science education, and start taking some CS classes?
How do you pack that many misconceptions into two sentences? It must take both native talent and
a lot of practice.
>A combination of clojure and jython is one possibility
Intriguing thought. I may try it on a future project.
Jay Maynard
on 2013-03-29 at 01:59:31 said:
I have a ready-made C# project that I could hack on if I felt the need…though something tells me that
diving into a 500 KLOC package as an introduction to a language may not be that good an idea…
after all, I learned to hate C++ from diving into a then-800 KLOC package…
Jay Maynard
on 2013-03-29 at 02:03:23 said:
Jessica, I’d say Python’s problem space is that group of programs for which an interpreted language
is good enough, little to no bit-bashing is needed, and its I/O capabilities are good enough. yeah,
that’s a pretty wide domain, but by no means “every programming problem”.
Jeff Read
on 2013-03-29 at 12:20:44 said:
A combinatio n of clo jure and jython is one possibility
Holy crap, if you thought CL had warts — wait till you get a load of Clojure. I tried wrapping my head
around it for a “joke project”. I’d been joking around on Reddit about L33tStart — a fictional init(1)
replacement written in ClojureScript and running on Node — and decided that such a blasphemous
thing should really, actually exist.
It didn’t take much exposure to Clojure(Script) for me to discover that I was allergic. That combined
with Clojure’s community of twenty-something naïfs (“holy shit, guys, Rich Hickey is such a genius!

if you minimize side effects and code in strictly functional style, your programs become simpler and
more tractable!”) is enough to turn me right off the language and actively discourage other smart folks
from adopting it.
Anyway, Clojure is strictly only as powerful as JScheme or Kawa — so if you like Scheme you can
use one of those and gain all of Clojure’s java-interop advantages, plus the awesomeness of working
directly in (a somewhat reduced form of) Scheme.
rrenaud
on 2013-03-29 at 13:02:46 said:
So you find the algorithmic bottlenecks and fix them in Python. Then you begin a failed translation to
Lisp for no reason?
on 2013-03-29 at 13:58:10 said:
rrenaud – you’ve messed up your reading comprehension somewhere, or just didn’t read through the
comments from before your initial post – he _thought_ he found them, then attempted a rewrite, then
found more.
As he said March 25 at 4:06 pm: “””At the time I began looking at Lisp, I believed – mistakenly – that
I had already found and optimized out the stuff that could be attacked that way. In my defense, I will
note that the remaining O(n**2) code was pretty well obscured; it took a couple of weeks of
concentrated attention by two able hackers to find it, and that was after I’d built the machinery for
gathering timings.”””
Jay Maynard
on 2013-03-29 at 15:26:26 said:
rrenaud: Why do you think I said that was unlike Eric? Unlike you, apparently, I do know him
personally and have what I think is a decent grasp on his hacking style, and the idea that he’d

on 2013-03-29 at 16:12:35 said:
>the idea that he’d commence a port for performance reasons before making sure every last drop of
speed was wrung out of it algorithmically is something that he’d normally ridicule with vigor.
Indeed. But to be fair, I didn’t actually give enough information in the OP to exclude the following two
theories: (1) Eric had a momentary attack of brain-damage and behaved in a way he would normally
ridicule, (2) Eric had a momentary attack of “oooh, look at the shiny Lisp” and put more effort into
thinking about a port to that specific language than the evidence justified.
Neither theory is true, mind you. But I can’t entirely blame anyone for entertaining them, because I
didn’t convey the whole sequence of events exactly.
rrenaud’s biggest mistake was to suppose that I hate CS education; in fact, while I have little use for
the version taught everywhere but a handful of excellent schools like MIT/CMU/Stanford, “mild
contempt” would be a far better description than “hate”. If these places were doing their job properly,
hackers at my skill level and above wouldn’t be rara aves – and I wish that were the case, because
there’s lots of work to go around.
His funniest mistake was that he thought CS education would fix the mistake he believed me to be
making. See above…
on 2013-03-29 at 19:58:24 said:
2) was the theory I’d come up with…figuring you had a sudden need to connect with your roots or
something. Like I occasionally fire up a CP/M system.
Jay Maynard
on 2013-03-29 at 20:00:33 said:
And CS education, to me, seems to be a good way to train people to be computing theorists, which
is almost entirely orthogonal to hacking ability. I’ve never had a single CS course, have no plans to do

esr:
Another possibility is chicken scheme and cython, with possibly a thin layer of “C” glue.
http://www.call-cc.org/
on 2013-03-30 at 00:01:33 said:
My experience with performance tuning is that you get the greatest gains by starting with a really bad
algorithm. Fortunately, there are a lot of those lying around.
janzert
on 2013-03-30 at 03:05:40 said:
It would be interesting to see the performance of pypy on the post optimization version. The question
being, did the algorithmic optimization that was done help or hurt the relative performance of pypy?
esr
on 2013-03-30 at 06:30:12 said:
>It would be interesting to see the performance of pypy on the post optimization version. The
question being, did the algorithmic optimization that was done help or hurt the relative performance of
pypy?
It’s easy enough to run that test that I’m doing it now. Timing stock Python on the 56K-commit
benchmark repo, 270sec (208 commits/sec). Same with pypy 178sec (315 commits/sec) That’s
interesting – actually a significant speedup this time. I wasn’t seeing that when I wrote the OP, quite

might be worth running a bisection to find out what.
Russ Nelson
on 2013-03-31 at 02:08:21 said:
Warning: Jay and Jessica, if you fail to appreciate Python as the transcendent language of the gods,
you will be replaced by a small Python script after the Singularity!
esr
on 2013-03-31 at 02:35:32 said:
>Warning: Jay and Jessica, if you fail to appreciate Python as the transcendent language of the
gods, you will be replaced by a small Python script after the Singularity!
“There is another theory which states this has already occurred.”
Jay Maynard
on 2013-03-31 at 08:16:48 said:
Heh. Python is *my* weapon of choice for the problems it can handle.
Jacob Hallén
on 2013-03-31 at 19:11:20 said:
Go see the people in the PyPy channel on Freenode about why your code is slow. Slowness is
considered to be a bug, unless you your code is too short to overcome warmup.
Jeff Read
on 2013-04-01 at 17:54:59 said:
Warning: Jay and Jessica, if you fail to appreciate Python as the transcende nt language of
the go ds, you w ill be replaced by a small Python script after the Singularity!
Any singularity based on Python will itself meet a day of reckoning with the Gods of the Copybook
Headings, who insist that bugs caught at runtime are orders of magnitude more expensive to fix than
bugs caught at compile time.
Traceback (most recent call last):
File “/usr/bin/singularity.py”, line 8643, in run_ai
File “/usr/lib/python2.7/dist-packages/ai/ai.py”, line 137406, in get_neuron_state:
File “/usr/lib/python2.7/dist-packages/ai/neuralnet.py”, line 99205, in query_neuron
File “/usr/lib/python2.7/dist-packages/ai/neuron.py”, line 20431, in query_synapse
TypeError: expected object of type ‘SynapseConfiguration’, got ‘NoneType’
Strong static typing systems are not put into languages just to make your lives miserable, folks.
Jakub Narebski
on 2013-04-01 at 21:34:34 said:
@Jeff Read: Strong typing does not necessarily mean static typing, ask ML (or Haskell, I’m not sure
which), with its implied types (and correctness checking that can discover errors in an algorithm by
type mismatch).
Patrick Maupin
@Jakub Narebski:
There are actually three different things that get conflated on typing:
strength
static/dynamic
explicit/implicit
Python has reasonably strong typing that is dynamic and implicit.

Typing on older languages is usually explicit. Even C#, which has “implicit” local variables, still
requires variable declarations for those. You tell the compiler — here’s a variable, figure it out based
on its use.
Strong is usually good. Static is usually good. Implicit is usually good. It wasn’t until recently that
you could have all three.
Jeff Read
on 2013-04-04 at 18:49:36 said:
Strong typing does not necessarily mean static typing, ask ML (or Haske ll, I ’m not sure
which), with its implied types (and correctness chec king that can disco ver errors in an
algorithm by type mismatch).
Never said it did. I chose the phrase “strong static typing” specifically to contrast with weak static
typing (e.g., C) and strong dynamic typing (e.g., Python, Lisp).
Also, both Haskell and ML support type inference.
Alexander Todorov
on 2013-04-07 at 17:53:02 said:
I’m well aware of the principle. Unfortunately, my experience is that Python profilers suck rather badly
– you generally end up having to write your own instrumentation to gather timings, which is what I did
in this case. It helped me find the obscured O(n**2) operations.
Did you use any profiling tools at all ? I’m interested to hear if there are any ready made tools that

python speed optimization in the real world

Documents