Going Faster with Python

Creative Commons License
Going Faster With Python by Asim Ihsan is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Table of contents

Revision History

Introduction

This article is about optimisation in Python. Optimisation is seeking to minimise the usage of some resource during the execution of a software system. Although commonly synonymous with minimising execution time I’ll also be covering minimising memory occupancy.

This article’s target audience is intermediate-level Python and beginner-level C developers; you’ve written non-trival code in Python before and can compile and run a “Hello World!” example in C.

Although the article’s primary focus is optimising to reduce the execution time of Python programmes it will restrict itself to single-process optimisations; parallelising tasks both on a multicore system using e.g. multiprocessing1 and onto many systems using e.g. celery2 or ParallelPython3 is a topic worthy of its own article.

Part 1 - When Do I Optimise?

“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” –Donald Knuth4

If we know software needs to be fast then why can’t we code with this perspective front-and-centre at all times?

There are many problems with statement, the first being: how fast? The best way to achieve performance requirements is to associate it with precise, quantifiable, and measureable metrics.5 In this case examples would be:

In some environments, such as hard real-time operating system components or mission-critical command-and-control servers, these are reasonable requirements. In others, such as a personal blog, they would be patently ridiculous. Context matters.

Another problem with the “omnipresent” approach is that there is no free lunch. There are many characteristics of a good software implementation, of which performance is only one,6.7 In no particular order some are:

As you prioritise some of the above properties you must sacrifice attention to others. An absolute focus on performance must necessarily come at the cost of, for example, readability and maintainability.

Even if your software system places a premium on performance it is a fallacy to suggest that this may only be addressed at the code tuning level. You have many options, at many stages:8

There is a third, and perhaps the most counter-intuitive, problem with constant optimization throughout coding. It is not only a standard heuristic but a repeatedly verified observation that software systems tend to spend most of their execution time in a minority of their code.

Given that programs spend the majority of their time in a minority of their code, constantly optimizing everything, in the best case, is mostly wasted! This “execution locality”, and how it impacts optimisation, is expressed in Amdahl’s Law,12 which describes how the total speedup of a software system after speeding up a constituent subcomponent depends on what fraction of the total time the subcomponent takes up:

\[ \begin{align} &\begin{aligned} \textrm{Execution time}_{\textrm{old}} & = \textrm{Execution time}_{\textrm{new}} \times \left( \left( 1 - \textrm{Fraction}_{\textrm{enhanced}} \right) + \frac{ \textrm{Fraction}_{\textrm{enhanced}}}{\textrm{Speedup}_{\textrm{enhanced}}} \right) \\ \textrm{Speedup}_{\textrm{overall}} & = \frac{\textrm{Execution time}_{\textrm{old}}}{\textrm{Execution time}_{\textrm{new}}} \\ & = \frac{1}{ \left( 1 - \textrm{Fraction}_{\textrm{enhanced}} \right) + \frac{ \textrm{Fraction}_{\textrm{enhanced}}}{\textrm{Speedup}_{\textrm{enhanced}}}} \end{aligned} \end{align} \]

Here’s an example. Suppose we have a web server and there is a routine we could optimise such that it becomes 10 times faster. Assuming that the web server process is busy with with this routine 40% of the time what is the overall speedup after the optimisation?

\[ \begin{align} &\begin{aligned} \textrm{Fraction}_{\textrm{enhanced}} & = 0.4 \\ \textrm{Speedup}_{\textrm{enhanced}} & = 10 \\ \textrm{Speedup}_{\textrm{overall}} & = \frac{1}{(1-0.4) + \frac{0.4}{10}} \\ & = \frac{1}{0.64} \\ & = \textrm{1.56 (3dp)} \end{aligned} \end{align} \]

Indeed, by Amdahl’s Law, the maximum possible speedup of the software system is \(\frac{1}{0.6} = \textrm{1.67 (3dp)}\).

In short: software optimisation is often best achieved during the requirements analysis and component design stages, but the time will come when you can’t shy away from a performance problem. At this point the next step is determining what is slow, and not just jumping into coding; this is part 2.

In parting I’ll leave you with this:

“If there’s a moral to this story, it is this: do not let performance considerations stop you from doing what is right. You can always make the code faster with a little cleverness. You can rarely recover so easily from a bad design…Design the program you want in the way it should be designed. Then, and only then, should you worry about performance. More often than not, you’ll discover the program is fast enough on your first pass.” –Elliotte Rusty Harold13

Part 2 - How Do I Know Where To Optimise?

If Part 1 is “how fast?”, part 2, in an extraordinary leap of atrocious grammar, is “where slow?”. Your software architecture is modular with well-defined interfaces, you’ve defined precise, quantifiable, and measureable performance metrics, and lo and behold you’re not meeting them after implementing a significant portion of tested functionality. Now what?

Basic timing

Particularly if you’re looking at system command-line utilities or scientific computing simply knowing “how much time did this take to finish?” or “how much RAM did it use at its peak?” is a good first step.

Typical approaches to doing this use top and time, and are very simple effective. Often however people forget that measurements must be repeated in order to gain confidence as to their accuracy. Hence to help you get started I’ve created a noddy little script for this article, src/utilities/measureproc.py.14

Let’s assume we using this little toy script:

1
2
3
4
5
6
7
8
#!/usr/bin/env python

import time

if __name__ == "__main__":
    a = range(2 ** 24)
    print a[-1]
    time.sleep(1)

In order to make repeated measurements as to its CPU and memory usage:

(going_faster_with_python)Mill:going_faster_with_python ai$ pwd
/Users/ai/Programming/going_faster_with_python

(going_faster_with_python)Mill:going_faster_with_python ai$ python src/utilities/measureproc.py
python src/utilities/longscript.py
Summary of 5 runs
metric  | min    | Q1     | median | Q2     | max   
--------+--------+--------+--------+--------+-------
clock   | 1.96   | 1.97   | 1.98   | 1.97   | 2.08  
user    | 0.68   | 0.68   | 0.68   | 0.68   | 0.76  
system  | 0.22   | 0.22   | 0.22   | 0.22   | 0.25  
rss_max | 525.39 | 525.39 | 525.39 | 525.39 | 525.39

The script outputs four important metrics:

The script summarises the measurements of these metrics over N runs using:

We note the following observations:

Logging

The humblest yet most important of approaches, logging is an old friend to most of us. In fact some of you may be thinking “Why am I even bothering with this article? Everyone knows that configurable tracing is critical in any production system.”. Although logging isn’t the focus of this article I wanted to cover some quick points regarding it.

With such discipline, in time forming habit, you can react quite effectively to a wide variety of inquiries, such as:

For Django fans django-debug-toolbar18 helps tie together the behaviour of Django components throughout the request cycle, including SQL queries and time taken to execute them.

PyCounter

TODO

Decorator that records frequency and time spent for individual functions.

To install: pip install pycounters.

To use:

1
2
3
4
5
6
import pycounters
import logging

reporter=pycounters.reporters.LogReporter(logging.getLogger("counters"))
pycounters.register_reporter(reporter)
pycounters.start_auto_reporting(seconds=300)

For more information see.19

CPU profiling

Detailed logging, and coming up with efficient and useful toolchains for analysing them, can be a chore sometimes. Fortunately, particularly if you’re dealing with smaller command-line-based tools or scientific computing, CPU profiling is another option.

In CPU profiling you gather information about function call chains (who calls whom) and how long functions take to return. There are two types of CPU profiling methods:

  1. Deterministic profiling. Your measurements are comprehensive, in that you record every single function call at the greatest possible precision. If the effort of such measurement does not impede the actual functionality of the softare system this method is fine, but this is a big assumption.
  2. Statistical profiling. Your measurements are sporadic but regular, with a configurable interval. By being sporadic the hope is that you have less, and hopefully negligible, impact on the actual functionality of the software system, at the cost of less precise measurements.

Let’s run through a variety of CPU profilers with a toy example.

Our toy example

Let’s assume there is a BZIP2 compressed log file that we want to parse, and that each line is a measurement of a metric in the format epoch, metric name, metric value, e.g.:

1362330056,cpu_usage,112
1362330057,cpu_usage,21
...

Our objective is to determine the arithmetic mean of the cpu_usage values throughout the whole file.

I’ve created a script src/utilities/generate_log.py20 that will generate a log in this format into the path src/utilities/example.log.bz2. Please run this script before continuing.

In order to parse this script we’re using src/cpu_profiling/parse_log.py.21 It has shown poor performance and we want to see what parts are slow (I’ve deliberately made this script very non-idiomatic and obtuse in places. This is not an example of good code! Unsurprising considering the context):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#!/usr/bin/env python

from __future__ import division

import os
import sys
import bz2
import contextlib
import re

# -------------------------------------------------------------------
#   Constants.
# -------------------------------------------------------------------
current_dir = os.path.abspath(os.path.join(__file__, os.pardir))
parent_dir = os.path.join(current_dir, os.pardir)
log_filepath = os.path.join(parent_dir, "utilities", "example.log.bz2")

re_log_line = re.compile("(.*?),(.*?),(.*)\n")
# -------------------------------------------------------------------

def main():
    cpu_usages = []
    with contextlib.closing(bz2.BZ2File(log_filepath)) as f_in:
        for line in f_in:
            process_line(line, cpu_usages)
    summarise(cpu_usages)

def summarise(cpu_usages):
    print "avg: %s" % (sum(cpu_usages) / len(cpu_usages), )

def process_line(line, cpu_usages):
    re_obj = re_log_line.search(line)
    try:
        elems = re_obj.groups()
    except:
        pass
    else:
        if elems[1] == "cpu_usage":
            cpu_usages.append(int(elems[2]))

if __name__ == "__main__":
    main()

Looking at this script there are many possible reasons for performance problems:

These are all valid points. However, put aside the toy script for a moment and consider the bigger picture. You may instead be faced with a convulted, complex, and poorly documented system, where such points are not obvious. Hence, instead of jumping in and “optimising” the code, your response to anyone who suggests such “optimisations”, in simple or complex scenarios, is always:

"Before we waste time on opinions I’d like to proceed on the basis of empirical measurements.

cProfile

cProfile22 is a deterministic profiler that’s part of the Python standard library. It’s easy to use, efficient enough that it has neglible impact on many programmes, and with a little trickery can be used with a decorator to profile individual functions.

To use it on our toy example:

python -m cProfile -o profile.stats parse_log.py

This will output a file profile.stats to the current working directory. It contains low-level details that may be parsed out and summarised. One command I like using sorts the statistics by total time spent in particular functions, as follows:

(going_faster_with_python)Mill:src ai$ python -c "import pstats; p = pstats.Stats('profile.stats');
p.sort_stats('time').print_stats(5)"
Sun Mar  3 17:49:04 2013    profile.stats

         20000398 function calls (20000376 primitive calls) in 31.770 seconds

   Ordered by: internal time
   List reduced from 67 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5000000   12.930    0.000   23.044    0.000 cpu_profiling/parse_log.py:31(process_line)
        1    8.666    8.666   31.756   31.756 cpu_profiling/parse_log.py:21(main)
  5000000    7.917    0.000    7.917    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
  5000000    1.621    0.000    1.621    0.000 {method 'groups' of '_sre.SRE_Match' objects}
  5000065    0.575    0.000    0.575    0.000 {method 'append' of 'list' objects}

Interpreting the results:

What can we conclude from the results? These conclusions are ordered from most to least important:

You can also use the cProfiler output to trace which functions are calling whom. Rather than cover that in detail here I’m going to cover the same functionality in a better user interface in the “callgrind” section below. However, here is how you’d get such information from the command-line for who is calling the hot functions:

(going_faster_with_python)Mill:src ai$ python -c "import pstats;
p = pstats.Stats('profile.stats');
p.sort_stats('cumulative').print_callers(5)"

   Ordered by: cumulative time
   List reduced from 67 to 5 due to restriction <5>

Function                                         was called by...
                                                     ncalls  tottime  cumtime
cpu_profiling/parse_log.py:3(<module>)           <-
cpu_profiling/parse_log.py:21(main)              <-       1    7.869   27.992  cpu_profiling/parse_log.py:3(<module>)
cpu_profiling/parse_log.py:31(process_line)      <- 5000000   11.267   20.078  cpu_profiling/parse_log.py:21(main)
{method 'search' of '_sre.SRE_Pattern' objects}  <- 5000000    6.946    6.946  cpu_profiling/parse_log.py:31(process_line)
{method 'groups' of '_sre.SRE_Match' objects}    <- 5000000    1.360    1.360  cpu_profiling/parse_log.py:31(process_line)

and for who the hot functions are calling:

(going_faster_with_python)Mill:src ai$ python -c "import pstats;
p = pstats.Stats('profile.stats');
p.sort_stats('cumulative').print_callees(5)"

   Ordered by: cumulative time
   List reduced from 67 to 5 due to restriction <5>

Function                                         called...
                                                     ncalls  tottime  cumtime
cpu_profiling/parse_log.py:3(<module>)           ->       3    0.000    0.000  /Users/ai/Programming/.envs/going_faster_with_python/lib/python2.7/posixpath.py:60(join)
                                                          1    0.000    0.000  /Users/ai/Programming/.envs/going_faster_with_python/lib/python2.7/posixpath.py:341(abspath)
                                                          1    0.000    0.000  /Users/ai/Programming/.envs/going_faster_with_python/lib/python2.7/re.py:188(compile)
                                                          1    0.000    0.000  /usr/local/Cellar/python/2.7.3/lib/python2.7/__future__.py:48(<module>)
                                                          1    0.000    0.000  /usr/local/Cellar/python/2.7.3/lib/python2.7/contextlib.py:1(<module>)
                                                          1    7.869   27.992  cpu_profiling/parse_log.py:21(main)
cpu_profiling/parse_log.py:21(main)              ->       1    0.000    0.000  /usr/local/Cellar/python/2.7.3/lib/python2.7/contextlib.py:149(__init__)
                                                          1    0.000    0.000  /usr/local/Cellar/python/2.7.3/lib/python2.7/contextlib.py:151(__enter__)
                                                          1    0.000    0.001  /usr/local/Cellar/python/2.7.3/lib/python2.7/contextlib.py:153(__exit__)
                                                          1    0.000    0.045  cpu_profiling/parse_log.py:28(summarise)
                                                    5000000   11.267   20.078  cpu_profiling/parse_log.py:31(process_line)
cpu_profiling/parse_log.py:31(process_line)      -> 5000000    0.505    0.505  {method 'append' of 'list' objects}
                                                    5000000    1.360    1.360  {method 'groups' of '_sre.SRE_Match' objects}
                                                    5000000    6.946    6.946  {method 'search' of '_sre.SRE_Pattern' objects}
{method 'search' of '_sre.SRE_Pattern' objects}  ->
{method 'groups' of '_sre.SRE_Match' objects}    ->

A final trick with cProfile is that you can craft a decorator to only trigger it for particular functions. This is useful when the overhead of cProfile over all the code is too high, but you need a profile of a function and called subfunctions:23

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import cProfile

def profileit(name):
    def inner(func):
        def wrapper(*args, **kwargs):
            prof = cProfile.Profile()
            retval = prof.runcall(func, *args, **kwargs)
            # Note use of name from outer scope
            prof.dump_stats(name)
            return retval
        return wrapper
    return inner

@profileit("profile_for_func1_001")
def func1(...)
    ...

line_profiler

cProfiler is rather coarse because it only traces function calls, and so is only precise at the function call level. Using a module called line_profiler we can measure CPU occupancy at the line-level.

After a pip install line_profiler you just need to decorate the functions you’re interested in with @profile. Note that this will render the script unexecutable because we do not use any imported modules to profile our script. I’ve already done the decorating in src/cpu_profiling/parse_log_line_profiler.py. Afterward execute kernprof.py:

(going_faster_with_python)Mill:src ai$ kernprof.py -l -v cpu_profiling/parse_log_line_profiler.py

avg: 49.9978002
Wrote profile results to parse_log_line_profiler.py.lprof
Timer unit: 1e-06 s

File: cpu_profiling/parse_log_line_profiler.py
Function: main at line 21
Total time: 105.217 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    21                                           @profile
    22                                           def main():
    23         1            2      2.0      0.0      cpu_usages = []
    24         1           34     34.0      0.0      with contextlib.closing(bz2.BZ2File(log_filepath)) as f_in:
    25   5000001     11602598      2.3     11.0          for line in f_in:
    26   5000000     93565103     18.7     88.9              process_line(line, cpu_usages)
    27         1        49088  49088.0      0.0      summarise(cpu_usages)

File: cpu_profiling/parse_log_line_profiler.py
Function: process_line at line 32
Total time: 44.2081 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    32                                           @profile
    33                                           def process_line(line, cpu_usages):
    34   5000000     14758591      3.0     33.4      re_obj = re_log_line.search(line)
    35   5000000      4406648      0.9     10.0      try:
    36   5000000      6765236      1.4     15.3          elems = re_obj.groups()
    37                                               except:
    38                                                   pass
    39                                               else:
    40   5000000      5814440      1.2     13.2          if elems[1] == "cpu_usage":
    41   5000000     12463137      2.5     28.2              cpu_usages.append(int(elems[2]))

The output is quite intuitive and you’ll note it confirms many of our intuitions from the “cProfiler” section. We already knew that process_line() was quite slow. However, line_profiler indicates that appending to a list in order to track CPU usage takes up 30% of the execution time of the function. Is it, however, the append itself, the int, or the array access that takes up this time? You’d need to do a bit of refactoring and spread the code over several lines to aid line_profiler.

However, the “total time” values seem a bit off. Why did cProfiler run the script in 30-odd seconds and this one is spending eons just in the individual functions? Recall that both cProfiler and line_profiler are instances of deterministic profilers; in fact the actual execution of our script is faster without such profiling (although cProfiler doesn’t add that much overhead):

(going_faster_with_python)Mill:src ai$ python utilities/measureproc.py python cpu_profiling/parse_log.py

Summary of 5 runs
metric  | min   | Q1    | median | Q2    | max  
--------+-------+-------+--------+-------+------
clock   | 24.15 | 24.23 | 24.67  | 26.02 | 31.76
user    | 24.00 | 24.12 | 24.48  | 25.86 | 30.46
system  | 0.07  | 0.09  | 0.09   | 0.11  | 0.15 
rss_max | 46.02 | 46.04 | 46.04  | 46.04 | 46.05

callgrind

TODO

This is far superior to RunSnakeRun, and to boot is actually usable.

Installation on Mac OS X:

To use this:

For more information see,25.26

profilestats

TODO

Decorator for profiling individual functions then converting the profiling data to kcachegrind format. Of course you could just use the cProfile decorator trick explained above and then call pyprof2calltree, but to each their own.

To install this: pip install profilestats.

Usage:

For more information see.27

statprof

TODO

Statistical profiler. Intended to have lighter impact than cProfiler. It regularly gathers the stack on a timer, rather than deterministically tracing all calls.

To install: pip install statprof.

To use:

1
2
3
4
5
6
7
8
import statprof

statprof.start()
    try:
        my_questionable_function()
    finally:
        statprof.stop()
        statprof.display()

For more information see.28

plop

TODO

Statistical profiler, low CPU overhead.

To install: pip install plop tornado

To use:

For more information see,2930

Memory profiling

TODO

pympler

TODO

maliae

TODO

Part 3 - How Do I Optimise?

Introduction

CPython and Bytecode Analysis

import this
for i in xrange(5):
    print i

Cython

numpy

Cython with numpy

PyPy

Part 4 - Case Study 1 - A Log Parser

Introduction

Use Cases

Initial Code

Initial Profiling

Part 5 - Case Study 2 - N-gram Language Models

Introduction

Use Cases

Initial Code

Initial Profiling

References

Cramer, David. “Django-debug-toolbar.” Accessed February 26, 2013. https://github.com/django-debug-toolbar/django-debug-toolbar.

Darnell, Ben. “Plop: Low-overhead Profiling for Python.” Accessed March 06, 2013. https://tech.dropbox.com/2012/07/plop-low-overhead-profiling-for-python/.

———. “Plop.” Accessed March 06, 2013. https://github.com/bdarnell/plop.

Fowler, Martin, and Kent Beck. Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional, 1999.

Hellman, Doug. “CProfile.” Accessed February 26, 2013. http://pymotw.com/2/profile/index.html.

Ihsan, Asim. “Generate_log.py.” Accessed March 03, 2013. https://github.com/asimihsan/going_faster_with_python/blob/master/src/utilities/generate_log.py.

———. “Measureproc.py.” Accessed March 03, 2013. https://github.com/asimihsan/going_faster_with_python/blob/master/src/utilities/measureproc.py.

———. “Parse_log.py.” Accessed March 03, 2013. https://github.com/asimihsan/going_faster_with_python/blob/master/src/cpu_profiling/parse_log.py.

Knuth, Donald E. “An Empirical Study of FORTRAN Programs.” Software: Practice and Experience 1, no. 2: 105–133.

———. “Structured Programming With Go To Statements.” Computing Surveys 6: 261–301.

Leskes, Boaz. “PyCounter.” Accessed March 06, 2013. http://pycounters.readthedocs.org/en/latest/.

McConnell, Steve. Code Complete. Microsoft Press, 2009.

Oram, Andy, and Greg Wilson. Beautiful Code: Leading Programmers Explain How They Think. O’Reilly Media, Incorporated, 2007.

O’Sullivan, Bryan. “Statprof.” Accessed March 06, 2013. https://github.com/bos/statprof.py.

Patterson, David A., and John L. Hennessy. Computer Organization and Design: The Hardware/Software Interface. Morgan Kaufmann, 2009.

PythonDocs. “Logger.exception.” Accessed February 26, 2013. http://docs.python.org/2/library/logging.html logging.Logger.exception.

———. “Multiprocessing.” Accessed February 24, 2013. http://docs.python.org/2/library/multiprocessing.html.

Schlichting, Hanno. “Profilestats.” Accessed March 06, 2013. https://pypi.python.org/pypi/profilestats.

Sedgewick, Robert, and Kevin Wayne. Algorithms. 4 ed. Addison-Wesley Professional, 2011.

Sentry. “Sentry.” Accessed February 26, 2013. https://www.getsentry.com/.

Sissel, Jordan. “Logstash.” Accessed February 26, 2013. http://www.logstash.net/.

Solem, Ask. “Celery.” Accessed February 24, 2013. http://www.celeryproject.org/.

Tsui, Frank, and Orlando Karam. Essentials of Software Engineering. Jones & Bartlett Learning, 2010.

Vanovschi, Vitalii. “ParallelPython.” Accessed February 24, 2013. http://www.parallelpython.com/.

Weidendorfer, Josef. “KCachegrind.” Accessed March 06, 2013. http://kcachegrind.sourceforge.net/html/Home.html.

———. “KCachegrind Source.” Accessed March 06, 2013. http://kcachegrind.sourceforge.net/html/Download.html.

Williams, Lloyd G., and Connie U. Smith. “Five Steps To Solving Software Performance Problems.” Software Engineering Research and Performance Engineering Services.

jsdalton. “CProfile Decorator (Stackoverflow).” Accessed March 03, 2013. http://stackoverflow.com/questions/5375624/a-decorator-that-profiles-a-method-call-and-logs-the-profiling-result.

justfalter. “How To Install Qcachegrind (kcachegrind) on Mac OSX Snow Leopard.” Accessed March 06, 2013. http://langui.sh/2011/06/16/how-to-install-qcachegrind-kcachegrind-on-mac-osx-snow-leopard/.


  1. PythonDocs, “Multiprocessing,” accessed February 24, 2013, http://docs.python.org/2/library/multiprocessing.html.

  2. Ask Solem, “Celery,” accessed February 24, 2013, http://www.celeryproject.org/.

  3. Vitalii Vanovschi, “ParallelPython,” accessed February 24, 2013, http://www.parallelpython.com/.

  4. Donald E. Knuth, “Structured Programming With Go To Statements,” Computing Surveys 6 (1974): 261–301.

  5. Lloyd G. Williams and Connie U. Smith, “Five Steps To Solving Software Performance Problems,” Software Engineering Research and Performance Engineering Services (2002): 2.

  6. Frank Tsui and Orlando Karam, Essentials of Software Engineering (Jones & Bartlett Learning, 2010), chap. 9 ‘Implementation.’

  7. Martin Fowler and Kent Beck, Refactoring: Improving the Design of Existing Code (Addison-Wesley Professional, 1999), chap. 2 ‘Principles in Refactoring.’

  8. Steve McConnell, Code Complete (Microsoft Press, 2009), chap. 25 ‘Code-Tuning Strategies.’

  9. Robert Sedgewick and Kevin Wayne, Algorithms, 4 ed. (Addison-Wesley Professional, 2011).

  10. Donald E. Knuth, “An Empirical Study of FORTRAN Programs,” Software: Practice and Experience 1, no. 2 (1971): 105–133.

  11. David A. Patterson and John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface (Morgan Kaufmann, 2009), chap. 1 ‘Fundamentals of Computer Design.’

  12. Ibid., chap. 1 ‘Fundamentals of Computer Design.’

  13. Andy Oram and Greg Wilson, Beautiful Code: Leading Programmers Explain How They Think (O’Reilly Media, Incorporated, 2007), chap. 5 ‘Correct, Beautiful, Fast.’

  14. Asim Ihsan, “Measureproc.py,” accessed March 03, 2013, https://github.com/asimihsan/going_faster_with_python/blob/master/src/utilities/measureproc.py.

  15. PythonDocs, “Logger.exception,” accessed February 26, 2013, http://docs.python.org/2/library/logging.html logging.Logger.exception.

  16. Sentry, “Sentry,” accessed February 26, 2013, https://www.getsentry.com/.

  17. Jordan Sissel, “Logstash,” accessed February 26, 2013, http://www.logstash.net/.

  18. David Cramer, “Django-debug-toolbar,” accessed February 26, 2013, https://github.com/django-debug-toolbar/django-debug-toolbar.

  19. Boaz Leskes, “PyCounter,” accessed March 06, 2013, http://pycounters.readthedocs.org/en/latest/.

  20. Asim Ihsan, “Generate_log.py,” accessed March 03, 2013, https://github.com/asimihsan/going_faster_with_python/blob/master/src/utilities/generate_log.py.

  21. Asim Ihsan, “Parse_log.py,” accessed March 03, 2013, https://github.com/asimihsan/going_faster_with_python/blob/master/src/cpu_profiling/parse_log.py.

  22. Doug Hellman, “CProfile,” accessed February 26, 2013, http://pymotw.com/2/profile/index.html.

  23. Jsdalton, “CProfile Decorator (Stackoverflow),” accessed March 03, 2013, http://stackoverflow.com/questions/5375624/a-decorator-that-profiles-a-method-call-and-logs-the-profiling-result.

  24. Josef Weidendorfer, “KCachegrind Source,” accessed March 06, 2013, http://kcachegrind.sourceforge.net/html/Download.html.

  25. Josef Weidendorfer, “KCachegrind,” accessed March 06, 2013, http://kcachegrind.sourceforge.net/html/Home.html.

  26. Justfalter, “How To Install Qcachegrind (kcachegrind) on Mac OSX Snow Leopard,” accessed March 06, 2013, http://langui.sh/2011/06/16/how-to-install-qcachegrind-kcachegrind-on-mac-osx-snow-leopard/.

  27. Hanno Schlichting, “Profilestats,” accessed March 06, 2013, https://pypi.python.org/pypi/profilestats.

  28. Bryan O’Sullivan, “Statprof,” accessed March 06, 2013, https://github.com/bos/statprof.py.

  29. Ben Darnell, “Plop,” accessed March 06, 2013, https://github.com/bdarnell/plop.

  30. Ben Darnell, “Plop: Low-overhead Profiling for Python,” accessed March 06, 2013, https://tech.dropbox.com/2012/07/plop-low-overhead-profiling-for-python/.