High Performance Python: Profiling

Note

This article is part of an ongoing series about profiling Python programs that starts here [4].

Note

This text is a draft and may be subject to changes.

There are two things we can profile, execution time and memory consumption.

Profiling Execution Time, Part I

From the Linux command line we can profile the execution time of any program with the time command:

$time ./main.py  # This starts fighty

real    0m14.098s
user    0m13.937s
sys     0m0.132s

To implement fighty, I first wrote the roster. It is a regular Python class which has a list fighters as member, and will gradually be filled with the API methods. One method we need here is populate(), which, ah, populates the list with fighters:

class Roster(object):

    def __init__(self):
        self.fighters = None

    def populate(self, n):
        self.fighters = []
        for i in range(0, n):
            self.fighters.append(Fighter())

As data structure for a fighter I chose a class. Why this choice is not so bad at all, even memory-wise, will become evident later on.

So, each fighter is an instance of class Fighter and its attributes are initialized with random values:

class FightData(object):
    def __init__(self):
        self.ini = None
        self.pos = None
        self.hp = None
        self.killed_in_round = None
        self.kills = None
        self.rank = None
        self.opp = None
        self.hits_sent = None
        self.misses_sent = None
        self.dmg_sent = None
        self.hits_rcvd = None
        self.misses_rcvd = None


class Fighter(object):

    ID_SEQ = 1

    def __init__(self, **kw):
        self.fight_data = FightData()
        if not kw:
            self.populate()
        else:
            for k, v in kw:
                setattr(self, k, v)
            if self.id > self.__class__.ID_SEQ:
                self.__class__.ID_SEQ = self.id

    def populate(self):
        self.id = self.__class__.ID_SEQ
        self.__class__.ID_SEQ += 1
        self.hp = 5
        self.ini0 = 3
        self.ini1 = 6
        self.att0 = 6
        self.att1 = 9
        self.dfn0 = 5
        self.dfn1 = 8
        self.dmg0 = 3
        self.dmg1 = 6
        ap = 10
        while ap > 0:
            r = randint(0, 8)
            ap -= 1
            if r == 0:
                self.ini0 += 1
            elif r == 1:
                self.ini1 += 1
            elif r == 2:
                self.att0 += 1
            elif r == 3:
                self.att1 += 1
            elif r == 4:
                self.dfn0 += 1
            elif r == 5:
                self.dfn1 += 1
            elif r == 6:
                self.dmg0 += 1
            elif r == 7:
                self.dmg1 += 1
            elif r == 8:
                self.hp += 1

I was curious how Python would perform when I populate the roster with, say, 100,000 fighters. The result you see in the time example above, — and I must say, I am somewhat disappointed. Python took way longer as expected, and that only to populate the list in memory. Also my memory gauge was ascending noticably!

The latter phenomenon lead me to abandon profiling execution time at this point and rather look at the memory consumption. We will return to this topic later.

Profiling Memory Consumption

Memory profiling in Python in my opinion is a really rocky subject. The standard library does not contain any module to help here, there is only sys.getsizeof(), which is okay for simple data types, but will lie to you about more complex ones like dicts or classes.

More elaborate is Guppy [5], almost a suite, which contains a module heapy to measure the memory consumption. But documentation sucks. I was not able to figure out how to use it until I found this tutorial [5]. And its biggest issue is that Guppy seems not to be maintained any more, the latest package on PyPI dates from 2008 (Python 2.6). I was able to compile it for Python 2.7, but on Python 3.2 compilation fails.

Then there is Pympler [6], another module to analyse memory consumption. It seems rather modern, works with Python 3.2, but—documentation sucks. Anyways, we will use it here.

Here is a simple test script that, based on Pympler, prints out the used memory for several simple and more complex data types.

import sys
import random
from pympler.asizeof import asizeof, leng

class Fighter_double:
    def __init__(self):
        self.ini = random.randint(1, 10**7) * 1.0
        self.att = random.randint(1, 10**7) * 1.0
        self.dfn = random.randint(1, 10**7) * 1.0
        self.dmg = random.randint(1, 10**7) * 1.0
        self.hp = random.randint(1, 10**7) * 1.0


class Fighter(object):
    def __init__(self):
        self.ini = random.randint(1, 10**7)
        self.att = random.randint(1, 10**7)
        self.dfn = random.randint(1, 10**7)
        self.dmg = random.randint(1, 10**7)
        self.hp = random.randint(1, 10**7)


class Fighter_slots(object):
    __slots__ = ['ini', 'att', 'dfn', 'dmg', 'hp']
    def __init__(self):
        self.ini = random.randint(1, 10**7)
        self.att = random.randint(1, 10**7)
        self.dfn = random.randint(1, 10**7)
        self.dmg = random.randint(1, 10**7)
        self.hp = random.randint(1, 10**7)

def p(s, v):
    v1 = asizeof(v)
    v1b = sys.getsizeof(v)
    try:
        v2 = len(v)
    except TypeError:
        v2 = '-'
    v3 = leng(v)
    if v3 is None:
        v3 = '-'
    print("{:25} | asizeof={:5n} | sizeof={:3n} | len={:>5} | leng={:>5}".format(s, v1, v1b, v2, v3))

def pe(s, v, e, v0):
    l1 = len(v)
    l2 = leng(v)
    se = asizeof(e)
    sv = asizeof(v)
    sv0 = asizeof(v0)
    s_per_e = int(0.5 + (sv - sv0) / l1)
    e_diff = int(0.5 + s_per_e - se)
    print(s)
    print("  len:               {:10n}".format(l1))
    print("  leng:              {:10n}".format(l2))
    print("  Size total:        {:10n}".format(sv))
    print("  Size empty:        {:10n}".format(sv0))
    print("  Element size:      {:10n}".format(se))
    print("  Size per element:  {:10n}".format(s_per_e))
    print("  Element size diff: {:10n}".format(e_diff))

def create_fighters():
    i = 12
    d = 4.5
    s0 = ""
    s = "übung"
    s1 = "1"
    s2 = "ü"
    di0 = {}
    di = {'foo': 1}
    di1 = {'foo': 1, 'bar': 2}
    di2 = {'foobar': 1}
    li0 = []
    li1 = [1]
    li2 = [1,2,3]
    tp0 = tuple()
    tp1 = (1,)
    tp2 = (1,2,3)
    f_double = Fighter_double()
    f = Fighter()
    f_slots = Fighter_slots()

    p("INT", i)
    p("DOUBLE", d)
    p("STRING ('')", s0)
    p("STRING ('übung')", s)
    p("STRING ('1')", s1)
    p("STRING ('ü')", s2)
    p("DICT (empty)", di0)
    p("DICT (1 INT)", di)
    p("DICT (1 INT long key)", di2)
    p("DICT (2 INTs)", di1)
    p("LIST (empty)", li0)
    p("LIST (1 INT)", li1)
    p("LIST (3 INTs)", li2)
    p("TUPLE (empty)", tp0)
    p("TUPLE (1 INT)", tp1)
    p("TUPLE (3 INTs)", tp2)
    p("Fighter double", f_double)
    p("Fighter", f)
    p("Fighter slots", f_slots)

    print()

    n = 10**2
    # Careful to not generate duplicate items in list. asizeof() will count
    # references to the same item only once!
    a = random.sample(range(n), n)
    pe("LIST of INTs", a, a[0], [])
    print()
    b = tuple(a)
    pe("TUPLE of INTs", b, b[0], tuple())
    print()
    a = [ Fighter() for i in range(n) ]
    pe("LIST of Fighters", a, a[0], [])
    print()
    a = [ Fighter_slots() for i in range(n) ]
    pe("LIST of Fighters w/ slots", a, a[0], [])
    print()
    a = [ [random.randint(1,10**7)
           , random.randint(1,10**7)
           , random.randint(1,10**7)
           , random.randint(1,10**7)
           , random.randint(1,10**7) ] for i in range(n) ]
    pe("LIST of LIST w/ 5 INTs", a, a[0], [])
    print()
    a = tuple([ (random.randint(1,10**7)
           , random.randint(1,10**7)
           , random.randint(1,10**7)
           , random.randint(1,10**7)
           , random.randint(1,10**7) ) for i in range(n) ])
    pe("TUPLE of TUPLE w/ 5 INTs", a, a[0], [])


if __name__ == '__main__':

    import locale
    locale.setlocale(locale.LC_ALL, "")
    create_fighters()

We find that a simple integer in Python takes up 32 bytes, a double 24 bytes:

INT                       | asizeof=   32 | sizeof= 28 | len=    - | leng=    1
DOUBLE                    | asizeof=   24 | sizeof= 24 | len=    - | leng=    -

And an instance of a class with a handful of integer attributes, like a fighter, uses 824 bytes. Each instance! (Using doubles instead of integers does not gain much.):

Fighter double            | asizeof=  824 | sizeof= 64 | len=    - | leng=    -
Fighter                   | asizeof=  864 | sizeof= 64 | len=    - | leng=    -

asizeof() calculates the size of the object including its items, where-as sys.getsizeof() only returns the size of the object itself. In a similar fashion does leng() count the real amount of items, even if they are not used, where-as len() only counts the used items.

Unused items’ result from the fact that for speed reasons Python allocates memory not for each single item as you add it, but fetches a bigger memory block that can contain several items. If you add less items than may fit into that block, the memory is still occupied:

LIST of INTs
  len:                      100
  leng:                     118
  Size total:             4,064
  Size empty:                72
  Element size:              32
  Size per element:          40
  Element size diff:          8

An empty list occupies 72 bytes, 118 ints (118 reserved, 100 used) need 3,776 bytes. The above result names as total size 4,064 bytes, which means 216 bytes are unaccounted for.

Let us now look at fighty again, how much memory the roster and its fighter list occupies:

$ time ./main.py
Locale: ('en_GB', 'UTF-8')
100,000 fighters enlisted in roster.

==============
Memory Profile
==============
Roster                    | asizeof=226,426,808 | sizeof= 64 | len=    - | leng=    -
Fighters
  len:                   100,000
  leng:                  112,506
  Size total:        226,426,368
  Size empty:                 72
  Element size:            4,344
  Size per element:        2,264
  Element size diff:      -2,079
One Fighter               | asizeof=4,344 | sizeof= 64 | len=    - | leng=    -

One fighter uses 4,344 bytes, should not 100,000 of them occupy 434,400,000 bytes? Pympler only counts memory used by unique objects: maybe there are fighters which have the same attribute values… I do not know how to find that out reliably.

Anyways, the list of 100,000 fighters, remember, this is just the list, we have not calculated any fighting data yet, takes up 226 MB RAM. Ouch! Populating one million fighters I was too short tempered to wait for the results — my developer machine almost started to swap, and it has 16 GB RAM.

Slots to save the day…

…but the nights are still troubled…

We can reduce the needed memory for a fighter (and thus for any class) significantly, if we use slots [1] instead of regular attributes:

Roster                    | asizeof=32,025,224 | sizeof= 64 | len=    - | leng=    -
Fighters
  len:                  100,000
  leng:                 112,506
  Size total:        32,024,784
  Size empty:                72
  Element size:             784
  Size per element:         320
  Element size diff:       -463
One Fighter               | asizeof=  784 | sizeof=136 | len=    - | leng=    -

And here is proof that a class with slots is really not a bad container for data. Compare its memory consumption with that of a list or tuple:

LIST of Fighters
  len:                      100
  leng:                     118
  Size total:            51,680
  Size empty:                72
  Element size:             864
  Size per element:         516
  Element size diff:       -347

LIST of Fighters w/ slots
  len:                      100
  leng:                     118
  Size total:            25,816
  Size empty:                72
  Element size:             344
  Size per element:         257
  Element size diff:        -86

LIST of LIST w/ 5 INTs
  len:                      100
  leng:                     118
  Size total:            28,120
  Size empty:                72
  Element size:             272
  Size per element:         280
  Element size diff:          8

TUPLE of TUPLE w/ 5 INTs
  len:                      100
  leng:                     100
  Size total:            26,456
  Size empty:                72
  Element size:             256
  Size per element:         264
  Element size diff:          8

What surprises me here is that the total size of a list of clots [2] is smaller than a list of lists or a tuple of tuples, even if the size of a single clot is larger than that of a similar list (344 bytes to 272 bytes). Like above, I am not quite sure if this is a result of Pympler only counting unique objects or, some hideous optimization of Python itself [3].

More about profiling execution time in a later article.

[1]Keep in mind that slots are way less than a panacea, google has a lot of discussions about their shortcomings.
[2]Classes with slots
[3]The oncoming Python 3.3 will have key-sharing dictionaries [7] for classes.
[4]../high-performance-python-prima-facie.html
[5](1, 2) http://pypi.python.org/pypi/guppy/
[6]http://packages.python.org/Pympler/
[7]http://docs.python.org/dev/whatsnew/3.3.html#pep-412