Porting feedgenerator to Py3k

Actually, here we already are some steps into my efforts to port Pelican [2] to Python 3, — I’ll write about the beginning steps later, promise. (Here is a first impression [3])

Feedgenerator 1.2.1 is a stand-alone version of a Django module with that name. During the port I got the impression that the code was rather old. And although the tests of Pelican indicated that the porting was quite successful, I felt uneasy about all those decodings and encodings to and from unicode that had to be changed or removed for Python 3.

Syntactically, everything was in order after the changes, but are strings with unicode characters still correctly handled?

I wanted proof.

A quick glance over the shoulder into Django’s repository revealed, time did not stand still there, too. Django 1.5 is a wip [1] porting to Py3k, and also the feedgenerator module has evolved. So, let’s see if we can build a standalone version of it again.

The new feedgenerator depends on too much functionality in other Django modules to be practical to extract and zipped into a single file, as was the case earlier. Instead, I took all dependent modules, and arranged them as submodules of the new standalone feedgenerator.

And, — kudos to the Django developers, really, I’m impressed. That was all. The new feedgenerator runs in Python 2 and 3, same codebase; and no rewriting with 2to3 necessary.

Proof — again this shouting from the lower ranks…

Ok, the tests I wrote myself. My goal was to ensure that unicode characters inside some text field of a feed get properly encoded, regardless if run with Python 2 or 3; in other words, that the resulting feed is the same in both circumstances.

Argh, I almost became blond.

To recap, in Python 2 we have types str and unicode, Python 3 knows about str (text/unicode) and bytes (data). Now, Py3 str replaces the old unicode and Py3 bytes replaces the old str. At least we get adviced to keep it that way in new code. To make things worse, Python 2 did not complain when we stored unicode characters in a str, even unicode characters are bytes, after all! Shit and fans got acquainted when we tried to encode such a mess…

Ok, as a simple test, I created output of a feed to act as expected result, and some fixtures as input for the feedgenerator. All string literals were marked as unicode: from __future__ import unicode_literals, as also Django does. So, the created feed should easily be identical with the expected result, right?


Only in Python 3 both were the same, in Python 2 the resulting feed was encoded, i.e. bytes, and I had to encode the expected result, too, for a match. To quote The Doctor: “WHAT? … WHAT? … WHAT?”

Look here at the test-case [4].

Typically, you would write a generated feed into a file. But for the test-case I kept everything in memory, feed’s method writeString() generates the output in a StringIO buffer and returns its value. And StringIO I suspected for murder …err, for juggling with encodings.

A new test-case proves [5] me right. Please, really, read the code and my annotations there. You won’t believe your eyes.

tl;dr In Python 2, StringIO returns the same type we originally wrote into it, str will be str and friends will be friends (or sth. like that). Not so in Python 3. Here, StringIO always returns str (remember, that’s unicode now!). Even if we feed bytes into it, we get out str — with surprising change to the content! Again, please look at the code.

Check-out [6] the complete feedgenerator from github. Although module six is included in that package, the tests require you to install six globally (or in the virtualenv), sorry about that, for now.

[1]Work In Progress, as I learned yesterday ;)