Python 3 Unicode-Bytes Quirks

Imagine in your program you compare two variables which contain strings. You are pretty sure that under certain circumstances both variables contain the same strings, — but somehow Python insists that they do not!

For a quick debug, you print out both variables and get

>>> print(s1, s2, s1 == s2)
b'xyz' b'xyz' False

What happened?

Make sure you also test the types of the variables:

>>> print(type(s1), type(s2))
<class 'bytes'> <class 'str'>

Confused? Well, I was when I saw that the first time.

In Python 2 it was pretty legal to say foo = str(something) to cast something into a string.

But watch out if you do this in Python 3, esp. if the castee is of type bytes. The idiom str(some_bytes) actually returns the string representation of the bytes, and does not a cast — which would have made some kind of decoding necessary!

See how the variables were initialized to solve the mystery:

>>> s1 = b'xyz'
>>> s2 = str(s1)
>>> print(s1, s2)
b'xyz' b'xyz'
>>> print(type(s1), type(s2))
<class 'bytes'> <class 'str'>

What we actually wanted to say is:

>>> s1 = b'xyz'
>>> s2 = str(s1, encoding='utf-8')

Remember this new idiom, and post-it on your fridge, dm. ;)

Python 3 Unicode-Bytes Quirks

Dirk Makowski

social

Categories

Tags