Unicode (UTF-8) reading and writing to files in Python

Question

Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file’s encoding.

Supposing the file is encoded in UTF-8, we can use:

>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")

Then f.read returns a decoded Unicode object:

>>> f.read()
u'Capit\xe1l\n\n'

In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).

We can also use open from the codecs standard library module:

>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'

Note, however, that this can cause problems when mixing read() and readline().

Leave a Comment Cancel reply