Python 3.x makes a clear distinction between the types:
'...'literals = a sequence of Unicode characters (Latin-1, UCS-2 or UCS-4, depending on the widest character in the string)
b'...'literals = a sequence of octets (integers between 0 and 255)
If you’re familiar with:
- Java or C#, think of
- SQL, think of
- Windows registry, think of
If you’re familiar with C(++), then forget everything you’ve learned about
char and strings, because a character is not a byte. That idea is long obsolete.
str when you want to represent text.
bytes when you want to represent low-level binary data like structs.
NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')
You can encode a
str to a
>>> '\uFEFF'.encode('UTF-8') b'\xef\xbb\xbf'
And you can decode a
bytes into a
>>> b'\xE2\x82\xAC'.decode('UTF-8') '€'
But you can’t freely mix the two types.
>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't concat bytes to str
b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.
>>> b'A' == b'\x41' True
But I must emphasize, a character is not a byte.
>>> 'A' == b'A' False
In Python 2.x
Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:
u'...'literals = sequence of Unicode characters = 3.x
'...'literals = sequences of confounded bytes/characters
- Usually text, encoded in some unspecified encoding.
- But also used to represent binary data like
In order to ease the 2.x-to-3.x transition, the
b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be
bytes in 3.x) from text strings (which should be
str in 3.x). The
b prefix does nothing in 2.x, but tells the
2to3 script not to convert it to a Unicode string in 3.x.
b'...' literals in Python have the same purpose that they do in PHP.
Also, just out of curiosity, are there
more symbols than the b and u that do
r prefix creates a raw string (e.g.,
r'\t' is a backslash +
t instead of a tab), and triple quotes
"""...""" allow multi-line string literals.