To parse http header you could use cgi.parse_header()
:
_, params = cgi.parse_header('text/html; charset=utf-8')
print params['charset'] # -> utf-8
Or using the response object:
response = urllib2.urlopen('http://example.com')
response_encoding = response.headers.getparam('charset')
# or in Python 3: response.headers.get_content_charset(default)
In general the server may lie about the encoding or do not report it at all (the default depends on content-type) or the encoding might be specified inside the response body e.g., <meta>
element in html documents or in xml declaration for xml documents. As a last resort the encoding could be guessed from the content itself.
You could use requests
to get Unicode text:
import requests # pip install requests
r = requests.get(url)
unicode_str = r.text # may use `chardet` to auto-detect encoding
Or BeautifulSoup
to parse html (and convert to Unicode as a side-effect):
from bs4 import BeautifulSoup # pip install beautifulsoup4
soup = BeautifulSoup(urllib2.urlopen(url)) # may use `cchardet` for speed
# ...
Or bs4.UnicodeDammit
directly for arbitrary content (not necessarily an html):
from bs4 import UnicodeDammit
dammit = UnicodeDammit(b"Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# -> Sacré bleu!
print(dammit.original_encoding)
# -> utf-8