A good way to get the charset/encoding of an HTTP response in Python

To parse http header you could use cgi.parse_header(): _, params = cgi.parse_header(‘text/html; charset=utf-8’) print params[‘charset’] # -> utf-8 Or using the response object: response = urllib2.urlopen(‘http://example.com’) response_encoding = response.headers.getparam(‘charset’) # or in Python 3: response.headers.get_content_charset(default) In general the server may lie about the encoding or do not report it at all (the default depends on … Read more

urllib2 file name

Did you mean urllib2.urlopen? You could potentially lift the intended filename if the server was sending a Content-Disposition header by checking remotefile.info()[‘Content-Disposition’], but as it is I think you’ll just have to parse the url. You could use urlparse.urlsplit, but if you have any URLs like at the second example, you’ll end up having to … Read more

Python urllib2 URLError HTTP status code.

You shouldn’t check for a status code after catching URLError, since that exception can be raised in situations where there’s no HTTP status code available, for example when you’re getting connection refused errors. Use HTTPError to check for HTTP specific errors, and then use URLError to check for other problems: try: urllib2.urlopen(url) except urllib2.HTTPError, e: … Read more

Read file object as string in python

You can use Python in interactive mode to search for solutions. if f is your object, you can enter dir(f) to see all methods and attributes. There’s one called read. Enter help(f.read) and it tells you that f.read() is the way to retrieve a string from an file object.

Source interface with Python and urllib2

Unfortunately the stack of standard library modules in use (urllib2, httplib, socket) is somewhat badly designed for the purpose — at the key point in the operation, HTTPConnection.connect (in httplib) delegates to socket.create_connection, which in turn gives you no “hook” whatsoever between the creation of the socket instance sock and the sock.connect call, for you … Read more

How do I add a header to urllib2 opener?

You can add the headers directly to the OpenerDirector object returned by build_opener. From the last example in the urllib2 docs: OpenerDirector automatically adds a User-Agent header to every Request. To change this: import urllib2 opener = urllib2.build_opener() opener.addheaders = [(‘User-agent’, ‘Mozilla/5.0’)] opener.open(‘http://www.example.com/’) Also, remember that a few standard headers (Content-Length, Content-Type and Host) are … Read more

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup? [duplicate]

As justhalf points out above, my question here is essentially a duplicate of this question. The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters. This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode … Read more

How to download any(!) webpage with correct charset in python?

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted: fp = urllib2.urlopen(request) charset = fp.headers.getparam(‘charset’) You can use BeautifulSoup to locate a meta element in the HTML: soup = BeatifulSoup.BeautifulSoup(data) meta = soup.findAll(‘meta’, {‘http-equiv’:lambda v:v.lower()==’content-type’}) If neither is available, browsers typically fall back to user … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)