urllib2 – Tarik Billa

A good way to get the charset/encoding of an HTTP response in Python

January 6, 2024 by Tarik

To parse http header you could use cgi.parse_header(): _, params = cgi.parse_header(‘text/html; charset=utf-8’) print params[‘charset’] # -> utf-8 Or using the response object: response = urllib2.urlopen(‘http://example.com’) response_encoding = response.headers.getparam(‘charset’) # or in Python 3: response.headers.get_content_charset(default) In general the server may lie about the encoding or do not report it at all (the default depends on … Read more

urllib2 file name

December 30, 2023 by Tarik

Did you mean urllib2.urlopen? You could potentially lift the intended filename if the server was sending a Content-Disposition header by checking remotefile.info()[‘Content-Disposition’], but as it is I think you’ll just have to parse the url. You could use urlparse.urlsplit, but if you have any URLs like at the second example, you’ll end up having to … Read more

How do I download a zip file in python using urllib2?

December 24, 2023 by Tarik

Here’s how I’d deal with the url building and downloading. I’m making sure to name the file as the basename of the url (the last bit after the trailing slash) and I’m also using the with clause for opening the file to write to. This uses a ContextManager which is nice because it will close … Read more

Python urllib2 URLError HTTP status code.

December 22, 2023 by Tarik

You shouldn’t check for a status code after catching URLError, since that exception can be raised in situations where there’s no HTTP status code available, for example when you’re getting connection refused errors. Use HTTPError to check for HTTP specific errors, and then use URLError to check for other problems: try: urllib2.urlopen(url) except urllib2.HTTPError, e: … Read more

Read file object as string in python

December 19, 2023 by Tarik

You can use Python in interactive mode to search for solutions. if f is your object, you can enter dir(f) to see all methods and attributes. There’s one called read. Enter help(f.read) and it tells you that f.read() is the way to retrieve a string from an file object.

Source interface with Python and urllib2

December 9, 2023 by Tarik

Unfortunately the stack of standard library modules in use (urllib2, httplib, socket) is somewhat badly designed for the purpose — at the key point in the operation, HTTPConnection.connect (in httplib) delegates to socket.create_connection, which in turn gives you no “hook” whatsoever between the creation of the socket instance sock and the sock.connect call, for you … Read more

How do I add a header to urllib2 opener?

September 28, 2023 by Tarik

You can add the headers directly to the OpenerDirector object returned by build_opener. From the last example in the urllib2 docs: OpenerDirector automatically adds a User-Agent header to every Request. To change this: import urllib2 opener = urllib2.build_opener() opener.addheaders = [(‘User-agent’, ‘Mozilla/5.0’)] opener.open(‘http://www.example.com/’) Also, remember that a few standard headers (Content-Length, Content-Type and Host) are … Read more

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup? [duplicate]

September 22, 2023 by Tarik

As justhalf points out above, my question here is essentially a duplicate of this question. The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters. This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode … Read more

How to get the URL of a redirect with Python

September 19, 2023 by Tarik

You can easily get D by just asking for the current URL. req = urllib2.Request(starturl, datagen, headers) res = urllib2.urlopen(req) finalurl = res.geturl() To deal with the intermediate redirects you’ll probably need to build your own opener, using HTTPRedirectHandler that records the redirects.

How to download any(!) webpage with correct charset in python?

September 19, 2023 by Tarik

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted: fp = urllib2.urlopen(request) charset = fp.headers.getparam(‘charset’) You can use BeautifulSoup to locate a meta element in the HTML: soup = BeatifulSoup.BeautifulSoup(data) meta = soup.findAll(‘meta’, {‘http-equiv’:lambda v:v.lower()==’content-type’}) If neither is available, browsers typically fall back to user … Read more