How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Provide a User-Agent header: import requests url=”http://www.ichangtou.com/#company:data_000008.html” headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36’} response = requests.get(url, headers=headers) print(response.content) FYI, here is a list of User-Agent strings for different browsers: List of all Browsers As a side note, there is a pretty useful third-party package called … Read more

Web scraping with Python [closed]

Use urllib2 in combination with the brilliant BeautifulSoup library: import urllib2 from BeautifulSoup import BeautifulSoup # or if you’re using BeautifulSoup4: # from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen(‘http://example.com’).read()) for row in soup(‘table’, {‘class’: ‘spad’})[0].tbody(‘tr’): tds = row(‘td’) print tds[0].string, tds[1].string # will print date and sunrise

How to save an image locally using Python whose URL address I already know?

Python 2 Here is a more straightforward way if all you want to do is save it as a file: import urllib urllib.urlretrieve(“http://www.digimouth.com/news/media/2011/09/google-logo.jpg”, “local-filename.jpg”) The second argument is the local path where the file should be saved. Python 3 As SergO suggested the code below should work with Python 3. import urllib.request urllib.request.urlretrieve(“http://www.digimouth.com/news/media/2011/09/google-logo.jpg”, “local-filename.jpg”)

How can I efficiently parse HTML with Java?

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after. Its party trick is a CSS selector syntax to find elements, e.g.: String html = “<html><head><title>First parse</title></head>” + “<body><p>Parsed HTML into a doc.</p></body></html>”; Document doc = Jsoup.parse(html); Elements links … Read more

How can I pass variable into an evaluate function?

You have to pass the variable as an argument to the pageFunction like this: const links = await page.evaluate((evalVar) => { console.log(evalVar); // 2. should be defined now … }, evalVar); // 1. pass variable as an argument You can pass in multiple variables by passing more arguments to page.evaluate(): await page.evaluate((a, b c) => … Read more

Web-scraping JavaScript page with Python

EDIT Sept 2021: phantomjs isn’t maintained any more, either EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end. dryscape isn’t maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium’s python library … Read more

How can I get the Google cache age of any URL or web page? [closed]

Use the URL https://webcache.googleusercontent.com/search?q=cache:<your url without “http://”> Example: https://webcache.googleusercontent.com/search?q=cache:stackoverflow.com It contains a header like this: This is Google’s cache of https://stackoverflow.com/. It is a snapshot of the page as it appeared on 21 Aug 2012 11:33:38 GMT. The current page could have changed in the meantime. Learn more Tip: To quickly find your search term … Read more

Headless Browser and scraping – solutions [closed]

If Ruby is your thing, you may also try: https://github.com/chriskite/anemone (dev stopped) https://github.com/sparklemotion/mechanize https://github.com/postmodern/spidr https://github.com/stewartmckee/cobweb http://watirwebdriver.com/ (Selenium) also, Nokogiri gem can be used for scraping: http://nokogiri.org/ there is a dedicated book about how to utilise nokogiri for scraping by packt publishing

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)