web-scraping – Page 10

How to use Python requests to fake a browser visit a.k.a and generate User Agent?

October 24, 2022 by Tarik

Provide a User-Agent header: import requests url=”http://www.ichangtou.com/#company:data_000008.html” headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36’} response = requests.get(url, headers=headers) print(response.content) FYI, here is a list of User-Agent strings for different browsers: List of all Browsers As a side note, there is a pretty useful third-party package called … Read more

Web scraping with Python [closed]

October 19, 2022 by Tarik

Use urllib2 in combination with the brilliant BeautifulSoup library: import urllib2 from BeautifulSoup import BeautifulSoup # or if you’re using BeautifulSoup4: # from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen(‘http://example.com’).read()) for row in soup(‘table’, {‘class’: ‘spad’})[0].tbody(‘tr’): tds = row(‘td’) print tds[0].string, tds[1].string # will print date and sunrise

How to save an image locally using Python whose URL address I already know?

October 18, 2022 by Tarik

Python 2 Here is a more straightforward way if all you want to do is save it as a file: import urllib urllib.urlretrieve(“http://www.digimouth.com/news/media/2011/09/google-logo.jpg”, “local-filename.jpg”) The second argument is the local path where the file should be saved. Python 3 As SergO suggested the code below should work with Python 3. import urllib.request urllib.request.urlretrieve(“http://www.digimouth.com/news/media/2011/09/google-logo.jpg”, “local-filename.jpg”)

How can I efficiently parse HTML with Java?

October 16, 2022 by Tarik

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after. Its party trick is a CSS selector syntax to find elements, e.g.: String html = “<html><head><title>First parse</title></head>” + “<body><p>Parsed HTML into a doc.</p></body></html>”; Document doc = Jsoup.parse(html); Elements links … Read more

How can I pass variable into an evaluate function?

October 8, 2022 by Tarik

You have to pass the variable as an argument to the pageFunction like this: const links = await page.evaluate((evalVar) => { console.log(evalVar); // 2. should be defined now … }, evalVar); // 1. pass variable as an argument You can pass in multiple variables by passing more arguments to page.evaluate(): await page.evaluate((a, b c) => … Read more

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org

October 8, 2022 by Tarik

Once upon a time I stumbled with this issue. If you’re using macOS go to Macintosh HD > Applications > Python3.6 folder (or whatever version of python you’re using) > double click on “Install Certificates.command” file. 😀

Web-scraping JavaScript page with Python

October 5, 2022 by Tarik

EDIT Sept 2021: phantomjs isn’t maintained any more, either EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end. dryscape isn’t maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium’s python library … Read more

How can I get the Google cache age of any URL or web page? [closed]

October 4, 2022 by Tarik

Use the URL https://webcache.googleusercontent.com/search?q=cache:<your url without “http://”> Example: https://webcache.googleusercontent.com/search?q=cache:stackoverflow.com It contains a header like this: This is Google’s cache of https://stackoverflow.com/. It is a snapshot of the page as it appeared on 21 Aug 2012 11:33:38 GMT. The current page could have changed in the meantime. Learn more Tip: To quickly find your search term … Read more

Headless Browser and scraping – solutions [closed]

September 23, 2022 by Tarik

If Ruby is your thing, you may also try: https://github.com/chriskite/anemone (dev stopped) https://github.com/sparklemotion/mechanize https://github.com/postmodern/spidr https://github.com/stewartmckee/cobweb http://watirwebdriver.com/ (Selenium) also, Nokogiri gem can be used for scraping: http://nokogiri.org/ there is a dedicated book about how to utilise nokogiri for scraping by packt publishing

How to find elements by class

September 11, 2022 by Tarik

You can refine your search to only find those divs with a given class using BS3: mydivs = soup.find_all(“div”, {“class”: “stylelistrow”})