Using python Requests with javascript pages

Good news: there is now a requests module that supports javascript: https://pypi.org/project/requests-html/ from requests_html import HTMLSession session = HTMLSession() r = session.get(‘http://www.yourjspage.com’) r.html.render() # this call executes the js in the page As a bonus this wraps BeautifulSoup, I think, so you can do things like r.html.find(‘#myElementID’).text which returns the content of the HTML element … Read more

How to scrape a website which requires login using python and beautifulsoup?

You can use mechanize: import mechanize from bs4 import BeautifulSoup import urllib2 import cookielib ## http.cookiejar in python3 cj = cookielib.CookieJar() br = mechanize.Browser() br.set_cookiejar(cj) br.open(“https://id.arduino.cc/auth/login/”) br.select_form(nr=0) br.form[‘username’] = ‘username’ br.form[‘password’] = ‘password.’ br.submit() print br.response().read() Or urllib – Login to website using urllib2

selenium with scrapy for dynamic page

It really depends on how do you need to scrape the site and how and what data do you want to get. Here’s an example how you can follow pagination on ebay using Scrapy+Selenium: import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = “product_spider” allowed_domains = [‘ebay.com’] start_urls = [‘http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40’] def __init__(self): self.driver = … Read more

What is the difference between web-crawling and web-scraping? [duplicate]

Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite differently. Usually a scraper will be bespoke to the websites it is supposed to be scraping, and would be doing things a (good) … Read more

How to scrape only visible webpage text with BeautifulSoup?

Try this: from bs4 import BeautifulSoup from bs4.element import Comment import urllib.request def tag_visible(element): if element.parent.name in [‘style’, ‘script’, ‘head’, ‘title’, ‘meta’, ‘[document]’]: return False if isinstance(element, Comment): return False return True def text_from_html(body): soup = BeautifulSoup(body, ‘html.parser’) texts = soup.findAll(text=True) visible_texts = filter(tag_visible, texts) return u” “.join(t.strip() for t in visible_texts) html = urllib.request.urlopen(‘http://www.nytimes.com/2009/12/21/us/21storm.html’).read() … Read more

Scraping html tables into R data frames using the XML package

…or a shorter try: library(XML) library(RCurl) library(rlist) theurl <- getURL(“https://en.wikipedia.org/wiki/Brazil_national_football_team”,.opts = list(ssl.verifypeer = FALSE) ) tables <- readHTMLTable(theurl) tables <- list.clean(tables, fun = is.null, recursive = FALSE) n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) the picked table is the longest one on the page tables[[which.max(n.rows)]]

Problem HTTP error 403 in Python 3 Web Scraping

This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it’s easily detected). Try setting a known browser user agent with: from urllib.request import Request, urlopen req = Request( url=”http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1″, headers={‘User-Agent’: ‘Mozilla/5.0’} ) webpage = urlopen(req).read() This works for me. By … Read more

retrieve links from web page using python and BeautifulSoup [closed]

Here’s a short snippet using the SoupStrainer class in BeautifulSoup: import httplib2 from bs4 import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request(‘http://www.nytimes.com’) for link in BeautifulSoup(response, parse_only=SoupStrainer(‘a’)): if link.has_attr(‘href’): print(link[‘href’]) The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Edit: Note that I used the SoupStrainer class … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)