web-scraping – Page 8

Simple jQuery selector only selects first element in Chrome..?

February 27, 2023 by Tarik

If jQuery isn’t present on the webpage, and of course no other code assigns something to $, Chrome’s JS console assigns $ a shortcut to document.querySelector(). You can achieve what you want with $$(), which is assigned by the console a shortcut to document.querySelectorAll(). To know if the page contains jQuery, you can execute jQuery … Read more

Selenium-Debugging: Element is not clickable at point (X,Y)

February 20, 2023 by Tarik

Another element is covering the element you are trying to click. You could use execute_script() to click on this. element = driver.find_element_by_class_name(‘pagination-r’) driver.execute_script(“arguments[0].click();”, element)

What should I use to open a url instead of urlopen in urllib3

February 20, 2023 by Tarik

urllib3 is a different library from urllib and urllib2. It has lots of additional features to the urllibs in the standard library, if you need them, things like re-using connections. The documentation is here: https://urllib3.readthedocs.org/ If you’d like to use urllib3, you’ll need to pip install urllib3. A basic example looks like this: from bs4 … Read more

Converting html to text with Python

February 19, 2023 by Tarik

soup.get_text() outputs what you want: from bs4 import BeautifulSoup soup = BeautifulSoup(html) print(soup.get_text()) output: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. … Read more

Web scraping with Java

February 15, 2023 by Tarik

jsoup Extracting the title is not difficult, and you have many options, search here on Stack Overflow for “Java HTML parsers“. One of them is Jsoup. You can navigate the page using DOM if you know the page structure, see http://jsoup.org/cookbook/extracting-data/dom-navigation It’s a good library and I’ve used it in my last projects.

Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)

February 14, 2023 by Tarik

Attempt to use XHR2 responseType = “document” and fall back on (new DOMParser).parseFromString(responseText, getResponseHeader(“Content-Type”)) with my text/html patch. See https://gist.github.com/1138724 for an example of how I detect responseType = “document support (synchronously checking response === null on an object URL created from a text/html blob). Use the Chrome WebRequest API to hide X-Requested-With, etc. headers.

Is it ok to scrape data from Google results? [closed]

February 11, 2023 by Tarik

Google disallows automated access in their TOS, so if you accept their terms you would break them. That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed 🙂 There are two options to scrape … Read more

How to manage log in session through headless chrome?

February 9, 2023 by Tarik

There is an option to save user data using the userDataDir option when launching puppeteer. This stores the session and other things related to launching chrome. puppeteer.launch({ userDataDir: “./user_data” }); It doesn’t go into great detail but here’s a link to the docs for it: https://pptr.dev/#?product=Puppeteer&version=v1.6.1&show=api-puppeteerlaunchoptions

Extracting an information from web page by machine learning

February 3, 2023 by Tarik

First, your task fits into the information extraction area of research. There are mainly 2 levels of complexity for this task: extract from a given html page or a website with the fixed template (like Amazon). In this case the best way is to look at the HTML code of the pages and craft the … Read more

How to run Scrapy from within a Python script

January 29, 2023 by Tarik

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands: import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition … process = CrawlerProcess({ ‘USER_AGENT’: ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)’ }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished