web-crawler – Page 4

getting Forbidden by robots.txt: scrapy

February 23, 2023 by Tarik

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY ROBOTSTXT_OBEY = False Here are the release notes

PyPi download counts seem unrealistic

February 20, 2023 by Tarik

This is kind of an old question at this point, but I noticed the same thing about a package I have on PyPI and investigated further. It turns out PyPI keeps reasonably detailed download statistics, including (apparently slightly anonymised) user agents. From that, it was apparent that most people downloading my package were things like … Read more

Designing a web crawler

February 18, 2023 by Tarik

If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper: In the course of extracting links, any Web crawler will encounter multiple links to the same document. To avoid downloading and processing a document multiple times, a URL-seen test must … Read more

Search in html source with GOOGLE? [closed]

February 12, 2023 by Tarik

I’ve come across the following resources on my travels (some already mentioned above): HTML Mark-up-focused search engines Nerdydata I’d also like to throw in the following: Huge, website crawl data archives Common Crawl – ‘years of free web page data to help change the world’ (over 250TB+) How can we analyze this crawl data? For … Read more

crawler vs scraper

February 10, 2023 by Tarik

A crawler gets web pages — i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s). A scraper takes pages that have been downloaded or, in a more general sense, … Read more

How to run Scrapy from within a Python script

January 29, 2023 by Tarik

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands: import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition … process = CrawlerProcess({ ‘USER_AGENT’: ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)’ }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished

How can I use different pipelines for different spiders in a single Scrapy project

January 13, 2023 by Tarik

Just remove all pipelines from main settings and use this inside spider. This will define the pipeline to user per spider class testSpider(InitSpider): name=”test” custom_settings = { ‘ITEM_PIPELINES’: { ‘app.MyPipeline’: 400 } }

What is the difference between web-crawling and web-scraping? [duplicate]

January 6, 2023 by Tarik

Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite differently. Usually a scraper will be bespoke to the websites it is supposed to be scraping, and would be doing things a (good) … Read more

Hide Email Address from Bots – Keep mailto:

January 2, 2023 by Tarik

The issue with your request is specifically the “Supporting screen-readers”, as by definition screen readers are a “bot” of some sort. If a screen-reader needs to be able to interpret the email address, then a page-crawler would be able to interpret it as well. Also, the point of the mailto attribute is to be the … Read more

Detecting ‘stealth’ web-crawlers

December 28, 2022 by Tarik

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically … Read more