PyPi download counts seem unrealistic

This is kind of an old question at this point, but I noticed the same thing about a package I have on PyPI and investigated further. It turns out PyPI keeps reasonably detailed download statistics, including (apparently slightly anonymised) user agents. From that, it was apparent that most people downloading my package were things like … Read more

Designing a web crawler

If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper: In the course of extracting links, any Web crawler will encounter multiple links to the same document. To avoid downloading and processing a document multiple times, a URL-seen test must … Read more

Search in html source with GOOGLE? [closed]

I’ve come across the following resources on my travels (some already mentioned above): HTML Mark-up-focused search engines Nerdydata I’d also like to throw in the following: Huge, website crawl data archives Common Crawl – ‘years of free web page data to help change the world’ (over 250TB+) How can we analyze this crawl data? For … Read more

crawler vs scraper

A crawler gets web pages — i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s). A scraper takes pages that have been downloaded or, in a more general sense, … Read more

How to run Scrapy from within a Python script

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands: import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition … process = CrawlerProcess({ ‘USER_AGENT’: ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)’ }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished

What is the difference between web-crawling and web-scraping? [duplicate]

Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite differently. Usually a scraper will be bespoke to the websites it is supposed to be scraping, and would be doing things a (good) … Read more

Detecting ‘stealth’ web-crawlers

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically … Read more