Change IP address dynamically?

An approach using Scrapy will make use of two components, RandomProxy and RotateUserAgentMiddleware. Modify DOWNLOADER_MIDDLEWARES as follows. You will have to insert the new components in the settings.py: DOWNLOADER_MIDDLEWARES = { ‘scrapy.contrib.downloadermiddleware.retry.RetryMiddleware’: 90, ‘tutorial.randomproxy.RandomProxy’: 100, ‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware’: 110, ‘scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware’ : None, ‘tutorial.spiders.rotate_useragent.RotateUserAgentMiddleware’ :400, } Random Proxy You can use scrapy-proxies. This component will process Scrapy requests … Read more

Click a Button in Scrapy

Scrapy cannot interpret javascript. If you absolutely must interact with the javascript on the page, you want to be using Selenium. If using Scrapy, the solution to the problem depends on what the button is doing. If it’s just showing content that was previously hidden, you can scrape the data without a problem, it doesn’t … Read more

Scrapy Unit Testing

The way I’ve done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML. A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have … Read more

TypeError: Object of type ‘bytes’ is not JSON serializable

You are creating those bytes objects yourself: item[‘title’] = [t.encode(‘utf-8’) for t in title] item[‘link’] = [l.encode(‘utf-8’) for l in link] item[‘desc’] = [d.encode(‘utf-8’) for d in desc] items.append(item) Each of those t.encode(), l.encode() and d.encode() calls creates a bytes string. Do not do this, leave it to the JSON format to serialise these. Next, … Read more

How to run Scrapy from within a Python script

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands: import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition … process = CrawlerProcess({ ‘USER_AGENT’: ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)’ }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished

selenium with scrapy for dynamic page

It really depends on how do you need to scrape the site and how and what data do you want to get. Here’s an example how you can follow pagination on ebay using Scrapy+Selenium: import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = “product_spider” allowed_domains = [‘ebay.com’] start_urls = [‘http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40’] def __init__(self): self.driver = … Read more

How to use PyCharm to debug Scrapy projects

The scrapy command is a python script which means you can start it from inside PyCharm. When you examine the scrapy binary (which scrapy) you will notice that this is actually a python script: #!/usr/bin/python from scrapy.cmdline import execute execute() This means that a command like scrapy crawl IcecatCrawler can also be executed like this: … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)