Run a Scrapy spider in a Celery Task

The twisted reactor cannot be restarted. A work around for this is to let the celery task fork a new child process for each crawl you want to execute as proposed in the following post: Running Scrapy spiders in a Celery task This gets around the “reactor cannot be restart-able” issue by utilizing the multiprocessing … Read more

Passing a argument to a callback function [duplicate]

This is what you’d use the meta Keyword for. def parse(self, response): for sel in response.xpath(‘//tbody/tr’): item = HeroItem() # Item assignment here url=”https://” + item[‘server’] + ‘.battle.net/’ + sel.xpath(‘td[@class=”cell-BattleTag”]//a/@href’).extract()[0].strip() yield Request(url, callback=self.parse_profile, meta={‘hero_item’: item}) def parse_profile(self, response): item = response.meta.get(‘hero_item’) item[‘weapon’] = response.xpath(‘//li[@class=”slot-mainHand”]/a[@class=”slot-link”]/@href’).extract()[0].split(“https://stackoverflow.com/”)[4] yield item Also note, doing sel = Selector(response) is a waste … Read more

Running Scrapy spiders in a Celery task

Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen’s code located here http://snippets.scrapy.org/snippets/13/ First the tasks.py file from celery import task @task() def crawl_domain(domain_pk): from crawl import domain_crawl return domain_crawl(domain_pk) Then the crawl.py file from multiprocessing … Read more

scrapy text encoding

Since Scrapy 1.2.0, a new setting FEED_EXPORT_ENCODING is introduced. By specifying it as utf-8, JSON output will not be escaped. That is to add in your settings.py: FEED_EXPORT_ENCODING = ‘utf-8’

how to filter duplicate requests based on url in scrapy

You can write custom middleware for duplicate removal and add it in settings import os from scrapy.dupefilter import RFPDupeFilter class CustomFilter(RFPDupeFilter): “””A dupe filter that considers specific ids in the url””” def __getid(self, url): mm = url.split(“&refer”)[0] #or something like that return mm def request_seen(self, request): fp = self.__getid(request.url) if fp in self.fingerprints: return True … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)