InterfaceError: connection already closed (using django + celery + Scrapy)

Unfortunately this is a problem with django + psycopg2 + celery combo. It’s an old and unsolved problem. Take a look on this thread to understand: https://github.com/celery/django-celery/issues/121 Basically, when celery starts a worker, it forks a database connection from django.db framework. If this connection drops for some reason, it doesn’t create a new one. Celery … Read more

Passing a argument to a Scrapy callback function [duplicate]

This is what you’d use the meta Keyword for. def parse(self, response): for sel in response.xpath(‘//tbody/tr’): item = HeroItem() # Item assignment here url=”https://” + item[‘server’] + ‘.battle.net/’ + sel.xpath(‘td[@class=”cell-BattleTag”]//a/@href’).extract()[0].strip() yield Request(url, callback=self.parse_profile, meta={‘hero_item’: item}) def parse_profile(self, response): item = response.meta.get(‘hero_item’) item[‘weapon’] = response.xpath(‘//li[@class=”slot-mainHand”]/a[@class=”slot-link”]/@href’).extract()[0].split(“https://stackoverflow.com/”)[4] yield item Also note, doing sel = Selector(response) is a waste … Read more

Send Post Request in Scrapy

The answer above do not really solved the problem. They are sending the data as paramters instead of JSON data as the body of the request. From http://bajiecc.cc/questions/1135255/scrapy-formrequest-sending-json: my_data = {‘field1’: ‘value1’, ‘field2’: ‘value2′} request = scrapy.Request( url, method=’POST’, body=json.dumps(my_data), headers={‘Content-Type’:’application/json’} )

Scrapy – Reactor not Restartable [duplicate]

You cannot restart the reactor, but you should be able to run it more times by forking a separate process: import scrapy import scrapy.crawler as crawler from scrapy.utils.log import configure_logging from multiprocessing import Process, Queue from twisted.internet import reactor # your spider class QuotesSpider(scrapy.Spider): name = “quotes” start_urls = [‘http://quotes.toscrape.com/tag/humor/’] def parse(self, response): for quote … Read more

Best way for a beginner to learn screen scraping by Python [closed]

I agree that the Scrapy docs give off that impression. But, I believe, as I found for myself, that if you are patient with Scrapy, and go through the tutorials first, and then bury yourself into the rest of the documentation, you will not only start to understand the different parts to Scrapy better, but … Read more

suppress Scrapy Item printed in logs after pipeline

Another approach is to override the __repr__ method of the Item subclasses to selectively choose which attributes (if any) to print at the end of the pipeline: from scrapy.item import Item, Field class MyItem(Item): attr1 = Field() attr2 = Field() # … attrN = Field() def __repr__(self): “””only print out attr1 after exiting the Pipeline””” … Read more

Crawling with an authenticated session in Scrapy

Do not override the parse function in a CrawlSpider: When you are using a CrawlSpider, you shouldn’t override the parse function. There’s a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules. Logging in before … Read more

Access django models inside of Scrapy

If anyone else is having the same problem, this is how I solved it. I added this to my scrapy settings.py file: def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module(‘settings’, [path]) project = imp.load_module(‘settings’, f, filename, desc) setup_environ(project) setup_django_env(‘/path/to/django/project/’) Note: the path above is to your django project folder, … Read more

tech