How to force scrapy to crawl duplicate url?
You’re probably looking for the dont_filter=True argument on Request(). See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects
You’re probably looking for the dont_filter=True argument on Request(). See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects
Unfortunately this is a problem with django + psycopg2 + celery combo. It’s an old and unsolved problem. Take a look on this thread to understand: https://github.com/celery/django-celery/issues/121 Basically, when celery starts a worker, it forks a database connection from django.db framework. If this connection drops for some reason, it doesn’t create a new one. Celery … Read more
This is what you’d use the meta Keyword for. def parse(self, response): for sel in response.xpath(‘//tbody/tr’): item = HeroItem() # Item assignment here url=”https://” + item[‘server’] + ‘.battle.net/’ + sel.xpath(‘td[@class=”cell-BattleTag”]//a/@href’).extract()[0].strip() yield Request(url, callback=self.parse_profile, meta={‘hero_item’: item}) def parse_profile(self, response): item = response.meta.get(‘hero_item’) item[‘weapon’] = response.xpath(‘//li[@class=”slot-mainHand”]/a[@class=”slot-link”]/@href’).extract()[0].split(“https://stackoverflow.com/”)[4] yield item Also note, doing sel = Selector(response) is a waste … Read more
You can crawl a local file using an url of the following form: file:///path/to/file.html
The answer above do not really solved the problem. They are sending the data as paramters instead of JSON data as the body of the request. From http://bajiecc.cc/questions/1135255/scrapy-formrequest-sending-json: my_data = {‘field1’: ‘value1’, ‘field2’: ‘value2′} request = scrapy.Request( url, method=’POST’, body=json.dumps(my_data), headers={‘Content-Type’:’application/json’} )
You cannot restart the reactor, but you should be able to run it more times by forking a separate process: import scrapy import scrapy.crawler as crawler from scrapy.utils.log import configure_logging from multiprocessing import Process, Queue from twisted.internet import reactor # your spider class QuotesSpider(scrapy.Spider): name = “quotes” start_urls = [‘http://quotes.toscrape.com/tag/humor/’] def parse(self, response): for quote … Read more
I agree that the Scrapy docs give off that impression. But, I believe, as I found for myself, that if you are patient with Scrapy, and go through the tutorials first, and then bury yourself into the rest of the documentation, you will not only start to understand the different parts to Scrapy better, but … Read more
Another approach is to override the __repr__ method of the Item subclasses to selectively choose which attributes (if any) to print at the end of the pipeline: from scrapy.item import Item, Field class MyItem(Item): attr1 = Field() attr2 = Field() # … attrN = Field() def __repr__(self): “””only print out attr1 after exiting the Pipeline””” … Read more
Do not override the parse function in a CrawlSpider: When you are using a CrawlSpider, you shouldn’t override the parse function. There’s a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules. Logging in before … Read more
If anyone else is having the same problem, this is how I solved it. I added this to my scrapy settings.py file: def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module(‘settings’, [path]) project = imp.load_module(‘settings’, f, filename, desc) setup_environ(project) setup_django_env(‘/path/to/django/project/’) Note: the path above is to your django project folder, … Read more