scrapy – Tarik Billa

How to force scrapy to crawl duplicate url?

April 8, 2024 by Tarik

You’re probably looking for the dont_filter=True argument on Request(). See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects

InterfaceError: connection already closed (using django + celery + Scrapy)

April 5, 2024 by Tarik

Unfortunately this is a problem with django + psycopg2 + celery combo. It’s an old and unsolved problem. Take a look on this thread to understand: https://github.com/celery/django-celery/issues/121 Basically, when celery starts a worker, it forks a database connection from django.db framework. If this connection drops for some reason, it doesn’t create a new one. Celery … Read more

Passing a argument to a Scrapy callback function [duplicate]

April 1, 2024 by Tarik

This is what you’d use the meta Keyword for. def parse(self, response): for sel in response.xpath(‘//tbody/tr’): item = HeroItem() # Item assignment here url=”https://” + item[‘server’] + ‘.battle.net/’ + sel.xpath(‘td[@class=”cell-BattleTag”]//a/@href’).extract()[0].strip() yield Request(url, callback=self.parse_profile, meta={‘hero_item’: item}) def parse_profile(self, response): item = response.meta.get(‘hero_item’) item[‘weapon’] = response.xpath(‘//li[@class=”slot-mainHand”]/a[@class=”slot-link”]/@href’).extract()[0].split(“https://stackoverflow.com/”)[4] yield item Also note, doing sel = Selector(response) is a waste … Read more

scraping the file with html saved in local system

January 4, 2024 by Tarik

You can crawl a local file using an url of the following form: file:///path/to/file.html

Send Post Request in Scrapy

December 31, 2023 by Tarik

The answer above do not really solved the problem. They are sending the data as paramters instead of JSON data as the body of the request. From http://bajiecc.cc/questions/1135255/scrapy-formrequest-sending-json: my_data = {‘field1’: ‘value1’, ‘field2’: ‘value2′} request = scrapy.Request( url, method=’POST’, body=json.dumps(my_data), headers={‘Content-Type’:’application/json’} )

Scrapy – Reactor not Restartable [duplicate]

December 30, 2023 by Tarik

You cannot restart the reactor, but you should be able to run it more times by forking a separate process: import scrapy import scrapy.crawler as crawler from scrapy.utils.log import configure_logging from multiprocessing import Process, Queue from twisted.internet import reactor # your spider class QuotesSpider(scrapy.Spider): name = “quotes” start_urls = [‘http://quotes.toscrape.com/tag/humor/’] def parse(self, response): for quote … Read more

Best way for a beginner to learn screen scraping by Python [closed]

December 22, 2023 by Tarik

I agree that the Scrapy docs give off that impression. But, I believe, as I found for myself, that if you are patient with Scrapy, and go through the tutorials first, and then bury yourself into the rest of the documentation, you will not only start to understand the different parts to Scrapy better, but … Read more

suppress Scrapy Item printed in logs after pipeline

December 21, 2023 by Tarik

Another approach is to override the __repr__ method of the Item subclasses to selectively choose which attributes (if any) to print at the end of the pipeline: from scrapy.item import Item, Field class MyItem(Item): attr1 = Field() attr2 = Field() # … attrN = Field() def __repr__(self): “””only print out attr1 after exiting the Pipeline””” … Read more

Crawling with an authenticated session in Scrapy

December 12, 2023 by Tarik

Do not override the parse function in a CrawlSpider: When you are using a CrawlSpider, you shouldn’t override the parse function. There’s a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules. Logging in before … Read more

Access django models inside of Scrapy

December 2, 2023 by Tarik

If anyone else is having the same problem, this is how I solved it. I added this to my scrapy settings.py file: def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module(‘settings’, [path]) project = imp.load_module(‘settings’, f, filename, desc) setup_environ(project) setup_django_env(‘/path/to/django/project/’) Note: the path above is to your django project folder, … Read more