How to force scrapy to crawl duplicate url?
You’re probably looking for the dont_filter=True argument on Request(). See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects
You’re probably looking for the dont_filter=True argument on Request(). See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects
The answer above do not really solved the problem. They are sending the data as paramters instead of JSON data as the body of the request. From http://bajiecc.cc/questions/1135255/scrapy-formrequest-sending-json: my_data = {‘field1’: ‘value1’, ‘field2’: ‘value2′} request = scrapy.Request( url, method=’POST’, body=json.dumps(my_data), headers={‘Content-Type’:’application/json’} )
You cannot restart the reactor, but you should be able to run it more times by forking a separate process: import scrapy import scrapy.crawler as crawler from scrapy.utils.log import configure_logging from multiprocessing import Process, Queue from twisted.internet import reactor # your spider class QuotesSpider(scrapy.Spider): name = “quotes” start_urls = [‘http://quotes.toscrape.com/tag/humor/’] def parse(self, response): for quote … Read more
Here you go; var phantom = require(‘phantom’); phantom.create(function (ph) { ph.createPage(function (page) { var url = “http://www.bdtong.co.kr/index.php?c_category=C02”; page.open(url, function() { page.includeJs(“http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js”, function() { page.evaluate(function() { $(‘.listMain > li’).each(function () { console.log($(this).find(‘a’).attr(‘href’)); }); }, function(){ ph.exit() }); }); }); }); });
pass the spider arguments on the process.crawl method: process.crawl(spider, input=”inputargument”, first=”James”, last=”Bond”)
You should run scrapy crawl spider_name command being in a scrapy project folder, where scrapy.cfg file resides. From the docs: Crawling To put our spider to work, go to the project’s top level directory and run: scrapy crawl dmoz
Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge. You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won’t be strictly true but in practice I … Read more
You can lock tables using the MySQL LOCK TABLES command like this: LOCK TABLES tablename WRITE; # Do other queries here UNLOCK TABLES; See: http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html
To totally ignore all breakpoints in Chrome, you must do as follows: Open your page in the Chrome browser. Press F12 or right-click on the page and select Inspect. In the Source panel, press Ctrl+F8 to deactivate all breakpoints. (or: At the top-right corner, select deactivate breakpoints.) All breakpoints and debugger statements will be deactivated. … Read more