web-crawler – Tarik Billa

How to force scrapy to crawl duplicate url?

April 8, 2024 by Tarik

You’re probably looking for the dont_filter=True argument on Request(). See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects

Send Post Request in Scrapy

December 31, 2023 by Tarik

The answer above do not really solved the problem. They are sending the data as paramters instead of JSON data as the body of the request. From http://bajiecc.cc/questions/1135255/scrapy-formrequest-sending-json: my_data = {‘field1’: ‘value1’, ‘field2’: ‘value2′} request = scrapy.Request( url, method=’POST’, body=json.dumps(my_data), headers={‘Content-Type’:’application/json’} )

Scrapy – Reactor not Restartable [duplicate]

December 30, 2023 by Tarik

You cannot restart the reactor, but you should be able to run it more times by forking a separate process: import scrapy import scrapy.crawler as crawler from scrapy.utils.log import configure_logging from multiprocessing import Process, Queue from twisted.internet import reactor # your spider class QuotesSpider(scrapy.Spider): name = “quotes” start_urls = [‘http://quotes.toscrape.com/tag/humor/’] def parse(self, response): for quote … Read more

How can I scrape pages with dynamic content using node.js?

December 3, 2023 by Tarik

Here you go; var phantom = require(‘phantom’); phantom.create(function (ph) { ph.createPage(function (page) { var url = “http://www.bdtong.co.kr/index.php?c_category=C02”; page.open(url, function() { page.includeJs(“http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js”, function() { page.evaluate(function() { $(‘.listMain > li’).each(function () { console.log($(this).find(‘a’).attr(‘href’)); }); }, function(){ ph.exit() }); }); }); }); });

Passing arguments to process.crawl in Scrapy python

September 7, 2023 by Tarik

pass the spider arguments on the process.crawl method: process.crawl(spider, input=”inputargument”, first=”James”, last=”Bond”)

How to identify web-crawler?

September 6, 2023 by Tarik

unknown command: crawl error

September 5, 2023 by Tarik

You should run scrapy crawl spider_name command being in a scrapy project folder, where scrapy.cfg file resides. From the docs: Crawling To put our spider to work, go to the project’s top level directory and run: scrapy crawl dmoz

guide on crawling the entire web?

August 24, 2023 by Tarik

Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge. You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won’t be strictly true but in practice I … Read more

How do I lock read/write to MySQL tables so that I can select and then insert without other programs reading/writing to the database?

August 22, 2023 by Tarik

You can lock tables using the MySQL LOCK TABLES command like this: LOCK TABLES tablename WRITE; # Do other queries here UNLOCK TABLES; See: http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html

how to totally ignore ‘debugger’ statement in chrome?

August 9, 2023 by Tarik

To totally ignore all breakpoints in Chrome, you must do as follows: Open your page in the Chrome browser. Press F12 or right-click on the page and select Inspect. In the Source panel, press Ctrl+F8 to deactivate all breakpoints. (or: At the top-right corner, select deactivate breakpoints.) All breakpoints and debugger statements will be deactivated. … Read more