Scraping dynamic content using python-Scrapy

You can also solve it with ScrapyJS (no need for selenium and a real browser): This library provides Scrapy+JavaScript integration using Splash. Follow the installation instructions for Splash and ScrapyJS, start the splash docker container: $ docker run -p 8050:8050 scrapinghub/splash Put the following settings into settings.py: SPLASH_URL = ‘http://192.168.59.103:8050’ DOWNLOADER_MIDDLEWARES = { ‘scrapyjs.SplashMiddleware’: 725, … Read more

Get href using css selector with Scrapy

What you’re looking for is: Link = Link1.css(‘span[class=title] a::attr(href)’).extract()[0] Since you’re matching a span “class” attribute also, you can even write Link = Link1.css(‘span.title a::attr(href)’).extract()[0] Please note that ::text pseudo element and ::attr(attributename) functional pseudo element are NOT standard CSS3 selectors. They’re extensions to CSS selectors in Scrapy 0.20. Edit (2017-07-20): starting from Scrapy 1.0, … Read more

Scraping a JSON response with Scrapy

It’s the same as using Scrapy’s HtmlXPathSelector for html responses. The only difference is that you should use json module to parse the response: class MySpider(BaseSpider): … def parse(self, response): jsonresponse = json.loads(response.text) item = MyItem() item[“firstName”] = jsonresponse[“firstName”] return item

Scrapy and proxies

From the Scrapy FAQ, Does Scrapy work with HTTP proxies? Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware. The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell. C:\>set http_proxy=http://proxy:port csh% setenv http_proxy … Read more

How to give delay between each requests in scrapy?

There is a setting for that: DOWNLOAD_DELAY Default: 0 The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. DOWNLOAD_DELAY = 0.25 # 250 ms of delay Read the docs: https://doc.scrapy.org/en/latest/index.html

scrapy: Call a function when a spider quits

It looks like you can register a signal listener through dispatcher. I would try something like: from scrapy import signals from scrapy.xlib.pydispatch import dispatcher class MySpider(CrawlSpider): def __init__(self): dispatcher.connect(self.spider_closed, signals.spider_closed) def spider_closed(self, spider): # second param is instance of spder about to be closed. In the newer version of scrapy scrapy.xlib.pydispatch is deprecated. instead you … Read more

How can i use multiple requests and pass items in between them in scrapy python

No problem. Following is correct version of your code: def page_parser(self, response): sites = hxs.select(‘//div[@class=”row”]’) items = [] request = Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription1) request.meta[‘item’] = item yield request request = Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription2, meta={‘item’: item}) yield request yield Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription3, meta={‘item’: item}) def parseDescription1(self,response): item = response.meta[‘item’] item[‘desc1’] = “test” return item def parseDescription2(self,response): item = response.meta[‘item’] … Read more

Scrapy – how to manage cookies/sessions

Three years later, I think this is exactly what you were looking for: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar Just use something like this in your spider’s start_requests method: for i, url in enumerate(urls): yield scrapy.Request(“http://www.example.com”, meta={‘cookiejar’: i}, callback=self.parse_page) And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time: def parse_page(self, response): # do some … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)