scrapy – Page 4 – Tarik Billa

Scraping dynamic content using python-Scrapy

May 22, 2023 by Tarik

You can also solve it with ScrapyJS (no need for selenium and a real browser): This library provides Scrapy+JavaScript integration using Splash. Follow the installation instructions for Splash and ScrapyJS, start the splash docker container: $ docker run -p 8050:8050 scrapinghub/splash Put the following settings into settings.py: SPLASH_URL = ‘http://192.168.59.103:8050’ DOWNLOADER_MIDDLEWARES = { ‘scrapyjs.SplashMiddleware’: 725, … Read more

Get href using css selector with Scrapy

May 7, 2023 by Tarik

What you’re looking for is: Link = Link1.css(‘span[class=title] a::attr(href)’).extract()[0] Since you’re matching a span “class” attribute also, you can even write Link = Link1.css(‘span.title a::attr(href)’).extract()[0] Please note that ::text pseudo element and ::attr(attributename) functional pseudo element are NOT standard CSS3 selectors. They’re extensions to CSS selectors in Scrapy 0.20. Edit (2017-07-20): starting from Scrapy 1.0, … Read more

How to get the scrapy failure URLs?

May 1, 2023 by Tarik

Yes, this is possible. The code below adds a failed_urls list to a basic spider class and appends urls to it if the response status of the url is 404 (this would need to be extended to cover other error statuses as required). Next I added a handle that joins the list into a single … Read more

Scraping a JSON response with Scrapy

April 23, 2023 by Tarik

It’s the same as using Scrapy’s HtmlXPathSelector for html responses. The only difference is that you should use json module to parse the response: class MySpider(BaseSpider): … def parse(self, response): jsonresponse = json.loads(response.text) item = MyItem() item[“firstName”] = jsonresponse[“firstName”] return item

Scrapy and proxies

April 22, 2023 by Tarik

From the Scrapy FAQ, Does Scrapy work with HTTP proxies? Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware. The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell. C:\>set http_proxy=http://proxy:port csh% setenv http_proxy … Read more

How to give delay between each requests in scrapy?

April 18, 2023 by Tarik

There is a setting for that: DOWNLOAD_DELAY Default: 0 The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. DOWNLOAD_DELAY = 0.25 # 250 ms of delay Read the docs: https://doc.scrapy.org/en/latest/index.html

scrapy: Call a function when a spider quits

April 18, 2023 by Tarik

It looks like you can register a signal listener through dispatcher. I would try something like: from scrapy import signals from scrapy.xlib.pydispatch import dispatcher class MySpider(CrawlSpider): def __init__(self): dispatcher.connect(self.spider_closed, signals.spider_closed) def spider_closed(self, spider): # second param is instance of spder about to be closed. In the newer version of scrapy scrapy.xlib.pydispatch is deprecated. instead you … Read more

Scrapy get request url in parse

April 14, 2023 by Tarik

The ‘response’ variable that’s passed to parse() has the info you want. You shouldn’t need to override anything. eg. (EDITED) def parse(self, response): print “URL: ” + response.request.url

How can i use multiple requests and pass items in between them in scrapy python

April 12, 2023 by Tarik

No problem. Following is correct version of your code: def page_parser(self, response): sites = hxs.select(‘//div[@class=”row”]’) items = [] request = Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription1) request.meta[‘item’] = item yield request request = Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription2, meta={‘item’: item}) yield request yield Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription3, meta={‘item’: item}) def parseDescription1(self,response): item = response.meta[‘item’] item[‘desc1’] = “test” return item def parseDescription2(self,response): item = response.meta[‘item’] … Read more

Scrapy – how to manage cookies/sessions

March 24, 2023 by Tarik

Three years later, I think this is exactly what you were looking for: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar Just use something like this in your spider’s start_requests method: for i, url in enumerate(urls): yield scrapy.Request(“http://www.example.com”, meta={‘cookiejar’: i}, callback=self.parse_page) And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time: def parse_page(self, response): # do some … Read more