scrapy – Page 5 – Tarik Billa

Change IP address dynamically?

March 24, 2023 by Tarik

An approach using Scrapy will make use of two components, RandomProxy and RotateUserAgentMiddleware. Modify DOWNLOADER_MIDDLEWARES as follows. You will have to insert the new components in the settings.py: DOWNLOADER_MIDDLEWARES = { ‘scrapy.contrib.downloadermiddleware.retry.RetryMiddleware’: 90, ‘tutorial.randomproxy.RandomProxy’: 100, ‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware’: 110, ‘scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware’ : None, ‘tutorial.spiders.rotate_useragent.RotateUserAgentMiddleware’ :400, } Random Proxy You can use scrapy-proxies. This component will process Scrapy requests … Read more

Using Scrapy with authenticated (logged in) user session

March 8, 2023 by Tarik

In the code above, the FormRequest that is being used to authenticate has the after_login function set as its callback. This means that the after_login function will be called and passed the page that the login attempt got as a response. It is then checking that you are successfully logged in by searching the page … Read more

Click a Button in Scrapy

March 3, 2023 by Tarik

Scrapy cannot interpret javascript. If you absolutely must interact with the javascript on the page, you want to be using Selenium. If using Scrapy, the solution to the problem depends on what the button is doing. If it’s just showing content that was previously hidden, you can scrape the data without a problem, it doesn’t … Read more

getting Forbidden by robots.txt: scrapy

February 23, 2023 by Tarik

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY ROBOTSTXT_OBEY = False Here are the release notes

Scrapy Unit Testing

February 15, 2023 by Tarik

The way I’ve done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML. A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have … Read more

TypeError: Object of type ‘bytes’ is not JSON serializable

February 5, 2023 by Tarik

You are creating those bytes objects yourself: item[‘title’] = [t.encode(‘utf-8’) for t in title] item[‘link’] = [l.encode(‘utf-8’) for l in link] item[‘desc’] = [d.encode(‘utf-8’) for d in desc] items.append(item) Each of those t.encode(), l.encode() and d.encode() calls creates a bytes string. Do not do this, leave it to the JSON format to serialise these. Next, … Read more

How to run Scrapy from within a Python script

January 29, 2023 by Tarik

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands: import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition … process = CrawlerProcess({ ‘USER_AGENT’: ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)’ }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished

How can I use different pipelines for different spiders in a single Scrapy project

January 13, 2023 by Tarik

Just remove all pipelines from main settings and use this inside spider. This will define the pipeline to user per spider class testSpider(InitSpider): name=”test” custom_settings = { ‘ITEM_PIPELINES’: { ‘app.MyPipeline’: 400 } }

selenium with scrapy for dynamic page

January 8, 2023 by Tarik

It really depends on how do you need to scrape the site and how and what data do you want to get. Here’s an example how you can follow pagination on ebay using Scrapy+Selenium: import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = “product_spider” allowed_domains = [‘ebay.com’] start_urls = [‘http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40’] def __init__(self): self.driver = … Read more

How to use PyCharm to debug Scrapy projects

December 26, 2022 by Tarik

The scrapy command is a python script which means you can start it from inside PyCharm. When you examine the scrapy binary (which scrapy) you will notice that this is actually a python script: #!/usr/bin/python from scrapy.cmdline import execute execute() This means that a command like scrapy crawl IcecatCrawler can also be executed like this: … Read more