web-crawler – Page 5

Get a list of URLs from a site [closed]

December 24, 2022 by Tarik

I didn’t mean to answer my own question but I just thought about running a sitemap generator. First one I found http://www.xml-sitemaps.com has a nice text output. Perfect for my needs.

How to find all links / pages on a website

December 19, 2022 by Tarik

Check out linkchecker—it will crawl the site (while obeying robots.txt) and generate a report. From there, you can script up a solution for creating the directory tree.

How to pass a user defined argument in scrapy spider

December 17, 2022 by Tarik

Spider arguments are passed in the crawl command using the -a option. For example: scrapy crawl myspider -a category=electronics -a domain=system Spiders can access arguments as attributes: class MySpider(scrapy.Spider): name=”myspider” def __init__(self, category=”, **kwargs): self.start_urls = [f’http://www.example.com/{category}’] # py36 super().__init__(**kwargs) # python3 def parse(self, response) self.log(self.domain) # system Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments Update … Read more

how to detect search engine bots with php?

November 28, 2022 by Tarik

I use the following code which seems to be working fine: function _bot_detected() { return ( isset($_SERVER[‘HTTP_USER_AGENT’]) && preg_match(‘/bot|crawl|slurp|spider|mediapartners/i’, $_SERVER[‘HTTP_USER_AGENT’]) ); } update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en added mediapartners

Difference between BeautifulSoup and Scrapy crawler?

November 17, 2022 by Tarik

Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling. While BeautifulSoup is a parsing library which also does a pretty good … Read more

TypeError: can’t use a string pattern on a bytes-like object in re.findall()

November 14, 2022 by Tarik

You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode(‘utf-8’). See Convert bytes to a Python String

keep rsync from removing unfinished source files

November 11, 2022 by Tarik

It seems to me the problem is transferring a file before it’s complete, not that you’re deleting it. If this is Linux, it’s possible for a file to be open by process A and process B can unlink the file. There’s no error, but of course A is wasting its time. Therefore, the fact that … Read more

Finding the layers and layer sizes for each Docker image

October 19, 2022 by Tarik

Check out dive written in golang. Awesome tool!

How to request Google to re-crawl my website? [closed]

October 8, 2022 by Tarik

There are two options. The first (and better) one is using the Fetch as Google option in Webmaster Tools that Mike Flynn commented about. Here are detailed instructions: Go to: https://www.google.com/webmasters/tools/ and log in If you haven’t already, add and verify the site with the “Add a Site” button Click on the site name for … Read more

Sending “User-agent” using Requests library in Python

September 29, 2022 by Tarik

The user-agent should be specified as a field in the header. Here is a list of HTTP header fields, and you’d probably be interested in request-specific fields, which includes User-Agent. If you’re using requests v2.13 and newer The simplest way to do what you want is to create a dictionary and specify your headers directly, … Read more