web-crawler – Tarik Billa

guide on crawling the entire web?

August 24, 2023 by Tarik

Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge. You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won’t be strictly true but in practice I … Read more

How to find sitemap.xml path on websites?

June 10, 2023 by Tarik

There is no standard, so there is no guarantee. With that said, its common for the sitemap to be self labeled and on the root, like this: example.com/sitemap.xml Case is sensitive on some servers, so keep that in mind. If its not there, look in the robots file on the root: example.com/robots.txt If you don’t … Read more

How to write a crawler?

March 11, 2023 by Tarik

You’ll be reinventing the wheel, to be sure. But here’s the basics: A list of unvisited URLs – seed this with one or more starting pages A list of visited URLs – so you don’t go around in circles A set of rules for URLs you’re not interested in – so you don’t index the … Read more

crawler vs scraper

February 10, 2023 by Tarik

A crawler gets web pages — i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s). A scraper takes pages that have been downloaded or, in a more general sense, … Read more

Detecting ‘stealth’ web-crawlers

December 28, 2022 by Tarik

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically … Read more

Get a list of URLs from a site [closed]

December 24, 2022 by Tarik

I didn’t mean to answer my own question but I just thought about running a sitemap generator. First one I found http://www.xml-sitemaps.com has a nice text output. Perfect for my needs.