guide on crawling the entire web?

Question

Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge.

You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won’t be strictly true but in practice I think you’ll find it’s mostly true. Still chances are you’ll need multiple (maybe thousands) of starting points.

You will want to make sure you don’t traverse the same page twice (within a single traversal). In practice the traversal will take so long that it’s merely a question of how long before you come back to a particular node and also how you detect and deal with changes (meaning the second time you come to a page it may have changed).

The killer will be how much data you need to store and what you want to do with it once you’ve got it.

Leave a Comment Cancel reply