search-engine – Page 2

Designing a web crawler

February 18, 2023 by Tarik

If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper: In the course of extracting links, any Web crawler will encounter multiple links to the same document. To avoid downloading and processing a document multiple times, a URL-seen test must … Read more

Search in html source with GOOGLE? [closed]

February 12, 2023 by Tarik

I’ve come across the following resources on my travels (some already mentioned above): HTML Mark-up-focused search engines Nerdydata I’d also like to throw in the following: Huge, website crawl data archives Common Crawl – ‘years of free web page data to help change the world’ (over 250TB+) How can we analyze this crawl data? For … Read more

What does percolator mean/do in elasticsearch?

February 2, 2023 by Tarik

What you usually do is index documents and get them back by querying. What the percolator allows you to do in a nutshell is index your queries and percolate documents against the indexed queries to know which queries they match. It’s also called reversed search, as what you do is the opposite to what you … Read more

How do websites know they’re not the default home page or search provider?

January 24, 2023 by Tarik

Simply there is no way to do that with JavaScript because the “default search/homepage” is a user’s preference and you do not have access to that without user’s permission because that would be a security/privacy issue. What Google does at every user visit is show a promo ad with a close icon and a go … Read more

What is the difference between web-crawling and web-scraping? [duplicate]

January 6, 2023 by Tarik

Crawling would be essentially what Google, Yahoo, MSN, etc. do, looking for ANY information. Scraping is generally targeted at certain websites, for specfic data, e.g. for price comparison, so are coded quite differently. Usually a scraper will be bespoke to the websites it is supposed to be scraping, and would be doing things a (good) … Read more

What’s the best Django search app? [closed]

December 26, 2022 by Tarik

Check out Haystack Search – a new model based search abstraction layer that currently supports Xapian, Solr and Whoosh. Looks like it’s well supported and documented.

Is it possible to search for a particular filename on GitHub?

October 16, 2022 by Tarik

Does the search user.rb in:path do what you want to do? Alternatively there is also this search filename:user.rb Found on: https://help.github.com/articles/searching-code/

How do search engines deal with AngularJS applications?

September 8, 2022 by Tarik

(2022) Use Server Side Rendering if possible, and generate URLs with Pushstate Google can and will run JavaScript now so it is very possible to build a site using only JavaScript provided you create a sensible URL structure. However, pagespeed has become a progressively more important ranking factor and typically pages built clientside perform poorly … Read more