Aug 5, 2003, 12:42 PM
Post #9 of 9
Most search engines consist of two parts: searching and ranking.
Re: [benchivers] Efficient Search Engines
[In reply to]
To search for files/urls/whatever that contain a set of words, they usually use something called an "inverted hash tree". Two main indexes are generally needed, a word index and a document index. The document index is a simple mapping of id's to document locations (url's, file paths, what have you). The word index has an entry for each word found which contains the document id's of every document.
You would need two modules: one to index a document and another to find documents. Indexing is simple...you break up each document into words and add the new document id to the entries in the word index for each word found in the document. Finding documents can be a tad more complex depending on how much boolean logic you want to support. For example, if you want to do simple AND's for all search terms, the final search result is simply the intersection of all document id sets for each term. OR's are simply the union. Combinations of AND's and OR's can be derived easily.
How found documents are ranked varies considerablly from one search engine to the next. Google was unique in using links between documents to caculate rank. More traditional methods used word frequency, locality and other statistical measures.
Note, what I described is the bare-bone basics. I didn't cover common word filtering, contextual maps, part of speach filtering, word stemming, semantic approximation, synonym replacement and a blizzard of other techniques that seperate the men from the boys in the world of search engines.
In short, you could build a search engine using perl but I don't see why you'd want to given the wide variety of choices available out there.