CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Advanced:
Efficient Search Engines

 



benchivers
Novice

Apr 19, 2003, 11:57 AM

Post #1 of 9 (1522 views)
Efficient Search Engines Can't Post

Hi,

Does anybody know how the major search engines work? e.g. Google.

Do they operate using Perl? Is Perl the most efficient and web server friendly language for creating search engines?

Does anyone know what structure the search engines work on? Flat-file? Oracle? SQL? I would have thought flat-file was the most efficient method, as all what would need to be opened was a file instead of a program which performs the query. Is this correct?

Does anyone know how Google can search through so many webpages within just a couple of seconds, or not even that, and still return relevant results?

Does anyone have any ideas for creating an efficient search engine?

Any help with this subject would be most appreciated.

Many Regards,

Ben Chivers
Wheres the damn coffee? zzzZZZZZ!!!


Paul
Enthusiast

Apr 20, 2003, 2:02 AM

Post #2 of 9 (1519 views)
Re: [benchivers] Efficient Search Engines [In reply to] Can't Post


Quote
Do they operate using Perl?


Hehe, no.


Quote
Is Perl the most efficient and web server friendly language for creating search engines?


No.


Quote
Flat-file?


You really think a search engine with millions of records uses flatfile Laugh


Quote
I would have thought flat-file was the most efficient method, as all what would need to be opened was a file instead of a program which performs the query. Is this correct?


Definitely not.


Quote
Does anyone know how Google can search through so many webpages within just a couple of seconds, or not even that, and still return relevant results?


Try networking several thousand servers Smile

Read this article....

http://www.internetweek.com/story/INW20010427S0010


(This post was edited by Paul on Apr 20, 2003, 2:07 AM)


benchivers
Novice

Apr 20, 2003, 3:44 AM

Post #3 of 9 (1514 views)
Re: [Paul] Efficient Search Engines [In reply to] Can't Post

Hi Paul,

Thanks for your reply, but it wasn't very helpful.

What scripting languages do you think Google use?

What is the best DBMS to use for managing and maintaining a database?

Cheers,

Ben Chivers
Wheres the damn coffee? zzzZZZZZ!!!


davorg
Thaumaturge / Moderator

Apr 20, 2003, 6:49 AM

Post #4 of 9 (1511 views)
Re: [benchivers] Efficient Search Engines [In reply to] Can't Post


In Reply To
What scripting languages do you think Google use?


What makes you think that the must use a "scripting language". I expect they use some combination of C and C++ and I wouldn't describe either of those as a "scripting language" (but then personally, I wouldn't describe Perl as a "scripting language" either).


In Reply To
What is the best DBMS to use for managing and maintaining a database?


Depends on all sorts of factors. What hardware you're using, what OS you're running, how much money you can spend, how much emphasis you place on performance and reliability.

I'd guess that Google use a commercial RDBMS like Oravle of DB2.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


uri
Thaumaturge

Apr 27, 2003, 3:48 PM

Post #5 of 9 (1503 views)
Re: [davorg] Efficient Search Engines [In reply to] Can't Post

actually RDBMs are not a god choice for search engines. they usually don't have the ability to index all words of documents (though some have added support for that). document indexes that support complex queries are very different beasts than DBs. i saw a talk by a google honcho and learned they have a modified linux for their platform. they designed the system so they can drop in additional servers whenever they need to without major reconfiguration. no one who does large search engines writes them in anything but c or c++ for speed reasons.

there may be plenty of perl behind the scenes at some of the major search sites but none in the main search pipeline


Paul
Enthusiast

Apr 27, 2003, 4:28 PM

Post #6 of 9 (1502 views)
Re: [uri] Efficient Search Engines [In reply to] Can't Post

FYI: Google use RedHat


uri
Thaumaturge

Apr 27, 2003, 5:07 PM

Post #7 of 9 (1500 views)
Re: [Paul] Efficient Search Engines [In reply to] Can't Post

i heard it was redhat but i wasn't sure when i wrote that so i didn't say it. but it is modified for sure to handle their special needs. i recall they had to do with speeding up context switches and related speed issues.
this talk was at LISA in philly this last november.


Paul
Enthusiast

Apr 28, 2003, 2:07 AM

Post #8 of 9 (1493 views)
Re: [uri] Efficient Search Engines [In reply to] Can't Post

They also use 8,000 servers...phew..


rsmah
newbie

Aug 5, 2003, 12:42 PM

Post #9 of 9 (1415 views)
Re: [benchivers] Efficient Search Engines [In reply to] Can't Post

Most search engines consist of two parts: searching and ranking.

To search for files/urls/whatever that contain a set of words, they usually use something called an "inverted hash tree". Two main indexes are generally needed, a word index and a document index. The document index is a simple mapping of id's to document locations (url's, file paths, what have you). The word index has an entry for each word found which contains the document id's of every document.

You would need two modules: one to index a document and another to find documents. Indexing is simple...you break up each document into words and add the new document id to the entries in the word index for each word found in the document. Finding documents can be a tad more complex depending on how much boolean logic you want to support. For example, if you want to do simple AND's for all search terms, the final search result is simply the intersection of all document id sets for each term. OR's are simply the union. Combinations of AND's and OR's can be derived easily.

How found documents are ranked varies considerablly from one search engine to the next. Google was unique in using links between documents to caculate rank. More traditional methods used word frequency, locality and other statistical measures.

Note, what I described is the bare-bone basics. I didn't cover common word filtering, contextual maps, part of speach filtering, word stemming, semantic approximation, synonym replacement and a blizzard of other techniques that seperate the men from the boys in the world of search engines.

In short, you could build a search engine using perl but I don't see why you'd want to given the wide variety of choices available out there.

Cheers,
Rob

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives