CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Making a search

 



benn600
User


Jun 12, 2005, 7:41 PM

Post #1 of 14 (2975 views)
Making a search Can't Post

I am adding a search to my web site and everything works well, except for one line which I think I need. This line of code slows way down by about 2-4 times and I doubt most searches use it. My way of searching is to just separate words in both the string and the database. The input string is so small that I don't notice a slow down with that separation code. My other one that does the entire database of files does take a very long time. Here is the code:

(@sin) = split(/ /, $sin);

Just real simple and I then use foreach a few times to check for eq words. I tried using =~, but this also slows everything wayyyyy down. It also returns words as being the same even though we wouldn't consider them the same at all, but I understand what it does. Anyway...just an answer to my first question is all I need. Thanks a lot!
----------------------------------------------------------------------------
Wink http://www.ppcpathways.com/ Wink
Visit my new site devoted to reviewing the latest pocketpc products and news that I built in cgi-perl!


rork
User

Jun 13, 2005, 8:12 AM

Post #2 of 14 (2973 views)
Re: [benn600] Making a search [In reply to] Can't Post

Maybe it's better to show some more of your code, and an example of your database. But here is the way I should search.

Seperating the words in the string is OK, but keep the database in one piece, you can use a regexp for every search word, and may use word bounderies.

If you have a large database it would be a good idea to itter over it, using a while loop, instead of captering the whole file at once which I think you do now.
--
Don't reinvent the wheel, use it, abuse it or hack it.


benn600
User


Jun 13, 2005, 10:14 PM

Post #3 of 14 (2968 views)
Re: [rork] Making a search [In reply to] Can't Post

I just take a few fields of data in a foreach loop (split for | ) and add it to a simple scalar $.. = "$.. Name: $name" and I add some long items (the important stuff...) More weight is given to names and things like that. What do you mean use regexp for every search word? Word boundaries? I am sorry but I don't understand those terms. Thank you very much for your response!
----------------------------------------------------------------------------
Wink http://www.ppcpathways.com/ Wink
Visit my new site devoted to reviewing the latest pocketpc products and news that I built in cgi-perl!


rork
User

Jun 13, 2005, 10:47 PM

Post #4 of 14 (2965 views)
Re: [benn600] Making a search [In reply to] Can't Post

RegExp is short for regular expression, things like match and substitute etc. Big advantage is that is doesn't have to match the complete string but can match a word in a large string.

Here is some code:

Code
my $searchstring = "perl regular expressions"; 
my $found = 1;

undef $/; # load the whole database in one scalar;

open (DATA, "<", "database.txt") or die "Can't open database.txt: $!";
my $data = <DATA>
close(DATA);
foreach my $word split(/\s/, $searchstring) {
if ($data !~ m/$word/) {
$found = 0;
last;
}
}

if ($found) {
print "Found the string";
}
else {
print "Nothing found";
}


$data !~ m/$word/
This is the part that search for a word in the database. I use a negative regexp (!~) which is true when $data is not found, to use a positive regexp us =~ Now its looking for perl in the whole text, but doesn't make a difference between perl and http://www.perlguru.com/. To find single words only you can use word boundries (\b) or spaces (\s).

$data !~ m/\b$data\b/;

To ignore case you can add the i switch to the end of the regexp.

$data !~ m/\b$data\b/i;

More info about regular expression and metacharacters can be found at http://perldoc.perl.org/search.html?q=perlre. The top 4 perlre documents. I think perlreref is most usefull.

I once made a search engine using a flat text database. Each line contained "url \t title \t keywords". This makes it more easy to itter over it:

Code
while(<DATA>) { 
chomp $_;
my ($url, $title, $keywords) = split(/\t/, $_);
if ($keywords =~ m/$searchstring/) {
print $url\n;
}
}


If you have a large database this will really speed things up a lot scince you use much less memory at a time.

I hope you can work it out now.
--
Don't reinvent the wheel, use it, abuse it or hack it.


benn600
User


Jun 14, 2005, 7:10 AM

Post #5 of 14 (2951 views)
Re: [rork] Making a search [In reply to] Can't Post

Well thank you very much! That is very helpful. I do have a question, though. When I used ($input =~ $database), it slowed things down and would return results for just about any combination. How will regxp compare to this for speed and returning just about anything?
----------------------------------------------------------------------------
Wink http://www.ppcpathways.com/ Wink
Visit my new site devoted to reviewing the latest pocketpc products and news that I built in cgi-perl!


benn600
User


Jun 14, 2005, 7:23 AM

Post #6 of 14 (2950 views)
Re: [benn600] Making a search [In reply to] Can't Post

How would I make an index? I have heard this a lot when it comes to searching. My site deals with around 50 text files per search, on average. If you do go to my site, it is not the one in my signature. I am actually rewriting it from the ground up and luckily, I am almost finished. My perl file is about 250 kb! Imagine writing that, Twice! It is very large for me. But, the only way I can think of doing an index would be to have it update the index every time there is a change in the database and it would look like this:

$word,$article,$forum,$messages

amazing|1,2,3,4|3,5,7|1,7

And it would find the word and then it would only need to open those #'s under the type that it is searching for. In fact, I didn't give users to my site the ability to search more than one at once (even though it would be simple...just add it to the original in...the actual search is separated from getting the data) because it was getting slow with only 250 articles and a page or two for each. Would this index idea be something to try? I worry about updating it more than anything. I could just make something that I have to actually open and it would just start from the beginning and build the database. This wouldn't be bad, but I like being able to search for things immediately, which searching the actual database allows. Articles are only editable by my admin pages, so it wouldn't be near as difficult to do that.

This is neat because it could really save time, I think...does anyone else have an idea of how it would help time? I think I might do this...I actually have a project already on my hands first, and then I'll get going on this if you guys tell me to!
----------------------------------------------------------------------------
Wink http://www.ppcpathways.com/ Wink
Visit my new site devoted to reviewing the latest pocketpc products and news that I built in cgi-perl!


rork
User

Jun 14, 2005, 9:34 AM

Post #7 of 14 (2946 views)
Re: [benn600] Making a search [In reply to] Can't Post

One first question: what do you think is slow? 5 seconds, 10 seconds, 30 seconds, 2 minutes?

Your regexp was wrong, you tried to match $database within $input. I think a large regexp can be slow, but with small one's it's not a real problem. I think most time you can buy is to itter over one file and not having to split the complete file.

I think if you really want to match every word best is to open each text file, itter over it line by line and use a regexp to find it. The way you want to set up a database is to much maintenance unless if you write a script to do that.

Here is some code to search the original databases, this answers your original question.

Code
opendir(DIR, "data"); # asume your databases are in the folder data. 
my @files = readdir(DIR);
closedir(DIR);

my $input = $q->param('search');

foreach my $file(@files) {
if (($file eq ".") || ($file eq "..")) { next }
open(FILE, "<", $file);
while(<FILE>) {
if ($_ =~ m/$input/i) {
# get the ID and Title and content of the page somehow
print $title . "<BR>\n";
}
}
}


But I wouldn't do that because of your problem: It's way to slow. (I thought of using Google with the site: command to do this)

Now every program is amazing to someone, so I won't search for amazing and there is no need to index amazing. (as for and, or, the, to, user etc. I think it's best to make a database as I showed example: (from your site)

http://www.ppcpathways.com/cgi-bin/review/main.pl?action=article&sendingidnumber=8&page=0
|Battleship by Handmark|battleship,games,handmark,multiplayer

Use keywords someone will search for (it's a good idea to log searches so you know what people want to know).

If you want you can write a script that takes a file, splits the text, filters the common words and add it to the database. And there are scripts (and probably parts of code which you can implement in your forum script) on the internet which asign keywords to pages.
--
Don't reinvent the wheel, use it, abuse it or hack it.


benn600
User


Jun 14, 2005, 7:12 PM

Post #8 of 14 (2939 views)
Re: [rork] Making a search [In reply to] Can't Post

What are my options for getting only text and numbers from a string? I am actually going for an index. In fact, I am making 27 index files (one for each letter and one for numbers). This should make searching even a very large database very fast. It is a simple index as follows:

$word, $artices, ...

I have more than just $articles, but that refers to the folder /articles and then I have 1.txt, 2.txt, etc. and $articles is stored as 1.x and 1 is the file and x is the number of times that word is in the file. It is in a subroutine so when I make a change, I can just call the subroutine and it will update the index. I am anxious to see how well this works, but right now, I use a huge list of replace lines to remove characters such as :, $, %, etc. I also use $sin =~ s/<([^>]|\n)*>//g; and $sin =~ tr/A-Z/a-z/; to remove HTML and make everything lowercase. When I remove HTML it seems to take out anything in the middle of tags. Is there a way to keep text? Sometimes I adjust font size or add <B> to text and the search ignores what is between bold tags.

Thanks!
----------------------------------------------------------------------------
Wink http://www.ppcpathways.com/ Wink
Visit my new site devoted to reviewing the latest pocketpc products and news that I built in cgi-perl!


KevinR
Veteran


Jun 14, 2005, 10:58 PM

Post #9 of 14 (2936 views)
Re: [benn600] Making a search [In reply to] Can't Post

If your search site just searches about 50 text files I see little need for an index, although it would not hurt anything, especially if the site grows. Maybe look at this "simple" search script if you want to:

http://www.scriptarchive.com/nms-download.cgi?s=nms-search&c=zip
-------------------------------------------------


benn600
User


Jun 14, 2005, 11:31 PM

Post #10 of 14 (2934 views)
Re: [KevinR] Making a search [In reply to] Can't Post

I actually have the index working well already. My site will grow without a doubt and that search looks nice, but it does look very simple. I didn't even see any weighting, which my search does based on the number of times a word is found.
----------------------------------------------------------------------------
Wink http://www.ppcpathways.com/ Wink
Visit my new site devoted to reviewing the latest pocketpc products and news that I built in cgi-perl!


davorg
Thaumaturge / Moderator

Jun 15, 2005, 5:01 AM

Post #11 of 14 (2933 views)
Re: [benn600] Making a search [In reply to] Can't Post


In Reply To
How would I make an index? I have heard this a lot when it comes to searching. My site deals with around 50 text files per search, on average.


Have you considered switching from text files to using a real database? There are a couple of good open source (i.e. free) ones around - look at MySQL and PostgreSQL.

That would probably help your speed issue.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


davorg
Thaumaturge / Moderator

Jun 15, 2005, 5:03 AM

Post #12 of 14 (2932 views)
Re: [KevinR] Making a search [In reply to] Can't Post


In Reply To
If your search site just searches about 50 text files I see little need for an index, although it would not hurt anything, especially if the site grows. Maybe look at this "simple" search script if you want to:

http://www.scriptarchive.com/nms-download.cgi?s=nms-search&c=zip


Could you please link to the real nms site instead of Matt's.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


rork
User

Jun 22, 2005, 12:40 AM

Post #13 of 14 (2924 views)
Re: [benn600] Making a search [In reply to] Can't Post

I'm just being curious, but how muchtime does this method save you?

you can save yourself some lines by using
s/[;,$%etc]//g;
escape metacharacters if necessary.
--
Don't reinvent the wheel, use it, abuse it or hack it.


benn600
User


Jun 22, 2005, 6:07 AM

Post #14 of 14 (2922 views)
Re: [rork] Making a search [In reply to] Can't Post

Unfortunately, I used about 5 big foreach loops that go out for quite a ways. I tabbed everything out and they are deep. However, even as I have it set up now, it saves a lot of time. My old method took seconds (3-8 or more) and now it typically takes less than a second to one or two. I could probably rewrite it again and make it even faster, but I've put so much time in to PocketPC Pathways 2.0 that I just want to start using it, and it is pretty fast already. Most entries can be returned within a second.
----------------------------------------------------------------------------
Wink http://www.ppcpathways.com/ Wink
Visit my new site devoted to reviewing the latest pocketpc products and news that I built in cgi-perl!

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives