CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
expensive scripts -- making them cheaper

 



scooper
Deleted

Sep 30, 2000, 1:11 PM

Post #1 of 8 (1002 views)
expensive scripts -- making them cheaper Can't Post

I've got this recursive search sub which compares given keywords to
keywords in a flat file, and pushes the results to a hash. The problem
is that while it's not nescessarily SLOW, it seems to be EXPENSIVE
-- it consumes about 50 percent of the CPU ON A 16K
FLAT FILE DATABASE. NG :

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


open (UPDATESFILE, $updatesfile) or die "Error opening $updatesfile: $!\n";
flock(UPDATESFILE, 1); # SHARED LOCK
while (<UPDATESFILE> ){
$updatefile_body = $_;

@updatefile_text = split(/\n/, $updatefile_body);
chop;
foreach $line (@updatefile_text) {

@updatefields = split(/\t/, $line);
$update_record = "$updatefields[11]";

foreach $term (@search_string) {
if (!($update_record =~ /$term/)) {
$include{$updatefields[0]} = 'false';
last;
} else {
$include{$updatefields[0]} = 'true';
}
}
}
}
close(UPDATESFILE);

</pre><HR></BLOCKQUOTE>

There is an awful lot of array and hashing of things going on before
this to get us the $term values (which live in an array called @search_string)
If I comment all that out, and hardcode @search_string, I still have
an expensive script. So this is how I arrived at the fact that this
is the portion of my script which is causing a problem.

If I include the data as a portion of the file using __DATA__ , main: ATA,
etc. the script is still expensive. This rules out 'Disk thrash' as
a problem. Besides, it's a 16K file!!!

If I change the method of matching from this:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


if (!($update_record =~ /$term/)) {
$include{$updatefields[0]} = 'false';
last;
} else {
$include{$updatefields[0]} = 'true';
}
</pre><HR></BLOCKQUOTE>

To this:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


if (!(grep $term $update_record)) {
$include{$updatefields[0]} = 'false';
last;
} else {
$include{$updatefields[0]} = 'true';
}
</pre><HR></BLOCKQUOTE>

the script is still expensive. This removes the variable of the method
of match. I was thinking that the variables were interpolated, or that
previous variables were interpolated -- but this doesn't seem to be
the case.

If I remove columns from the database that _ARE NOT_ being read, the
script is still expensive. So, I think that the size of the database
isn't really the issue. Never mind the fact that I really did bother
to drill down to the 11th element of each record.

So, this leads me to believe that the problem is that I iterate $term
over each $update_record as many times as there are records and terms.
(vis a vis 11 terms * 2000 records = 22,000 recursions).

Or that I am using a FLAT FILE database. I do recall somewhere that
this is the least efficient method of storing data.

To add insult to injury -- if I sleep the 'foreach' like this:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


foreach $term (@search_string) {
sleep(.005); # shouldn't this be a 1 or a 5??
if (!($update_record =~ /$term/)) {
$include{$updatefields[0]} = 'false';
last;
} else {
$include{$updatefields[0]} = 'true';
}
}
</pre><HR></BLOCKQUOTE>

CPU usage goes down to 10 - 15 percent at the cost of performance.

Any help would be appreciated!


[This message has been edited by scooper (edited 09-30-2000).]


dws
Deleted

Sep 30, 2000, 4:59 PM

Post #2 of 8 (1002 views)
Re: expensive scripts -- making them cheaper [In reply to] Can't Post

Where is the recursion happening? I only see iteration.

Your script is going a fair amount of unecessary work. First, you're actually reading the file line-by-line, so the split to extract lines isn't necessary.
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

while (<UPDATESFILE> ){ $updatefile_body = $_; @updatefile_text = split(/\n/, $updatefile_body);
chop;
foreach $line (@updatefile_text) {</pre><HR></BLOCKQUOTE>can be rewritten as
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

while (<UPDATESFILE> ) {
chomp($line = $_);</pre><HR></BLOCKQUOTE>
Your matching code is suspicious. A search term of "foo" will match again "food". Is that what you intend? If not, you'll need a slightly more sophisticated regular expression.

One thing you could do that would save quite a bit of work would be to assemble a single regular expression for all of the search terms. Something like
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

$searchre = "(?:" . join("|", @search_string) . ")";</pre><HR></BLOCKQUOTE>Then use the /o modifier to compile it once (instead of n_search_terms * n_records times).
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

if ( $update_record =~ m/$searchre/o ) {</pre><HR></BLOCKQUOTE>
If the search strings might contain otherwise valid regex characters, you'll need to quote them. I'll leave that as homework (it's covered in perlre).

I'm a bit concerned by your math. You say 11 terms, and your code fragment uses 11 as an index to get the 12th field of a tab delimited record. Are these 11s related? If so, double-check your code, since you're ignoring fields 1 through 10.


[This message has been edited by dws (edited 09-30-2000).]


japhy
Enthusiast

Sep 30, 2000, 10:28 PM

Post #3 of 8 (1002 views)
Re: expensive scripts -- making them cheaper [In reply to] Can't Post

Using values of 'true' and 'false' is a silly practice, from a simplicity standpoint. Just
use 1 for true and 0 (or "") for false. Then you can do tests like

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


if ($flag) { ... }
# instead of
if ($flag eq 'true') { ... }
</pre><HR></BLOCKQUOTE>

And you can use the ?: operator here:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


$include{$first} = (exists $include{$first}) ?
$include{$first} :
($update_record =~ $searchre);
</pre><HR></BLOCKQUOTE>

And the $searchre is slow. Make the elements of @keywords pre-compiled regexes:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


my @keywords = map qr{@{[ split ]}}, @favorite_keywords;
</pre><HR></BLOCKQUOTE>

That should speed up the program some.

------------------
Jeff "japhy" Pinyan -- accomplished author, consultant, hacker, and teacher



[This message has been edited by japhy (edited 10-01-2000).]


scooper
Deleted

Oct 1, 2000, 8:27 AM

Post #4 of 8 (1002 views)
Re: expensive scripts -- making them cheaper [In reply to] Can't Post

dws --

yep, again you are right. It's not recursion -- it's iteration. The sub
doesn't call itself at any point. my error.

The rewrite to the code in the fist paragraph makes some sense. I wasn't
aware that it could be done that way, since all I've ever seen is the
other way :-)

This is a neat thing:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


$searchre = "(?:" . join("|", @search_string) . ")";
</pre><HR></BLOCKQUOTE>

joins terms like "foo bar" with a pipe (?:foo|bar). This matches either
what preceeds or follows the pipe. instant on-the-fly extended regex!

This means 'foo' OR 'bar' but in this sense I need 'foo' AND 'bar' --
'foo' is a property of 'bar' (think HOT water as opposed to COLD water,
or this is 'bar' and it's 'foo'). Could that be then revised as :

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


$searchre = "(?:" . join(" ", @search_string) . ")";
</pre><HR></BLOCKQUOTE>


So then, why use 'join'? You're using it to iterate over the list
@search_string !!!

So then why the regex?

The code matches limited keywords that are NOT user defined (OK, you didn't
know this...). So in this case, matching 'food' to 'foo' couldn't happen
because 'food' would never exist. So in this sense the search terms are
acting like 'keys' in the database, each word refers to it's respective owner.

So I strip it out and use this to iterate:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


$searchre = join(" ", @search_string);
</pre><HR></BLOCKQUOTE>

Then there's the other part:
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


($update_record =~ m/$searchre/o); # match the search string in the update record, and only compile the pattern once.
</pre><HR></BLOCKQUOTE>

well that breaks the sub, because if we compile it only once -- WE DON'T ITERATE
OVER THE LIST.

remove that.

If / is the delimiter, then the initial m is OPTIONAL.

remove that too? OK.

That leaves us with what we had:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


($update_record =~ /$searchre/);
</pre><HR></BLOCKQUOTE>

and we have removed foreach $term(@search_string). That leaves us with:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


sub search_database{
foreach $favorite_keywords (@favorite_keywords) {
$input_string = $favorite_keywords;
@search_string = split(/\s+/, $input_string);
$searchre = join(" ", @search_string);
open (UPDATESFILE, $updatesfile) or die "Error opening updatefile. $updatesfile: $!\n";
flock(UPDATESFILE, 1); # this is a SHARED LOCK
while (<UPDATESFILE> ){
chomp($line= $_);
@updatefields = split(/\t/, $line);
$update_record = "$updatefields[11]";
if (!($update_record =~ /$searchre/)) {
$include{$updatefields[0]} = 'false' unless exists $include{$updatefields[0]};
} else {
$include{$updatefields[0]} = 'true';
}
}
close(UPDATESFILE);
}
}
</pre><HR></BLOCKQUOTE>


It's tighter_code++ ... check this out:

I did a sort of benchmark using Time::HiRes.

My code returns a search for '5_spades' in 5.311 seconds. Then I pasted
our revised sub OVER my own.

The same search (same database, search string, etc...) returns in
believe_it_or_not .2118 seconds!

Still pushes the processor up to utilization in the .50 range, but
for far less actual time.

I hope that makes you feel GOOD!

There won't be any regex characters in the search string for reasons
mentioned above :-) -- but thanks for the warning.

Yes, I've been concerned with my math for quite some time now (it's pretty much
always wrong), but those 11's are NOT related -- I am ignoring fields
_0_thru_10_ in this search.

The 11th element of the array is subdivided into 9 parts -- and each part
contains a variable number of keywords.

It looks like this:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


0\t1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t11a|11b|11c|11d|11e|11f|11g|11h|11i|\t12\t
</pre><HR></BLOCKQUOTE>

The 11th element is delimited by pipes, and what is inside the pipes is
CSV text.

Thanks again for your help !! Now if I can only figure out how to get the processor
utilization back into a realistic range.


[This message has been edited by scooper (edited 10-01-2000).]


rGeoffrey
User / Moderator

Oct 1, 2000, 9:59 AM

Post #5 of 8 (1002 views)
Re: expensive scripts -- making them cheaper [In reply to] Can't Post

The following is based on the "tighter" block of code in the previous reply.

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


sub search_database
{
foreach my $searchre (@favorite_keywords) {
$searchre = join(" ", split(/\s+/, $searchre));

open (UPDATESFILE, $updatesfile) or die "Error opening updatefile. $updatesfile: $!\n";
flock(UPDATESFILE, 1); # this is a SHARED LOCK

while (<UPDATESFILE> ){
chomp $_;
my ($first, $update_record) = (split(/\t/, $_))[0,11];
unless ($update_record =~ /$searchre/) {
$include{$first} = 'false' unless exists $include{$first};
} else {
$include{$first} = 'true';
}
}

close(UPDATESFILE);
}
}
</pre><HR></BLOCKQUOTE>

For starters we can simplify the opening of the foreach loop. There are several variables that are only used to get from the current foreach item to $searchre ($favorite_keywords, $input_string, and @search_string).

Inside the while $line is set to be $_ when $_ would do just fine, so $line is gone.

@updatefields is wasteful because we really only care about the 0 and 11 elements. So now we can use an array slice to get just those two elements from the split.

And finally I changed 'if (! (stuff))' to be 'unless (stuff)' because I think it looks nicer.

However we can improve again to get this

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


sub search_database
{
open (UPDATESFILE, $updatesfile) or die "Error opening updatefile. $updatesfile: $!\n";
flock(UPDATESFILE, 1); # this is a SHARED LOCK

my @keywords = map { join(" ", split(/\s+/, $_)) } @favorite_keywords;

while (<UPDATESFILE> ){
chomp $_;
my ($first, $update_record) = (split(/\t/, $_))[0,11];
foreach my $searchre (@keywords) {
unless ($update_record =~ /$searchre/) {
$include{$first} | |= 'false';
} else {
$include{$first} = 'true';
}
}
}

close(UPDATESFILE);
}
</pre><HR></BLOCKQUOTE>

$include{$first} can only have three values, 'true', 'false', or it does not exist. So we can use the | |= (or equal) operator for false. If $include{$first} is either 'true' or 'false because it has already been set, then the | | will leave the value unchanged otherwise it will be set to 'false'. Note that this will generate some warnings if you perl -w. And this is why I only use -w when I really need it.

Also this version switches the foreach and while loops. This way you only open the file once. The first way you open it once for each time through the foreach and then close it, providing a small but vulnerable window for someone else to grab, lock, and change the file. And to make this switch a little quicker, the building of $searchre is moved outside the loop so it only is done once.



[This message has been edited by rGeoffrey (edited 10-01-2000).]


scooper
Deleted

Oct 1, 2000, 12:39 PM

Post #6 of 8 (1002 views)
Re: expensive scripts -- making them cheaper [In reply to] Can't Post

Hey rGeoffrey,japhy,

Good to see ya here! You bring up som points that I know are problematic
throughout most of the code that I have been writing:

1. Spending too much energy getting from point A to point B
2. Repetitively opening & closing file FOREACH iteration of a loop

I think that I will eventually solve problem 1 -- that has a lot to do
with exposure to a more mature writing style, and 'vocabulary'. I'm still
at the point of do instruction 1, followed by 2, followed by ...; what
I'm seeing is that with one or two lines you COULD do instructions
1..3.

This also brings to a realization that I had on two tricks that you have.
You really DO like map! I do see why, because it allows you the ability
to iterate over a list while applying the block to each element of the
list.

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


map { join(" ", split(/\s+/, $_)) } @favorite_keywords;
# take whatever is coming in, and split it up into words (on whitespace).
# join it back together with a single space.
# do that as many times as there are elements in THE ARRAY @favorite_keywords
</pre><HR></BLOCKQUOTE>


OT -- I go back to the post and notice that japhy has joined our
little group. way cool....

The other thing that I need to play with a little more is 'my'. I've
had situations in the past where I probably needed a my, but didn't
really understand the concept. Because I'm trying to UNDERSTAND WHY
you guys come up with what you do, I had a look in the 'nutshell' book.

'Declares one or more private variables to exist only within the innermost
enclosing block, subroutine, eval, or file.'

So now I'm not so scared of it -- now that I've seen it used twice +
managed to RTFM.

japhy makes the point that 'using values of 'true' and 'false' is a silly
practice, from a simplicity standpoint. Just use 1 for true and 0 (or "")
for false.' I agree with that -- I'm so sure he can see this in my
code:


<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


foreach $key (keys %include) {
if ($include{$key} eq 'true') {
$num_results++ ;
} else {
}
}
</pre><HR></BLOCKQUOTE>

I changed that to:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


foreach $key (keys %include) {
if($include{$key}){$num_results++};
}
</pre><HR></BLOCKQUOTE>

and also did this:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


while (($key,$include{$key}) = each(%include)) {
if($include{$key}){$num_results++};
}
</pre><HR></BLOCKQUOTE>

(sheepishly) I do confess, I did try this:


<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>



map { if($include{$key}){$num_results++} } keys %include;

</pre><HR></BLOCKQUOTE>
I really do want to play with map (lol), but the list value should be an array.


So now it looks like this:


<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


sub search_database
{
open (UPDATESFILE, $updatesfile) or die "Error opening updatefile. $updatesfile: $!\n";
flock(UPDATESFILE, 1); # this is a SHARED LOCK
my @keywords = map qr{@{[ split ]}}, @favorite_keywords;
while (<UPDATESFILE> ){
chomp $_;
my ($first, $update_record) = (split(/\t/, $_))[0,11];
foreach my $searchre (@keywords) {
unless ($update_record =~ /$searchre/) {
$include{$first} = (exists $include{$first}) ?
$include{$first} :
($update_record =~ $searchre);
} else {
$include{$first} = '1';
}
}
}
close(UPDATESFILE);
}
</pre><HR></BLOCKQUOTE>


this returns 19 results in .78 seconds -- still very fast.


Thanks for the help you guys!

[This message has been edited by scooper (edited 10-01-2000).]


japhy
Enthusiast

Oct 1, 2000, 1:49 PM

Post #7 of 8 (1002 views)
Re: expensive scripts -- making them cheaper [In reply to] Can't Post

A couple things I noticed about your coding in general, scoop:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


while (($key,$hash{$key}) = each %hash) {
if ($hash{$key}) { $found++ }
}
</pre><HR></BLOCKQUOTE>

You don't need to assign the values from each() like that -- in truth, that slows it
down about 45% the speed of just doing:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


while (($key,$val) = each %hash) {
if ($val) { $found++ }
}
</pre><HR></BLOCKQUOTE>

And, from a purely experience-induced level, I'd be inclined to rewrite that as:

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


$found += grep $_, values %hash;
</pre><HR></BLOCKQUOTE>

That line runs 2/3 times faster than the other code. It takes advantage of the grep() function and the values() function -- it adds, to the $found variable, the number of true values in %hash. Idiomatic expressions like that are what add flair to your programs, and a sparkle in your smile. Wink

And the example of ?: I used in my code was to get around the unless-else clause. If I
understand your code correctly, for all intents and purposes, if a string fails on ONE pattern, it doesn't matter if it can match the others...

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


foreach my $searchre (@keywords) {
unless ($update_record =~ /$searchre/) {
$include{$first} = (exists $include{$first}) ?
$include{$first} :
($update_record =~ $searchre);
} else {
$include{$first} = '1';
}
}

# can be rewritten as

foreach my $searchre (@keywords) {
$include{$first} = ($update_record =~ $searchre) or last;
}
</pre><HR></BLOCKQUOTE>

That code does a lot of things at once:

1. evaluates $update_record =~ $searchre
2. if it succeeds, sets $include{first} to 1
3. if it fails, sets $include{$first} to ''
4. if $include{$first} is false, executes last()

It's this kind of condensing of code and efficiency-hording techniques that make your programs faster, and, assuming you don't do anything crazy, easier to follow.


------------------
Jeff "japhy" Pinyan -- accomplished author, consultant, hacker, and teacher



scooper
Deleted

Oct 1, 2000, 4:35 PM

Post #8 of 8 (1002 views)
Re: expensive scripts -- making them cheaper [In reply to] Can't Post

yes -- that's a pretty good observation. right now I'm just spending a lot of time just getting this sucker to work!

Still, you guys (dws, rGeoffrey, and yourself) are great teachers. I learn a new trick every time we swap posts. I guess I have to learn more about the way perl is written, and this is a comfortable way to do that.

Unfortunately, I have to write a lot of code to do that -- but I'm easily spending 12-14 hours a day on this for the next two months. It's a big project, so expect to see quite a bit of me for the next 3 weeks or so.

I guess the trick is to do as much as you can in as little space as possible.

BTW --

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


$found += grep $_, values %hash;
</pre><HR></BLOCKQUOTE>

grep returns the number of times it finds what it was fed in the values hash
$found gets incremented each time grep finds

Thanks again!!

scooper



[This message has been edited by scooper (edited 10-01-2000).]

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives