CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Help speeding up this code?

 



Chivalri
Novice

Aug 8, 2008, 11:50 AM

Post #1 of 13 (1409 views)
Help speeding up this code? Can't Post

Hi Everyone,

I have written some perl code dealing with large amounts of data. The script reads in one line at a time from a flat file, then searches the line for any reference to information from another flat file. This creates a nested loop, and when one flat file is 300mb+ and the other is 1mb, it takes a while to run. Any suggestions for improving the performance of this code?

Thanks!

The offending code:

Code
#%hash has been filled with some meta data in an array read in 
#from the smaller flat file
#The first element is the name we are searching for.
#the rest are irrelevant to this loop
#
#fh_log is a file handle to the large 300mb+ flat file
#each line may contain references to 1 or more table names, and
#I need to capture all of them.

while($cur_line=<fh_log>)
{
foreach $key (keys %hash)
{
$table = $hash{$key}[0];
if($command[0]=~ m/$table/i) #If this query contains reference to this table
{
$table_stats{$table}++; #count up how many times this table is referenced.
$table_refs++; #total table references (for % calc)
}
}
}



KevinR
Veteran


Aug 8, 2008, 3:34 PM

Post #2 of 13 (1408 views)
Re: [Chivalri] Help speeding up this code? [In reply to] Can't Post

The code looks about as good as its going to get. The problem is the size of the files. But ask on www.perlmonks.com and see if anyone knows something that might help you.
-------------------------------------------------


shawnhcorey
Enthusiast


Aug 9, 2008, 3:51 AM

Post #3 of 13 (1397 views)
Re: [Chivalri] Help speeding up this code? [In reply to] Can't Post


In Reply To

Code
while($cur_line=<fh_log>) 
{
foreach $key (keys %hash)
{
$table = $hash{$key}[0];
if($command[0]=~ m/$table/i) #If this query contains reference to this table
{
$table_stats{$table}++; #count up how many times this table is referenced.
$table_refs++; #total table references (for % calc)
}
}
}



You can speed up the program by pre-compiling the regular expression 'm/$table/i' See `perldoc perlop` and search for 'Quote and Quote-like Expressions'. You may also want to read `perldoc perlretut` and `perldoc perlre`.

__END__

I love Perl; it's the only language where you can bless your thingy.

Perl documentation is available at perldoc.perl.org. The list of standard modules and pragmatics is available in perlmodlib.

Get Markup Help. Please note the markup tag of "code".


KevinR
Veteran


Aug 9, 2008, 10:11 AM

Post #4 of 13 (1382 views)
Re: [shawnhcorey] Help speeding up this code? [In reply to] Can't Post

I was going to suggest that but the value of $table changes with each iteration of the loop. So I think he would need to compile a regexp for each value of $table then loop through <fh_log>. Is that what you meant?
-------------------------------------------------


shawnhcorey
Enthusiast


Aug 9, 2008, 10:58 AM

Post #5 of 13 (1380 views)
Re: [KevinR] Help speeding up this code? [In reply to] Can't Post


In Reply To
I was going to suggest that but the value of $table changes with each iteration of the loop. So I think he would need to compile a regexp for each value of $table then loop through <fh_log>. Is that what you meant?


Yup. Here's an example program:


Code
#!/usr/bin/perl 

use strict;
use warnings;

# The words are in capitals to highlight the use of the m//i flag in the compiled REs.
# Note the RE pattern __[^_]+__ in the word list.
# Note that the word boundaries, \b, to insure that the OR in FOR is not counted as a word.
# Note that the capture parenthesis in the compiled RE do work.
#
my @Words = qw( MY PERL WHILE IF FOR USE OR ELSIF __[^_]+__ );
my @Words_re = map { qr/\b($_)\b/i } @Words;

my %Count = ();
while( <> ){
for my $re ( @Words_re ){
for my $each ( /$re/ ){
$Count{$1} ++;
}
}
}

for my $word ( sort keys %Count ){
printf "%5d %s\n", $Count{$word}, $word;
}

__END__


__END__

I love Perl; it's the only language where you can bless your thingy.

Perl documentation is available at perldoc.perl.org. The list of standard modules and pragmatics is available in perlmodlib.

Get Markup Help. Please note the markup tag of "code".


KevinR
Veteran


Aug 9, 2008, 12:23 PM

Post #6 of 13 (1377 views)
Re: [shawnhcorey] Help speeding up this code? [In reply to] Can't Post

Nice example. Hopefully the OP gives it a try and reports back. I'm curious if precompiling 1 MB worth of data will actually speed it up. Since the other file is so big, 300 MB, I think it will.
-------------------------------------------------


Chivalri
Novice

Aug 11, 2008, 8:12 AM

Post #7 of 13 (1280 views)
Re: [KevinR] Help speeding up this code? [In reply to] Can't Post

Thanks for the feedback everyone. I went ahead and modified the code using the precompile suggestions, but it seems it has made the code slower! I am not sure why this may have happened (I am new to precompiling regexp).

Here is the modified code. One major difference is since I read the dataset from a file, I couldn't use qw in the initial array. if you know a way to fix this, please let me know.


Code
my @tables; 
my @tables_re;

#reading in small file
while($cur_line=<size_log>)
{
my @command=split(/;;;+/,$cur_line);
push @tables, $command[0];
#parse remainder of line...
}
push @tables, qw( __[^_]+__ );
@tables_re = map { qr/\b($_)\b/i } @tables;

#pattern match against big file
while(<fh_log>)
{
for my $re (@tables_re)
{
for my $each (/$re/ ) #If this query contains reference to this table
{
$table_stats{$1}++; #count up how many times this table is referenced.
$table_refs++;
}
}
}


On my test data set (8 MB large file and 26kb small file, it was running about 20sec's before. Now it runs about 5 minutes)


KevinR
Veteran


Aug 11, 2008, 9:54 AM

Post #8 of 13 (1275 views)
Re: [Chivalri] Help speeding up this code? [In reply to] Can't Post

This part of the code:


Code
   for my $re (@tables_re)  
{
for my $each (/$re/ ) #If this query contains reference to this table
{
$table_stats{$1}++; #count up how many times this table is referenced.
$table_refs++;
}
}


maybe it should be written using "if" :


Code
   for my $re (@tables_re)  
{
if (/($re)/ ) #If this query contains reference to this table
{
$table_stats{$1}++; #count up how many times this table is referenced.
$table_refs++;
}
}

-------------------------------------------------


shawnhcorey
Enthusiast


Aug 11, 2008, 12:26 PM

Post #9 of 13 (1266 views)
Re: [Chivalri] Help speeding up this code? [In reply to] Can't Post


In Reply To
Here is the modified code. One major difference is since I read the dataset from a file, I couldn't use qw in the initial array. if you know a way to fix this, please let me know.


This is to be expected. Don't take everything you find in a example literally.


In Reply To

Code
push @tables, qw( __[^_]+__ );



You didn't add this just because you seen it in my example? You really are looking of strings starting and ending with double underscores with no intervening ones?


In Reply To
On my test data set (8 MB large file and 26kb small file, it was running about 20sec's before. Now it runs about 5 minutes)


I'm not sure why this would happen.

__END__

I love Perl; it's the only language where you can bless your thingy.

Perl documentation is available at perldoc.perl.org. The list of standard modules and pragmatics is available in perlmodlib.

Get Markup Help. Please note the markup tag of "code".


Chivalri
Novice

Aug 11, 2008, 1:07 PM

Post #10 of 13 (1262 views)
Re: [shawnhcorey] Help speeding up this code? [In reply to] Can't Post

True enough, I did add it since I saw it in your example and wasn't sure why you threw it in there Tongue
Unfortunately, the data comes in pretty much any format. I am matching data like "LOGMNRG_ICOL$" and "CHANGE_DETECT_296991" against log file lines like "select pitagname, minvalue, maxvalue, LASTMODIFYDATE, COMPMAXSECS from standardpitag where eq...". The script basically runs against large DB's to see what users are doing, and parses that into a series of reports. The log files can grow to be a few hundred megs after just one day, and we will generally use a week or more of data to generate these reports.

The only other thing I could think about is putting this amount of data in an array seems to be much slower then using a hash, so I switched to using a hash. This brought the test data set down to about 3 minutes and 40 secs (baseline is 25-30 secs using original code). Here is the code updated with hashes instead of arrays and precompiled:


Code
my %htable_re; 

#reading in small file
while($cur_line=<size_log>)
{
my @command=split(/;;;+/,$cur_line);
$htable_re{$command[0]}=qr/\b($command[0])\b/i
#parse remainder of line...
}

#pattern match against big file
while(<fh_log>)
{
foreach $key (keys %htable_re)
{
if(/$htable_re{$key}/ ) #If this query contains reference to this table
{
$table_stats{$key}++; #count up how many times this table is referenced.
$table_refs++;
}
}
}




KevinR
Veteran


Aug 11, 2008, 1:37 PM

Post #11 of 13 (1259 views)
Re: [Chivalri] Help speeding up this code? [In reply to] Can't Post

There is no guarantee that using qr// will speed things up. Even the perl documentation says:



Quote
Since Perl may compile the pattern at the moment of execution of qr() operator, using qr() may have speed advantages in some situations, notably if the result of qr() is used standalone


In your case it looks like precompiling all the regexps is slowing things down. Maybe someone on www.perlmonks.com can shed some light on why that is. I tend to use qr// sparingly so I have little experience with how you are attempting to use it.
-------------------------------------------------


shawnhcorey
Enthusiast


Aug 11, 2008, 5:30 PM

Post #12 of 13 (1242 views)
Re: [KevinR] Help speeding up this code? [In reply to] Can't Post


In Reply To
There is no guarantee that using qr// will speed things up. Even the perl documentation says:



Quote
Since Perl may compile the pattern at the moment of execution of qr() operator, using qr() may have speed advantages in some situations, notably if the result of qr() is used standalone


In your case it looks like precompiling all the regexps is slowing things down. Maybe someone on www.perlmonks.com can shed some light on why that is. I tend to use qr// sparingly so I have little experience with how you are attempting to use it.


I ran some benchmarks. I used the script as the data for the test. I got about a 5-to-1 speed up. All I can say is that you have really, really big datasets and it's going to take time no matter what you do.


Code
#!/usr/bin/perl 

use strict;
use warnings;
use utf8;

use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent = 1;
$Data::Dumper::Maxdepth = 0;

use Benchmark;

my $x = 'a';
my @words = ();
print "Creating words...\n";
for ( 1 .. 1_000 ){
push @words, $x;
$x ++;
}

print "creating REs...\n";
my @words_re = map { qr/\b($_)\b/i } @words;


my @data = <>;

print "timing...\n";
timethese( 100, {
search => \&search,
precompiled => \&precompiled,
} );

sub search {
my %count = ();

for ( @data ){
for my $word ( @words ){
if( /$word/ ){
$count{$word} ++;
}
}
}
}

sub precompiled {
my %count = ();

for ( @data ){
for my $re ( @words_re ){
if( /$re/ ){
$count{$1} ++;
}
}
}
}



The results are:

Code
Creating words... 
creating REs...
timing...
Benchmark: timing 100 iterations of precompiled, search...
precompiled: 21 wallclock secs (18.13 usr + 0.00 sys = 18.13 CPU) @ 5.52/s (n=100)
search: 119 wallclock secs (94.82 usr + 0.08 sys = 94.90 CPU) @ 1.05/s (n=100)


__END__

I love Perl; it's the only language where you can bless your thingy.

Perl documentation is available at perldoc.perl.org. The list of standard modules and pragmatics is available in perlmodlib.

Get Markup Help. Please note the markup tag of "code".


KevinR
Veteran


Aug 11, 2008, 10:31 PM

Post #13 of 13 (1237 views)
Re: [shawnhcorey] Help speeding up this code? [In reply to] Can't Post

Stuff like this varies so much between one version of perl and another and even the operating system and operating system version, it is difficult to compare code run on one computer with another. It would be interesting to see if the OP gets similar results running your test code on his system.
-------------------------------------------------

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives