CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Huge data file and looping best practices

 



carillonator
Novice

Apr 25, 2009, 2:34 PM

Post #1 of 6 (687 views)
Huge data file and looping best practices Can't Post

Hi,

I'm pretty new to Perl, but I have experience with PHP. I have been asked to improve a Perl script written by a questionable coder, which analyzes a set of data about patents. The data file has 8 million lines, which look like this:


Code
patent #, char1, char2, char3, ... , char480 

1234567,1,0,1,0,1,0, ... (480 characteristics)
(x 8 million lines)


The script compares each [binary] characteristic of each patent with every other patent and counts the number of differences for each patent pair (my attempt at the improved code is below).

I see that the entire 6G data file is brought into memory, so what is the best way to go one line at a time? I also have read that assigning the input file to OUT is not a good idea, as opposed to a scalar, but again I'm not sure what the best way is.

The program will be run on an 8-core machine with 64G memory. Notice it takes arguments that limit execution to a certain range of iterations of the first loop, so I can run 7 different instances at the same time (one per core) on different parts of the data. Or, is there a smarter way to allocate resources? I only know how to do this using the for loop. Can I choose to only run a certain range of iterations of the first loop while still getting the memory benefits of the while loop?

Since it will take a VERY long time to run all of this program, the slightest improvements could save days or weeks. Any input on making this script as smart and efficient as possible would be greatly appreciated.

Thanks in advance!!


Code
#!/usr/bin/perl 
use strict;

my(@patno1,@patno2,@record1,@record2);

my $startat=@ARGV[0];
my $endat=@ARGV[1];

open(OUT, "<patents.csv")|| die("Could not open file!\n");
my @lines=<OUT>;
close(OUT);

#clear variance file if it exists
open(OUT, ">variance.csv")|| die("Could not open file variance.csv!\n");
close(OUT);

map(chomp,@lines);

# iterate over all patents
for(my $i=$startat;$i<=$endat;$i++)
{
@record1=split(/\,/,$lines[$i]);
$patno1=shift(@record1);

# iterate through other lines to compare
for(my $j=$i+1;$j<$#lines;$j++)
{
@record2=split(/\,/,$lines[$j]);
$patno2=shift @record2;

my $variance=0;

# iterate through each characteristic
for(my $k=0;$k<$#record1;$k++)
{
if($record1[$k]!=$record2[$k])
{
$variance++;
}
}

open(OUT, ">>variance.csv")|| die("Could not open file variance.csv!\n");
print OUT $patno1.",".$patno2.",".$variance."\n";
close(OUT);


}
}



(This post was edited by carillonator on Apr 26, 2009, 4:55 AM)


FishMonger
Veteran / Moderator

Apr 26, 2009, 9:25 AM

Post #2 of 6 (658 views)
Re: [carillonator] Huge data file and looping best practices [In reply to] Can't Post

The first and major problem I see is the poor choice of data storage. A csv file with 480 fields is absurd. This data should be stored in relational database.

Slurping big files into memory is never a good approach. I'd start by using Tie::File http://search.cpan.org/~mjd/Tie-File-0.96/lib/Tie/File.pm which access the csv file as an array without slurping it into ram.

Why are you opening/closing the variance.csv file ($#lines * $endat + 1) times? That is going to create a huge bottleneck. Open it once and leave it open until the end.

Using map when chomping the array is very inefficient. chomp is a list operator, so use it as such.

Code
chomp @lines;

However, if you use Tie::File, you won't need to use chomp.

Perl has 2 different ways to initialize a for/foreach loop. The C style initialization that you're using is syntactically very noisy.

instead of:

Code
for(my $i=$startat;$i<=$endat;$i++)

use:

Code
for my $i ( $startat .. $endat )


Tie::File will help in increasing the efficiency, but looping over the file line-by-line would be more efficient. I haven't fully analyzed what you're trying to achieve, but it appears that you're comparing fields in 1 line against the same fields in the following line. If that's the case, when looping line-by-line, you need to store the first line in a var and then do the compare when reading the next line. After the compare, you reassign the first var with the value of the current line and repeat this process as you traverse the file.


FishMonger
Veteran / Moderator

Apr 26, 2009, 9:45 AM

Post #3 of 6 (656 views)
Re: [carillonator] Huge data file and looping best practices [In reply to] Can't Post

See if this works better.


Code
#!/usr/bin/perl 

use strict;
use warnings;

open my $patentsFH, '<', 'patents.csv'
or die "Could not open patents.csv file! $!";

open my $varianceFH, '>', 'variance.csv'
or die "Could not open variance.csv file! $!";

my $line1 = <$patentsFH>;
chomp $line1;
while (my $line2 = <$patentsFH>) {

chomp $line2;
my ($patno1, @record1) = split(/,/, $line1);
my ($patno2, @record2) = split(/,/, $line2);
$line1 = $line2;

my $variance;

# iterate through each characteristic
for my $index ( 0 .. $#record1 ) {
$variance++ if $record1[$index] != $record2[$index];
}
print $varianceFH "$patno1,$patno2,$variance\n";
}



carillonator
Novice

Apr 26, 2009, 10:09 AM

Post #4 of 6 (653 views)
Re: [FishMonger] Huge data file and looping best practices [In reply to] Can't Post

FishMonger, thanks for the replies. I'll check out Tie::File.

What makes this program such a monster is that it compares each patent to every other patent, not just one patent to the next, as your code does. So that's about (8million^2)/2 comparisons.

Thanks for the better syntax in file handles and loops, too.


FishMonger
Veteran / Moderator

Apr 26, 2009, 10:29 AM

Post #5 of 6 (648 views)
Re: [carillonator] Huge data file and looping best practices [In reply to] Can't Post

In that case, I'd start by sorting the file and then count the number of groups of lines that differ. From that you should be able to work up a formula that calculates the $variance, but off hand I don't know what that formula would be.


carillonator
Novice

Apr 27, 2009, 5:33 AM

Post #6 of 6 (622 views)
Re: [FishMonger] Huge data file and looping best practices [In reply to] Can't Post

you're absolutely right. We did actually do the sorting in order to calculate the total number of combinations, but it never occurred to me that the individual differences could be interpolated from there.

thanks!

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives