Apr 25, 2009, 2:34 PM
Post #1 of 6
Huge data file and looping best practices
I'm pretty new to Perl, but I have experience with PHP. I have been asked to improve a Perl script written by a questionable coder, which analyzes a set of data about patents. The data file has 8 million lines, which look like this:
patent #, char1, char2, char3, ... , char480
1234567,1,0,1,0,1,0, ... (480 characteristics)
(x 8 million lines)
The script compares each [binary] characteristic of each patent with every other patent and counts the number of differences for each patent pair (my attempt at the improved code is below).
I see that the entire 6G data file is brought into memory, so what is the best way to go one line at a time? I also have read that assigning the input file to OUT is not a good idea, as opposed to a scalar, but again I'm not sure what the best way is.
The program will be run on an 8-core machine with 64G memory. Notice it takes arguments that limit execution to a certain range of iterations of the first loop, so I can run 7 different instances at the same time (one per core) on different parts of the data. Or, is there a smarter way to allocate resources? I only know how to do this using the for loop. Can I choose to only run a certain range of iterations of the first loop while still getting the memory benefits of the while loop?
Since it will take a VERY long time to run all of this program, the slightest improvements could save days or weeks. Any input on making this script as smart and efficient as possible would be greatly appreciated.
Thanks in advance!!
open(OUT, "<patents.csv")|| die("Could not open file!\n");
#clear variance file if it exists
open(OUT, ">variance.csv")|| die("Could not open file variance.csv!\n");
# iterate over all patents
# iterate through other lines to compare
# iterate through each characteristic
open(OUT, ">>variance.csv")|| die("Could not open file variance.csv!\n");
print OUT $patno1.",".$patno2.",".$variance."\n";
(This post was edited by carillonator on Apr 26, 2009, 4:55 AM)