CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Stumped - need faster way to use short file to search uber-long file

 



Jessica
New User

Dec 11, 2009, 3:54 PM

Post #1 of 5 (811 views)
Stumped - need faster way to use short file to search uber-long file Can't Post

In short, I have one long file (FILE1) that is sorted based on position: 1 through 2485959639 (yes, about 2.5 billion lines of data). I am trying to compare this file with an 'annotation' file (FILE2) has about 250,000 lines, with the the name of the item I am interested and the start and stop positions of this item. The only way I can get this to work is a veeeery basic code to read FILE1 in line by line, and for each line compare it to FILE2 line by line. As you can imagine, this is taking for...e....ver. At the rate it is running now, this script will take about a month to run. I have tried a binary search, but the problem is that the start/stop sites of some items overlap. So I can't sort FILE2 based on position and do a search, because I will end up with 1 match when there might be 2 or 3. In reverse, I have tried going through each item position in FILE2 to search FILE1, but this script takes up too much memory since I have to save the whole FILE1, rather than just read it in line by line. Here is what I have (a bit simplified to protect identifying items):

my @file2=<FILE2>;
while (my $file1=<FILE1>) {
if ($file1 =~ /([0-9]*)/) {
$position=$1;
foreach my $file2 (@file2) {
if ($file2 =~ /regexprtomatchitemnames\t([0-9]*)\t([0-9]*)/){
$start=$1;$end=$2;
if ($position>=$start && $position<=$end){
print OUTFILE "$file1\n";
} } } } }

Help?!?!?!?!


FishMonger
Veteran / Moderator

Dec 11, 2009, 4:27 PM

Post #2 of 5 (805 views)
Re: [Jessica] Stumped - need faster way to use short file to search uber-long file [In reply to] Can't Post

Please post a sample of the actual data in each file and point out what portions you need to match.


Jessica
New User

Dec 11, 2009, 5:04 PM

Post #3 of 5 (798 views)
Re: [FishMonger] Stumped - need faster way to use short file to search uber-long file [In reply to] Can't Post

FILE1 (2 columns, 1 is the unique position, 2 is data)
Position Data
1 blue
2 green
3 green
4 red
5 blue
6 yellow
7 blue
8 green
9 red
10 blue
...
2485959637 green

FILE2 (3 columns, one is the item name, 2 and 3 are the start and stop positions that correlate to column 1 in FILE1)
Item Start Stop
1 8 216
2 188 1024
3 11803 18544
4 16832 19516
5 24345 24401
6 24407 24477
7 27290 27503
8 27508 27576
9 27582 27683
10 198469 198818
....
259652 2485959214 2485959635

Basically, I want any line in FILE1 that contains a position covered by an item in FILE2 to be printed to an output file. So, for item 1 in FILE2, I want to find lines 8 through 216 in FILE1 and copy it to another file. Then I want to pull lines 188 through 1024 and print them to the same file (even if this means that lines 188 through 216 are printed twice...I need this to happen). The limiting factor so far for me has been that more complex scripts require the whole FILE1 to be read in as an array or such and it takes too much memory and the script won't run.


FishMonger
Veteran / Moderator

Dec 11, 2009, 5:34 PM

Post #4 of 5 (792 views)
Re: [Jessica] Stumped - need faster way to use short file to search uber-long file [In reply to] Can't Post


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Tie::File;

my $file1 = 'file1';
my $file2 = 'file2';
my $output = 'outputfile';

tie my @file1, 'Tie::File', $file1 or die "can't tie '$file1' $!";
open my $file2_fh, '<', $file2 or die "can't open '$file2' $!";
open my $out_fh, '>', $output or die "can't open '$output' $!";

while ( <$file2_fh> ) {
chomp;
my (undef, $start, $stop) = split /\s+/;
print $out_fh @file1[--$start..--$stop];
}



(This post was edited by FishMonger on Dec 11, 2009, 5:36 PM)


Jessica
New User

Dec 16, 2009, 9:03 AM

Post #5 of 5 (767 views)
Re: [FishMonger] Stumped - need faster way to use short file to search uber-long file [In reply to] Can't Post

This is much faster, though it still took a couple of days. Thanks!

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives