CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Improving speed when comparing two lists (nested loop)

 



sessmurda
Novice

Oct 10, 2012, 7:51 AM

Post #1 of 6 (1990 views)
Improving speed when comparing two lists (nested loop) Can't Post

This is an issue I've had a few times before but always was able to find a variable to make a hash key that matched between two data sets, and improve speed that way. Right now I am working with two files.


Code
 
Scaffold604 947743 947827
Scaffold604 941827 941924
Scaffold604 938360 938389
Scaffold604 932556 933427
Scaffold3712 176554 176704
Scaffold3712 180811 180934
Scaffold3712 181470 181554


and

Code
 
27051543 Scaffold446353 19 1442
27051545 Scaffold121539 7428 11547
27051546 Scaffold121539 80165 98618
27051547 Scaffold121539 167332 175700
27051549 Scaffold267160 81607 82942
27051550 Scaffold267160 79672 80108


What I want to do is find all the lines in file 1 that match the Scaffold number and fall within the range that comes after. Right now I am loading the 2nd file into a hash, and as I loop through the first file. The while loop is below.


Code
 
while (<IN>)
{
chomp;
my @fields = split /\t/;
foreach my $PAC (sort keys %loci)
{
my ($scaf, $start, $stop) = @{$loci{$PAC}};
my $min = ($start - 100);
my $max = ($stop + 100);
if ( ($scaf eq $fields[0]) && ($fields[1] > $min) && ($fields[2] < $max) )
{
print "$scaf\t$fields[1]\t$fields[2]\n";
}
}
}
close (IN);

This works but is very time intensive, was wondering if anyone had any suggestions as some of the files I want to use this on are millions of lines long.

Thanks in advance


Laurent_R
Veteran / Moderator

Oct 10, 2012, 10:25 AM

Post #2 of 6 (1983 views)
Re: [sessmurda] Improving speed when comparing two lists (nested loop) [In reply to] Can't Post

Hi,

how long is your file 2? Does it also have millions of reords, or is it much less?

I am asking because there may be more efficient ways to populate your hash.


sessmurda
Novice

Oct 10, 2012, 11:28 AM

Post #3 of 6 (1982 views)
Re: [Laurent_R] Improving speed when comparing two lists (nested loop) [In reply to] Can't Post

I want to run it on a few different data sets, the hash file at max is 50000 lines but the other files contain between 1-100 million lines each. There are multiple records in the million line files that should match single records in 50000 line files, so I cannot delete from the hash once I match because of that.


Laurent_R
Veteran / Moderator

Oct 10, 2012, 11:56 AM

Post #4 of 6 (1981 views)
Re: [sessmurda] Improving speed when comparing two lists (nested loop) [In reply to] Can't Post

My idea was to see if populating the hash with all values, instead of only ranges, is feasible.

For example, if you have the range 19..1442, create a hash entry for each value between 19 and 1442. Same thing for the other ranges.

Access time to a hash does not depend on the number of elements in the hash (it is said to be in O(1), if you know this notation, except if very rare pathological cases). This means you could check directly for your value in the hash, and this would be very fast. I have done relatively similar things in the very recent past, with a param file of a few hundreds of thousands of entries, and a data file of about 35 million records, and obtained a speed-up of 60 times.

So the question is: do all intermediate values of your file 2 fit into a hash? If they do, you may achive a huge performance gain.


sessmurda
Novice

Oct 10, 2012, 12:00 PM

Post #5 of 6 (1980 views)
Re: [Laurent_R] Improving speed when comparing two lists (nested loop) [In reply to] Can't Post

No they do not, the main purpose is to split into 2 segments, those that fall within, and those that are outside. My current strategy is to find all those that fall within then using another script I wrote which is a hash-based version of 'grep -f' that runs much faster. There are far fewer that are outside than inside, and if I could only grab those that would be awesome as well.

Thanks again for your in depth replies


Chris Charley
User

Oct 11, 2012, 12:52 PM

Post #6 of 6 (1954 views)
Re: [sessmurda] Improving speed when comparing two lists (nested loop) [In reply to] Can't Post


Quote
to find a variable to make a hash key that matched between two data sets, and improve speed that way

Just using a hash does not 'magically' improve speed. :-)
If it is used the way you have set up, it is effectively being accessed linearly to find the matches. I believe you chose the wrong item to be a key, $PAC from the hash %loci. There is no corresponding value in the other file you wish to match against. The correct item to choose as a key is the Scaffold604, Scaffold3712 items. To see if they are present then in the other file, just test the Scaffold... key you have in the other file to see if it is in the hash.


Code
#!/usr/bin/perl 
use strict;
use warnings;

my $f1 = <<EOF;
Scaffold604 947743 947827
Scaffold604 941827 941924
Scaffold604 938360 938389
Scaffold604 932556 933427
Scaffold3712 176554 176704
Scaffold3712 180811 180934
Scaffold3712 181470 181554
Scaffold446353 19 1400
Scaffold121539 168000 175000
EOF

my $f2 = <<EOF;
27051543 Scaffold446353 19 1442
27051545 Scaffold121539 7428 11547
27051546 Scaffold121539 80165 98618
27051547 Scaffold121539 167332 175700
27051549 Scaffold267160 81607 82942
27051550 Scaffold267160 79672 80108
EOF

my %data;
open my $fh2, '<', \$f2 or die $!;
while (<$fh2>) {
chomp;
my (undef, $scaff, $start, $stop) = split /\t/;
push @{ $data{$scaff} }, {start => $start, stop => $stop};
}
close $fh2 or die $!;

open INSIDE, ">", 'inside.txt' or die $!;
open OUTSIDE, ">", 'outside.txt' or die $!;

open my $fh1, '<', \$f1 or die $!;
while (<$fh1>) {
my ($scaff, $field1, $field2) = split /\t/;
if (my $aref = $data{$scaff}) { # array reference

my (%seen_inside, %seen_outside);

for my $href (@$aref) { # hash reference
if ($href->{start} - 100 < $field1 && $field2 < $href->{stop} + 100) {
print INSIDE unless $seen_inside{$_}++;
}
else {
print OUTSIDE unless $seen_outside{$_}++;
}
}
}
}
close $fh1 or die $!;
close INSIDE or die $!;
close OUTSIDE or die $!;


 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives