CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
speeding up clustering program

 



pjshort42
Novice

Jul 29, 2012, 5:50 AM

Post #1 of 15 (3595 views)
speeding up clustering program Can't Post

Hello,

I am trying to measure clustering in a network by counting how many groups of 4 nodes I have that have at least 3 connections between them (also counting the number with 4 connections in between them). The data input file is essentially node A touches node B. I am using a series of several loops and it is taking upwards of 10 hours to run for my really large data sets. Any ideas for how to speed this up?

The array @gene_list ends up being about 1000 nodes in total and the file input of connections is about 9000 entries long.


Code
print 'Taking some time to read file...', "\n"; 
open FILE2, "my file.txt" or die $!;

@nodeA = ();
@nodeB = ();
while (<FILE2>)
{
my @columns = split('\t', $_);
if (exists $columns[1])
{
my $col1 = $columns[0];
my $col2 = $columns[1];
push(@nodeA, $col1);
push(@nodeB, $col2);
}
}

##list of unique nodes##
@combined_list = (@nodeA, @nodeB);
%contact_count = ();
foreach (@combined_list) {
$contact_count{$_}++;
}
@gene_list = keys %contact_count;
print scalar keys %contact_count;

#we want to select a particular set of nodes and find the number of edges between them#
$full_count = 0;
$quad_count = 0;
for ($i=0; $i<scalar @gene_list; $i++){
for ($j=$i+1; $j<scalar @gene_list; $j++){
for ($k=$j+1; $k<scalar @gene_list; $k++){
for ($l=$k+1; $l<scalar @gene_list; $l++){
@to_consider = ();
$count = 0;
push (@to_consider, $gene_list[$i]);
push (@to_consider, $gene_list[$j]);
push (@to_consider, $gene_list[$k]);
push (@to_consider, $gene_list[$l]);
#print $to_consider[0], "\t", $to_consider[1], "\n"
for ($q=0; $q<scalar @nodeA; q++){
if ((grep(/$nodeA[$q]/, @to_consider))and(grep(/$nodeB[$q]/, @to_consider))){

$count +=1;
}
$q +=1;
}
if ($count>2){
$full_count += 1;
}
if ($count>3){
$quad_count +=1;
$count = 0;
}
}
}
}
}



my $end = time;
my $elapsed = $end - $start;
$triple = $full_count - $quad_count;
print "\n", "total time ", $elapsed, " seconds \n";
print 'The total number of cluster groups is ', $full_count, ' with ', $quad_count, ' quadruple groups and ', $triple, ' triple counts.', "\n";



FishMonger
Veteran / Moderator

Jul 29, 2012, 7:01 AM

Post #2 of 15 (3590 views)
Re: [pjshort42] speeding up clustering program [In reply to] Can't Post

Use the Devel::NYTProf module to profile your script. It will generate a report that tells you where the script is spending its time, which will be the sections that you need to refactor.

http://search.cpan.org/~timb/Devel-NYTProf-4.06/lib/Devel/NYTProf.pm


Laurent_R
Veteran / Moderator

Jul 29, 2012, 8:40 AM

Post #3 of 15 (3587 views)
Re: [FishMonger] speeding up clustering program [In reply to] Can't Post

There is at least one thing you should try to do, which is to compute the size of your array @gene_list only once at the beginning, not billions of times


Code
for ($i=0; $i<scalar @gene_list; $i++){  
for ($j=$i+1; $j<scalar @gene_list; $j++){
for ($k=$j+1; $k<scalar @gene_list; $k++){
for ($l=$k+1; $l<scalar @gene_list; $l++){


If your array has 1,000 entries, the first "for" instruction computes the size of the array 1,000 times, the second one almost 499,500 times, the third one 166,167,000 times, etc.

If the compiler is not able to optimize it, then it is quite normal it take ages to run.

Do something like this.


Code
my $array_size = scalar  @gene_list; 
for ($i=0; $i<$array_size; $i++){
for ($j=$i+1; $j<$array_size; $j++){
for ($k=$j+1; $k<$array_size; $k++){
for ($l=$k+1; $l<$array_size; $l++){


Another thing you might consider is to use the foreach syntax, which is more perlish and is supposed to be slightly faster than the C style for. It usually does not make a difference, but since these instructions are repeated so many times, it does make sense to try something like:


Code
my $max = scalar  @gene_list - 1; 
for my $i (0 .. $max) {
for my $j ($i+1 .. $max) {
for my $k ($j + 1 .. $max) { #...


I have tried your solution and mine with only 3 loops ($i, $j and $k, as just above), my code ran in 15 seconds, yours in 45 seconds. (This doesn't mean that you will get the same ratio of improvement: in both cases, the most inner loop did almost nothing (it only incremented a variable), your real code is doing many other things, so that optimizing the looping mechanism will not pay off as much compared to total execution duration.)

Finally, look very carefully at everything that is happening in the two most inner loops, they are executed so many times that any optimization you can find there will pay off a lot. In particular, you have again this "scalar @nodes" (or whatever) that should be changed to a precalculated variable in order not to recalculate the size of this array each time over.

Having said that, I definitely agree with FishMonger that you should profile your code to get a better idea of where the program is spending a lot of time.


pjshort42
Novice

Jul 30, 2012, 5:25 AM

Post #4 of 15 (3564 views)
Re: [Laurent_R] speeding up clustering program [In reply to] Can't Post

Thank you!

I reformatted a bit by first making a hash with all of the interactions which resulted in a much shorter grep down the line. While it is not super fast, it is probably about 10-fold faster thanks to change in computing the size of the array and using the hash.


Laurent_R
Veteran / Moderator

Jul 30, 2012, 10:27 AM

Post #5 of 15 (3556 views)
Re: [pjshort42] speeding up clustering program [In reply to] Can't Post

Did you try the "for my $i (0 .. $max)", rather than the C-style for? In my tests with your code yesterday, it brought a quite large improvement.


pjshort42
Novice

Jul 31, 2012, 1:32 AM

Post #6 of 15 (3527 views)
Re: [Laurent_R] speeding up clustering program [In reply to] Can't Post

I did try that and it ended up just about the same speed, but it is good to start using that style anyways.

I think the reason (using the module posted earlier) is that it is the second part of the code (where it takes the output of the quadruple loop and counts matches) that is taking the vast majority of time.

here is what I have changed it to as of now:

Code
use List::Util qw(first max maxstr min minstr reduce shuffle sum); 

my $start = time;
print 'Taking some time to read file...', "\n";
open FILE2, "my_file.txt" or die $!;

@nodeA = ();
@nodeB = ();
@contacts = ();
while (<FILE2>)
{
my @columns = split('\t', $_);
if (exists $columns[2])
{
my $col1 = $columns[0];
my $col2 = $columns[1];
push(@nodeA, $col1);
push(@nodeB, $col2);
}
}
%interactions = ();
for ($h=0; $h<scalar @nodeA; $h++){
push @{$interactions{$nodeA[$h]}}, $nodeB[$h];
}


##list of unique genes##
@combined_list = (@nodeA, @nodeB);
%contact_count = ();
foreach (@combined_list) {
$contact_count{$_}++;
}

@gene_list = keys %contact_count;

#we want to select a particular number of genes and find the number of edges between them#
$full_count = 0;
$quad_count = 0;
$quint_count = 0;
my $gene_size = scalar @gene_list -1;
for my $i (0..$gene_size){
for my $j ($i+1..$gene_size){
for my $k ($j+1..$gene_size){
for my $l ($k+1..$gene_size){
@to_consider = ();
$count = 0;
push (@to_consider, $gene_list[$i]);
push (@to_consider, $gene_list[$j]);
push (@to_consider, $gene_list[$k]);
push (@to_consider, $gene_list[$l]);
#print $to_consider[0], "\t", $to_consider[1], "\n"
for my $o (0..3){
for my $p (0..3){
@grep_add = grep(/$to_consider[$o]/,@{$interactions{$to_consider[$p]}});
$count += scalar @grep_add;
@grep_add = ();
}
}
if ($count>2){
$full_count += 1;
}
if ($count>3){
$quad_count +=1;
}
if ($count >5){
$quint_count +=1;
}
$count = 0
}
}
}
}



my $end = time;
my $elapsed = $end - $start;
$triple = $full_count - $quad_count;
$real_quad = $quad_count - $quint_count;
print "\n", "total time ", $elapsed, " seconds \n";
print 'The total number of cluster groups is ', $full_count, ' with ', $quint_count, ' quintuple groups, ', $real_quad, ' quadruple groups, and ', $triple, ' triple counts.', "\n";


Using the hash and much smaller grep method seems to be working much faster (but still fairly slow!).


Laurent_R
Veteran / Moderator

Jul 31, 2012, 2:50 AM

Post #7 of 15 (3519 views)
Re: [pjshort42] speeding up clustering program [In reply to] Can't Post

Yes, you are probably right that the second part of the code is taking most of the time.

But to know exactly where, you should use the code profiler suggested by FishMonger.

That may give you some further ideas to speed up the program.


pjshort42
Novice

Jul 31, 2012, 4:15 AM

Post #8 of 15 (3516 views)
Re: [Laurent_R] speeding up clustering program [In reply to] Can't Post

Yes, I have used the code profiler and it is the second part of the code. Can't think of much more to speed it up.. may just be stuck with it.

Thanks for all of the help!


Laurent_R
Veteran / Moderator

Jul 31, 2012, 10:57 AM

Post #9 of 15 (3502 views)
Re: [pjshort42] speeding up clustering program [In reply to] Can't Post

Well I do not understand well enough what you are trying to do exactly.

I tested today at work a complete Perl rewrite that I did of a PL-SQL type of program and managed to speed it up by a factor of more than 60 (on a given data set, the original program took 1h18mn to run, my new version ran in 1mn and 14 sec, and I have added today one further improvement that I haven't had time to test yet). But this implied that I fully understood what the original program was doing so as to think of a completely different way to do it taking advantage of Perl's powerful data structures (hashes of hashes of hashes), not available in PL-SQL.

I do not understand well enough what you are doing to be able to guide you much more.

(To tell the truth, I have done quite a lot of performance improvement work over the last few years, but it is the first time that I obtain such a huge improvement ratio, my best previous success was a factor of 20, for which I was already very happy and quite proud.)


pjshort42
Novice

Aug 1, 2012, 2:27 AM

Post #10 of 15 (3486 views)
Re: [Laurent_R] speeding up clustering program [In reply to] Can't Post

What I am trying to do is take the first file which is organized in a two column format (nodeA, nodeB) with several thousand entries. Each row represents a network connection (so a connection between node A and node B in the network).

What I am trying to do is measure clustering by counting up the number of times that I find certain arrangements. In this case, I am looking for groups of 4 nodes that share 4 or more edges together (evidence of clustering).

Before the first loop is taking all of the connections of nodes from the first list and removing any repeats with the hash so we end up with a list of every node in the network.

The first loop is attempting to iterate through this entire list of nodes for every possible combination. What the loop should be doing is:
first: 1,2,3,4
second: 1,2,3,5
third: 1,2,3,6
.
.
etc until all of the combinations are covered.

Each of these combinations is then run through grep to add up how many times one member of the list of 4 nodes is connected to another of the 4 nodes. If it is more than 4, we add 1 to our count and continue on to the next set of four genes.

Let me know if that makes sense!

Edit: The code i posted is adding up >2, >3, >5 hopefully this doesn't confuse you I was running that to check and see if it was working correctly on a dummy file earlier. What I am really looking for is >4, >5 but those can be changed to be whatever depending on how many connections we are looking for!


(This post was edited by pjshort42 on Aug 1, 2012, 2:56 AM)


BillKSmith
Veteran

Aug 1, 2012, 7:22 AM

Post #11 of 15 (3477 views)
Re: [pjshort42] speeding up clustering program [In reply to] Can't Post


Quote
until all of the combinations are covered.



Consider using the function combinations in the module Algorithm::Combinatorics. Be sure to read the 'SUBROUTINES' section of the document for information on how to use the function as an iterator. I have not studied your code. Your example suggests that you are examining every permutation rather than every combination. That would have a huge time penalty. Even if you are doing it correctly, I would expect the c-code in the module to be faster than an all-perl solution.
Good Luck,
Bill


pjshort42
Novice

Aug 1, 2012, 9:18 AM

Post #12 of 15 (3469 views)
Re: [BillKSmith] speeding up clustering program [In reply to] Can't Post

I will look into the module and hopefully it will speed up that part of the code. I thought that with the $i, $i+1, $j+1, etc. sort of structure it would cover only the combinations (I don't think there would be any repeats in this loop as the list is unique).

If my thinking isn't correct on that, though let me know.


Laurent_R
Veteran / Moderator

Aug 1, 2012, 10:51 AM

Post #13 of 15 (3461 views)
Re: [pjshort42] speeding up clustering program [In reply to] Can't Post

I think you are doing combinations in your code.

However, it really also depends on your data in your file. If you may have in your file (nodeA, nodeB) and somewhere else in the same file (nodeB, nodeA), then you might be really doing permutations without knowing.

If you can have that kind of unseen duplicates, then you should preprocess your file to remove them. With 4 nodes, you may end up with as much 24 times less processing to do if I understand well enough what your are doing.


BillKSmith
Veteran

Aug 1, 2012, 12:27 PM

Post #14 of 15 (3453 views)
Re: [pjshort42] speeding up clustering program [In reply to] Can't Post

I agree with Laurent_R. You are taking combinations, but not combinations of the right things. You should be examining all combinations of four nodes. Your data repeats a node for every connection to it. Therefore, you are taking combinations of connections rather than nodes. Process your list of all nodes with List::MoreUtils qw( uniq ) to remove all duplicates before starting to compute combinations.
Good Luck,
Bill


pjshort42
Novice

Aug 2, 2012, 1:47 AM

Post #15 of 15 (3406 views)
Re: [BillKSmith] speeding up clustering program [In reply to] Can't Post

I was under the impression that this part of the code should take care of the repeats:


Code
@combined_list = (@nodeA, @nodeB);  
%contact_count = ();
foreach (@combined_list) {
$contact_count{$_}++;
}


the %contact_count hash keys should not contain any repeats, so further down in the code @gene_list is created as an array with the keys of this hash.

I don't think it is this part of the code that it is slowing it down, but rather the grep in the second set of loops. Hopefully that makes sense!

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives