CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Clustal alignment parsing

 



newtoperlprog
Novice

Jul 10, 2014, 8:35 AM

Post #1 of 9 (3718 views)
Clustal alignment parsing Can't Post

Dear All,

I am trying to parse the output of clustalw program for multiple sequence alignment. I am fairly new to perl programming and am trying my best to understand the various concepts.
I have an alignment file (see attachment) and what I am trying to do:
a) Join the various blocks of alignment in one line, so it will be something like ids followed by alignments.
b) Print only those sequences which are conserved (greater than 15 residues) across the alignment length (marked by an asterisks).

Below is my code:

Code
#!/usr/bin/perl 
use strict;
use warnings;

my %ALIGNMENT;
my $FILE = "clustal.aln";
read_alignment($FILE);

foreach my $key ( keys my $ALIGNMENT )
{
printf "%-15s %s\n", ($key, $ALIGNMENT{$key});
}
sub read_alignment
{
my ($file) = @_;
my ($line);
local (*TMP);
open(TMP, "< $file") or die "can't open file '$file'\n";
while ( $line = <TMP> )
{
my $key;
next if $line =~ /^CLUSTAL/;
( $key, my $sequence ) = $line =~ /[\S]+/;
if ( !exists $ALIGNMENT{$key} )
{
$ALIGNMENT{$key} = "$sequence";
}
}
close TMP
}

Attachments: clustal.aln (3.57 KB)


Laurent_R
Veteran / Moderator

Jul 10, 2014, 11:04 AM

Post #2 of 9 (3621 views)
Re: [newtoperlprog] Clustal alignment parsing [In reply to] Can't Post

I am not a biologist and therefore don't understand completely what you are saying.

You did not say what problem you are encountering with your script, so we are left to try to guess.

One thing I can say is that this:

Code
foreach my $key ( keys my $ALIGNMENT )   
{
printf "%-15s %s\n", ($key, $ALIGNMENT{$key});
}

is certainly not working, not doing what you want and might probably not even compile if you really have use strict and use warnings. There is no $ALIGNMENT variable. I guess you want to go through the hash, then you should probably have this:


Code
foreach my $key ( keys %ALIGNMENT )   
{
printf "%-15s %s\n", $key, $ALIGNMENT{$key};
}


Also, although this might not be technically necessary, good practices would recommend that you use a lexical filehandler and change this:

Code
     my ($line);  
local (*TMP);
open(TMP, "< $file") or die "can't open file '$file'\n";
while ( $line = <TMP> )
# ...


to this:


Code
      open my $TMP, "<", $file or die "can't open file $file\n";  
while ( my $line = <$TMP> )
# ...



(This post was edited by Laurent_R on Jul 12, 2014, 2:18 AM)


Chris Charley
User

Jul 11, 2014, 9:13 AM

Post #3 of 9 (3227 views)
Re: [newtoperlprog] Clustal alignment parsing [In reply to] Can't Post

I'm unclear about your problem. For the data:

Code
gnl|hbvcds|FJ356716_X_C-H      tcgccccggtcaacgacctggattgaggaatacatcaaagactgtgtatttaaggactgg 
gnl|hbvcds|HM066946_X_P-H tcgccccggtcaacgacctggattgaggactacatcaaagactgtgtatttaaggactgg
gnl|hbvcds|HM117850_X_P-H tcgccccggtcaacgacctggattgaggaatacatcaaagactgtgtatttaaggactgg
gnl|hbvcds|HM117851_X_P-H tcgccccagtcaacgacctggattgaggaatacatcaaagactgtgtgtttaaggactgg
******* ********************* ***************** ************

It has consecutive asterisks greater than 15 from positions 8 to 28 and 30 to 46. Are these the conserved sequences for this block? Do you want to print only the bases, [tagc], corresponding to these positions?

Do you want to get all these conserved bases from the complete file concatenated for each id? Guess this is a start!


bulrush
Novice

Jul 11, 2014, 11:37 AM

Post #4 of 9 (3136 views)
Re: [Chris Charley] Clustal alignment parsing [In reply to] Can't Post

I have had some problems with


Code
if ( !exists $ALIGNMENT{$key} )


so I always write it this way now:


Code
if ( !(exists $ALIGNMENT{$key} ) )

-----
* Redhat Linux RHEL 5.5.56
* Perl 5.8.8


Laurent_R
Veteran / Moderator

Jul 11, 2014, 3:13 PM

Post #5 of 9 (3126 views)
Re: [bulrush] Clustal alignment parsing [In reply to] Can't Post

Is this answering to any code snippet written by Chris Charley?

You might consider this:


Code
do_something() unless exists $ALIGNMENT{$key};


or using a lower precedence negation operator:


Code
if ( not exists $ALIGNMENT{$key} )



Laurent_R
Veteran / Moderator

Jul 12, 2014, 2:29 AM

Post #6 of 9 (2653 views)
Re: [newtoperlprog] Clustal alignment parsing [In reply to] Can't Post

Newtoperlprog has sent me further questions by PM. I think it is best to put it in the public forum. This is the message by Newtoperlprog (reformated to make the code clearer):

____________

Thank very much for your reply and time. Now I am trying to do a little different thing. Now I am reading the file in a while loop and creating a hash variable.
The problem is its printing the same sequence with different key value.
Kindly help me and provide some directions for the same.
Regards

Code
#!/usr/bin/perl 
use warnings;
use strict;

my %hash; my @temp;my $id; my $seq;
my $file = $ARGV[0];
open (A, $file);
while (<A>)
{
chomp $_;
my $line = $_;
if ($line =~/^CLUSTAL/ || $line =~/^\s*$/)
{
next;
}
# print "$line\n";
if ($line =~/(.*\|.*\|.*)\s+(.*)/)
{
$id = $1; $seq = $2;
# print "$id\t$seq\n";
# $hash{$id} = 1;
push @{$hash{$id}}, $seq;
# $hash{$id}.= $seq;
#$hash{$id}++;
}
}

foreach my $totalseq (keys %hash)
{
if (exists $hash{$id})
{
# print "$totalseq\t$hash{$id}\n";
print "$totalseq\t@{$hash{$id}}\n";
}
}
close (A);



Laurent_R
Veteran / Moderator

Jul 12, 2014, 2:41 AM

Post #7 of 9 (2644 views)
Re: [newtoperlprog] Clustal alignment parsing [In reply to] Can't Post

Hi Newtoperlprog,

I think your error is in the last section, the foreach loop:


Code
foreach my $totalseq (keys %hash)  
{
if (exists $hash{$id})
{
# print "$totalseq\t$hash{$id}\n";
print "$totalseq\t@{$hash{$id}}\n";
}
}

You are using the $id variable without ever resetting it, so that it has retained the last value assigned to it in the previous loop, this is unlikely to be what you want.

Although I am not entirely sure of what you want to do, I would suspect you probably wanted something like this:


Code
foreach my $totalseq (keys %hash)  
{
if (exists $hash{$totalseq})
{
# print "$totalseq\t$hash{$totalseq}\n";
print "$totalseq\t@{$hash{$totalseq}}\n";
}
}


although the :


Code
    if (exists $hash{$totalseq})

is useless because it is bound to exist since you are looping on the keys of the hash. So it could be simplified to:


Code
foreach my $totalseq (keys %hash)  
{
print "$totalseq\t@{$hash{$totalseq}}\n";
}


If this is not what you need, please explain what you want.

There are a number of other things that could be improved in your code, but I'll put that in a separate post.


Laurent_R
Veteran / Moderator

Jul 12, 2014, 3:12 AM

Post #8 of 9 (2623 views)
Re: [newtoperlprog] Clustal alignment parsing [In reply to] Can't Post

Hi,

this is my attempt to rewrite more effectively your code:


Code
#!/usr/bin/perl  
use warnings;
use strict;

my %hash;
# my @temp;my $id; my $seq; -- see note 1
my $file = $ARGV[0];
open my $A, "<", $file or die "could not open $file $!";
# see note 2
while (<$A>)
{
chomp; # note 3
next if /^CLUSTAL/ or /^\s*$/;
if (/(.*\|.*\|.*)\s+(.*)/)
{
my ($id, $seq) = ($1, $2);
# see note 4
push @{$hash{$id}}, $seq;
}
}
close ($A); # see note 5

foreach my $totalseq (keys %hash)
{
print "$totalseq\t@{$hash{$totalseq}}\n";
# see note 6
}


Note 1:
# these variables are not needed (@temp) or should declared be in a smaller scope (the 2 others)

Note 2:
# use a lexical file handle ($A), not a bare word file handle
# use the 3-argument syntax for open
# check if opening the file was successful
# I would also change the name of the filehandle to something more meaningful, but did not change it here to avoid confusing you with too many changes

Note 3:
# if you are using the $_ default variable, then you don't need to spell it out in a chomp or in a regex, use completely the advantage of $_.
# if you prefer to use a named variable, then rather do it this way:

Code
while (my $line = <$A>) { 
chomp $line;
# etc.


Note 4:
These $seq and $id variables are temporary variables used only within the loop. Make them lexically scoped to the loop. If you had done that, the compiler would have told you about your error (using by mistake $id in the final loop).

Note 5:
# you no longer need this file handle, close it immediately

Note 6:
# explanations for these change were given in my previous post

An additional point: you could possibly make your code more concise and replace this:

Code
     if (/(.*\|.*\|.*)\s+(.*)/)  
{
my ($id, $seq) = ($1, $2);
# see note 4
push @{$hash{$id}}, $seq;
}


with a single line of code:


Code
push @{$hash{$1}}, $2 if  /(.*\|.*\|.*)\s+(.*)/;


I did not do it in the code above to avoid making too many changes and to avoid using a terse syntax which might be a bit difficult for a beginner.


newtoperlprog
Novice

Jul 14, 2014, 11:53 AM

Post #9 of 9 (1122 views)
Re: Clustal alignment parsing [In reply to] Can't Post

Dear All,

I am very much grateful to all of your wonderful suggestions and guidance.
I need some additional insight and help from all of you. The regular expression I used to match the "CLUSTAL" to get id and seq might not work for very long sequence alignment. There might be some missing information.
So I guess the good way would be to make the id as key and the sequence as the value and then print the whole alignment as key and value and then as Chris Charley mentioned print out the [atgc] corresponding to each '*' mark in the sequence alignment in the new line.
Again thankyou for all your help and suggestions.

Code
#!/usr/bin/perl 
use strict;
use warnings;

my %alignment;
my $FILE = "clustal.aln";
read_alignment($FILE);

foreach my $key ( keys my %alignment )
{
printf "%-15s %s\n", ($key, $alignment{$key});
}
sub read_alignment
{
my ($file) = @_;
my ($line);
open my $TMP, "<", $file or die "Check the file $file\n";
while ( $line = <$TMP> )
{
my $key;
next if $line =~ /^CLUSTAL/;
( $key, my $sequence ) = $line =~ /[\S]+/;
if (not exists $alignment{$key})
{
$alignment{$key} = $sequence;
}
}
close ($TMP);
}



(This post was edited by FishMonger on Jul 14, 2014, 2:21 PM)
Attachments: clustal.aln (1.20 KB)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives