CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
extracting sequences

 



muluwork
Novice

May 8, 2012, 3:20 PM

Post #1 of 5 (2674 views)
extracting sequences Can't Post

Hello there,
I have one big problem, but I don't know how to address it. I think the expert in this forum can solve it easily as I am not a programmer. I only know very little about perl. I am just biologist happen to face this problem, but think it can be solved by perl. I am really tired of cut and paste from a huge database. Would you please help me? Here is my problem. I want to extract sequences from an entry which is genebank format. I want to extract nucleotide sequences. The input looks like as attached (input) and also the output shall look like as attached (output).

Here a bit explanation to make it easier. For example, the input file in the attachment has two entries and each entry end by two slashes, '//'. Some entries have a term called 'CDS' but some others not. As it is seen in the input file (one entry has CDS but the other not, please see the attachment) I need only those entries with CDS. In front of the term CDS there are two numbers, in the attached example; i.e., '25..762'. The sequences to be extracted is determined by the second number '762'. Then at the bottom of each entry numbered sequence is found. So extraction shall begin from 763 till the end of this sequence. The name of the header of each entry shall begin with '>' and the sentence in front shall be taken from each entry definition line which is labeled as 'DEFINITION'. For the attached example it shall be '>Danio rerio cDNA clone MGC:198355 IMAGE:9039344, complete cds.' (please see the attached output)

Please save me time. If I do it manually it will take me a lot of time, which I don't have at this point. It is not a homework. It is a research stuff.
Your help is much appreciated!
Attachments: input (8.53 KB)
  output (0.32 KB)


wickedxter
User

May 8, 2012, 6:36 PM

Post #2 of 5 (2607 views)
Re: [muluwork] extracting sequences [In reply to] Can't Post

Hello and welcome....

after some diggin i found that the bioperl module supports the GenBank sequence format your useing...
http://www.bioperl.org/wiki/GenBank_sequence_format have you tried useing the module?


muluwork
Novice

May 9, 2012, 12:44 PM

Post #3 of 5 (2321 views)
Re: [wickedxter] extracting sequences [In reply to] Can't Post

Thanks Wickedxter! I saw the link and went through some of the modules. Still hard to grasp but I will read it further and try. If you have an idea of a module that fit may need can you please forward me to begin with. Thanks again!


Chris Charley
User

May 9, 2012, 1:57 PM

Post #4 of 5 (2312 views)
Re: [muluwork] extracting sequences [In reply to] Can't Post

I don't work with Biogen data, but reading up on them, it seems Bio::SeqIO provides the functionality to parse 'GenBank' formatted files. Here is a script that worked with your input file. I'm not sure if I am using all the correct methods, but the results are good, I believe.

I reviewed the functionality for the 2 modules whose features I used, http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/SeqIO.pm and it's parent module, http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Seq.pm


Code
   #!/usr/bin/perl    
use strict;
use warnings;
use Bio::SeqIO;

my $in = Bio::SeqIO->new( -file => "input.txt" ,
-format => 'GenBank');

open my $out, ">", 'fasta.out' or
die "Unable to open fasta.out for writing. $!";

while ( my $seq = $in->next_seq() ) {
my @features = $seq->get_SeqFeatures(); # just top level
foreach my $feat ( @features ) {
next unless $feat->primary_tag eq 'CDS';
print $out ">", $seq->desc, "\n";
print $out join("\n",
$seq->subseq(1 + $feat->end, $seq->length) =~ /.{1,60}/g), "\n";
}
}
close $out or die $!;


__END__
*** print out

>Danio rerio cDNA clone MGC:198355 IMAGE:9039344, complete cds.
GCGTTCATGGAGAAAACTTCACTCGTGGTCAAGACATGCCTGCTACCTGCATCCCCGAGC
TCGGCCTTCAGTCACCAGGCTGCTCTGATCCCACCTGCGGTTTGCAGTTTACCAGAGGCT
GTCCAAGGCAGTCTACCACAACCTGCGTGACATTAACACTGTCTGGTCTTAGCAGTAGTT
GCAGTAGGAAGTCCCCCCCTTTCTCTGTCTTTAGGTTCTCACCTTCTGTTTTTGTAAATA
ATGTTTGGCTCGCCAGCTTTA


Chris

Update: Yes, there is an output method and I'm posting it here. Also, there is a test at the end to see if everything is still in the correct format.


Code
#!/usr/bin/perl 
use strict;
use warnings;
use Bio::SeqIO;
use 5.014;

my $in = Bio::SeqIO->new( -file => "input.txt" ,
-format => 'GenBank');
my $out = Bio::SeqIO->new( -file => '>fasta.dat',
-format => 'fasta');

while ( my $seq = $in->next_seq() ) {
my @features = $seq->get_SeqFeatures(); # just top level
foreach my $feat ( @features ) {
next unless $feat->primary_tag eq 'CDS';
my $subseq = $seq->trunc(1 + $feat->end, $seq->length);
$out->write_seq($subseq);
}
}

my $test = Bio::SeqIO->new( -file => 'fasta.dat',
-format => 'fasta');

while (my $seq = $test->next_seq()) {
say "ID";
say $seq->id, "\n";
say "SEQ";
say $seq->seq, "\n";
say "DESC";
say $seq->desc;
}
__END__

*** fasta.dat

>BC171628 Danio rerio cDNA clone MGC:198355 IMAGE:9039344, complete cds.
GCGTTCATGGAGAAAACTTCACTCGTGGTCAAGACATGCCTGCTACCTGCATCCCCGAGC
TCGGCCTTCAGTCACCAGGCTGCTCTGATCCCACCTGCGGTTTGCAGTTTACCAGAGGCT
GTCCAAGGCAGTCTACCACAACCTGCGTGACATTAACACTGTCTGGTCTTAGCAGTAGTT
GCAGTAGGAAGTCCCCCCCTTTCTCTGTCTTTAGGTTCTCACCTTCTGTTTTTGTAAATA
ATGTTTGGCTCGCCAGCTTTA

*** test run

C:\Old_Data\perlp>perl t5.pl
ID
BC171628

SEQ
GCGTTCATGGAGAAAACTTCACTCGTGGTCAAGACATGCCTGCTACCTGCATCCCCGAGCTCGGCCTTCAGTCACCAGGC
TGCTCTGATCCCACCTGCGGTTTGCAGTTTACCAGAGGCTGTCCAAGGCAGTCTACCACAACCTGCGTGACATTAACACT
GTCTGGTCTTAGCAGTAGTTGCAGTAGGAAGTCCCCCCCTTTCTCTGTCTTTAGGTTCTCACCTTCTGTTTTTGTAAATA
ATGTTTGGCTCGCCAGCTTTA

DESC
Danio rerio cDNA clone MGC:198355 IMAGE:9039344, complete cds.



(This post was edited by Chris Charley on May 10, 2012, 8:20 AM)


muluwork
Novice

May 9, 2012, 2:30 PM

Post #5 of 5 (2307 views)
Re: [Chris Charley] extracting sequences [In reply to] Can't Post

Thank you very much indeed Chris. It works perfect. Bioperl will be my friend. I will read more on it. Thanks once again :)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives