
Chris Charley
User
May 9, 2012, 1:57 PM
Post #4 of 5
(2019 views)
|
|
Re: [muluwork] extracting sequences
[In reply to]
|
Can't Post
|
|
I don't work with Biogen data, but reading up on them, it seems Bio::SeqIO provides the functionality to parse 'GenBank' formatted files. Here is a script that worked with your input file. I'm not sure if I am using all the correct methods, but the results are good, I believe. I reviewed the functionality for the 2 modules whose features I used, http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/SeqIO.pm and it's parent module, http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Seq.pm #!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new( -file => "input.txt" , -format => 'GenBank'); open my $out, ">", 'fasta.out' or die "Unable to open fasta.out for writing. $!"; while ( my $seq = $in->next_seq() ) { my @features = $seq->get_SeqFeatures(); # just top level foreach my $feat ( @features ) { next unless $feat->primary_tag eq 'CDS'; print $out ">", $seq->desc, "\n"; print $out join("\n", $seq->subseq(1 + $feat->end, $seq->length) =~ /.{1,60}/g), "\n"; } } close $out or die $!; __END__ *** print out >Danio rerio cDNA clone MGC:198355 IMAGE:9039344, complete cds. GCGTTCATGGAGAAAACTTCACTCGTGGTCAAGACATGCCTGCTACCTGCATCCCCGAGC TCGGCCTTCAGTCACCAGGCTGCTCTGATCCCACCTGCGGTTTGCAGTTTACCAGAGGCT GTCCAAGGCAGTCTACCACAACCTGCGTGACATTAACACTGTCTGGTCTTAGCAGTAGTT GCAGTAGGAAGTCCCCCCCTTTCTCTGTCTTTAGGTTCTCACCTTCTGTTTTTGTAAATA ATGTTTGGCTCGCCAGCTTTA Chris Update: Yes, there is an output method and I'm posting it here. Also, there is a test at the end to see if everything is still in the correct format.
#!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; use 5.014; my $in = Bio::SeqIO->new( -file => "input.txt" , -format => 'GenBank'); my $out = Bio::SeqIO->new( -file => '>fasta.dat', -format => 'fasta'); while ( my $seq = $in->next_seq() ) { my @features = $seq->get_SeqFeatures(); # just top level foreach my $feat ( @features ) { next unless $feat->primary_tag eq 'CDS'; my $subseq = $seq->trunc(1 + $feat->end, $seq->length); $out->write_seq($subseq); } } my $test = Bio::SeqIO->new( -file => 'fasta.dat', -format => 'fasta'); while (my $seq = $test->next_seq()) { say "ID"; say $seq->id, "\n"; say "SEQ"; say $seq->seq, "\n"; say "DESC"; say $seq->desc; } __END__ *** fasta.dat >BC171628 Danio rerio cDNA clone MGC:198355 IMAGE:9039344, complete cds. GCGTTCATGGAGAAAACTTCACTCGTGGTCAAGACATGCCTGCTACCTGCATCCCCGAGC TCGGCCTTCAGTCACCAGGCTGCTCTGATCCCACCTGCGGTTTGCAGTTTACCAGAGGCT GTCCAAGGCAGTCTACCACAACCTGCGTGACATTAACACTGTCTGGTCTTAGCAGTAGTT GCAGTAGGAAGTCCCCCCCTTTCTCTGTCTTTAGGTTCTCACCTTCTGTTTTTGTAAATA ATGTTTGGCTCGCCAGCTTTA *** test run C:\Old_Data\perlp>perl t5.pl ID BC171628 SEQ GCGTTCATGGAGAAAACTTCACTCGTGGTCAAGACATGCCTGCTACCTGCATCCCCGAGCTCGGCCTTCAGTCACCAGGC TGCTCTGATCCCACCTGCGGTTTGCAGTTTACCAGAGGCTGTCCAAGGCAGTCTACCACAACCTGCGTGACATTAACACT GTCTGGTCTTAGCAGTAGTTGCAGTAGGAAGTCCCCCCCTTTCTCTGTCTTTAGGTTCTCACCTTCTGTTTTTGTAAATA ATGTTTGGCTCGCCAGCTTTA DESC Danio rerio cDNA clone MGC:198355 IMAGE:9039344, complete cds.
(This post was edited by Chris Charley on May 10, 2012, 8:20 AM)
|