CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Intermediate:
filter huge genbank file by organism



Oct 22, 2012, 11:14 AM

Post #1 of 9 (5127 views)
filter huge genbank file by organism Can't Post

hi everyone,

i stopped for while programming with perl and now im stuck! Can you please help me??

This is my code:


use strict;
use warnings;

sub ReadFile{

my $infile = $_[0]; my $hash_string = $_[1];

open (INFILE , "<$infile") or die ("Cannot open infile fasta: $infile. $!\n");
my $string = "";
my @strings;

while (my $line = <INFILE>) {
chomp ($line);
if ($line =~ m/LOCUS/){
$string = $line;
push (@strings, $string);
$hash_string -> {$string} = "$line\n";
else {
$hash_string -> {$string} .= $line;
close INFILE;

sub FindSpecies {

my $hash_string = $_[0]; my $outfile = $_[1];

open (OUTFILE , ">$outfile") or die ("Cannot open outfile fasta: $outfile. $!\n");

foreach my $line (keys %$hash_string) {
if ($line =~ m/ORGANISM/){
if (my $sp =~ (/((.+)\n(.+)\;(.+)\;(.+)(.+))/)) {
$sp = $5; my $string = "";
if ($sp =~ m/Equinodermata/){
print OUTFILE "$hash_string->{$string}\n";
close OUTFILE;

my $infile = $ARGV[0];
my %hash_string = ();
my $outfile = $infile.".final";

ReadFile ($infile, \%hash_string);
FindSpecies (\%hash_string, $outfile);

what am i trying to do?? well i have a huge genbank file from ncbi and i want to filter by ORGANISM (Filo Echinodermata). So yes i did this code and didnt get any error ... but my output file dont get any information! What am i doing wrong?

Thanks a lot guys!


Oct 22, 2012, 11:50 AM

Post #2 of 9 (5126 views)
Re: [andreiareis] filter huge genbank file by organism [In reply to] Can't Post

open (OUTFILE , ">$outfile") or die ("Cannot open outfile fasta: $outfile. $!\n");

every time you try and write to the file its clobbering everything in the file.

#change this 
open (INFILE , "<$infile") or die ("Cannot open infile fasta: $infile. $!\n");
#to this
open (INFILE , "$infile") or die ("Cannot open infile fasta: $infile. $!\n");

#change this
open (OUTFILE , ">$outfile") or die ("Cannot open outfile fasta: $outfile. $!\n");
#to this (>> is append mode)
open (OUTFILE , ">>$outfile") or die ("Cannot open outfile fasta: $outfile. $!\n");

also look in to open's three arg open


Oct 22, 2012, 1:34 PM

Post #3 of 9 (5122 views)
Re: [wickedxter] filter huge genbank file by organism [In reply to] Can't Post

Thanks for your reply :)

but didnt work out :(

Chris Charley

Oct 22, 2012, 3:55 PM

Post #4 of 9 (5114 views)
Re: [andreiareis] filter huge genbank file by organism [In reply to] Can't Post

I see some errors in your program, but, it would be helpful if you could attach a file, (part of your input file), with lines in it that would match LOCUS, ORGANISM and Equinodermata.

Looking at your program, it is hard to determine what it is you wish to capture, save, etc.

Please try to describe what it is you want to achieve. :-)


Update: I think you are parsing a 'genbank' file. I've attached a sample genbank file. Is this what you are parsing and what do you want from it?

(This post was edited by Chris Charley on Oct 23, 2012, 7:29 AM)
Attachments: input.txt (8.68 KB)

Veteran / Moderator

Oct 23, 2012, 12:07 AM

Post #5 of 9 (5104 views)
Re: [Chris Charley] filter huge genbank file by organism [In reply to] Can't Post

Yes, please provide a data sample.


Oct 23, 2012, 8:15 AM

Post #6 of 9 (5097 views)
Re: [Laurent_R] filter huge genbank file by organism [In reply to] Can't Post

that is my file test ...
Attachments: (98.2 KB)

Veteran / Moderator

Oct 23, 2012, 9:23 AM

Post #7 of 9 (5083 views)
Re: [andreiareis] filter huge genbank file by organism [In reply to] Can't Post

Rather than manually parsing the file, I think it would be better to use one of the Bio:: modules designed for parsing this file format.

I'm not a biologist and I've never worked with any of the related modules, but the Bio::GenBankParser module looks like it might be a good starting point.

Here's my initial test.


use strict;
use warnings;
use Bio::GenBankParser;
use Data::Dumper;

my $parser = Bio::GenBankParser->new( file => '');

while ( my $seq = $parser->next_seq ) {
print Dumper $seq;

and its output

[rkb@099-91-RKB01 ~]$ ./  

ERROR (line 131): Invalid section: Was expecting commented line, or
header, or locus, or dbsource, or definition, or
accession line, or project line, or version line, or
keywords, or source line, or organism, or reference, or
features, or base count, or contig, or origin, or
comment, or record delimiter

ERROR (line 131): Invalid startrule: Was expecting eofile but found
COMP" instead

So, either that wasn't the best module choice, or there is a problem with the format of the file.

You may want to look over some of the other module choices.

Since you mentioned fasta file in your die statement, you might want to look over some of those modules.

You may also want to look at the site


Oct 23, 2012, 9:23 AM

Post #8 of 9 (5082 views)
Re: [andreiareis] filter huge genbank file by organism [In reply to] Can't Post

LOCUS NM_001204213 10781 bp mRNA linear PRI 21-OCT-2012
DEFINITION Homo sapiens nitric oxide synthase 1 (neuronal) (NOS1), transcript
variant 3, mRNA.
ACCESSION NM_001204213
VERSION NM_001204213.1 GI:323635430
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Equinodermata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 10781)
AUTHORS Wang,J., Ma,X.H., Xiang,B., Wu,J.Y., Wang,Y.C., Deng,W., Li,M.L.,
Wang,Q., He,Z.L. and Li,T.
TITLE [Association study of NOS1 gene polymorphisms and schizophrenia]
JOURNAL Zhonghua Yi Xue Yi Chuan Xue Za Zhi 29 (4), 459-463 (2012)
PUBMED 22875507
REMARK GeneRIF: One NOS1 SNP (rs1520811) was found to be associated with
REFERENCE 2 (bases 1 to 10781)
AUTHORS Bielau,H., Brisch,R., Bernard-Mittelstaedt,J., Dobrowolny,H.,
Gos,T., Baumann,B., Mawrin,C., Bernstein,H.G., Bogerts,B. and
TITLE Immunohistochemical evidence for impaired nitric oxide signaling of
the locus coeruleus in bipolar disorder
JOURNAL Brain Res. 1459, 91-99 (2012)
PUBMED 22560594
REMARK GeneRIF: The current data on nNOS suggest a dysregulation of the
nitrergic system in bipolar disorder.
REFERENCE 3 (bases 1 to 10781)
AUTHORS Kwan,K.Y., Lam,M.M., Johnson,M.B., Dube,U., Shim,S., Rasin,M.R.,
Sousa,A.M., Fertuzinhos,S., Chen,J.G., Arellano,J.I., Chan,D.W.,
Pletikos,M., Vasung,L., Rowitch,D.H., Huang,E.J., Schwartz,M.L.,
Willemsen,R., Oostra,B.A., Rakic,P., Heffer,M., Kostovic,I.,
Judas,M. and Sestan,N.
TITLE Species-dependent posttranscriptional regulation of NOS1 by FMRP in
the developing cerebral cortex
JOURNAL Cell 149 (4), 899-911 (2012)
PUBMED 22579290
REMARK GeneRIF: Study identifies a species-dependent posttranscriptional
regulation of human NOS1 by FMRP in specific neocortical circuits
during column development and synaptogenesis, and showed it to be
altered in Fragile X syndrome.
REFERENCE 4 (bases 1 to 10781)
AUTHORS Boissel,J.P., Zelenka,M., Godtel-Armbrust,U., Feuerstein,T.J. and
TITLE Transcription of different exons 1 of the human neuronal nitric
oxide synthase gene is dynamically regulated in a cell- and
stimulus-specific manner
JOURNAL Biol. Chem. 384 (3), 351-362 (2003)
PUBMED 12715886
REFERENCE 5 (bases 1 to 10781)
AUTHORS Larsson,B. and Phillips,S.C.
TITLE Isolation and characterization of a novel, human neuronal nitric
oxide synthase cDNA
JOURNAL Biochem. Biophys. Res. Commun. 251 (3), 898-902 (1998)
PUBMED 9791007
REFERENCE 6 (bases 1 to 10781)
AUTHORS Wang,Y., Goligorsky,M.S., Lin,M., Wilcox,J.N. and Marsden,P.A.
TITLE A novel, testis-specific mRNA transcript encoding an NH2-terminal
truncated nitric-oxide synthase
JOURNAL J. Biol. Chem. 272 (17), 11392-11401 (1997)
PUBMED 9111048
REFERENCE 7 (bases 1 to 10781)
AUTHORS Chen,P.F., Tsai,A.L. and Wu,K.K.
TITLE Cysteine 99 of endothelial nitric oxide synthase (NOS-III) is
critical for tetrahydrobiopterin-dependent NOS-III stability and
JOURNAL Biochem. Biophys. Res. Commun. 215 (3), 1119-1129 (1995)
PUBMED 7488039
REFERENCE 8 (bases 1 to 10781)
AUTHORS Kishimoto,J., Spurr,N., Liao,M., Lizhi,L., Emson,P. and Xu,W.
TITLE Localization of brain nitric oxide synthase (NOS) to human
chromosome 12
JOURNAL Genomics 14 (3), 802-804 (1992)
PUBMED 1385308
REMARK Erratum:[Genomics 1993 Feb;15(2):465]
REFERENCE 9 (bases 1 to 10781)
AUTHORS Lowenstein,C.J., Glatt,C.S., Bredt,D.S. and Snyder,S.H.
TITLE Cloned and expressed macrophage nitric oxide synthase contrasts
with the brain enzyme
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 89 (15), 6711-6715 (1992)
PUBMED 1379716
REFERENCE 10 (bases 1 to 10781)
AUTHORS Bredt,D.S., Ferris,C.D. and Snyder,S.H.
TITLE Nitric oxide synthase regulatory sites. Phosphorylation by cyclic
AMP-dependent protein kinase, protein kinase C, and
calcium/calmodulin protein kinase; identification of flavin and
calmodulin binding sites
JOURNAL J. Biol. Chem. 267 (16), 10976-10981 (1992)
PUBMED 1375933
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AC068799.14, AK294435.1,
U17327.1, AK307481.1, BC033208.1, AC026364.36 and BE207961.1.

Summary: The protein encoded by this gene belongs to the family of
nitric oxide synthases, which synthesize nitric oxide from
L-arginine. Nitric oxide is a reactive free radical, which acts as
a biologic mediator in several processes, including
neurotransmission, and antimicrobial and antitumoral activities. In
the brain and peripheral nervous system, nitric oxide displays many
properties of a neurotransmitter, and has been implicated in
neurotoxicity associated with stroke and neurodegenerative
diseases, neural regulation of smooth muscle, including
peristalsis, and penile erection. This protein is ubiquitously
expressed, with high level of expression in skeletal muscle.
Multiple transcript variants that differ in the 5' UTR have been
described for this gene but the full-length nature of these
transcripts is not known. Additionally, alternatively spliced
transcript variants encoding different isoforms (some
testis-specific) have been found for this gene.[provided by RefSeq,
Feb 2011].

Transcript Variant: This variant (3, also known as TnNOS) contains
2 unique alternate exons (Tex 1 and Tex 2) at the 5' end compared
to variant 1, resulting in translation initiation from a downstream
AUG, and an isoform (isoform 3, also known as nNOSgamma) with a
shorter N-terminus compared to isoform 1. This variant is
specifically expressed in the testis, and the encoded isoform has
catalytic activity (PMID:9111048). Variants 3 and 4 encode the same

Sequence Note: This RefSeq record was created from transcript and
genomic sequence data to make the sequence consistent with the
reference genome assembly. The genomic coordinates used for the
transcript record were based on transcript alignments.

Publication Note: This RefSeq record includes a subset of the
publications that are available for this gene. Please see the Gene
record to access additional publications.
COMPLETENESS: complete on the 3' end.
1-99 AC068799.14 78066-78164
100-155 AC068799.14 81047-81102
156-900 AK294435.1 1368-2112
901-1504 U17327.1 2283-2886
1505-2910 AK307481.1 2913-4318
2911-5345 U17327.1 4293-6727
5346-5843 BC033208.1 2019-2516
5844-10525 AC026364.36 33532-38213 c
10526-10781 BE207961.1 1-256 c
FEATURES Location/Qualifiers
source 1..10781
/organism="Homo sapiens"
gene 1..10781
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
/note="nitric oxide synthase 1 (neuronal)"
exon 1..99
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 100..155
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
misc_feature 150..152
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
/note="upstream in-frame stop codon"
exon 156..284
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 285..430
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
CDS 312..3608
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
/note="isoform 3 is encoded by transcript variant 3; NOS
type I; neuronal NOS; constitutive NOS; peptidyl-cysteine
S-nitrosylase NOS1"
/product="nitric oxide synthase, brain isoform 3"
exon 431..593
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 594..685
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 686..827
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 828..967
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 968..1142
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 1143..1244
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 1245..1439
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 1440..1525
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 1526..1670
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 1671..1775
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 1776..1834
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 1835..1951
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 1952..2126
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 2127..2265
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 2266..2344
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 2345..2538
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 2539..2708
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 2709..2919
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 2920..3007
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 3008..3129
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 3130..3278
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 3279..3473
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 3474..3592
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
exon 3593..10775
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
STS 3659..3870
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
STS 3984..4141
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
STS 7815..7898
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
STS 8775..10103
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
STS 8801..8903
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
STS 8821..9938
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
STS 9634..9790
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
STS 10003..10091
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
STS 10552..10675
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
polyA_signal 10758..10763
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
polyA_site 10775
/gene_synonym="bNOS; IHPS1; N-NOS; NC-NOS; nNOS; NOS"
1 gataggtggg ggttgagaaa tggctgggca gggcagcaaa gcaactgcca aggactgggc
61 aaaaggcaat agaatgcaat tgaagcagga cgaatgcaga tgaggaaact gagaatcaca
121 gagggttttg gtgctcaacg agggtcacat aaccaccccc cacctcagga aaacagtccc
181 ccacaaagaa tggcagcccc tccaagtgtc cacgcttcct caaggtcaag aactgggaga
241 ctgaggtggt tctcactgac accctccacc ttaagagcac attggaaacg ggatgcactg
301 agtacatctg catgggctcc atcatgcatc cttctcagca tgcaaggagg cctgaagacg
361 tccgcacaaa aggacagctc ttccctctcg ccaaagagtt tattgatcaa tactattcat
421 caattaaaag atttggctcc aaagcccaca tggaaaggct ggaagaggtg aacaaagaga
481 tcgacaccac tagcacttac cagctcaagg acacagagct catctatggg gccaagcacg
541 cctggcggaa tgcctcgcgc tgtgtgggca ggatccagtg gtccaagctg caggtattcg
601 atgcccgtga ctgcaccacg gcccacggga tgttcaacta catctgtaac catgtcaagt
661 atgccaccaa caaagggaac ctcaggtctg ccatcaccat attcccccag aggacagacg
721 gcaagcacga cttccgagtc tggaactccc agctcatccg ctacgctggc tacaagcagc
781 ctgacggctc caccctgggg gacccagcca atgtgcagtt cacagagata tgcatacagc
841 agggctggaa accgcctaga ggccgcttcg atgtcctgcc gctcctgctt caggccaacg
901 gcaatgaccc tgagctcttc cagattcctc cagagctggt gttggaagtt cccatcaggc
961 accccaagtt tgagtggttc aaggacctgg ggctgaagtg gtacggcctc cccgccgtgt
1021 ccaacatgct cctagagatt ggcggcctgg agttcagcgc ctgtcccttc agtggctggt
1081 acatgggcac agagattggt gtccgcgact actgtgacaa ctcccgctac aatatcctgg
1141 aggaagtggc caagaagatg aacttagaca tgaggaagac gtcctccctg tggaaggacc
1201 aggcgctggt ggagatcaat atcgcggttc tctatagctt ccagagtgac aaagtgacca
1261 ttgttgacca tcactccgcc accgagtcct tcattaagca catggagaat gagtaccgct
1321 gccggggggg ctgccctgcc gactgggtgt ggatcgtgcc ccccatgtcc ggaagcatca
1381 cccctgtgtt ccaccaggag atgctcaact accggctcac cccctccttc gaataccagc
1441 ctgatccctg gaacacgcat gtctggaaag gcaccaacgg gacccccaca aagcggcgag
1501 ccattggctt caagaagcta gcagaagctg tcaagttctc ggccaagctg atggggcagg
1561 ctatggccaa gagggtgaaa gcgaccatcc tctatgccac agagacaggc aaatcgcaag
1621 cttatgccaa gaccttgtgt gagatcttca aacacgcctt tgatgccaag gtgatgtcca
1681 tggaagaata tgacattgtg cacctggaac atgaaactct ggtccttgtg gtcaccagca
1741 cctttggcaa tggagatccc cctgagaatg gggagaaatt cggctgtgct ttgatggaaa
1801 tgaggcaccc caactctgtg caggaagaaa ggaagagcta caaggtccga ttcaacagcg
1861 tctcctccta ctctgactcc caaaaatcat caggcgatgg gcccgacctc agagacaact
1921 ttgagagtgc tggacccctg gccaatgtga ggttctcagt ttttggcctc ggctcacgag
1981 cataccctca cttttgcgcc ttcggacacg ctgtggacac cctcctggaa gaactgggag
2041 gggagaggat cctgaagatg agggaagggg atgagctctg tgggcaggaa gaggctttca
2101 ggacctgggc caagaaggtc ttcaaggcag cctgtgatgt cttctgtgtg ggagatgatg
2161 tcaacattga aaaggccaac aattccctca tcagcaatga tcgcagctgg aagagaaaca
2221 agttccgcct cacctttgtg gccgaagctc cagaactcac acaaggtcta tccaatgtcc
2281 acaaaaagcg agtctcagct gcccggctcc ttagccgtca aaacctccag agccctaaat
2341 ccagtcggtc aactatcttc gtgcgtctcc acaccaacgg gagccaggag ctgcagtacc
2401 agcctgggga ccacctgggt gtcttccctg gcaaccacga ggacctcgtg aatgccctga
2461 tcgagcggct ggaggacgcg ccgcctgtca accagatggt gaaagtggaa ctgctggagg
2521 agcggaacac ggctttaggt gtcatcagta actggacaga cgagctccgc ctcccgccct
2581 gcaccatctt ccaggccttc aagtactacc tggacatcac cacgccacca acgcctctgc
2641 agctgcagca gtttgcctcc ctagctacca gcgagaagga gaagcagcgt ctgctggtcc
2701 tcagcaaggg tttgcaggag tacgaggaat ggaaatgggg caagaacccc accatcgtgg
2761 aggtgctgga ggagttccca tctatccaga tgccggccac cctgctcctg acccagctgt
2821 ccctgctgca gccccgctac tattccatca gctcctcccc agacatgtac cctgatgaag
2881 tgcacctcac tgtggccatc gtttcctacc gcactcgaga tggagaagga ccaattcacc
2941 acggcgtatg ctcctcctgg ctcaaccgga tacaggctga cgaactggtc ccctgtttcg
3001 tgagaggagc acccagcttc cacctgcccc ggaaccccca agtcccctgc atcctcgttg
3061 gaccaggcac cggcattgcc cctttccgaa gcttctggca acagcggcaa tttgatatcc
3121 aacacaaagg aatgaacccc tgccccatgg tcctggtctt cgggtgccgg caatccaaga
3181 tagatcatat ctacagggaa gagaccctgc aggccaagaa caagggggtc ttcagagagc
3241 tgtacacggc ttactcccgg gagccagaca aaccaaagaa gtacgtgcag gacatcctgc
3301 aggagcagct ggcggagtct gtgtaccgag ccctgaagga gcaagggggc cacatatacg
3361 tctgtgggga cgtcaccatg gctgctgatg tcctcaaagc catccagcgc atcatgaccc
3421 agcaggggaa gctctcggca gaggacgccg gcgtattcat cagccggatg agggatgaca
3481 accgatacca tgaggatatt tttggagtca ccctgcgaac gtacgaagtg accaaccgcc
3541 ttagatctga gtccattgcc ttcattgaag agagcaaaaa agacaccgat gaggttttca
3601 gctcctaact ggaccctctt gcccagccgg ctgcaagttt tgtaagcgcg gacagacact
3661 gctgaacctt tcctctggga ccccctgtgg ccctcgctct gcctcctgtc cttgtcgctg
3721 tgccctggtt tccctcctcg ggcttctcgc ccctcagtgg tttcctcggc cctcctgggt
3781 ttactccttg agttttcctg ctgcgatgca atgcttttct aatctgcagt ggctcttaca
3841 aaactctgtt cccactccct ctcttgccga caagggcaac tcacgggtgc atgaaaccac
3901 tggaacatgg ccgtcgctgt gggggttttt ttctctgggg ttcccctgga aaggctgcag
3961 gaactaggca caagctctct gagccagtcc ctcagccact gaagtccccc tttctccttt
4021 tttatgatga cattttggtt gtgcgtgcct gtgtgtgtgt gtgtgtgtgt gtgtgtgtgt
4081 gtgatgggcc aggtctctgt ccgtcctctt ccctgcacaa gtgtgtcgat cttagattgc
4141 cactgctttc attgaagacc ctcaatgcca agaaacgtgt ccctggccca tattaatccc
4201 tcgtgtgtcc ataattaggg tccacgccca tgtacctgaa acatttggaa gccccataat
4261 tgttctagtt agaaagggtt cagggcatgg ggagaggagt gggaaattga ttaaaggggc
4321 tgtctcccaa tgaaagaggc attcccagaa tttgctgcat ttagattttg ataccagtga
4381 gcagagccct catgtgacat gaacccatcc aatggattgt gcaaatcccc tccccaaacc
4441 cacccatacc agctagaatc acttgacttt gccacatcca ttgactgacc ccctcctcca
4501 gcaatagcat ccaaggggcc tggaagttat gttgttcaaa gaagcctggt ggcaataagg
4561 atcttcccac tttgccactg gatgactttg gatgggtcac ttgtcctcag tttttcctag
4621 tcataatgtc atacgaacct aaagaatatg aatggattaa atgttaaagc tttggtgcct
4681 ggaaacaata tcaagtaaca atatgattat tattttttta ttcccccaaa gcgggcttgc
4741 tgcttcaccc ttggggatga aataatggaa gctggttaaa gtggatgagg ttggaaagag
4801 ttgccataat gaggtcccac gtggcttctt cgataggagc cacaacttgg ggtgggaaga
4861 acttgtccct caggcttgtt gccctctgca gttgatctcc aaagttttaa acctgttaaa
4921 ttaattttga caaataagtt accctcaact cagatcaaaa atgggcagcc aagtcttcgg
4981 taggaattgg agccggtgta attcctccct aagaggcaac ctgttgaatt tactctctca
5041 gagtaaatgg tgggaaggga tccctttgta tactttttta aatactacaa attagtgtca
5101 ggcagttccc agaaagagac aagaaatcct agtggcctcc cagactgcag ggtccccaag
5161 gatggaaagg gaatgttctg ctggttctac cctgtttgtt gtgtcttgct atacagaaaa
5221 accacatttc ttttatatac tgtacgtggg catatcttgt tgttcagttt gggtgtctgc
5281 taaagaggaa gtgcactggc cctctttgaa agggctttac agtgggggca ccaagacccc
5341 aaagggccca ggccaggaga ctgttaaagt gaaaaggcaa tctatgactc accttgctct
5401 gccatccctg gcagccccca ccggtgtcct gttcctgcca catggagctt gacttcatgc
5461 cagctataat ctcccctgcc ttcctttaat cccaatttcc cctgctcact cttccacaga
5521 tataaagaac aaacacttag catcccacac tcaccccttc taatcctgaa gggaagccca
5581 ttctaaactc ctttcctgca aacccatttc cagctcctag tagctttcct cccaaaggct
5641 ttctttccaa tcctttatag ctttggagac gcctccccaa ttccccaggg aaggaaactg
5701 ttgtgtccaa tccccattaa agacaaattg atcagtgctt cccactccaa gtcaagcttt
5761 atgcaggaat gcttttccat cagggaataa atacttagaa gcgcttacaa ggtgccaggc
5821 acctcctttc tgcatgtgcc tgcctttcta gtagcagaca gatggaaaca ttgtctcatt
5881 ttgtcaagga gtccaaagaa atgattataa aaccaggatt catccttctt ctccagaaag
5941 attttttttt aagtaaacac ctttcaatcc ccaacacaag ctgcttcaca actccaggct
6001 agaaggcagg agagcgatct gatgtgtttc tttcatttgc cagaattcct gataccaaaa
6061 gcctctctct ctgttgagta acctctcaag gaccagagtg gagtccagat tgttaggctc
6121 agatcaaggg tggggaaata ctgccctctc gtggtggctt ttcatccagg cctcgtagcc
6181 aaccgtttaa gtgcaaaata gaattaagca atgggtaagc aaaatagggt tgacaagata
6241 tttgggggtt attcgggtta tggcccattt atttccctct tccccctgaa ttgaccagta
6301 gcagctccag ccccatttca caaaagtgag tttggccagg aggaatgaga cgtctcctga
6361 aataggaaca ccggaacatc atgctcacct gccatcacta tgcatccagt tcccacagct
6421 tgtgtcgtga aagagcagag agatgatgtt aaactccttg ggaggagaga gggcttcttt
6481 tggtttccct ggagtgagac agccaggtgt ctttcttttg cggggggaca cttcagaccc
6541 atcaatatgg aattttggga gccgacctga gtgcaaatcc taattctgcc cctgttggtg
6601 cagatggctg tgggcggctc acttgacctt ttagagtctg catacccacc tgtataacaa
6661 ggtggattga atgagacaat gcccacgaaa tgcccagtta cagtacctgg ttcaaaactt
6721 actgcatttt aatttttcac ttaacttata acatgtcttg cttctccagt gtgtggaagg
6781 caccgggcag tttgcagaga taagcaaaac acagttcctc tcgtgcagaa ggttagaatc
6841 tatttttttt tttgacagag tcttgctctg tcacccaggc tggcgtacag tggtacgatc
6901 tcagctcact gcatcctctg cctcccccag ttcaagtgat tcttctgcct cggcctcctg
6961 agtaactggg actacaggcg cctaccacca cgcccagcta agttttgtat ttttagtaga
7021 gtcagggttt caccatgttg gccaggctgg tcttgaattc ctgacctcaa atgatccacg
7081 cacctcagcc tcccaaagtg ctggattaca ggcatgagcc accacgccca gccaaaggtt
7141 ataatctgat ggagagagac acccgtcttg gaactgacat aaatttctgg ggtttgagaa
7201 atgggcggga tttcactggt agcttctgga aggtaagagt tgtccaggaa ttgggaagag
7261 tgagaggaaa ggcacggaca gggagcatgt aagataaatt gaggctggct ttggaaggct
7321 gaggagggtg agaaaaggtg ggctgggacc agaccgtggg gagaggtgag tggcattaca
7381 agaaatttag gctttattca gaaggcaaca gggagtccct aagaatgttt ttcaaaaagg
7441 gacattaagg cgattggagt tatacttgga aaagaaagtt ctggccacag tacagagcat
7501 ggcccgttga gctgttgggg gggttattgc tgcaaccaag gcttgagtga gggaagaggc
7561 ggatgtagtg ataaagagac tccaggaact gaatcagcgt acctggcacc ccatccattg
7621 tagagggtga gaataaagga gaaattaaag catcttgcag gctgggcgcg gtagctcatg
7681 tctgtaatcc cagcactttg ggaggccgag gtgggtgtat cagttgaggt caggagttgg
7741 agaccagtca gccagttagt agaaaccctg actctactaa gaaaatacaa aaattagctg
7801 ggcatggtgg catgcgcctg tagtctcagc tacctgggag gctgaggaag gaggatcgct
7861 tgagcccagg aggtggaggc tgcagtgagc caagattgta ccactgcact ccagcctggg
7921 tgacagagca agactcttat ctcaaaaaaa ataaaataaa ataaaataaa ataaaacatc
7981 ttgcccctag ctgagagaga ggtctctgaa gagcaggctc agggaaaaga tgagttttca
8041 gagctgatgt gatagtcagc ttctctggag tcaacagggt gaatccttcc caagtccagc
8101 catgcccaga tgcccggagg gaaaactgac ccccagccag tagacattgg ctaagaacac
8161 agaatcttct gaccaaacac gctttcagca gctgcctgct ctggactttg aaagaggtca
8221 ggtcttgccc taagctcaaa acaagtgaga ggtgtcctga cctagctcat agggcaaatg
8281 gtcctaatag gatgggcaat ccagatgcct gagccccttc actccgacag caccagcgcc
8341 taatgcagcc ttttcattct tgccattagg aaatctgtgg acttctagcc tgtgttttaa
8401 accagccatg tttccttgta tatttcccta cccgctgccc cacataccca gcatgccgct
8461 gtggccacca tgtcctcaaa gccttctgtc tgtatcagga atgtagtctg agactgccag
8521 gaagcaacaa ggagagagaa acactaacta gtcttccttt ataacccatt catactctct
8581 ggctgtcccc aaccttcata gtctcctgca tccaaatgtc ctctttggct caaaaagtag
8641 gccaggcatg gtggttcatg cctgtaatag cactttggga gactgaggtg ggaggatcac
8701 ttggggccag gagtttgaga ccagcttggg caacacagcg caatctcgtc tctactaaaa
8761 aaaaaaaaaa aaaaaaatta gctgggcatg atggcatgct cctgtggtcc cagctacttg
8821 ggaggctgag gcaggaggat cacttggtcc caggagtttg aggcgacagt gagctaggat
8881 cgcaccactg cactccagcc tgagtgacag agcaagaccc tgtctctaaa aaaaattaaa
8941 atgaaagacc aggtgctggg attaaggaaa cacaggtctg agggtctgag ggaaggggcc
9001 tgcctcccag ggagtcaaca tagatgttcc ccatgaacag ggatttgact ttggaggcca
9061 acctggcctg gcctctgccc tttatctcac actccctatc cttggcccac tgccagtccc
9121 tgccttgtgg caaaggggcc ccaaaagaaa agctgccctt ccccaaatgt aaggacccag
9181 gtacactttc acccgtggaa agcagtgtct gtcgagagtc tgtttcctat taatacttat
9241 caaagccatg tgcgagggag gtggtcagct gtcaatatgc cttagtatgt ttatatgagt
9301 ttgttttgtt ctaaaatacc caaacagttc tggtcaagcg gggctatgcc cgtctggccc
9361 aaaacacagt ccgttattaa cgagatggcc ctggcaggcg ggaacaaatc tgcctccatg
9421 cactgcttcc tgtagtcttt tagaaagtaa ctccaggaca tcgaagtgcc cagatttgac
9481 tcctaagttc taggagactg tagcgcaggg tctgtcaacc ttagcactat tggcatttgg
9541 ggctgggtaa ttctttcttg tgggggccgt cttgggtact gtaggaagct gagcagcatt
9601 cctggcctcc atccacaaga tacctgtagc agtgtcctgc caacggtaac aatcaagtat
9661 gtcatcagac attgcccaat gtccccaggg ggcaacaccc ctctcttgga cttcagggtc
9721 aagagaatct ctgctggcta ccccaggact tctcattata gatttcctgg agcacgcagc
9781 agaaactttg cctagcccag tggttgtttc cattatctgc tgccaaagtg ggatttgagg
9841 gtgtccgggg gagggggcat ggggagggca gtatgctttc aaaaacccct cccaggccag
9901 gcgtggtggc tcatgcctgt aatcacagga ctttgggagg ccgaggctgg cagatcactt
9961 gaggctggga gttagagacc aacctggcta acatggcaaa acctcgtctc tactaaaaat
10021 acaaaaatca gcccggcgtg gtggcgggca tctgtaatcc catctactcg ggaggctgag
10081 gcaggagaat tacttgaacc caggaggcag aggctgcagt gagccgagat ggcaccactg
10141 cactccagct tgttgacaga atgagaccct gtggaaaaaa aaaaaaaagc cctcccatgc
10201 cagaacagag gatggcagtc tgtttcaata agacactgtg tccttggtgt tggttctgat
10261 taagactcac tgagatccag tgctcttgag ctgggtctca gtcccctccc atgtcctgtg
10321 ctctgccgcc actgttttca ttgttgtgtt ctcgttgtga ttgttaagac tcacactcct
10381 ggctcagcag tggttttcca gaaggcccaa agagcggtgc cgggcacccc acgtcgcagt
10441 gtccgttccg ggcttgggaa gctggggagg tgggcagacc tggtcgcatc tcaccacaca
10501 cacacacaca cacacacaca cacacgctgt cagaaactcg gccgtccccc ctacctctga
10561 gctctcaatg ctgctaatct ctgccaagtg tccctgtgct ccagcacctt ccttgaagga
10621 ctgacgccca ccccacgctc tttgcgaggt tgtccaggct gtgtttgtcg catgctcttc
10681 ttctgtatag ttctcatctt ccaattttat gggattcaac aaaagcctat tatgcttgtt
10741 tgcattatgg ttacaatatt aaaaagtgga ttcaaaaaaa a

Chris Charley

Oct 23, 2012, 9:27 AM

Post #9 of 9 (5078 views)
Re: [andreiareis] filter huge genbank file by organism [In reply to] Can't Post

OK, I see where the word 'Equinodermata' is in your file.

What info do you want to get from this (after the Equinodermata word)?

Update: I am attaching a sample program I wrote to parse a file in 'genbank' format. I'm sure you can use the general structure to parse your file - even if its not exactly the same as the example I used.

Update 2: It might help to search BioPerl's group or ask your question there.

(This post was edited by Chris Charley on Oct 24, 2012, 7:22 AM)
Attachments: (1.56 KB)


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives