CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Extracted a list of sequences from a fasta file

 



alfumao
Novice


Nov 12, 2009, 4:09 AM

Post #1 of 7 (1914 views)
Extracted a list of sequences from a fasta file Can't Post

Hello everybody,

Iīm trying to extract a list of sequences from a local sequence file (these sequences are not online so I canīt try to retrieve them from remote databases).


-I have the list of names on a file called list.txt
(Example of name on the list BCANI_B0229 )

-I have a file called genome.fasta, where the sequences of the list are stores among thousands of other sequences.

>BCANI_B0229 |genewise 2.00
MRPRSLQWRISLWLGLGISMLWAIAALVTASQLSREMNVVFDSALEETAQRILPLAVLDITERESDDEMN
QRMATLRKHDEYFTYVVRNTAGEVLLQSHSADLSIFPPFTTMGFQTTPTHRIYFDAALQGAVNIAVAEPL
GARRGVAMQVLWRLALPLGLVIPLSLIGVWAIVRVSMAPLWAFRSDIEARGSGDLSPVTVNDLPSEVSPI
AQAVNRLMDKLKRALEAERSFTANSAHELRTPIAAALAQTQRLIAETQDKAARDRGEQIEAALHRLSRLS
EKLMQLAKAEGGRLHAAEPIDMGVILRMVVRELSQPHGNENRIRLSIPDMAVKIDIDADAFAILARNLIE
NALKHGPHDEPVEVLLTASGKLVVSNGGPAVPPDILERLSHRFERGATKAEGTGLGLAIAKAIAAGTGGT
IDLFSPREGRKDGFEVRFTPGGRVLH
>BCANI_B0475 |fshgen1.25
MRLPLPHSLPIWLLTILISGLLTTQIATLWIVSQDRADANNALELFRLSERATYLVKLLHATPPDDWNNL
AIRVSGSGEAMNITSEPDVTTALASENDLAELEDVLVARLTRYGVVDARIRRDTTGYRLPKEETAVHDST
EMGIIEKQIDSLAATTRSGASLTTSLQFNDGRWLNFITHITPIDPIVTSDTIPLFTLVALCVILGAIWAT
QRLIAPYRILEDAIDRIGGDLKSPPLPETGIREYTAAARAVNSMQARLLDYVAEREQLAAALAHDLRTPL
TRMKLRMELLDDDALRQSLSRDLNDIEAISRSVIDFATSELAHEKAERLDLWSLLLSVADKYPQVSLDEK
GSDYRDAICLCQPISIQRCITNLIDNALAYGGKVRLSLHRDGDDLLLRIKDNGPGISPERIEEMFKPFSR
ADKSRNRESGGFGLGLTIARNIARKNGGEISLRNDPAGGLIAELRLPGSRLMPR
...
X (n thousand times)

-All the sequences are in the same FASTA format (in this format each sequence has its own header begining with a ">" symbol and a name composed by letters or numbers.
After this header you find the sequence of letters that compose the sequence itself.)

Iīve been told to do that using Bioperl, but Iīve read somewhere that perl but be faster and less complicated to perform this task.
I guess a pattern match is required to determine when the sequence begins (^>) and to match the name from the list after the ">" symbol. Then I need to retrieve the sequence that is related to that name.

Looking on the internet I found a one line script:

perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID = <idfile>; chomp @ID; %ID = map {($_, 1)} @ID;} $inmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat mybigfile.fa

But Iīm Unable to run it (and even to understand it).

Can some body help me with this problem.

Yours gratefully

Alvaro


shawnhcorey
Enthusiast


Nov 12, 2009, 6:40 AM

Post #2 of 7 (1908 views)
Re: [alfumao] Extracted a list of sequences from a fasta file [In reply to] Can't Post

One-line Perl script are notoriously hard to understand. Try something like this:

Code
#!/usr/bin/perl 

use strict;
use warnings;

use Data::Dumper;

# Make Data::Dumper pretty
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent = 1;

# Set maximum depth for Data::Dumper, zero means unlimited
$Data::Dumper::Maxdepth = 0;

my %sequences = ();
my $name = '';

while( <DATA> ){
chomp;
if( m{ \A \> (.*) }msx ){
$name = $1;
}else{
$sequences{$name} .= $_;
}
}

print Dumper \%sequences;

__DATA__
>BCANI_B0229 |genewise 2.00
MRPRSLQWRISLWLGLGISMLWAIAALVTASQLSREMNVVFDSALEETAQRILPLAVLDITERESDDEMN
QRMATLRKHDEYFTYVVRNTAGEVLLQSHSADLSIFPPFTTMGFQTTPTHRIYFDAALQGAVNIAVAEPL
GARRGVAMQVLWRLALPLGLVIPLSLIGVWAIVRVSMAPLWAFRSDIEARGSGDLSPVTVNDLPSEVSPI
AQAVNRLMDKLKRALEAERSFTANSAHELRTPIAAALAQTQRLIAETQDKAARDRGEQIEAALHRLSRLS
EKLMQLAKAEGGRLHAAEPIDMGVILRMVVRELSQPHGNENRIRLSIPDMAVKIDIDADAFAILARNLIE
NALKHGPHDEPVEVLLTASGKLVVSNGGPAVPPDILERLSHRFERGATKAEGTGLGLAIAKAIAAGTGGT
IDLFSPREGRKDGFEVRFTPGGRVLH
>BCANI_B0475 |fshgen1.25
MRLPLPHSLPIWLLTILISGLLTTQIATLWIVSQDRADANNALELFRLSERATYLVKLLHATPPDDWNNL
AIRVSGSGEAMNITSEPDVTTALASENDLAELEDVLVARLTRYGVVDARIRRDTTGYRLPKEETAVHDST
EMGIIEKQIDSLAATTRSGASLTTSLQFNDGRWLNFITHITPIDPIVTSDTIPLFTLVALCVILGAIWAT
QRLIAPYRILEDAIDRIGGDLKSPPLPETGIREYTAAARAVNSMQARLLDYVAEREQLAAALAHDLRTPL
TRMKLRMELLDDDALRQSLSRDLNDIEAISRSVIDFATSELAHEKAERLDLWSLLLSVADKYPQVSLDEK
GSDYRDAICLCQPISIQRCITNLIDNALAYGGKVRLSLHRDGDDLLLRIKDNGPGISPERIEEMFKPFSR
ADKSRNRESGGFGLGLTIARNIARKNGGEISLRNDPAGGLIAELRLPGSRLMPR


__END__

I love Perl; it's the only language where you can bless your thingy.

Perl documentation is available at perldoc.perl.org. The list of standard modules and pragmatics is available in perlmodlib.

Get Markup Help. Please note the markup tag of "code".


alfumao
Novice


Nov 12, 2009, 7:09 AM

Post #3 of 7 (1906 views)
Re: [shawnhcorey] Extracted a list of sequences from a fasta file [In reply to] Can't Post

Dear shawnhcorey,

I tried your script adding the file names things as you can see here,

#!/c:/Perl -w

use strict;
use warnings;

use Data::Dumper;

# Make Data::Dumper pretty
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent = 1;

# Set maximum depth for Data::Dumper, zero means unlimited
$Data::Dumper::Maxdepth =1000000;

my %sequences = ('BME.fasta'); # this is the genomic file with all the sequences
my $name = 'list.txt'; #this is the list of sequence names I want to retrieve from the genomic big file

while( <DATA> ){
chomp;
if( m{ \A \> (.*) }msx ){
$name = $1;
}else{
$sequences{$name} .= $_;
}
}

print Dumper \%sequences;
__DATA__
(You wrote the sequences under _DATA_, I hope this doesnīt mean I have to do the same because this would mean to write several Megabytes...)


I donīt get anything back...Maybe I forgot to say that I need to retrieve the sequences Iīm searching for, in a file I can use later, (to use that data later performing genomic analysis).
By the way, I don't know if this is normal, but instead of printing messages on the screen, my perl interpreter just pops for less than a second showing some kind of message I cannot read...
So, if the retrieved sequences should appear on the screen, I just can't see anything...May my perl be installed wrong also???

Thank you very much anyway, and sorry for not being able to explain myself better the first time...

Best Regards


shawnhcorey
Enthusiast


Nov 12, 2009, 7:17 AM

Post #4 of 7 (1905 views)
Re: [alfumao] Extracted a list of sequences from a fasta file [In reply to] Can't Post

Try:

Code
#!/usr/bin/perl 

use strict;
use warnings;

use Data::Dumper;

# Make Data::Dumper pretty
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent = 1;

# Set maximum depth for Data::Dumper, zero means unlimited
$Data::Dumper::Maxdepth = 0;

my $filename = 'BME.fasta';

open my $fasta_fh, '<', $filename or die "could not open $filename: $!\n";

my %sequences = ();
my $name = '';

while( <$fasta_fh> ){
chomp;
if( m{ \A \> (.*) }msx ){
$name = $1;
}else{
$sequences{$name} .= $_;
}
}
die "could not read $filename: $!\n" if $!;
close $fasta_fh or die "could not close $filename: $!\n";

print Dumper \%sequences;


__END__

I love Perl; it's the only language where you can bless your thingy.

Perl documentation is available at perldoc.perl.org. The list of standard modules and pragmatics is available in perlmodlib.

Get Markup Help. Please note the markup tag of "code".


alfumao
Novice


Nov 12, 2009, 7:44 AM

Post #5 of 7 (1903 views)
Re: [shawnhcorey] Extracted a list of sequences from a fasta file [In reply to] Can't Post

hello again shawnhcorey,

I tried your new script and it dumps all the Fasta file on the interpreter screen, but doesnīt give an output file with the listed sequences (the ones I need to extract from the whole fasta file).
I donīt really find the place to locate my list of sequence identities in your script, and if I donīt give that input, itīs impossible for the dumper to extract that particular set of sequences from the whole fasta...Do I explain myself? (English is no my first language as you can easily see and sometimes is hard for me to explain things the right way, I beg your pardon for that...)

Should we add this kind of line to your the script in order to get the output for later use...

open (OUTFILE, ">>extracted_sequences") || die ("cannot open outfile")

Thank you very much for your kind help and attention

Best Regards


shawnhcorey
Enthusiast


Nov 12, 2009, 8:40 AM

Post #6 of 7 (1899 views)
Re: [alfumao] Extracted a list of sequences from a fasta file [In reply to] Can't Post

You have stated how you want the output. How about providing an example?

__END__

I love Perl; it's the only language where you can bless your thingy.

Perl documentation is available at perldoc.perl.org. The list of standard modules and pragmatics is available in perlmodlib.

Get Markup Help. Please note the markup tag of "code".


alfumao
Novice


Nov 12, 2009, 9:18 AM

Post #7 of 7 (1897 views)
Re: [shawnhcorey] Extracted a list of sequences from a fasta file [In reply to] Can't Post

Hello again shawnhcorey ,

Iīll try to provide de example you told me about.

I have a file containing 5 thousand secuences with this format (FASTA):

>BCANI_B0229 |genewise 2.00
MRPRSLQWRISLWLGLGISMLWAIAALVTASQLSREMNVVFDSALEETAQRILPLAVLDITERESDDEMN
QRMATLRKHDEYFTYVVRNTAGEVLLQSHSADLSIFPPFTTMGFQTTPTHRIYFDAALQGAVNIAVAEPL
GARRGVAMQVLWRLALPLGLVIPLSLIGVWAIVRVSMAPLWAFRSDIEARGSGDLSPVTVNDLPSEVSPI
AQAVNRLMDKLKRALEAERSFTANSAHELRTPIAAALAQTQRLIAETQDKAARDRGEQIEAALHRLSRLS
EKLMQLAKAEGGRLHAAEPIDMGVILRMVVRELSQPHGNENRIRLSIPDMAVKIDIDADAFAILARNLIE
NALKHGPHDEPVEVLLTASGKLVVSNGGPAVPPDILERLSHRFERGATKAEGTGLGLAIAKAIAAGTGGT
IDLFSPREGRKDGFEVRFTPGGRVLH
>BCANI_B0475 |fshgen1.25
MRLPLPHSLPIWLLTILISGLLTTQIATLWIVSQDRADANNALELFRLSERATYLVKLLHATPPDDWNNL
AIRVSGSGEAMNITSEPDVTTALASENDLAELEDVLVARLTRYGVVDARIRRDTTGYRLPKEETAVHDST
EMGIIEKQIDSLAATTRSGASLTTSLQFNDGRWLNFITHITPIDPIVTSDTIPLFTLVALCVILGAIWAT
QRLIAPYRILEDAIDRIGGDLKSPPLPETGIREYTAAARAVNSMQARLLDYVAEREQLAAALAHDLRTPL
TRMKLRMELLDDDALRQSLSRDLNDIEAISRSVIDFATSELAHEKAERLDLWSLLLSVADKYPQVSLDEK
GSDYRDAICLCQPISIQRCITNLIDNALAYGGKVRLSLHRDGDDLLLRIKDNGPGISPERIEEMFKPFSR
ADKSRNRESGGFGLGLTIARNIARKNGGEISLRNDPAGGLIAELRLPGSRLMPR

(Imagine there are 5000 of these in a row)

I want to extract 150 particular sequences from that set. I have the names of those 150 in a single txt file:

BCANI_B0175
BCANI_B0229
BCANI_B1265
BCANI_B3422
..... (up to 150 names)

I think the program should:

-Open the Fasta (big sequences file),
-Open the list and read the 150 names (headers) into an array.
-Take each of the names in a loop to locate the sequence in the big file and extract each hit writting it on an "appending" output file, so that after the programs end the loop iterations with each of the 150 names from the list,( locating the sequence and then writting it to the outfile).
to get that its necessary to parse the search taking into account that each sequence begins with the ">" (^>) symbol followed by any number or letter, and ends with a carriage return (/n)
- As result we obtain a file where the sequences referred to the names on the list are stored in fasta format as they were on the big file.

Is that the explanation you asked me for?

Forgive me if I make waste time answering my dumb questions...I dont know if Iīll ever get to program any Perl... :$

Thank you for your patience ;)

Al

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives