CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
How to extract protein names from sequence file

 



jtra00
Novice

Jan 24, 2012, 9:34 AM

Post #1 of 13 (10584 views)
How to extract protein names from sequence file Can't Post

How do I only retrieve the header (protein name) from this protein sequence file below?
I would like to extract for example for the first protein only rev_sp|P31946|
When I tried, I get everything plus sequence.


file protein.txt

>rev_sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3

NEGEGADGEDGQNESTWLTLNDRLLQMILTSDKYSEENLTDLEAIAEDFATKALSCAKEP

SNLIEYYFVSFNLALGLRIPHTPQMEKKSIEFAEQYAQQSNSVTTQKNDGSAVESLYRFY

DGKMKLYFVKSEPQTANPILYKDLLELVDNCIDQLEAEIKERYEKGMQQKKENRETKQEI

SSIVRWSSRRAGVVNKYAVSLLNREENSLEHGQETVAKMAAAMDDYREAQEALKAKQVLE

SKDMTM

>rev_sp|P31946-2|1433B_HUMAN Isoform Short of 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB

NEGEGADGEDGQNESTWLTLNDRLLQMILTSDKYSEENLTDLEAIAEDFATKALSCAKEP

SNLIEYYFVSFNLALGLRIPHTPQMEKKSIEFAEQYAQQSNSVTTQKNDGSAVESLYRFY

DGKMKLYFVKSEPQTANPILYKDLLELVDNCIDQLEAEIKERYEKGMQQKKENRETKQEI

SSIVRWSSRRAGVVNKYAVSLLNREENSLEHGQETVAKMAAAMDDYREAQEALKAKQVLE

SKDM

>rev_sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1

QNEDEVDQLAEKNQEEGDGQMDSTWLTLNDRLLQMILTSDKYSEESLTDLEAIADDFAAK

ALRCARDPSNLIEYYFVSFNLALGLRIPHTPPLETMAIDSAAKYAVLSNEAAEKRDNGTA

FEALYRHYDGKMKYYFVKSEGTNAAPILHKDLVDLIDCCILKLETEVMQRYERIMKLKDE

GGKNEEKQEISSIIRWSARRAGIVNKYAVSLLNREEVTLEVDMGAVKKMSEVMEDYREAQ

EALKAQYVLDERDDM

>rev_sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH PE=1 SV=4

NGEGAEEDQQDSTWLTLNDRLLQMILTSDKYSDENLTDLEAIADDFAQKALLCAQEPANQ

IEYYFVSFNLALGLRIPHTPQMQEKSIEFAEKYAAESAEVVSNKKEGSAVEALYRYYDGK

MKLYFVKSEYQFDNCNKILFKDLLSLVDNCVTELEKEIKERYAKVKELKKENGDAMTKQE

ISSIVRWSSRRAGVVNKYAVSLLNRDENSLPENLETVAKMASAMDDYREAQEALRARQLL

QERDGM

>rev_sp|P61981|1433G_HUMAN 14-3-3 protein gamma OS=Homo sapiens GN=YWHAG PE=1 SV=2



NNGEGGDDDQQDSTWLTLNDRLLQMILTSDKYSDENLTDLEAIADDFATKALHCAQEPAN

QIEYYFVSYNLALGLRIPHTPQMHEKSIEHAESYAKESSEVVTARKEGTAVEALYRYYDG

KMKLYFVKSEYQTESCNKILYNDLLSLVDQCVAELEKEIKERYARVMEIKKENGDASTKQ

EISSIVRWSSRRAGVVNKYAVSLLNREENSLPENLETVNKMAAAMDDYREAQEALRAKQV

LQERDVM


(This post was edited by jtra00 on Jan 24, 2012, 9:41 AM)


rovf
Veteran

Jan 24, 2012, 9:46 AM

Post #2 of 13 (10581 views)
Re: [jtra00] How to extract protein names from sequence file [In reply to] Can't Post

We can't say what you did wrong, if you don't show that part of the code, which supposes to extract the data....


jtra00
Novice

Jan 24, 2012, 9:54 AM

Post #3 of 13 (10580 views)
Re: [rovf] How to extract protein names from sequence file [In reply to] Can't Post

This what I have been trying recently with errors.

my $infile = 'hdec1.csv';
my $outfile = 'listdec.txt';

open (FILEHANDLE, '<', $infile);


my @inlines;
@inlines = <FILEHANDLE>;
my @outlines;


close FILEHANDLE;

foreach my $line (@inlines) {

chomp($line);

$line =~ /cmgTrapLocation\.\d\s\=\srev\_\s"(\w+)"\s/m;

push(@outlines,$line);

my $outfile = join("\n",@outlines);

}
print OUT $outfile;
close OUT;

exit;


rovf
Veteran

Jan 24, 2012, 10:00 AM

Post #4 of 13 (10577 views)
Re: [jtra00] How to extract protein names from sequence file [In reply to] Can't Post


Quote
This what I have been trying recently with errors.


If you have errors, you should also post the errors.

Also, in your code is


Code
use strict; 
use warnings;


missing. Please add this and rerun the program. It makes then more sense to discuss it.


jtra00
Novice

Jan 24, 2012, 10:05 AM

Post #5 of 13 (10576 views)
Re: [rovf] How to extract protein names from sequence file [In reply to] Can't Post

The error is
Print() on unopened filehandle OUT at line 30.


jtra00
Novice

Jan 24, 2012, 10:09 AM

Post #6 of 13 (10575 views)
Re: [jtra00] How to extract protein names from sequence file [In reply to] Can't Post

By the way thanks for suggesting the
Use warnings, because previously, the program will hang forever until I stop it.


jtra00
Novice

Jan 24, 2012, 11:27 AM

Post #7 of 13 (10571 views)
Re: [jtra00] How to extract protein names from sequence file [In reply to] Can't Post

simple grep followed by regular expression should have been okay. Oh well!

grep -e ">" old.fasta >new fasta


jtra00
Novice

Jan 24, 2012, 11:29 AM

Post #8 of 13 (10570 views)
Re: [jtra00] How to extract protein names from sequence file [In reply to] Can't Post

this what I get, now I need to format the output so to get each > on each line.
Any idea?
Thanks


jtra00
Novice

Jan 24, 2012, 11:30 AM

Post #9 of 13 (10569 views)
Re: [jtra00] How to extract protein names from sequence file [In reply to] Can't Post

>rev_sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
>rev_sp|P31946-2|1433B_HUMAN Isoform Short of 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB
>rev_sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
>rev_sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH PE=1 SV=4
>rev_sp|P61981|1433G_HUMAN 14-3-3 protein gamma OS=Homo sapiens GN=YWHAG PE=1 SV=2


histrung
Novice

Jan 24, 2012, 6:48 PM

Post #10 of 13 (10544 views)
Re: [jtra00] How to extract protein names from sequence file [In reply to] Can't Post

Is this what you want? I used the input you showed in the first post.

Code
 
#!/usr/bin/perl
use strict;
use warnings;

my $infile = 'hdec1.csv';
open (FILEHANDLE, '<', $infile);
my @inlines = <FILEHANDLE>;
close(FILEHANDLE);

my @outlines;
foreach my $line (@inlines) {
chomp($line);
push(@outlines,$1) if ( $line =~ />(rev_sp\|.*?\|)/) ;
}
my $protien = join("\n",@outlines);
my $outfile = 'listdec.txt';
open (OUT, '>', $outfile);
print OUT $protien."\n";
close OUT;

Output
cat listdec.txt
rev_sp|P31946|
rev_sp|P31946-2|
rev_sp|P62258|
rev_sp|Q04917|
rev_sp|P61981|

Just egrep and sed:
egrep ">" hdec1.csv | sed -e 's/\(.\)\(.*|\)\(.*\)/\2/'
rev_sp|P31946|
rev_sp|P31946-2|
rev_sp|P62258|
rev_sp|Q04917|
rev_sp|P61981|



(This post was edited by histrung on Jan 24, 2012, 7:11 PM)


rovf
Veteran

Jan 25, 2012, 12:31 AM

Post #11 of 13 (10517 views)
Re: [jtra00] How to extract protein names from sequence file [In reply to] Can't Post

The message "print on unopened filehandle OUT" means that you want to print to the unopened file handle named OUT.

You have to associate a file handle with a file, before you can write to it.

See

perldoc -f open


BillKSmith
Veteran

Jan 26, 2012, 7:22 AM

Post #12 of 13 (10468 views)
Re: [histrung] How to extract protein names from sequence file [In reply to] Can't Post

Seems like a one-liner to me.


Code
  

perl -p -e"$_='' if(!/>(rev_sp\|.*?\|)/);" hdecl.txt >listdec.txt

Good Luck,
Bill


jtra00
Novice

Jan 27, 2012, 9:22 AM

Post #13 of 13 (10395 views)
Re: [BillKSmith] How to extract protein names from sequence file [In reply to] Can't Post

Thank you very much, it worked.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives