Home: Perl Programming Help: Regular Expressions:
How to extract protein names from sequence file



jtra00
Novice

Jan 24, 2012, 9:34 AM


Views: 23705
How to extract protein names from sequence file

How do I only retrieve the header (protein name) from this protein sequence file below?
I would like to extract for example for the first protein only rev_sp|P31946|
When I tried, I get everything plus sequence.


file protein.txt

>rev_sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3

NEGEGADGEDGQNESTWLTLNDRLLQMILTSDKYSEENLTDLEAIAEDFATKALSCAKEP

SNLIEYYFVSFNLALGLRIPHTPQMEKKSIEFAEQYAQQSNSVTTQKNDGSAVESLYRFY

DGKMKLYFVKSEPQTANPILYKDLLELVDNCIDQLEAEIKERYEKGMQQKKENRETKQEI

SSIVRWSSRRAGVVNKYAVSLLNREENSLEHGQETVAKMAAAMDDYREAQEALKAKQVLE

SKDMTM

>rev_sp|P31946-2|1433B_HUMAN Isoform Short of 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB

NEGEGADGEDGQNESTWLTLNDRLLQMILTSDKYSEENLTDLEAIAEDFATKALSCAKEP

SNLIEYYFVSFNLALGLRIPHTPQMEKKSIEFAEQYAQQSNSVTTQKNDGSAVESLYRFY

DGKMKLYFVKSEPQTANPILYKDLLELVDNCIDQLEAEIKERYEKGMQQKKENRETKQEI

SSIVRWSSRRAGVVNKYAVSLLNREENSLEHGQETVAKMAAAMDDYREAQEALKAKQVLE

SKDM

>rev_sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1

QNEDEVDQLAEKNQEEGDGQMDSTWLTLNDRLLQMILTSDKYSEESLTDLEAIADDFAAK

ALRCARDPSNLIEYYFVSFNLALGLRIPHTPPLETMAIDSAAKYAVLSNEAAEKRDNGTA

FEALYRHYDGKMKYYFVKSEGTNAAPILHKDLVDLIDCCILKLETEVMQRYERIMKLKDE

GGKNEEKQEISSIIRWSARRAGIVNKYAVSLLNREEVTLEVDMGAVKKMSEVMEDYREAQ

EALKAQYVLDERDDM

>rev_sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH PE=1 SV=4

NGEGAEEDQQDSTWLTLNDRLLQMILTSDKYSDENLTDLEAIADDFAQKALLCAQEPANQ

IEYYFVSFNLALGLRIPHTPQMQEKSIEFAEKYAAESAEVVSNKKEGSAVEALYRYYDGK

MKLYFVKSEYQFDNCNKILFKDLLSLVDNCVTELEKEIKERYAKVKELKKENGDAMTKQE

ISSIVRWSSRRAGVVNKYAVSLLNRDENSLPENLETVAKMASAMDDYREAQEALRARQLL

QERDGM

>rev_sp|P61981|1433G_HUMAN 14-3-3 protein gamma OS=Homo sapiens GN=YWHAG PE=1 SV=2



NNGEGGDDDQQDSTWLTLNDRLLQMILTSDKYSDENLTDLEAIADDFATKALHCAQEPAN

QIEYYFVSYNLALGLRIPHTPQMHEKSIEHAESYAKESSEVVTARKEGTAVEALYRYYDG

KMKLYFVKSEYQTESCNKILYNDLLSLVDQCVAELEKEIKERYARVMEIKKENGDASTKQ

EISSIVRWSSRRAGVVNKYAVSLLNREENSLPENLETVNKMAAAMDDYREAQEALRAKQV

LQERDVM


(This post was edited by jtra00 on Jan 24, 2012, 9:41 AM)


rovf
Veteran

Jan 24, 2012, 9:46 AM


Views: 23702
Re: [jtra00] How to extract protein names from sequence file

We can't say what you did wrong, if you don't show that part of the code, which supposes to extract the data....


jtra00
Novice

Jan 24, 2012, 9:54 AM


Views: 23701
Re: [rovf] How to extract protein names from sequence file

This what I have been trying recently with errors.

my $infile = 'hdec1.csv';
my $outfile = 'listdec.txt';

open (FILEHANDLE, '<', $infile);


my @inlines;
@inlines = <FILEHANDLE>;
my @outlines;


close FILEHANDLE;

foreach my $line (@inlines) {

chomp($line);

$line =~ /cmgTrapLocation\.\d\s\=\srev\_\s"(\w+)"\s/m;

push(@outlines,$line);

my $outfile = join("\n",@outlines);

}
print OUT $outfile;
close OUT;

exit;


rovf
Veteran

Jan 24, 2012, 10:00 AM


Views: 23698
Re: [jtra00] How to extract protein names from sequence file


Quote
This what I have been trying recently with errors.


If you have errors, you should also post the errors.

Also, in your code is


Code
use strict; 
use warnings;


missing. Please add this and rerun the program. It makes then more sense to discuss it.


jtra00
Novice

Jan 24, 2012, 10:05 AM


Views: 23697
Re: [rovf] How to extract protein names from sequence file

The error is
Print() on unopened filehandle OUT at line 30.


jtra00
Novice

Jan 24, 2012, 10:09 AM


Views: 23696
Re: [jtra00] How to extract protein names from sequence file

By the way thanks for suggesting the
Use warnings, because previously, the program will hang forever until I stop it.


jtra00
Novice

Jan 24, 2012, 11:27 AM


Views: 23692
Re: [jtra00] How to extract protein names from sequence file

simple grep followed by regular expression should have been okay. Oh well!

grep -e ">" old.fasta >new fasta


jtra00
Novice

Jan 24, 2012, 11:29 AM


Views: 23691
Re: [jtra00] How to extract protein names from sequence file

this what I get, now I need to format the output so to get each > on each line.
Any idea?
Thanks


jtra00
Novice

Jan 24, 2012, 11:30 AM


Views: 23690
Re: [jtra00] How to extract protein names from sequence file

>rev_sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
>rev_sp|P31946-2|1433B_HUMAN Isoform Short of 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB
>rev_sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
>rev_sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH PE=1 SV=4
>rev_sp|P61981|1433G_HUMAN 14-3-3 protein gamma OS=Homo sapiens GN=YWHAG PE=1 SV=2


histrung
Novice

Jan 24, 2012, 6:48 PM


Views: 23665
Re: [jtra00] How to extract protein names from sequence file

Is this what you want? I used the input you showed in the first post.

Code
 
#!/usr/bin/perl
use strict;
use warnings;

my $infile = 'hdec1.csv';
open (FILEHANDLE, '<', $infile);
my @inlines = <FILEHANDLE>;
close(FILEHANDLE);

my @outlines;
foreach my $line (@inlines) {
chomp($line);
push(@outlines,$1) if ( $line =~ />(rev_sp\|.*?\|)/) ;
}
my $protien = join("\n",@outlines);
my $outfile = 'listdec.txt';
open (OUT, '>', $outfile);
print OUT $protien."\n";
close OUT;

Output
cat listdec.txt
rev_sp|P31946|
rev_sp|P31946-2|
rev_sp|P62258|
rev_sp|Q04917|
rev_sp|P61981|

Just egrep and sed:
egrep ">" hdec1.csv | sed -e 's/\(.\)\(.*|\)\(.*\)/\2/'
rev_sp|P31946|
rev_sp|P31946-2|
rev_sp|P62258|
rev_sp|Q04917|
rev_sp|P61981|



(This post was edited by histrung on Jan 24, 2012, 7:11 PM)


rovf
Veteran

Jan 25, 2012, 12:31 AM


Views: 23638
Re: [jtra00] How to extract protein names from sequence file

The message "print on unopened filehandle OUT" means that you want to print to the unopened file handle named OUT.

You have to associate a file handle with a file, before you can write to it.

See

perldoc -f open


BillKSmith
Veteran

Jan 26, 2012, 7:22 AM


Views: 23589
Re: [histrung] How to extract protein names from sequence file

Seems like a one-liner to me.


Code
  

perl -p -e"$_='' if(!/>(rev_sp\|.*?\|)/);" hdecl.txt >listdec.txt

Good Luck,
Bill


jtra00
Novice

Jan 27, 2012, 9:22 AM


Views: 23516
Re: [BillKSmith] How to extract protein names from sequence file

Thank you very much, it worked.