Home: Perl Programming Help: Regular Expressions:
How to extract protein names from sequence file



jtra00
Novice

Jan 24, 2012, 9:34 AM


Views: 24624
How to extract protein names from sequence file

How do I only retrieve the header (protein name) from this protein sequence file below?
I would like to extract for example for the first protein only rev_sp|P31946|
When I tried, I get everything plus sequence.


file protein.txt

>rev_sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3

NEGEGADGEDGQNESTWLTLNDRLLQMILTSDKYSEENLTDLEAIAEDFATKALSCAKEP

SNLIEYYFVSFNLALGLRIPHTPQMEKKSIEFAEQYAQQSNSVTTQKNDGSAVESLYRFY

DGKMKLYFVKSEPQTANPILYKDLLELVDNCIDQLEAEIKERYEKGMQQKKENRETKQEI

SSIVRWSSRRAGVVNKYAVSLLNREENSLEHGQETVAKMAAAMDDYREAQEALKAKQVLE

SKDMTM

>rev_sp|P31946-2|1433B_HUMAN Isoform Short of 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB

NEGEGADGEDGQNESTWLTLNDRLLQMILTSDKYSEENLTDLEAIAEDFATKALSCAKEP

SNLIEYYFVSFNLALGLRIPHTPQMEKKSIEFAEQYAQQSNSVTTQKNDGSAVESLYRFY

DGKMKLYFVKSEPQTANPILYKDLLELVDNCIDQLEAEIKERYEKGMQQKKENRETKQEI

SSIVRWSSRRAGVVNKYAVSLLNREENSLEHGQETVAKMAAAMDDYREAQEALKAKQVLE

SKDM

>rev_sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1

QNEDEVDQLAEKNQEEGDGQMDSTWLTLNDRLLQMILTSDKYSEESLTDLEAIADDFAAK

ALRCARDPSNLIEYYFVSFNLALGLRIPHTPPLETMAIDSAAKYAVLSNEAAEKRDNGTA

FEALYRHYDGKMKYYFVKSEGTNAAPILHKDLVDLIDCCILKLETEVMQRYERIMKLKDE

GGKNEEKQEISSIIRWSARRAGIVNKYAVSLLNREEVTLEVDMGAVKKMSEVMEDYREAQ

EALKAQYVLDERDDM

>rev_sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH PE=1 SV=4

NGEGAEEDQQDSTWLTLNDRLLQMILTSDKYSDENLTDLEAIADDFAQKALLCAQEPANQ

IEYYFVSFNLALGLRIPHTPQMQEKSIEFAEKYAAESAEVVSNKKEGSAVEALYRYYDGK

MKLYFVKSEYQFDNCNKILFKDLLSLVDNCVTELEKEIKERYAKVKELKKENGDAMTKQE

ISSIVRWSSRRAGVVNKYAVSLLNRDENSLPENLETVAKMASAMDDYREAQEALRARQLL

QERDGM

>rev_sp|P61981|1433G_HUMAN 14-3-3 protein gamma OS=Homo sapiens GN=YWHAG PE=1 SV=2



NNGEGGDDDQQDSTWLTLNDRLLQMILTSDKYSDENLTDLEAIADDFATKALHCAQEPAN

QIEYYFVSYNLALGLRIPHTPQMHEKSIEHAESYAKESSEVVTARKEGTAVEALYRYYDG

KMKLYFVKSEYQTESCNKILYNDLLSLVDQCVAELEKEIKERYARVMEIKKENGDASTKQ

EISSIVRWSSRRAGVVNKYAVSLLNREENSLPENLETVNKMAAAMDDYREAQEALRAKQV

LQERDVM


(This post was edited by jtra00 on Jan 24, 2012, 9:41 AM)


rovf
Veteran

Jan 24, 2012, 9:46 AM


Views: 24621
Re: [jtra00] How to extract protein names from sequence file

We can't say what you did wrong, if you don't show that part of the code, which supposes to extract the data....


jtra00
Novice

Jan 24, 2012, 9:54 AM


Views: 24620
Re: [rovf] How to extract protein names from sequence file

This what I have been trying recently with errors.

my $infile = 'hdec1.csv';
my $outfile = 'listdec.txt';

open (FILEHANDLE, '<', $infile);


my @inlines;
@inlines = <FILEHANDLE>;
my @outlines;


close FILEHANDLE;

foreach my $line (@inlines) {

chomp($line);

$line =~ /cmgTrapLocation\.\d\s\=\srev\_\s"(\w+)"\s/m;

push(@outlines,$line);

my $outfile = join("\n",@outlines);

}
print OUT $outfile;
close OUT;

exit;


rovf
Veteran

Jan 24, 2012, 10:00 AM


Views: 24617
Re: [jtra00] How to extract protein names from sequence file


Quote
This what I have been trying recently with errors.


If you have errors, you should also post the errors.

Also, in your code is


Code
use strict; 
use warnings;


missing. Please add this and rerun the program. It makes then more sense to discuss it.


jtra00
Novice

Jan 24, 2012, 10:05 AM


Views: 24616
Re: [rovf] How to extract protein names from sequence file

The error is
Print() on unopened filehandle OUT at line 30.


jtra00
Novice

Jan 24, 2012, 10:09 AM


Views: 24615
Re: [jtra00] How to extract protein names from sequence file

By the way thanks for suggesting the
Use warnings, because previously, the program will hang forever until I stop it.


jtra00
Novice

Jan 24, 2012, 11:27 AM


Views: 24611
Re: [jtra00] How to extract protein names from sequence file

simple grep followed by regular expression should have been okay. Oh well!

grep -e ">" old.fasta >new fasta


jtra00
Novice

Jan 24, 2012, 11:29 AM


Views: 24610
Re: [jtra00] How to extract protein names from sequence file

this what I get, now I need to format the output so to get each > on each line.
Any idea?
Thanks


jtra00
Novice

Jan 24, 2012, 11:30 AM


Views: 24609
Re: [jtra00] How to extract protein names from sequence file

>rev_sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3
>rev_sp|P31946-2|1433B_HUMAN Isoform Short of 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB
>rev_sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
>rev_sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH PE=1 SV=4
>rev_sp|P61981|1433G_HUMAN 14-3-3 protein gamma OS=Homo sapiens GN=YWHAG PE=1 SV=2


histrung
Novice

Jan 24, 2012, 6:48 PM


Views: 24584
Re: [jtra00] How to extract protein names from sequence file

Is this what you want? I used the input you showed in the first post.

Code
 
#!/usr/bin/perl
use strict;
use warnings;

my $infile = 'hdec1.csv';
open (FILEHANDLE, '<', $infile);
my @inlines = <FILEHANDLE>;
close(FILEHANDLE);

my @outlines;
foreach my $line (@inlines) {
chomp($line);
push(@outlines,$1) if ( $line =~ />(rev_sp\|.*?\|)/) ;
}
my $protien = join("\n",@outlines);
my $outfile = 'listdec.txt';
open (OUT, '>', $outfile);
print OUT $protien."\n";
close OUT;

Output
cat listdec.txt
rev_sp|P31946|
rev_sp|P31946-2|
rev_sp|P62258|
rev_sp|Q04917|
rev_sp|P61981|

Just egrep and sed:
egrep ">" hdec1.csv | sed -e 's/\(.\)\(.*|\)\(.*\)/\2/'
rev_sp|P31946|
rev_sp|P31946-2|
rev_sp|P62258|
rev_sp|Q04917|
rev_sp|P61981|



(This post was edited by histrung on Jan 24, 2012, 7:11 PM)


rovf
Veteran

Jan 25, 2012, 12:31 AM


Views: 24557
Re: [jtra00] How to extract protein names from sequence file

The message "print on unopened filehandle OUT" means that you want to print to the unopened file handle named OUT.

You have to associate a file handle with a file, before you can write to it.

See

perldoc -f open


BillKSmith
Veteran

Jan 26, 2012, 7:22 AM


Views: 24508
Re: [histrung] How to extract protein names from sequence file

Seems like a one-liner to me.


Code
  

perl -p -e"$_='' if(!/>(rev_sp\|.*?\|)/);" hdecl.txt >listdec.txt

Good Luck,
Bill


jtra00
Novice

Jan 27, 2012, 9:22 AM


Views: 24435
Re: [BillKSmith] How to extract protein names from sequence file

Thank you very much, it worked.