CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
text search in file

 



cmccabe1
Novice

Jul 9, 2013, 9:48 AM

Post #1 of 13 (689 views)
text search in file Can't Post

I have a (test.vcf) file in which the INFO column is always 5th and looks like:



RS=199476396;RSPOS=985955;dbSNPBuildID=136;SSR=0;SAO=1;VP=0x050260000a01000002110100;GENEINFO=AGRN:375790;WGT=1;VC=SNV;PM;S3D;NSM;REF;OTHERKG;LSD;OM;CLNALLE=1;CLNHGVS=NC_000001.10:g.985955G>C;CLNSRC=OMIM Allelic Variant;CLNORIGIN=1;CLNSRCID=103320.0001;CLNSIG=5;CLNDSDB=GeneReviews:NCBI:OMIM:Orphanet;CLNDSDBID=NBK1168:C1850792:254300:590;CLNDBN=Myasthenia\x2c limb-girdle\x2c familial;CLNACC=RCV000019902.1

RS=207460006;RSPOS=1199489;dbSNPBuildID=136;SSR=0;SAO=0;VP=0x050060080001000002110100;GENEINFO=UBE2J2:118424;WGT=1;VC=SNV;PM;INT;OTHERKG;LSD;OM;CLNALLE=1;CLNHGVS=NC_000001.10:g.1199489G>A;CLNSRC=.;CLNORIGIN=2;CLNSRCID=.;CLNSIG=1;CLNDSDB=.;CLNDSDBID=.;CLNDBN=.;CLNACC=.

RS=144003672;RSPOS=1245104;dbSNPBuildID=134;SSR=0;SAO=2;VP=0x050268020a01000002100120;GENEINFO=ACAP3:116983|PUSL1:126789;WGT=1;VC=SNV;PM;PMC;S3D;NSM;REF;R5;OTHERKG;LSD;CLNALLE=1;CLNHGVS=NC_000001.10:g.1245104C>A;CLNSRC=.;CLNORIGIN=2;CLNSRCID=.;CLNSIG=1;CLNDSDB=.;CLNDSDBID=.;CLNDBN=.;CLNACC=.

RS=145324009;RSPOS=1469331;dbSNPBuildID=134;SSR=0;SAO=2;VP=0x050268000a01000002100120;GENEINFO=ATAD3A:55210;WGT=1;VC=SNV;PM;PMC;S3D;NSM;REF;OTHERKG;LSD;CLNALLE=1;CLNHGVS=NC_000001.10:g.1469331G>A;CLNSRC=.;CLNORIGIN=2;CLNSRCID=.;CLNSIG=1;CLNDSDB=.;CLNDSDBID=.;CLNDBN=.;CLNACC=.



Out of each line all that is needed is RS= up to the ;, GENEINFO= up to the ;, CLNHGVS= up to the ;, CLNSIG= up to the ;, CLNDBN= up to the ;. Is there a command that can do this? Thank you.


FishMonger
Veteran / Moderator

Jul 9, 2013, 10:05 AM

Post #2 of 13 (686 views)
Re: [cmccabe1] text search in file [In reply to] Can't Post

Yes, you should use the split function along with an array slice to extract those 5 fields. Then, if needed, use the split function on those fields to separate the 'key'='value' pairs.

Here's an example using your data.

Code
#!/usr/bin/perl 

use strict;
use warnings;
use Data::Dumper;

$/ = '';

while (my $line = <DATA>) {
my @fields = (split(/;/, $line))[0,6,17,21,24];
print Dumper \@fields;
}

__DATA__
RS=199476396;RSPOS=985955;dbSNPBuildID=136;SSR=0;SAO=1;VP=0x050260000a01000002110100;GENEINFO=AGRN:375790;WGT=1;VC=SNV;PM;S3D;NSM;REF;OTHERKG;LSD;OM;CLNALLE=1;CLNHGVS=NC_000001.10:g.985955G>C;CLNSRC=OMIM Allelic Variant;CLNORIGIN=1;CLNSRCID=103320.0001;CLNSIG=5;CLNDSDB=GeneReviews:NCBI:OMIM:Orphanet;CLNDSDBID=NBK1168:C1850792:254300:590;CLNDBN=Myasthenia\x2c limb-girdle\x2c familial;CLNACC=RCV000019902.1

RS=207460006;RSPOS=1199489;dbSNPBuildID=136;SSR=0;SAO=0;VP=0x050060080001000002110100;GENEINFO=UBE2J2:118424;WGT=1;VC=SNV;PM;INT;OTHERKG;LSD;OM;CLNALLE=1;CLNHGVS=NC_000001.10:g.1199489G>A;CLNSRC=.;CLNORIGIN=2;CLNSRCID=.;CLNSIG=1;CLNDSDB=.;CLNDSDBID=.;CLNDBN=.;CLNACC=.

RS=144003672;RSPOS=1245104;dbSNPBuildID=134;SSR=0;SAO=2;VP=0x050268020a01000002100120;GENEINFO=ACAP3:116983|PUSL1:126789;WGT=1;VC=SNV;PM;PMC;S3D;NSM;REF;R5;OTHERKG;LSD;CLNALLE=1;CLNHGVS=NC_000001.10:g.1245104C>A;CLNSRC=.;CLNORIGIN=2;CLNSRCID=.;CLNSIG=1;CLNDSDB=.;CLNDSDBID=.;CLNDBN=.;CLNACC=.

RS=145324009;RSPOS=1469331;dbSNPBuildID=134;SSR=0;SAO=2;VP=0x050268000a01000002100120;GENEINFO=ATAD3A:55210;WGT=1;VC=SNV;PM;PMC;S3D;NSM;REF;OTHERKG;LSD;CLNALLE=1;CLNHGVS=NC_000001.10:g.1469331G>A;CLNSRC=.;CLNORIGIN=2;CLNSRCID=.;CLNSIG=1;CLNDSDB=.;CLNDSDBID=.;CLNDBN=.;CLNACC=.


Outputs:

Code
$VAR1 = [ 
'RS=199476396',
'GENEINFO=AGRN:375790',
'CLNHGVS=NC_000001.10:g.985955G>C',
'CLNSIG=5',
'CLNDBN=Myasthenia\\x2c limb-girdle\\x2c familial'
];
$VAR1 = [
'RS=207460006',
'GENEINFO=UBE2J2:118424',
'CLNORIGIN=2',
'CLNDSDBID=.',
undef
];
$VAR1 = [
'RS=144003672',
'GENEINFO=ACAP3:116983|PUSL1:126789',
'CLNALLE=1',
'CLNSRCID=.',
'CLNDSDBID=.'
];
$VAR1 = [
'RS=145324009',
'GENEINFO=ATAD3A:55210',
'CLNHGVS=NC_000001.10:g.1469331G>A',
'CLNSIG=1',
'CLNDBN=.'
];



FishMonger
Veteran / Moderator

Jul 9, 2013, 10:09 AM

Post #3 of 13 (684 views)
Re: [cmccabe1] text search in file [In reply to] Can't Post

I should point out that format of your data needs to be consistent for this approach to work. If it's not, then you'll need to use one or more regex's to extract the data. As you can see from the output of my example, the format of the 2nd and 3rd records are not consistent with the other 2.


(This post was edited by FishMonger on Jul 9, 2013, 10:10 AM)


cmccabe1
Novice

Jul 9, 2013, 10:17 AM

Post #4 of 13 (679 views)
Re: [FishMonger] text search in file [In reply to] Can't Post

I am new to PERL but if I create a .pl file with the code in it. How can I run that script, what command do I use or is there another way? Thank you.


FishMonger
Veteran / Moderator

Jul 9, 2013, 10:27 AM

Post #5 of 13 (675 views)
Re: [cmccabe1] text search in file [In reply to] Can't Post

Are you on windows or a *nix system?

I'll assume you're on Windows. Think of this perl script as if it were a batch script and execute the same way.

For example, assuming the script is named example.pl
----------------------------------
c:\test>example.pl
or
c:\test>perl example.pl


cmccabe1
Novice

Jul 9, 2013, 10:41 AM

Post #6 of 13 (669 views)
Re: [FishMonger] text search in file [In reply to] Can't Post

I am using Cygwin on a windows OS.

I created a vcfsplit.pl file with the code in it.

Thank you.


Laurent_R
Veteran / Moderator

Jul 9, 2013, 10:52 AM

Post #7 of 13 (665 views)
Re: [cmccabe1] text search in file [In reply to] Can't Post

You can either change the file permissions to 755 (with the "chmod 777 example.pl" bash command) and then run as:


Code
./example.pl


provided you have this "shebang" line at the top of your program:


Code
#!/usr/bin/perl


Or, even simpler:


Code
perl example.pl



cmccabe1
Novice

Jul 9, 2013, 11:57 AM

Post #8 of 13 (653 views)
Re: [Laurent_R] text search in file [In reply to] Can't Post

I am getting this error:

Name "main::DATA" used only once: possible typo at ./splitvcf.pl line 9.
readline() on unopened filehandle DATA at ./splitvcf.pl line 9. Thank you.


Laurent_R
Veteran / Moderator

Jul 9, 2013, 11:16 PM

Post #9 of 13 (641 views)
Re: [cmccabe1] text search in file [In reply to] Can't Post

Did you copy in your program the __DATA__ section used by Fishmonger at the end of his example?


cmccabe1
Novice

Jul 10, 2013, 6:22 AM

Post #10 of 13 (631 views)
Re: [Laurent_R] text search in file [In reply to] Can't Post

I placed the __DATA__ section under the code:

Here is the error I get:
perl splitvcf.pl >output.txt
Can't open perl script "splitvcf.pl": No such file or directory

the splitvcf.pl is in the home directory of cygwin.

Thank you.

In Reply To


FishMonger
Veteran / Moderator

Jul 10, 2013, 7:13 AM

Post #11 of 13 (625 views)
Re: [cmccabe1] text search in file [In reply to] Can't Post

I've never been much of a fan of cygwin.

Are you executing it from within the cygwin shell?

Are you in that same directory as the script when you execute it, or are you in some other directory?

Try specifying the full path.


cmccabe1
Novice

Jul 10, 2013, 7:27 AM

Post #12 of 13 (618 views)
Re: [FishMonger] text search in file [In reply to] Can't Post

That was it the command runs, but I only get output from the first record. Since the format may different in each field as you suggested in your third post. Is it possible to use RS= up to the ;, GENEINFO= up to the ;, CLNHGVS= up to the ;, CLNSIG= up to the ;, CLNDBN= up to the ; and if that text is not there it is undefined or blank? Thank you for all your help.


FishMonger
Veteran / Moderator

Jul 10, 2013, 8:17 AM

Post #13 of 13 (614 views)
Re: [cmccabe1] text search in file [In reply to] Can't Post

If you have inconsistent data format, then you'll need to use a regex. Please read over the info in the following perldocs.

perlrequick - Perl regular expressions quick start
http://perldoc.perl.org/5.16.2/perlrequick.html

perlretut - Perl regular expressions tutorial
http://perldoc.perl.org/5.16.2/perlretut.html

Once you read over that info, try to write a regex that extracts the info you need. Your first tries may not work but you can then post your efforts and we will help you to fix them.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives