CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
File parsing problem, Use of uninitialised value error.

 

First page Previous page 1 2 Next page Last page  View All


MB123
Novice

Nov 11, 2012, 10:18 AM

Post #1 of 31 (9403 views)
File parsing problem, Use of uninitialised value error. Can't Post

Hi all,

I have this code, partly written by myself.




Code
$cod{1}="Int"; 
$cod{2}="non";
$cod{3}="syn";
$cod{4}="stop";

$file="input.txt";
open IN, "$file";
open OUT, ">output.txt";
print OUT "Coordinate No of Strains AA Change\n";
while(<IN>){
if(m/^FT\s+SNP\s+(\d+)/){
$SNP=$1;
}elsif(m/^FT\s+\/note="(.*)"/){
$line=$1;
$count = ($line =~ tr/=/=/);
$line =~ m/\((AA \w+->\w+)\)\s*$/;
$change = $1 || "";
}elsif(m/^FT\s+\/colour=(\d+)/){
print OUT "$SNP $count $change\n" if $cod{$1} eq "non";
}
}


I want to use it on a text file in the format shown below-


Code
FT	SNP	27534 
FT /note="refAllele: T SNPstrains: 7564_8#80=C (non-synonymous) (AA Leu->Ser) "
FT /colour=2
FT SNP 27682
FT /note="refAllele: T SNPstrains: 7414_8#37=C (synonymous) "
FT /colour=3
FT SNP 27710
FT /note="refAllele: G SNPstrains: 7083_1#32=T (non-synonymous) (AA Val->Phe) 7521_5#41=T (non-synonymous) (AA Val->Phe) "
FT /colour=2
FT SNP 27771
FT /note="refAllele: A SNPstrains: 7480_8#28=G (non-synonymous) (AA His->Arg) "
FT /colour=2
FT SNP 28047
FT /note="refAllele: A SNPstrains: 7480_7#86=T (non-synonymous) (AA Lys->Ile) "
FT /colour=2
FT SNP 28490
FT /note="refAllele: G SNPstrains: 7083_1#4=T (non-synonymous) (AA Gly->Cys) 7554_6#38=T (non-synonymous) (AA Gly->Cys) "
FT /colour=2
FT SNP 28492
FT /note="refAllele: C SNPstrains: 7414_7#66=A (synonymous) 7414_8#44=A (synonymous) 7521_6#54=A (synonymous) "
FT /colour=3
FT SNP 28548
FT /note="refAllele: C SNPstrains: 7414_8#65=T (non-synonymous) (AA Ser->Leu) "
FT /colour=2
FT SNP 28787
FT /note="refAllele: G SNPstrains: 7414_7#14=A (non-synonymous) (AA Asp->Asn) "
FT /colour=2
FT SNP 28840
FT /note="refAllele: C SNPstrains: 7414_8#51=T (synonymous) 7414_8#71=T (synonymous) "
FT /colour=3
FT SNP 28941
FT /note="refAllele: A SNPstrains: 7083_1#1=G (non-synonymous) (AA Gln->Arg) "
FT /colour=2
FT SNP 29080
FT /note="refAllele: A SNPstrains: 7414_7#49=G (synonymous) 7521_6#39=G (synonymous) 7564_8#91=G (synonymous) 7712_8#14=G (synonymous) "
FT /colour=3
FT SNP 29214
FT /note="refAllele: T SNPstrains: 7554_6#36=C (non-synonymous) (AA Val->Ala) "
FT /colour=2
FT SNP 29574
FT /note="refAllele: C SNPstrains: 7065_8#73=T (non-synonymous) (AA Pro->Leu) "
FT /colour=2
FT SNP 29610
FT /note="refAllele: C SNPstrains: 7480_8#12=T "
FT /colour=1
FT SNP 29658
FT /note="refAllele: T SNPstrains: 7564_8#79=A "


My ideal output, as I'm sure you can see, would have the Mutation coordinate, e.g. 27534, the number of strains that show this mutation, and the amino acid change, e.g. AA Leu->Ser.

However, when I run the code I receive the following error:


Code
Use of uninitalized value ($count or $change) in concatenation (.) or string at Script.pl line 24, <IN> line XXXX.


The lines it refers to are any that contain this:


Code
FT		/colour=2


That is, those that follow a non-synonymous mutation line.

Any help would be greatly appreciated.

Many thanks,

MB


Laurent_R
Enthusiast / Moderator

Nov 11, 2012, 11:59 AM

Post #2 of 31 (9401 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

The error is probaly in this line:

print OUT "$SNP $count $change\n" if $cod{$1} eq "non";

One (or more) of the four variables there is not initialized.

But this could be because one of the previous if/elsif branch did not match anything, then $count or $sntp might not be initialized.

Possibly $1 in the array subscript is not between 1 and 4.


MB123
Novice

Nov 11, 2012, 12:44 PM

Post #3 of 31 (9394 views)
Re: [Laurent_R] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Thank you for your reply,

Forgive me if I am wrong, but array subscript refers to -

Code
$cod{1}="Int"; 
$cod{2}="non";
$cod{3}="syn";
$cod{4}="stop";


does it not? I was under the impression that these were matching to the numbers contained in the

Code
/colour=?


part of the file, in which case these are either 1, 2, 3 or 4.


Laurent_R
Enthusiast / Moderator

Nov 11, 2012, 2:08 PM

Post #4 of 31 (9386 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

When you have:


Code
print OUT "$SNP $count $change\n" if $cod{$1} eq "non"


$cod{$1} has a subscript of $1. And $1 refers to the first matched parten in the last regular expression, i.e. in this case the digits representing the colors. So, you are right.

In fact, looking again at the warning message:


Code
Use of uninitalized value ($count or $change) in concatenation (.) or string at Script.pl line 24, <IN> line XXXX.


the uninitialized value occurs earlier in that line, somewhere in the string "$SNP $count $change\n" which contatenates the three variables. So either $SNP, either $count or either $change is not initialized (or possibly two of them or all three) most probably because one og the three prior regex match did not work as expected (or some part of the data is not matching on the regex).

Having a look, I think this line may not work properly:


Code
$line =~ m/\((AA \w+->\w+)\)\s*$/;

as it does not match

Code
refAllele: G SNPstrains: 7083_1#32=T (non-synonymous) (AA Val->Phe) 7521_5#41=T (non-synonymous) (AA Val->Phe)

But, yet, that's maybe what you want.

Having a poor knowledge of genetics, I don't necessarily understand exactly your logic. I'll try to run your program on your data and get back to you, that will be far easier.


Laurent_R
Enthusiast / Moderator

Nov 11, 2012, 2:59 PM

Post #5 of 31 (9385 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Hi,

I tried to run the code on your data and I did not get your warning message about uninitialized value.

The output file also seems OK to me:


Code
$ cat output.txt 
Coordinate No of Strains AA Change
27534 1 AA Leu->Ser
27710 2 AA Val->Phe
27771 1 AA His->Arg
28047 1 AA Lys->Ile
28490 2 AA Gly->Cys
28548 1 AA Ser->Leu
28787 1 AA Asp->Asn
28941 1 AA Gln->Arg
29214 1 AA Val->Ala
29574 1 AA Pro->Leu


So presumably your data is larger than what you showed us and somewhere else down the input file, some part of the data is not consistent with the data segment you gave us.

Your warning message says:


Code
Use of uninitalized value ($count or $change) in concatenation (.) or string at Script.pl line 24, <IN> line XXXX.


The "line XXXX" part (XXXX is presumably a number) says where in the input file the error occurred. Locate the first of these lines number XXXX in the input file and look carefully at the group of three lines in the file before that line.

If you don't see anything wrong, post these lines, I or someone else might find out.

Some advice concerning usually accepted better programming practices:
- Always "use strict;"
- Always "use warnings;"

These two diagnostic pragmas will tell you a lot about possible errors in your program, often even before it runs. They will force you to declare your variables ("my" statement) and to think about where they need to exist.

-Always check the return value of system calls such as open file
- Use the more modern syntax to open your files (see example below).

So, this is quick rewrite of your script with such advice in mind (plus a little bit of reformatting for clarity, but you don't need to agree with me on the reformatting):


Code
use warnings; 
use strict;

my %cod;
$cod{1} = "Int";
$cod{2} = "non";
$cod{3} = "syn";
$cod{4} = "stop";

my $file = "input.txt";
open IN, "<", $file or die "could not open $file $! \n";
open OUT, ">", "output.txt" or die "could not open output.txt $! \n";
print OUT "Coordinate No of Strains AA Change\n";
my ($SNP, $count, $change);
while(<IN>){
if (m/^FT\s+SNP\s+(\d+)/) {
$SNP = $1;
}
elsif (m/^FT\s+\/note="(.*)"/) {
my $line = $1;
$count = ($line =~ tr/=/=/);
$line =~ m/\((AA \w+->\w+)\)\s*$/;
$change = $1 || "";
}
elsif (m/^FT\s+\/colour=(\d+)/) {
print OUT "$SNP $count $change\n" if $cod{$1} eq "non";
}
}


One final additional point: I don't think the hash at the beginning is very useful since you are checking $cod{$1} only against "non", which will match only if $1 = 2, so that you could have your conditional statement:

print OUT "$SNP $count $change\n" if $1 == 2; # or : if $1 eq '2';

and could forget altogether about the %cod hashtable (unless of course you showed us only part of your code).


(This post was edited by Laurent_R on Nov 11, 2012, 3:01 PM)


MB123
Novice

Nov 11, 2012, 3:37 PM

Post #6 of 31 (9382 views)
Re: [Laurent_R] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Thank you for the update of my code.

Yes, my data is roughly ~70,000 lines. Is there a way to expand the errors that perl shows me? Because I can only see as far back as line # 69523. I presume the error would occur before this line.

Anyway, here are a couple of lines that returned the error message-

Code
FT	SNP	2815273 
FT /note="refAllele: G SNPstrains: 7748_6#66=A (non-synonymous) (AA Pro->Leu) "
FT /colour=2
FT SNP 2815907
FT /note="refAllele: C SNPstrains: 7414_8#17=T (non-synonymous) (AA Val->Ile) "
FT /colour=2
FT SNP 2816235
FT /note="refAllele: G SNPstrains: 7748_6#47=T (non-synonymous) (AA Lys->Asn) "
FT /colour=2


I can't see any differences between those and the sample I gave earlier. Also, it may be of some help that my output file only writes in the coordinates column values, and leaves the other two blank.

Many thanks.


Laurent_R
Enthusiast / Moderator

Nov 11, 2012, 4:05 PM

Post #7 of 31 (9380 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Hi,

I also don't see anything special in those lines.

And I tried to run the program on those 9 lines, with no warnings and it generated correct output.


Code
$ perl uninit.pl 

~
$ cat output.txt
Coordinate No of Strains AA Change
2815273 1 AA Pro->Leu
2815907 1 AA Val->Ile
2816235 1 AA Lys->Asn


The input lines you provided were probably not the faulty ones.


MB123
Novice

Nov 12, 2012, 5:52 AM

Post #8 of 31 (9370 views)
Re: [Laurent_R] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Hi,

Is there a way that I can find the faulty lines without having to check through all ~70,000 lines by eye?

Thanks


BillKSmith
Veteran

Nov 12, 2012, 10:15 AM

Post #9 of 31 (9361 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

All of your data consists of groups of three records. Your code will fail if this pattern is broken because it will have access to 'stale' data. A better approach is to enforce the order. Report all discrepancies.

Code
use warnings; 
use strict;
my %cod = (
1 => "Int",
2 => "non",
3 => "syn",
4 => "stop",
);
my $file = "input.txt";
open my $IN, "<", $file or die "could not open $file $! \n";
open my $OUT, ">", "output.txt" or die "could not open output.txt $! \n";
print $OUT "Coordinate No of Strains AA Change\n";

while (my $record = <$IN>) {
my ($SNP) = $record =~ m/^FT\s+SNP\s+(\d+)/;
$record = <$IN>;
my ($line) = $record =~ (m/^FT\s+\/note="(.*)"/);
my $count = $line =~ tr/=/=/ ;
my ($change) = $line =~ m/\((AA \w+->\w+)\)\s*/;
$record = <$IN>;
my ($color) = $record =~ m/^FT\s+\/colour=([1-4])/;
next if defined $color and $color =~ m/[134]/;
if (defined $count and defined $change and defined $color and $color == 2) {
print $OUT "$SNP $count $change\n"
}
else {
$SNP ||= 'invalid';
$count ||= 'invalid';
$change ||= 'invalid';
$color ||= 'invalid';
warn "Invalid data block near $.\n"
. "SNP = $SNP count = $count change = $change color = $color\n"
;
}
}
close $OUT;
close $IN;


You will have to decide whether it is better to warn or to die on errors.

Based on the testing you have done so far, I suspect that you have either a blank, missing, or duplicated line. They would all be hard to find without their context.
Good Luck,
Bill


Laurent_R
Enthusiast / Moderator

Nov 12, 2012, 10:16 AM

Post #10 of 31 (9361 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Yes, the error message tells you where it encountered the problem in the input file, at the very end of the message (what you quoted as XXXX in your post is actually yhe line number where the problem occurred. All you need is that line plus the two previous ones.

Assuming you error message is on line 128, you could print the faulty lines by typing at the prompt:


Code
perl -ne 'print if 126..128' input.txt

(If under Windows, if becomes:
perl -ne "print if 126-128" input.txt
i.e. replacing single quotes by double quotes.)


(This post was edited by Laurent_R on Nov 12, 2012, 10:30 AM)


MB123
Novice

Nov 12, 2012, 11:14 AM

Post #11 of 31 (9356 views)
Re: [Laurent_R] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

My problem is that it appears most of the lines return the error (I assume because they become out of synch from a missing or duplicated line). The result is my command line runs for ~30 seconds listing lines where the error is found, and then when it finishes I am limited with how far up I can scroll to find the source of the error.

Is there a way to find the very first line where the error occurred?


Laurent_R
Enthusiast / Moderator

Nov 12, 2012, 11:47 AM

Post #12 of 31 (9352 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

If you are on Unix of Linux, run your program with a command like this one:

perl my_program.pl <params if any> | more

This will stop the output at the first page.

An alternative that would work even under Windows if to redirect the output to a file and then open the file:

perl my_program.pl <params if any> > a_file.txt

This way you can get the first errors.


MB123
Novice

Nov 13, 2012, 3:36 AM

Post #13 of 31 (9328 views)
Re: [Laurent_R] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Hi,

I'm sorry but I am not sure I fully understand. I am on Windows, and I typed exactly what you put into my perl command line (changing the file names to the files that I have) and it returns-

Code
>was unexpected at this time

If I remove the second '>' it returns-

Code
The system cannot find the file specified.



BillKSmith
Veteran

Nov 13, 2012, 5:50 AM

Post #14 of 31 (9313 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

The script I posted in post #9 is designed to report synch errors. Change the 'warn' to 'die' to make it stop at the first error.
Good Luck,
Bill


MB123
Novice

Nov 13, 2012, 7:34 AM

Post #15 of 31 (9311 views)
Re: [BillKSmith] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Hi Bill,

Thanks for your reply. I ran your script after changing 'warn' to 'die' in this line -


Code
die "Invalid data block near $.\n"


It returned the same result - repeated errors of uninitialised values running for ~30 seconds with the furthest back cell I can see being 69523. The output remained the same - only the 'coordinate' column filled with the other two left blank.


BillKSmith
Veteran

Nov 13, 2012, 8:58 AM

Post #16 of 31 (9302 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Sorry about that. I thought that I was catching all the errors. One way to stop your program on the first warning is to add your own signal handler. At the begining of your own script, add:

Code
$SIG{'__WARN__'} = sub{die $_[0]};


This should force your program to die with the usual message (including input line number) on the very first warning.
Good Luck,
Bill


MB123
Novice

Nov 13, 2012, 9:37 AM

Post #17 of 31 (9300 views)
Re: [BillKSmith] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Thank you, that worked perfectly. However the result was a bit unexpected - it failed on the very first non-synonymous change -

Code
FT	SNP	26 
FT /note="refAllele: C SNPstrains: 7521_5#13=T "
FT /colour=1
FT SNP 49
FT /note="refAllele: C SNPstrains: 7469_7#75=T "
FT /colour=1
FT SNP 102
FT /note="refAllele: C SNPstrains: 7521_5#67=A "
FT /colour=1
FT SNP 139
FT /note="refAllele: T SNPstrains: 7414_8#36=C "
FT /colour=1
FT SNP 177
FT /note="refAllele: A SNPstrains: 7480_7#6=G "
FT /colour=1
FT SNP 233
FT /note="refAllele: T SNPstrains: 7480_7#22=C "
FT /colour=1
FT SNP 259
FT /note="refAllele: C SNPstrains: 7564_8#83=T "
FT /colour=1
FT SNP 314
FT /note="refAllele: A SNPstrains: 7414_8#83=G "
FT /colour=1
FT SNP 319
FT /note="refAllele: T SNPstrains: 7414_7#41=C "
FT /colour=1
FT SNP 329
FT /note="refAllele: C SNPstrains: 7564_8#31=T "
FT /colour=1
FT SNP 375
FT /note="refAllele: G SNPstrains: 7564_8#11=A "
FT /colour=1
FT SNP 413
FT /note="refAllele: T SNPstrains: 7414_8#83=C "
FT /colour=1
FT SNP 414
FT /note="refAllele: G SNPstrains: 7521_5#22=A "
FT /colour=1
FT SNP 433
FT /note="refAllele: T SNPstrains: 7083_1#5=C 7414_8#8=C 7480_8#49=C "
FT /colour=1
FT SNP 442
FT /note="refAllele: T SNPstrains: 7065_8#2=C 7065_8#94=C 7083_1#2=C 7083_1#3=C 7083_1#41=C 7083_1#42=C 7083_1#43=C "
FT /colour=1
FT SNP 460
FT /note="refAllele: T SNPstrains: 7564_8#14=C "
FT /colour=1
FT SNP 703
FT /note="refAllele: G SNPstrains: 7521_5#39=A (non-synonymous) (AA Ala->Thr) "
FT /colour=2

That is, the last 3 lines of this data. The error referred to line 52 specifically, which is the last line -

Code
FT		/colour=2

Am I right to assume that the error is therefore a mistake in my code rather than a duplication or deletion in the data?


BillKSmith
Veteran

Nov 13, 2012, 2:13 PM

Post #18 of 31 (9290 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

The number of your colour=2 line should be divisible by three. Your lines are out of sync. The error is reported for this line because it is the first line after the real error that attempted to print and failed. Open your data file with your editor and find the first "colour=' line that has a line number that is not divisible by three. Find the last "colour=' line before it that does. The error is between those two lines. Remember, you have already narrowed your hunt to 52 lines.
Good Luck,
Bill


Laurent_R
Enthusiast / Moderator

Nov 13, 2012, 2:39 PM

Post #19 of 31 (9287 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post


In Reply To
Hi,

I'm sorry but I am not sure I fully understand. I am on Windows, and I typed exactly what you put into my perl command line (changing the file names to the files that I have) and it returns-

Code
>was unexpected at this time

If I remove the second '>' it returns-

Code
The system cannot find the file specified.



Sorry, I was probably not clear enough.

The "<param if any>" notation was just to meant "put there your params if your script needs one", it did not mean to tell you to put this literally.

The idea was to have a command like: "perl program > file.txt" if there was no parameter, and ""perl program param1 param2 > file.txt" if you had two parameters.

But anyway, you hame more forward in between, and, if I understand the last messages, it seems that the first error occured on line 52 of your input, so you don't have to read too much of your input file to find where it is. And since 52 cannot be divided by 3, you probably have an out-of-sync problem, due either to an inconsistency in your input file, or to a yet-unnoticed bug in your program leading to, for example, a line being skipped during processing.

If you still cannot find where, please post you whole input until the line corresponding to the first occurrence of the warning, we can try to find out.


MB123
Novice

Nov 13, 2012, 3:12 PM

Post #20 of 31 (9283 views)
Re: [Laurent_R] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

I found some white space at the top of my text file which I removed. However the error now occurs in the same place, being

Code
FT           /colour=2

but it is now line 51 of my input file.

The error refers to use of uninitialised value $count at line 30 of the script below

Code
#!/usr/bin/perl -w 


use warnings;
use strict;

my %cod;
$cod{1} = "Int";
$cod{2} = "non";
$cod{3} = "syn";
$cod{4} = "stop";

$SIG{'__WARN__'} = sub{die $_[0]};
my $file = "BSAC.txt";
open IN, "<", $file or die "could not open $file $! \n";
open OUT, ">", "output.txt" or die "could not open output.txt $! \n";
print OUT "Coordinate No of Strains AA Change\n";
my ($SNP, $count, $change);
while(<IN>){
if (m/^FT\s+SNP\s+(\d+)/) {
$SNP = $1;
}
elsif (m/^FT\s+\/note="(.*)"/) {
my $line = $1;
$count = ($line =~ tr/=/=/);
$line =~ m/\((AA \w+->\w+)\)\s*$/;
$change = $1 || "";
}
elsif (m/^FT\s+\/colour=(\d+)/) {
print OUT "$SNP $count $change\n" if $cod{$1} eq "non";
}
}


This line -


Code
print OUT "$SNP $count $change\n" if $cod{$1} eq "non";


The entire input up to the error is the one I have pasted above, up to SNP position 703. I am very confused by this!


BillKSmith
Veteran

Nov 13, 2012, 6:41 PM

Post #21 of 31 (9273 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

I ran the code you posted in post #20 against the data that you posted in post #17. It ran without error and output one data record with all three fields. Perhaps Something is changing in the cut and paste process. Would you please post the 51-line data file as an attachment


Contents of output.txt

Code
Coordinate	No of Strains	AA Change 
703 1 AA Ala->Thr

Good Luck,
Bill


FishMonger
Veteran / Moderator

Nov 13, 2012, 7:10 PM

Post #22 of 31 (9271 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

I've been following this thread but until now I've not contributed.

So far, the helpers have indirectly stated that you first need to correct the issues with the input before you can parse the data correctly. While that might be the ideal approach, it is not always/often possible. Instead, I suggest that you add additional/different error checking to accommodate the problems.

Here's my test script, which has the minimal level of error checking that I think it might need, but in a production level script, the error checking/handing would be expanded.

In this example I'm putting the input data inside the the script but also include commented out lines that read the input data from an external file.


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Carp;

my %cod = (
1 => "Int",
2 => "non",
3 => "syn",
4 => "stop",
);

my $input_file = 'BSAC.txt';
my $output_file = 'output.txt';

#open my $input_fh, '<', $input_file or croak "could not open '$input_file' <$!>\n";
open my $output_fh, '>', $output_file or croak "could not open '$output_file <$!>\n";

printf {$output_fh} ("%-12s %-15s %-10s\n", 'Coordinate', 'No of Strains', 'AA Change');
print {$output_fh} '-' x 38, "\n";

RECORD:
#while (my $line = <$input_fh>) {
while (my $line = <DATA>) {
chomp $line;
if ( $line =~ /^FT \s+ SNP \s+ (\d+)/x) {
my $snp = $1;

#my $note = <$input_fh>;
my $note = <DATA>;
if ( $note =~ /^FT \s+ \/note = "(.+)"/x ) {
$note = $1;
}
else {
carp qq(format error parsing "note" at or near line $. - skipping this record\n);
next RECORD;
}

my $count = ($note =~ tr/=/=/) || 0;
my ($change) = $note =~ /\(AA ([^)]+)\) \s+$/x ? $1 : '';

#my $colour = <$input_fh>;
my $colour = <DATA>;
$colour or do {
carp qq(format error parsing "colour" at or near line $. - skipping this record\n);
next RECORD;
};

if ( $colour =~ /^FT \s+ \/colour = (\d+)/x ) {
$colour = $1;
}
else {
carp qq(format error parsing "colour" at or near line $. - skipping this record\n);
next RECORD;
}

if ($cod{$colour} eq 'non') {
printf {$output_fh} ("%-12s %-14d %-10s\n", $snp, $count, $change);
}
}
}
#close $input_fh;
close $output_fh;


__DATA__

FT SNP 27534
FT /note="refAllele: T SNPstrains: 7564_8#80=C (non-synonymous) (AA Leu->Ser) "
FT /colour=2
FT SNP 27682
FT /note="refAllele: T SNPstrains: 7414_8#37=C (synonymous) "
FT /colour=3
FT SNP 27710
FT /note="refAllele: G SNPstrains: 7083_1#32=T (non-synonymous) (AA Val->Phe) 7521_5#41=T (non-synonymous) (AA Val->Phe) "
FT /colour=2
FT SNP 27771
FT /note="refAllele: A SNPstrains: 7480_8#28=G (non-synonymous) (AA His->Arg) "
FT /colour=2
FT SNP 28047
FT /note="refAllele: A SNPstrains: 7480_7#86=T (non-synonymous) (AA Lys->Ile) "
FT /colour=2
FT SNP 28490
FT /note="refAllele: G SNPstrains: 7083_1#4=T (non-synonymous) (AA Gly->Cys) 7554_6#38=T (non-synonymous) (AA Gly->Cys) "

FT SNP 28492
FT /note="refAllele: C SNPstrains: 7414_7#66=A (synonymous) 7414_8#44=A (synonymous) 7521_6#54=A (synonymous) "
FT /colour=3
FT SNP 28548
FT /note="refAllele: C SNPstrains: 7414_8#65=T (non-synonymous) (AA Ser->Leu) "
FT /colour=2
FT SNP 28787
FT /note="refAllele: G SNPstrains: 7414_7#14=A (non-synonymous) (AA Asp->Asn) "
FT /colour=2
FT SNP 28840
FT /note="refAllele: C SNPstrains: 7414_8#51=T (synonymous) 7414_8#71=T (synonymous) "
FT /colour=3
FT SNP 28941
FT /note="refAllele: A SNPstrains: 7083_1#1=G (non-synonymous) (AA Gln->Arg) "
FT /colour=2
FT SNP 29080
FT /note="refAllele: A SNPstrains: 7414_7#49=G (synonymous) 7521_6#39=G (synonymous) 7564_8#91=G (synonymous) 7712_8#14=G (synonymous) "
FT /colour=3
FT SNP 29214
FT /note="refAllele: T SNPstrains: 7554_6#36=C (non-synonymous) (AA Val->Ala) "
FT /colour=2
FT SNP 29574
FT /note="refAllele: C SNPstrains: 7065_8#73=T (non-synonymous) (AA Pro->Leu) "
FT /colour=2
FT SNP 29610
FT /note="refAllele: C SNPstrains: 7480_8#12=T "
FT /colour=1
FT SNP 29658
FT /note="refAllele: T SNPstrains: 7564_8#79=A "


Based on the sample input data, this is the contents of output.txt.

Code
Coordinate   No of Strains   AA Change  
--------------------------------------
27534 1 Leu->Ser
27710 2 Val->Phe
27771 1 His->Arg
28047 1 Lys->Ile
28548 1 Ser->Leu
28787 1 Asp->Asn
28941 1 Gln->Arg
29214 1 Val->Ala
29574 1 Pro->Leu


And here are the "error" messages sent to stderr, which I might direct to an "error" file for later review.

Code
format error parsing "colour" at or near line 19 - skipping this record 
at D:\test\Perl-1.pl line 54, <DATA> line 19.
format error parsing "colour" at or near line 48 - skipping this record
at D:\test\Perl-1.pl line 46, <DATA> line 48.



Laurent_R
Enthusiast / Moderator

Nov 13, 2012, 11:21 PM

Post #23 of 31 (9267 views)
Re: [FishMonger] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

I definitely agree that the code should be more robust for inconsistent data, either by being able to cope with some data format variation or by at least checking the input and producing error message when the format in unexpected.

At the same time, I believe that knowing inside out the data you are dealing with id of paramount importance. I am dealing daily with masses of data from external sources. There are so many possible errors or format variation that it is not possible to forecast every possible error or format variation. But knowing very well the data goes a long way to help successful data munging.

As an additional note, relying on a succession of 3 types of line is not robust enough. You need for each line check that it is a line of the type you expect. If it isn't, then rejecti it with an error message and try to get back in sync by searching the next first line of a group of three (otherwise you'll have to reject the whole file).


MB123
Novice

Nov 14, 2012, 6:17 AM

Post #24 of 31 (9259 views)
Re: [FishMonger] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

Hi,

Thank you for your input. I ran your script using my entire file and it produced a blank output file, bar the headers. It also returned

Code
format error parsing "note" at or near line XXX - skipping this record at script.pl line 34 <$input_fh> line XXX.

The line refers to those which have

Code
FT     SNP       703


for example.

I have attached the 51 line file for Bill.

Many thanks.
Attachments: 51 line.txt (1.48 KB)


FishMonger
Veteran / Moderator

Nov 14, 2012, 6:40 AM

Post #25 of 31 (9257 views)
Re: [MB123] File parsing problem, Use of uninitialised value error. [In reply to] Can't Post

The reason it did not produce the desired output was because the format of these "note" lines differ from what you previously posted and because of that the regex failed to match so it moved to the next record(s) which also failed for the same reason.

The portion of the line I'm referring to is:

/note="

compared to this:

"/note=""

Take note of the difference with the quotes.


(This post was edited by FishMonger on Nov 14, 2012, 6:43 AM)

First page Previous page 1 2 Next page Last page  View All
 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives