CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
parsing rows of data with different lengths

 



kch
Novice

Apr 6, 2010, 9:33 AM

Post #1 of 9 (5576 views)
parsing rows of data with different lengths Can't Post

I am not an expert at Perl but I thought this question is most appropriate under regular expression thread.

So I have a data consisting of rows where a column is an aggregate of 10 attributes. So for example,

MB 1 UL L F UT 100 BX C 20CT

signify

Attribute 1= MB
A 2= 1
A 3= UL
...
A8 = BX
A 9= C
A 10= 20 CT

However, the problem is that for some rows there are missing values. I could identify this missing value I know the possible codes for each attribute.
So for example if it were

MB 1 UL L F UT 100 C 20CT

I would know that A8 is missing since attribute 8 can only take BX or P..(and in this particular case if it is P it would be left as blank in the data)

So I am wondering if I could somehow first fill these missing values (by P for attribute 8) and split this code into 10 attributes (columns)?

Thanks in advance!


roolic
User

Apr 6, 2010, 12:14 PM

Post #2 of 9 (5560 views)
Re: [kch] parsing rows of data with different lengths [In reply to] Can't Post


Code
my $string = 'MB 1 UL L F UT 100 C 20CT'; 
$string =~ s/^((?:\w+\s+){7})(?!(?:BX|P)\s+)/$1P /; #inserting P attribute
my @attrs = split(/s+/, $string);
$attrs[7] = '' if $attrs[7] eq 'P'; # processing P attribute



kch
Novice

Apr 6, 2010, 5:07 PM

Post #3 of 9 (5545 views)
Re: [roolic] parsing rows of data with different lengths [In reply to] Can't Post

thanks
this works fine but what if I want to read in a file and do it for every row?
I tried playing it around by using the g/ expression but wasn't able to get this to work..To read in the file I am able to do so by


open $TEXTREV, "a.txt";
my @str = <$TEXTREV>;
my $str = join("", @str);

and when i do the following, I only get the first entry to have it corrected.... (P attribute inserted)

$str =~ s/^((?:\w+\s+){7})(?!(?:BX|P)\s+)/$1P /; #inserting P attribute
my @attrs = split(/s+/, $str);
$attrs[7] = '' if $attrs[7] eq 'P';


Also, there are 10 attributes that have missing values here and there so would i need to have 10 different s command as above to do this? Would you mind elaborating on what

my @attrs = split(/s+/, $str);
$attrs[7] = '' if $attrs[7] eq 'P';

does?

Thank you


roolic
User

Apr 6, 2010, 9:10 PM

Post #4 of 9 (5527 views)
Re: [kch] parsing rows of data with different lengths [In reply to] Can't Post

try following:


Code
open (INFILE, "< a.txt") || die "can not read: $!"; 
open (OUTFILE, "> out.csv") || die "can not write: $!";

while ( my $string = <INFILE> ){
$string =~ s/^\W+//; # removing garabge in line start
$string =~ s/\W+$//; # removing \n\r etc at the end of line

# inserting P attribute via "replace" ))
# 'MB 1 UL L F UT 100 C 20CT' -> 'MB 1 UL L F UT 100 P C 20CT'
$string =~ s/^((?:\w+\s+){7})(?!(?:BX|P)\s+)/$1P /;

# the following makes array of strings like
# ('MB','1','UL','L','F','UT','100','P','C','20CT')
my @attrs = split(/s+/, $str);

# if 8th (index is 7 because started from 0) element is 'P'
# making it blank according to the requirements
$attrs[7] = '' if $attrs[7] eq 'P';

# printing into CSV data file (readable by excel etc)
print OUTFILE join(',', @attrs)."\n";
}
close INFILE;
close OUTFILE;


pay an attention that you should not join whole file data into single string
because the regex provided will work for first data set only because of '^' (line start) condition in the regex. so it's better to process the file via line by line.

if you'd like to process the data (not to store in csv) you can make the array of arrays (2D matrix) the following way:

place before while() circle
my $data = [];

place within while() circle (instead of or in addition to 'print')
push @{$data},\@attrs;

then you can access the required row and column via
$data->[row-1][column-1]


roolic
User

Apr 6, 2010, 9:15 PM

Post #5 of 9 (5525 views)
Re: [roolic] parsing rows of data with different lengths [In reply to] Can't Post

oops
my @attrs = split(/s+/, $string);
not
my @attrs = split(/s+/, $str);
in the prev message code


kch
Novice

Apr 7, 2010, 7:37 AM

Post #6 of 9 (5508 views)
Re: [roolic] parsing rows of data with different lengths [In reply to] Can't Post

Would it be possible for me to put in a "if else" function that first counts the number of "characters" in a string and if it exceeds 10 (attributes) first delete an attribute and do nothing on those that have exactly 10 attributes?

So I have some data that are longer than 10 attributes. For example,

MB 1 UL L F UT 100 BL BX P 20 CT

We have 11 attribute and we have to delete BL above to make it
MB 1 UL L F UT 100 BX P 20 CT

However, when i run the code, I get
MB 1 UL R NF REG L P F UT 85 BL BX P 20CT


In another case, when we have 8 attributes
WNST 1 F 85 BX C -3.00 20 CT

We get
WNST 1 R R F REG 85 BX C -3 20CT

which is what we want (the code is doing what it is suppose to) but we now 11 attribute and I'd like to delete -3 from above to make it

WNST 1 R R F REG 85 BX C 20CT

Regarding the code itself, I recoded the above to take into account other attributes in different positions...
$string=~ s/^((?:\w+\s+){2})(?!(?:UL|L|DUL|DLX|M)\s+)/$1R /;
$string=~ s/^((?:\w+\s+){3})(?!(?:M|R)\s+)/$1R /;
$string=~ s/^((?:\w+\s+){4})(?!(?:F|NF)\s+)/$1NF /;
$string=~ s/^((?:\w+\s+){5})(?!(?:LT|UT|R)\s+)/$1REG /;
$string =~ s/^((?:\w+\s+){7})(?!(?:BX|P)\s+)/$1P /;


thank you


roolic
User

Apr 8, 2010, 3:42 AM

Post #7 of 9 (5479 views)
Re: [kch] parsing rows of data with different lengths [In reply to] Can't Post

due to there could be a lot of conditions and the logic is rather complex using just regular expressions is not flexible and the code is hard to read. So I'd recommend to check all the attributes one by one within the internal circle like following

Code
my @input = split( /\s+/, $string ); 
my @attrs = ();
my $index = 0; # the position
while ( @input ) {
$index++;
my $current = shift @input; # reading next input

if( $index <= 2 ){
# just copy first two walues
push @attrs, $current;
next; # skip other checking and go to next value read
}

if( $index == 3 ){
if( $current =~ /^(?:UL|L|DUL|DLX|M)$/ ){
# process eligible value
push @attrs, $current;
next;
}
# the current element is not eligible
# to be 3rd, so setting 3rd element to 'R'
# and treat current value as next element (4th)
push @attrs, 'R';
$index++;
}

if( $index == 4 ){
# same way
if( $current =~ /^(?:M|R)$/ ){
push @attrs, $current;
next;
}
push @attrs, 'R';
$index++;
}

if( $index == 5 ){
# same way
if( $current =~ /^(?:F|NF)$/ ){
push @attrs, $current;
next;
}
push @attrs, 'NF';
$index++;
}

# insert processing for 6-7 here

if( $index == 8 ){
# following is example and could be used for any index
# checking condition to skip the attribute
# with additional check for next (further) value
if( $current =~ /^BL$/ && $input[0] =~ /(?:BX|P)/ ){
# skipping BL and reading next value for next if()
$current = shift @input;
}
if( $current =~ /^(?:BX|P)$/ ){
# push blank string if 'P'
push @attrs, ( $current ne 'P' ) ? $current : '';
next;
}
push @attrs, ''; # default is blank string
$index++;
}
# insert processor for 9-10 here
}



kch
Novice

Apr 8, 2010, 12:25 PM

Post #8 of 9 (5469 views)
Re: [roolic] parsing rows of data with different lengths [In reply to] Can't Post

We have to start with
open (INFILE, "< at.txt") || die "can not read: $!";
open (OUTFILE, "> output.csv") || die "can not write: $!";

while (my $string = <INFILE>) {


or simply

$string=<INFILE> ?

Also
for
if( $current =~ /^BL$/ && $input[0] =~ /(?:BX|P)/ ){
is this simply ignoring BL or also deleting? The problem is that i do not know exactly what these strings would be so i'd like to put a command that says "IF it is NOT equal to XYZ then erase the string"

thanks

In Reply To


roolic
User

Apr 12, 2010, 8:56 PM

Post #9 of 9 (5440 views)
Re: [kch] parsing rows of data with different lengths [In reply to] Can't Post

sorry for delay
if you have to process whole the file the row parser should be within the row fetch circle

Code
while ( my $string = <INFILE> ){  
$string =~ s/^\W+//; # removing garabge in line start
$string =~ s/\W+$//; # removing \n\r etc at the end of line

my @input = split( /\s+/, $string );
my @attrs = ();
my $index = 0; # the position
while ( @input ) {

# place conditions by index here

}
# printing into CSV data file (readable by excel etc)
print OUTFILE join(',', @attrs)."\n";
}


for 8th attribute use the following condition to skip (ignore, delete) any value if the following (next) is 'BX' or 'P' or if there are just 2 elements left:


Code
if( $input[0] =~ /^(?:BX|P)$/ || scalar @input == 2 ){ 
# assign new value to $current
# it means the previous value will not be
# taken into account anymore = ignored
$current = shift @input;
}


 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives