CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
string processing efficiency

 



randizzle
Novice

Oct 31, 2008, 9:57 AM

Post #1 of 15 (2530 views)
string processing efficiency Can't Post

Hi, I know there is a better way to do the following than how I have it:

Take a line from a file, get a substring at a specific position and a specific length. This is going to be a zip code (5 or 9 digits).

Take a broken zip code (you know how Excel kills leading zeroes...), shift it $numberOfZeroesMissing to the right.

Insert $numberOfZeroesMissing amount of 0's on the left.

Replace that substring in the original line, write to file. Position, length, input, and output are ARGVs.

Example: MA1526 10312008 should become MA0152610312008 (position 3, length 5, one zero missing)

Example: PR1526521 10312008 should become PR00152652110312008 (position 3, length 9, two zeroes missing)

The way I have it now I'm using split, foreach's, whiles nested in whiles, ifs, character arrays, etc. and it's over 90 lines of actual code. It just runs slow. The files I work on are 200,000+ lines, text, fixed-width fields, and sent by clients who probably make them in Excel in parts. Northeast and Puerto Rico zips start with 0 or 00.

I KNOW it's possible to cut it at least in half using regex's, but I can't figure out an efficient way. Can you help me? Thanks!


KevinR
Veteran


Oct 31, 2008, 10:30 AM

Post #2 of 15 (2525 views)
Re: [randizzle] string processing efficiency [In reply to] Can't Post

In your example lines:

Example: MA1526 10312008 should become MA0152610312008 (position 3, length 5, one zero missing)

Example: PR1526521 10312008 should become PR00152652110312008 (position 3, length 9, two zeroes missing)

how do you know the first line is not length 9 with 4/5 missing zeros?

Why is the first line length 5? Looks like length 4 (1526). Why is the second line length 9? Looks like length 7 (1526521).
-------------------------------------------------


FishMonger
Veteran / Moderator

Oct 31, 2008, 10:50 AM

Post #3 of 15 (2527 views)
Re: [randizzle] string processing efficiency [In reply to] Can't Post

You should use sprintf to format the zip with leading zeros.

perldoc -f sprintf

You say you're working with fixed-width fields, but your examples indicate otherwise.


FishMonger
Veteran / Moderator

Oct 31, 2008, 10:55 AM

Post #4 of 15 (2526 views)
Re: [randizzle] string processing efficiency [In reply to] Can't Post

Here's an example that should get you going, assuming you're actually working with fixed-width fields.


Code
my $line = 'PR1526521  10312008'; 
my $zip = sprintf("%09d", int(substr($line, 2,9)));
$line =~ s/^(..)([\d ]{9})/$1$zip/;

print $line;



KevinR
Veteran


Oct 31, 2008, 11:22 AM

Post #5 of 15 (2521 views)
Re: [FishMonger] string processing efficiency [In reply to] Can't Post


In Reply To
You should use sprintf to format the zip with leading zeros.

perldoc -f sprintf

You say you're working with fixed-width fields, but your examples indicate otherwise.


That would be my suggestion as well. Could be there are spaces in the text that is getting collapsed because he is not using the code tags to retain formatting.

Edit:

yes, the missing spaces show up in the source code:


Code
MA1526 10312008 should become MA0152610312008 (position 3, length 5, one zero missing)  

Example: PR1526521 10312008 should become PR00152652110312008 (position 3, length 9, two zeroes missing)

-------------------------------------------------


(This post was edited by KevinR on Oct 31, 2008, 11:24 AM)


randizzle
Novice

Oct 31, 2008, 11:41 AM

Post #6 of 15 (2516 views)
Re: [FishMonger] string processing efficiency [In reply to] Can't Post

Yea well it's not "fixed" permanently, each file will have different positions (any integer) and length (5 or 9 only), but it will be the same ones throughout the file. That's why they're command line args.

Thanks though, I'll try your code modified a little:

while (<INPUT>)
{
my $line = $_;
my $zip = substr($line,$zipStart,$zipLength); # zipStart and zipLength are ARGV[0] and ARGV[1]


if ($zipLength == 5) {int(substr($line, 2,5)));
$line =~ s/^(..)([\d ]{5})/$1$zip/;
} # end if


else { # 5 and 9 are the only valid choices, error checking code not shown
my $zip = sprintf("%09d", int(substr($line, 2,9)));
$line =~ s/^(..)([\d ]{9})/$1$zip/;
} # end else


print OUTPUT $line;
} # end while



KevinR
Veteran


Oct 31, 2008, 11:59 AM

Post #7 of 15 (2516 views)
Re: [randizzle] string processing efficiency [In reply to] Can't Post

another approach:


Code
while (<DATA>) { 
my ($s,$z,$d) = $_ =~ /^(\S\S)(\d+\s*)(\S+)$/;
$n = length $z;
$z = sprintf "%0${n}d",$z;
print "$s$z$d\n";
}
__DATA__
MA1526 10312008
PR1526521 10312008


Although I think you could use unpack() to good use since the length of the lines will either be 15 or 19 characters.


Code
while (<DATA>) { 
chomp;
my ($s,$z,$d);
if (length $_ == 19) {
($s,$z,$d) = unpack ("A2A9A*",$_);
$z = sprintf "%09d",$z;
}
else {
($s,$z,$d) = unpack ("A2A5A*",$_);
$z = sprintf "%05d",$z;
}
print "$s$z$d\n";
}
__DATA__
MA1526 10312008
PR1526521 10312008


The advantage is that unpack() is very fast and efficient.
-------------------------------------------------


(This post was edited by KevinR on Oct 31, 2008, 12:00 PM)


KevinR
Veteran


Oct 31, 2008, 12:03 PM

Post #8 of 15 (2514 views)
Re: [randizzle] string processing efficiency [In reply to] Can't Post

the code could be reduced a bit at the risk of being less readable to some:


Code
while (<DATA>) { 
chomp;
my $x = length $_ == 19 ? 9 : 5;
my ($s,$z,$d) = unpack ("A2A${x}A*",$_);
print $s,(sprintf "%0${x}d",$z),$d,"\n";
}
__DATA__
MA1526 10312008
PR1526521 10312008

-------------------------------------------------


(This post was edited by KevinR on Oct 31, 2008, 12:05 PM)


randizzle
Novice

Oct 31, 2008, 12:29 PM

Post #9 of 15 (2509 views)
Re: [KevinR] string processing efficiency [In reply to] Can't Post


Quote
Although I think you could use unpack() to good use since the length of the lines will either be 15 or 19 character



It's just the substring zip that will either be 9 or 5 characters.

Actually the lines can have any arbitrary size (1,000+ bytes!) and any number of fields at whatever size the clients feel like sending me. They're not techy, so they'll have data in a spreadsheet and send it to me, sometimes in a spreadsheet, sometimes in text, sometimes delimited, sometimes fixed width, sometimes (grrr) with no eols so I have a text file with one line that 190,000,000,000 or so bytes (it's probably an unconverted Mac file).

I do need this to be sort of readable, but thanks for your super efficient method.



randizzle
Novice

Oct 31, 2008, 12:33 PM

Post #10 of 15 (2508 views)
Re: [KevinR] string processing efficiency [In reply to] Can't Post


Quote
how do you know the first line is not length 9 with 4/5 missing zeros?

S
orry, there will only be one or two missing zeroes, that's how US (including Puerto Rico) zip codes are.


Quote
Why is the first line length 5? Looks like length 4 (1526). Why is the second line length 9? Looks like length 7 (1526521).


The first is actually "1526 " (blank...fixed width) and the second is "1526521 " (blank blank). Those are two fields.



randizzle
Novice

Oct 31, 2008, 1:07 PM

Post #11 of 15 (2504 views)
Re: [KevinR] string processing efficiency [In reply to] Can't Post

One more thing I ASS-U-ME'd everyone knew: not all the zips in the file will be broken, I might have a list like

Code
 ... 
... Abt IL60640E10312008 ...
... Best Buy MA1640 E10312008 ...
... Circuit City PR603 E10312008 ...
... TigerDirect IL60640E10312008 ...
... CompUSA NY5740 E10312008 ...
...



or a list like

Code
 ... 
... Abt IL606404001E10312008 ...
... Best Buy MA16406582 E10312008 ...
... Circuit City PR6030020 E10312008 ...
... TigerDirect IL606404006E10312008 ...
... CompUSA NY57402346 E10312008 ...
...



I made those zips up, sorry for anyone in Massachusetts, New York, Puerto Rico, and Illinois!


FishMonger
Veteran / Moderator

Oct 31, 2008, 2:05 PM

Post #12 of 15 (2498 views)
Re: [randizzle] string processing efficiency [In reply to] Can't Post

I like Kevin's idea of using unpack. I does reduce the readability for those that don't understand how pack works, but it is more efficient.

We can increase the efficiency a little more as well as condense it a little more by dropping chomp and the assignment of the 3 vars ($s,$z,$d).


Code
$zipLength = $ARGV[1] || 9; 

while (<DATA>) {
printf "%s%0${zipLength}d%s\n", unpack("A20A${zipLength}A*", $_);
}

__DATA__
... Abt IL606404001E10312008 ...
... Best Buy MA16406582 E10312008 ...
... Circuit City PR6030020 E10312008 ...
... TigerDirect IL606404006E10312008 ...
... CompUSA NY57402346 E10312008 ...

Should output:

Code
... Abt           IL606404001E10312008 ... 
... Best Buy MA016406582E10312008 ...
... Circuit City PR006030020E10312008 ...
... TigerDirect IL606404006E10312008 ...
... CompUSA NY057402346E10312008 ...



(This post was edited by FishMonger on Oct 31, 2008, 2:07 PM)


KevinR
Veteran


Oct 31, 2008, 7:01 PM

Post #13 of 15 (2493 views)
Re: [randizzle] string processing efficiency [In reply to] Can't Post


In Reply To

Quote
Although I think you could use unpack() to good use since the length of the lines will either be 15 or 19 character



It's just the substring zip that will either be 9 or 5 characters.

Actually the lines can have any arbitrary size (1,000+ bytes!) and any number of fields at whatever size the clients feel like sending me. They're not techy, so they'll have data in a spreadsheet and send it to me, sometimes in a spreadsheet, sometimes in text, sometimes delimited, sometimes fixed width, sometimes (grrr) with no eols so I have a text file with one line that 190,000,000,000 or so bytes (it's probably an unconverted Mac file).

I do need this to be sort of readable, but thanks for your super efficient method.


Hard to know what to suggest if the lines can be any arbitrary size. There could be many many false matches in a file that could be one long line. There has to be a rule or a set of rules that can be applied to the problem, if not, it will take a lot of filtering and double-checking to add those missing zeros.

I assumed the lines were what you posted.
-------------------------------------------------


randizzle
Novice

Nov 1, 2008, 2:22 AM

Post #14 of 15 (2491 views)
Re: [KevinR] string processing efficiency [In reply to] Can't Post


Quote
Hard to know what to suggest if the lines can be any arbitrary size. There could be many many false matches in a file that could be one long line. There has to be a rule or a set of rules that can be applied to the problem, if not, it will take a lot of filtering and double-checking to add those missing zeros.


KevinR, you're making it more difficult than it really is Pirate. There doesn't need to be any rules, I vim the file and see myself what position the zip code is in.

Let's just start at the point where I have a 5 (or 9) character (digits) field in each record, and some will have trailing spaces. For x trailing spaces, shift 5 (or 9) - x characters in the field to the right x times, and insert x zeroes in the left.

The steps as I see them (9 digit field length):
1254854__ (2 spaces at the end)
11254854_ (shift 9-2=7 chars to the right once)
111254854 (twice)
011254854 (insert one zero)
001254854 (and a second zero)

Like I said, using regex there has to be a faster way to do it, and all of your replies are leading me there.

I'll know exactly where the zip code field will be, no worries about that. I only work on one file at a time, and that's also the main reason I have command line args (that I will literally type myself):

Code
>perl zipfix.pl $0 $1 $2 $3 
$0 = zip code position (integer)
$1 = zip code length (5 or 9)
$2 = input file
$3 = output file


I'll just do

Code
my $line = $_; 
my $zipCode = substr($line,$zipStart,$zipLength);


and I have, in $zipCode, a 5- or 9-byte length string that may or may not need to be fixed. This part is solved already. And yes I will also do

Code
substr($line,$zipStart,$zipLength) = $zipCodeFixed


to reinsert the corrected zip back in its place.

The fixing is all I really need help on. I tried your code and it almost works, just for some reason it changes the field length in the output file to 7 (I didn't see any numbers in your code that would do that) and it doesn't add zeroes or remove trailing spaces. I think I can figure the rest out on my own though, thanks for your help. I'll post back during the work week if I need more help.
Thanks again!


FishMonger
Veteran / Moderator

Nov 1, 2008, 6:03 AM

Post #15 of 15 (2484 views)
Re: [randizzle] string processing efficiency [In reply to] Can't Post

Did you test the last version I posted which used a slightly modified version of Kevin's unpack approach?

We'd need to run some benchmark tests, but I'm willing to bet that using unpack would be more efficient than using the combination of substr and a regex.

Here's a more complete test script example that uses the unpack approach and in this example, it also strips out the leading and trailing dots and spaces from your sample data.


Code
[root@perlman ~]# cat randizzle.pl 
#!/usr/bin/perl

use strict;
use warnings;

my ($zipStart, $zipLength) = @ARGV;

while ( my $line = <DATA>) {

$line =~ s/[ ]+$//; # remove trailing spaces and dots

$line = sprintf "%s%0${zipLength}d%s\n",
unpack("A${zipStart}A${zipLength}A*", $line);

$line =~ s/^[ ]+//; # remove leading spaces and dots

print $line;
}

__DATA__
... Abt IL60640E10312008 ...
... Best Buy MA1640 E10312008 ...
... Circuit City PR603 E10312008 ...
... TigerDirect IL60640E10312008 ...
... CompUSA NY5740 E10312008 ...

Hmm, the posting didn't show the dot inside the character class, but it's there in the script.


Code
[root@perlman ~]# ./randizzle.pl 20 5 
Abt IL60640E10312008
Best Buy MA01640E10312008
Circuit City PR00603E10312008
TigerDirect IL60640E10312008
CompUSA NY05740E10312008


 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives