CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
regex across \n

 



zeppelinmage
New User

Oct 4, 2012, 1:54 AM

Post #1 of 7 (7052 views)
regex across \n Can't Post

Hello, I am having issues with my Perl regex. For some reason I just can't get this to work.

I have some very large (50mb+) text files from which I need to clean up the data. There is a lot of garbage in it. The data I want looks like this:

UBUS01 KMSC 312300
COT UA /OV LRD360030 /TM 2350 /FL055 /TP PA32 /SK SKC /TB NEG /IC
NEG /RM /TA UNKN=


Basically, I need to extract all data between "UB" and "=" into $1 so I can operate on it later. It can be either two or three lines long.

My code is the following:

while (<>) {
/(UB.*=)/sm;
print OUT "$1\n"; #This prints to file so I can check the output. Once the code works, I'll be removing it.
}

The regex as coded keeps outputting empty $1. /(UB.*)/sm; returns the first line, /(.*=)/sm; returns the last line. I just can't put it all together. If it's better/easier, I do have an option to remove all \n from the file. (But then, if it's essentially all "one line", would the regex fire multiple times across that line?)

Thanks.


BillKSmith
Veteran

Oct 4, 2012, 5:42 AM

Post #2 of 7 (7044 views)
Re: [zeppelinmage] regex across \n [In reply to] Can't Post

Your problem has nothing to do with the regular expression. It is fine as written. The problem is that the diamond operator (<>) reads one line at a time. No single line matches. When you change re regex, only the last lline matches.

You want to 'slurp' the entire file into single string before you do the match. The minimum change to correct your problem is to undefine the INPUT_RECORD_SEPARATOR ($/). Of course, your while loop would only run once.

Code
undef $/; 
while (<DATA>) {
/(.*=)/sm;
print "$1\n"; #This prints to file so I can check the output.
}
__DATA__
UBUS01 KMSC 312300
COT UA /OV LRD360030 /TM 2350 /FL055 /TP PA32 /SK SKC /TB NEG /IC
NEG /RM /TA UNKN=

Good Luck,
Bill


zeppelinmage
New User

Oct 4, 2012, 12:52 PM

Post #3 of 7 (7037 views)
Re: [BillKSmith] regex across \n [In reply to] Can't Post

Okay, that helps somewhat - I now get multiple lines of output. Unfortunately, since I need to operate on each individual match, I need the while loop to fire per match.


Laurent_R
Veteran / Moderator

Oct 4, 2012, 3:48 PM

Post #4 of 7 (7030 views)
Re: [zeppelinmage] regex across \n [In reply to] Can't Post


In Reply To
I have some very large (50mb+) text files from which I need to clean up the data.


By the criteria of the files I am working with, these are very SMALL files. The files I am working on usually have sizes typically between 10 to 20 Gbytes, and sometimes up to 700 GB or even more.

;-)

You don't give enough information on you input file, but I would think that slurping the file after having defined the input separator as "=" or as "=\n" would probably help you very much.


zeppelinmage
New User

Oct 4, 2012, 4:51 PM

Post #5 of 7 (7027 views)
Re: [Laurent_R] regex across \n [In reply to] Can't Post


In Reply To

In Reply To
I have some very large (50mb+) text files from which I need to clean up the data.


By the criteria of the files I am working with, these are very SMALL files. The files I am working on usually have sizes typically between 10 to 20 Gbytes, and sometimes up to 700 GB or even more.

;-)

You don't give enough information on you input file, but I would think that slurping the file after having defined the input separator as "=" or as "=\n" would probably help you very much.

If I can't open the text file in Notepad and parse it by hand, to me it's very large. ;) (I don't do this very often.)

My complete input file can be found here: http://vortex.plymouth.edu/~stjones00/Apr10.txt

The problem I have is there are incomplete entries mixed in with complete entries (plus other extraneous entries I don't want), so I need to parse out the wanted data. (It will begin with UA or UUA and end with =, but I need the leading line, hence my beginning the pattern with UB.) Then I need to take these individual, complete entries and perform some operations on them. (Test for a specific value, remove \n, etc.)

My thought was to match the pattern into $1 and send that to a subroutine to perform the operations.


BillKSmith
Veteran

Oct 4, 2012, 8:42 PM

Post #6 of 7 (7023 views)
Re: [zeppelinmage] regex across \n [In reply to] Can't Post

Your Regex will only make one match. (i.e. everything from the first UB to the last =) because .* is greedy. The non-greedy equivalent (.*?) would probably be closer to what you want.

I would certainly try Laurent's suggestion first. Break the file into strings ending in '='.
Good Luck,
Bill


zeppelinmage
New User

Oct 4, 2012, 9:49 PM

Post #7 of 7 (7019 views)
Re: [BillKSmith] regex across \n [In reply to] Can't Post

It worked! Thanks, guys.

For the record, this is what I ended up doing:


Code
$/ = "="; 
while (<DATA>) {
/(UB.*?U+A.*?=)$/sm;
print OUT "$1\n";
}


 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives