CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Regular Expressions:
Optimising a regex for stripping unwanted whitespace



May 20, 2008, 10:36 AM

Post #1 of 3 (7648 views)
Optimising a regex for stripping unwanted whitespace Can't Post

and newlines from an XML file?

I have a program that writes data as XML currently using XML::Structured. One side effect is that if I want to write a value <INFLUENCE ID="3">1</INFLUENCE> the tag gets broken across three lines as

     <INFLUENCE ID="3"> 

which is acceptable but undesirable.

I'm currently using the following on the whole text of the XML file and repeating it once for each affected tag type (currently two).

   $rawdata =~ s/(<INFLUENCE\s[^>]+>)\n\s*(\S*)\n\s*(<\/INFLUENCE>)/$1$2$3/g;

Thinking further I know its a long way below optimum and I'm not fond of the use of the "\S*" match in the middle.

Obviously I could use alternation to match all the tag types in one hit but I think I'd be better off testing for any tag with something like

   $rawdata =~ s/(<(\w+)\s[^>]*>)\n\s*([+\-0-9eE]*)\n\s*(<\/\2)/$1$3$4/g;

I'm not happy with the character class [+\-0-9eE] as I want to match any of the characters in a number even if its in scientific notation (I might want to do the same for text) and I'm wondering if I should use a negated class such as [^\n<>] which should accept ANY single line that isn't an XML tag, but I'm worried I could come unstuck with "greed" since it could match the whole line including the whitespace at the beginning if \s* failed to eat it first. I assume that with two greedy matches the first one "wins".


Jun 9, 2008, 5:43 AM

Post #2 of 3 (7222 views)
Re: [BorisE] Optimising a regex for stripping unwanted whitespace [In reply to] Can't Post

Some ideas:

1. use \S*? instead of \S*
2. use two substitute operations:


This is less prone to error and leaving you wondering "how come it's not working???"


Jun 25, 2008, 4:48 AM

Post #3 of 3 (7118 views)
Re: [BorisE] Optimising a regex for stripping unwanted whitespace [In reply to] Can't Post


This is forum related to RegExp, but I have a different soultion to your problem without using RegExp.

You can split your file using "\n" character and hence u will get each line in an array element. Traverse through the array and create a new file with the same XML tags keeping a "\n" whever required...


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives