CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
Optimising a regex for stripping unwanted whitespace

 



BorisE
Novice

May 20, 2008, 10:36 AM

Post #1 of 3 (3378 views)
Optimising a regex for stripping unwanted whitespace Can't Post

and newlines from an XML file?

I have a program that writes data as XML currently using XML::Structured. One side effect is that if I want to write a value <INFLUENCE ID="3">1</INFLUENCE> the tag gets broken across three lines as


Code
     <INFLUENCE ID="3"> 
1
</INFLUENCE>




which is acceptable but undesirable.

I'm currently using the following on the whole text of the XML file and repeating it once for each affected tag type (currently two).


Code
   $rawdata =~ s/(<INFLUENCE\s[^>]+>)\n\s*(\S*)\n\s*(<\/INFLUENCE>)/$1$2$3/g;



Thinking further I know its a long way below optimum and I'm not fond of the use of the "\S*" match in the middle.

Obviously I could use alternation to match all the tag types in one hit but I think I'd be better off testing for any tag with something like


Code
   $rawdata =~ s/(<(\w+)\s[^>]*>)\n\s*([+\-0-9eE]*)\n\s*(<\/\2)/$1$3$4/g;



I'm not happy with the character class [+\-0-9eE] as I want to match any of the characters in a number even if its in scientific notation (I might want to do the same for text) and I'm wondering if I should use a negated class such as [^\n<>] which should accept ANY single line that isn't an XML tag, but I'm worried I could come unstuck with "greed" since it could match the whole line including the whitespace at the beginning if \s* failed to eat it first. I assume that with two greedy matches the first one "wins".


meloyelo
User

Jun 9, 2008, 5:43 AM

Post #2 of 3 (2952 views)
Re: [BorisE] Optimising a regex for stripping unwanted whitespace [In reply to] Can't Post

Some ideas:

1. use \S*? instead of \S*
2. use two substitute operations:

s{\s*</INFLUENCE>}{</INFLUENCE>}g;
s{(<INFLUENCE.*?>)\s*}{$1}g;

This is less prone to error and leaving you wondering "how come it's not working???"


amodiahs
Novice

Jun 25, 2008, 4:48 AM

Post #3 of 3 (2848 views)
Re: [BorisE] Optimising a regex for stripping unwanted whitespace [In reply to] Can't Post

Hello,

This is forum related to RegExp, but I have a different soultion to your problem without using RegExp.

You can split your file using "\n" character and hence u will get each line in an array element. Traverse through the array and create a new file with the same XML tags keeping a "\n" whever required...

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives