CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
Complex regular subexpression recursion limit

 



trutwijd
New User

Apr 7, 2005, 9:16 AM

Post #1 of 5 (3315 views)
Complex regular subexpression recursion limit Can't Post

I'm running a program under ActiveState perl on Windows XP/2000 that is crashing due to the above error message, precisely:

Complex regular subexpression recursion limit (32766) exceeded

The program terminates hard and I only realized this was the reason after running with -w. The program parses a large number of XML files and the regular expression upon which it is dying is something I wrote to remove XML comment tags containing a certain phrase, for example any XML comment containing the word "josh". <!-- Hey josh, strip this comment -->

$text =~ s/(<!--(?(?!<!--|-->).)*josh(?(?!<!--|-->).)*-->)//gi;

I used the lookahead assertion syntax after reading about this in the Perl Cookbook text and how hard it is to use RE's to work with HTML/XML.

The program chokes on some large XML files, is there any way to increase this recursion limit (a google search seems to indicate the answer here is no). If not is there a better way to write the RE to do the same task?

Thanks,

Josh


davorg
Thaumaturge / Moderator

Apr 8, 2005, 5:42 AM

Post #2 of 5 (3305 views)
Re: [trutwijd] Complex regular subexpression recursion limit [In reply to] Can't Post

Do not use regular expressions to parse HTML or XML.

If you're parsing HTML then use an HTML parser.

If you're parsing XML then use an XML parser.

But _never_ use regular expressions to parse HTML or XML.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


trutwijd
New User

Apr 8, 2005, 6:03 AM

Post #3 of 5 (3303 views)
Re: [davorg] Complex regular subexpression recursion limit [In reply to] Can't Post

In theory that's a great idea, in the real world not all XML is loadable by a parser. We're using an XML editor here that for better or worse leaves junk in a lot of the files that make it invalid to Perl's parsers, I've tried almost all of em. The regular expression is attempting to strip some of that junk out so further on in the program the file can be loaded by a parser.

Even if I could load all these files in an XML parser without pre-processing, how does that help me with my problem? I don't know of any XML parser that can strip a comment tag based on it's contents.


(This post was edited by trutwijd on Apr 8, 2005, 6:04 AM)


davorg
Thaumaturge / Moderator

Apr 8, 2005, 6:12 AM

Post #4 of 5 (3300 views)
Re: [trutwijd] Complex regular subexpression recursion limit [In reply to] Can't Post


In Reply To
In theory that's a great idea, in the real world not all XML is loadable by a parser. We're using an XML editor here that for better or worse leaves junk in a lot of the files that make it invalid to Perl's parsers, I've tried almost all of em. The regular expression is attempting to strip some of that junk out so further on in the program the file can be loaded by a parser.


Actually you have that logic turned on its head. By definition XML is data that can be parsed by an XML parser. If the data you are getting can't be parser by an XML parser then it isn't XML. Perhaps you should punt the problem back to the people who create the data and ask them to provide _valid_ XML.


In Reply To
Even if I could load all these files in an XML parser without pre-processing, how does that help me with my problem? I don't know of any XML parser that can strip a comment tag based on it's contents.


I use XML::LibXML for all my XML work. I've never had to bother with comments but a quick glance through the documentation shows that there is an XML::LibXML::Comment subclass of the main node class which gives you information about the contents of a comment. So it looks like you could use that.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


trutwijd
New User

Apr 8, 2005, 6:35 AM

Post #5 of 5 (3297 views)
Re: [davorg] Complex regular subexpression recursion limit [In reply to] Can't Post

I REALLY wish I could punt this back to the source of the problem, I've tried. That road lead nowhere unfortunately. Many of the things this program was written for was to find all the problems that this editor tool is inserting into the XML, I have to deal with garbage like duplicate attributes, attributes not seperated by spaces, invalid DTD definitions, characters outside the defined character set, etc. It's a big mess.

Anyway, thanks for the LibXML link, this program currently uses LibXML to check if the XML chunk is valid or not. The regular expression that fails is before this step though.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives