
japhy
Enthusiast
Jan 24, 2000, 7:53 AM
Post #3 of 3
(1633 views)
|
Re: How to remove HTML CODE sections.
[In reply to]
|
Can't Post
|
|
There is a class of HTML:: modules out there specifically for parsing HTML correctly, but I'm not too familiar with it. I've been practicing writing regular expressions to parse complex strings, and I do believe I've come up with a regular expression to match regular HTML tags. I'm still working on matching comment tags, DTD tags, and SSI tags. This regex will match fake HTML tags, too, like <AAA href="...">, and I'm still trying to find the W3C RFC on the format in which HTML tag names -- both built-in and user specified -- can be in. This regex only allows for tag names of letters, numbers, and underscores. The attribute matching part allows for hyphens. <BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR> $text =~ s{ < \w+ ( \s+ [-\w]+ ( \s*=\s* ( "[^"]*" | '[^']*' | \S+ ) ) )* \s* >}{}gx; </pre><HR></BLOCKQUOTE> That regex matches and removes normal-looking HTML tags. Again, it's probably safer to use one of the HTML:: modules.
|