Jul 3, 2000, 7:01 AM
Post #2 of 3
There is an O'reilly book out called 'Mastering Regular Expressions', and he walks through this scenario. I totaly recommend getting it. Here is the regular expression, and now I'll try to explain it.
Re: need help with html parser
[In reply to]
/(<[^/>]+> )([^<]*)(</[^>]+> )?/
In the first set of parentheses I match a less-than sign, one or more characters that are not a slash or a greater-than sign, and then the greater than sign. I include the slash in the negated character class so that I don't match a closing HTML tag.
Then I match any number of characters that are not a less-than sign. That gets your text.
Lastly I match a less-than sign, a slash, one or more characters that are not a greater-than sign, and then the greater than sign. This gets the closing HTML tag. The question mark after than last set makes it optional.
This should work for the situation you described, but I doubt it will work for every situation possible that HTML can be written as.
If you going to do really complex HTML parsing then I would recommend writing a parsing program.
Hope this helps. Good luck