CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
need help with html parser

 



shaba
Deleted

Jul 1, 2000, 9:50 PM

Post #1 of 3 (2576 views)
need help with html parser Can't Post

hi!

i need some help coming up with a parser that reads in a line of text from an html file and looks for any html tags and separates those from just plain text. i am not yet comfortable with regular expressions and i am having a lot of trouble. so far i have tried this:

if ($line =~ /(<(.+)> )?/) { print #1; }

this works fine for printing out the html tags but i still don't know how to extract the text part of the line out.

basically, if
$line = <b> here is some bolded text </b> then i need $1= <b> $2 = here is ... $3 = </b>.

my main problem is that $line could also be
<b> here is some bolded text
and i still need $1 = <b> and $2 = here is ...

please help!

Shaheeb R.


mckhendry
Deleted

Jul 3, 2000, 7:01 AM

Post #2 of 3 (2576 views)
Re: need help with html parser [In reply to] Can't Post

There is an O'reilly book out called 'Mastering Regular Expressions', and he walks through this scenario. I totaly recommend getting it. Here is the regular expression, and now I'll try to explain it.

/(<[^/>]+> )([^<]*)(</[^>]+> )?/

In the first set of parentheses I match a less-than sign, one or more characters that are not a slash or a greater-than sign, and then the greater than sign. I include the slash in the negated character class so that I don't match a closing HTML tag.

Then I match any number of characters that are not a less-than sign. That gets your text.

Lastly I match a less-than sign, a slash, one or more characters that are not a greater-than sign, and then the greater than sign. This gets the closing HTML tag. The question mark after than last set makes it optional.

This should work for the situation you described, but I doubt it will work for every situation possible that HTML can be written as.

If you going to do really complex HTML parsing then I would recommend writing a parsing program.

Hope this helps. Good luck


errr
Deleted

Jul 5, 2000, 8:12 PM

Post #3 of 3 (2576 views)
Re: need help with html parser [In reply to] Can't Post

Hate to spoil the regex fun... but HTML cannot be parsed well with a regex.. use the HTML::Parse module http://www.perl.com/pub/doc/manual/html/pod/perlfaq9.html#How_do_I_remove_HTML_from_a_stri

you're doing simple stuff so maybe not what you want.. but modules are always good

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives