CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
HTML Anchors

 



ernie
stranger

Jul 9, 2001, 1:48 AM

Post #1 of 5 (5636 views)
HTML Anchors Can't Post

Hi,

I need to create a RegEx to strip out the HTML tags from a HTML document, but leaving the links (anchors) intact.
For example, I need the following text

<BODY><P><BR><A HREF="www.perlguru.com"><FONT SIZE=2>Perl Guru</FONT></A></P></BODY>

... to look like this ...

<A HREF="www.perlguru.com">Perl Guru</A>

So far I've assigned the HTML to a single scalar, and stripped out comments, script tags, etc. The following will strip out ALL tags :

$HTML =~ s/(<[^>]*>)*//gi;

... but the A and /A tags are really causing me problems!


I've spent ages on this already - can anyone help please

Thanks
Ernie




mhx
Enthusiast

Jul 9, 2001, 2:50 AM

Post #2 of 5 (5633 views)
Re: HTML Anchors [In reply to] Can't Post

Hi Ernie,

have a look at the HTML::Parser package. Your task can be performed without a single regex (and perhaps more accurate) with the following code:

Code
#!/bin/perl -w 
use strict;
use HTML::Parser;

# Create some simple HTML
my $HTML = <<HTML;
<BODY><P><BR><!-- This is a comment -->
<A HREF="www.perlguru.com"><FONT SIZE=2>Perl Guru</FONT></A>
</P></BODY>
HTML

# Remove the linebreaks
$HTML =~ s/\r?\n//g;

# This will hold the parsed HTML
my $parsed;

# Set up a new HTML::Parser object
my $p = HTML::Parser->new( api_version => 3 );
$p->report_tags( 'a' ); # report only anchors
$p->handler( comment => '' ); # ignore comments
$p->handler( default => sub { $parsed .= shift }, "text" );
$p->parse( $HTML ); # parse the HTML

# Print the parsed HTML
print $parsed;

The output of that code is simply:

Code
<A HREF="www.perlguru.com">Perl Guru</A>

Hope this helps.

-- Marcus



BbBoy
stranger

Jul 10, 2001, 5:34 AM

Post #3 of 5 (5625 views)
Re: HTML Anchors [In reply to] Can't Post

Or, just:

$HTML =~s/<a(.*?)<\/a>/\[a$1\[\/a\]/ig;
$HTML =~s/<(.*?)>//g;
$HTML =~s/\[a(.*?)\[\/a]/<a$1<\/a>/ig;

Free Traffic!


ernie
stranger

Jul 10, 2001, 8:30 AM

Post #4 of 5 (5619 views)
Re: HTML Anchors [In reply to] Can't Post

Thanks guys.

The HTML Parser didnt work for me but I learnt a lot trying to get it (and TokeParser) to work.

The Regular Expressions is exactly what I was after. It works perfectly.

Many thanks
Ernie




mhx
Enthusiast

Jul 12, 2001, 11:54 AM

Post #5 of 5 (5605 views)
Re: HTML Anchors [In reply to] Can't Post

Hi Ernie & BbBoy,

I guess the regex solution will handle many cases.
But you can easily build some valid HTML that will make the regex produce very funny results ;-)
Take the following HTML code:

Code
<BODY><P><BR><!-- This is a comment --> 
<IMG SRC="pics/rightarrow.gif" ALT="->" WIDTH=200>
<A HREF="www.perlguru.com">
<FONT SIZE=2>Perl Guru</FONT><!-- /A> Ooops! --></A>

It's completely valid, and the filtered output should simply be

Code
<A HREF="www.perlguru.com">Perl Guru</A>

which is also what the HTML::Parser example delivers. But if you feed this through the regex, it will result in

Code
  
" WIDTH=200>

Perl Guru Ooops! -->

which is certainly not the wanted result.
Ok, this is a constructed example, but I just wanted to point out that there is the possibility that the regex may fail even if the HTML is valid. If this doesn't bother you, it's ok.
Just a small hint if you want to keep the regex: Add the s modifier to the regex, so the dot will match newline characters and thus allow tags to be spread over several lines. Alternatively, you can of course filter the newlines before applying the regex.

-- Marcus


 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives