Home: Perl Programming Help: Regular Expressions:
HTML Anchors



ernie
stranger

Jul 9, 2001, 1:48 AM


Views: 19273
HTML Anchors

Hi,

I need to create a RegEx to strip out the HTML tags from a HTML document, but leaving the links (anchors) intact.
For example, I need the following text

<BODY><P><BR><A HREF="www.perlguru.com"><FONT SIZE=2>Perl Guru</FONT></A></P></BODY>

... to look like this ...

<A HREF="www.perlguru.com">Perl Guru</A>

So far I've assigned the HTML to a single scalar, and stripped out comments, script tags, etc. The following will strip out ALL tags :

$HTML =~ s/(<[^>]*>)*//gi;

... but the A and /A tags are really causing me problems!


I've spent ages on this already - can anyone help please

Thanks
Ernie




mhx
Enthusiast

Jul 9, 2001, 2:50 AM


Views: 19270
Re: HTML Anchors

Hi Ernie,

have a look at the HTML::Parser package. Your task can be performed without a single regex (and perhaps more accurate) with the following code:

Code
#!/bin/perl -w 
use strict;
use HTML::Parser;

# Create some simple HTML
my $HTML = <<HTML;
<BODY><P><BR><!-- This is a comment -->
<A HREF="www.perlguru.com"><FONT SIZE=2>Perl Guru</FONT></A>
</P></BODY>
HTML

# Remove the linebreaks
$HTML =~ s/\r?\n//g;

# This will hold the parsed HTML
my $parsed;

# Set up a new HTML::Parser object
my $p = HTML::Parser->new( api_version => 3 );
$p->report_tags( 'a' ); # report only anchors
$p->handler( comment => '' ); # ignore comments
$p->handler( default => sub { $parsed .= shift }, "text" );
$p->parse( $HTML ); # parse the HTML

# Print the parsed HTML
print $parsed;

The output of that code is simply:

Code
<A HREF="www.perlguru.com">Perl Guru</A>

Hope this helps.

-- Marcus



BbBoy
stranger

Jul 10, 2001, 5:34 AM


Views: 19262
Re: HTML Anchors

Or, just:

$HTML =~s/<a(.*?)<\/a>/\[a$1\[\/a\]/ig;
$HTML =~s/<(.*?)>//g;
$HTML =~s/\[a(.*?)\[\/a]/<a$1<\/a>/ig;

Free Traffic!


ernie
stranger

Jul 10, 2001, 8:30 AM


Views: 19256
Re: HTML Anchors

Thanks guys.

The HTML Parser didnt work for me but I learnt a lot trying to get it (and TokeParser) to work.

The Regular Expressions is exactly what I was after. It works perfectly.

Many thanks
Ernie




mhx
Enthusiast

Jul 12, 2001, 11:54 AM


Views: 19242
Re: HTML Anchors

Hi Ernie & BbBoy,

I guess the regex solution will handle many cases.
But you can easily build some valid HTML that will make the regex produce very funny results ;-)
Take the following HTML code:

Code
<BODY><P><BR><!-- This is a comment --> 
<IMG SRC="pics/rightarrow.gif" ALT="->" WIDTH=200>
<A HREF="www.perlguru.com">
<FONT SIZE=2>Perl Guru</FONT><!-- /A> Ooops! --></A>

It's completely valid, and the filtered output should simply be

Code
<A HREF="www.perlguru.com">Perl Guru</A>

which is also what the HTML::Parser example delivers. But if you feed this through the regex, it will result in

Code
  
" WIDTH=200>

Perl Guru Ooops! -->

which is certainly not the wanted result.
Ok, this is a constructed example, but I just wanted to point out that there is the possibility that the regex may fail even if the HTML is valid. If this doesn't bother you, it's ok.
Just a small hint if you want to keep the regex: Add the s modifier to the regex, so the dot will match newline characters and thus allow tags to be spread over several lines. Alternatively, you can of course filter the newlines before applying the regex.

-- Marcus