Home: Perl Programming Help: Regular Expressions:
HTML Anchors


Jul 9, 2001, 1:48 AM

Views: 18053
HTML Anchors


I need to create a RegEx to strip out the HTML tags from a HTML document, but leaving the links (anchors) intact.
For example, I need the following text

<BODY><P><BR><A HREF="www.perlguru.com"><FONT SIZE=2>Perl Guru</FONT></A></P></BODY>

... to look like this ...

<A HREF="www.perlguru.com">Perl Guru</A>

So far I've assigned the HTML to a single scalar, and stripped out comments, script tags, etc. The following will strip out ALL tags :

$HTML =~ s/(<[^>]*>)*//gi;

... but the A and /A tags are really causing me problems!

I've spent ages on this already - can anyone help please



Jul 9, 2001, 2:50 AM

Views: 18050
Re: HTML Anchors

Hi Ernie,

have a look at the HTML::Parser package. Your task can be performed without a single regex (and perhaps more accurate) with the following code:

#!/bin/perl -w 
use strict;
use HTML::Parser;

# Create some simple HTML
my $HTML = <<HTML;
<BODY><P><BR><!-- This is a comment -->
<A HREF="www.perlguru.com"><FONT SIZE=2>Perl Guru</FONT></A>

# Remove the linebreaks
$HTML =~ s/\r?\n//g;

# This will hold the parsed HTML
my $parsed;

# Set up a new HTML::Parser object
my $p = HTML::Parser->new( api_version => 3 );
$p->report_tags( 'a' ); # report only anchors
$p->handler( comment => '' ); # ignore comments
$p->handler( default => sub { $parsed .= shift }, "text" );
$p->parse( $HTML ); # parse the HTML

# Print the parsed HTML
print $parsed;

The output of that code is simply:

<A HREF="www.perlguru.com">Perl Guru</A>

Hope this helps.

-- Marcus


Jul 10, 2001, 5:34 AM

Views: 18042
Re: HTML Anchors

Or, just:

$HTML =~s/<a(.*?)<\/a>/\[a$1\[\/a\]/ig;
$HTML =~s/<(.*?)>//g;
$HTML =~s/\[a(.*?)\[\/a]/<a$1<\/a>/ig;

Free Traffic!


Jul 10, 2001, 8:30 AM

Views: 18036
Re: HTML Anchors

Thanks guys.

The HTML Parser didnt work for me but I learnt a lot trying to get it (and TokeParser) to work.

The Regular Expressions is exactly what I was after. It works perfectly.

Many thanks


Jul 12, 2001, 11:54 AM

Views: 18022
Re: HTML Anchors

Hi Ernie & BbBoy,

I guess the regex solution will handle many cases.
But you can easily build some valid HTML that will make the regex produce very funny results ;-)
Take the following HTML code:

<BODY><P><BR><!-- This is a comment --> 
<IMG SRC="pics/rightarrow.gif" ALT="->" WIDTH=200>
<A HREF="www.perlguru.com">
<FONT SIZE=2>Perl Guru</FONT><!-- /A> Ooops! --></A>

It's completely valid, and the filtered output should simply be

<A HREF="www.perlguru.com">Perl Guru</A>

which is also what the HTML::Parser example delivers. But if you feed this through the regex, it will result in

" WIDTH=200>

Perl Guru Ooops! -->

which is certainly not the wanted result.
Ok, this is a constructed example, but I just wanted to point out that there is the possibility that the regex may fail even if the HTML is valid. If this doesn't bother you, it's ok.
Just a small hint if you want to keep the regex: Add the s modifier to the regex, so the dot will match newline characters and thus allow tags to be spread over several lines. Alternatively, you can of course filter the newlines before applying the regex.

-- Marcus