CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Regular Expressions:
HTML Anchors



Jul 9, 2001, 1:48 AM

Post #1 of 5 (19304 views)
HTML Anchors Can't Post


I need to create a RegEx to strip out the HTML tags from a HTML document, but leaving the links (anchors) intact.
For example, I need the following text

<BODY><P><BR><A HREF=""><FONT SIZE=2>Perl Guru</FONT></A></P></BODY>

... to look like this ...

<A HREF="">Perl Guru</A>

So far I've assigned the HTML to a single scalar, and stripped out comments, script tags, etc. The following will strip out ALL tags :

$HTML =~ s/(<[^>]*>)*//gi;

... but the A and /A tags are really causing me problems!

I've spent ages on this already - can anyone help please



Jul 9, 2001, 2:50 AM

Post #2 of 5 (19301 views)
Re: HTML Anchors [In reply to] Can't Post

Hi Ernie,

have a look at the HTML::Parser package. Your task can be performed without a single regex (and perhaps more accurate) with the following code:

#!/bin/perl -w 
use strict;
use HTML::Parser;

# Create some simple HTML
my $HTML = <<HTML;
<BODY><P><BR><!-- This is a comment -->
<A HREF=""><FONT SIZE=2>Perl Guru</FONT></A>

# Remove the linebreaks
$HTML =~ s/\r?\n//g;

# This will hold the parsed HTML
my $parsed;

# Set up a new HTML::Parser object
my $p = HTML::Parser->new( api_version => 3 );
$p->report_tags( 'a' ); # report only anchors
$p->handler( comment => '' ); # ignore comments
$p->handler( default => sub { $parsed .= shift }, "text" );
$p->parse( $HTML ); # parse the HTML

# Print the parsed HTML
print $parsed;

The output of that code is simply:

<A HREF="">Perl Guru</A>

Hope this helps.

-- Marcus


Jul 10, 2001, 5:34 AM

Post #3 of 5 (19293 views)
Re: HTML Anchors [In reply to] Can't Post

Or, just:

$HTML =~s/<a(.*?)<\/a>/\[a$1\[\/a\]/ig;
$HTML =~s/<(.*?)>//g;
$HTML =~s/\[a(.*?)\[\/a]/<a$1<\/a>/ig;

Free Traffic!


Jul 10, 2001, 8:30 AM

Post #4 of 5 (19287 views)
Re: HTML Anchors [In reply to] Can't Post

Thanks guys.

The HTML Parser didnt work for me but I learnt a lot trying to get it (and TokeParser) to work.

The Regular Expressions is exactly what I was after. It works perfectly.

Many thanks


Jul 12, 2001, 11:54 AM

Post #5 of 5 (19273 views)
Re: HTML Anchors [In reply to] Can't Post

Hi Ernie & BbBoy,

I guess the regex solution will handle many cases.
But you can easily build some valid HTML that will make the regex produce very funny results ;-)
Take the following HTML code:

<BODY><P><BR><!-- This is a comment --> 
<IMG SRC="pics/rightarrow.gif" ALT="->" WIDTH=200>
<A HREF="">
<FONT SIZE=2>Perl Guru</FONT><!-- /A> Ooops! --></A>

It's completely valid, and the filtered output should simply be

<A HREF="">Perl Guru</A>

which is also what the HTML::Parser example delivers. But if you feed this through the regex, it will result in

" WIDTH=200>

Perl Guru Ooops! -->

which is certainly not the wanted result.
Ok, this is a constructed example, but I just wanted to point out that there is the possibility that the regex may fail even if the HTML is valid. If this doesn't bother you, it's ok.
Just a small hint if you want to keep the regex: Add the s modifier to the regex, so the dot will match newline characters and thus allow tags to be spread over several lines. Alternatively, you can of course filter the newlines before applying the regex.

-- Marcus


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives