CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Parsing HTML

 



Steerpike
Deleted

Apr 20, 2000, 3:07 PM

Post #1 of 3 (618 views)
Parsing HTML Can't Post

Hi folksies, thanks for any help anyone offers this request Smile

Could anyone tell me, preferably using pattern matching/regular expressions, a quick and efficient way of lifting link names from HTML?

For example, taking 'URL Name' from the following HTML:

<P><A HREF="www.url.com"><I>URL Name</I></A><FONT SIZE="+2">Blah Blah</FONT></P>

I am aware that there shall probably be a few modules dedicated to solving such problems...but I really would prefer a 'manual' pattern matching approach Smile

Thanks again all Smile


Steerpike
Deleted

Apr 20, 2000, 3:19 PM

Post #2 of 3 (618 views)
Re: Parsing HTML [In reply to] Can't Post

Oh, I have an addendum Smile

Let's instead use the HTML:

<P><A HREF="www.url.com"><I>URL Name</I></A><FONT SIZE="+2">Blah Blah</FONT><A HREF="www.url.com/url/">Another Link</A></P>

And assume I want the first link title in any given piece of HTML Smile

Thanks again Smile



Cure
User

Apr 20, 2000, 10:13 PM

Post #3 of 3 (619 views)
Re: Parsing HTML [In reply to] Can't Post

Hi

#!/usr/bin/perl -w

use strict;

use HTML::Parser;

my $parser = HTML::Parser->new(api_version => 3,
start_h => [ \&start,"self,tagname,attr" ]);

$parser->parse(<<EOFOO);
<P><A HREF="www.url.com"><I>URL Name</I></A><FONT SIZE="+2">Blah
Blah</FONT><A HREF="www.url.com/url/">Another Link</A></P>
EOFOO

for (@{$parser->{urls}})
{
print "$_->[0] $_->[1]\n";
}

sub start
{
my ($self,$tag,$attr) = @_;

if ( $tag eq 'a' && exists $attr->{href} )
{
$self->{_current_url} = $attr->{href};
$self->handler(text => sub {
my ( $self,$text ) = @_;
$self->{_current_text} .= $text;
},
"self, dtext");

$self->handler( end => \&end,"self, tagname");
}
}

sub end
{
my ( $self, $tag ) = @_;
if ( $tag eq 'a' )
{
push @{$self->{urls}},[$self->{_current_url},
$self->{_current_text}];
delete $self->{_current_url};
delete $self->{_current_text};

$self->handler(text => undef);
$self->handler(end => undef);
}
}

Cure

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives