CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
generate list of "href" urls from html file

 



nijisei
New User

Aug 30, 2003, 8:12 PM

Post #1 of 7 (787 views)
generate list of "href" urls from html file Can't Post

Hi all,

Please forgive me if this has already been answered.

Let's say I have a static html file with several links using the [a] tag and [href] attribute. How do I read this file into a perl script, and parse it for these links, ultimately ending up with a list of the urls?

In other words, I have a file like this:

===========================================

<html>

<body>

blah blah blah<br>

<a href="www.foo.com">This is one link</a><br>

blah blah blah<br>

<a href="www.bar.com">This is another link</a></br>

blah blah blah<br>

<a href="www.sandwich.com">This is a third link</a></br>

</body>

</html>

===========================================

In my perl script, I want to have a list variable that ends up looking like this:

===========================================

@list_of_urls = ("http://www.foo.com", "http://www.bar.com", "http://www.sandwich.com")

===========================================

Thanks for your help!


KevinR
Veteran


Aug 31, 2003, 3:13 PM

Post #2 of 7 (781 views)
Re: [nijisei] generate list of "href" urls from html file [In reply to] Can't Post

I assume your example html file has an error in it, your anchor tags are not coded properly:

<a href="www.foo.com">

I am going to assume they are already correct:

<a href="http://www.foo.com">

if that is the case and the anchor tags are on one line like in your example, this should work for you:


Code
#!/perl/bin/perl.exe 
print qq~Content-type: text/html\n\n~;
$path = '/path/to/your/file.html';
open (FILE, "$path");
while ( <FILE> ) {
if (/\<a href\="?([^>|"]*)"?\>/ig){
push @array, $1;
}
}
close (FILE);
print "$_<br>" for @array;

-------------------------------------------------


davorg
Thaumaturge / Moderator

Sep 1, 2003, 1:33 AM

Post #3 of 7 (777 views)
Re: [nijisei] generate list of "href" urls from html file [In reply to] Can't Post

You really don't want to go parsing HTML documents with regular expressions. You should use a real HTML parser to do the job.

The HTML::Parser distribution (get it from CPAN if it's not already installed) comes with a module called HTML::LinkExtor which does just what you want - it extracts link URLs from HTML.


Code
use HTML::LinkExtor; 

my $p = HTML::LinkExtor->new;
$p->parse_file('something.html');
my @links = $p->links; # @links now contains list of links


There are other examples in the docs of how to do more complex things (like only getting links that appear in 'a' tags).

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


Paul
Enthusiast

Sep 1, 2003, 2:16 AM

Post #4 of 7 (775 views)
Re: [davorg] generate list of "href" urls from html file [In reply to] Can't Post


Quote
You really don't want to go parsing HTML documents with regular expressions.


How do you think HTML::LinkExtor does it Laugh


davorg
Thaumaturge / Moderator

Sep 1, 2003, 2:44 AM

Post #5 of 7 (773 views)
Re: [Paul] generate list of "href" urls from html file [In reply to] Can't Post


In Reply To

Quote
You really don't want to go parsing HTML documents with regular expressions.


How do you think HTML::LinkExtor does it Laugh


Well, it uses HTML::Parser of course.

But if your actual question is "how does HTML::Parser do it?" then the answer is that is certainly doesn't use regular expressions. HTML::Parser uses a library written in C which you can take a look at here. Now I don't claim to be an expert on parses, but it looks to me like a state machine which would imply it's an LR (or bottom-up) parser.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


KevinR
Veteran


Sep 1, 2003, 1:42 PM

Post #6 of 7 (769 views)
Re: [davorg] generate list of "href" urls from html file [In reply to] Can't Post

Thanks Dave, I was unaware of that module. I appreciate your informative posts, always very helpful and in the spirit of sharing.

Thanks,

Kevin
-------------------------------------------------


nijisei
New User

Sep 4, 2003, 1:11 PM

Post #7 of 7 (741 views)
Re: [KevinR] generate list of "href" urls from html file [In reply to] Can't Post

Thanks for your help, everyone. Mission accomplished! Smile

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives