Home: Perl Programming Help: Beginner:
loop through websites



thePanda41
Novice

Jul 30, 2007, 6:39 AM


Views: 3764
loop through websites

I am attempting to run through a website which contains a bunch of links. Is it possible to run a loop that will open each link individually, run my regular expressions to capture certain data, then go on to the next links and do the same thing?


KevinR
Veteran


Jul 30, 2007, 9:13 AM


Views: 3761
Re: [thePanda41] loop through websites

Sounds possible.
-------------------------------------------------


adaykin
Novice

Jul 30, 2007, 12:15 PM


Views: 3759
Re: [thePanda41] loop through websites

check out the Robot::UA class I can't remember if that is a default module or not, but you will need to get familiar with that and LWP classes most likely
------------------------------------------------------------

New Horizon Designs <-- My site, just updated the GUI to a PHP Nuke interface


KevinR
Veteran


Jul 30, 2007, 4:12 PM


Views: 3757
Re: [adaykin] loop through websites

Robot::UA? Maybe you are thinking of LWP::UserAgent?
-------------------------------------------------


adaykin
Novice

Jul 30, 2007, 6:25 PM


Views: 3755
Re: [KevinR] loop through websites

Sorry I meant to say LWP::RobotUA, it's a default module installed with Perl, just go into your command prompt and type in "perldoc LWP::RobotUA" that should give you a start. I'm using it now to traverse a few sites.
------------------------------------------------------------

New Horizon Designs <-- My site, just updated the GUI to a PHP Nuke interface


KevinR
Veteran


Jul 30, 2007, 8:16 PM


Views: 3754
Re: [adaykin] loop through websites

No LWP modules are core modules. But they might included with some distributions of perl. You can see a list of the core (5.8) modules starting with "L" here:

http://perldoc.perl.org/index-modules-L.html
-------------------------------------------------


adaykin
Novice

Jul 31, 2007, 6:42 AM


Views: 3751
Re: [KevinR] loop through websites

Well if he wants to get it with activestate it comes installed there by default. Even on linux machines that have Perl already installed everyone I have touched comes with the LWP modules already installed.

I would recommend activestate if Perl isn't already installed on your machine.They have Perl in binary format there with an executable format easy to install.
------------------------------------------------------------

New Horizon Designs <-- My site, just updated the GUI to a PHP Nuke interface


hydpm
User

Jul 31, 2007, 7:26 AM


Views: 3746
Re: [thePanda41] loop through websites

You can use linkchecker .
i think it will server ur purpose


hydpm
User

Jul 31, 2007, 7:28 AM


Views: 3745
Re: [thePanda41] loop through websites

    use WE_Frontend::LinkChecker;
my $lc = WE_Frontend::LinkChecker->new(-url => "http://www/",
-restrict => [..]);
my $errors = $lc->check_html;
print $errors;


thePanda41
Novice

Jul 31, 2007, 8:26 AM


Views: 3740
Re: [thePanda41] loop through websites

thanks guys, I'll test some of that out and see where I get.


KevinR
Veteran


Jul 31, 2007, 9:54 AM


Views: 3738
Re: [wingsof5r] loop through websites


In Reply To
You can use linkchecker .
i think it will server ur purpose


It might work for all I know, I have never heard of that module before, but the description of the module is:

WE_Frontend::LinkChecker - check a site for broken links
-------------------------------------------------


KevinR
Veteran


Jul 31, 2007, 9:59 AM


Views: 3735
Re: [adaykin] loop through websites


In Reply To
Well if he wants to get it with activestate it comes installed there by default. Even on linux machines that have Perl already installed everyone I have touched comes with the LWP modules already installed.

I would recommend activestate if Perl isn't already installed on your machine.They have Perl in binary format there with an executable format easy to install.


Sorry, mate, I hope it did not seem as though I was trying to nit-pick your suggestion. The distinction between a "default" and a "core" module was my only concern.

Kevin
-------------------------------------------------


KevinR
Veteran


Jul 31, 2007, 10:04 AM


Views: 3734
Re: [thePanda41] loop through websites


In Reply To
thanks guys, I'll test some of that out and see where I get.


If nothing else works, you can always get the page with LWP or LWP::Simple and use HTML::TokeParser to get the links out of the html code and then loop through them.

http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/TokeParser.pm

there is an example in the TokeParser module for getting the links from the html document.
-------------------------------------------------


hydpm
User

Aug 1, 2007, 7:09 AM


Views: 3727
Re: [thePanda41] loop through websites

I have used the link checker tool in one of the projects:
The linkchecker-4.0-1.i386.rpm can be downloaded from
http://linkchecker.sourceforge.net/

Hope this will help you.

you can write a perl script invoking this utility.

I have used some thing like below:

---------------------
echo "#Links to validate the integration for apache"
echo "#############################################"
linkchecker -r 0 http://rhel4-in-qa1.spikesource.in/
linkchecker -r 0 http://rhel4-in-qa1.spikesource.in/withauth -u guest -p guest
linkchecker -r 0 https://rhel4-in-qa1.spikesource.in/ -u guest -p guest
linkchecker -r 0 https://rhel4-in-qa1.spikesource.in/withauth -u guest -p guest
--------------------------------------

it is just a part of the code


adaykin
Novice

Aug 2, 2007, 7:12 AM


Views: 3716
Re: [KevinR] loop through websites


In Reply To

In Reply To
Well if he wants to get it with activestate it comes installed there by default. Even on linux machines that have Perl already installed everyone I have touched comes with the LWP modules already installed.

I would recommend activestate if Perl isn't already installed on your machine.They have Perl in binary format there with an executable format easy to install.


Sorry, mate, I hope it did not seem as though I was trying to nit-pick your suggestion. The distinction between a "default" and a "core" module was my only concern.

Kevin



np man I see where you were coming from
------------------------------------------------------------

New Horizon Designs <-- My site, just updated the GUI to a PHP Nuke interface