CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Fetch server info

 



florida2
Novice

Jul 15, 2003, 1:04 PM

Post #1 of 14 (1410 views)
Fetch server info Can't Post

I need to parse my links on my NT and get server name info between "http://"; and the first "/". I need a count to find out how much I have of each one. Here a short simple example:

If my File::Find gets this info:
http://riverserver/dir1/dir2/index.html
http://riverserver/directA/home.html
http://the.sun.com/dirB/aPage.html

I want to be able to get a count of how many I have of each:
riverserver = 2
the.sun.com = 1


I tried this using "File::Find" in my web directory and it doesnt work:

Code
use File::Find; 
require 5.6.1;

my $dir = '/mainDirectory';

my $line;

sub searcher
{
if( $_ =~ /\.(?:html?|pl)$/)
{
my $name = $File::Find::name;
open ( F, $name ) || warn "Can\'t open File $name: $!\n";

while($line = <F>)
{
if ($line =~ m{^https?://([^/]+)}) #I think my problem is this reg expression??
{
$server{$1}++;
}
}
close F;
}
}

find( \&searcher, $dir );

for ( keys %server )
{
print "$_ : $server{$_}\n";
}


The above script only captures doesnt seem to work correctly. Please advise how I can get this to work??


davorg
Thaumaturge / Moderator

Jul 16, 2003, 1:46 AM

Post #2 of 14 (1405 views)
Re: [florida2] Fetch server info [In reply to] Can't Post

You should really think about using HTML::LinkExtor to extract the links from your documents and URI to extract the various parts from the URL.

This is untested, but the program would look something like this:

Code
#!/usr/bin/perl 

use strict;
use warnings;
use HTML::LinkExtor;
use URI;
use File::Find;

my %servers;
my $dir = '/mainDirectory';

find(\&searcher, $dir);

sub searcher {
return unless /\.(html?|pl)/;

my $p = HTML::LinkExtor->new;
$p->parse_file($File::Find::name);
foreach ($p->links) { # for each link found...
my ($tag, %attrs) = @$_;
foreach my $a (keys %attr) { # for each attribute of link
my $url = URI->new($attr{$a});
$servers{$url->host}++;
}
}
}


That will populate your %servers hash for you.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


florida2
Novice

Jul 16, 2003, 5:06 AM

Post #3 of 14 (1400 views)
Re: [davorg] Fetch server info [In reply to] Can't Post

Thanks, I tried putting a print statement in to get output and cant get it to print any data.



Code
                                    
use strict;
use warnings;
use HTML::LinkExtor;
use URI;
use File::Find;

my %servers;

my $dir = '/perl/bin/dester';

find(\&searcher, $dir);

sub searcher {
return unless /\.(html?|pl)/;

my $p = HTML::LinkExtor->new;
$p->parse_file($File::Find::name);
foreach ($p->links) { # for each link found...
my ($tag, %attrs) = @$_;
foreach my $a (keys %attrs) { # for each attribute of link
my $url = URI->new($attrs{$a});

$servers{$url->host}++;
print "$_ = $attrs{$_}\n"; #doesnt print anything
print; #prints reference addresses only
}
}
}



(This post was edited by davorg on Jul 16, 2003, 5:34 AM)


davorg
Thaumaturge / Moderator

Jul 16, 2003, 5:41 AM

Post #4 of 14 (1396 views)
Re: [davorg] Fetch server info [In reply to] Can't Post

OK, well there was a little more to it that I thought.

1/ You need to create an absolute URL object. And to do that you need to know what base to use for any relative links. In your case you can just put your host name as the base as you're only interested in the host and any relative links _must_ be on your host.

2/ You need to ignore any links that aren't 'http'.

3/ My code wasn't producing any output.

This code seems to do what (I think) you want.


Code
#!/usr/bin/perl  

use strict;
use warnings;
use HTML::LinkExtor;
use URI;
use File::Find;

my %servers;
my $dir = 'C:/somewhere';

# base URL to turn any relative URLs to absolute
my $base = 'http://your.host.here/';

find(\&searcher, $dir);

sub searcher {
return unless /\.(html?|pl)/;

my $p = HTML::LinkExtor->new;
$p->parse_file($File::Find::name);
foreach ($p->links) { # for each link found...
my ($tag, %attrs) = @$_;
foreach my $a (keys %attrs) { # for each attribute of link
my $url = URI->new_abs($attrs{$a}, $base);

next unless $url->scheme =~ /^http/;

$servers{$url->host}++;
}
}
}

for (keys %servers) {
print "$_ : $servers{$_}\n";
}


Sean Burke's book Perl & LWP has lots of good stuff about this kind of work

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


florida2
Novice

Jul 16, 2003, 8:14 AM

Post #5 of 14 (1393 views)
Re: [davorg] Fetch server info [In reply to] Can't Post

Thanks it now works!

Please explain the object oriented features of your program. I m lost on the new_abs and scheme methods (all I assume part of URI modules) and
"$servers{$url->host}++". If you have the time can you please explain all this part below?


Code
my $p = HTML::LinkExtor->new;   #constructor here 
$p->parse_file($File::Find::name); #pare_file is a method doing what with $File::Find module??
foreach ($p->links)
{
# for each link found...
my ($tag, %attrs) = @$_; #Looks like your creating an array ref here doing what?
foreach my $a (keys %attrs)
{
# for each attribute of link
my $url = URI->new_abs($attrs{$a}, $base); #is new_abs a method doing what?
next unless $url->scheme =~ /^http/; #is scheme a method doing what?
$servers{$url->host}++;
}
}



davorg
Thaumaturge / Moderator

Jul 17, 2003, 3:54 AM

Post #6 of 14 (1383 views)
Re: [florida2] Fetch server info [In reply to] Can't Post

OK. Here goes. A couple of posts back I provided links to the docs for both HTML::LinkExtor and URI. You could look there for more details.


Code
# This creates a new HTML::LinkExtor object and stores 
# a reference to it in $p
my $p = HTML::LinkExtor->new;

# "parse_file" is HTML::LinkExtor method. It parses the given
# file looking for links and creating a list of links found
$p->parse_file($File::Find::name);

# You can access the list of found links using the "links"
# method.
foreach ($p->links) {
# Each element in this list is a reference to an array.
# The first element of the array is the tag name, and the
# remaining elements are in pairs (attr name and attr value).
# This attribute data is, of course, perfect for storing in
# a hash.
# For example [ 'a', 'href', 'http://foo.com' ]
# We pull this array into a scalar containing the tag name
# and a hash containing the attr name/value pairs
my ($tag, %attrs) = @$_;

# We then iterate across the attributes
foreach my $a (keys %attrs) {
# This creates a new URI object.
# We use the "new_abs" constructor to create an
# absolute URI, i.e. one that always contains a scheme
# (e.g. http://) and a host (e.g. foo.com).
# Because some of the links we've found might be
# relative, we need to give it a base URL to add to
# relative URLs to create absolute ones.
my $url = URI->new_abs($attrs{$a}, $base);

# "scheme" is a method on the URI object which returns
# the scheme of the URL. We are only interested in URLs
# with the scheme "http".
next unless $url->scheme =~ /^http/;

# "host" is a method which returns the hostname part of
# the URL. We get the hostname and increment the
# associated value in the %servers hash.
$servers{$url->host}++;
}
}



Hope that helps. Let me know if you need any more help.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


florida2
Novice

Jul 17, 2003, 4:15 AM

Post #7 of 14 (1381 views)
Re: [davorg] Fetch server info [In reply to] Can't Post

davorg,

Thanks for all your time and patience. I now understand this better. Thank you!!


florida2
Novice

Jul 17, 2003, 1:09 PM

Post #8 of 14 (1373 views)
Re: [davorg] Fetch server info [In reply to] Can't Post

Im running the script and it works in everything including my Unix server
and My NT server (which has Perl 5.8) except for when I run it starting in the NT server webroot (f:\inetpub\wwwroot\) it gives me the following message:

Code
Use of unitialized value in hash element at F:\Perl\bin\scriptname.pl line 32. 
Use of unitialized value in hash element at F:\Perl\bin\scriptname.pl line 32.
Use of unitialized value in hash element at F:\Perl\bin\scriptname.pl line 32.
Can't locate object method "host" via package "URI::_foreign"
at F:\Perl\bin\scriptname.pl line 32.


Line 32 seems to be right on this line:

Code
$servers{$url->host}++;


If I run this script starting in any of the NT server webroot and other directories (f:\inetpub\wwwroot\nextDirectory)
it works with no problems. It just seems to not work when I run it from web root on my NT server.

If possible can you advise why this is happening???


davorg
Thaumaturge / Moderator

Jul 18, 2003, 1:58 AM

Post #9 of 14 (1370 views)
Re: [florida2] Fetch server info [In reply to] Can't Post

Hmm... looks like you're finding some other kind of link that we didn't consider earlier.

Can you try adding the line:

Code
print ref $url, "  - $attrs{a}\n";

Just after you create the URL object. This will tell us what type of URL is breaking things.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


florida2
Novice

Jul 18, 2003, 8:58 AM

Post #10 of 14 (1368 views)
Re: [davorg] Fetch server info [In reply to] Can't Post

thanks. I did as you suggested and this was what caused the error-
httpt:

I assume that is a problem with the URI module?
I added this to get it to work:

Code
next if($url->scheme = /^httpt/);



Thanks for all your help and helping me understand Perl alot better!


Thanks


davorg
Thaumaturge / Moderator

Jul 19, 2003, 3:05 AM

Post #11 of 14 (1363 views)
Re: [florida2] Fetch server info [In reply to] Can't Post

Isn't that an error in the original HTML documents?

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


florida2
Novice

Jul 19, 2003, 10:55 AM

Post #12 of 14 (1360 views)
Re: [davorg] Fetch server info [In reply to] Can't Post

Yes,

Sorry I should have clarified that. It is an error on an HTML page that caused the problem. So I assume there is something wrong with the URI module?

I am very happy with the script you supplied and it now works great!


davorg
Thaumaturge / Moderator

Jul 19, 2003, 11:22 PM

Post #13 of 14 (1356 views)
Re: [florida2] Fetch server info [In reply to] Can't Post

I'm not sure. I think that the URI module is doing the best it can under the circumstances. It's creating a URI of type URI::_foreign, which is basically saying "well you've told me this is a URI, so I'm going to create a URI object for you, but it really doesn't look like a valid URI to me so you're on your own when it comes to dealing with ir!"

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


florida2
Novice

Jul 22, 2003, 4:12 AM

Post #14 of 14 (1351 views)
Re: [davorg] Fetch server info [In reply to] Can't Post

thanks, I dealt with it using a "next if" statement.

I appreciate all the information and your time!

Thanks again.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives