
davorg
Thaumaturge
/ Moderator
Jul 17, 2003, 3:54 AM
Post #6 of 14
(743 views)
|
|
Re: [florida2] Fetch server info
[In reply to]
|
Can't Post
|
|
OK. Here goes. A couple of posts back I provided links to the docs for both HTML::LinkExtor and URI. You could look there for more details.
# This creates a new HTML::LinkExtor object and stores # a reference to it in $p my $p = HTML::LinkExtor->new; # "parse_file" is HTML::LinkExtor method. It parses the given # file looking for links and creating a list of links found $p->parse_file($File::Find::name); # You can access the list of found links using the "links" # method. foreach ($p->links) { # Each element in this list is a reference to an array. # The first element of the array is the tag name, and the # remaining elements are in pairs (attr name and attr value). # This attribute data is, of course, perfect for storing in # a hash. # For example [ 'a', 'href', 'http://foo.com' ] # We pull this array into a scalar containing the tag name # and a hash containing the attr name/value pairs my ($tag, %attrs) = @$_; # We then iterate across the attributes foreach my $a (keys %attrs) { # This creates a new URI object. # We use the "new_abs" constructor to create an # absolute URI, i.e. one that always contains a scheme # (e.g. http://) and a host (e.g. foo.com). # Because some of the links we've found might be # relative, we need to give it a base URL to add to # relative URLs to create absolute ones. my $url = URI->new_abs($attrs{$a}, $base); # "scheme" is a method on the URI object which returns # the scheme of the URL. We are only interested in URLs # with the scheme "http". next unless $url->scheme =~ /^http/; # "host" is a method which returns the hostname part of # the URL. We get the hostname and increment the # associated value in the %servers hash. $servers{$url->host}++; } } Hope that helps. Let me know if you need any more help. -- Dave Cross, Perl Hacker, Trainer and Writer http://www.dave.org.uk/ Get more help at Perl Monks
|