CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Robot: storing URLs as in hashes of hashes, causing great pain!

 



tules
Novice

Sep 1, 2008, 7:25 PM

Post #1 of 3 (886 views)
Robot: storing URLs as in hashes of hashes, causing great pain! Can't Post

You should be able to see what im trying to do here, i want to generate anonymous hash references for each key, which is a url; each anonymous hash will then contain a list of keys for all the urls extracted from the page, each of which then opens up into another anonymous hash for the links on that page, so on add infinitum. The result should be a tree like structure mapping out all the urls crawled by the bot. Please help, this is urgent as I need to impress my new boss :)



Code
   


use warnings;
use Sort::Array qw(Discard_Duplicates
);
use LWP::Simple;
use HTML::LinkExtor;

my %hash = ("http://www.myspace.com" => {}
);

my $p = HTML::LinkExtor->new();

my @collected_stuff;

while ((my $key,my $value) = each(%hash))

{my $content = get($key);

$p->parse($content)->eof;

my @links = $p->links;

my @links1 = ();

foreach my $link (@links)
{if ($$link[2] =~ m/\.(js|css|png|wav|mp3|rm|mpg|bmp|jpg|rar|tar|zip|tif|gif|mp4)$/g) {next;}
push (@links1, $$link[2]);
}

my @links2 = ();

foreach my $link1 (@links1)
{$link1=~ s/\/$//g;
if ($link1 =~ /^(http:)/g)
{push (@links2, $link1);}
}

@links2 = Discard_Duplicates
(
empty_fields => 'delete',
data => \@links2,
);

foreach my $link2 (@links2)
{print "$hash{$key}\n$hash{$key}{$link2}\n$link2\n";}


};











$hash{$key} this prints out HASH(0x239b94)

$hash{$key}{$link2} this prints out nothing at all, and seems to throw the error "use of uninitialised value"

So what's the deal? Do i need to use references? And if so how?


sycoogtit
User

Sep 6, 2008, 5:35 PM

Post #2 of 3 (849 views)
Re: [tules] Robot: storing URLs as in hashes of hashes, causing great pain! [In reply to] Can't Post

Going off of the example at http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/LinkExtor.pm#EXAMPLE, here's something:


Code
#!/usr/bin/perl 
use strict;

use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

my @links = ();

# maximum number of pages to get so this doesn't run forever
my $max_pages = 5;

# number of pages retrieved so far
my $num_pages = 0;

# hash of pages we've gotten so we don't keep getting the same page
my %retrieved_pages;

get_page();

# Set up a callback that collects links
sub callback {
my($tag, %attr) = @_;
while(my ($key, $val) = each(%attr)) {
return if $val =~ /\.(js|css|png|wav|mp3|rm|mpg|bmp|jpg|rar|tar|zip|tif|gif|mp4)$/;
return if $val =~ /javascript:/;
}

push(@links, values %attr);
}

sub get_page {
my ($url) = @_;

$num_pages++;
$url = "http://www.myspace.com/" if $url eq "";
my $ua = LWP::UserAgent->new;

@links = ();
# Make the parser. Unfortunately, we don't know the base yet
# (it might be diffent from $url)
my $p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives
my $res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});

# Expand all URLs to absolute ones
my $base = $res->base;
@links = map { $_ = url($_, $base)->abs; } @links;

# Print them out
print "$url links..... (num_pages: $num_pages)\n";
print join("\n", @links), "\n";
$retrieved_pages{$url} = 1;

for my $link (@links) {
if ($num_pages < $max_pages) {
print "$link\n";
get_page($link) if $retrieved_pages{$link} != 1;
}
}
}


The only problem with this is it's a depth-first search, which means it might never get the 2nd or 3rd link that's on your starting page. I'm sure you can do a little tweaking so it's breadth-first.

--
http://bunsooter.com


KevinR
Veteran


Sep 6, 2008, 5:59 PM

Post #3 of 3 (847 views)
Re: [sycoogtit] Robot: storing URLs as in hashes of hashes, causing great pain! [In reply to] Can't Post

Don't bother, he already figured this out and posted on a different forum.
-------------------------------------------------

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives