Home: Perl Programming Help: Advanced:
Robot: storing URLs in multidimensional hash



tules
Novice

Sep 1, 2008, 7:27 PM


Views: 2295
Robot: storing URLs in multidimensional hash

You should be able to see what im trying to do here, i want to generate anonymous hash references for each key, which is a url; each anonymous hash will then contain a list of keys for all the urls extracted from the page, each of which then opens up into another anonymous hash for the links on that page, so on add infinitum. The result should be a tree like structure mapping out all the urls crawled by the bot. Please help, this is urgent as I need to impress my new boss :)



Code
    


use warnings;
use Sort::Array qw(Discard_Duplicates
);
use LWP::Simple;
use HTML::LinkExtor;

my %hash = ("http://www.myspace.com" => {}
);

my $p = HTML::LinkExtor->new();

my @collected_stuff;

while ((my $key,my $value) = each(%hash))

{my $content = get($key);

$p->parse($content)->eof;

my @links = $p->links;

my @links1 = ();

foreach my $link (@links)
{if ($$link[2] =~ m/\.(js|css|png|wav|mp3|rm|mpg|bmp|jpg|rar|tar|zip|tif|gif|mp4)$/g) {next;}
push (@links1, $$link[2]);
}

my @links2 = ();

foreach my $link1 (@links1)
{$link1=~ s/\/$//g;
if ($link1 =~ /^(http:)/g)
{push (@links2, $link1);}
}

@links2 = Discard_Duplicates
(
empty_fields => 'delete',
data => \@links2,
);

foreach my $link2 (@links2)
{print "$hash{$key}\n$hash{$key}{$link2}\n$link2\n";}


};











$hash{$key} this prints out HASH(0x239b94)

$hash{$key}{$link2} this prints out nothing at all, and seems to throw the error "use of uninitialised value"

So what's the deal? Do i need to use references? And if so how?


PGScooter
stranger

Sep 1, 2008, 7:43 PM


Views: 2294
Re: [tules] Robot: storing URLs in multidimensional hash

Hi Tules,

I still don't understand much from this code. But the error "$hash{$key} this prints out HASH(0x239b94) " definitely tells me something. I usually get this error when i am trying to use an array as a string.

If I remember correctly, I would try printing this:

Code
my @newarray = @{$hash{$key}}; 
print @newarray;


I hope that works, but wouldn't be surprised if I messed it up :)
The more you teach me, the more I learn. The more I learn, the more I teach.


tules
Novice

Sep 2, 2008, 2:55 PM


Views: 2275
Re: [PGScooter] Robot: storing URLs in multidimensional hash

it means u need to dereference it i think, its ok ive decided to take a completely different root anyway, observe!




Code
   


use warnings;
use Sort::Array qw(Discard_Duplicates
);
use LWP::Simple;
use HTML::LinkExtor;

my @collected_stuff;

my %hash = (url => "http://www.myspace.com",
id => 1,
parent => 0
);

push (@collected_stuff, {%hash});

my $p = HTML::LinkExtor->new();

while (scalar(@collected_stuff) < 2)

{$ref = \$collected_stuff[0]{url};

$key = $$ref;

my $content = get($key);

$p->parse($content)->eof;

my @links = $p->links;

my @links1 = ();

foreach my $link (@links)
{if ($$link[2] =~ m/\.(js|css|png|wav|mp3|rm|mpg|bmp|jpg|rar|tar|zip|tif|gif|mp4)$/g) {next;}
push (@links1, $$link[2]);
}

my @links2 = ();

foreach my $link1 (@links1)
{$link1=~ s/\/$//g;
if ($link1 =~ /^(http:)/g)
{push (@links2, $link1);}
}

@links2 = Discard_Duplicates
(
empty_fields => 'delete',
data => \@links2,
);

foreach my $link2 (0..$#links2)
{$i = 1; $i2 = ($i + 1);#$i2 starts from last id used in last link
$collected_stuff[$i] = $hash{ url => $link2,
id => $i2,
parent => 1
};
}

foreach my $idx (0..$#collected_stuff)
{my $ref_hash = $collected_stuff[$idx];
foreach my $name (keys %$ref_hash)
{print $name . " " . $$ref_hash{$name} . "\n";
}
print "\n\n";
}

};



prefectuous
Novice

Oct 29, 2008, 8:01 AM


Views: 2029
Re: [tules] Robot: storing URLs in multidimensional hash

To keep this organized you will probably want to create an object that knows how to handle a certain amount of recursion.
A very basic example follows. To add your link data just stick in the constructor. I just horked this up, and haven't tested it, but you should get the idea.

If you use one of the autoloading base classes in @ISA you can simplify these objects quite a bit and amend whatever additional data to the bot you want without having to rewrite the class.
_________________________

package Mylink ;

sub new {
my $class = shift;
my %foo = @_ ;
my $self = \%foo;
bless ($self, $class);
my @bar ;
$self->{"Branch"} = \@bar ;
return $self ;
}

sub addbranch {
my $self = shift ;
my $B = $self->{'Branch'} ;
my $O = $self->new(@_, 'Uptree' => $self) ;
push @$B, $O ;
return $O ;
}

sub uptree {
my $self = shift ;
return $self->{'Uptree'} ;
}

1 ;

#!/bin/perl

my $ThisRecordPointer ;

# Building a tree

while(getalink()) {
my $gotdata = scalar(@_) ;
my $L = $ThisRecordPointer->addbranch(%$_) if $glotdata;
$ThisRecordPointer = $L if $gotdata ;
$ThisRecordPointer = $ThisRecordPointer->uptree() unless $gotdata ;
}