CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Advanced:
Robot: storing URLs in multidimensional hash

 



tules
Novice

Sep 1, 2008, 7:27 PM

Post #1 of 4 (1682 views)
Robot: storing URLs in multidimensional hash Can't Post

You should be able to see what im trying to do here, i want to generate anonymous hash references for each key, which is a url; each anonymous hash will then contain a list of keys for all the urls extracted from the page, each of which then opens up into another anonymous hash for the links on that page, so on add infinitum. The result should be a tree like structure mapping out all the urls crawled by the bot. Please help, this is urgent as I need to impress my new boss :)



Code
    


use warnings;
use Sort::Array qw(Discard_Duplicates
);
use LWP::Simple;
use HTML::LinkExtor;

my %hash = ("http://www.myspace.com" => {}
);

my $p = HTML::LinkExtor->new();

my @collected_stuff;

while ((my $key,my $value) = each(%hash))

{my $content = get($key);

$p->parse($content)->eof;

my @links = $p->links;

my @links1 = ();

foreach my $link (@links)
{if ($$link[2] =~ m/\.(js|css|png|wav|mp3|rm|mpg|bmp|jpg|rar|tar|zip|tif|gif|mp4)$/g) {next;}
push (@links1, $$link[2]);
}

my @links2 = ();

foreach my $link1 (@links1)
{$link1=~ s/\/$//g;
if ($link1 =~ /^(http:)/g)
{push (@links2, $link1);}
}

@links2 = Discard_Duplicates
(
empty_fields => 'delete',
data => \@links2,
);

foreach my $link2 (@links2)
{print "$hash{$key}\n$hash{$key}{$link2}\n$link2\n";}


};











$hash{$key} this prints out HASH(0x239b94)

$hash{$key}{$link2} this prints out nothing at all, and seems to throw the error "use of uninitialised value"

So what's the deal? Do i need to use references? And if so how?


PGScooter
stranger

Sep 1, 2008, 7:43 PM

Post #2 of 4 (1681 views)
Re: [tules] Robot: storing URLs in multidimensional hash [In reply to] Can't Post

Hi Tules,

I still don't understand much from this code. But the error "$hash{$key} this prints out HASH(0x239b94) " definitely tells me something. I usually get this error when i am trying to use an array as a string.

If I remember correctly, I would try printing this:

Code
my @newarray = @{$hash{$key}}; 
print @newarray;


I hope that works, but wouldn't be surprised if I messed it up :)
The more you teach me, the more I learn. The more I learn, the more I teach.


tules
Novice

Sep 2, 2008, 2:55 PM

Post #3 of 4 (1662 views)
Re: [PGScooter] Robot: storing URLs in multidimensional hash [In reply to] Can't Post

it means u need to dereference it i think, its ok ive decided to take a completely different root anyway, observe!




Code
   


use warnings;
use Sort::Array qw(Discard_Duplicates
);
use LWP::Simple;
use HTML::LinkExtor;

my @collected_stuff;

my %hash = (url => "http://www.myspace.com",
id => 1,
parent => 0
);

push (@collected_stuff, {%hash});

my $p = HTML::LinkExtor->new();

while (scalar(@collected_stuff) < 2)

{$ref = \$collected_stuff[0]{url};

$key = $$ref;

my $content = get($key);

$p->parse($content)->eof;

my @links = $p->links;

my @links1 = ();

foreach my $link (@links)
{if ($$link[2] =~ m/\.(js|css|png|wav|mp3|rm|mpg|bmp|jpg|rar|tar|zip|tif|gif|mp4)$/g) {next;}
push (@links1, $$link[2]);
}

my @links2 = ();

foreach my $link1 (@links1)
{$link1=~ s/\/$//g;
if ($link1 =~ /^(http:)/g)
{push (@links2, $link1);}
}

@links2 = Discard_Duplicates
(
empty_fields => 'delete',
data => \@links2,
);

foreach my $link2 (0..$#links2)
{$i = 1; $i2 = ($i + 1);#$i2 starts from last id used in last link
$collected_stuff[$i] = $hash{ url => $link2,
id => $i2,
parent => 1
};
}

foreach my $idx (0..$#collected_stuff)
{my $ref_hash = $collected_stuff[$idx];
foreach my $name (keys %$ref_hash)
{print $name . " " . $$ref_hash{$name} . "\n";
}
print "\n\n";
}

};



prefectuous
Novice

Oct 29, 2008, 8:01 AM

Post #4 of 4 (1416 views)
Re: [tules] Robot: storing URLs in multidimensional hash [In reply to] Can't Post

To keep this organized you will probably want to create an object that knows how to handle a certain amount of recursion.
A very basic example follows. To add your link data just stick in the constructor. I just horked this up, and haven't tested it, but you should get the idea.

If you use one of the autoloading base classes in @ISA you can simplify these objects quite a bit and amend whatever additional data to the bot you want without having to rewrite the class.
_________________________

package Mylink ;

sub new {
my $class = shift;
my %foo = @_ ;
my $self = \%foo;
bless ($self, $class);
my @bar ;
$self->{"Branch"} = \@bar ;
return $self ;
}

sub addbranch {
my $self = shift ;
my $B = $self->{'Branch'} ;
my $O = $self->new(@_, 'Uptree' => $self) ;
push @$B, $O ;
return $O ;
}

sub uptree {
my $self = shift ;
return $self->{'Uptree'} ;
}

1 ;

#!/bin/perl

my $ThisRecordPointer ;

# Building a tree

while(getalink()) {
my $gotdata = scalar(@_) ;
my $L = $ThisRecordPointer->addbranch(%$_) if $glotdata;
$ThisRecordPointer = $L if $gotdata ;
$ThisRecordPointer = $ThisRecordPointer->uptree() unless $gotdata ;
}

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives