CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
List an array

 



tot94
Novice

Sep 17, 2015, 1:50 PM

Post #1 of 10 (1687 views)
List an array Can't Post

Hello,

I try to list an array. It contains links from a web page. But the returns of the script is not what i am suppose to see ! Here is the code :


Code
#!/usr/bin/perl -w 

use strict;
use warnings;

use WWW::Mechanize;

print "Which website scraped ? : \n ";
my $url = <>;

my $mech = WWW::Mechanize->new( autocheck => 1 );
my $result = $mech->get( $url );

my @links = $mech->find_all_links( url_regex => qr/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ );

foreach my $links (@links)
{
print "$links\n";
}


The return :


Code
antoine@antoine-UX31E:~/prg/web_scraping$ perl web.pl 
Which website scraped ? :
http://toto.fr
WWW::Mechanize::Link=ARRAY(0x2dc6ae0)
WWW::Mechanize::Link=ARRAY(0x2dd70a8)
WWW::Mechanize::Link=ARRAY(0x2dd77f8)
WWW::Mechanize::Link=ARRAY(0x2dcf270)
WWW::Mechanize::Link=ARRAY(0x2dd7648)
WWW::Mechanize::Link=ARRAY(0x2dd7360)
WWW::Mechanize::Link=ARRAY(0x2dd6ec8)
WWW::Mechanize::Link=ARRAY(0x2db6230)
WWW::Mechanize::Link=ARRAY(0x2dcf3d8)
WWW::Mechanize::Link=ARRAY(0x2db6290)
WWW::Mechanize::Link=ARRAY(0x2dd71f8)
WWW::Mechanize::Link=ARRAY(0x2dd73c0)
WWW::Mechanize::Link=ARRAY(0x2dcf570)
WWW::Mechanize::Link=ARRAY(0x2d6abc0)
WWW::Mechanize::Link=ARRAY(0x2dcf990)
WWW::Mechanize::Link=ARRAY(0x2d6a980)
WWW::Mechanize::Link=ARRAY(0x2dd7048)
WWW::Mechanize::Link=ARRAY(0x2dd79c0)
WWW::Mechanize::Link=ARRAY(0x2dcf6f0)
WWW::Mechanize::Link=ARRAY(0x2dd9d18)
WWW::Mechanize::Link=ARRAY(0x2dd7588)
WWW::Mechanize::Link=ARRAY(0x2dda378)
WWW::Mechanize::Link=ARRAY(0x2dda630)
WWW::Mechanize::Link=ARRAY(0x2dd7240)
WWW::Mechanize::Link=ARRAY(0x2dd00f8)
WWW::Mechanize::Link=ARRAY(0x2ddaa20)
WWW::Mechanize::Link=ARRAY(0x2dda0a8)
WWW::Mechanize::Link=ARRAY(0x2ddc8f8)
WWW::Mechanize::Link=ARRAY(0x2dcfc90)
WWW::Mechanize::Link=ARRAY(0x2dda300)
WWW::Mechanize::Link=ARRAY(0x2dda720)
WWW::Mechanize::Link=ARRAY(0x2dda528)
WWW::Mechanize::Link=ARRAY(0x2ddaa50)
WWW::Mechanize::Link=ARRAY(0x2ddc8c8)
WWW::Mechanize::Link=ARRAY(0x2dd9fd0)
WWW::Mechanize::Link=ARRAY(0x2dcfa20)
WWW::Mechanize::Link=ARRAY(0x2ddcac0)


Do you know why it returns this way ???
Thks


BillKSmith
Veteran

Sep 17, 2015, 8:09 PM

Post #2 of 10 (1683 views)
Re: [tot94] List an array [In reply to] Can't Post

What do you expect? The method find_all_links returns a list of WWW::Mechanize::Link objects. Your output is exactly what you should expect for list of objects.

You need the WWW:::Mechanize::Link module to extract the information that you want. I cannot provide you with code because I do not use these modules.
Good Luck,
Bill


tot94
Novice

Sep 18, 2015, 1:22 AM

Post #3 of 10 (1679 views)
Re: [tot94] List an array [In reply to] Can't Post

Hello,

I was expected the links to be listed such as "http://..". Isn't it a problem of type ? Or I have to make another operation with @list variable ?


Laurent_R
Veteran / Moderator

Sep 18, 2015, 2:29 AM

Post #4 of 10 (1678 views)
Re: [tot94] List an array [In reply to] Can't Post

Hi,
you might try to use the Data::Dumper module to visualize the contents of your objects:

Code
use Data::Dumper; # near the top of your script

and, in your foreach loop, something like this:

Code
print Dumper \$links;

Having said that, the proper way of exploring these objects is certainly to use the methods provided by WWW::Mechanize module. I am not using it and can only recommend to look into the documentation.


FishMonger
Veteran / Moderator

Sep 18, 2015, 6:06 AM

Post #5 of 10 (1674 views)
Re: [tot94] List an array [In reply to] Can't Post

Using HTML::LinkExtor might be easier.
http://search.cpan.org/~gaas/HTML-Parser-3.71/lib/HTML/LinkExtor.pm


tot94
Novice

Sep 18, 2015, 10:53 AM

Post #6 of 10 (1669 views)
Re: [Laurent_R] List an array [In reply to] Can't Post

Yo !
So I've upgraded like this :

Code
#!/usr/bin/perl -w 

use strict;
use warnings;

use WWW::Mechanize;
use Data::Dumper;

print "Which website scraped ? : \n ";
#my $url = "http://www.perlguru.com";
my $url = <>;

my $mech = WWW::Mechanize->new( autocheck => 1 );
my $result = $mech->get( $url );

#url regexation
#qr /STRING/
#/ \ delimiters of the expression
#$ end of the line
#\d+ digits (0-9) (1 or more times (matching the most amount possible))
#any character except \n (1 or more times (matching the most amount possible))
# @ mean it is an array

#my @links = $mech->find_all_links( url_regex => qr/\d+.+\.pdf$/ );
my @links = $mech->find_all_links( url_regex => qr/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ );

foreach my $links (@links)
{
# print("Hello");
print (Dumper ($links));
}

Code
 
And here is what it is display after valide the script :


Code
tot94@mac:~/prg/web_scraping$ perl web.pl 
Which website scraped ? :
https://en.wikipedia.org/wiki/Uniform_Resource_Locator
$VAR1 = bless( [
'https://en.wikipedia.org/wiki/Uniform_Resource_Locator',
undef,
undef,
'link',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'/' => '/',
'href' => 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator',
'rel' => 'canonical'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'http://news.bbc.co.uk/1/hi/technology/8306631.stm',
'"Berners-Lee "sorry" for slashes"',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'rel' => 'nofollow',
'class' => 'external text',
'href' => 'http://news.bbc.co.uk/1/hi/technology/8306631.stm'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'http://www.w3.org/Conferences/IETF92/WWX_BOF_mins.html',
'"Living Documents BoF Minutes"',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'rel' => 'nofollow',
'class' => 'external text',
'href' => 'http://www.w3.org/Conferences/IETF92/WWX_BOF_mins.html'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'http://www.w3.org/Addressing/URL/url-spec.txt',
'"Uniform Resource Locators (URL): A Syntax for the Expression of Access Information of Objects on the Network"',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'rel' => 'nofollow',
'href' => 'http://www.w3.org/Addressing/URL/url-spec.txt',
'class' => 'external text'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'http://tools.ietf.org/html/rfc1738',
'"Uniform Resource Locators (URL)"',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'rel' => 'nofollow',
'href' => 'http://tools.ietf.org/html/rfc1738',
'class' => 'external text'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'http://www.atm.tut.fi/list-archive/ietf-announce/msg13572.html',
'"Completion of IANA Selection of IDNA Prefix"',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'rel' => 'nofollow',
'class' => 'external text',
'href' => 'http://www.atm.tut.fi/list-archive/ietf-announce/msg13572.html'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'http://tools.ietf.org/html/rfc2396',
'"Uniform Resource Identifiers (URI): Generic Syntax"',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'rel' => 'nofollow',
'href' => 'http://tools.ietf.org/html/rfc2396',
'class' => 'external text'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'https://tools.ietf.org/html/rfc7595',
'"Guidelines and Registration Procedures for URI Schemes"',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'href' => 'https://tools.ietf.org/html/rfc7595',
'class' => 'external text',
'rel' => 'nofollow'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'https://tools.ietf.org/html/rfc3305',
'"Report from the Joint W3C/IETF URI Planning Interest Group: Uniform Resource Identifiers (URIs), URLs, and Uniform Resource Names (URNs): Clarifications and Recommendations"',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'href' => 'https://tools.ietf.org/html/rfc3305',
'class' => 'external text',
'rel' => 'nofollow'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'http://tools.ietf.org/html/rfc3986',
'"Uniform Resource Identifiers (URI): Generic Syntax"',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'class' => 'external text',
'href' => 'http://tools.ietf.org/html/rfc3986',
'rel' => 'nofollow'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'http://www.w3.org/International/articles/idn-and-iri/',
'"An Introduction to Multilingual Web Addresses"',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'class' => 'external text',
'href' => 'http://www.w3.org/International/articles/idn-and-iri/',
'rel' => 'nofollow'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'https://www.w3.org/International/wiki/IRIStatus',
'"What is Happening with "International URLs""',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'rel' => 'nofollow',
'class' => 'external text',
'href' => 'https://www.w3.org/International/wiki/IRIStatus'
}
], 'WWW::Mechanize::Link' );
$VAR1 = bless( [
'https://url.spec.whatwg.org/',
'URL specification',
undef,
'a',
bless( do{\(my $o = 'https://en.wikipedia.org/wiki/Uniform_Resource_Locator')}, 'URI::https' ),
{
'class' => 'external text',
'href' => 'https://url.spec.whatwg.org/',
'rel' => 'nofollow'
}
], 'WWW::Mechanize::Link' );

Code
 
Why do I capture also the class and so much detail ?
Is my URL regex wrong ?

Thanks


BillKSmith
Veteran

Sep 18, 2015, 1:11 PM

Post #7 of 10 (1664 views)
Re: [tot94] List an array [In reply to] Can't Post

It is possible to make this approach work, it is a very bad idea for several reasons. The first is that it totally defeats the advantages of OO. Future changes to the module could cause your program to fail. You may be able to ignore these objections if it were very easy to do. If that were the case, you would not be asking for more help.

Laurent's other suggestion would result in a far better program. The knowledge that you gain would be useful in every network application you work on in the future (especially updates to this one). Study the documentation for Mechanize. The code to properly do what you want is probably trivial once you have identified the appropriate methods. Remember, the purpose of the module is to do common tasks for us.
Good Luck,
Bill


Laurent_R
Veteran / Moderator

Sep 18, 2015, 3:46 PM

Post #8 of 10 (1658 views)
Re: [tot94] List an array [In reply to] Can't Post

Hmm, maybe I wasn't clear enough. The only reason I suggested to use Data::Dumper was to show you that you actually get quite a lot of information back, but that information was hidden behind a list of objects reference with your original syntax. Now, you can see there is quite a bit of data behind that somewhat cryptic list of objects references, and you know that what you get from your command can be useful.

But I am not suggesting however that you should use Data::Dumper or even the background information supplied by Data::Dumper to retrieve your data. And I really don't think you should try to do that.

But, now that you can see that the information has been retrieved and is there, you know that you can go further, and I would really urge you to RTFM, i.e. look at the documentation of module's methods that will enable to access the available data. As I already said, I personally don't use that module and therefore know only very little about it, I don't have time now to search the docs and therefore can't help you much further. But, really, read the module's documentation to find the methods that you need to use and figure out how to use them.


bulrush
User

Sep 19, 2015, 5:49 AM

Post #9 of 10 (1646 views)
Re: [tot94] List an array [In reply to] Can't Post

Try this:


Code
my @links = $mech->find_all_links( url_regex => qr/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ );  

foreach my $links (@links)
{
print "$links->{'url'}\n";
print "$links->{'text'}\n";
print "$links->{'name'}\n";
print "$links->{'tag'}\n";
print "$links->{'base'}\n";
print "$links->{'attr'}\n";
}

-----


Laurent_R
Veteran / Moderator

Sep 20, 2015, 2:14 AM

Post #10 of 10 (1634 views)
Re: [bulrush] List an array [In reply to] Can't Post

Bill and myself were precisely warning against the temptation of trying to access directly the data in this way. WWW::Mechanize is object-oriented and is supplying scores of methods for accessing the content of the data, these methods should be used for that. It is generally a rather bad practice to peek directly into the data structures storing the objects.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives