CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
a little script that makes use of LWP::Simple

 



dilbert
User

Jan 29, 2018, 4:50 AM

Post #1 of 6 (8119 views)
a little script that makes use of LWP::Simple Can't Post

hello dear perl experts,


I'm quite new to Programming and OO programming especially. Nonetheless, I'm trying to write a very simple Spider for web crawling. below there is the code:

What is aimed: i need to fetch the data out of this page: http://europa.eu/youth/volunteering/evs-organisation_en

the first step: first i do a view on the page source to find HTML elements?


Code
view-source:https://europa.eu/youth/volunteering/evs-organisation_en


note: i need to fetch the data that come right below this line:


Code
<h3>EVS accredited organisations search results: <span class="ey_badge">6066</span></h3>  </div>


in fact: below this line i have all the interesting data: i need to fetch it
note: - see below the page: there are many more page - i need to switch to them and fetch the content the same way...:


i have several optoins: to do this with PHP Simple HTML DOM Parser
(cf.http://simplehtmldom.sourceforge.net/manual.htm ) - but i like Perl more than PHP - and i guess this is a true Perl job:


so i think that i can solve it this way - but i have to extend this code a bit .....




Code
#!C:\Perl\bin\perl 

use warnings;

BEGIN {
open my $file1,"+>>", ("links.txt");
select($file1);
}
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

#The Url I want it to start at;
$URL = "view-source:https://europa.eu/youth/volunteering/evs-organisation_en";

#Request and receive contents of a web page;
for ($URL) {
$contents = get ($URL);
$browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);
my $request = HTTP::Request->new(GET => $URL);
my $response = $browser->request($request);

#Tell me if there is an error;
if ($response->is_error()) {printf "%s\n", $response->status_line;}
$contents = $response->content();

#Extract the links from the HTML;
my ($page_parser) = HTML::LinkExtor->new(undef, $URL);
$page_parser->parse($contents)->eof;
@links = $page_parser->links;

#Print the link to a links.txt file;
foreach $link (@links) {print "$$link[2]\n";}
}



Well - i have to extend this code a bit


Zhris
Enthusiast

Jan 29, 2018, 8:36 AM

Post #2 of 6 (8112 views)
Re: [dilbert] a little script that makes use of LWP::Simple [In reply to] Can't Post

Hi,


Quote
the first step: first i do a view on the page source to find HTML elements?


view-source is a browser based command, it tells the browser to output the response in plain text rather than render it based on its actual content type, html in this case. You should not need to include view-source in your url.

I have written a little script that extracts the data out of each block and cleans it up a little. The browse function is generic, it takes an input ref which contains the url and xpaths of the parent and children in order to construct the output ref.

It is just to give you an idea of an approach I might take, it does not yet navigate across each page, you may want to use it as a basis.


Code
use strict; 
use warnings FATAL => qw#all#;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use Data::Dumper;

my $handler_relurl = sub { q#https://europa.eu# . $_[0] };
my $handler_trim = sub { $_[0] =~ s#^\s*(.+?)\s*$#$1#r };
my $handler_val = sub { $_[0] =~ s#^[^:]+:\s*##r };
my $handler_split = sub { [ split $_[0], $_[1] ] };
my $handler_split_colon = sub { $handler_split->( qr#; #, $_[0] ) };
my $handler_split_comma = sub { $handler_split->( qr#, #, $_[0] ) };

my $conf =
{
url => q#https://europa.eu/youth/volunteering/evs-organisation_en#,
parent => q#//div[@class="vp ey_block block-is-flex"]#,
children =>
{
internal_url => [ q#//a/@href#, [ $handler_relurl ] ],
external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ],
title => [ q#//h4# ],
topics => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ],
location => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ],
hand => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ],
pic_number => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ],
}
};

print Dumper browse( $conf );

sub browse
{
my $conf = shift;

my $ref = [ ];

my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 );
my $response = $lwp_useragent->get( $conf->{url} );
die $response->status_line unless $response->is_success;
my $content = $response->decoded_content;

my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content );
my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} );
for my $node ( @nodes )
{
push @$ref, { };

while ( my ( $key, $val ) = each %{$conf->{children}} )
{
my $xpath = $val->[0];
my $handlers = $val->[1] // [ ];

$val = ($node->findvalues( qq#.$xpath# ))[0] // next;
$val = $_->( $val ) for @$handlers;
$ref->[-1]->{$key} = $val;
}
}

return $ref;
}


Output of the first block:


Code
{ 
'internal_url' => 'https://europa.eu/youth/volunteering/organisation/948417016_en',
'external_url' => 'http://www.apd.ge',
'location' => 'Tbilisi, Georgia',
'title' => '"Academy for Peace and Development" Union',
'topics' => [
'Access for disadvantaged',
'Youth (Participation, Youth Work, Youth Policy)',
'Intercultural/intergenerational education and (lifelong)learning'
],
'pic_number' => '948417016',
'hand' => [
'Receiving',
'Sending'
]
}


Chris


dilbert
User

Jan 29, 2018, 11:04 AM

Post #3 of 6 (8099 views)
Re: [Zhris] a little script that makes use of LWP::Simple [In reply to] Can't Post

 
Hello dear Chris,


many thanks for the quick answer and the ideas that you give here: this is more than expected. It gives me an idea of an approach i can take

Now i have to take care bout the navigate across each page.


Many thanks again for this great posting - it sheds a light upon the process.


i like especially this part..



Code
use LWP::UserAgent;  
use HTML::TreeBuilder::XPath;
use Data::Dumper;



and the subsequent code with all the cristalclea lines.


Many thanks for your support here!!

keep up the graeat work ! - it rocks

dilbert ;)

Smile


dilbert
User

Jan 30, 2018, 7:21 AM

Post #4 of 6 (8082 views)
Re: [dilbert] a little script that makes use of LWP::Simple [In reply to] Can't Post

hello dear all


on a first glance the issue about scraping from page to page - can be solved via different approaches:

we have the pagination on the bottom of the page: see for example:



Code
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=5


and


Code
 

http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=6


and


Code
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7


and so forth and so forth..


well we can set this url (s) as a base -

if we have an array from which we load the urls that need to be visited - we would come across all the pages...

Note: we have more than 6000 results - and on each page 21 little entries that represent one record

so we have approx 305 Pages that we have to visit.

we can increment the pages (that are shown above) and count to the number of 305

How do you like this idea?


Zhris
Enthusiast

Jan 30, 2018, 7:29 PM

Post #5 of 6 (8068 views)
Re: [dilbert] a little script that makes use of LWP::Simple [In reply to] Can't Post

Hi,

Hardcoding the total number of pages isn't practical as it could vary. You could:

- extract the number of results from the first page, divide that by the results per page ( 21 ) and round it down.
- extract the url from the "last" link at the bottom of the page, create a URI object and read the page number from the query string.

Note that I say round down above, because the query page number begins at 0, not 1.

Looping over could be as simple as:


Code
my $url_pattern = 'https://europa.eu/youth/volunteering/evs-organisation_en&page=%s'; 

for my $page ( 0 .. $last )
{
my $url = sprintf $url_pattern, $page;

...
}


Otherwise, I personally would probably try to incorporate paging into the $conf, perhaps an iterator which upon each call fetches the next node, behind the scenes it automatically increments the page when there are no nodes left until there are no pages left. But this is probably beyond the scope of what you need and a basic looping mechanism should be sufficient.

I checked Web::Scraper in case it had features to handle paging, which it unfortunately doesn't. It is however a much more featuresome replacement to my solution above, it could be used in place if you preferred.

Finally, if you eventually need to look into distributing the scrape process across multiple processes, there are various ways this could be incorporated, but you should consider doing it asynchronously via HTTP::Async.

Let us know how you get on.

Chris


(This post was edited by Zhris on Jan 30, 2018, 7:36 PM)


dilbert
User

Jan 31, 2018, 10:44 AM

Post #6 of 6 (8053 views)
Re: [Zhris] a little script that makes use of LWP::Simple [In reply to] Can't Post

hello dear Chris, hello dear all. ;)



note: first of all dear Chris - this is just a quick posting - (at the moment i am not at home - but i wanted to share some ideas with you - and i wanted to say many thanks for the continued help!!

To begin with the begining: The scraper runs so well - i have tried it out. This is just overwhelming.

Besides the results that this script brings - it is a great chance to see how (!!) Perl works - and this is a great chance to learn and to digg deeper into Perl.

BTW: not only to the above mentioned page of volunteering organizations (alltogether more than 6000 records)

even more: - but also to the volunteering projects - Search results: 506 projects found see https://europa.eu/youth/volunteering_en and especially: https://europa.eu/youth/volunteering/project_en?field_eyp_country_value=All&country=&type_1=All&topic=&date_start=&date_end=
Search results: 506 projects found

The data structure is (allmost) the same. So the scraper can be applied here too: wonderful : as mentioned above: I want to learn more and more - Perl is pretty difficult - and for me it seems to be harder to dive into Perl than to learn PHP and Python.

But anyway - Perl is very very powerful - and as we see - we can do such alot of things. I try to get more XPath knowledge -and i will try to apply the solution on more than only one target - perhaps there is a kind of a swiss-army knife solution of a robust scraper/parser that works for such (pretty difficult) sites with cells like we have in this European target.

conclusio: After the parsing of the data i try to store all the d ata in a MySQL-DB. This is the next step in the process of this little project.

As for the next step - the looping over different pages: i have done a quick search in order to find some solutions that may give some hints how we can do this "pagination-thing":

First of all: like the idea of incrementing the page - as you say "behind the scenes it automatically increments the page when there are no nodes left until there are no pages left." Chris, this is a great idea.

i have done a quick search on the solutions - that could be applied - more or less:


a Bash-Solution https://stackoverflow.com/questions/35423019/iterate-through-urls

if I wanted to [..] download a bunch of .ts files from a website, and the url format is http://example.com/video-1080Pxxxxx.ts

Question: Where the xxxxx is a number from 00000 to 99999 (required zero padding), how would I iterate through that in bash so that it tries every integer starting at 00000, 00001, 00002, etc.?

Loop over the integer values from 0 to 99999, and use printf to pad to 5 digits.


Code
for x in {0..99999}; do 
zx=$(printf '%05d' $x) # zero-pad to 5 digits
url="http://example.com/video-1080P${zx}.ts"
... # Do something with url - done


a Perl-solution

https://stackoverflow.com/questions/14983580/for-loop-in-perl-how-to-do-something-with-each-url


Code
my @arr(somting somting1);  
for my $i(0 .. $#arr){
my url = get (www.$arr[$i].com);
do something with url...
}



and ... Andy Lesters says: Don't use index variables like that to loop over your array. Iterate over the array directly like this:


Code
my @domains = ( ... ); 
for my $domain ( @domains ) {
my $url = "http://www.$host.com";
...
}


and he adds the following note: and, as others have said, always use strict; use warnings;, especially as a beginner.



some thougths - that i derived from a perlmaven idea:

https://perlmaven.com/simple-way-to-fetch-many-web-pages
Finally we arrived giving an example of downloading many pages using HTTP::Tiny


Code
    use strict; 
use warnings;
use 5.010;

use HTTP::Tiny;

my @urls = qw(
https://perlmaven.com/
https://cn.perlmaven.com/
https://br.perlmaven.com/
);

my $ht = HTTP::Tiny->new;

foreach my $url (@urls) {
say "Start $url";
my $response = $ht->get($url);
if ($response->{success}) {
say 'Length: ', length $response->{content};
} else {
say "Failed: $response->{status} $response->{reasons}";
}
}


The code is, quite straight forward. We have a list of URLs in the @urls array. An HTTP::Tiny object is created and assigned to the $ht variable. The in a for-loop we go over each url and fetch it. In order to save space in this article I only printed the size of each page. This is the result: inally we arrived giving an example of downloading many pages using HTTP::Tiny.


Well Chris, very nice examples - here a Python Example - using Beatiful Soup:

BeautifulSoup looping through urls: https://stackoverflow.com/questions/27752860/beautifulsoup-looping-through-urls
Follow the pagination by making an endless loop and follow the "Next" link until it is not found.


dear Chris dear all - this was just a quick posting - (at the moment i am not at home - but i wanted to share some ideas with you - and i wanted to say many thanks for the continued help!!


In the next few days i try to figure out how to apply a solution for itterating over all the pages and ... last but not least - try to apply all the dataset to a MySQL - DB.

but that can wait at the moment...


i keep coming back to this wonderful thread on a regular base - and yes: i ll keep you informed how it is going on here.

regards
dilbert

- i love this place!!! ;)
keep up this - great place for idea exchhange and knowledge transfer - it is a great place for discussing ideas - and yes - for learning.!!!! ;)


(This post was edited by dilbert on Jan 31, 2018, 10:49 AM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives